Linux Trace Kernel
 help / color / mirror / Atom feed
* [PATCH v8 08/46] KVM: Provide generic interface for checking memory private/shared status
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:31 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Sean Christopherson <seanjc@google.com>

Introduce a generic kvm_mem_is_private() interface using a static call to
determine if a GFN is private. This allows the implementation for checking
a GFN's private/shared status to be set at runtime.

In preparation for choosing implementations between a guest_memfd lookup
and the existing VM attribute lookup, rename the existing
VM-attribute-based check to kvm_vm_mem_is_private to emphasize that it
looks up VM attributes.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h | 12 +++++++++++-
 virt/kvm/kvm_main.c      | 15 +++++++++++++++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index eb26d4ea8945a..3915da2a61778 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2546,7 +2546,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range);
 
-static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+static inline bool kvm_vm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
 }
@@ -2557,6 +2557,16 @@ static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
 						  KVM_MEMORY_ATTRIBUTE_PRIVATE,
 						  KVM_MEMORY_ATTRIBUTE_PRIVATE);
 }
+#endif  /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
+
+#ifdef kvm_arch_has_private_mem
+typedef bool (kvm_mem_is_private_t)(struct kvm *kvm, gfn_t gfn);
+DECLARE_STATIC_CALL(__kvm_mem_is_private, kvm_mem_is_private_t);
+
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+	return static_call(__kvm_mem_is_private)(kvm, gfn);
+}
 #else
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6669f1477013c..8b238e461b854 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2627,6 +2627,20 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
+#ifdef kvm_arch_has_private_mem
+DEFINE_STATIC_CALL_RET0(__kvm_mem_is_private, kvm_mem_is_private_t);
+EXPORT_STATIC_CALL_GPL(__kvm_mem_is_private);
+
+static void kvm_init_memory_attributes(void)
+{
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+	static_call_update(__kvm_mem_is_private, kvm_vm_mem_is_private);
+#endif
+}
+#else
+static void kvm_init_memory_attributes(void) { }
+#endif
+
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
 	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
@@ -6528,6 +6542,7 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 	kvm_preempt_ops.sched_in = kvm_sched_in;
 	kvm_preempt_ops.sched_out = kvm_sched_out;
 
+	kvm_init_memory_attributes();
 	kvm_init_debug();
 
 	r = kvm_vfio_ops_init();

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 07/46] KVM: Rename memory attribute APIs to prepare for in-place gmem conversion
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:31 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Sean Christopherson <seanjc@google.com>

Rename memory attribute APIs to add a "vm_" in the name in anticipation of
moving PRIVATE tracking into guest_memfd, to allow in-place conversion
between SHARED and PRIVATE.  At that point, there will effectively be two
(potential) sources of memory attributes: the VM and guest_memfd.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c   |  6 +++---
 include/linux/kvm_host.h | 15 +++++++++++----
 virt/kvm/guest_memfd.c   |  6 +++---
 virt/kvm/kvm_main.c      | 16 ++++++++--------
 4 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e0005a21b6e22..cbc50aef801fb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -8087,11 +8087,11 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
 	const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
 
 	if (level == PG_LEVEL_2M)
-		return kvm_range_has_memory_attributes(kvm, start, end, ~0, attrs);
+		return kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attrs);
 
 	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
 		if (hugepage_test_mixed(slot, gfn, level - 1) ||
-		    attrs != kvm_get_memory_attributes(kvm, gfn))
+		    attrs != kvm_get_vm_memory_attributes(kvm, gfn))
 			return false;
 	}
 	return true;
@@ -8191,7 +8191,7 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
 		 * be manually checked as the attributes may already be mixed.
 		 */
 		for (gfn = start; gfn < end; gfn += nr_pages) {
-			unsigned long attrs = kvm_get_memory_attributes(kvm, gfn);
+			unsigned long attrs = kvm_get_vm_memory_attributes(kvm, gfn);
 
 			if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
 				hugepage_clear_mixed(slot, gfn, level);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d370e834d619e..eb26d4ea8945a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2534,13 +2534,13 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
 }
 
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
-static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
+static inline unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
 	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
 }
 
-bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
-				     unsigned long mask, unsigned long attrs);
+bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+					unsigned long mask, unsigned long attrs);
 bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range);
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
@@ -2548,7 +2548,14 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
-	return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+	return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+}
+static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
+					    gfn_t end)
+{
+	return kvm_range_has_vm_memory_attributes(kvm, start, end,
+						  KVM_MEMORY_ATTRIBUTE_PRIVATE,
+						  KVM_MEMORY_ATTRIBUTE_PRIVATE);
 }
 #else
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index b4c24fdf159f6..8101f64e0366f 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -915,9 +915,9 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
 
 	folio_unlock(folio);
 
-	if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
-					     KVM_MEMORY_ATTRIBUTE_PRIVATE,
-					     KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
+	if (!kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + 1,
+						KVM_MEMORY_ATTRIBUTE_PRIVATE,
+						KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
 		ret = -EINVAL;
 		goto out_put_folio;
 	}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7b989b659cf82..6669f1477013c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2419,7 +2419,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
-static u64 kvm_supported_mem_attributes(struct kvm *kvm)
+static u64 kvm_supported_vm_mem_attributes(struct kvm *kvm)
 {
 #ifdef kvm_arch_has_private_mem
 	if (!kvm || kvm_arch_has_private_mem(kvm))
@@ -2433,19 +2433,19 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
  * Returns true if _all_ gfns in the range [@start, @end) have attributes
  * such that the bits in @mask match @attrs.
  */
-bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
-				     unsigned long mask, unsigned long attrs)
+bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+					unsigned long mask, unsigned long attrs)
 {
 	XA_STATE(xas, &kvm->mem_attr_array, start);
 	unsigned long index;
 	void *entry;
 
-	mask &= kvm_supported_mem_attributes(kvm);
+	mask &= kvm_supported_vm_mem_attributes(kvm);
 	if (attrs & ~mask)
 		return false;
 
 	if (end == start + 1)
-		return (kvm_get_memory_attributes(kvm, start) & mask) == attrs;
+		return (kvm_get_vm_memory_attributes(kvm, start) & mask) == attrs;
 
 	guard(rcu)();
 	if (!attrs)
@@ -2567,7 +2567,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	mutex_lock(&kvm->slots_lock);
 
 	/* Nothing to do if the entire range has the desired attributes. */
-	if (kvm_range_has_memory_attributes(kvm, start, end, ~0, attributes))
+	if (kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attributes))
 		goto out_unlock;
 
 	/*
@@ -2606,7 +2606,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 	/* flags is currently not used. */
 	if (attrs->flags)
 		return -EINVAL;
-	if (attrs->attributes & ~kvm_supported_mem_attributes(kvm))
+	if (attrs->attributes & ~kvm_supported_vm_mem_attributes(kvm))
 		return -EINVAL;
 	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
 		return -EINVAL;
@@ -4926,7 +4926,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 		return 1;
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	case KVM_CAP_MEMORY_ATTRIBUTES:
-		return kvm_supported_mem_attributes(kvm);
+		return kvm_supported_vm_mem_attributes(kvm);
 #endif
 #ifdef CONFIG_KVM_GUEST_MEMFD
 	case KVM_CAP_GUEST_MEMFD:

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 06/46] KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:31 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Explicitly guard reporting support for KVM_MEMORY_ATTRIBUTE_PRIVATE based
on kvm_arch_has_private_mem being #defined in anticipation of decoupling
kvm_supported_mem_attributes() from CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
guest_memfd support for memory attributes will be unconditional to avoid
yet more macros (all architectures that support guest_memfd are expected to
use per-gmem attributes at some point), at which point enumerating support
KVM_MEMORY_ATTRIBUTE_PRIVATE based solely on memory attributes being
supported _somewhere_ would result in KVM over-reporting support on arm64.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 virt/kvm/kvm_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1ccc4895a4c26..7b989b659cf82 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2421,8 +2421,10 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
+#ifdef kvm_arch_has_private_mem
 	if (!kvm || kvm_arch_has_private_mem(kvm))
 		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
+#endif
 
 	return 0;
 }

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 05/46] KVM: Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:31 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable, only for (CoCo) VM types
that might use vm_memory_attributes.

Also document CONFIG_KVM_VM_MEMORY_ATTRIBUTES to specifically be about the
private/shared attribute.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/Kconfig | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 24f96396cfa1c..c28393dc664eb 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -81,13 +81,16 @@ config KVM_WERROR
 	  If in doubt, say "N".
 
 config KVM_VM_MEMORY_ATTRIBUTES
-	bool
+	depends on KVM_SW_PROTECTED_VM || KVM_INTEL_TDX || KVM_AMD_SEV
+	bool "Enable per-VM PRIVATE vs. SHARED attributes (for CoCo VMs)"
+	help
+	  Enable support for tracking PRIVATE vs. SHARED memory using per-VM
+	  memory attributes.
 
 config KVM_SW_PROTECTED_VM
 	bool "Enable support for KVM software-protected VMs"
 	depends on EXPERT
 	depends on KVM_X86 && X86_64
-	select KVM_VM_MEMORY_ATTRIBUTES
 	help
 	  Enable support for KVM software-protected VMs.  Currently, software-
 	  protected VMs are purely a development and testing vehicle for
@@ -138,7 +141,6 @@ config KVM_INTEL_TDX
 	bool "Intel Trust Domain Extensions (TDX) support"
 	default y
 	depends on INTEL_TDX_HOST
-	select KVM_VM_MEMORY_ATTRIBUTES
 	select HAVE_KVM_ARCH_GMEM_POPULATE
 	help
 	  Provides support for launching Intel Trust Domain Extensions (TDX)
@@ -162,7 +164,6 @@ config KVM_AMD_SEV
 	depends on KVM_AMD && X86_64
 	depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
 	select ARCH_HAS_CC_PLATFORM
-	select KVM_VM_MEMORY_ATTRIBUTES
 	select HAVE_KVM_ARCH_GMEM_PREPARE
 	select HAVE_KVM_ARCH_GMEM_INVALIDATE
 	select HAVE_KVM_ARCH_GMEM_POPULATE

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 04/46] KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:31 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Sean Christopherson <seanjc@google.com>

When memory attributes become trackable in guest_memfd, the concept of
having private memory is no longer dependent on
CONFIG_KVM_VM_MEMORY_ATTRIBUTES.

With this, on x86, kvm_arch_has_private_mem() is defined if some CoCo
platform support (or the testing CONFIG_KVM_SW_PROTECTED_VM) is compiled
in.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 arch/x86/include/asm/kvm_host.h | 4 +++-
 include/linux/kvm_host.h        | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8e8eb8a5e8a6b..1bde67cf6eb0e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2394,7 +2394,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 		       int tdp_max_root_level, int tdp_huge_page_level);
 
 
-#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+#if defined(CONFIG_KVM_SW_PROTECTED_VM) ||	\
+	defined(CONFIG_KVM_INTEL_TDX) ||	\
+	defined(CONFIG_KVM_AMD_SEV)
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
 #endif
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 201d0f2143976..d370e834d619e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 }
 #endif
 
-#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+#ifndef kvm_arch_has_private_mem
 static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 {
 	return false;

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 03/46] KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:31 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Sean Christopherson <seanjc@google.com>

Bury KVM_VM_MEMORY_ATTRIBUTES in x86 to discourage other architectures
from adding support for per-VM memory attributes, because tracking private
vs. shared memory on a per-VM basis is now deprecated in favor of tracking
on a per-guest_memfd basis, and while RWX memory attributes are on the
horizon, they too are expected to be x86-only.

This will also allow modifying KVM_VM_MEMORY_ATTRIBUTES to be
user-selectable (in x86) without creating weirdness in KVM's Kconfigs.
Now that guest_memfd supports in-place conversions, it's entirely possible
to run x86 CoCo VMs without support for KVM_VM_MEMORY_ATTRIBUTES.

Leave the code itself in common KVM so that it's trivial to undo this
change if new per-VM attributes do come along.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 arch/x86/kvm/Kconfig | 3 +++
 virt/kvm/Kconfig     | 3 ---
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 26f6afd51bbdc..24f96396cfa1c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -80,6 +80,9 @@ config KVM_WERROR
 
 	  If in doubt, say "N".
 
+config KVM_VM_MEMORY_ATTRIBUTES
+	bool
+
 config KVM_SW_PROTECTED_VM
 	bool "Enable support for KVM software-protected VMs"
 	depends on EXPERT
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 5119cb37145fc..297e4399fbd49 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -100,9 +100,6 @@ config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
 config KVM_MMU_LOCKLESS_AGING
        bool
 
-config KVM_VM_MEMORY_ATTRIBUTES
-       bool
-
 config KVM_GUEST_MEMFD
        select XARRAY_MULTI
        bool

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 01/46] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:31 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Sean Christopherson <seanjc@google.com>

Start plumbing in guest_memfd support for in-place private<=>shared
conversions by tracking attributes via a maple tree.  KVM currently tracks
private vs. shared attributes on a per-VM basis, which made sense when a
guest_memfd _only_ supported private memory, but tracking per-VM simply
can't work for in-place conversions as the shared/private status of a given
page needs to be per-gmem_inode, not per-VM.

Use the filemap invalidation lock to protect the maple tree, as taking the
lock for read when faulting in memory (for userspace or the guest) isn't
expected to result in meaningful contention, and using a separate lock
would add significant complexity (avoiding deadlock is quite difficult).

Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/guest_memfd.c | 133 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 117 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 86690683b2fe3..b4c24fdf159f6 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
 #include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/kvm_host.h>
+#include <linux/maple_tree.h>
 #include <linux/mempolicy.h>
 #include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
@@ -33,6 +34,13 @@ struct gmem_inode {
 	struct list_head gmem_file_list;
 
 	u64 flags;
+	/*
+	 * Every index in this inode, whether memory is populated or
+	 * not, is tracked in attributes. The entire range of indices,
+	 * corresponding to the size of this inode, is represented in
+	 * this maple tree.
+	 */
+	struct maple_tree attributes;
 };
 
 static __always_inline struct gmem_inode *GMEM_I(struct inode *inode)
@@ -60,6 +68,24 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
 	return gfn - slot->base_gfn + slot->gmem.pgoff;
 }
 
+static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
+{
+	struct maple_tree *mt = &GMEM_I(inode)->attributes;
+	void *entry = mtree_load(mt, index);
+
+	return WARN_ON_ONCE(!entry) ? 0 : xa_to_value(entry);
+}
+
+static bool kvm_gmem_is_private_mem(struct inode *inode, pgoff_t index)
+{
+	return kvm_gmem_get_attributes(inode, index) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+}
+
+static bool kvm_gmem_is_shared_mem(struct inode *inode, pgoff_t index)
+{
+	return !kvm_gmem_is_private_mem(inode, index);
+}
+
 static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
 				    pgoff_t index, struct folio *folio)
 {
@@ -397,10 +423,13 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
 	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED))
-		return VM_FAULT_SIGBUS;
+	filemap_invalidate_lock_shared(inode->i_mapping);
+	if (kvm_gmem_is_shared_mem(inode, vmf->pgoff))
+		folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+	else
+		folio = ERR_PTR(-EACCES);
+	filemap_invalidate_unlock_shared(inode->i_mapping);
 
-	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
 	if (IS_ERR(folio)) {
 		if (PTR_ERR(folio) == -EAGAIN)
 			return VM_FAULT_RETRY;
@@ -557,6 +586,51 @@ bool __weak kvm_arch_supports_gmem_init_shared(struct kvm *kvm)
 	return true;
 }
 
+static int kvm_gmem_init_inode(struct inode *inode, loff_t size, u64 flags)
+{
+	struct gmem_inode *gi = GMEM_I(inode);
+	MA_STATE(mas, &gi->attributes, 0, (size >> PAGE_SHIFT) - 1);
+	u64 attrs;
+	int r;
+
+	inode->i_op = &kvm_gmem_iops;
+	inode->i_mapping->a_ops = &kvm_gmem_aops;
+	inode->i_mode |= S_IFREG;
+	inode->i_size = size;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+
+	/*
+	 * guest_memfd memory is neither migratable nor swappable: set
+	 * inaccessible to gate off both.
+	 */
+	mapping_set_inaccessible(inode->i_mapping);
+	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+	gi->flags = flags;
+
+	mt_set_external_lock(&gi->attributes,
+			     &inode->i_mapping->invalidate_lock);
+
+	/*
+	 * Store default attributes for the entire gmem instance. Ensuring every
+	 * index is represented in the maple tree at all times simplifies the
+	 * conversion and merging logic.
+	 */
+	attrs = gi->flags & GUEST_MEMFD_FLAG_INIT_SHARED ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+	/*
+	 * Acquire the invalidation lock purely to make lockdep happy.  The
+	 * maple tree library expects all stores to be protected via the lock,
+	 * and the library can't know when the tree is reachable only by the
+	 * caller, as is the case here.
+	 */
+	filemap_invalidate_lock(inode->i_mapping);
+	r = mas_store_gfp(&mas, xa_mk_value(attrs), GFP_KERNEL);
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return r;
+}
+
 static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 {
 	static const char *name = "[kvm-gmem]";
@@ -587,16 +661,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_fops;
 	}
 
-	inode->i_op = &kvm_gmem_iops;
-	inode->i_mapping->a_ops = &kvm_gmem_aops;
-	inode->i_mode |= S_IFREG;
-	inode->i_size = size;
-	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
-	mapping_set_inaccessible(inode->i_mapping);
-	/* Unmovable mappings are supposed to be marked unevictable as well. */
-	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
-
-	GMEM_I(inode)->flags = flags;
+	err = kvm_gmem_init_inode(inode, size, flags);
+	if (err)
+		goto err_inode;
 
 	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
 	if (IS_ERR(file)) {
@@ -799,9 +866,13 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	if (!file)
 		return -EFAULT;
 
+	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
+
 	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
-	if (IS_ERR(folio))
-		return PTR_ERR(folio);
+	if (IS_ERR(folio)) {
+		r = PTR_ERR(folio);
+		goto out;
+	}
 
 	if (!folio_test_uptodate(folio)) {
 		clear_highpage(folio_page(folio, 0));
@@ -817,6 +888,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	else
 		folio_put(folio);
 
+out:
+	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
 	return r;
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
@@ -948,6 +1021,15 @@ static struct inode *kvm_gmem_alloc_inode(struct super_block *sb)
 
 	mpol_shared_policy_init(&gi->policy, NULL);
 
+	/*
+	 * Memory attributes are protected by the filemap invalidation lock, but
+	 * the lock structure isn't available at this time.  Immediately mark
+	 * maple tree as using external locking so that accessing the tree
+	 * before it's fully initialized results in NULL pointer dereferences
+	 * and not more subtle bugs.
+	 */
+	mt_init_flags(&gi->attributes, MT_FLAGS_LOCK_EXTERN | MT_FLAGS_USE_RCU);
+
 	gi->flags = 0;
 	INIT_LIST_HEAD(&gi->gmem_file_list);
 	return &gi->vfs_inode;
@@ -955,7 +1037,26 @@ static struct inode *kvm_gmem_alloc_inode(struct super_block *sb)
 
 static void kvm_gmem_destroy_inode(struct inode *inode)
 {
-	mpol_free_shared_policy(&GMEM_I(inode)->policy);
+	struct gmem_inode *gi = GMEM_I(inode);
+
+	mpol_free_shared_policy(&gi->policy);
+
+	/*
+	 * Note!  Checking for an empty tree is functionally necessary
+	 * to avoid explosions if the tree hasn't been fully
+	 * initialized, i.e. if the inode is being destroyed before
+	 * guest_memfd can set the external lock, lockdep would find
+	 * that the tree's internal ma_lock was not held.
+	 */
+	if (!mtree_empty(&gi->attributes)) {
+		/*
+		 * Acquire the invalidation lock purely to make lockdep happy,
+		 * the inode is unreachable at this point.
+		 */
+		filemap_invalidate_lock(inode->i_mapping);
+		__mt_destroy(&gi->attributes);
+		filemap_invalidate_unlock(inode->i_mapping);
+	}
 }
 
 static void kvm_gmem_free_inode(struct inode *inode)

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 02/46] KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:31 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Sean Christopherson <seanjc@google.com>

Rename the per-VM memory attributes Kconfig to make it explicitly about
per-VM attributes in anticipation of adding memory attributes support to
guest_memfd, at which point it will be possible (and desirable) to have
memory attributes without the per-VM support, even in x86.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/Kconfig            |  6 +++---
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/x86.c              |  2 +-
 include/linux/kvm_host.h        |  8 ++++----
 include/trace/events/kvm.h      |  4 ++--
 virt/kvm/Kconfig                |  2 +-
 virt/kvm/kvm_main.c             | 14 +++++++-------
 8 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index eee473717c0e5..8e8eb8a5e8a6b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2394,7 +2394,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 		       int tdp_max_root_level, int tdp_huge_page_level);
 
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
 #endif
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 801bf9e520db3..26f6afd51bbdc 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -84,7 +84,7 @@ config KVM_SW_PROTECTED_VM
 	bool "Enable support for KVM software-protected VMs"
 	depends on EXPERT
 	depends on KVM_X86 && X86_64
-	select KVM_GENERIC_MEMORY_ATTRIBUTES
+	select KVM_VM_MEMORY_ATTRIBUTES
 	help
 	  Enable support for KVM software-protected VMs.  Currently, software-
 	  protected VMs are purely a development and testing vehicle for
@@ -135,7 +135,7 @@ config KVM_INTEL_TDX
 	bool "Intel Trust Domain Extensions (TDX) support"
 	default y
 	depends on INTEL_TDX_HOST
-	select KVM_GENERIC_MEMORY_ATTRIBUTES
+	select KVM_VM_MEMORY_ATTRIBUTES
 	select HAVE_KVM_ARCH_GMEM_POPULATE
 	help
 	  Provides support for launching Intel Trust Domain Extensions (TDX)
@@ -159,7 +159,7 @@ config KVM_AMD_SEV
 	depends on KVM_AMD && X86_64
 	depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
 	select ARCH_HAS_CC_PLATFORM
-	select KVM_GENERIC_MEMORY_ATTRIBUTES
+	select KVM_VM_MEMORY_ATTRIBUTES
 	select HAVE_KVM_ARCH_GMEM_PREPARE
 	select HAVE_KVM_ARCH_GMEM_INVALIDATE
 	select HAVE_KVM_ARCH_GMEM_POPULATE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 26ed97efda919..e0005a21b6e22 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7998,7 +7998,7 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 		vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
 				int level)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d9d51803b7b20..2fde594e86d72 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13569,7 +13569,7 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 		}
 	}
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	kvm_mmu_init_memslot_memory_attributes(kvm, slot);
 #endif
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ab8cfaec82d31..201d0f2143976 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 }
 #endif
 
-#ifndef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 {
 	return false;
@@ -871,7 +871,7 @@ struct kvm {
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
 #endif
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	/* Protected by slots_lock (for writes) and RCU (for reads) */
 	struct xarray mem_attr_array;
 #endif
@@ -2533,7 +2533,7 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
 	return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
 }
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
 	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
@@ -2555,7 +2555,7 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return false;
 }
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 #ifdef CONFIG_KVM_GUEST_MEMFD
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index b282e3a867696..1ba72bd73ea2f 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -358,7 +358,7 @@ TRACE_EVENT(kvm_dirty_ring_exit,
 	TP_printk("vcpu %d", __entry->vcpu_id)
 );
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 /*
  * @start:	Starting address of guest memory range
  * @end:	End address of guest memory range
@@ -383,7 +383,7 @@ TRACE_EVENT(kvm_vm_set_mem_attributes,
 	TP_printk("%#016llx -- %#016llx [0x%lx]",
 		  __entry->start, __entry->end, __entry->attr)
 );
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 TRACE_EVENT(kvm_unmap_hva_range,
 	TP_PROTO(unsigned long start, unsigned long end),
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 794976b88c6f9..5119cb37145fc 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -100,7 +100,7 @@ config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
 config KVM_MMU_LOCKLESS_AGING
        bool
 
-config KVM_GENERIC_MEMORY_ATTRIBUTES
+config KVM_VM_MEMORY_ATTRIBUTES
        bool
 
 config KVM_GUEST_MEMFD
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e44c20c049610..1ccc4895a4c26 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1115,7 +1115,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	xa_init(&kvm->mem_attr_array);
 #endif
 
@@ -1300,7 +1300,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	srcu_barrier(&kvm->srcu);
 	cleanup_srcu_struct(&kvm->srcu);
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	xa_destroy(&kvm->mem_attr_array);
 #endif
 	kvm_arch_free_vm(kvm);
@@ -2418,7 +2418,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
 	if (!kvm || kvm_arch_has_private_mem(kvm))
@@ -2623,7 +2623,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 
 	return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
 }
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
@@ -4922,7 +4922,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_SYSTEM_EVENT_DATA:
 	case KVM_CAP_DEVICE_CTRL:
 		return 1;
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	case KVM_CAP_MEMORY_ATTRIBUTES:
 		return kvm_supported_mem_attributes(kvm);
 #endif
@@ -5326,7 +5326,7 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	case KVM_SET_MEMORY_ATTRIBUTES: {
 		struct kvm_memory_attributes attrs;
 
@@ -5337,7 +5337,7 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
 		break;
 	}
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 	case KVM_CREATE_DEVICE: {
 		struct kvm_create_device cd;
 

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 00/46] guest_memfd: In-place conversion support
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:31 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng

This is v8 of guest_memfd in-place conversion support.

Up till now, guest_memfd supports the entire inode worth of memory being
used as all-shared, or all-private. CoCo VMs may request guest memory to be
converted between private and shared states, and the only way to support
that currently would be to have the userspace VMM provide two sources of
backing memory from completely different areas of physical memory.

pKVM has a use case for in-place sharing: the guest and host may be
cooperating on given data, and pKVM doesn't protect data through
encryption, so copying that given data between different areas of physical
memory as part of conversions would be unnecessary work.

This series also serves as a foundation for guest_memfd huge page
support. Now, guest_memfd only supports PAGE_SIZE pages, so if two sources
of backing memory are used, the userspace VMM could maintain a steady total
memory utilized by punching out the pages that are not used. When huge
pages are available in guest_memfd, even if the backing memory source
supports hole punching within a huge page, punching out pages to maintain
the total memory utilized by a VM would be introducing lots of
fragmentation.

In-place conversion avoids fragmentation by allowing the same physical
memory to be used for both shared and private memory, with guest_memfd
tracks the shared/private status of all the pages at a per-page
granularity.

The central principle, which guest_memfd continues to uphold, is that any
guest-private page will not be mappable to host userspace. All pages will
be mmap()-able in host userspace, but accesses to guest-private pages (as
tracked by guest_memfd) will result in a SIGBUS.

This series introduces a guest_memfd ioctl (not kvm, vm or vcpu, but
guest_memfd ioctl) that allows userspace to set memory
attributes (shared/private) directly through the guest_memfd. This is the
appropriate interface because shared/private-ness is a property of memory
and hence the request should be sent directly to the memory provider -
guest_memfd.

Tested with both CONFIG_KVM_VM_MEMORY_ATTRIBUTES enabled and disabled:

+ tools/testing/selftests/kvm/guest_memfd_test.c
+ tools/testing/selftests/kvm/pre_fault_memory_test.c
+ tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
+ tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
+ tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c

Updates for this revision:

+ Updated the series to _not_ deprecate all of VM memory attributes, but
  only deprecate tracking of the PRIVATE attributes in VM memory
  attributes. This takes into account upcoming RWX attributes support,
  which will be tracked at the VM level.
+ Reshuffled the earlier commits that deal with preparing KVM to stop
  seeing VM memory attributes as the only source of attributes.
+ Addressed comments from v7

TODOs

+ Retest with TDX selftests. v7 was tested with TDX [12], but the setup there was
  wrong. Conversions were successful (no errors), but the shared memory being
  tested is actually in a completely different host physical page.
+ Retest with SNP selftests. v6 was tested with SNP, I ported that to v7
  and those ran fine too. Just need to double-check for v8.

This series is based on kvm-x86/next, and here's the tree for your convenience:

https://github.com/googleprodkernel/linux-cc/commits/guest_memfd-inplace-conversion-v8

Older series:

+ RFCv7 is at [11]
+ RFCv6 is at [10]
+ RFCv5 is at [8]
+ RFCv4 is at [7]
+ RFCv3 is at [6]
+ RFCv2 is at [5]
+ RFCv1 is at [4]
+ Previous versions of this feature, part of other series, are available at
  [1][2][3].

[1] https://lore.kernel.org/all/bd163de3118b626d1005aa88e71ef2fb72f0be0f.1726009989.git.ackerleytng@google.com/
[2] https://lore.kernel.org/all/20250117163001.2326672-6-tabba@google.com/
[3] https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com/
[4] https://lore.kernel.org/all/cover.1760731772.git.ackerleytng@google.com/T/
[5] https://lore.kernel.org/all/cover.1770071243.git.ackerleytng@google.com/T/
[6] https://lore.kernel.org/r/20260313-gmem-inplace-conversion-v3-0-5fc12a70ec89@google.com/T/
[7] https://lore.kernel.org/all/20260326-gmem-inplace-conversion-v4-0-e202fe950ffd@google.com/T/
[8] https://lore.kernel.org/r/20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com
[9] https://lore.kernel.org/all/20260414-selftest-global-metadata-v1-0-fd223922bc57@google.com/T/
[10] https://lore.kernel.org/r/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com
[11] https://lore.kernel.org/r/20260522-gmem-inplace-conversion-v7-0-2f0fae496530@google.com
[12] https://lore.kernel.org/all/20260605134153.204152-1-ackerleytng@google.com/

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
Ackerley Tng (27):
      KVM: Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable
      KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
      KVM: guest_memfd: Introduce function to check GFN private/shared status
      KVM: guest_memfd: Only prepare folios for private pages
      KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
      KVM: guest_memfd: Ensure pages are not in use before conversion
      KVM: guest_memfd: Call arch invalidate hooks on conversion
      KVM: guest_memfd: Return early if range already has requested attributes
      KVM: guest_memfd: Advertise KVM_SET_MEMORY_ATTRIBUTES2 ioctl
      KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
      KVM: guest_memfd: Use actual size for invalidation in kvm_gmem_release()
      KVM: guest_memfd: Determine invalidation filter from memory attributes
      KVM: guest_memfd: Zero page while getting pfn
      KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
      KVM: guest_memfd: Make in-place conversion the default
      KVM: selftests: Test basic single-page conversion flow
      KVM: selftests: Test conversion flow when INIT_SHARED
      KVM: selftests: Test conversion precision in guest_memfd
      KVM: selftests: Test conversion before allocation
      KVM: selftests: Convert with allocated folios in different layouts
      KVM: selftests: Test that truncation does not change shared/private status
      KVM: selftests: Add helpers to pin pages with CONFIG_GUP_TEST
      KVM: selftests: Test conversion with elevated page refcount
      KVM: selftests: Reset shared memory after hole-punching
      KVM: selftests: Provide function to look up guest_memfd details from gpa
      KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
      KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd

Michael Roth (1):
      KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE

Sean Christopherson (18):
      KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
      KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
      KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
      KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
      KVM: Rename memory attribute APIs to prepare for in-place gmem conversion
      KVM: Provide generic interface for checking memory private/shared status
      KVM: guest_memfd: Wire up core private/shared attribute interfaces
      KVM: Consolidate private memory and guest_memfd ifdeffery in kvm_host.h
      KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
      KVM: selftests: Create gmem fd before "regular" fd when adding memslot
      KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset}
      KVM: selftests: Add support for mmap() on guest_memfd in core library
      KVM: selftests: Add selftests global for guest memory attributes capability
      KVM: selftests: Add helpers for calling ioctls on guest_memfd
      KVM: selftests: Test that shared/private status is consistent across processes
      KVM: selftests: Provide common function to set memory attributes
      KVM: selftests: Check fd/flags provided to mmap() when setting up memslot
      KVM: selftests: Update private memory exits test to work with per-gmem attributes

 Documentation/virt/kvm/api.rst                     |  78 +++-
 .../virt/kvm/x86/amd-memory-encryption.rst         |  13 +-
 Documentation/virt/kvm/x86/intel-tdx.rst           |   4 +
 arch/x86/include/asm/kvm_host.h                    |   4 +-
 arch/x86/kvm/Kconfig                               |  15 +-
 arch/x86/kvm/mmu/mmu.c                             |   8 +-
 arch/x86/kvm/svm/sev.c                             |  16 +-
 arch/x86/kvm/vmx/tdx.c                             |  11 +-
 arch/x86/kvm/x86.c                                 |  15 +-
 include/linux/kvm_host.h                           |  74 +--
 include/trace/events/kvm.h                         |   4 +-
 include/uapi/linux/kvm.h                           |  16 +
 mm/swap.c                                          |   2 +
 tools/testing/selftests/kvm/Makefile.kvm           |   1 +
 tools/testing/selftests/kvm/include/kvm_util.h     | 139 +++++-
 tools/testing/selftests/kvm/include/test_util.h    |  34 +-
 tools/testing/selftests/kvm/lib/kvm_util.c         | 164 ++++---
 tools/testing/selftests/kvm/lib/test_util.c        |   7 -
 .../kvm/x86/guest_memfd_conversions_test.c         | 509 +++++++++++++++++++++
 .../kvm/x86/private_mem_conversions_test.c         |  53 ++-
 .../selftests/kvm/x86/private_mem_kvm_exits_test.c |  36 +-
 virt/kvm/Kconfig                                   |   4 +-
 virt/kvm/guest_memfd.c                             | 474 +++++++++++++++++--
 virt/kvm/kvm_main.c                                |  86 +++-
 24 files changed, 1547 insertions(+), 220 deletions(-)
---
base-commit: b7fbe9a1bf9ee6c967ef77d366ca58c35fcf1887
change-id: 20260225-gmem-inplace-conversion-bd0dbd39753a

Best regards,
--
Ackerley Tng <ackerleytng@google.com>



^ permalink raw reply

* Re: [PATCH v7 10/42] KVM: guest_memfd: Ensure pages are not in use before conversion
From: Ackerley Tng @ 2026-06-19  0:17 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <509f9a66-5ae9-4c05-bef1-ced89fd29bf0@kernel.org>

"Vlastimil Babka (SUSE)" <vbabka@kernel.org> writes:

> On 5/23/26 02:17, Ackerley Tng via B4 Relay wrote:
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> When converting memory to private in guest_memfd, it is necessary to ensure
>> that the pages are not currently being accessed by any other part of the
>> kernel or userspace to avoid any current user writing to guest private
>> memory.
>>
>> guest_memfd checks for unexpected refcounts to determine whether a page is
>> still in use. The only expected refcounts after unmapping the range
>> requested for conversion are those that are held by guest_memfd itself.
>
> Is it sufficient to only check, and not also freeze the refcount? (i.e.
> using folio_ref_freeze()), because without freezing, anything (e.g.
> compaction's pfn-based scanner) could do a speculative folio_try_get() and
> the checked refcount becomes stale.
>

I believe there's no issue here, since the main thing here is to check
for long-term pins on the folio. Perhaps David can help me verify. :)

> Might be ok if we know that no such speculative increment can result in
> actually touching the page contents, and the extra refcount and something
> inspecting the struct folio won't interfere with anything else. Then it
> could be just a comment mentioning why it's safe.
>

In this series guest_memfd doesn't change anything in folio metadata,
guest_memfd only updates the attributes tracked in the guest_memfd
inode, and updates the RMP table for SNP.

With the upcoming huge page support, guest_memfd needs to split/merge
the folio, which means updates to folio metadata. That will need a
closer look.

I haven't added the comment, mostly because it's a long weekend here and
I'd like to get Sashiko to run on it over the weekend. We should
definitely continue this discussion on v8!

> IIRC the compaction's scanning can result in a migration here so it's
> probably ok?
>

Migration isn't supported for guest_memfd yet, so I think that's ok.

>> Update the kvm_memory_attributes2 structure to include an error_offset
>> field. This allows KVM to report the exact offset where a conversion
>> failed to userspace. If the safety check fails, return -EAGAIN and copy
>> the error_offset back to userspace so that it can potentially retry the
>> operation or handle the failure gracefully.
>>
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
>> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
>> Reviewed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH] tracing/user_events: fix use-after-free of enabler in user_event_mm_dup()
From: Beau Belgrave @ 2026-06-19  0:12 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, linux-kernel, stable
In-Reply-To: <20260618222743.538915-1-michael.bommarito@gmail.com>

On Thu, Jun 18, 2026 at 06:27:43PM -0400, Michael Bommarito wrote:
> user_event_enabler_destroy() removes an enabler from the per-mm
> mm->enablers list with list_del_rcu() and then frees it immediately with
> kfree(). That list is walked locklessly by user_event_mm_dup() during
> fork(), under rcu_read_lock() only:
> 
> 	rcu_read_lock();
> 	list_for_each_entry_rcu(enabler, &old_mm->enablers, mm_enablers_link)
> 		...
> 
> user_event_mm_dup() does not take event_mutex. The per-enabler destroy
> path user_events_ioctl_unreg() (DIAG_IOCSUNREG) takes event_mutex but
> nothing that excludes the dup walk. Threads that share an mm share one
> user_event_mm and one enabler list, so an unregister on one thread can
> free an enabler while another thread is forking and user_event_mm_dup()
> is mid-walk. The walk then dereferences the freed enabler (for example
> enabler->event in user_event_enabler_dup()).
> 
> This is reachable by an unprivileged task that can open user_events_data:
> a single multithreaded process that registers an enabler and then
> concurrently unregisters it and calls fork() triggers the race. KASAN
> reports a slab-use-after-free read in user_event_enabler_dup() called
> from user_event_mm_dup() and copy_process() during clone(); with
> kasan.fault=panic the kernel panics.
> 
> Free the enabler after a grace period with kfree_rcu(), matching the
> list_del_rcu() removal and the rcu_read_lock() readers in
> user_event_mm_dup(). Add an rcu_head to struct user_event_enabler for
> this. The error path in user_event_enabler_create() keeps using kfree()
> because that enabler is freed before it is published to the RCU list.
> 
> Cc: stable@vger.kernel.org
> Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement")
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
> 
> Notes:
>     KASAN on the unpatched tree (v7.1, x86-64, CONFIG_KASAN=y, SMP):
>     
>       BUG: KASAN: slab-use-after-free in user_event_enabler_dup+0x50a/0x540
>       Read of size 8 (enabler->event, 16 bytes into a freed kmalloc-cg-64):
>         user_event_enabler_dup
>         user_event_mm_dup
>         copy_process
>         __do_sys_clone
>       Allocated by the registering task; freed on another CPU via the
>       DIAG_IOCSUNREG path. With kasan.fault=panic the access panics.
>     
>     After the patch the same reproducer runs cleanly (no splat, no panic)
>     across the full window, and a serialized control (same paths, no
>     concurrency) is clean on both stock and patched.
>     
>     Re-ran tools/testing/selftests/user_events on stock and patched, both
>     clean: abi_test pass:6/6, dyn_test pass:4/4, ftrace_test pass:6/6.
> 
>  kernel/trace/trace_events_user.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
> index c4ba484f7b38b..412ca1e3a40cf 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -109,6 +109,9 @@ struct user_event_enabler {
>  
>  	/* Track enable bit, flags, etc. Aligned for bitops. */
>  	unsigned long		values;
> +
> +	/* Defer free so RCU list readers (user_event_mm_dup) are safe. */
> +	struct rcu_head		rcu;
>  };
>  
>  /* Bits 0-5 are for the bit to update upon enable/disable (0-63 allowed) */
> @@ -404,7 +407,12 @@ static void user_event_enabler_destroy(struct user_event_enabler *enabler,
>  	/* No longer tracking the event via the enabler */
>  	user_event_put(enabler->event, locked);
>  
> -	kfree(enabler);
> +	/*
> +	 * The enabler is removed from an RCU-traversed list
> +	 * (user_event_mm_dup walks mm->enablers under rcu_read_lock only),
> +	 * so the backing memory must outlive a grace period.
> +	 */
> +	kfree_rcu(enabler, rcu);
>  }
>  
>  static int user_event_mm_fault_in(struct user_event_mm *mm, unsigned long uaddr,
> -- 
> 2.53.0

Thanks for fixing this!

Acked-by: Beau Belgrave <beaub@linux.microsoft.com>

Thanks,
-Beau

^ permalink raw reply

* Re: [PATCH 4/4] tracing: trace_fprobe: fix typo in function name
From: Masami Hiramatsu @ 2026-06-18 23:43 UTC (permalink / raw)
  To: Martin Kaiser; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <20260507081041.885781-5-martin@kaiser.cx>

On Thu,  7 May 2026 10:09:09 +0200
Martin Kaiser <martin@kaiser.cx> wrote:

> The function name should be __register_tracepoint_fprobe.
> 

This looks good to me. Let me pick it.

Thanks,

> Signed-off-by: Martin Kaiser <martin@kaiser.cx>
> ---
>  kernel/trace/trace_fprobe.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
> index 9f5f08c0e7c2..4d1abbf66229 100644
> --- a/kernel/trace/trace_fprobe.c
> +++ b/kernel/trace/trace_fprobe.c
> @@ -764,7 +764,7 @@ static int unregister_fprobe_event(struct trace_fprobe *tf)
>  	return trace_probe_unregister_event_call(&tf->tp);
>  }
>  
> -static int __regsiter_tracepoint_fprobe(struct trace_fprobe *tf)
> +static int __register_tracepoint_fprobe(struct trace_fprobe *tf)
>  {
>  	struct tracepoint_user *tuser __free(tuser_put) = NULL;
>  	struct module *mod __free(module_put) = NULL;
> @@ -836,7 +836,7 @@ static int __register_trace_fprobe(struct trace_fprobe *tf)
>  	tf->fp.flags &= ~FPROBE_FL_DISABLED;
>  
>  	if (trace_fprobe_is_tracepoint(tf))
> -		return __regsiter_tracepoint_fprobe(tf);
> +		return __register_tracepoint_fprobe(tf);
>  
>  	/* TODO: handle filter, nofilter or symbol list */
>  	return register_fprobe(&tf->fp, tf->symbol, NULL);
> -- 
> 2.43.7
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH 3/4] tracing: probes: fix typo in a log message
From: Masami Hiramatsu @ 2026-06-18 23:43 UTC (permalink / raw)
  To: Martin Kaiser; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <20260507081041.885781-4-martin@kaiser.cx>

On Thu,  7 May 2026 10:09:08 +0200
Martin Kaiser <martin@kaiser.cx> wrote:

> Fix a typo ("Invalid $-variable") in a log message.
> 
> Signed-off-by: Martin Kaiser <martin@kaiser.cx>

This looks good to me. Let me pick it.

Thanks,

> ---
>  kernel/trace/trace_probe.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 262d8707a3df..df68d40de161 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -509,7 +509,7 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
>  	C(NO_RETVAL,		"This function returns 'void' type"),	\
>  	C(BAD_STACK_NUM,	"Invalid stack number"),		\
>  	C(BAD_ARG_NUM,		"Invalid argument number"),		\
> -	C(BAD_VAR,		"Invalid $-valiable specified"),	\
> +	C(BAD_VAR,		"Invalid $-variable specified"),	\
>  	C(BAD_REG_NAME,		"Invalid register name"),		\
>  	C(BAD_MEM_ADDR,		"Invalid memory address"),		\
>  	C(BAD_IMM,		"Invalid immediate value"),		\
> -- 
> 2.43.7
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH] tracing/user_events: fix use-after-free of enabler in user_event_mm_dup()
From: Michael Bommarito @ 2026-06-18 22:27 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: Beau Belgrave, linux-trace-kernel, linux-kernel, stable

user_event_enabler_destroy() removes an enabler from the per-mm
mm->enablers list with list_del_rcu() and then frees it immediately with
kfree(). That list is walked locklessly by user_event_mm_dup() during
fork(), under rcu_read_lock() only:

	rcu_read_lock();
	list_for_each_entry_rcu(enabler, &old_mm->enablers, mm_enablers_link)
		...

user_event_mm_dup() does not take event_mutex. The per-enabler destroy
path user_events_ioctl_unreg() (DIAG_IOCSUNREG) takes event_mutex but
nothing that excludes the dup walk. Threads that share an mm share one
user_event_mm and one enabler list, so an unregister on one thread can
free an enabler while another thread is forking and user_event_mm_dup()
is mid-walk. The walk then dereferences the freed enabler (for example
enabler->event in user_event_enabler_dup()).

This is reachable by an unprivileged task that can open user_events_data:
a single multithreaded process that registers an enabler and then
concurrently unregisters it and calls fork() triggers the race. KASAN
reports a slab-use-after-free read in user_event_enabler_dup() called
from user_event_mm_dup() and copy_process() during clone(); with
kasan.fault=panic the kernel panics.

Free the enabler after a grace period with kfree_rcu(), matching the
list_del_rcu() removal and the rcu_read_lock() readers in
user_event_mm_dup(). Add an rcu_head to struct user_event_enabler for
this. The error path in user_event_enabler_create() keeps using kfree()
because that enabler is freed before it is published to the RCU list.

Cc: stable@vger.kernel.org
Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement")
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---

Notes:
    KASAN on the unpatched tree (v7.1, x86-64, CONFIG_KASAN=y, SMP):
    
      BUG: KASAN: slab-use-after-free in user_event_enabler_dup+0x50a/0x540
      Read of size 8 (enabler->event, 16 bytes into a freed kmalloc-cg-64):
        user_event_enabler_dup
        user_event_mm_dup
        copy_process
        __do_sys_clone
      Allocated by the registering task; freed on another CPU via the
      DIAG_IOCSUNREG path. With kasan.fault=panic the access panics.
    
    After the patch the same reproducer runs cleanly (no splat, no panic)
    across the full window, and a serialized control (same paths, no
    concurrency) is clean on both stock and patched.
    
    Re-ran tools/testing/selftests/user_events on stock and patched, both
    clean: abi_test pass:6/6, dyn_test pass:4/4, ftrace_test pass:6/6.

 kernel/trace/trace_events_user.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index c4ba484f7b38b..412ca1e3a40cf 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -109,6 +109,9 @@ struct user_event_enabler {
 
 	/* Track enable bit, flags, etc. Aligned for bitops. */
 	unsigned long		values;
+
+	/* Defer free so RCU list readers (user_event_mm_dup) are safe. */
+	struct rcu_head		rcu;
 };
 
 /* Bits 0-5 are for the bit to update upon enable/disable (0-63 allowed) */
@@ -404,7 +407,12 @@ static void user_event_enabler_destroy(struct user_event_enabler *enabler,
 	/* No longer tracking the event via the enabler */
 	user_event_put(enabler->event, locked);
 
-	kfree(enabler);
+	/*
+	 * The enabler is removed from an RCU-traversed list
+	 * (user_event_mm_dup walks mm->enablers under rcu_read_lock only),
+	 * so the backing memory must outlive a grace period.
+	 */
+	kfree_rcu(enabler, rcu);
 }
 
 static int user_event_mm_fault_in(struct user_event_mm *mm, unsigned long uaddr,
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH 0/3] rv/reactors: fix lockdep warning and add KUnit tests
From: Gabriele Monaco @ 2026-06-18 15:35 UTC (permalink / raw)
  To: Wen Yang; +Cc: Nam Cao, linux-trace-kernel, linux-kernel
In-Reply-To: <4053c9bb-6229-438c-8c14-917909c1618f@linux.dev>

On Thu, 2026-06-18 at 01:11 +0800, Wen Yang wrote:
> Thank you for your feedback.
> I am using a WSL dev environment with 12 cores and 16GB. The config
> of the tested kernel code is as follows:

Uhm that's a strange one, I cannot get a machine like that..
The closest is a 16 CPUs where I can limit the resources in vng.

> And then, using vng to build and run kselftests (since kunit is
> already 
> built-in) can reproduce this issue:
> 
> $ vng --build
> 
> $ vng -v --run arch/x86/boot/bzImage --user root -- 
> tools/testing/selftests/verification/verificationtest-ktap

Well whenever I pass some argument to vng (instead of just vng -v that brings
up an interactive shell), I see an unrelated lockdep splat in
timekeeping_init(), but all clear when the KUnit runs..

I'm going to try and understand better what's going on, I don't think I can
reproduce it easily.

Thanks,
Gabriele


^ permalink raw reply

* [PATCH] rv: update rvgen monitor synthesis documentation path
From: Yu Chuanyu via B4 Relay @ 2026-06-18 13:45 UTC (permalink / raw)
  To: Steven Rostedt, Gabriele Monaco
  Cc: linux-trace-kernel, linux-kernel, Yu Chuanyu

From: Yu Chuanyu <lucayu.alight@gmail.com>

The rvgen source comments still refer to da_monitor_synthesis.rst, which
no longer exists. The documentation is now available in
monitor_synthesis.rst. Update both references to point to the current
file.

Signed-off-by: Yu Chuanyu <lucayu.alight@gmail.com>
---
 tools/verification/rvgen/__main__.py    | 2 +-
 tools/verification/rvgen/rvgen/dot2k.py | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/verification/rvgen/__main__.py b/tools/verification/rvgen/__main__.py
index 5c923dc..2a2bb03 100644
--- a/tools/verification/rvgen/__main__.py
+++ b/tools/verification/rvgen/__main__.py
@@ -6,7 +6,7 @@
 # dot2k: transform dot files into a monitor for the Linux kernel.
 #
 # For further information, see:
-#   Documentation/trace/rv/da_monitor_synthesis.rst
+#   Documentation/trace/rv/monitor_synthesis.rst
 
 if __name__ == '__main__':
     from rvgen.dot2k import da2k, ha2k
diff --git a/tools/verification/rvgen/rvgen/dot2k.py b/tools/verification/rvgen/rvgen/dot2k.py
index 110cfd6..326984f 100644
--- a/tools/verification/rvgen/rvgen/dot2k.py
+++ b/tools/verification/rvgen/rvgen/dot2k.py
@@ -6,7 +6,7 @@
 # dot2k: transform dot files into a monitor for the Linux kernel.
 #
 # For further information, see:
-#   Documentation/trace/rv/da_monitor_synthesis.rst
+#   Documentation/trace/rv/monitor_synthesis.rst
 
 from collections import deque
 from .dot2c import Dot2c

---
base-commit: e771677c937da5808f7b6c1f0e4a97ec1a84f8a8
change-id: 20260618-rvgen-doc-path-11695c57153d

Best regards,
--  
Yu Chuanyu <lucayu.alight@gmail.com>



^ permalink raw reply related

* Re: [PATCH v3 09/13] verification/rvgen: Delete __parse_constraint()
From: Nam Cao @ 2026-06-18 13:24 UTC (permalink / raw)
  To: Gabriele Monaco
  Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
	linux-kernel
In-Reply-To: <9035cc5b83dda3a8ec06e8488fba62ceb7431123.camel@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> Yeah, I don't see it explicitly mandated in the theory, but the
> description (from the sources) states:
>
>   The value of a clock thus denotes the amount of time that has been  
> elapsed since its last reset
>
> But it also says (emphasis added by me):
>
>   Clocks /can/ be reset to zero after which they start increasing ...
>
> Nowhere it says clocks /must/ be reset, their value simply won't make
> sense (according to the definition).
>
> Now in our implementation we may have some automatic reset when the
> monitor starts (I'm planning that to avoid invalid states), which could
> make explicit resets superfluous in some cases.

Reseting the clocks on monitor start sounds sensible.

> Let's leave that to the user for now and skip this check.

Thanks,
Nam

^ permalink raw reply

* Re: [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: David Hildenbrand (Arm) @ 2026-06-18 12:38 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Shakeel Butt
  Cc: JP Kobryn, linux-mm, willy, usama.arif, akpm, mhocko, rostedt,
	mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <bbcb6db5-6a01-46b7-979f-dadd52a5176f@kernel.org>

On 6/18/26 10:30, Vlastimil Babka (SUSE) wrote:
> On 6/18/26 10:21, David Hildenbrand (Arm) wrote:
>> On 6/17/26 20:18, Vlastimil Babka (SUSE) wrote:
>>>
>>> Yeah and I don't recall ever that a change to a mm tracepoint would ever
>>> break someone who'd complain and we'd have to revert it.
>> Really? :)
>>
>> Read the context of the link I posted once more.
> 
> Ah, I see. I've only read the single mail from Steven that referred to the
> old powertop breakage and didn't notice the context.
> 
> But I don't think these worries should stop us from adding easily usable
> tracepoints.

Steve explained a way how apparently scheduler people are handling it without
trace events.

You can always remove/modify tracepoints, but not trace events.

Anyhow, just wanted to mention it, because so far MM didn't rally know about
this implication.

-- 
Cheers,

David

^ permalink raw reply

* Re: [RFC PATCH 3/3] mm/compaction: respect compact_unevictable_allowed in alloc_contig path
From: Wandun @ 2026-06-18 11:47 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), linux-mm, linux-kernel,
	linux-trace-kernel, linux-rt-devel
  Cc: akpm, surenb, mhocko, jackmanb, hannes, ziy, rostedt, mhiramat,
	mathieu.desnoyers, david, ljs, liam, rppt, bigeasy, clrkwllms,
	Alexander.Krabler
In-Reply-To: <9890b8f5-69b9-49bc-8ed6-ea47723b644e@kernel.org>



On 6/18/26 02:57, Vlastimil Babka (SUSE) wrote:
> On 6/4/26 04:38, Wandun Chen wrote:
>> From: Wandun Chen <chenwandun@lixiang.com>
>>
>> vm.compact_unevictable_allowed=0 is used to prevent compacting
>> unevictable pages. However, isolate_migratepages_range() passes
>> ISOLATE_UNEVICTABLE regardless of this sysctl, so the setting
>> has no effect in the alloc_contig path.
>>
>> Fix it by:
>>   - Keep ISOLATE_UNEVICTABLE for CMA allocation, discussed in [1].
>>   - Honour sysctl_compact_unevictable_allowed for non-CMA allocation.
>>
>> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
>> Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
>> Link: https://lore.kernel.org/all/25ba0d77-eb61-4efc-b2fc-73878cbd85c1@suse.cz/ [1]
> 
> There was also the "Ideally by not having mlock'd pages in CMA areas at
> all." part. Is it the case? It was more elaborated here:

Yes, It is the case.

> https://lore.kernel.org/all/CAPTztWZpnX1j8-7yeppVUsxE=O9hbVeqricDjZt8_pnN7a-kBQ@mail.gmail.com/

I missed this important information. Thanks for pointing it out, Vlastimil.

Best regards,
Wandun

> 
>> ---
>>  include/linux/compaction.h | 6 ++++++
>>  mm/compaction.c            | 9 +++++++--
>>  mm/internal.h              | 1 +
>>  mm/page_alloc.c            | 2 ++
>>  4 files changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>> index f29ef0653546..04e60f65b976 100644
>> --- a/include/linux/compaction.h
>> +++ b/include/linux/compaction.h
>> @@ -106,6 +106,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>>  extern void __meminit kcompactd_run(int nid);
>>  extern void __meminit kcompactd_stop(int nid);
>>  extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int highest_zoneidx);
>> +extern bool compaction_allow_unevictable(void);
>>  
>>  #else
>>  static inline void reset_isolation_suitable(pg_data_t *pgdat)
>> @@ -131,6 +132,11 @@ static inline void wakeup_kcompactd(pg_data_t *pgdat,
>>  {
>>  }
>>  
>> +static inline bool compaction_allow_unevictable(void)
>> +{
>> +	return true;
>> +}
>> +
>>  #endif /* CONFIG_COMPACTION */
>>  
>>  struct node;
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 007d5e00a8ae..a10acb273454 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1341,6 +1341,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
>>  							unsigned long end_pfn)
>>  {
>>  	unsigned long pfn, block_start_pfn, block_end_pfn;
>> +	isolate_mode_t mode = cc->allow_unevictable ? ISOLATE_UNEVICTABLE : 0;
>>  	int ret = 0;
>>  
>>  	/* Scan block by block. First and last block may be incomplete */
>> @@ -1360,8 +1361,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
>>  					block_end_pfn, cc->zone))
>>  			continue;
>>  
>> -		ret = isolate_migratepages_block(cc, pfn, block_end_pfn,
>> -						 ISOLATE_UNEVICTABLE);
>> +		ret = isolate_migratepages_block(cc, pfn, block_end_pfn, mode);
>>  
>>  		if (ret)
>>  			break;
>> @@ -1902,6 +1902,11 @@ typedef enum {
>>   * compactable pages.
>>   */
>>  static int sysctl_compact_unevictable_allowed __read_mostly = CONFIG_COMPACT_UNEVICTABLE_DEFAULT;
>> +
>> +bool compaction_allow_unevictable(void)
>> +{
>> +	return sysctl_compact_unevictable_allowed;
>> +}
>>  /*
>>   * Tunable for proactive compaction. It determines how
>>   * aggressively the kernel should compact memory in the
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 181e79f1d6a2..163f9d6b37f3 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -1052,6 +1052,7 @@ struct compact_control {
>>  					 * ensure forward progress.
>>  					 */
>>  	bool alloc_contig;		/* alloc_contig_range allocation */
>> +	bool allow_unevictable;		/* Allow isolation of unevictable folios */
>>  };
>>  
>>  /*
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 81a9d4d1e6c0..1cf9d4a3b14c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7118,6 +7118,8 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
>>  		.ignore_skip_hint = true,
>>  		.no_set_skip_hint = true,
>>  		.alloc_contig = true,
>> +		.allow_unevictable = !!(alloc_flags & ACR_FLAGS_CMA) ||
>> +					     compaction_allow_unevictable(),
>>  	};
>>  	INIT_LIST_HEAD(&cc.migratepages);
>>  	enum pb_isolate_mode mode = (alloc_flags & ACR_FLAGS_CMA) ?
> 


^ permalink raw reply

* Re: [RFC PATCH 1/3] mm/compaction: skip isolate mlocked folios when compact_unevictable_allowed=0
From: Wandun @ 2026-06-18 11:43 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), linux-mm, linux-kernel,
	linux-trace-kernel, linux-rt-devel
  Cc: akpm, surenb, mhocko, jackmanb, hannes, ziy, rostedt, mhiramat,
	mathieu.desnoyers, david, ljs, liam, rppt, bigeasy, clrkwllms,
	Alexander.Krabler, Hugh Dickins
In-Reply-To: <969cb14b-5b8b-48e6-add6-4dd13101dd89@kernel.org>



On 6/18/26 02:52, Vlastimil Babka (SUSE) wrote:
> On 6/4/26 04:38, Wandun Chen wrote:
>> From: Wandun Chen <chenwandun@lixiang.com>
>>
>> compact_unevictable_allowed is default 0 under PREEMPT_RT,
>> isolate_migratepages_block() skips folios with PG_unevictable set.
>> However, mlock_folio() sets PG_mlocked immediately but defers
>> PG_unevictable to mlock_folio_batch(), result in a folio with
>> PG_mlocked=1 but PG_unevictable=0. Compaction will isolate such a
>> folio.
>>
>> Fix by checking folio_test_mlocked() together with the existing
>> folio_test_unevictable() check.
>>
>> A similar issue has been reported by Alexander Krabler on a 6.12-rt
>> aarch64 system. Vlastimil suggested to check the mlocked flag [1].
>>
>> Reported-by: Alexander Krabler <Alexander.Krabler@kuka.com>
>> Closes: https://lore.kernel.org/all/DU0PR01MB10385345F7153F334100981888259A@DU0PR01MB10385.eurprd01.prod.exchangelabs.com/
>> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
>> Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
>> Link: https://lore.kernel.org/all/33275585-f2db-4779-89f0-3ae24b455a67@suse.cz/ [1]
> 
> Well in that thread, Hugh doubted my suggestion and then it seems we didn't
> concluded anything. Did you actually in practice observe the issue that
> Alexander had, and that this patch fixed it, or is that theoretical?
> 
Yes, I wrote a test case that can reproduce it in a few second.

The test case contains 3 steps:
1. mlockall
2. mmap file(2GB) + trigger file write page fault;
3. during step 1, trigger compact via /proc/sys/vm/compact_memory


My reproduction environment is qemu with 4GB ram, 8 core, aarch64,
preempt_rt and includes the tracepoint in patch 02.
After running the reproduction program for a few seconds, the
following output appears.

repro-403     [004] ....1   101.270505: mm_compaction_isolate_folio: pfn=0x71e3a mode=0x0 flags=referenced|uptodate|mlocked
repro-403     [004] ....1   101.270507: mm_compaction_isolate_folio: pfn=0x71e3b mode=0x0 flags=referenced|uptodate|mlocked
repro-403     [004] ....1   101.270513: mm_compaction_isolate_folio: pfn=0x71e3c mode=0x0 flags=referenced|uptodate|mlocked
repro-403     [004] ....1   101.270515: mm_compaction_isolate_folio: pfn=0x71e3d mode=0x0 flags=uptodate|mlocked
repro-403     [004] ....1   101.270517: mm_compaction_isolate_folio: pfn=0x71e3e mode=0x0 flags=uptodate|mlocked
repro-403     [004] ....1   101.270520: mm_compaction_isolate_folio: pfn=0x71e3f mode=0x0 flags=uptodate|mlocked


Unfortunately, I recently found that there is still a bug in the
fix patch. Setting mlocked in the mlock_folio function could happen
even after the page is successfully isolated, so it still cannot
prevent migration. Because of this, I need to think more about how
to fix it.

Perhaps we should double-check whether the page is mlocked during
the actual migration phase.

What do you think of this best-effort approach?


Best regards,
Wandun





The full reproducer is as below:

/* gcc repro.c -o repro -lpthread */

#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

#define PAGE_SIZE       4096
#define NR_PAGES        32
#define FILE_SIZE       (2ULL * 1024 * 1024 * 1024)

static void *worker_fn(void *arg)
{
	int fd = (long)arg;
	size_t len = (size_t)FILE_SIZE;
	char *p = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (p == MAP_FAILED)
		return NULL;

	for (size_t off = 0; off + NR_PAGES * PAGE_SIZE <= len;
	     off += NR_PAGES * PAGE_SIZE) {
		for (int i = 0; i < NR_PAGES; i++)
			p[off + i * PAGE_SIZE] = 1;
		usleep(200);
	}

	munmap(p, len);
	return NULL;
}

static void *compact_fn(void *arg)
{
	(void)arg;
	int fd = open("/proc/sys/vm/compact_memory", O_WRONLY);
	if (fd < 0)
		return NULL;

	while (1) {
		if (write(fd, "1", 1) < 0) {}
		usleep(5000);
	}
}

int main(void)
{
	mlockall(MCL_CURRENT | MCL_FUTURE);

	int fd = open("./repro_largefile.dat", O_RDWR | O_CREAT, 0600);
	if (fd < 0)
		return 1;
	unlink("./repro_largefile.dat");
	if (ftruncate(fd, (off_t)FILE_SIZE) < 0)
		return 1;

	printf("repro_largefile: 1 worker, %d pages/batch, Ctrl-C to stop\n",
	       NR_PAGES);

	pthread_t compact, worker;
	pthread_create(&compact, NULL, compact_fn, NULL);
	pthread_create(&worker, NULL, worker_fn, (void *)(long)fd);

	pthread_join(worker, NULL);
	return 0;
}

>> ---
>>  mm/compaction.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index b776f35ad020..7e07b792bcb5 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1116,7 +1116,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>  		is_unevictable = folio_test_unevictable(folio);
>>  
>>  		/* Compaction might skip unevictable pages but CMA takes them */
>> -		if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
>> +		if (!(mode & ISOLATE_UNEVICTABLE) &&
>> +		    (is_unevictable || folio_test_mlocked(folio)))
>>  			goto isolate_fail_put;
>>  
>>  		/*
> 


^ permalink raw reply

* Re: [PATCH] usb: typec: add trace point for typec_set_mode
From: Heikki Krogerus @ 2026-06-18 11:31 UTC (permalink / raw)
  To: Ahmad Fatoum
  Cc: Greg Kroah-Hartman, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, linux-kernel, linux-usb, linux-trace-kernel,
	kernel
In-Reply-To: <43e13854-a634-4706-bc12-723c871a5579@pengutronix.de>

Hi,

On Thu, Jun 18, 2026 at 01:00:58PM +0200, Ahmad Fatoum wrote:
> Hello Heikki,
> 
> On 6/18/26 12:56 PM, Heikki Krogerus wrote:
> > On Wed, Jun 17, 2026 at 10:03:04PM +0200, Ahmad Fatoum wrote:
> >> --- a/drivers/usb/typec/class.c
> >> +++ b/drivers/usb/typec/class.c
> >> @@ -20,6 +20,9 @@
> >>  #include "class.h"
> >>  #include "pd.h"
> >>  
> >> +#define CREATE_TRACE_POINTS
> >> +#include <trace/events/typec.h>
> > 
> > Those should probable go to drivers/usb/typec/trace.c and then you
> > need add something like this to drivers/usb/typec/Makefile:
> > 
> >  obj-$(CONFIG_TYPEC)            += typec.o
> >  typec-y                                := class.o mux.o bus.o pd.o retimer.o mode_selection.o
> >  typec-$(CONFIG_ACPI)           += port-mapper.o
> > +typec-$(CONFIG_TRACING)                += trace.o
> 
> Thanks for the suggestion. I will do that for v2.
> 
> I also saw there is Sashiko AI feedback on this patch[1], but I am not
> familiar enough with how the event headers are used outside the kernel
> to determine if that's actionable advice or if it can be ignored.
> 
> Do you have an opinion on that?
> 
> [1]:
> https://sashiko.dev/#/patchset/20260617-typec_set_mode-tracepoint-v1-1-bdfbb39cfccd%40pengutronix.de

It's correct. You need to use a private trace.h in this case, so just
move it here: drivers/usb/typec/trace.h

And also make sure you include everything needed in that header like
it's telling you.

Thanks,

-- 
heikki

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-18 11:13 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: David Hildenbrand (Arm), Balbir Singh, lsf-pc, linux-kernel,
	linux-cxl, cgroups, linux-mm, linux-trace-kernel, damon,
	kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
	mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
	xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
	pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
	roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
	zhengqi.arch, terry.bowman, Matthew Wilcox
In-Reply-To: <90418cd3-751f-439d-83ed-a0c33517c3bd@kernel.org>

On Thu, Jun 18, 2026 at 10:21:30AM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/15/26 17:37, Gregory Price wrote:
> > 
> > One thought would be a way to switch what fallback list is used, and
> > then have specific fallback lists for certain contexts.
> > 
> > Right now there is a single example of this: __GFP_THISNODE
> >   |= __GFP_THISNODE   =>  NOFALLBACK
> >   &= ~__GFP_THISNODE  =>  FALLBACK
> > 
> > We could add an interface with the desired fallback list based as an
> > argument, and let get_page_from_freelist to prefer that over the default
> > global lists.
> 
> Does it mean a new argument in a number of functions in the page allocator,
> or can it be mapped to alloc_flags (at least internally?), because the
> number of possible fallback lists is small enough?
>

What I ended up with was adding a single page_alloc.c external interface
that allows you define the zonelist via an enum, and then an internal
selector resolution in prepare_alloc_pages() stored in alloc_context

eg:

static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
                int preferred_nid, nodemask_t *nodemask,
                struct alloc_context *ac, gfp_t *alloc_gfp,
                unsigned int *alloc_flags)
{       
        ac->highest_zoneidx = gfp_zone(gfp_mask);
        ac->zonelist = select_zonelist(preferred_nid, gfp_mask, ac->zlsel);
	... snip ...
}

struct folio *__folio_alloc_zonelist_noprof(gfp_t gfp, unsigned int order,
                int preferred_nid, nodemask_t *nodemask,
                enum alloc_zonelist zlsel);


The original __folio_alloc* functions just add a DEFAULT - which tells
select_zonelist() to base the decision on __GFP_THISNODE.


struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
                nodemask_t *nodemask)
{
        return __folio_alloc_core(gfp, order, preferred_nid, nodemask,
                                  ALLOC_ZONELIST_DEFAULT);
}
EXPORT_SYMBOL(__folio_alloc_noprof);


This does a few things
  - The isolation is structural, there is no way to accidentally
    allocate private memory without passing ALLOC_ZONELIST_PRIVATE

  - The isolation forces folios - there are no non-folio interfaces
    which allow zonelist selection

  - The zonelist selection is confined to this allocation context,
    so no inheritence is possible.



I tried to avoid using an ALLOC_ flag so we can avoid yet another flag
crunch, but there certainly are few enough zonelists that we could
encode it there and expose it.  I know Brendan was looking at plumbing
alloc flags out to an interface, so i'm open to that.

Externally the way I determine what zonelist to use is a lookup based on
reason - letting the node filter.  This is really only needed in a
couple spots:

mm/khugepaged.c:  enum alloc_zonelist zlsel = alloc_zonelist_for_node(node, NODE_ALLOC_RECLAIM);
mm/vmscan.c:      mtc->zlsel = alloc_zonelist_for_nodemask(mtc->nmask, NODE_ALLOC_TIERING);
mm/migrate.c:     .zlsel = alloc_zonelist_for_node(node, NODE_ALLOC_USER_MIGRATE),

static inline enum alloc_zonelist
alloc_zonelist_for_node(int nid, enum node_alloc_reason reason)
{
        bool ok;

        if (!node_state(nid, N_MEMORY_PRIVATE))
                return ALLOC_ZONELIST_DEFAULT;
        switch (reason) {
        case NODE_ALLOC_RECLAIM:
                ok = node_is_reclaimable(nid);
                break;
        case NODE_ALLOC_TIERING:
                ok = node_allows_tiering(nid);
                break;
        case NODE_ALLOC_USER_MIGRATE:
                ok = node_allows_user_migrate(nid);
                break;
        default:
                ok = false;
        }
        return ok ? ALLOC_ZONELIST_PRIVATE : ALLOC_ZONELIST_DEFAULT;
}

Otherwise... everything is now a mempolicy w/ MPOL_F_BIND and all the
handling goes through the normal fault-paths :]

static struct page *__alloc_pages_mpol(gfp_t gfp, unsigned int order,
                struct mempolicy *pol, pgoff_t ilx, int nid)
{
        nodemask_t *nodemask;
        struct page *page;
        enum alloc_zonelist zlsel = (pol->flags & MPOL_F_PRIVATE) ?
                ALLOC_ZONELIST_PRIVATE : ALLOC_ZONELIST_DEFAULT;
...
        if (pol->mode == MPOL_PREFERRED_MANY)
                return alloc_pages_preferred_many(gfp, order, nid, nodemask,
                                                  zlsel);
...
}


Switching to an alloc_flag would probably be trivially if that's really
wanted

~Gregory

^ permalink raw reply

* Re: [PATCH] usb: typec: add trace point for typec_set_mode
From: Ahmad Fatoum @ 2026-06-18 11:00 UTC (permalink / raw)
  To: Heikki Krogerus
  Cc: Greg Kroah-Hartman, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, linux-kernel, linux-usb, linux-trace-kernel,
	kernel
In-Reply-To: <ajPO6roV4HRZYGNd@kuha>

Hello Heikki,

On 6/18/26 12:56 PM, Heikki Krogerus wrote:
> On Wed, Jun 17, 2026 at 10:03:04PM +0200, Ahmad Fatoum wrote:
>> --- a/drivers/usb/typec/class.c
>> +++ b/drivers/usb/typec/class.c
>> @@ -20,6 +20,9 @@
>>  #include "class.h"
>>  #include "pd.h"
>>  
>> +#define CREATE_TRACE_POINTS
>> +#include <trace/events/typec.h>
> 
> Those should probable go to drivers/usb/typec/trace.c and then you
> need add something like this to drivers/usb/typec/Makefile:
> 
>  obj-$(CONFIG_TYPEC)            += typec.o
>  typec-y                                := class.o mux.o bus.o pd.o retimer.o mode_selection.o
>  typec-$(CONFIG_ACPI)           += port-mapper.o
> +typec-$(CONFIG_TRACING)                += trace.o

Thanks for the suggestion. I will do that for v2.

I also saw there is Sashiko AI feedback on this patch[1], but I am not
familiar enough with how the event headers are used outside the kernel
to determine if that's actionable advice or if it can be ignored.

Do you have an opinion on that?

[1]:
https://sashiko.dev/#/patchset/20260617-typec_set_mode-tracepoint-v1-1-bdfbb39cfccd%40pengutronix.de

Thanks,
Ahmad

> 
> 
> Thanks,
> 

-- 
Pengutronix e.K.                  |                             |
Steuerwalder Str. 21              | http://www.pengutronix.de/  |
31137 Hildesheim, Germany         | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686  | Fax:   +49-5121-206917-5555 |


^ permalink raw reply

* Re: [PATCH] usb: typec: add trace point for typec_set_mode
From: Heikki Krogerus @ 2026-06-18 10:56 UTC (permalink / raw)
  To: Ahmad Fatoum
  Cc: Greg Kroah-Hartman, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, linux-kernel, linux-usb, linux-trace-kernel,
	kernel
In-Reply-To: <20260617-typec_set_mode-tracepoint-v1-1-bdfbb39cfccd@pengutronix.de>

Hi Ahmad,

On Wed, Jun 17, 2026 at 10:03:04PM +0200, Ahmad Fatoum wrote:
> Some Type-C controllers toggle muxes themselves. Other controllers like
> the TUSB320 report the mode to the host, so it can control the muxes.
> 
> To improve debuggability of both kinds of drivers, add a trace point that
> can be used to keep track of the mode being set inside the Type-C
> framework:
> 
>   echo 1 > /sys/kernel/debug/tracing/events/typec/typec_mode/enable
> 
> Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
> ---
>  MAINTAINERS                  |  1 +
>  drivers/usb/typec/class.c    |  9 ++++++++-
>  include/trace/events/typec.h | 36 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c8d4b913f26c..ddd59e5e6eaf 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -27753,6 +27753,7 @@ F:	Documentation/ABI/testing/sysfs-class-typec
>  F:	Documentation/driver-api/usb/typec.rst
>  F:	drivers/usb/typec/
>  F:	include/linux/usb/typec.h
> +F:	include/trace/events/typec*.h
>  
>  USB TYPEC INTEL PMC MUX DRIVER
>  M:	Heikki Krogerus <heikki.krogerus@linux.intel.com>
> diff --git a/drivers/usb/typec/class.c b/drivers/usb/typec/class.c
> index 0977581ad1b6..9316d067f19a 100644
> --- a/drivers/usb/typec/class.c
> +++ b/drivers/usb/typec/class.c
> @@ -20,6 +20,9 @@
>  #include "class.h"
>  #include "pd.h"
>  
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/typec.h>

Those should probable go to drivers/usb/typec/trace.c and then you
need add something like this to drivers/usb/typec/Makefile:

 obj-$(CONFIG_TYPEC)            += typec.o
 typec-y                                := class.o mux.o bus.o pd.o retimer.o mode_selection.o
 typec-$(CONFIG_ACPI)           += port-mapper.o
+typec-$(CONFIG_TRACING)                += trace.o
 obj-$(CONFIG_TYPEC)            += altmodes/
 obj-$(CONFIG_TYPEC_TCPM)       += tcpm/
 obj-$(CONFIG_TYPEC_UCSI)       += ucsi/


Thanks,

-- 
heikki

^ permalink raw reply

* Re: [PATCH v5 1/2] serial: qcom-geni: trace: Drop redundant len field from geni_serial_data
From: Konrad Dybcio @ 2026-06-18  8:55 UTC (permalink / raw)
  To: Praveen Talari, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Greg Kroah-Hartman, Jiri Slaby
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	mukesh.savaliya, aniket.randive, chandana.chiluveru
In-Reply-To: <20260615-add-tracepoints-for-qcom-geni-serial-v5-1-2efa4c97e0e2@oss.qualcomm.com>

On 6/15/26 4:16 PM, Praveen Talari wrote:
> The dynamic array stored in the ring buffer already carries its own
> length in the array metadata. There is no need to also store it as a
> separate scalar field in the entry struct.
> 
> Drop __field(unsigned int, len) and the corresponding __entry->len
> assignment, and use __get_dynamic_array_len(data) in the TP_printk for
> both the len=%u format argument and the __print_hex() size argument.
> This saves 4 bytes per event on the ring buffer.
> 
> Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
> ---

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>

Konrad

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox