* [POC PATCH 2/6] KVM: selftests: Use guest_memfd memory contents in-place for SNP launch update
From: Ackerley Tng @ 2026-04-28 23:33 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, linux-coco, linux-doc, linux-kernel,
linux-kselftest, linux-mm, linux-trace-kernel, mathieu.desnoyers,
mhiramat, michael.roth, mingo, nphamcs, oupton, pankaj.gupta,
pbonzini, pratyush, qi.zheng, qperret, rick.p.edgecombe, rientjes,
rostedt, seanjc, shakeel.butt, shikemeng, shivankg, shuah, skhan,
steven.price, suzuki.poulose, tabba, tglx, vannapurve, vbabka,
weixugc, willy, wyihan, x86, yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1777418884.git.ackerleytng@google.com>
Update the SEV-SNP launch update flow to utilize guest_memfd in-place
conversion.
Include the KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE flag when setting memory
attributes to private. This is permitted before the SNP VM is finalized.
In snp_launch_update_data, pass 0 as the host virtual address. This
instructs the kernel to perform the launch update using the guest_memfd
backing the guest physical address rather than a userspace-provided
buffer.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
tools/testing/selftests/kvm/lib/x86/sev.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/kvm/lib/x86/sev.c b/tools/testing/selftests/kvm/lib/x86/sev.c
index d0205b3299e0b..72b2935871fe4 100644
--- a/tools/testing/selftests/kvm/lib/x86/sev.c
+++ b/tools/testing/selftests/kvm/lib/x86/sev.c
@@ -32,13 +32,14 @@ static void encrypt_region(struct kvm_vm *vm, struct userspace_mem_region *regio
const u64 size = (j - i + 1) * vm->page_size;
const u64 offset = (i - lowest_page_in_region) * vm->page_size;
- if (private)
- vm_mem_set_private(vm, gpa_base + offset, size, 0);
+ if (private) {
+ vm_mem_set_private(vm, gpa_base + offset, size,
+ KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE);
+ }
if (is_sev_snp_vm(vm))
snp_launch_update_data(vm, gpa_base + offset,
- (u64)addr_gpa2hva(vm, gpa_base + offset),
- size, page_type);
+ 0, size, page_type);
else
sev_launch_update_data(vm, gpa_base + offset, size);
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related
* [POC PATCH 3/6] KVM: selftests: Make guest_code_xsave more friendly
From: Ackerley Tng @ 2026-04-28 23:33 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, linux-coco, linux-doc, linux-kernel,
linux-kselftest, linux-mm, linux-trace-kernel, mathieu.desnoyers,
mhiramat, michael.roth, mingo, nphamcs, oupton, pankaj.gupta,
pbonzini, pratyush, qi.zheng, qperret, rick.p.edgecombe, rientjes,
rostedt, seanjc, shakeel.butt, shikemeng, shivankg, shuah, skhan,
steven.price, suzuki.poulose, tabba, tglx, vannapurve, vbabka,
weixugc, willy, wyihan, x86, yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1777418884.git.ackerleytng@google.com>
The original implementation of guest_code_xsave makes a jmp to
guest_sev_es_code in inline assembly. When code that uses guest_sev_es_code
is removed, guest_sev_es_code will be optimized out, leading to a linking
error since guest_code_xsave still tries to jmp to guest_sev_es_code.
Rewrite guest_code_xsave() to instead make a call, in C, to
guest_sev_es_code(), so that usage of guest_sev_es_code() is made known to
the compiler.
This rewriting also gives a name to the xsave inline assembly, improving
readability.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
.../selftests/kvm/x86/sev_smoke_test.c | 24 +++++++++++++------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index 1a49ee3915864..8b859adf4cf6f 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -80,13 +80,23 @@ static void guest_sev_code(void)
GUEST_DONE();
}
-/* Stash state passed via VMSA before any compiled code runs. */
-extern void guest_code_xsave(void);
-asm("guest_code_xsave:\n"
- "mov $" __stringify(XFEATURE_MASK_X87_AVX) ", %eax\n"
- "xor %edx, %edx\n"
- "xsave (%rdi)\n"
- "jmp guest_sev_es_code");
+static void xsave_all_registers(void *addr)
+{
+ __asm__ __volatile__(
+ "mov $" __stringify(XFEATURE_MASK_X87_AVX) ", %eax\n"
+ "xor %edx, %edx\n"
+ "xsave (%0)"
+ :
+ : "r"(addr)
+ : "eax", "edx", "memory"
+ );
+}
+
+static void guest_code_xsave(void *vmsa_gva)
+{
+ xsave_all_registers(vmsa_gva);
+ guest_sev_es_code();
+}
static void compare_xsave(u8 *from_host, u8 *from_guest)
{
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related
* [POC PATCH 4/6] KVM: selftests: Allow specifying CoCo-privateness while mapping a page
From: Ackerley Tng @ 2026-04-28 23:33 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, linux-coco, linux-doc, linux-kernel,
linux-kselftest, linux-mm, linux-trace-kernel, mathieu.desnoyers,
mhiramat, michael.roth, mingo, nphamcs, oupton, pankaj.gupta,
pbonzini, pratyush, qi.zheng, qperret, rick.p.edgecombe, rientjes,
rostedt, seanjc, shakeel.butt, shikemeng, shivankg, shuah, skhan,
steven.price, suzuki.poulose, tabba, tglx, vannapurve, vbabka,
weixugc, willy, wyihan, x86, yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1777418884.git.ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
tools/testing/selftests/kvm/include/x86/processor.h | 2 ++
tools/testing/selftests/kvm/lib/x86/processor.c | 13 ++++++++++---
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h
index 77f576ee7789d..683f21452db58 100644
--- a/tools/testing/selftests/kvm/include/x86/processor.h
+++ b/tools/testing/selftests/kvm/include/x86/processor.h
@@ -1507,6 +1507,8 @@ enum pg_level {
void tdp_mmu_init(struct kvm_vm *vm, int pgtable_levels,
struct pte_masks *pte_masks);
+void ___virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
+ gpa_t gpa, int level, bool private);
void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
gpa_t gpa, int level);
void virt_map_level(struct kvm_vm *vm, gva_t gva, gpa_t gpa,
diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c
index b51467d70f6e7..02781194f51a2 100644
--- a/tools/testing/selftests/kvm/lib/x86/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86/processor.c
@@ -256,8 +256,8 @@ static u64 *virt_create_upper_pte(struct kvm_vm *vm,
return pte;
}
-void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
- gpa_t gpa, int level)
+void ___virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
+ gpa_t gpa, int level, bool private)
{
const u64 pg_size = PG_LEVEL_SIZE(level);
u64 *pte = &mmu->pgd;
@@ -309,12 +309,19 @@ void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
* Neither SEV nor TDX supports shared page tables, so only the final
* leaf PTE needs manually set the C/S-bit.
*/
- if (vm_is_gpa_protected(vm, gpa))
+ if (private)
*pte |= PTE_C_BIT_MASK(mmu);
else
*pte |= PTE_S_BIT_MASK(mmu);
}
+void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
+ gpa_t gpa, int level)
+{
+ ___virt_pg_map(vm, mmu, gva, gpa, level,
+ vm_is_gpa_protected(vm, gpa));
+}
+
void virt_arch_pg_map(struct kvm_vm *vm, gva_t gva, gpa_t gpa)
{
__virt_pg_map(vm, &vm->mmu, gva, gpa, PG_LEVEL_4K);
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related
* [POC PATCH 5/6] KVM: selftests: Test conversions for SNP
From: Ackerley Tng @ 2026-04-28 23:33 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, linux-coco, linux-doc, linux-kernel,
linux-kselftest, linux-mm, linux-trace-kernel, mathieu.desnoyers,
mhiramat, michael.roth, mingo, nphamcs, oupton, pankaj.gupta,
pbonzini, pratyush, qi.zheng, qperret, rick.p.edgecombe, rientjes,
rostedt, seanjc, shakeel.butt, shikemeng, shivankg, shuah, skhan,
steven.price, suzuki.poulose, tabba, tglx, vannapurve, vbabka,
weixugc, willy, wyihan, x86, yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1777418884.git.ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
.../selftests/kvm/x86/sev_smoke_test.c | 190 +++++++++++++++++-
1 file changed, 185 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index 8b859adf4cf6f..86f17e59e9392 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -253,17 +253,197 @@ static void test_sev_smoke(void *guest, u32 type, u64 policy)
}
}
+#define GHCB_MSR_REG_GPA_REQ 0x012
+#define GHCB_MSR_REG_GPA_REQ_VAL(v) \
+ /* GHCBData[63:12] */ \
+ (((u64)((v) & GENMASK_ULL(51, 0)) << 12) | \
+ /* GHCBData[11:0] */ \
+ GHCB_MSR_REG_GPA_REQ)
+
+#define GHCB_MSR_REG_GPA_RESP 0x013
+#define GHCB_MSR_REG_GPA_RESP_VAL(v) \
+ /* GHCBData[63:12] */ \
+ (((u64)(v) & GENMASK_ULL(63, 12)) >> 12)
+
+#define GHCB_DATA_LOW 12
+#define GHCB_MSR_INFO_MASK (BIT_ULL(GHCB_DATA_LOW) - 1)
+#define GHCB_RESP_CODE(v) ((v) & GHCB_MSR_INFO_MASK)
+
+/*
+ * SNP Page State Change Operation
+ *
+ * GHCBData[55:52] - Page operation:
+ * 0x0001 Page assignment, Private
+ * 0x0002 Page assignment, Shared
+ */
+enum psc_op {
+ SNP_PAGE_STATE_PRIVATE = 1,
+ SNP_PAGE_STATE_SHARED,
+};
+
+#define GHCB_MSR_PSC_REQ 0x014
+#define GHCB_MSR_PSC_REQ_GFN(gfn, op) \
+ /* GHCBData[55:52] */ \
+ (((u64)((op) & 0xf) << 52) | \
+ /* GHCBData[51:12] */ \
+ ((u64)((gfn) & GENMASK_ULL(39, 0)) << 12) | \
+ /* GHCBData[11:0] */ \
+ GHCB_MSR_PSC_REQ)
+
+#define GHCB_MSR_PSC_RESP 0x015
+#define GHCB_MSR_PSC_RESP_VAL(val) \
+ /* GHCBData[63:32] */ \
+ (((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
+
+static u64 ghcb_gpa;
+static void snp_register_ghcb(void)
+{
+ u64 ghcb_pfn = ghcb_gpa >> PAGE_SHIFT;
+ u64 val;
+
+ GUEST_ASSERT(ghcb_gpa);
+
+ wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_REG_GPA_REQ_VAL(ghcb_gpa >> PAGE_SHIFT));
+ vmgexit();
+
+ val = rdmsr(MSR_AMD64_SEV_ES_GHCB);
+ GUEST_ASSERT_EQ(GHCB_RESP_CODE(val), GHCB_MSR_REG_GPA_RESP);
+ GUEST_ASSERT_EQ(GHCB_MSR_REG_GPA_RESP_VAL(val), ghcb_pfn);
+}
+
+static void snp_page_state_change(u64 gpa, enum psc_op op)
+{
+ u64 val;
+
+ wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_PSC_REQ_GFN(gpa >> PAGE_SHIFT, op));
+ vmgexit();
+
+ val = rdmsr(MSR_AMD64_SEV_ES_GHCB);
+ GUEST_ASSERT_EQ(GHCB_RESP_CODE(val), GHCB_MSR_PSC_RESP);
+ GUEST_ASSERT_EQ(GHCB_MSR_PSC_RESP_VAL(val), 0);
+}
+
+#define RMP_PG_SIZE_4K 0
+static inline void pvalidate(void *vaddr, bool validate)
+{
+ bool no_rmpupdate;
+ int rc;
+
+ /* "pvalidate" mnemonic support in binutils 2.36 and newer */
+ asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFF\n\t"
+ : "=@ccc"(no_rmpupdate), "=a"(rc)
+ : "a"(vaddr), "c"(RMP_PG_SIZE_4K), "d"(validate)
+ : "memory", "cc");
+
+ GUEST_ASSERT(!no_rmpupdate);
+ GUEST_ASSERT_EQ(rc, 0);
+}
+
+#define CONVERSION_TEST_VALUE_SHARED_1 0xab
+#define CONVERSION_TEST_VALUE_SHARED_2 0xcd
+#define CONVERSION_TEST_VALUE_PRIVATE 0xef
+#define CONVERSION_TEST_VALUE_SHARED_3 0xbc
+static void guest_code_conversion(u8 *test_shared_gva, u8 *test_private_gva, u64 test_gpa)
+{
+ snp_register_ghcb();
+
+ GUEST_ASSERT_EQ(READ_ONCE(*test_shared_gva), CONVERSION_TEST_VALUE_SHARED_1);
+ WRITE_ONCE(*test_shared_gva, CONVERSION_TEST_VALUE_SHARED_2);
+
+ snp_page_state_change(test_gpa, SNP_PAGE_STATE_PRIVATE);
+ pvalidate(test_private_gva, true);
+
+ WRITE_ONCE(*test_private_gva, CONVERSION_TEST_VALUE_PRIVATE);
+ GUEST_ASSERT_EQ(READ_ONCE(*test_private_gva), CONVERSION_TEST_VALUE_PRIVATE);
+
+ pvalidate(test_private_gva, false);
+ snp_page_state_change(test_gpa, SNP_PAGE_STATE_SHARED);
+
+ WRITE_ONCE(*test_shared_gva, CONVERSION_TEST_VALUE_SHARED_3);
+
+ wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_TERM_REQ);
+ vmgexit();
+}
+
+static void test_conversion(u64 policy)
+{
+ gva_t test_private_gva;
+ gva_t test_shared_gva;
+ struct kvm_vcpu *vcpu;
+ gva_t ghcb_gva;
+ gpa_t test_gpa;
+ struct kvm_vm *vm;
+ void *ghcb_hva;
+ void *test_hva;
+
+ vm = vm_sev_create_with_one_vcpu(KVM_X86_SNP_VM, guest_code_conversion, &vcpu);
+
+ ghcb_gva = vm_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR,
+ MEM_REGION_TEST_DATA);
+ ghcb_hva = addr_gva2hva(vm, ghcb_gva);
+ ghcb_gpa = addr_gva2gpa(vm, ghcb_gva);
+ sync_global_to_guest(vm, ghcb_gpa);
+
+ test_shared_gva = vm_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR,
+ MEM_REGION_TEST_DATA);
+ test_hva = addr_gva2hva(vm, test_shared_gva);
+ test_gpa = addr_gva2gpa(vm, test_shared_gva);
+
+ test_private_gva = vm_unused_gva_gap(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR);
+ ___virt_pg_map(vm, &vm->mmu, test_private_gva, test_gpa, PG_SIZE_4K, true);
+
+ vcpu_args_set(vcpu, 3, test_shared_gva, test_private_gva, test_gpa);
+
+ vm_sev_launch(vm, policy, NULL);
+
+ WRITE_ONCE(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_1);
+
+ fprintf(stderr, "ghcb_hva=%p ghcb_gpa=%lx ghcb_gva=%lx\n", ghcb_hva, ghcb_gpa, ghcb_gva);
+ fprintf(stderr, "test_hva=%p test_gpa=%lx test_private_gva=%lx test_shared_gva=%lx\n", test_hva, test_gpa, test_private_gva, test_shared_gva);
+
+ vcpu_run(vcpu);
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_HYPERCALL);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.nr, KVM_HC_MAP_GPA_RANGE);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[0], test_gpa);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_ENCRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
+
+ vm_mem_set_private(vm, test_gpa, PAGE_SIZE, KVM_SET_MEMORY_ATTRIBUTES2_MODE_UNSPECIFIED);
+
+ vcpu_run(vcpu);
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_HYPERCALL);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.nr, KVM_HC_MAP_GPA_RANGE);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[0], test_gpa);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_DECRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
+
+ vm_mem_set_shared(vm, test_gpa, PAGE_SIZE, KVM_SET_MEMORY_ATTRIBUTES2_MODE_UNSPECIFIED);
+
+ vcpu_run(vcpu);
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SYSTEM_EVENT);
+ TEST_ASSERT_EQ(vcpu->run->system_event.type, KVM_SYSTEM_EVENT_SEV_TERM);
+ TEST_ASSERT_EQ(vcpu->run->system_event.ndata, 1);
+ TEST_ASSERT_EQ(vcpu->run->system_event.data[0], GHCB_MSR_TERM_REQ);
+
+ TEST_ASSERT_EQ(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_3);
+}
+
int main(int argc, char *argv[])
{
TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SEV));
- test_sev_smoke(guest_sev_code, KVM_X86_SEV_VM, 0);
+ // test_sev_smoke(guest_sev_code, KVM_X86_SEV_VM, 0);
- if (kvm_cpu_has(X86_FEATURE_SEV_ES))
- test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
+ // if (kvm_cpu_has(X86_FEATURE_SEV_ES))
+ // test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
- if (kvm_cpu_has(X86_FEATURE_SEV_SNP))
- test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
+ if (kvm_cpu_has(X86_FEATURE_SEV_SNP)) {
+ test_conversion(snp_default_policy());
+ // test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
+ }
return 0;
}
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related
* [POC PATCH 6/6] KVM: selftests: Test content modes ZERO and PRESERVE for SNP
From: Ackerley Tng @ 2026-04-28 23:33 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, linux-coco, linux-doc, linux-kernel,
linux-kselftest, linux-mm, linux-trace-kernel, mathieu.desnoyers,
mhiramat, michael.roth, mingo, nphamcs, oupton, pankaj.gupta,
pbonzini, pratyush, qi.zheng, qperret, rick.p.edgecombe, rientjes,
rostedt, seanjc, shakeel.butt, shikemeng, shivankg, shuah, skhan,
steven.price, suzuki.poulose, tabba, tglx, vannapurve, vbabka,
weixugc, willy, wyihan, x86, yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1777418884.git.ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
.../selftests/kvm/x86/sev_smoke_test.c | 47 +++++++++++++++++--
1 file changed, 44 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index 86f17e59e9392..7a91a113c4fb7 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -365,7 +365,26 @@ static void guest_code_conversion(u8 *test_shared_gva, u8 *test_private_gva, u64
vmgexit();
}
-static void test_conversion(u64 policy)
+static void vm_set_memory_attributes_expect_error(struct kvm_vm *vm, u64 gpa,
+ size_t size, u64 attributes,
+ u64 flags, int expected_errno)
+{
+ loff_t error_offset = -1;
+ size_t len_ignored;
+ loff_t offset;
+ int gmem_fd;
+ int ret;
+
+ gmem_fd = kvm_gpa_to_guest_memfd(vm, gpa, &offset, &len_ignored);
+ ret = __gmem_set_memory_attributes(gmem_fd, offset, size, attributes,
+ &error_offset, flags);
+
+ TEST_ASSERT_EQ(ret, -1);
+ TEST_ASSERT_EQ(offset, error_offset);
+ TEST_ASSERT_EQ(errno, expected_errno);
+}
+
+static void test_conversion(u64 policy, u64 content_mode)
{
gva_t test_private_gva;
gva_t test_shared_gva;
@@ -409,6 +428,21 @@ static void test_conversion(u64 policy)
TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_ENCRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
+ /* ZERO when setting memory attributes to private is always not supported. */
+ vm_set_memory_attributes_expect_error(vm, test_gpa, PAGE_SIZE,
+ KVM_MEMORY_ATTRIBUTE_PRIVATE,
+ KVM_SET_MEMORY_ATTRIBUTES2_ZERO,
+ EOPNOTSUPP);
+
+ /* PRESERVE is not supported for SNP. */
+ vm_set_memory_attributes_expect_error(vm, test_gpa, PAGE_SIZE, 0,
+ KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE,
+ EOPNOTSUPP);
+ vm_set_memory_attributes_expect_error(vm, test_gpa, PAGE_SIZE,
+ KVM_MEMORY_ATTRIBUTE_PRIVATE,
+ KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE,
+ EOPNOTSUPP);
+
vm_mem_set_private(vm, test_gpa, PAGE_SIZE, KVM_SET_MEMORY_ATTRIBUTES2_MODE_UNSPECIFIED);
vcpu_run(vcpu);
@@ -419,7 +453,12 @@ static void test_conversion(u64 policy)
TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_DECRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
- vm_mem_set_shared(vm, test_gpa, PAGE_SIZE, KVM_SET_MEMORY_ATTRIBUTES2_MODE_UNSPECIFIED);
+ vm_mem_set_shared(vm, test_gpa, PAGE_SIZE, content_mode);
+
+ if (content_mode == KVM_SET_MEMORY_ATTRIBUTES2_ZERO)
+ TEST_ASSERT_EQ(READ_ONCE(*(u8 *)test_hva), 0);
+ else
+ fprintf(stderr, "test_hva contents = %x\n", READ_ONCE(*(u8 *)test_hva));
vcpu_run(vcpu);
@@ -441,7 +480,9 @@ int main(int argc, char *argv[])
// test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
if (kvm_cpu_has(X86_FEATURE_SEV_SNP)) {
- test_conversion(snp_default_policy());
+ test_conversion(snp_default_policy(), KVM_SET_MEMORY_ATTRIBUTES2_MODE_UNSPECIFIED);
+ test_conversion(snp_default_policy(), KVM_SET_MEMORY_ATTRIBUTES2_ZERO);
+
// test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
}
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related
* Re: [PATCH RFC v5 24/53] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Ackerley Tng @ 2026-04-28 23:40 UTC (permalink / raw)
To: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260428-gmem-inplace-conversion-v5-24-d8608ccfca22@google.com>
Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
writes:
> From: Michael Roth <michael.roth@amd.com>
>
Thanks Michael!
>
> [...snip...]
>
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index c2126b3c30724..bf10d24907a00 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2343,7 +2343,15 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> int level;
> int ret;
>
> - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
> + /*
> + * For vm_memory_attributes=1, in-place conversion/population is not
> + * supported, so the initial contents necessarily need to come from a
> + * separate src address. For vm_memory_attributes=0, this isn't
> + * necessarily the case, since the pages may have been populated
> + * directly from userspace before calling KVM_SEV_SNP_LAUNCH_UPDATE.
> + */
I dropped the #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES from [1] since
vm_memory_attributes is #define-d as false when if
CONFIG_KVM_VM_MEMORY_ATTRIBUTES is not defined.
> + if (vm_memory_attributes &&
> + sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)
> return -EINVAL;
>
> ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
[1] https://github.com/AMDESE/linux/commit/7e7c29afdf3763822ced0b7007fc0f93b8fb993d
>
> [...snip...]
>
^ permalink raw reply
* Re: [RFC PATCH 16/19] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-04-29 0:13 UTC (permalink / raw)
To: Steven Rostedt
Cc: SeongJae Park, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260428141745.2768ac4e@gandalf.local.home>
On Tue, 28 Apr 2026 14:17:45 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> On Sun, 26 Apr 2026 13:52:17 -0700
> SeongJae Park <sj@kernel.org> wrote:
>
> > Introduce a new tracepoint for exposing the per-region per-probe
> > positive sample count via tracefs.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> > include/trace/events/damon.h | 41 ++++++++++++++++++++++++++++++++++++
> > mm/damon/core.c | 1 +
> > 2 files changed, 42 insertions(+)
> >
> > diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
> > index 7e25f4469b81b..121d7bc3a2c27 100644
> > --- a/include/trace/events/damon.h
> > +++ b/include/trace/events/damon.h
> > @@ -130,6 +130,47 @@ TRACE_EVENT(damon_monitor_intervals_tune,
> > TP_printk("sample_us=%lu", __entry->sample_us)
> > );
> >
> > +TRACE_EVENT(damon_aggregated_v2,
> > +
> > + TP_PROTO(unsigned int target_id, struct damon_region *r,
> > + unsigned int nr_regions),
> > +
> > + TP_ARGS(target_id, r, nr_regions),
> > +
> > + TP_STRUCT__entry(
> > + __field(unsigned long, target_id)
> > + __field(unsigned int, nr_regions)
>
> Move the nr_regions to after "end" as on 64 bit machines, this creates a 4
> byte hole.
Thank you for the nice suggestion, Steven. Will do so in the next version.
Thanks,
SJ
[...]
^ permalink raw reply
* Re: [PATCH] tracing/probes: Limit size of event probe to 3K
From: Steven Rostedt @ 2026-04-29 0:32 UTC (permalink / raw)
To: LKML, Linux Trace Kernel; +Cc: Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260428122302.706610ba@gandalf.local.home>
On Tue, 28 Apr 2026 12:23:02 -0400
Steven Rostedt <rostedt@kernel.org> wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> There currently isn't a max limit an event probe can be. One could make an
> event greater than PAGE_SIZE, which makes the event useless because if
> it's bigger than the max event that can be recorded into the ring buffer,
> then it will never be recorded.
>
> A event probe should never need to be greater than 3K, so make that the
> max size. As long as the max is less than the max that can be recorded
> onto the ring buffer, it should be fine.
>
> Cc: stable@vger.kernel.org
> Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events")
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Hi Masami,
I ran this through my tests along with some other fixes I have. If you
ack it, I can push this to Linus along with my other changes.
-- Steve
^ permalink raw reply
* [PATCH] kprobes: Remove dead child probes from aggrprobe list on module unload
From: Shijia Hu @ 2026-04-29 3:29 UTC (permalink / raw)
To: mhiramat, naveen, davem
Cc: ananth, akpm, linux-kernel, linux-trace-kernel, hushijia1
When a kernel module that registered kprobes is unloaded without calling
unregister_kprobe(), kprobes_module_callback() calls kill_kprobe() to
mark the probe(s) GONE. If the probe is an aggrprobe, kill_kprobe()
also marks all child probes GONE, but it does not remove them from
the aggrprobe's list.
The problem is that child probes whose struct kprobe resides in the
unloading module's memory are freed along with the module, yet they
remain on the aggrprobe's list. Later, when another caller registers
a kprobe at the same address, __get_valid_kprobe() walks that list
and dereferences the freed child probe, causing a use-after-free.
Reproduction steps:
1) Load module A which registers two kprobes on the same kernel
function address (e.g., do_nanosleep), causing them to be
aggregated under one aggrprobe.
2) Unload module A without calling unregister_kprobe().
Module A's memory is freed, but its two child probes remain
on the aggrprobe's list as dangling pointers.
3) Load module B and register a kprobe on the same address
(e.g., do_nanosleep). register_kprobe() -> __get_valid_kprobe()
traverses the aggrprobe's list and dereferences the freed child
probe from module A, triggering a use-after-free and kernel panic.
The resulting crash looks like:
[ 464.950864] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 464.950872] #PF: supervisor read access in kernel mode
[ 464.950874] #PF: error_code(0x0000) - not-present page
...
[ 464.950915] Call Trace:
[ 464.950922] <TASK>
[ 464.950923] register_kprobe+0x65/0x2e0
[ 464.950928] ? __pfx_stage2_init+0x10/0x10 [kprobe_leak_stage2]
[ 464.950933] stage2_init+0x37/0xff0 [kprobe_leak_stage2]
[ 464.950938] ? __pfx_stage2_init+0x10/0x10 [kprobe_leak_stage2]
[ 464.950942] do_one_initcall+0x56/0x2e0
[ 464.950948] do_init_module+0x60/0x230
...
Fix this by adding selective cleanup in kprobes_module_callback():
after calling kill_kprobe() on the aggrprobe, iterate its child list
and remove any child probe whose struct kprobe is inside the going
module's memory range (within_module_init / within_module_core).
This is done in kprobes_module_callback() rather than kill_kprobe()
because kill_kprobe()'s semantic is "the probed code is going away,
mark probes GONE". The lifetime of a probe is bound to the probed
code, not to the module containing the struct kprobe. Child probes
owned by other still-loaded modules or by kmalloc (ftrace, perf,
kprobe-events) must stay on the list so they can be unregistered
later. Only child probes whose memory is about to be freed need to
be removed from the list to prevent dangling pointers.
Fixes: e8386a0cb22f4 ("kprobes: support probing module __exit function")
Signed-off-by: Shijia Hu <hushijia1@uniontech.com>
---
kernel/kprobes.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index bfc89083daa9..ff277314183c 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2664,6 +2664,7 @@ static int kprobes_module_callback(struct notifier_block *nb,
unsigned long val, void *data)
{
struct module *mod = data;
+ struct hlist_node *tmp;
struct hlist_head *head;
struct kprobe *p;
unsigned int i;
@@ -2685,7 +2686,7 @@ static int kprobes_module_callback(struct notifier_block *nb,
*/
for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
head = &kprobe_table[i];
- hlist_for_each_entry(p, head, hlist)
+ hlist_for_each_entry_safe(p, tmp, head, hlist) {
if (within_module_init((unsigned long)p->addr, mod) ||
(checkcore &&
within_module_core((unsigned long)p->addr, mod))) {
@@ -2702,6 +2703,26 @@ static int kprobes_module_callback(struct notifier_block *nb,
*/
kill_kprobe(p);
}
+
+ /*
+ * Child probes are not on the kprobe hash list, so
+ * the above loop can not find them. If a child probe
+ * is allocated in the module's memory, it will become
+ * a dangling pointer after the module is freed.
+ */
+ if (kprobe_aggrprobe(p)) {
+ struct kprobe *kp, *kptmp;
+
+ list_for_each_entry_safe(kp, kptmp, &p->list, list) {
+ if (within_module_init((unsigned long)kp, mod) ||
+ (checkcore &&
+ within_module_core((unsigned long)kp, mod))) {
+ kp->flags |= KPROBE_FLAG_GONE;
+ list_del_rcu(&kp->list);
+ }
+ }
+ }
+ }
}
if (val == MODULE_STATE_GOING)
remove_module_kprobe_blacklist(mod);
--
2.20.1
^ permalink raw reply related
* Re: [PATCH v2] mm/page_alloc: trace PCP refills and PCP zone lock usage
From: SUVONOV BUNYOD @ 2026-04-29 3:31 UTC (permalink / raw)
To: Steven Rostedt
Cc: akpm, vbabka, linux-mm, mhiramat, mathieu desnoyers,
linux-trace-kernel, linux-kernel, surenb, mhocko, jackmanb,
hannes, ziy, david, vishal moola, corbet, skhan, linux-doc
In-Reply-To: <20260428142335.3bca0166@gandalf.local.home>
Thanks for reviewing Steven,
>Why this change? It makes it much harder to understand.
>
>The above is not a normal macro. Ignore any checkpatch warnings about it.
>The proper way to do the TP_STRUCT__entry() is to make it just like a struct:
>
>struct {
> unsigned long pfn;
> unsigned int order;
> int migratetype;
>};
>
>Thus, the macro should be:
>
> TP_STRUCT__entry(
> __field( unsigned long, pfn )
> __field( unsigned int, order )
> __field( int, migratetype )
> ),
Yeah sorry for the formatting issue, will fix in v3. Any other concerns?
What do you think about the introduction of those tracepoints themselves?
-- Bunyod
^ permalink raw reply
* Re: [PATCH] kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
From: Jianpeng Chang @ 2026-04-29 8:16 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: naveen, davem, catalin.marinas, mark.rutland, linux-kernel,
linux-trace-kernel, stable
In-Reply-To: <20260428184321.309a48036892b8d23a08b566@kernel.org>
在 2026/4/28 下午5:43, Masami Hiramatsu (Google) 写道:
> CAUTION: This email comes from a non Wind River email account! Do
> not click links or open attachments unless you recognize the sender
> and know the content is safe.
>
> Hi,
>
> On Mon, 27 Apr 2026 15:35:44 +0800 Jianpeng Chang
> <jianpeng.chang.cn@windriver.com> wrote:
>
>> When kprobe_add_area_blacklist() iterates through a section like
>> .kprobes.text, the start address may not correspond to a named
>> symbol. On ARM64 with CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS=y
>> (introduced by commit baaf553d3bc3 ("arm64: Implement
>> HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS")), the compiler flag -
>> fpatchable-function-entry=4,2 inserts 2 NOPs before each function
>> entry point for ftrace call_ops. These pre-function NOPs sit at
>> the section base address, before the first named function symbol.
>> The compiler emits a $x mapping symbol at offset 0x00 to mark the
>> start of code, but find_kallsyms_symbol() ignores mapping symbols.
>>
>> Without CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS (e.g. defconfig), no
>> pre-function NOPs are inserted, the first function starts at
>> offset 0x00, and the bug does not trigger.
>>
>> This only affects modules that have a .kprobes.text section (i.e.
>> those using the __kprobes annotation). Modules using
>> NOKPROBE_SYMBOL() instead (like kretprobe_example.ko) blacklist
>> exact function addresses via the _kprobe_blacklist section and are
>> not affected.
>>
>> For kprobe_example.ko on ARM64 with -fpatchable-function-
>> entry=4,2, the .kprobes.text section layout is:
>>
>> offset 0x00: $x + 2 NOPs (mapping symbol + ftrace preamble)
>> offset 0x08: handler_post (64 bytes) offset 0x50: handler_pre
>> (68 bytes)
>
> Ah, OK. It is for __kprobes attribute. I recommend user to use
> NOKPROBE_SYMBOL() but I understand the situation.
>
>>
>> kprobe_add_area_blacklist() starts iterating from the section base
>> address (offset 0x00), which only has the $x mapping symbol.
>> kprobe_add_ksym_blacklist() then calls
>> kallsyms_lookup_size_offset() for this address, which goes
>> through:
>>
>> kallsyms_lookup_size_offset() -> module_address_lookup() ->
>> find_kallsyms_symbol()
>>
>> find_kallsyms_symbol() scans all module symbols to find the
>> closest preceding symbol.
>>
>> Since no named text symbol exists at offset 0x00,
>> find_kallsyms_symbol() picks __UNIQUE_ID_vermagic (a .modinfo
>> symbol whose address is in the temporary image) as the "best"
>> match. The computed "size" = next_text_symbol - modinfo_symbol
>> spans across these two unrelated memory regions, creating a
>> blacklist entry with a bogus range of tens of terabytes.
>>
>> Whether this causes a visible failure depends on address
>> randomization, here is what happens on Raspberry Pi 4/5:
>>
>> - On RPi5, the bogus size was ~35 TB. start + size stayed within
>> 64-bit range, so the blacklist entry covered the entire kernel
>> text. register_kprobe() in the module's own init function failed
>> with -EINVAL.
>>
>> - On RPi4, the bogus size was ~75 TB. start + size overflowed 64
>> bits and wrapped to a small address near zero. The range check
>> (addr >= start && addr < end) then failed because end wrapped
>> around, so the bogus entry was accidentally harmless and kprobes
>> worked by luck.
>>
>> The same bug exists on both machines, but randomization determines
>> whether the integer overflow masks it or not.
>>
>> Fix this by checking the offset returned by
>> kallsyms_lookup_size_offset(). A non-zero offset means the address
>> is not at a symbol boundary, so skip forward to the next symbol
>> instead of creating a blacklist entry with a wrong size.
>>
>> Fixes: baaf553d3bc3 ("arm64: Implement
>> HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS") Signed-off-by: Jianpeng Chang
>> <jianpeng.chang.cn@windriver.com> --- Hi,
>>
>> This patch skips non-symbol addresses, fixes the bogus blacklist
>> entry, but leaves the NOP gap at the start of .kprobes.text
>> unblacklisted.
>
> That is OK because those NOPs are not executed in kprobe handler.
>
>>
>> We can continue alloc the ent without return to add the gap to
>> blacklist, or do some more works to add the gap to the first
>> symbol in blacklist. I'm not sure if is this necessary, or is
>> there a better way?
>
> Are there any compiler option or attribute to avoid inserting these
> NOPs to the specific section? (like notrace?)
>
> Also, as you can see there is an alias symbol whose size is 0. and
> in that case, we move the entry + 1 and call
> kprobe_add_ksym_blacklist() again. Thus, the offset becomes 1.
> Please make sure it is correctly handled.
>
Regarding the alias symbol concern: kallsyms_lookup_size_offset()
computes size as the distance to the next different-address symbol, not
from ELF st_size. I tested with a module containing alias symbols in
.kprobes.text (created via __attribute__((alias))), and the lookup
returned a correct size with offset=0 — the if (ret == 0) ret = 1 path
was never triggered.
That said, #define __kprobes notrace __section(".kprobes.text") is a
cleaner fix. The NOPs in .kprobes.text are unnecessary since these
functions should never be traced by ftrace. I've tested this on RPi5 —
the bug is resolved and all .kprobes.text functions are correctly
blacklisted. I'll send the notrace approach in v2.
Thanks,
Jianpeng> Thanks,
>
>>
>> Thanks, Jianpeng
>>
>> kernel/kprobes.c | 4 ++++ 1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/kprobes.c b/kernel/kprobes.c index
>> bfc89083daa9..be700fb03198 100644 --- a/kernel/kprobes.c> +++ b/
>> kernel/kprobes.c @@ -2503,6 +2503,10 @@ int
>> kprobe_add_ksym_blacklist(unsigned long entry) !
>> kallsyms_lookup_size_offset(entry, &size, &offset)) return -
>> EINVAL;
>>
>> + /* Not on a symbol boundary -- skip to the next symbol */
>> + if (offset) + return (int)(size - offset); + ent
>> = kmalloc_obj(*ent); if (!ent) return -ENOMEM; -- 2.54.0
>>
>
>
> -- Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 2/3] init: use static buffers for bootconfig extra command line
From: Masami Hiramatsu @ 2026-04-29 8:27 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, oss, paulmck, linux-trace-kernel, linux-kernel,
kernel-team
In-Reply-To: <aeJH8mhxqrwdsjxc@gmail.com>
On Fri, 17 Apr 2026 08:38:16 -0700
Breno Leitao <leitao@debian.org> wrote:
>
> On Fri, Apr 17, 2026 at 10:44:36AM +0900, Masami Hiramatsu wrote:
> > On Wed, 15 Apr 2026 03:51:11 -0700
> > Breno Leitao <leitao@debian.org> wrote:
> >
> > But if we can do it, should we continue using bootconfig? I mean
> > it is easy to make a tool (or add a feature in tools/bootconfig)
> > which converts bootconfig file to command line string and embeds
> > it in the kernel. Hmm.
>
> Sure, you are talking about a a tool that embeddeds it in the kernel binary,
> something like:
>
>
> 0) Get a kernel and define CONFIG_BOOT_CONFIG_EMBED_FILE=".bootconfig"
>
> 1) Add an option in tools/bootconfig to convert bootconfig (.bootconfig)
> to a cmdline string ($ bootconfig -C kernel .bootconfig).
> Something like:
> # tools/bootconfig/bootconfig -C kernel .bootconfig
> mem=2G loglevel=7 debug nokaslr %
>
> 2) At kernel build time, run that tool on .bootconfig and embed the
> resulting string into the kernel image as a .init.rodata symbol
> (embedded_kernel_cmdline[]).
>
> # gdb -batch -ex 'x/s &embedded_kernel_cmdline' vmlinux
> 0xffffffff87e108f8: "mem=2G loglevel=7 debug nokaslr "
Yeah, I think this looks good to me.
>
> 3) At boot, the arch's setup_arch() prepends that symbol to
> boot_command_line right before parse_early_param() — so early_param()
> handlers (mem=, earlycon=, loglevel=, ...) actually see kernel.*
> keys from the embedded bootconfig.
Ah, I thought it is arch independent config, but it depends on
architecture.... Hmm.
>
> This needs to be architecture by architecture. Something like:
>
> @@ -924,6 +925,13 @@ void __init setup_arch(char **cmdline_p)
> builtin_cmdline_added = true;
> #endif
>
> + /*
> + * Prepend kernel.* keys from the embedded bootconfig (rendered at
> + * build time by tools/bootconfig) so parse_early_param() below sees
> + * them. No-op when CONFIG_BOOT_CONFIG_EMBED=n.
> + */
> + xbc_prepend_embedded_cmdline(boot_command_line, COMMAND_LINE_SIZE);
> +
> strscpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
> *cmdline_p = command_line;
>
> Am I describing your suggestion accordingly?
I think we can start supporting this option for the architecture which
already support CONFIG_CMDLINE. Maybe we need CONFIG_ARCH_SUPPORT_CMDLINE
option which indicates the architecture supports embedded cmdline.
Thus all of this feature can depend on that Kconfig.
Thank you,
>
> Thanks!
> --breno
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH] kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
From: Masami Hiramatsu @ 2026-04-29 8:30 UTC (permalink / raw)
To: Jianpeng Chang
Cc: naveen, davem, catalin.marinas, mark.rutland, linux-kernel,
linux-trace-kernel, stable
In-Reply-To: <a298419e-c581-41c9-b6d5-9319c24b7995@windriver.com>
On Wed, 29 Apr 2026 16:16:44 +0800
Jianpeng Chang <jianpeng.chang.cn@windriver.com> wrote:
>
>
> 在 2026/4/28 下午5:43, Masami Hiramatsu (Google) 写道:
> > CAUTION: This email comes from a non Wind River email account! Do
> > not click links or open attachments unless you recognize the sender
> > and know the content is safe.
> >
> > Hi,
> >
> > On Mon, 27 Apr 2026 15:35:44 +0800 Jianpeng Chang
> > <jianpeng.chang.cn@windriver.com> wrote:
> >
> >> When kprobe_add_area_blacklist() iterates through a section like
> >> .kprobes.text, the start address may not correspond to a named
> >> symbol. On ARM64 with CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS=y
> >> (introduced by commit baaf553d3bc3 ("arm64: Implement
> >> HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS")), the compiler flag -
> >> fpatchable-function-entry=4,2 inserts 2 NOPs before each function
> >> entry point for ftrace call_ops. These pre-function NOPs sit at
> >> the section base address, before the first named function symbol.
> >> The compiler emits a $x mapping symbol at offset 0x00 to mark the
> >> start of code, but find_kallsyms_symbol() ignores mapping symbols.
> >>
> >> Without CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS (e.g. defconfig), no
> >> pre-function NOPs are inserted, the first function starts at
> >> offset 0x00, and the bug does not trigger.
> >>
> >> This only affects modules that have a .kprobes.text section (i.e.
> >> those using the __kprobes annotation). Modules using
> >> NOKPROBE_SYMBOL() instead (like kretprobe_example.ko) blacklist
> >> exact function addresses via the _kprobe_blacklist section and are
> >> not affected.
> >>
> >> For kprobe_example.ko on ARM64 with -fpatchable-function-
> >> entry=4,2, the .kprobes.text section layout is:
> >>
> >> offset 0x00: $x + 2 NOPs (mapping symbol + ftrace preamble)
> >> offset 0x08: handler_post (64 bytes) offset 0x50: handler_pre
> >> (68 bytes)
> >
> > Ah, OK. It is for __kprobes attribute. I recommend user to use
> > NOKPROBE_SYMBOL() but I understand the situation.
> >
> >>
> >> kprobe_add_area_blacklist() starts iterating from the section base
> >> address (offset 0x00), which only has the $x mapping symbol.
> >> kprobe_add_ksym_blacklist() then calls
> >> kallsyms_lookup_size_offset() for this address, which goes
> >> through:
> >>
> >> kallsyms_lookup_size_offset() -> module_address_lookup() ->
> >> find_kallsyms_symbol()
> >>
> >> find_kallsyms_symbol() scans all module symbols to find the
> >> closest preceding symbol.
> >>
> >> Since no named text symbol exists at offset 0x00,
> >> find_kallsyms_symbol() picks __UNIQUE_ID_vermagic (a .modinfo
> >> symbol whose address is in the temporary image) as the "best"
> >> match. The computed "size" = next_text_symbol - modinfo_symbol
> >> spans across these two unrelated memory regions, creating a
> >> blacklist entry with a bogus range of tens of terabytes.
> >>
> >> Whether this causes a visible failure depends on address
> >> randomization, here is what happens on Raspberry Pi 4/5:
> >>
> >> - On RPi5, the bogus size was ~35 TB. start + size stayed within
> >> 64-bit range, so the blacklist entry covered the entire kernel
> >> text. register_kprobe() in the module's own init function failed
> >> with -EINVAL.
> >>
> >> - On RPi4, the bogus size was ~75 TB. start + size overflowed 64
> >> bits and wrapped to a small address near zero. The range check
> >> (addr >= start && addr < end) then failed because end wrapped
> >> around, so the bogus entry was accidentally harmless and kprobes
> >> worked by luck.
> >>
> >> The same bug exists on both machines, but randomization determines
> >> whether the integer overflow masks it or not.
> >>
> >> Fix this by checking the offset returned by
> >> kallsyms_lookup_size_offset(). A non-zero offset means the address
> >> is not at a symbol boundary, so skip forward to the next symbol
> >> instead of creating a blacklist entry with a wrong size.
> >>
> >> Fixes: baaf553d3bc3 ("arm64: Implement
> >> HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS") Signed-off-by: Jianpeng Chang
> >> <jianpeng.chang.cn@windriver.com> --- Hi,
> >>
> >> This patch skips non-symbol addresses, fixes the bogus blacklist
> >> entry, but leaves the NOP gap at the start of .kprobes.text
> >> unblacklisted.
> >
> > That is OK because those NOPs are not executed in kprobe handler.
> >
> >>
> >> We can continue alloc the ent without return to add the gap to
> >> blacklist, or do some more works to add the gap to the first
> >> symbol in blacklist. I'm not sure if is this necessary, or is
> >> there a better way?
> >
> > Are there any compiler option or attribute to avoid inserting these
> > NOPs to the specific section? (like notrace?)
> >
> > Also, as you can see there is an alias symbol whose size is 0. and
> > in that case, we move the entry + 1 and call
> > kprobe_add_ksym_blacklist() again. Thus, the offset becomes 1.
> > Please make sure it is correctly handled.
> >
> Regarding the alias symbol concern: kallsyms_lookup_size_offset()
> computes size as the distance to the next different-address symbol, not
> from ELF st_size. I tested with a module containing alias symbols in
> .kprobes.text (created via __attribute__((alias))), and the lookup
> returned a correct size with offset=0 — the if (ret == 0) ret = 1 path
> was never triggered.
>
> That said, #define __kprobes notrace __section(".kprobes.text") is a
> cleaner fix. The NOPs in .kprobes.text are unnecessary since these
> functions should never be traced by ftrace. I've tested this on RPi5 —
> the bug is resolved and all .kprobes.text functions are correctly
> blacklisted. I'll send the notrace approach in v2.
Ah, great! thanks!
>
> Thanks,
> Jianpeng> Thanks,
> >
> >>
> >> Thanks, Jianpeng
> >>
> >> kernel/kprobes.c | 4 ++++ 1 file changed, 4 insertions(+)
> >>
> >> diff --git a/kernel/kprobes.c b/kernel/kprobes.c index
> >> bfc89083daa9..be700fb03198 100644 --- a/kernel/kprobes.c> +++ b/
> >> kernel/kprobes.c @@ -2503,6 +2503,10 @@ int
> >> kprobe_add_ksym_blacklist(unsigned long entry) !
> >> kallsyms_lookup_size_offset(entry, &size, &offset)) return -
> >> EINVAL;
> >>
> >> + /* Not on a symbol boundary -- skip to the next symbol */
> >> + if (offset) + return (int)(size - offset); + ent
> >> = kmalloc_obj(*ent); if (!ent) return -ENOMEM; -- 2.54.0
> >>
> >
> >
> > -- Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH] kprobes: Remove dead child probes from aggrprobe list on module unload
From: Masami Hiramatsu @ 2026-04-29 8:40 UTC (permalink / raw)
To: Shijia Hu; +Cc: naveen, davem, ananth, akpm, linux-kernel, linux-trace-kernel
In-Reply-To: <20260429032919.208790-1-hushijia1@uniontech.com>
On Wed, 29 Apr 2026 11:29:19 +0800
Shijia Hu <hushijia1@uniontech.com> wrote:
> When a kernel module that registered kprobes is unloaded without calling
> unregister_kprobe(), kprobes_module_callback() calls kill_kprobe() to
> mark the probe(s) GONE. If the probe is an aggrprobe, kill_kprobe()
> also marks all child probes GONE, but it does not remove them from
> the aggrprobe's list.
That sounds like a bug in the module.
>
> The problem is that child probes whose struct kprobe resides in the
> unloading module's memory are freed along with the module, yet they
> remain on the aggrprobe's list. Later, when another caller registers
> a kprobe at the same address, __get_valid_kprobe() walks that list
> and dereferences the freed child probe, causing a use-after-free.
>
> Reproduction steps:
>
> 1) Load module A which registers two kprobes on the same kernel
> function address (e.g., do_nanosleep), causing them to be
> aggregated under one aggrprobe.
>
> 2) Unload module A without calling unregister_kprobe().
> Module A's memory is freed, but its two child probes remain
> on the aggrprobe's list as dangling pointers.
Would you mean "load a buggy kernel module and unload it, the kernel cause
use-after-free."? for example:
----
struct kprobe my_probe = {...};
init_module() {
register_kprobe(&my_probe);
}
exit_module() {
/* do nothing */
}
----
Yes, this cause UAF because that module has a bug. Please call
unregister_kprobe().
Thanks,
>
> 3) Load module B and register a kprobe on the same address
> (e.g., do_nanosleep). register_kprobe() -> __get_valid_kprobe()
> traverses the aggrprobe's list and dereferences the freed child
> probe from module A, triggering a use-after-free and kernel panic.
>
> The resulting crash looks like:
> [ 464.950864] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 464.950872] #PF: supervisor read access in kernel mode
> [ 464.950874] #PF: error_code(0x0000) - not-present page
> ...
> [ 464.950915] Call Trace:
> [ 464.950922] <TASK>
> [ 464.950923] register_kprobe+0x65/0x2e0
> [ 464.950928] ? __pfx_stage2_init+0x10/0x10 [kprobe_leak_stage2]
> [ 464.950933] stage2_init+0x37/0xff0 [kprobe_leak_stage2]
> [ 464.950938] ? __pfx_stage2_init+0x10/0x10 [kprobe_leak_stage2]
> [ 464.950942] do_one_initcall+0x56/0x2e0
> [ 464.950948] do_init_module+0x60/0x230
> ...
>
> Fix this by adding selective cleanup in kprobes_module_callback():
> after calling kill_kprobe() on the aggrprobe, iterate its child list
> and remove any child probe whose struct kprobe is inside the going
> module's memory range (within_module_init / within_module_core).
>
> This is done in kprobes_module_callback() rather than kill_kprobe()
> because kill_kprobe()'s semantic is "the probed code is going away,
> mark probes GONE". The lifetime of a probe is bound to the probed
> code, not to the module containing the struct kprobe. Child probes
> owned by other still-loaded modules or by kmalloc (ftrace, perf,
> kprobe-events) must stay on the list so they can be unregistered
> later. Only child probes whose memory is about to be freed need to
> be removed from the list to prevent dangling pointers.
>
> Fixes: e8386a0cb22f4 ("kprobes: support probing module __exit function")
> Signed-off-by: Shijia Hu <hushijia1@uniontech.com>
> ---
> kernel/kprobes.c | 23 ++++++++++++++++++++++-
> 1 file changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index bfc89083daa9..ff277314183c 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -2664,6 +2664,7 @@ static int kprobes_module_callback(struct notifier_block *nb,
> unsigned long val, void *data)
> {
> struct module *mod = data;
> + struct hlist_node *tmp;
> struct hlist_head *head;
> struct kprobe *p;
> unsigned int i;
> @@ -2685,7 +2686,7 @@ static int kprobes_module_callback(struct notifier_block *nb,
> */
> for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> head = &kprobe_table[i];
> - hlist_for_each_entry(p, head, hlist)
> + hlist_for_each_entry_safe(p, tmp, head, hlist) {
> if (within_module_init((unsigned long)p->addr, mod) ||
> (checkcore &&
> within_module_core((unsigned long)p->addr, mod))) {
> @@ -2702,6 +2703,26 @@ static int kprobes_module_callback(struct notifier_block *nb,
> */
> kill_kprobe(p);
> }
> +
> + /*
> + * Child probes are not on the kprobe hash list, so
> + * the above loop can not find them. If a child probe
> + * is allocated in the module's memory, it will become
> + * a dangling pointer after the module is freed.
> + */
> + if (kprobe_aggrprobe(p)) {
> + struct kprobe *kp, *kptmp;
> +
> + list_for_each_entry_safe(kp, kptmp, &p->list, list) {
> + if (within_module_init((unsigned long)kp, mod) ||
> + (checkcore &&
> + within_module_core((unsigned long)kp, mod))) {
> + kp->flags |= KPROBE_FLAG_GONE;
> + list_del_rcu(&kp->list);
> + }
> + }
> + }
> + }
> }
> if (val == MODULE_STATE_GOING)
> remove_module_kprobe_blacklist(mod);
> --
> 2.20.1
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH] tracing/probes: Limit size of event probe to 3K
From: Masami Hiramatsu @ 2026-04-29 8:42 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260428122302.706610ba@gandalf.local.home>
On Tue, 28 Apr 2026 12:23:02 -0400
Steven Rostedt <rostedt@kernel.org> wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> There currently isn't a max limit an event probe can be. One could make an
> event greater than PAGE_SIZE, which makes the event useless because if
> it's bigger than the max event that can be recorded into the ring buffer,
> then it will never be recorded.
>
> A event probe should never need to be greater than 3K, so make that the
> max size. As long as the max is less than the max that can be recorded
> onto the ring buffer, it should be fine.
This looks good to me.
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thanks!
>
> Cc: stable@vger.kernel.org
> Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events")
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> kernel/trace/trace_probe.c | 6 ++++++
> kernel/trace/trace_probe.h | 4 +++-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index e1c73065dae5..e0d3a0da26af 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1523,6 +1523,12 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
> parg->offset = *size;
> *size += parg->type->size * (parg->count ?: 1);
>
> + if (*size > MAX_PROBE_EVENT_SIZE) {
> + ret = -E2BIG;
> + trace_probe_log_err(ctx->offset, EVENT_TOO_BIG);
> + goto fail;
> + }
> +
> if (parg->count) {
> len = strlen(parg->type->fmttype) + 6;
> parg->fmt = kmalloc(len, GFP_KERNEL);
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 9fc56c937130..262d8707a3df 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -38,6 +38,7 @@
> #define MAX_BTF_ARGS_LEN 128
> #define MAX_DENTRY_ARGS_LEN 256
> #define MAX_STRING_SIZE PATH_MAX
> +#define MAX_PROBE_EVENT_SIZE 3072
>
> /* Reserved field names */
> #define FIELD_STRING_IP "__probe_ip"
> @@ -561,7 +562,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
> C(BAD_TYPE4STR, "This type does not fit for string."),\
> C(NEED_STRING_TYPE, "$comm and immediate-string only accepts string type"),\
> C(TOO_MANY_ARGS, "Too many arguments are specified"), \
> - C(TOO_MANY_EARGS, "Too many entry arguments specified"),
> + C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
> + C(EVENT_TOO_BIG, "Event too big (too many fields?)"),
>
> #undef C
> #define C(a, b) TP_ERR_##a
> --
> 2.53.0
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH] tracing/probes: Limit size of event probe to 3K
From: Masami Hiramatsu @ 2026-04-29 8:51 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260428122302.706610ba@gandalf.local.home>
Hi Steve,
BTW, to prevent regressions during future expansions, how about adding the following line?
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 2cabf8a23ec5..c5ee7920dec6 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -979,6 +979,7 @@ static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu,
ucb = uprobe_buffer_get();
ucb->dsize = tu->tp.size + dsize;
+ BUILD_BUG_ON(MAX_UCB_BUFFER_SIZE < MAX_PROBE_EVENT_SIZE);
if (WARN_ON_ONCE(ucb->dsize > MAX_UCB_BUFFER_SIZE)) {
ucb->dsize = MAX_UCB_BUFFER_SIZE;
dsize = MAX_UCB_BUFFER_SIZE - tu->tp.size;
Thanks,
On Tue, 28 Apr 2026 12:23:02 -0400
Steven Rostedt <rostedt@kernel.org> wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> There currently isn't a max limit an event probe can be. One could make an
> event greater than PAGE_SIZE, which makes the event useless because if
> it's bigger than the max event that can be recorded into the ring buffer,
> then it will never be recorded.
>
> A event probe should never need to be greater than 3K, so make that the
> max size. As long as the max is less than the max that can be recorded
> onto the ring buffer, it should be fine.
>
> Cc: stable@vger.kernel.org
> Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events")
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> kernel/trace/trace_probe.c | 6 ++++++
> kernel/trace/trace_probe.h | 4 +++-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index e1c73065dae5..e0d3a0da26af 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1523,6 +1523,12 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
> parg->offset = *size;
> *size += parg->type->size * (parg->count ?: 1);
>
> + if (*size > MAX_PROBE_EVENT_SIZE) {
> + ret = -E2BIG;
> + trace_probe_log_err(ctx->offset, EVENT_TOO_BIG);
> + goto fail;
> + }
> +
> if (parg->count) {
> len = strlen(parg->type->fmttype) + 6;
> parg->fmt = kmalloc(len, GFP_KERNEL);
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 9fc56c937130..262d8707a3df 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -38,6 +38,7 @@
> #define MAX_BTF_ARGS_LEN 128
> #define MAX_DENTRY_ARGS_LEN 256
> #define MAX_STRING_SIZE PATH_MAX
> +#define MAX_PROBE_EVENT_SIZE 3072
>
> /* Reserved field names */
> #define FIELD_STRING_IP "__probe_ip"
> @@ -561,7 +562,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
> C(BAD_TYPE4STR, "This type does not fit for string."),\
> C(NEED_STRING_TYPE, "$comm and immediate-string only accepts string type"),\
> C(TOO_MANY_ARGS, "Too many arguments are specified"), \
> - C(TOO_MANY_EARGS, "Too many entry arguments specified"),
> + C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
> + C(EVENT_TOO_BIG, "Event too big (too many fields?)"),
>
> #undef C
> #define C(a, b) TP_ERR_##a
> --
> 2.53.0
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply related
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Arun George/Arun George @ 2026-04-29 6:15 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman, gost.dev, arungeorge05, cpgs
In-Reply-To: <ae_i9IlIndumJWN3@gourry-fedora-PF4VCD3F>
On 28-04-2026 03:58 am, Gregory Price wrote:
> On Mon, Apr 27, 2026 at 06:02:57PM +0530, Arun George wrote:
>>
>> Any particular workload you are targeting with
>> this (which can tolerate this latency)?
>>
>> Any deployments you think of where the goal is a capacity expansion
>> with a compromise in performance?
>>
> Primary use cases for us are any workload that benefits from zswap -
> which is many, many (many, many [many, many]) workloads.
>
A curious question please. If the primary use case is swap, can't we
handle this problem statement by re-using the zsmalloc allocation classes?
A separate size class can be reserved for non-compressed pages in
zsmalloc. And this interface could be used by zswap, zram etc. (We have
been using this implementation for testing btw.). This does not require
additional book-keeping or buddy allocator.
But that approach will not give a generic solution and not available for
user-land anyway!
>> And I believe the bear-proof cage might work in the normal scenarios,
>> but may not work for all.
>
> If it can't work for all workloads, then it's likely not general purpose
> enough to find core kernel support and should seek to use the existing
> interfaces (DAX and friends).
>
I agree. That is a good point.
>
> You need two controls over compressed RAM for it to be reliable:
>
> - Allocation control (acquiring new struct page to write to)
> - Write-control (preventing new writes to compressed pages)
>
> Private nodes provide the allocation control.
>
> A read-only mapping, and guarantee that only memory that can reach
> the device is userland memory - is the only way to control the cpu
> writes from the OS perspective.
>
So write-control part need to handled in the specific back end driver of
private pages while the allocation control is a generic front-end sort
of, right? (Ex: zswap cram back end for compressed devices case.)>
> In the next version of the RFC i'll demonstrate cram.c as a new swap
> backend that allows for read-only mappings to be soft-faulted in,
> migration on write, isolation to ANON memory, and some optional
> settings that allow a device or administrator a "writable budget"
> which allows some number of pages to be made writable without migration.
Great! I believe "writable budget" could be an interesting idea which
can solve the 'bus error' sort of scenarios due to device not capable of
taking any more writes. The write budget could be replenished using the
control path and writes will not go ahead without the budget available,
right?>
> ~Gregory
>
~Arun George
^ permalink raw reply
* [PATCH] sched: Use trace_call__<tp>() to save a static branch
From: Gabriele Monaco @ 2026-04-29 9:41 UTC (permalink / raw)
To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: linux-trace-kernel, Gabriele Monaco
The wrapper functions __trace_set_current_state() and
__trace_set_need_resched() allow the tracepoints to be called from code
outside sched/core.c, those calls are already guarded by a
tracepoint_enabled(<tp>) so there is no need to repeat this check once
again inside the call using trace_<tp>().
Use the new trace_call__<tp>() API to directly call the tracepoint
without check. Those helper functions must be called after the
appropriate check.
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
kernel/sched/core.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25a..c37562b02e24 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -537,10 +537,14 @@ sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) { }
/* need a wrapper since we may need to trace from modules */
EXPORT_TRACEPOINT_SYMBOL(sched_set_state_tp);
-/* Call via the helper macro trace_set_current_state. */
+/*
+ * Call via the helper macro trace_set_current_state.
+ * Calls to this function MUST be guarded by a
+ * tracepoint_enabled(sched_set_state_tp)
+ */
void __trace_set_current_state(int state_value)
{
- trace_sched_set_state_tp(current, state_value);
+ trace_call__sched_set_state_tp(current, state_value);
}
EXPORT_SYMBOL(__trace_set_current_state);
@@ -1203,9 +1207,13 @@ static void __resched_curr(struct rq *rq, int tif)
}
}
+/*
+ * Calls to this function MUST be guarded by a
+ * tracepoint_enabled(sched_set_need_resched_tp)
+ */
void __trace_set_need_resched(struct task_struct *curr, int tif)
{
- trace_sched_set_need_resched_tp(curr, smp_processor_id(), tif);
+ trace_call__sched_set_need_resched_tp(curr, smp_processor_id(), tif);
}
EXPORT_SYMBOL_GPL(__trace_set_need_resched);
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
--
2.54.0
^ permalink raw reply related
* [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
From: Darko Tominac @ 2026-04-29 13:15 UTC (permalink / raw)
To: Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
David Hildenbrand, Vlastimil Babka, Jann Horn
Cc: xe-linux-external, danielwa, linux-kernel, linux-trace-kernel,
linux-perf-users, linux-mm
When uprobes are active, MADV_DONTNEED can discard file-backed pages
that contain uprobe software breakpoint instructions. Because the
uprobe infrastructure does not re-instrument pages on individual page
faults (uprobe_mmap() is only called during VMA creation, not on
page-in), the breakpoints are silently lost once the discarded pages are
re-read from the backing file. The probes stop firing with no error
indication, and the only recovery is to unregister and re-register the
affected uprobes.
Note that MADV_FREE is not affected: it only operates on anonymous VMAs
(madvise_free_single_vma() rejects non-anonymous VMAs with -EINVAL),
while uprobes only instrument file-backed mappings, so the two can never
overlap.
A concrete example is a userspace memory reclamation subsystem that
periodically calls madvise(MADV_DONTNEED) on file-backed text pages to
release memory. This silently clears uprobe breakpoints placed by
eBPF-based security and tracing tools that use uprobes to attach eBPF
programs to user-space functions, causing those tools to stop
functioning within seconds of the first reclamation pass.
Add a check in madvise_dontneed_free(), which handles MADV_DONTNEED,
MADV_DONTNEED_LOCKED and MADV_FREE, that when CONFIG_UPROBES is enabled
detects whether the target range contains active uprobes:
- Fast path: if no uprobes are registered system-wide, or the VMA is
not file-backed (uprobes only instrument file-backed mappings, so
anonymous VMAs -- including MADV_FREE targets -- can never contain
breakpoints), or no uprobes are present in the VMA range, proceed
with the discard as before.
- Slow path: when uprobes are detected in the range, use
vma_first_uprobe_addr() to jump directly to each uprobe page via
the rbtree, zapping the clean ranges between them. This is
O(M * log N) where M is the number of uprobes in the range and
N is the total uprobe count, rather than O(pages). madvise()
still returns success, consistent with the advisory nature of
MADV_DONTNEED.
When CONFIG_UPROBES is not configured, the original behaviour is
preserved with no overhead.
To support the above, export vma_has_uprobes() and add new helpers
any_uprobes_registered() and vma_first_uprobe_addr() in the uprobes
subsystem. vma_first_uprobe_addr() returns the page-aligned virtual
address of the lowest-offset uprobe in a given VMA range by leveraging
the (inode, offset)-sorted global rbtree.
Cc: xe-linux-external@cisco.com
Cc: danielwa@cisco.com
Signed-off-by: Darko Tominac <dtominac@cisco.com>
---
include/linux/uprobes.h | 21 +++++++++++
kernel/events/uprobes.c | 79 +++++++++++++++++++++++++++++++++++++++--
mm/madvise.c | 73 +++++++++++++++++++++++++++++++++----
3 files changed, 164 insertions(+), 9 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f548fea2adec..9ce5c46fd2e9 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -212,6 +212,11 @@ extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consum
extern void uprobe_unregister_sync(void);
extern int uprobe_mmap(struct vm_area_struct *vma);
extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+extern bool vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+extern unsigned long vma_first_uprobe_addr(struct vm_area_struct *vma,
+ unsigned long start,
+ unsigned long end);
+extern bool any_uprobes_registered(void);
extern void uprobe_start_dup_mmap(void);
extern void uprobe_end_dup_mmap(void);
extern void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm);
@@ -278,6 +283,22 @@ static inline void
uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end)
{
}
+static inline bool
+vma_has_uprobes(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ return false;
+}
+static inline unsigned long
+vma_first_uprobe_addr(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ return 0;
+}
+static inline bool any_uprobes_registered(void)
+{
+ return false;
+}
static inline void uprobe_start_dup_mmap(void)
{
}
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4084e926e284..0f8aea99b96f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -152,6 +152,19 @@ static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
}
+/**
+ * any_uprobes_registered - check if any uprobes are currently registered
+ *
+ * Check whether the global uprobe rbtree has any entries, indicating
+ * that at least one uprobe is currently active in the system.
+ *
+ * Return: true if one or more uprobes are registered, false otherwise.
+ */
+bool any_uprobes_registered(void)
+{
+ return !no_uprobe_events();
+}
+
/**
* is_swbp_insn - check if instruction is breakpoint instruction.
* @insn: instruction to be checked.
@@ -1635,8 +1648,16 @@ int uprobe_mmap(struct vm_area_struct *vma)
return 0;
}
-static bool
-vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+/**
+ * vma_has_uprobes - check whether a vma range contains any uprobes.
+ * @vma: the vma to search.
+ * @start: start address of the range (inclusive).
+ * @end: end address of the range (exclusive).
+ *
+ * Return: true if at least one uprobe is registered in [@start, @end),
+ * false otherwise.
+ */
+bool vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end)
{
loff_t min, max;
struct inode *inode;
@@ -1654,6 +1675,60 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
return !!n;
}
+/**
+ * vma_first_uprobe_addr - find first uprobe in a vma range.
+ * @vma: the vma to search.
+ * @start: start address of the range (inclusive).
+ * @end: end address of the range (exclusive).
+ *
+ * Used by madvise to skip directly to uprobe pages.
+ *
+ * Return: the page-aligned virtual address of the first uprobe in
+ * [@start, @end), or 0 if none exists.
+ */
+unsigned long vma_first_uprobe_addr(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ loff_t min, max, first_offset;
+ struct inode *inode;
+ struct rb_node *n, *t;
+ struct uprobe *u;
+
+ /* No uprobes possible on anonymous mappings */
+ if (!vma->vm_file)
+ return 0;
+
+ /* Empty range -- nothing to search */
+ if (start >= end)
+ return 0;
+
+ inode = file_inode(vma->vm_file);
+
+ min = vaddr_to_offset(vma, start);
+ max = min + (end - start) - 1;
+
+ read_lock(&uprobes_treelock);
+ n = find_node_in_range(inode, min, max);
+ if (!n) {
+ read_unlock(&uprobes_treelock);
+ return 0;
+ }
+
+ /* Walk left to find the lowest offset in range */
+ u = rb_entry(n, struct uprobe, rb_node);
+ first_offset = u->offset;
+ for (t = rb_prev(n); t; t = rb_prev(t)) {
+ u = rb_entry(t, struct uprobe, rb_node);
+ if (u->inode != inode || u->offset < min)
+ break;
+ first_offset = u->offset;
+ }
+ read_unlock(&uprobes_treelock);
+
+ /* Return page-aligned vaddr containing this uprobe */
+ return PAGE_ALIGN_DOWN(offset_to_vaddr(vma, first_offset));
+}
+
/*
* Called in context of a munmap of a vma.
*/
diff --git a/mm/madvise.c b/mm/madvise.c
index 69708e953cf5..c73f1131224b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -32,6 +32,7 @@
#include <linux/leafops.h>
#include <linux/shmem_fs.h>
#include <linux/mmu_notifier.h>
+#include <linux/uprobes.h>
#include <asm/tlb.h>
@@ -862,6 +863,30 @@ static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior)
return 0;
}
+static long madvise_dontneed_free_range(struct madvise_behavior *madv_behavior,
+ unsigned long start, unsigned long end)
+{
+ struct madvise_behavior_range *range = &madv_behavior->range;
+ unsigned long saved_start = range->start;
+ unsigned long saved_end = range->end;
+ int behavior = madv_behavior->behavior;
+ long ret;
+
+ range->start = start;
+ range->end = end;
+
+ if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
+ ret = madvise_dontneed_single_vma(madv_behavior);
+ else if (behavior == MADV_FREE)
+ ret = madvise_free_single_vma(madv_behavior);
+ else
+ ret = -EINVAL;
+
+ range->start = saved_start;
+ range->end = saved_end;
+ return ret;
+}
+
static
bool madvise_dontneed_free_valid_vma(struct madvise_behavior *madv_behavior)
{
@@ -898,7 +923,7 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
{
struct mm_struct *mm = madv_behavior->mm;
struct madvise_behavior_range *range = &madv_behavior->range;
- int behavior = madv_behavior->behavior;
+ unsigned long cur, end, uprobe_addr;
if (!madvise_dontneed_free_valid_vma(madv_behavior))
return -EINVAL;
@@ -947,12 +972,46 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
VM_WARN_ON(range->start > range->end);
}
- if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
- return madvise_dontneed_single_vma(madv_behavior);
- else if (behavior == MADV_FREE)
- return madvise_free_single_vma(madv_behavior);
- else
- return -EINVAL;
+ /*
+ * Preserve uprobes: if any uprobes are active in this VMA range,
+ * avoid discarding pages that contain active breakpoints.
+ *
+ * Fast path: if no uprobes are registered system-wide, or the VMA
+ * is not file-backed (uprobes only instrument file-backed mappings,
+ * so anonymous VMAs can never contain breakpoints), or no uprobes
+ * are present in this VMA range, proceed with the full operation.
+ */
+ if (likely(!any_uprobes_registered()) ||
+ !madv_behavior->vma->vm_file ||
+ !vma_has_uprobes(madv_behavior->vma, range->start, range->end))
+ return madvise_dontneed_free_range(madv_behavior,
+ range->start, range->end);
+
+ /*
+ * Slow path: jump from uprobe to uprobe via rbtree lookup, zapping
+ * the clean range before each uprobe page. This is O(M * log N)
+ * where M is the number of uprobes in the range and N is the total
+ * uprobe count, versus O(pages) for a page-by-page scan. 'cur'
+ * tracks the beginning of the current clean range.
+ */
+ cur = range->start;
+ end = range->end;
+ while (cur < end) {
+ uprobe_addr = vma_first_uprobe_addr(madv_behavior->vma,
+ cur, end);
+ if (!uprobe_addr) {
+ /* No more uprobes - zap the rest */
+ madvise_dontneed_free_range(madv_behavior, cur, end);
+ break;
+ }
+ /* Zap the clean range before the uprobe page */
+ if (cur < uprobe_addr)
+ madvise_dontneed_free_range(madv_behavior, cur,
+ uprobe_addr);
+ /* Skip past the uprobe page */
+ cur = uprobe_addr + PAGE_SIZE;
+ }
+ return 0;
}
static long madvise_populate(struct madvise_behavior *madv_behavior)
--
2.35.6
^ permalink raw reply related
* Re: [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
From: David Hildenbrand (Arm) @ 2026-04-29 13:31 UTC (permalink / raw)
To: Darko Tominac, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn
Cc: xe-linux-external, danielwa, linux-kernel, linux-trace-kernel,
linux-perf-users, linux-mm
In-Reply-To: <20260429131522.4049054-1-dtominac@cisco.com>
On 4/29/26 15:15, Darko Tominac wrote:
> When uprobes are active, MADV_DONTNEED can discard file-backed pages
> that contain uprobe software breakpoint instructions. Because the
If my memory serves me right, uprobes can only be installed in MAP_PRIVATE fil
mappings. Installing a uprobe breaks CoW by installing an anonymous page.
Not a file-backed page.
> uprobe infrastructure does not re-instrument pages on individual page
> faults (uprobe_mmap() is only called during VMA creation, not on
> page-in), the breakpoints are silently lost once the discarded pages are
> re-read from the backing file. The probes stop firing with no error
> indication, and the only recovery is to unregister and re-register the
> affected uprobes.
Right. Don't MADV_DONTNEED uprobes, just like you are not supposed to
MADV_DONTNEED debugger breakpoints/set data etc. :)
>
> Note that MADV_FREE is not affected: it only operates on anonymous VMAs
> (madvise_free_single_vma() rejects non-anonymous VMAs with -EINVAL),
> while uprobes only instrument file-backed mappings, so the two can never
> overlap.
>
> A concrete example is a userspace memory reclamation subsystem that
> periodically calls madvise(MADV_DONTNEED) on file-backed text pages to
> release memory.
It shouldn't do that on a MAP_PRIVATE file-backed VMA. It breaks the programn,
including uprobes and anything else that breaks CoW in there.
If it's using MADV_DONTNEED, it is damaging the application.
MADV_DONTNEED is not for memory reclaim.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v3 00/28] vfs/nfsd: add support for CB_NOTIFY callbacks in directory delegations
From: Chuck Lever @ 2026-04-29 13:41 UTC (permalink / raw)
To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs
In-Reply-To: <20260428-dir-deleg-v3-0-5a0780ba9def@kernel.org>
On Tue, Apr 28, 2026, at 3:09 AM, Jeff Layton wrote:
> Re-posting the set per Christian's request. The only difference in this
> version is a small error handling fix in alloc_init_dir_deleg(). The old
> version could crash since release_pages() can't handle an array with
> NULL pointers in it.
>
> ---------------------------------8<------------------------------------
>
> This patchset builds on the directory delegation work we did a few
> months ago, to add support for CB_NOTIFY callbacks for some events. In
> particular, creates, unlinks and renames. The server also sends updated
> directory attributes in the notifications. With this support, the client
> can register interest in a directory and get notifications about changes
> within it without losing its lease.
>
> The series starts with patches to allow the vfs to ignore certain types
> of events on directories. nfsd can then request these sorts of
> delegations on directories, and then set up inotify watches on the
> directory to trigger sending CB_NOTIFY events.
>
> This has mainly been tested with pynfs, with some new testcases that
> I'll be posting soon. They seem to work fine with those tests, but I
> don't think we'll want to merge these until we have a complete
> client-side implementation to test against.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
> Changes in v3:
> - Fix error handling in alloc_init_dir_deleg()
> - Link to v2:
> https://lore.kernel.org/r/20260416-dir-deleg-v2-0-851426a550f6@kernel.org
>
> Changes in v2:
> - Fix __break_lease handling with different lease types on flc_lease
> list
> - Add FSNOTIFY_EVENT_RENAME data type to properly handle
> cross-directory rename events
> - Display fsnotify mask symbolically in tracepoints
> - New tracepoint in fsnotify()
> - Recalc fsnotify mask after unlocking lease instead of before
> - Don't notify client that is making the changes
> - After sending CB_NOTIFY, requeue if new events came in while running
> - Document removal of NFS4_VERIFIER_SIZE/NFS4_FHSIZE from UAPI headers
> - Properly release nfsd_dir_fsnotify_group on server shutdown
> - Link to v1:
> https://lore.kernel.org/r/20260407-dir-deleg-v1-0-aaf68c478abd@kernel.org
>
> ---
> Jeff Layton (28):
> filelock: pass current blocking lease to
> trace_break_lease_block() rather than "new_fl"
> filelock: add support for ignoring deleg breaks for dir change
> events
> filelock: add a tracepoint to start of break_lease()
> filelock: add an inode_lease_ignore_mask helper
> fsnotify: new tracepoint in fsnotify()
> fsnotify: add fsnotify_modify_mark_mask()
> fsnotify: add FSNOTIFY_EVENT_RENAME data type
> nfsd: check fl_lmops in nfsd_breaker_owns_lease()
> nfsd: add protocol support for CB_NOTIFY
> nfs_common: add new NOTIFY4_* flags proposed in RFC8881bis
> nfsd: allow nfsd to get a dir lease with an ignore mask
> nfsd: update the fsnotify mark when setting or removing a dir
> delegation
> nfsd: make nfsd4_callback_ops->prepare operation bool return
> nfsd: add callback encoding and decoding linkages for CB_NOTIFY
> nfsd: use RCU to protect fi_deleg_file
> nfsd: add data structures for handling CB_NOTIFY
> nfsd: add notification handlers for dir events
> nfsd: add tracepoint to dir_event handler
> nfsd: apply the notify mask to the delegation when requested
> nfsd: add helper to marshal a fattr4 from completed args
> nfsd: allow nfsd4_encode_fattr4_change() to work with no export
> nfsd: send basic file attributes in CB_NOTIFY
> nfsd: allow encoding a filehandle into fattr4 without a svc_fh
> nfsd: add a fi_connectable flag to struct nfs4_file
> nfsd: add the filehandle to returned attributes in CB_NOTIFY
> nfsd: properly track requested child attributes
> nfsd: track requested dir attributes
> nfsd: add support to CB_NOTIFY for dir attribute changes
>
> Documentation/sunrpc/xdr/nfs4_1.x | 264 ++++++++++++++-
> fs/attr.c | 2 +-
> fs/locks.c | 118 +++++--
> fs/namei.c | 31 +-
> fs/nfsd/filecache.c | 70 +++-
> fs/nfsd/nfs4callback.c | 60 +++-
> fs/nfsd/nfs4layouts.c | 5 +-
> fs/nfsd/nfs4proc.c | 17 +
> fs/nfsd/nfs4state.c | 551 ++++++++++++++++++++++++++++----
> fs/nfsd/nfs4xdr.c | 323 +++++++++++++++++--
> fs/nfsd/nfs4xdr_gen.c | 601 ++++++++++++++++++++++++++++++++++-
> fs/nfsd/nfs4xdr_gen.h | 20 +-
> fs/nfsd/state.h | 72 ++++-
> fs/nfsd/trace.h | 23 ++
> fs/nfsd/xdr4.h | 5 +
> fs/nfsd/xdr4cb.h | 12 +
> fs/notify/fsnotify.c | 5 +
> fs/notify/mark.c | 29 ++
> fs/posix_acl.c | 4 +-
> fs/xattr.c | 4 +-
> include/linux/filelock.h | 54 +++-
> include/linux/fsnotify.h | 8 +-
> include/linux/fsnotify_backend.h | 21 ++
> include/linux/nfs4.h | 127 --------
> include/linux/sunrpc/xdrgen/nfs4_1.h | 291 ++++++++++++++++-
> include/trace/events/filelock.h | 38 ++-
> include/trace/events/fsnotify.h | 51 +++
> include/trace/misc/fsnotify.h | 35 ++
> include/uapi/linux/nfs4.h | 2 -
> 29 files changed, 2519 insertions(+), 324 deletions(-)
> ---
> base-commit: f4d71dd7fd9cec357c32431fa55c107b96008312
> change-id: 20260325-dir-deleg-339066dd1017
>
> Best regards,
> --
> Jeff Layton <jlayton@kernel.org>
For the series:
Acked-by: Chuck Lever <chuck.lever@oracle.com>
--
Chuck Lever
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-04-29 13:42 UTC (permalink / raw)
To: Arun George/Arun George
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman, gost.dev, arungeorge05, cpgs
In-Reply-To: <1891546521.01777455002601.JavaMail.epsvc@epcpadp1new>
On Wed, Apr 29, 2026 at 11:45:26AM +0530, Arun George/Arun George wrote:
> On 28-04-2026 03:58 am, Gregory Price wrote:
> > On Mon, Apr 27, 2026 at 06:02:57PM +0530, Arun George wrote:
> >>
> >> Any particular workload you are targeting with
> >> this (which can tolerate this latency)?
> >>
> >> Any deployments you think of where the goal is a capacity expansion
> >> with a compromise in performance?
> >>
> > Primary use cases for us are any workload that benefits from zswap -
> > which is many, many (many, many [many, many]) workloads.
> >
> A curious question please. If the primary use case is swap, can't we
> handle this problem statement by re-using the zsmalloc allocation classes?
>
I'm using swap semantics for allocation ("demote + leafent") but otherwise
on-fault rather than removing the swap-entry, we leave it cached and
replace the page table entry with a read-only mapping (if Read-fault).
If there's a writable budget, and the node is under that budget, we may
also allow upgrading the read-only page to be writable (at which point
we would reap the swap entry).
This requires careful reverse-mapping in case there are multiple mappers
of the same folio.
Since otherwise the allocation is just alloc_pages_node(), and the fault
patterns differ from typical swap - i didn't see the need to overcomplicate
things by cramming the logic into zswap/zsmalloc instead of just making
it its own vswap[1] backend that sits in front of zswap.
vswap makes it easy to writeback a cram page to swap in the case where
the device is over-pressured and we need to make room (close the node,
disallow new cram entries, writeback existing cram entries to swap).
[1] vswap: https://lore.kernel.org/linux-mm/?t=20260320192741
> A separate size class can be reserved for non-compressed pages in
> zsmalloc. And this interface could be used by zswap, zram etc. (We have
> been using this implementation for testing btw.). This does not require
> additional book-keeping or buddy allocator.
>
The other reason not to overload an existing mechanism is because these
devices (that i've seen) cannot provide per-page compressability stats,
and so it would end up just looking like a bunch of either
uncompressible capacity or unknown compressed capacity.
That makes it harder for those components to reason about what to do
with their normal software-compressed capacity (for which they do have
that data).
> So write-control part need to handled in the specific back end driver of
> private pages while the allocation control is a generic front-end sort
> of, right? (Ex: zswap cram back end for compressed devices case.)
write control is handled by the OS in three ways:
1) No file memory (no page cache)
We get this for free using the swap semantics
This prevents buffered i/o from bypassing page table controls
2) User allocations only (or at least swap-eligible only)
This prevents catestrophic system failure if the device fails
3) Page table mapping control (disallow direct writes)
This prevents uncontended writes to compressed memory by the cpu
allocation control is handled via private nodes - the driver which
hotplugs the private nodes hands that node to cram - and cram is now
aware of that capacity and will use __GFP_PRIVATE to allocate from that
node. Removal of the private node from the fallback zonelist and the
lack of __GFP_PRIVATE in all other paths prevent normal buddy allocator
users from accessing that memory.
>
> Great! I believe "writable budget" could be an interesting idea which
> can solve the 'bus error' sort of scenarios due to device not capable of
> taking any more writes. The write budget could be replenished using the
> control path and writes will not go ahead without the budget available,
> right?>
>
Write budget is simple
budget=1 (up to 1 page can be writable
1) swap 1 page -> cram alloc 1 page, put VSWAP_CRAM in PTE
2) read-fault -> cram upgrades VSWAP_CRAM to R/O PTE
3) write-fault ->
a) if (writable_cnt < budget) { budget++; mkwrite(pte); }
b) else: normal swap semantic -> promote to normal memory
The catch with the writable budget is we may not always be able to catch
all frees of the vswap pages - meaning we get zombie pages in the vswap
tables. But this is ok if we run a regular kthread scan the vswap entry
list to reap zombies.
This also gives us a great place to TRIM/FLUSH those pages to release
the capacity without zeroing them.
Meanwhile - use ballooning and a simple shrinker to dynamically size the
region to respond to real compression ratio.
All said an done - you get something close to zswap but with R/O
mappings for all entries, and optional R/W-mappings for administrators
who know something about their workload and can afford to take the risk
of some amount of capacity being written to uncontended in exchange for
performance.
The writable-budget is a risk-dial: How much do you trust your workload
to now spew un/poorly-compressible memory? The write-budget is a direct
measure of that. (so take P99.99999 compression ratios, and you can make
a good chunk of that writable).
~Gregory
^ permalink raw reply
* Re: [PATCH 7.2 v16 02/13] mm/khugepaged: generalize alloc_charge_folio()
From: Nico Pache @ 2026-04-29 14:36 UTC (permalink / raw)
To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <06c7e6a7-60af-480e-afd9-700e985ca2ba@kernel.org>
On 4/27/26 1:41 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> From: Dev Jain <dev.jain@arm.com>
>>
>> Pass order to alloc_charge_folio() and update mTHP statistics.
>>
>> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>> Co-developed-by: Nico Pache <npache@redhat.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>
> Your SOB should come last, the order represents the history of this patch
Ah ok thank you, sorry about that.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Co-developed-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
>
^ permalink raw reply
* Re: [PATCH 7.2 v16 03/13] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-04-29 14:43 UTC (permalink / raw)
To: Usama Arif
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260420131549.3673619-1-usama.arif@linux.dev>
On 4/20/26 7:15 AM, Usama Arif wrote:
> On Sun, 19 Apr 2026 12:57:40 -0600 Nico Pache <npache@redhat.com> wrote:
>
>> The following cleanup reworks all the max_ptes_* handling into helper
>> functions. This increases the code readability and will later be used to
>> implement the mTHP handling of these variables.
>>
>> With these changes we abstract all the madvise_collapse() special casing
>> (dont respect the sysctls) away from the functions that utilize them. And
>> will later in this series to cleanly restrict mTHP collapses behaviors.
>>
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 114 +++++++++++++++++++++++++++++++++---------------
>> 1 file changed, 78 insertions(+), 36 deletions(-)
>>
>
> The old code re-read khugepaged_max_ptes_* on every loop iteration; the new
> code snapshots them once per scan call. If userspace writes the sysctl
> mid-scan, old behavior reacted within the scan, new behavior uses the value
> sampled at entry. This is completely ok IMO, but might be good to call out.
>
> Also might be good to write no functional change intended apart from
> above in the commit message?
Ah good point! Ill clear that up
>
> Acked-by: Usama Arif <usama.arif@linux.dev>
Thank you :)
>
^ permalink raw reply
* Re: [PATCH 7.2 v16 03/13] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-04-29 14:48 UTC (permalink / raw)
To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <c82a0b73-67ff-45e6-a792-e610b35a5b2f@kernel.org>
On 4/27/26 1:52 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> The following cleanup reworks all the max_ptes_* handling into helper
>> functions. This increases the code readability and will later be used to
>> implement the mTHP handling of these variables.
>>
>> With these changes we abstract all the madvise_collapse() special casing
>> (dont respect the sysctls) away from the functions that utilize them. And
>> will later in this series to cleanly restrict mTHP collapses behaviors.
>>
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 114 +++++++++++++++++++++++++++++++++---------------
>> 1 file changed, 78 insertions(+), 36 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index afac6bc4e76d..f42b55421191 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -348,6 +348,58 @@ static bool pte_none_or_zero(pte_t pte)
>> return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
>> }
>>
>> +/**
>> + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
>
> empty PTE or PTE mapping the shared zeropage ? That should be clarified also below.
Ah fair point, "empty" isn't the best representation of a "none"/zeropage.
>
>> + * @cc: The collapse control struct
>> + * @vma: The vma to check for userfaultfd
>> + *
>> + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> + * empty page.
>
> Not completely accurate due to uffd. And it's not really "empty page".
Sorry I forgot to update this comment. I originally planned on skipping
the VMA passing, but then figured later that it would make the code even
more uniform (as you suggested)
>
> Is that information really necessary for the caller? I'd suggest you drop this
> here and instead add a comment inline above the "return HPAGE_PMD_NR;".
Yeah, I'm not really sure; I can shorten them. I was heeding to lorenzos
request to add these with docstring headers
>
>> + *
>> + * Return: Maximum number of empty PTEs allowed for the collapse operation
>> + */
>> +static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>> + struct vm_area_struct *vma)
>> +{
>> + if (vma && userfaultfd_armed(vma))
>> + return 0;
>> + if (!cc->is_khugepaged)
>> + return HPAGE_PMD_NR;
>> + return khugepaged_max_ptes_none;
>> +}
>> +
>> +/**
>> + * collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for collapse
>
> "shared PTE" is not quite clear.
>
> "PTEs that map shared anonymous pages" ?
That works for me, thank you
>
>> + * @cc: The collapse control struct
>> + *
>> + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> + * shared page.
>
> Same comment as above.
ack
>
>> + *
>> + * Return: Maximum number of shared PTEs allowed for the collapse operation
>> + */
>> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>> +{
>> + if (!cc->is_khugepaged)
>> + return HPAGE_PMD_NR;
>> + return khugepaged_max_ptes_shared;
>> +}
>> +
>> +/**
>> + * collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collapse
>
> We're actually checking non-present page table entries (anonymous THP collapse)
> or non-present pagecache entries (file THP collapse).
>
> I wonder if there is an easy way to clarify that here, at least in the
> description (confusing name can stay unless we find something better).
I'll update the comment to include some form of this. In my mind the
name should probably stay relatively consistent to the sysctl value.
>
>> + * @cc: The collapse control struct
>> + *
>> + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> + * swap page.
>
> Dito.
ack!
>
>> + *
>> + * Return: Maximum number of swap PTEs allowed for the collapse operation
>> + */
>> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>> +{
>> + if (!cc->is_khugepaged)
>> + return HPAGE_PMD_NR;
>> + return khugepaged_max_ptes_swap;
>> +}
>> +
>> int hugepage_madvise(struct vm_area_struct *vma,
>> vm_flags_t *vm_flags, int advice)
>> {
>> @@ -546,21 +598,19 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>> pte_t *_pte;
>> int none_or_zero = 0, shared = 0, referenced = 0;
>> enum scan_result result = SCAN_FAIL;
>> + unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>> + unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>
> These could be const, right? Or will that change in future patches?
Yes I believe these can be const now! Thank you
>
>>
>> for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> _pte++, addr += PAGE_SIZE) {
>> pte_t pteval = ptep_get(_pte);
>> if (pte_none_or_zero(pteval)) {
>> - ++none_or_zero;
>> - if (!userfaultfd_armed(vma) &&
>> - (!cc->is_khugepaged ||
>> - none_or_zero <= khugepaged_max_ptes_none)) {
>> - continue;
>> - } else {
>> + if (++none_or_zero > max_ptes_none) {
>> result = SCAN_EXCEED_NONE_PTE;
>> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> goto out;
>> }
>> + continue;
>> }
>> if (!pte_present(pteval)) {
>> result = SCAN_PTE_NON_PRESENT;
>> @@ -591,9 +641,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>
>> /* See collapse_scan_pmd(). */
>> if (folio_maybe_mapped_shared(folio)) {
>> - ++shared;
>> - if (cc->is_khugepaged &&
>> - shared > khugepaged_max_ptes_shared) {
>> + if (++shared > max_ptes_shared) {
>> result = SCAN_EXCEED_SHARED_PTE;
>> count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> goto out;
>> @@ -1270,6 +1318,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> unsigned long addr;
>> spinlock_t *ptl;
>> int node = NUMA_NO_NODE, unmapped = 0;
>> + unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>> + unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>> + unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>
> Same question here.
ack! will adjust.
>
>>
>> VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>
>
>
> In general, LGTM. With the doc fixed up
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Thank you Ill get those updated.
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox