Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Steven Rostedt @ 2026-06-05 14:13 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Borislav Petkov, Zhuo, Qiuxu, mchehab+huawei@kernel.org,
	Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
	xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
	linux-edac@vger.kernel.org, linux-mm@kvack.org,
	linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <39cbb38f-ed3b-4f17-b9b7-ed466957ee99@kernel.org>



On June 5, 2026 4:52:28 AM EDT, "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>On 6/3/26 21:31, Steven Rostedt wrote:
>> On Wed, 3 Jun 2026 21:13:30 +0200
>> "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>> 
>>> Thanks, that makes sense!
>>>
>>> So, would it be fair to say that, in general, what's exposed through
>>>
>>> 	/sys/kernel/tracing/events/
>>>
>>> is stable ABI?
>> 
>> It's only stable if something depends on it. It changes all the time.
>> It's only when someone complains about it that it becomes "stable"!
>
>Heh, so we only know that it's stable when we break it ...
>
>Let me figure out how to document that.
>

Yep. That's basically Linus's rule. He even said we break user space API all the time. What we don't allow is to break actual user space. The problem is that you can break user space by fixing an API without knowing something depended on the bug.

^ permalink raw reply

* [GIT PULL] RTLA additional fixes for v7.2
From: Tomas Glozar @ 2026-06-05 14:02 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, linux-trace-kernel, Tomas Glozar

Steven,

The following changes since commit db956bcf8d681b5a01ebe04c79f6a7b29b9934f9:

  rtla: Document tests in README (2026-05-29 09:40:54 +0200)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/tglozar/linux.git/ tags/rtla-v7.2-fixups

for you to fetch changes up to 930645785902d5830f9c41862d21af651b8e5371:

  rtla/tests: Fix pgrep filter in get_workload_pids.sh (2026-06-05 14:11:31 +0200)

----------------------------------------------------------------
RTLA additional fixes for v7.2

- Fix and clean up .gitignore

Narrow match range of entries in .gitignore to only what is needed,
fixing "lib/" matching tools/tracing/rtla/tests/scripts/lib/*.

- Fix pgrep filter in runtime tests

Make the pgrep filter used by runtime tests to get workload PIDs work
on both older and newer versions of pgrep, regardless of whether
square brackets are counted as part of kthread comm or not.

Build, runtime tests, unit tests pass.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>

----------------------------------------------------------------
Tomas Glozar (2):
      rtla: Fix and clean up .gitignore
      rtla/tests: Fix pgrep filter in get_workload_pids.sh

 tools/tracing/rtla/.gitignore                             | 13 ++++---------
 tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh |  2 +-
 2 files changed, 5 insertions(+), 10 deletions(-)


^ permalink raw reply

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Masami Hiramatsu @ 2026-06-05 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tengda Wu, Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel
In-Reply-To: <20260604093445.GF3126523@noisy.programming.kicks-ass.net>

On Thu, 4 Jun 2026 11:34:45 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jun 01, 2026 at 08:40:01AM +0900, Masami Hiramatsu wrote:
> 
> > Peter, is it OK to drop @rq from task_on_cpu()? 
> 
> Sure.
> 
> > Then we can use it from rethook.
> 
> Well, it is in sched/sched.h, which is an internal header, and no you
> cannot use that header in rethook.

Ah, OK. Hmm, then we should not use it. Maybe ->on_cpu is also internal
state?

> 
> But lets step back first, what is the actual problem here, why are we
> looking at ->on_cpu at all?

Tengda, can you explain it?
I think you want to take a stacktrace on !current process, and
rethook_find_ret_addr() is rejected i the task is running state.

But if you can share actual situation what you need, it is
helpful for us to understand.

Thank you,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [POC] KVM: selftests: Verify conversion works with TDX
From: Ackerley Tng @ 2026-06-05 13:41 UTC (permalink / raw)
  To: devnull+ackerleytng.google.com
  Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
	baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
	dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
	jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
	linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
	mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
	pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
	shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
	tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
	yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <20260522-gmem-inplace-conversion-v7-0-2f0fae496530@google.com>

This POC shows that conversions works with TDX:

1. Find 2 pages in GVA space, map those twice, once as private and once as
   shared. This avoids having to manipulate page tables in the guest.
2. Use memory as private memory in the guest.
3. Request to convert memory to shared.
4. Write shared memory in the guest, check in the host.
5. Write shared memory in the host, check in the guest.
6. Request to convert memory to private.
7. Use memory as private memory in the guest.

I based this on Lisa's series at [1].

[1] https://lore.kernel.org/all/20260521-tdx-selftests-v13-v13-0-6983ae4c3a4d@google.com/

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/x86/tdx_vm_test.c | 154 ++++++++++++++++++
 1 file changed, 154 insertions(+)

diff --git a/tools/testing/selftests/kvm/x86/tdx_vm_test.c b/tools/testing/selftests/kvm/x86/tdx_vm_test.c
index 7cdcaf33b585b..093921af7d93e 100644
--- a/tools/testing/selftests/kvm/x86/tdx_vm_test.c
+++ b/tools/testing/selftests/kvm/x86/tdx_vm_test.c
@@ -26,6 +26,160 @@ TEST(verify_td_lifecycle)
 	kvm_vm_free(vm);
 }

+static gva_t conversions_private_gva;
+static gpa_t conversions_private_gpa;
+static gva_t conversions_shared_gva;
+static gpa_t conversions_shared_gpa;
+static size_t conversions_size;
+
+u64 tdx_map_gpa(u64 gpa, u64 size)
+{
+#define TDG_VP_VMCALL 0
+#define TDG_VP_VMCALL_MAP_GPA 0x10001
+#define TDVMCALL_EXPOSE_REGS_MASK 0xFC00
+	register u64 r10_reg asm("r10") = TDG_VP_VMCALL;
+	register u64 r11_reg asm("r11") = TDG_VP_VMCALL_MAP_GPA;
+	register u64 r12_reg asm("r12") = gpa;
+	register u64 r13_reg asm("r13") = size;
+	register u64 rax_reg asm("rax") = TDG_VP_VMCALL;
+	register u64 rcx_reg asm("rcx") = TDVMCALL_EXPOSE_REGS_MASK;
+
+	asm volatile(
+	 ".byte 0x66,0x0f,0x01,0xcc" /* tdcall */
+	 : "+r" (r10_reg), "+r" (r11_reg)
+	 : "r" (r12_reg), "r" (r13_reg), "r" (rax_reg), "r" (rcx_reg)
+	 : "cc", "memory"
+	);
+
+	return r10_reg;
+}
+
+enum accept_page_level {
+	PAGE_LEVEL_4K = 0,
+	PAGE_LEVEL_2M,
+};
+
+u64 tdx_accept_page(u64 gpa, enum accept_page_level level)
+{
+#define TDG_MEM_PAGE_ACCEPT 6
+	register u64 rax_reg asm("rax") = TDG_MEM_PAGE_ACCEPT;
+	register u64 rcx_reg asm("rcx") = gpa | level;
+
+	asm volatile(
+	 ".byte 0x66,0x0f,0x01,0xcc" /* tdcall */
+	 : "+r" (rax_reg)
+	 : "r" (rcx_reg)
+	 : "cc", "memory"
+	);
+
+	return rax_reg;
+}
+
+static void handle_hypercall_map_gpa(struct kvm_vcpu *vcpu)
+{
+	struct kvm_run *run = vcpu->run;
+	u64 attributes;
+	size_t size;
+	gpa_t gpa;
+
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_HYPERCALL);
+	TEST_ASSERT_EQ(run->hypercall.nr, KVM_HC_MAP_GPA_RANGE);
+	TEST_ASSERT_EQ(run->hypercall.flags, KVM_EXIT_HYPERCALL_LONG_MODE);
+
+	gpa = run->hypercall.args[0];
+	size = run->hypercall.args[1] * PAGE_SIZE;
+	attributes = 0;
+	if (run->hypercall.args[2] & KVM_MAP_GPA_RANGE_ENCRYPTED)
+		attributes = KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+	vm_mem_set_memory_attributes(vcpu->vm, gpa, size, attributes);
+}
+
+#define CONVERSIONS_PRIVATE_VAL 'A'
+#define CONVERSIONS_GUEST_SHARED_VAL 'B'
+#define CONVERSIONS_HOST_SHARED_VAL 'C'
+#define CONVERSIONS_STAGE_WROTE_SHARED 0x99
+
+static void guest_code_conversions(void)
+{
+	char *addr;
+
+	addr = (void *)conversions_private_gva;
+	WRITE_ONCE(*addr, CONVERSIONS_PRIVATE_VAL);
+	GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_PRIVATE_VAL);
+
+	GUEST_ASSERT_EQ(tdx_map_gpa(conversions_shared_gpa, conversions_size), 0);
+
+	addr = (void *)conversions_shared_gva;
+	WRITE_ONCE(*addr, CONVERSIONS_GUEST_SHARED_VAL);
+	GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_GUEST_SHARED_VAL);
+
+	GUEST_SYNC(CONVERSIONS_STAGE_WROTE_SHARED);
+
+	GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_HOST_SHARED_VAL);
+
+	GUEST_ASSERT_EQ(tdx_map_gpa(conversions_private_gpa, conversions_size), 0);
+	GUEST_ASSERT_EQ(tdx_accept_page(conversions_private_gpa, PAGE_LEVEL_4K), 0);
+
+	addr = (void *)conversions_private_gva;
+	WRITE_ONCE(*addr, CONVERSIONS_PRIVATE_VAL);
+	GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_PRIVATE_VAL);
+
+	GUEST_DONE();
+}
+
+TEST(verify_conversions)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	struct ucall uc;
+	char *test_hva;
+
+	vm = __vm_create(VM_SHAPE_TDX, 1, 0);
+	vcpu = vm_vcpu_add(vm, 0, guest_code_conversions);
+
+	conversions_size = getpagesize();
+
+	conversions_private_gva = vm_alloc_page(vm);
+	conversions_shared_gva = vm_alloc_shared(vm, conversions_size,
+						 KVM_UTIL_MIN_VADDR,
+						 MEM_REGION_TEST_DATA);
+	conversions_private_gpa = addr_gva2gpa(vm, conversions_private_gva);
+	conversions_shared_gpa = conversions_private_gpa | BIT_ULL(vm->pa_bits - 1);
+
+	vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
+
+	sync_global_to_guest(vm, conversions_size);
+	sync_global_to_guest(vm, conversions_private_gva);
+	sync_global_to_guest(vm, conversions_private_gpa);
+	sync_global_to_guest(vm, conversions_shared_gva);
+	sync_global_to_guest(vm, conversions_shared_gpa);
+
+	kvm_arch_vm_finalize_vcpus(vm);
+
+	test_hva = addr_gva2hva(vm, conversions_shared_gva);
+
+	vcpu_run(vcpu);
+	handle_hypercall_map_gpa(vcpu);
+
+	vcpu_run(vcpu);
+	TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC);
+	TEST_ASSERT_EQ(uc.args[1], CONVERSIONS_STAGE_WROTE_SHARED);
+
+	TEST_ASSERT_EQ(READ_ONCE(*test_hva), CONVERSIONS_GUEST_SHARED_VAL);
+
+	WRITE_ONCE(*test_hva, CONVERSIONS_HOST_SHARED_VAL);
+	TEST_ASSERT_EQ(READ_ONCE(*test_hva), CONVERSIONS_HOST_SHARED_VAL);
+
+	vcpu_run(vcpu);
+	handle_hypercall_map_gpa(vcpu);
+
+	vcpu_run(vcpu);
+	TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_DONE);
+
+	kvm_vm_free(vm);
+}
+
 int main(int argc, char **argv)
 {
 	TEST_REQUIRE(is_tdx_supported());
--
2.54.0.1032.g2f8565e1d1-goog

^ permalink raw reply related

* RE: [PATCH] mm/memory-failure: trace: change memory_failure_event to ras subsystem
From: Zhuo, Qiuxu @ 2026-06-05 13:09 UTC (permalink / raw)
  To: Xie Yuanbin, david@kernel.org, bp@alien8.de,
	akpm@linux-foundation.org, rostedt@goodmis.org,
	linmiaohe@huawei.com, nao.horiguchi@gmail.com,
	mhiramat@kernel.org, mchehab+huawei@kernel.org, Luck, Tony,
	Lai, Yi1
  Cc: linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	torvalds@linux-foundation.org, lilinjie8@huawei.com,
	liaohua4@huawei.com
In-Reply-To: <20260605081213.154660-1-xieyuanbin1@huawei.com>

> From: Xie Yuanbin <xieyuanbin1@huawei.com>
> [...]
> Subject: [PATCH] mm/memory-failure: trace: change memory_failure_event to
> ras subsystem
> 
> For historical version, commit 97f0b1345219 ("tracing: add trace event for
> memory-failure") introduced memory_failure_event in ras subsystem.
> commit 31807483d395 ("mm/memory-failure: remove the selection of RAS")
> changed memory_failure_event to memory_failure subsystem. This breaks
> the backward compatibility, some user programs rely on it.
> 
> Change memory_failure_event to ras subsystem to keep backward
> compatibility.
> 
> Fixes: 31807483d395 ("mm/memory-failure: remove the selection of RAS")
> 
> Reported-by: Yi Lai <yi1.lai@intel.com>
> Reported-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Closes: https://lore.kernel.org/linux-
> mm/CY8PR11MB7134346A3E4BB28ECA28D6E989132@CY8PR11MB7134.nam
> prd11.prod.outlook.com
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>

LGTM.

  Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

Verified that rasdaemon can enable and receive memory_failure_event on
v7.1-rc3.

  Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

Thanks
-Qiuxu

^ permalink raw reply

* [PATCH v2 6/6] x86/setup: prepend embedded bootconfig cmdline before parse_early_param
From: Breno Leitao @ 2026-06-05 12:03 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260605-bootconfig_using_tools-v2-0-d309f544b5f7@debian.org>

Call xbc_prepend_embedded_cmdline() in setup_arch() right after the
CONFIG_CMDLINE merge and before strscpy(command_line, ...) so the
build-time-rendered embedded bootconfig "kernel" subtree is part of
boot_command_line by the time parse_early_param() runs. early_param()
handlers (mem=, earlycon=, loglevel=, ...) now see values supplied via
CONFIG_BOOT_CONFIG_EMBED_FILE without parsing bootconfig at runtime.

Gate the prepend on the bootconfig opt-in: only fold in the embedded
kernel.* keys when "bootconfig" is present on the command line, or
CONFIG_BOOT_CONFIG_FORCE is set. Applying the embedded cmdline
unconditionally would (a) diverge from how embedded init.* keys are
treated and (b) break fail-safe recovery: a malformed embedded
console=/mem= could panic the boot with no way for the admin to disable
it by dropping "bootconfig" from the bootloader cmdline.
cmdline_find_option_bool() runs before parse_early_param(), so the gate
is cheap and correctly ordered.

Select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG so the user-visible
CONFIG_BOOT_CONFIG_EMBED_CMDLINE option becomes selectable on x86.

With this select in place, setup_boot_config() in init/main.c would
otherwise render the embedded "kernel" subtree a second time via
xbc_make_cmdline("kernel") into extra_command_line, duplicating every
embedded kernel.* key in saved_command_line and making accumulating
handlers (console=, earlycon=, ...) register the same value twice. Skip
that render only when xbc_prepend_embedded_cmdline() actually prepended
the keys, reported by xbc_embedded_cmdline_applied().

Keying the skip on the prepend itself, rather than re-deriving the
opt-in, keeps the two paths consistent even when setup_arch() and the
runtime parser detect "bootconfig" differently (e.g. "bootconfig=1"):
the keys are then rendered at runtime instead of being dropped.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 arch/x86/Kconfig        |  1 +
 arch/x86/kernel/setup.c | 16 ++++++++++++++++
 init/main.c             | 18 +++++++++++++++---
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f24810015234..f839795692b4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -126,6 +126,7 @@ config X86
 	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
 	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
 	select ARCH_SUPPORTS_CFI		if X86_64
+	select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
 	select ARCH_USES_CFI_TRAPS		if X86_64 && CFI
 	select ARCH_SUPPORTS_LTO_CLANG
 	select ARCH_SUPPORTS_LTO_CLANG_THIN
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 46882ce79c3a..26a82a41f44c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -6,6 +6,7 @@
  * parts of early kernel initialization.
  */
 #include <linux/acpi.h>
+#include <linux/bootconfig.h>
 #include <linux/console.h>
 #include <linux/cpu.h>
 #include <linux/crash_dump.h>
@@ -36,6 +37,7 @@
 #include <asm/bios_ebda.h>
 #include <asm/bugs.h>
 #include <asm/cacheinfo.h>
+#include <asm/cmdline.h>
 #include <asm/coco.h>
 #include <asm/cpu.h>
 #include <asm/efi.h>
@@ -924,6 +926,20 @@ void __init setup_arch(char **cmdline_p)
 	builtin_cmdline_added = true;
 #endif
 
+	/*
+	 * Honor the same opt-in as the runtime bootconfig parser: only fold
+	 * the embedded kernel.* keys into the cmdline when "bootconfig" is
+	 * present on the command line (or CONFIG_BOOT_CONFIG_FORCE is set).
+	 * This keeps fail-safe recovery working -- dropping "bootconfig" from
+	 * the bootloader cmdline disables the embedded keys -- so a malformed
+	 * embedded console=/mem= cannot brick a boot with no way out. It also
+	 * matches setup_boot_config(), which bails out under the same
+	 * condition before parsing the embedded bootconfig at runtime.
+	 */
+	if (IS_ENABLED(CONFIG_BOOT_CONFIG_FORCE) ||
+	    cmdline_find_option_bool(boot_command_line, "bootconfig"))
+		xbc_prepend_embedded_cmdline(boot_command_line, COMMAND_LINE_SIZE);
+
 	strscpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
 	*cmdline_p = command_line;
 
diff --git a/init/main.c b/init/main.c
index e363232b428b..567f641a5731 100644
--- a/init/main.c
+++ b/init/main.c
@@ -378,12 +378,15 @@ static void __init setup_boot_config(void)
 	int pos, ret;
 	size_t size;
 	char *err;
+	bool from_embedded = false;
 
 	/* Cut out the bootconfig data even if we have no bootconfig option */
 	data = get_boot_config_from_initrd(&size);
 	/* If there is no bootconfig in initrd, try embedded one. */
-	if (!data)
+	if (!data) {
 		data = xbc_get_embedded_bootconfig(&size);
+		from_embedded = true;
+	}
 
 	strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
 	err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
@@ -421,8 +424,17 @@ static void __init setup_boot_config(void)
 	} else {
 		xbc_get_info(&ret, NULL);
 		pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
-		/* keys starting with "kernel." are passed via cmdline */
-		extra_command_line = xbc_make_cmdline("kernel");
+		/*
+		 * keys starting with "kernel." are passed via cmdline. When
+		 * this bootconfig came from the embedded source and
+		 * setup_arch() already prepended the rendered "kernel" subtree
+		 * to boot_command_line, rendering again here would duplicate
+		 * the keys in saved_command_line and make accumulating handlers
+		 * (console=, earlycon=, ...) re-register the same value. Skip
+		 * only when the prepend really happened.
+		 */
+		if (!from_embedded || !xbc_embedded_cmdline_applied())
+			extra_command_line = xbc_make_cmdline("kernel");
 		/* Also, "init." keys are init arguments */
 		extra_init_args = xbc_make_cmdline("init");
 	}

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v2 5/6] bootconfig: add xbc_prepend_embedded_cmdline() helper
From: Breno Leitao @ 2026-06-05 12:03 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260605-bootconfig_using_tools-v2-0-d309f544b5f7@debian.org>

Add a helper that prepends the build-time-rendered embedded bootconfig
"kernel" subtree (embedded_kernel_cmdline[] from embedded-cmdline.S) to
a cmdline buffer with a separating space. Architectures call this from
setup_arch() before parse_early_param() so early_param() handlers
(mem=, earlycon=, loglevel=, ...) see values supplied via the embedded
bootconfig.

The in-place prepend (shift the existing string right, then drop the
embedded string in front) is factored into a small str_prepend() helper.

On overflow the helper logs an error and leaves the cmdline untouched
rather than panicking. Booting without the embedded values is better
than refusing to boot, and the error tells the user why their embedded
keys are missing.

The helper records whether it actually prepended, exposed via
xbc_embedded_cmdline_applied(). setup_boot_config() uses this to decide
whether the runtime "kernel" render would duplicate keys already folded
into boot_command_line.

When CONFIG_BOOT_CONFIG_EMBED_CMDLINE=n, the public declaration in
<linux/bootconfig.h> resolves to a no-op stub so callers compile
unchanged.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/bootconfig.h |  9 ++++++
 lib/bootconfig.c           | 78 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/include/linux/bootconfig.h b/include/linux/bootconfig.h
index 1c7f3b74ffcf..c186137f87ac 100644
--- a/include/linux/bootconfig.h
+++ b/include/linux/bootconfig.h
@@ -308,4 +308,13 @@ static inline const char *xbc_get_embedded_bootconfig(size_t *size)
 }
 #endif
 
+/* Build-time-rendered bootconfig cmdline prepended in setup_arch() */
+#ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+void __init xbc_prepend_embedded_cmdline(char *dst, size_t size);
+bool __init xbc_embedded_cmdline_applied(void);
+#else
+static inline void xbc_prepend_embedded_cmdline(char *dst, size_t size) { }
+static inline bool xbc_embedded_cmdline_applied(void) { return false; }
+#endif
+
 #endif
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 926094d97397..f66be0b2dc24 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -19,6 +19,7 @@
 #include <linux/errno.h>
 #include <linux/cache.h>
 #include <linux/compiler.h>
+#include <linux/printk.h>
 #include <linux/sprintf.h>
 #include <linux/memblock.h>
 #include <linux/string.h>
@@ -34,6 +35,83 @@ const char * __init xbc_get_embedded_bootconfig(size_t *size)
 	return (*size) ? embedded_bootconfig_data : NULL;
 }
 #endif
+
+#ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+/* embedded_kernel_cmdline is defined in embedded-cmdline.S */
+extern __visible const char embedded_kernel_cmdline[];
+extern __visible const char embedded_kernel_cmdline_end[];
+
+/* Set once the embedded cmdline has actually been prepended. */
+static bool xbc_cmdline_applied __initdata;
+
+/*
+ * str_prepend() - Prepend @src in front of the string in @dst, in place
+ * @dst: NUL-terminated destination buffer, currently @dst_len bytes long
+ * @dst_len: length of the current @dst string (excluding its NUL)
+ * @src: bytes to prepend (not NUL-terminated)
+ * @src_len: number of bytes from @src to prepend
+ *
+ * The caller must guarantee @dst has room for src_len + dst_len + 1 bytes.
+ * Moving dst_len + 1 bytes carries @dst's NUL terminator too, so an empty
+ * @dst needs no special case.
+ */
+static void __init str_prepend(char *dst, size_t dst_len,
+			       const char *src, size_t src_len)
+{
+	memmove(dst + src_len, dst, dst_len + 1);
+	memcpy(dst, src, src_len);
+}
+
+/**
+ * xbc_prepend_embedded_cmdline() - Prepend embedded bootconfig cmdline
+ * @dst: cmdline buffer to prepend into (must already contain a NUL byte)
+ * @size: total capacity of @dst in bytes
+ *
+ * Prepend the build-time-rendered "kernel" subtree of the embedded
+ * bootconfig to @dst. The rendered string already ends with a single
+ * space (the xbc_snprint_cmdline() invariant), which serves as the
+ * separator between the embedded keys and any existing content of @dst.
+ * On overflow, log an error and leave @dst untouched rather than
+ * silently truncating: booting without the embedded values is better
+ * than refusing to boot, and the error message tells the user why
+ * their embedded keys are missing.
+ *
+ * Intended to be called from setup_arch() before parse_early_param() so
+ * that early_param() handlers see the embedded values.
+ */
+void __init xbc_prepend_embedded_cmdline(char *dst, size_t size)
+{
+	size_t embed_len = embedded_kernel_cmdline_end - embedded_kernel_cmdline;
+	size_t dst_len;
+
+	if (!size || embed_len <= 1)	/* trailing NUL only */
+		return;
+	embed_len--;			/* exclude trailing NUL byte */
+
+	dst_len = strnlen(dst, size);
+	if (embed_len + dst_len + 1 > size) {
+		pr_err("embedded bootconfig cmdline (%zu bytes) does not fit in COMMAND_LINE_SIZE with %zu bytes already used; ignoring embedded values\n",
+		       embed_len, dst_len);
+		return;
+	}
+
+	str_prepend(dst, dst_len, embedded_kernel_cmdline, embed_len);
+	xbc_cmdline_applied = true;
+}
+
+/**
+ * xbc_embedded_cmdline_applied() - Did the embedded cmdline get prepended?
+ *
+ * Return true if xbc_prepend_embedded_cmdline() actually prepended the
+ * embedded "kernel" subtree. setup_boot_config() uses this to avoid
+ * rendering the same keys a second time.
+ */
+bool __init xbc_embedded_cmdline_applied(void)
+{
+	return xbc_cmdline_applied;
+}
+#endif
+
 #endif
 
 /*

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v2 4/6] bootconfig: clean build-time tools/bootconfig from make clean
From: Breno Leitao @ 2026-06-05 12:03 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260605-bootconfig_using_tools-v2-0-d309f544b5f7@debian.org>

The previous patch builds tools/bootconfig during 'make prepare' to
render the embedded bootconfig cmdline, but nothing removes it on
'make clean', leaving the compiled tool and its objects behind.

Wire a bootconfig_clean hook into the top-level clean target so the
compiled tool and its objects are removed by make clean, matching the
prepare-wired tools/objtool and tools/bpf/resolve_btfids.

The hook runs tools/bootconfig's Makefile via $(MAKE), which the kernel
build invokes with -rR (MAKEFLAGS += -rR). -rR drops the built-in $(RM)
variable, so the existing "$(RM) -f ..." clean recipe would expand to a
bare "-f ..." and fail. Spell the recipe with a literal "rm -f" so it
keeps working both standalone and when invoked from Kbuild.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Makefile                  | 13 ++++++++++++-
 tools/bootconfig/Makefile |  2 +-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index e95992959f44..4f31222ff1d1 100644
--- a/Makefile
+++ b/Makefile
@@ -1574,6 +1574,17 @@ ifneq ($(wildcard $(objtool_O)),)
 	$(Q)$(MAKE) -sC $(abs_srctree)/tools/objtool O=$(objtool_O) srctree=$(abs_srctree) $(patsubst objtool_%,%,$@)
 endif
 
+PHONY += bootconfig_clean
+
+bootconfig_O = $(abspath $(objtree))/tools/bootconfig
+
+# tools/bootconfig is only built (via the prepare hook above) when
+# CONFIG_BOOT_CONFIG_EMBED_CMDLINE is set; skip its clean otherwise.
+bootconfig_clean:
+ifneq ($(wildcard $(bootconfig_O)),)
+	$(Q)$(MAKE) -sC $(srctree)/tools/bootconfig O=$(bootconfig_O) clean
+endif
+
 tools/: FORCE
 	$(Q)mkdir -p $(objtree)/tools
 	$(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/
@@ -1743,7 +1754,7 @@ vmlinuxclean:
 	$(Q)$(CONFIG_SHELL) $(srctree)/scripts/link-vmlinux.sh clean
 	$(Q)$(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) clean)
 
-clean: archclean vmlinuxclean resolve_btfids_clean objtool_clean
+clean: archclean vmlinuxclean resolve_btfids_clean objtool_clean bootconfig_clean
 
 # mrproper - Delete all generated files, including .config
 #
diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
index aa75a7828685..86f1a4e64f04 100644
--- a/tools/bootconfig/Makefile
+++ b/tools/bootconfig/Makefile
@@ -31,4 +31,4 @@ install: $(ALL_PROGRAMS)
 	install $(OUTPUT)bootconfig $(DESTDIR)$(bindir)
 
 clean:
-	$(RM) -f $(OUTPUT)*.o $(ALL_PROGRAMS)
+	rm -f $(OUTPUT)*.o $(ALL_PROGRAMS)

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v2 3/6] bootconfig: render embedded bootconfig as a kernel cmdline at build time
From: Breno Leitao @ 2026-06-05 12:03 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260605-bootconfig_using_tools-v2-0-d309f544b5f7@debian.org>

Add the build-time pipeline that renders the "kernel" subtree of
CONFIG_BOOT_CONFIG_EMBED_FILE into a flat cmdline string and stashes
it in .init.rodata as embedded_kernel_cmdline[]. A follow-up patch
adds the runtime helper that prepends this string to boot_command_line
during early architecture setup so parse_early_param() sees the values.

The build wires up:
  tools/bootconfig -C kernel - userspace tool already shared with
                               lib/bootconfig.c, used here in -C mode
                               to render a bootconfig file to a cmdline
  lib/embedded-cmdline.S     - .incbin's the rendered text plus a NUL
                               (listed under the EXTRA BOOT CONFIG
                               MAINTAINERS entry)
  lib/Makefile rule          - runs tools/bootconfig at build time
  Makefile prepare dep       - ensures tools/bootconfig is built first,
                               same pattern as tools/objtool and
                               tools/bpf/resolve_btfids

Drop the test target from tools/bootconfig/Makefile's default 'all'
recipe so that hooking the binary into the kernel build does not run
test-bootconfig.sh on every prepare. The tests stay available as
'make -C tools/bootconfig test', matching the convention of
tools/objtool and tools/bpf/resolve_btfids whose 'all' targets only
build the binary.

Require BOOT_CONFIG_EMBED_FILE to be non-empty before the new option
can be enabled, otherwise tools/bootconfig -C runs against an empty
file and prints a parse error on every kernel build.

The feature gates on CONFIG_ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG, a
silent symbol arches select once they've wired the prepend call into
setup_arch(). No arch selects it in this patch, so the user-visible
CONFIG_BOOT_CONFIG_EMBED_CMDLINE is not yet enableable; when an arch
later opts in, the runtime behavior is added by the follow-up patches.

tools/bootconfig is compiled with $(HOSTCC), not $(CC): Kbuild exports
CC as the target cross-compiler, but the prepare hook runs the tool on
the build host, so $(CC) would produce a binary that fails to exec
("Exec format error") under ARCH=... cross builds. This mirrors
tools/objtool and tools/bpf/resolve_btfids.

embedded-cmdline.S places the rendered string in .init.rodata with the
"a" (allocatable, read-only) flag and %progbits, not "aw": the data is
never written at runtime, so it must not land in a writable section.

A follow-up patch wires the build-time tools/bootconfig into the
top-level clean target.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 MAINTAINERS               |  1 +
 Makefile                  |  5 +++++
 init/Kconfig              | 33 +++++++++++++++++++++++++++++++++
 lib/Makefile              | 16 ++++++++++++++++
 lib/embedded-cmdline.S    | 16 ++++++++++++++++
 tools/bootconfig/Makefile |  8 ++++++--
 6 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4087b67bbc69..fb9314cbe344 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9845,6 +9845,7 @@ F:	fs/proc/bootconfig.c
 F:	include/linux/bootconfig.h
 F:	lib/bootconfig-data.S
 F:	lib/bootconfig.c
+F:	lib/embedded-cmdline.S
 F:	tools/bootconfig/*
 F:	tools/bootconfig/scripts/*
 
diff --git a/Makefile b/Makefile
index d59f703f9797..e95992959f44 100644
--- a/Makefile
+++ b/Makefile
@@ -1543,6 +1543,11 @@ prepare: tools/bpf/resolve_btfids
 endif
 endif
 
+# tools/bootconfig renders the embedded bootconfig into a cmdline at build time.
+ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+prepare: tools/bootconfig
+endif
+
 # The tools build system is not a part of Kbuild and tends to introduce
 # its own unique issues. If you need to integrate a new tool into Kbuild,
 # please consider locating that tool outside the tools/ tree and using the
diff --git a/init/Kconfig b/init/Kconfig
index ca35184532dc..5f491a5ac4b8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1569,6 +1569,39 @@ config BOOT_CONFIG_EMBED_FILE
 	  This bootconfig will be used if there is no initrd or no other
 	  bootconfig in the initrd.
 
+config ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+	bool
+	help
+	  Selected by architectures whose setup_arch() prepends the
+	  build-time-rendered embedded bootconfig cmdline to
+	  boot_command_line before parse_early_param() runs.
+
+config BOOT_CONFIG_EMBED_CMDLINE
+	bool "Render embedded bootconfig as kernel cmdline at build time"
+	depends on BOOT_CONFIG_EMBED
+	depends on BOOT_CONFIG_EMBED_FILE != ""
+	depends on ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+	default n
+	help
+	  Render the "kernel" subtree of the embedded bootconfig file into a
+	  flat cmdline string at kernel build time and prepend it to
+	  boot_command_line during early architecture setup. This makes
+	  early_param() handlers (e.g. mem=, earlycon=, loglevel=) see the
+	  values supplied via the embedded bootconfig.
+
+	  The runtime bootconfig parser is unaffected, so tree-structured
+	  consumers such as ftrace boot-time tracing keep working.
+
+	  Note: when an initrd also carries a bootconfig, its "kernel"
+	  subtree is still parsed at runtime, but the embedded "kernel"
+	  keys remain in boot_command_line for parse_early_param() and
+	  end up later than the initrd keys in saved_command_line, so
+	  parse_args() last-wins favors the embedded values. If you need
+	  initrd to override embedded kernel.* keys, leave this option
+	  off.
+
+	  If unsure, say N.
+
 config CMDLINE_LOG_WRAP_IDEAL_LEN
 	int "Length to try to wrap the cmdline when logged at boot"
 	default 1021
diff --git a/lib/Makefile b/lib/Makefile
index 6e72d2c1cce7..9de0ac7732a2 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -273,6 +273,22 @@ filechk_defbconf = cat $(or $(real-prereqs), /dev/null)
 $(obj)/default.bconf: $(CONFIG_BOOT_CONFIG_EMBED_FILE) FORCE
 	$(call filechk,defbconf)
 
+obj-$(CONFIG_BOOT_CONFIG_EMBED_CMDLINE) += embedded-cmdline.o
+$(obj)/embedded-cmdline.o: $(obj)/embedded_cmdline.bin
+
+# Render the bootconfig "kernel" subtree to a flat cmdline string using
+# the userspace tools/bootconfig parser (-C mode). The runtime prepend
+# helper enforces COMMAND_LINE_SIZE at boot, so no build-time size
+# check is performed here (COMMAND_LINE_SIZE is an arch header
+# constant, not a Kconfig value).
+quiet_cmd_render_cmdline = BCONF2C $@
+      cmd_render_cmdline = \
+	$(objtree)/tools/bootconfig/bootconfig -C $< > $@
+
+targets += embedded_cmdline.bin
+$(obj)/embedded_cmdline.bin: $(obj)/default.bconf $(objtree)/tools/bootconfig/bootconfig FORCE
+	$(call if_changed,render_cmdline)
+
 obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o
 obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
diff --git a/lib/embedded-cmdline.S b/lib/embedded-cmdline.S
new file mode 100644
index 000000000000..740d7ad2dc01
--- /dev/null
+++ b/lib/embedded-cmdline.S
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Embed the build-time-rendered bootconfig "kernel" subtree as a flat
+ * cmdline string. setup_arch() prepends this to boot_command_line on
+ * architectures that select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ */
+	.section .init.rodata, "a", %progbits
+	.global embedded_kernel_cmdline
+embedded_kernel_cmdline:
+	.incbin "lib/embedded_cmdline.bin"
+	.byte 0
+	.global embedded_kernel_cmdline_end
+embedded_kernel_cmdline_end:
diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
index 90eb47c9d8de..aa75a7828685 100644
--- a/tools/bootconfig/Makefile
+++ b/tools/bootconfig/Makefile
@@ -15,10 +15,14 @@ override CFLAGS += -Wall -g -I$(CURDIR)/include
 ALL_TARGETS := bootconfig
 ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
 
-all: $(ALL_PROGRAMS) test
+all: $(ALL_PROGRAMS)
 
+# bootconfig is a build host tool: Kbuild's prepare hook runs it on the
+# build machine to render the embedded cmdline, so always compile it with
+# $(HOSTCC). Using $(CC) would cross-compile it under ARCH=... builds and
+# fail to exec on the host ("Exec format error").
 $(OUTPUT)bootconfig: main.c include/linux/bootconfig.h $(LIBSRC)
-	$(CC) $(filter %.c,$^) $(CFLAGS) $(LDFLAGS) -o $@
+	$(HOSTCC) $(filter %.c,$^) $(CFLAGS) $(LDFLAGS) -o $@
 
 test: $(ALL_PROGRAMS) test-bootconfig.sh
 	./test-bootconfig.sh $(OUTPUT)

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v2 2/6] bootconfig: render descendant keys when xbc_snprint_cmdline() root has a value
From: Breno Leitao @ 2026-06-05 12:03 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260605-bootconfig_using_tools-v2-0-d309f544b5f7@debian.org>

xbc_node_for_each_key_value() walks to the first leaf under @root, and
when @root is itself a leaf it yields @root. That happens not only for
an empty "kernel {}" subtree, but also when @root carries both a value
and subkeys, e.g.

	kernel = x
	kernel.foo = bar

Here @root ("kernel") is a leaf because its first child is the value
node "x", so the iterator returns @root first. Feeding @root back into
xbc_node_compose_key_after(root, root) returns -EINVAL, which the only
in-kernel caller papers over with a "len <= 0" check -- but the
follow-up tools/bootconfig -C user propagates the error and turns such
a bootconfig into a build failure. Worse, short-circuiting the whole
call on a leaf @root would silently drop the valid "kernel.foo = bar"
descendant that the pre-existing code rendered.

Skip @root inside the loop instead of bailing out: the value-only entry
is dropped (it is rendered through the "kernel" cmdline path, not here),
while real descendant keys are still emitted. An entirely empty subtree
now renders nothing and returns 0 rather than -EINVAL, matching the
"nothing to render is not an error" semantics expected by the new
build-time caller.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 lib/bootconfig.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 2ed9ee3dc81c..926094d97397 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -440,6 +440,17 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
 	 * itself is well defined and returns the would-be length.
 	 */
 	xbc_node_for_each_key_value(root, knode, val) {
+		/*
+		 * An empty or value-only @root (e.g. "kernel {}" or
+		 * "kernel = x", possibly alongside "kernel.foo = bar")
+		 * yields @root itself here. Skip it: composing a key for it
+		 * would fail with -EINVAL, yet any real descendant keys must
+		 * still be rendered. An entirely empty subtree then renders
+		 * nothing and returns 0 rather than an error.
+		 */
+		if (knode == root)
+			continue;
+
 		ret = xbc_node_compose_key_after(root, knode,
 					xbc_namebuf, XBC_KEYLEN_MAX);
 		if (ret < 0)

-- 
2.53.0-Meta

^ permalink raw reply related

* [PATCH v2 1/6] bootconfig: fix NULL-pointer arithmetic in xbc_snprint_cmdline()
From: Breno Leitao @ 2026-06-05 12:03 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260605-bootconfig_using_tools-v2-0-d309f544b5f7@debian.org>

xbc_snprint_cmdline() is meant to be called twice: first with
buf=NULL, size=0 to probe the rendered length, then with a real
buffer to fill it (the standard snprintf() two-pass pattern). The
probe call makes the function compute "buf + size" (NULL + 0) and,
on every iteration, advance "buf += ret" from that NULL base and
pass the result back into snprintf().

Pointer arithmetic on a NULL pointer is undefined behavior. It is
harmless in the in-kernel callers today, but the follow-up patches
run this same code in the userspace tools/bootconfig parser at kernel
build time, where host UBSan / FORTIFY_SOURCE abort the build.

Track a running written length (size_t) instead of mutating @buf, and
only form "buf + len" when @buf is non-NULL. snprintf(NULL, 0, ...)
is itself well defined and returns the would-be length, so the
two-pass "probe then fill" usage returns identical byte counts.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 lib/bootconfig.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index f445b7703fdd..2ed9ee3dc81c 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -427,10 +427,18 @@ static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
 int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
 {
 	struct xbc_node *knode, *vnode;
-	char *end = buf + size;
 	const char *val, *q;
+	size_t len = 0;
 	int ret;
 
+	/*
+	 * Track the running written length rather than advancing @buf, so we
+	 * never form "buf + size" or "buf += ret" while @buf is NULL (the
+	 * size-probe call passes buf=NULL, size=0). NULL pointer arithmetic
+	 * is undefined behavior and trips host UBSan / FORTIFY_SOURCE when
+	 * this renderer runs at kernel build time. snprintf(NULL, 0, ...)
+	 * itself is well defined and returns the would-be length.
+	 */
 	xbc_node_for_each_key_value(root, knode, val) {
 		ret = xbc_node_compose_key_after(root, knode,
 					xbc_namebuf, XBC_KEYLEN_MAX);
@@ -439,10 +447,11 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
 
 		vnode = xbc_node_get_child(knode);
 		if (!vnode) {
-			ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
+			ret = snprintf(buf ? buf + len : NULL, rest(len, size),
+				       "%s ", xbc_namebuf);
 			if (ret < 0)
 				return ret;
-			buf += ret;
+			len += ret;
 			continue;
 		}
 		xbc_array_for_each_value(vnode, val) {
@@ -452,15 +461,15 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
 			 * whitespace.
 			 */
 			q = strpbrk(val, " \t\r\n") ? "\"" : "";
-			ret = snprintf(buf, rest(buf, end), "%s=%s%s%s ",
-				       xbc_namebuf, q, val, q);
+			ret = snprintf(buf ? buf + len : NULL, rest(len, size),
+				       "%s=%s%s%s ", xbc_namebuf, q, val, q);
 			if (ret < 0)
 				return ret;
-			buf += ret;
+			len += ret;
 		}
 	}
 
-	return buf - (end - size);
+	return len;
 }
 #undef rest
 

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v2 0/6] bootconfig: embed kernel.* cmdline at build time
From: Breno Leitao @ 2026-06-05 12:03 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team

The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
already landed; this series wires the rendered cmdline into the kernel.

Motivation: today the embedded bootconfig is parsed at runtime, after
parse_early_param() has already run, so early_param() handlers can't
see embedded values. Folding the kernel.* subtree into the cmdline at
build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
users without forcing them to maintain two cmdline sources.

Behaviorally, the "kernel" subtree is rendered to a flat string at
build time and stashed in .init.rodata. setup_arch() prepends it to
boot_command_line before parse_early_param() runs. Overflow is a soft
error: the helper logs and leaves boot_command_line untouched rather
than panicking, so an oversized embedded bconf cannot brick a boot.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v2 (addressing review of v1):
- Split out a standalone fix for the NULL-pointer arithmetic in
  xbc_snprint_cmdline() so the build-time render cannot trip host
  UBSan/FORTIFY_SOURCE.
- Rework the leaf-root handling: instead of returning early, skip @root
  inside the loop so a root carrying both a value and subkeys
  (kernel = x together with kernel.foo = bar) still renders its
  descendant keys.
- Build tools/bootconfig with $(HOSTCC) so cross-compiled (ARCH=...)
  builds render the cmdline on the build host instead of failing with
  "Exec format error".
- Mark the embedded cmdline section read-only (drop the "w" flag from
  .init.rodata).
- Add a make-clean hook so tools/bootconfig artifacts are removed by
  make clean.
- Gate the x86 prepend on "bootconfig" being present on the command
  line (or CONFIG_BOOT_CONFIG_FORCE), matching the init.* opt-in
  semantics documented in bootconfig.rst and preserving fail-safe
  recovery: dropping "bootconfig" from the bootloader cmdline now also
  disables the embedded kernel.* keys.
- Link to v1: https://patch.msgid.link/20260527-bootconfig_using_tools-v1-0-b6906a86e7d5@debian.org

---
Breno Leitao (6):
      bootconfig: fix NULL-pointer arithmetic in xbc_snprint_cmdline()
      bootconfig: render descendant keys when xbc_snprint_cmdline() root has a value
      bootconfig: render embedded bootconfig as a kernel cmdline at build time
      bootconfig: clean build-time tools/bootconfig from make clean
      bootconfig: add xbc_prepend_embedded_cmdline() helper
      x86/setup: prepend embedded bootconfig cmdline before parse_early_param

 MAINTAINERS                |   1 +
 Makefile                   |  18 +++++++-
 arch/x86/Kconfig           |   1 +
 arch/x86/kernel/setup.c    |  16 +++++++
 include/linux/bootconfig.h |   9 ++++
 init/Kconfig               |  33 +++++++++++++
 init/main.c                |  18 ++++++--
 lib/Makefile               |  16 +++++++
 lib/bootconfig.c           | 112 ++++++++++++++++++++++++++++++++++++++++++---
 lib/embedded-cmdline.S     |  16 +++++++
 tools/bootconfig/Makefile  |  10 ++--
 11 files changed, 236 insertions(+), 14 deletions(-)
---
base-commit: e7e28506af98ce4e1059e5ec59334b335c00a246
change-id: 20260508-bootconfig_using_tools-cfa7aa9d6a5a

Best regards,
-- 
Breno Leitao <leitao@debian.org>


^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-05 11:08 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <CAA1CXcBeg5yFvG89xd49mD=LuSouDsXNCkfMF8Xmdgy7-h522g@mail.gmail.com>

On Fri, Jun 5, 2026 at 5:07 AM Nico Pache <npache@redhat.com> wrote:
>
> On Thu, Jun 4, 2026 at 8:45 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> > > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > > main scanning logic using a bitmap to track occupied pages and a stack
> > > structure that allows us to find optimal collapse sizes.
> > >
> > > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > > scanning phase (mmap_read_lock) that determines a potential PMD
> > > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > > phase (mmap_write_lock).
> > >
> > > To enabled mTHP collapse we make the following changes:
> > >
> > > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > > orders are enabled, we remove the restriction of max_ptes_none during the
> > > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > > have scanned the full PMD range and updated the bitmap to track occupied
> > > pages, we use the bitmap to find the optimal mTHP size.
> > >
> > > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > > and determine the best eligible order for the collapse. A stack structure
> > > is used instead of traditional recursion to manage the search. This also
> > > prevents a traditional recursive approach when the kernel stack struct is
> > > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > > find the highest order mTHPs that satisfy the collapse criteria. We start
> > > by attempting the PMD order, then moved on the consecutively lower orders
> > > (mTHP collapse). The stack maintains a pair of variables (offset, order),
> > > indicating the number of PTEs from the start of the PMD, and the order of
> > > the potential collapse candidate.
> > >
> > > The algorithm for consuming the bitmap works as such:
> > >     1) push (0, HPAGE_PMD_ORDER) onto the stack
> > >     2) pop the stack
> > >     3) check if the number of set bits in that (offset,order) pair
> > >        statisfy the max_ptes_none threshold for that order
> > >     4) if yes, attempt collapse
> > >     5) if no (or collapse fails), push two new stack items representing
> > >        the left and right halves of the current bitmap range, at the
> > >        next lower order
> > >     6) repeat at step (2) until stack is empty.
> > >
> > > Below is a diagram representing the algorithm and stack items:
> > >
> > >                             offset   mid_offset
> > >                             |        |
> > >                             |        |
> > >                             v        v
> > >           ____________________________________
> > >          |          PTE Page Table            |
> > >          --------------------------------------
> > >                           <-------><------->
> > >                              order-1  order-1
> > >
> > > mTHP collapses reject regions containing swapped out or shared pages.
> > > This is because adding new entries can lead to new none pages, and these
> > > may lead to constant promotion into a higher order mTHP. A similar
> > > issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> > > introducing at least 2x the number of pages, and on a future scan will
> > > satisfy the promotion condition once again. This issue is prevented via
> > > the collapse_max_ptes_none() function which imposes the max_ptes_none
> > > restrictions above.
> > >
> > > We currently only support mTHP collapse for max_ptes_none values of 0
> > > and HPAGE_PMD_NR - 1. resulting in the following behavior:
> > >
> > >     - max_ptes_none=0: Never introduce new empty pages during collapse
> > >     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> > >       available mTHP order
> > >
> > > Any other max_ptes_none value will emit a warning and default mTHP
> > > collapse to max_ptes_none=0. There should be no behavior change for PMD
> > > collapse.
> > >
> > > Once we determine what mTHP sizes fits best in that PMD range a collapse
> > > is attempted. A minimum collapse order of 2 is used as this is the lowest
> > > order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> > >
> > > Currently madv_collapse is not supported and will only attempt PMD
> > > collapse.
> > >
> > > We can also remove the check for is_khugepaged inside the PMD scan as
> > > the collapse_max_ptes_none() function handles this logic now.
> > >
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> > >  mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
> > >  1 file changed, 172 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 64ceebc9d8a7..d3d7db8be26c 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> > >
> > >  static struct kmem_cache *mm_slot_cache __ro_after_init;
> > >
> > > +#define KHUGEPAGED_MIN_MTHP_ORDER    2
> > > +/*
> > > + * mthp_collapse() does an iterative DFS over a binary tree, from
> > > + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> > > + * size needed for a DFS on a binary tree is height + 1, where
> > > + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> > > + *
> > > + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> > > + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> > > + */
> > > +#define MTHP_STACK_SIZE      (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> > > +
> > > +/*
> > > + * Defines a range of PTE entries in a PTE page table which are being
> > > + * considered for mTHP collapse.
> > > + *
> > > + * @offset: the offset of the first PTE entry in a PMD range.
> > > + * @order: the order of the PTE entries being considered for collapse.
> > > + */
> > > +struct mthp_range {
> > > +     u16 offset;
> > > +     u8 order;
> > > +};
> > > +
> > >  struct collapse_control {
> > >       bool is_khugepaged;
> > >
> > > @@ -110,6 +134,12 @@ struct collapse_control {
> > >
> > >       /* nodemask for allocation fallback */
> > >       nodemask_t alloc_nmask;
> > > +
> > > +     /* Each bit represents a single occupied (!none/zero) page. */
> > > +     DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> > > +     /* A mask of the current range being considered for mTHP collapse. */
> > > +     DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > > +     struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> > >  };
> > >
> > >  /**
> > > @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> > >       return result;
> > >  }
> > >
> > > +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> > > +                                  u16 offset, u8 order)
> > > +{
> > > +     const int size = *stack_size;
> > > +     struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> > > +
> > > +     VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> > > +     stack->order = order;
> > > +     stack->offset = offset;
> > > +     (*stack_size)++;
> > > +}
> > > +
> > > +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> > > +                                              int *stack_size)
> > > +{
> > > +     const int size = *stack_size;
> > > +
> > > +     VM_WARN_ON_ONCE(size <= 0);
> > > +     (*stack_size)--;
> > > +     return cc->mthp_bitmap_stack[size - 1];
> > > +}
> > > +
> > > +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> > > +                                             u16 offset, unsigned int nr_ptes)
> > > +{
> > > +     bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > > +     bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> > > +     return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > > +}
> > > +
> > > +/*
> > > + * mthp_collapse() consumes the bitmap that is generated during
> > > + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> > > + *
> > > + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> > > + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> > > + * of the bitmap for collapse eligibility. The stack maintains a pair of
> > > + * variables (offset, order), indicating the number of PTEs from the start of
> > > + * the PMD, and the order of the potential collapse candidate respectively. We
> > > + * start at the PMD order and check if it is eligible for collapse; if not, we
> > > + * add two entries to the stack at a lower order to represent the left and right
> > > + * halves of the PTE page table we are examining.
> > > + *
> > > + *                         offset       mid_offset
> > > + *                         |         |
> > > + *                         |         |
> > > + *                         v         v
> > > + *      --------------------------------------
> > > + *      |          cc->mthp_bitmap            |
> > > + *      --------------------------------------
> > > + *                         <-------><------->
> > > + *                          order-1  order-1
> > > + *
> > > + * For each of these, we determine how many PTE entries are occupied in the
> > > + * range of PTE entries we propose to collapse, then we compare this to a
> > > + * threshold number of PTE entries which would need to be occupied for a
> > > + * collapse to be permitted at that order (accounting for max_ptes_none).
> > > + *
> > > + * If a collapse is permitted, we attempt to collapse the PTE range into a
> > > + * mTHP.
> > > + */
> > > +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> > > +             unsigned long address, int referenced, int unmapped,
> > > +             struct collapse_control *cc, unsigned long enabled_orders)
> > > +{
> > > +     unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> > > +     int collapsed = 0, stack_size = 0;
> > > +     unsigned long collapse_address;
> > > +     struct mthp_range range;
> > > +     u16 offset;
> > > +     u8 order;
> > > +
> > > +     collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> > > +
> > > +     while (stack_size) {
> > > +             range = collapse_mthp_stack_pop(cc, &stack_size);
> > > +             order = range.order;
> > > +             offset = range.offset;
> > > +             nr_ptes = 1UL << order;
> > > +
> > > +             if (!test_bit(order, &enabled_orders))
> > > +                     goto next_order;
> > > +
> > > +             max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> > > +
> > > +             nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> > > +                                                            nr_ptes);
> > > +
> > > +             if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> > > +                     int ret;
> > > +
> > > +                     collapse_address = address + offset * PAGE_SIZE;
> > > +                     ret = collapse_huge_page(mm, collapse_address, referenced,
> > > +                                              unmapped, cc, order);
> > > +                     if (ret == SCAN_SUCCEED) {
> > > +                             collapsed += nr_ptes;
> > > +                             continue;
> > > +                     }
> > > +             }
> > > +
> > > +next_order:
> > > +             if ((BIT(order) - 1) & enabled_orders) {
> > > +                     const u8 next_order = order - 1;
> > > +                     const u16 mid_offset = offset + (nr_ptes / 2);
> > > +
> > > +                     collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> > > +                                              next_order);
> > > +                     collapse_mthp_stack_push(cc, &stack_size, offset,
> > > +                                              next_order);
> > > +             }
> > > +     }
> > > +     return collapsed;
> > > +}
> > > +
> > >  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >               struct vm_area_struct *vma, unsigned long start_addr,
> > >               bool *lock_dropped, struct collapse_control *cc)
> > >  {
> > > -     const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> > >       const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> > >       const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> > > +     unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> > > +     enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> > >       pmd_t *pmd;
> > > -     pte_t *pte, *_pte;
> > > -     int none_or_zero = 0, shared = 0, referenced = 0;
> > > +     pte_t *pte, *_pte, pteval;
> > > +     int i;
> > > +     int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> > >       enum scan_result result = SCAN_FAIL;
> > >       struct page *page = NULL;
> > >       struct folio *folio = NULL;
> > >       unsigned long addr;
> > > +     unsigned long enabled_orders;
> > >       spinlock_t *ptl;
> > >       int node = NUMA_NO_NODE, unmapped = 0;
> > >
> > > @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >               goto out;
> > >       }
> > >
> > > +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> > >       memset(cc->node_load, 0, sizeof(cc->node_load));
> > >       nodes_clear(cc->alloc_nmask);
> > > +
> > > +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> > > +
> > > +     /*
> > > +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> > > +      * scan all pages to populate the bitmap for mTHP collapse.
> > > +      */
> > > +     if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> > > +             max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> >
> > Hmm, this is a bit odd, what if the user set max_ptes_none = 0?
>
> We'd still want to scan the full PMD to populate the bitmap. That way
> we can find the smaller orders that contain 0 none/zero PTEs.
>
> >
> > I assume we handle the 0/511 thing elsewhere?
>
> Yes in the bitmap weight check and in collapse_huge_page_isolate()
>
> >
> > > +
> > >       pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> > >       if (!pte) {
> > >               cc->progress++;
> > > @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >               goto out;
> > >       }
> > >
> > > -     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > > -          _pte++, addr += PAGE_SIZE) {
> > > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > > +             _pte = pte + i;
> > > +             addr = start_addr + i * PAGE_SIZE;
> > > +             pteval = ptep_get(_pte);
> > > +
> > >               cc->progress++;
> > >
> > > -             pte_t pteval = ptep_get(_pte);
> > >               if (pte_none_or_zero(pteval)) {
> > >                       if (++none_or_zero > max_ptes_none) {
> > >                               result = SCAN_EXCEED_NONE_PTE;
> > > @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >                       }
> > >               }
> > >
> > > +             /* Set bit for occupied pages */
> > > +             __set_bit(i, cc->mthp_bitmap);
> > >               /*
> > >                * Record which node the original page is from and save this
> > >                * information to cc->node_load[].
> > > @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >       if (result == SCAN_SUCCEED) {
> > >               /* collapse_huge_page expects the lock to be dropped before calling */
> > >               mmap_read_unlock(mm);
> > > -             result = collapse_huge_page(mm, start_addr, referenced,
> > > -                                         unmapped, cc, HPAGE_PMD_ORDER);
> > > -             /* collapse_huge_page will return with the mmap_lock released */
> > > +             nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> > > +                                          unmapped, cc, enabled_orders);
> >
> > I guess mthp_collapse() also does PMD collapse if only PMD is enabled?
> >
> > It feels like this name is a bit confusing then :)
> >
> > But I guess we can do a follow up to think of a better name possibly.

Yeah, ideally we can clean that up later!

Thank you so much for reviewing and verifying the new algorithm. The
diff i sent was just a draft-- I have already added comments and
cleaned up the code more.

Cheers :)
-- Nico

> >
> > > +             /* mmap_lock was released above, set lock_dropped */
> > >               *lock_dropped = true;
> > > +             result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> > >       }
> > >  out:
> > >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > > --
> > > 2.54.0
> > >
> >
> > Thanks, Lorenzo
> >


^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-05 11:07 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <aiGOvIR9sfm91cvT@lucifer>

On Thu, Jun 4, 2026 at 8:45 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > main scanning logic using a bitmap to track occupied pages and a stack
> > structure that allows us to find optimal collapse sizes.
> >
> > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > scanning phase (mmap_read_lock) that determines a potential PMD
> > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > phase (mmap_write_lock).
> >
> > To enabled mTHP collapse we make the following changes:
> >
> > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > orders are enabled, we remove the restriction of max_ptes_none during the
> > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > have scanned the full PMD range and updated the bitmap to track occupied
> > pages, we use the bitmap to find the optimal mTHP size.
> >
> > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > and determine the best eligible order for the collapse. A stack structure
> > is used instead of traditional recursion to manage the search. This also
> > prevents a traditional recursive approach when the kernel stack struct is
> > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > find the highest order mTHPs that satisfy the collapse criteria. We start
> > by attempting the PMD order, then moved on the consecutively lower orders
> > (mTHP collapse). The stack maintains a pair of variables (offset, order),
> > indicating the number of PTEs from the start of the PMD, and the order of
> > the potential collapse candidate.
> >
> > The algorithm for consuming the bitmap works as such:
> >     1) push (0, HPAGE_PMD_ORDER) onto the stack
> >     2) pop the stack
> >     3) check if the number of set bits in that (offset,order) pair
> >        statisfy the max_ptes_none threshold for that order
> >     4) if yes, attempt collapse
> >     5) if no (or collapse fails), push two new stack items representing
> >        the left and right halves of the current bitmap range, at the
> >        next lower order
> >     6) repeat at step (2) until stack is empty.
> >
> > Below is a diagram representing the algorithm and stack items:
> >
> >                             offset   mid_offset
> >                             |        |
> >                             |        |
> >                             v        v
> >           ____________________________________
> >          |          PTE Page Table            |
> >          --------------------------------------
> >                           <-------><------->
> >                              order-1  order-1
> >
> > mTHP collapses reject regions containing swapped out or shared pages.
> > This is because adding new entries can lead to new none pages, and these
> > may lead to constant promotion into a higher order mTHP. A similar
> > issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> > introducing at least 2x the number of pages, and on a future scan will
> > satisfy the promotion condition once again. This issue is prevented via
> > the collapse_max_ptes_none() function which imposes the max_ptes_none
> > restrictions above.
> >
> > We currently only support mTHP collapse for max_ptes_none values of 0
> > and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >
> >     - max_ptes_none=0: Never introduce new empty pages during collapse
> >     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> >       available mTHP order
> >
> > Any other max_ptes_none value will emit a warning and default mTHP
> > collapse to max_ptes_none=0. There should be no behavior change for PMD
> > collapse.
> >
> > Once we determine what mTHP sizes fits best in that PMD range a collapse
> > is attempted. A minimum collapse order of 2 is used as this is the lowest
> > order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >
> > Currently madv_collapse is not supported and will only attempt PMD
> > collapse.
> >
> > We can also remove the check for is_khugepaged inside the PMD scan as
> > the collapse_max_ptes_none() function handles this logic now.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 172 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 64ceebc9d8a7..d3d7db8be26c 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >
> >  static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +#define KHUGEPAGED_MIN_MTHP_ORDER    2
> > +/*
> > + * mthp_collapse() does an iterative DFS over a binary tree, from
> > + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> > + * size needed for a DFS on a binary tree is height + 1, where
> > + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> > + *
> > + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> > + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> > + */
> > +#define MTHP_STACK_SIZE      (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> > +
> > +/*
> > + * Defines a range of PTE entries in a PTE page table which are being
> > + * considered for mTHP collapse.
> > + *
> > + * @offset: the offset of the first PTE entry in a PMD range.
> > + * @order: the order of the PTE entries being considered for collapse.
> > + */
> > +struct mthp_range {
> > +     u16 offset;
> > +     u8 order;
> > +};
> > +
> >  struct collapse_control {
> >       bool is_khugepaged;
> >
> > @@ -110,6 +134,12 @@ struct collapse_control {
> >
> >       /* nodemask for allocation fallback */
> >       nodemask_t alloc_nmask;
> > +
> > +     /* Each bit represents a single occupied (!none/zero) page. */
> > +     DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> > +     /* A mask of the current range being considered for mTHP collapse. */
> > +     DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > +     struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> >  };
> >
> >  /**
> > @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> >       return result;
> >  }
> >
> > +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> > +                                  u16 offset, u8 order)
> > +{
> > +     const int size = *stack_size;
> > +     struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> > +
> > +     VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> > +     stack->order = order;
> > +     stack->offset = offset;
> > +     (*stack_size)++;
> > +}
> > +
> > +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> > +                                              int *stack_size)
> > +{
> > +     const int size = *stack_size;
> > +
> > +     VM_WARN_ON_ONCE(size <= 0);
> > +     (*stack_size)--;
> > +     return cc->mthp_bitmap_stack[size - 1];
> > +}
> > +
> > +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> > +                                             u16 offset, unsigned int nr_ptes)
> > +{
> > +     bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > +     bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> > +     return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > +}
> > +
> > +/*
> > + * mthp_collapse() consumes the bitmap that is generated during
> > + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> > + *
> > + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> > + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> > + * of the bitmap for collapse eligibility. The stack maintains a pair of
> > + * variables (offset, order), indicating the number of PTEs from the start of
> > + * the PMD, and the order of the potential collapse candidate respectively. We
> > + * start at the PMD order and check if it is eligible for collapse; if not, we
> > + * add two entries to the stack at a lower order to represent the left and right
> > + * halves of the PTE page table we are examining.
> > + *
> > + *                         offset       mid_offset
> > + *                         |         |
> > + *                         |         |
> > + *                         v         v
> > + *      --------------------------------------
> > + *      |          cc->mthp_bitmap            |
> > + *      --------------------------------------
> > + *                         <-------><------->
> > + *                          order-1  order-1
> > + *
> > + * For each of these, we determine how many PTE entries are occupied in the
> > + * range of PTE entries we propose to collapse, then we compare this to a
> > + * threshold number of PTE entries which would need to be occupied for a
> > + * collapse to be permitted at that order (accounting for max_ptes_none).
> > + *
> > + * If a collapse is permitted, we attempt to collapse the PTE range into a
> > + * mTHP.
> > + */
> > +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> > +             unsigned long address, int referenced, int unmapped,
> > +             struct collapse_control *cc, unsigned long enabled_orders)
> > +{
> > +     unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> > +     int collapsed = 0, stack_size = 0;
> > +     unsigned long collapse_address;
> > +     struct mthp_range range;
> > +     u16 offset;
> > +     u8 order;
> > +
> > +     collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> > +
> > +     while (stack_size) {
> > +             range = collapse_mthp_stack_pop(cc, &stack_size);
> > +             order = range.order;
> > +             offset = range.offset;
> > +             nr_ptes = 1UL << order;
> > +
> > +             if (!test_bit(order, &enabled_orders))
> > +                     goto next_order;
> > +
> > +             max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> > +
> > +             nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> > +                                                            nr_ptes);
> > +
> > +             if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> > +                     int ret;
> > +
> > +                     collapse_address = address + offset * PAGE_SIZE;
> > +                     ret = collapse_huge_page(mm, collapse_address, referenced,
> > +                                              unmapped, cc, order);
> > +                     if (ret == SCAN_SUCCEED) {
> > +                             collapsed += nr_ptes;
> > +                             continue;
> > +                     }
> > +             }
> > +
> > +next_order:
> > +             if ((BIT(order) - 1) & enabled_orders) {
> > +                     const u8 next_order = order - 1;
> > +                     const u16 mid_offset = offset + (nr_ptes / 2);
> > +
> > +                     collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> > +                                              next_order);
> > +                     collapse_mthp_stack_push(cc, &stack_size, offset,
> > +                                              next_order);
> > +             }
> > +     }
> > +     return collapsed;
> > +}
> > +
> >  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               struct vm_area_struct *vma, unsigned long start_addr,
> >               bool *lock_dropped, struct collapse_control *cc)
> >  {
> > -     const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> >       const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> >       const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> > +     unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> > +     enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> >       pmd_t *pmd;
> > -     pte_t *pte, *_pte;
> > -     int none_or_zero = 0, shared = 0, referenced = 0;
> > +     pte_t *pte, *_pte, pteval;
> > +     int i;
> > +     int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> >       enum scan_result result = SCAN_FAIL;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long addr;
> > +     unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> >
> > @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> > +
> > +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> > +
> > +     /*
> > +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> > +      * scan all pages to populate the bitmap for mTHP collapse.
> > +      */
> > +     if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> > +             max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>
> Hmm, this is a bit odd, what if the user set max_ptes_none = 0?

We'd still want to scan the full PMD to populate the bitmap. That way
we can find the smaller orders that contain 0 none/zero PTEs.

>
> I assume we handle the 0/511 thing elsewhere?

Yes in the bitmap weight check and in collapse_huge_page_isolate()

>
> > +
> >       pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >       if (!pte) {
> >               cc->progress++;
> > @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > -     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, addr += PAGE_SIZE) {
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             _pte = pte + i;
> > +             addr = start_addr + i * PAGE_SIZE;
> > +             pteval = ptep_get(_pte);
> > +
> >               cc->progress++;
> >
> > -             pte_t pteval = ptep_get(_pte);
> >               if (pte_none_or_zero(pteval)) {
> >                       if (++none_or_zero > max_ptes_none) {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >
> > +             /* Set bit for occupied pages */
> > +             __set_bit(i, cc->mthp_bitmap);
> >               /*
> >                * Record which node the original page is from and save this
> >                * information to cc->node_load[].
> > @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >       if (result == SCAN_SUCCEED) {
> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >               mmap_read_unlock(mm);
> > -             result = collapse_huge_page(mm, start_addr, referenced,
> > -                                         unmapped, cc, HPAGE_PMD_ORDER);
> > -             /* collapse_huge_page will return with the mmap_lock released */
> > +             nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> > +                                          unmapped, cc, enabled_orders);
>
> I guess mthp_collapse() also does PMD collapse if only PMD is enabled?
>
> It feels like this name is a bit confusing then :)
>
> But I guess we can do a follow up to think of a better name possibly.
>
> > +             /* mmap_lock was released above, set lock_dropped */
> >               *lock_dropped = true;
> > +             result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > --
> > 2.54.0
> >
>
> Thanks, Lorenzo
>


^ permalink raw reply

* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: David Hildenbrand (Arm) @ 2026-06-05  9:42 UTC (permalink / raw)
  To: Breno Leitao, Miaohe Lin
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang, Andrew Morton,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <aiKXrovzrNN-gExm@gmail.com>

On 6/5/26 11:35, Breno Leitao wrote:
> On Wed, Jun 03, 2026 at 10:33:04AM +0800, Miaohe Lin wrote:
>> On 2026/6/2 17:41, David Hildenbrand (Arm) wrote:
>>>
>>> Races are fine. We might miss some pages, but that can happen on races either way.
>>>
>>>
>>> I'd just do something like
>>>
>>> if (PageReserved(page))
>>> 	return true;
>>>
>>> head = compound_head(page);
>>
>> If @head is split just after compound_head. And then @head is freed into buddy and re-allocated as slab
>> page while @page is still in the buddy. We would panic on this scene as @head is PageSlab. But we were
>> supposed to successfully handle @page. Or am I miss something?
> 
> You're right that it is racy, but I think it is an acceptable race here.
> 

I mean, any such races can currently already happen one way or the other?

Really, the only way to not get races is to tryget the (compound)page,
revalidate that the page is still part of the compound page.

I'm not sure if that's really a good idea.

But my memory is a bit vague in which scenarios we already hold a page reference
here to prevent any concurrent freeing?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v8 4/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-06-05  9:37 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <4d7b720a-7975-8a4d-a00e-e888d63812a0@huawei.com>

On Tue, Jun 02, 2026 at 03:05:32PM +0800, Miaohe Lin wrote:
> On 2026/5/27 22:06, Breno Leitao wrote:
> > Add a sysctl panic_on_unrecoverable_memory_failure (disabled by
> > default) that triggers a kernel panic when memory_failure()
> > encounters pages that cannot be recovered.  This provides a clean
> > crash with useful debug information rather than allowing silent
> > data corruption or a delayed crash at an unrelated code path.
> > 
> > Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
> > result == MF_IGNORED panics.  After the previous patch, MF_MSG_KERNEL
> > covers PG_reserved pages and the kernel-owned pages promoted from
> > get_hwpoison_page() via -ENOTRECOVERABLE (slab, page tables,
> > large-kmalloc).
> > 
> > All other action types are excluded:
> > 
> > - MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
> >   transient refcount races with the page allocator (an in-flight buddy
> >   allocation has refcount 0 and is no longer on the buddy free list,
> >   briefly), and panicking on them would risk killing the box for what
> >   is actually a recoverable userspace page.
> > 
> > - MF_MSG_UNKNOWN means identify_page_state() could not classify the
> >   page; that is precisely the wrong basis for a panic decision.
> > 
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> >  mm/memory-failure.c | 23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 14c0a958638c..dcd53dbc6aec 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
> >  
> >  static int sysctl_enable_soft_offline __read_mostly = 1;
> >  
> > +static int sysctl_panic_on_unrecoverable_mf __read_mostly;
> > +
> >  atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
> >  
> >  static bool hw_memory_failure __read_mostly = false;
> > @@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
> >  		.proc_handler	= proc_dointvec_minmax,
> >  		.extra1		= SYSCTL_ZERO,
> >  		.extra2		= SYSCTL_ONE,
> > +	},
> > +	{
> > +		.procname	= "panic_on_unrecoverable_memory_failure",
> > +		.data		= &sysctl_panic_on_unrecoverable_mf,
> > +		.maxlen		= sizeof(sysctl_panic_on_unrecoverable_mf),
> > +		.mode		= 0644,
> > +		.proc_handler	= proc_dointvec_minmax,
> > +		.extra1		= SYSCTL_ZERO,
> > +		.extra2		= SYSCTL_ONE,
> >  	}
> >  };
> >  
> > @@ -1255,6 +1266,15 @@ static void update_per_node_mf_stats(unsigned long pfn,
> >  	++mf_stats->total;
> >  }
> >  
> > +static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
> > +				      enum mf_result result)
> > +{
> > +	if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
> > +		return false;
> > +
> > +	return type == MF_MSG_KERNEL;
> 
> Would it be more straightforward to write as something like:
> 
> if (!sysctl_panic_on_unrecoverable_mf)
> 	return false;
> 
> return (type == MF_MSG_KERNEL && result == MF_IGNORED);

Sure, that reads better.  I'll fold the MF_IGNORED check into the return for
the next revision. 


        static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
                                              enum mf_result result)
        {
                if (!sysctl_panic_on_unrecoverable_mf)
                        return false;

                return type == MF_MSG_KERNEL && result == MF_IGNORED;
        }

^ permalink raw reply

* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-06-05  9:35 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: David Hildenbrand (Arm), linux-mm, linux-kernel, linux-doc,
	linux-kselftest, linux-trace-kernel, kernel-team, Lance Yang,
	Andrew Morton, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <4b27467e-935f-5587-2f48-5a794c30a592@huawei.com>

On Wed, Jun 03, 2026 at 10:33:04AM +0800, Miaohe Lin wrote:
> On 2026/6/2 17:41, David Hildenbrand (Arm) wrote:
> > On 6/2/26 05:08, Miaohe Lin wrote:
> >> On 2026/6/1 21:22, David Hildenbrand (Arm) wrote:
> >>> On 6/1/26 14:28, Miaohe Lin wrote:
> >>>>
> >>>> Thanks for your patch.
> >>>>
> >>>>
> >>>> Once shake_page finds a lightweight range-based way to shrink slab, slab pages could be freed
> >>>> into buddy and above PageSlab test should be removed then. Maybe add a TODO or XXX here?
> >>>>
> >>>>
> >>>> I'm not sure but is it safe or a common way to test PageReserved, PageSlab,
> >>>> PageTable and PageLargeKmalloc without extra page refcnt?
> >>>
> >>> Checking typed pages in a racy fashion is fine (PageSlab, PageTable,
> >>> PageLargeKmalloc).
> >>
> >> Got it. Thanks.
> >>
> >>> Checking PageReserved in a racy fashion is fine as well. TESTPAGEFLAG() will
> >>> allow checking it on compound pages.
> >>
> >> It seems PageReserved is not intended to be set on compound pages. I see there are PF_NO_COMPOUND
> >> in its definition: PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND).
> >>
> >>>
> >>> For PageLargeKmalloc, we would want to check the head page, though. The page
> >>> type is only stored for the head page.
> >>
> >> Maybe we should check the head page for PageSlab and PageTable too? alloc_slab_page only
> >> set PageSlab on the head page and __pagetable_ctor uses __folio_set_pgtable to set PageTable
> >> on folio.
> >>
> >>>
> >>> So maybe we want to lookup the compound head (if any) and perform the type
> >>> checks against that?
> >>
> >> Maybe we should or we might miss some pages that could have been handled. And
> >> if compound head is required, should we hold an extra page refcnt to guard against
> >> possible folio split race?
> > 
> > Races are fine. We might miss some pages, but that can happen on races either way.
> > 
> > 
> > I'd just do something like
> > 
> > if (PageReserved(page))
> > 	return true;
> > 
> > head = compound_head(page);
> 
> If @head is split just after compound_head. And then @head is freed into buddy and re-allocated as slab
> page while @page is still in the buddy. We would panic on this scene as @head is PageSlab. But we were
> supposed to successfully handle @page. Or am I miss something?

You're right that it is racy, but I think it is an acceptable race here.

For it to happen, the poisoned @page has to be a tail of a live compound page
at the time of the fault, and then -- in the few instructions between
compound_head() and the PageSlab(head) test -- that compound page has to be
split, the old head freed to buddy, and that head re-allocated as a slab page,
all while @page lands back in the buddy.  It cannot happen without concurrent
split/free/alloc activity in that exact window.

It is also worth noting the page in question genuinely took a unrecoverable ECC
error, and panic_on_unrecoverable_memory_failure is opt-in -- an operator who
enables it has explicitly chosen to crash rather than risk running on corrupted
memory.  Mis-attributing one such rare, genuinely-poisoned page as
unrecoverable is within that contract.

Thanks for the review and discussions,
--breno

^ permalink raw reply

* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Suzuki K Poulose @ 2026-06-05  9:06 UTC (permalink / raw)
  To: Michael Roth
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <4muegrza5iyyhqx6wevdlssnb6wvlc4m4wmuz5hmd3xikkftc4@3e2lpuq6tjgr>

On 04/06/2026 21:11, Michael Roth wrote:
> On Thu, Jun 04, 2026 at 04:29:19PM +0100, Suzuki K Poulose wrote:
>> On 23/05/2026 01:18, Ackerley Tng via B4 Relay wrote:
>>> From: Michael Roth <michael.roth@amd.com>
>>>
>>> For vm_memory_attributes=1, in-place conversion/population is not
>>> supported, so the initial contents necessarily must need to come
>>> from a separate src address, which is enforced by the current
>>> implementation. However, for vm_memory_attributes=0, it is possible for
>>> guest memory to be initialized directly from userspace by mmap()'ing the
>>> guest_memfd and writing to it while the corresponding GPA ranges are in
>>> a 'shared' state before converting them to the 'private' state expected
>>> by KVM_SEV_SNP_LAUNCH_UPDATE.
>>>
>>> Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
>>> for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
>>> SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
>>> copy in data from a separate memory location. Continue to enforce
>>> non-NULL for the original vm_memory_attributes=1 case.
>>>
>>> Signed-off-by: Michael Roth <michael.roth@amd.com>
>>> [Added src_page check in error handling path when the firmware command fails]
>>> [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
>>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>>
>>
>>
>>> ---
>>>    Documentation/virt/kvm/x86/amd-memory-encryption.rst | 15 +++++++++++----
>>>    arch/x86/kvm/svm/sev.c                               | 18 +++++++++++++-----
>>>    virt/kvm/kvm_main.c                                  |  1 +
>>>    3 files changed, 25 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> index b2395dd4769de..43085f65b2d85 100644
>>> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> @@ -503,7 +503,8 @@ secrets.
>>>    It is required that the GPA ranges initialized by this command have had the
>>>    KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
>>> -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
>>> +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
>>> +this aspect.
>>>    Upon success, this command is not guaranteed to have processed the entire
>>>    range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
>>> @@ -511,9 +512,15 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
>>>    remaining range that has yet to be processed. The caller should continue
>>>    calling this command until those fields indicate the entire range has been
>>>    processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
>>> -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
>>> -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
>>> -``uaddr`` will be ignored completely.
>>> +range plus 1, and ``uaddr`` (if specified) is the last byte of the
>>> +userspace-provided source buffer address plus 1.
>>> +
>>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>>> +ignored completely. Otherwise, ``uaddr`` is required if
>>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>>> +in the latter case guest memory can be initialized directly from userspace
>>> +prior to converting it to private and passing the GPA range on to this
>>> +interface.
>>
>> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in the
>> process of making it "private" ? i.e., the contents of a SNP shared
>> page are preserved while transitioning to "SNP Private" (via RMP
>> update).
> 
> sev_gmem_prepare() does sort of destroy contents since it finalizes the
> shared->private conversion which puts the page in an unusable state
> until the guest 'accepts' it as private memory and re-initializes the
> contents.
> 
> But that's run-time, when the guest is doing conversions. The
> documentation here is relating to initialization time when we are
> setting up the initial pre-encrypted/pre-measured guest memory image,
> via SNP_LAUNCH_UPDATE. That path calls into kvm_gmem_populate(), and it
> is then sev_gmem_post_populate() callback that actually finalizes the
> shared->private conversion. The sev_gmem_prepare() hook doesn't get used
> in this flow (kvm_gmem_populate() calls __kvm_gmem_get_pfn() which skips
> preparation).

Thanks, thats the bit I was missing. Skipping the prepare path, with 
__kvm_gmem_get_pfn(). I was under the assumption that 
kvm_arch_gmem_prepare() was called for all PFNs allocated from gmem
and how SNP was handling this populate case.


Thanks
Suzuki


> 
> -Mike
> 
>>
>> Suzuki
>>
>>

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-06-05  8:59 UTC (permalink / raw)
  To: ljs, david, npache
  Cc: lance.yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <aiJ90SWqXvwN9dNT@lucifer>


On Fri, Jun 05, 2026 at 09:07:23AM +0100, Lorenzo Stoakes wrote:
>On Fri, Jun 05, 2026 at 09:18:27AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/4/26 19:04, Nico Pache wrote:
>> > On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
>> >>
>> >> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>> >>>
>> >>>
>> >>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>> >>> draft.
>> >>
>> >> Okay, I read the above and did some investigating.
>> >>
>> >> I will try to implement and verify the changes you suggested :)
>> >
>> > I've implemented something slightly different actually and I *think* its better!
>> >
>> > } else {
>> >        /* this is map_anon_folio_pte_nopf with no mmu update */
>> >         __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
>> >                       /*uffd_wp=*/ false);
>> >        smp_wmb();
>> >         pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> >         /*
>> >          * Some architectures (e.g. MIPS) walk the live page table in
>> >          * their implementation. update_mmu_cache_range() must be called
>> >          * with a valid page table hierarchy and the PTE lock held.
>> >          * Acquire it nested inside pmd_ptl when they are distinct locks.
>> >          */
>> >         if (pte_ptl != pmd_ptl)
>> >             spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
>> >         update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
>> >         if (pte_ptl != pmd_ptl)
>> >             spin_unlock(pte_ptl);
>> >     }
>> > spin_unlock(pmd_ptl);
>> >
>> > The logic here is that when the PMD becomes visible, PTEs are already
>> > populated (no possibility of spurious faults on local CPU)
>> >
>> > the SMP_WMB makes sure of the above
>
>THe locks prevent those 'spurious' (really: incorrect) faults anyway so I don't
>think this is necessary.
>
>> >
>> > And the pmd is installed with the pte and pmd lock both held through
>> > the mmu_cache update.
>> >
>> > This follows the conventions used in pmd_install() and clears the
>> > potential for local CPU faults hitting cleared PTE entries.
>>
>> After the pmdp_collapse_flush() we'd be getting CPU faults due to the cleared
>> PMD already? So the case here is rather different.

The issue I was worried about: update_mmu_cache_range() can re-walk
vma->vm_mm while the PTE page table is still not reachable through the
PMD. And, yeah, that assumption is ugly, but it is what it is, and there
maybe be similar code elsewhere ...

So the ordering we need is "the PMD points to the PTE page table from
_pmd before update_mmu_cache_range()", not "new PTEs before PMD".

Those PTEs are cleared, but we hold the PTL, so nobody else can install
anything there :)

So David's original suggestion looks enough to me:

if (pte_ptl != pmd_ptl)
        spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);

pmd_populate();
map_anon_folio_pte_nopf();

if (pte_ptl != pmd_ptl)
        spin_unlock(pte_ptl);

>Yeah conceptually the code above is problematic because you immediately make the
>PTE available right at the point you populate, so taking a PTE lock after that
>is rather shutting the stable door after the horse has bolted.
>
>Doing it this way is not a good idea in any case because we're adding
>complexity, an extra function and an open-coded cache maintenance call for
>really no benefit.
>
>I asked Nico to abstract the anon folio mapping stuff explicitly so we could
>avoid this sort of duplication so let's not roll that back :)
>
>So again, I think going with the original suggestion (with an updated comment)
>is the right thing to do.
>
>
>Anyway, an aside But in practice we can't have page faults here right? The VMA is:
>
>- Ensured to span at least the PMD range (this isn't immediately obvious in the
>  code)
>- VMA write locked (mmap write lock held)
>
>And we hold the anon_vma lock so no rmap walkers can walk the page tables here
>either.
>
>So I actually wonder, given that, whether we need the PTE PTL at all.

I'd keep it. Cheap, and lets us sleep better at night :P

>But.
>
>At this stage it'll almost certainly be an owned exclusive cache line so it's
>very low cost to do it, and it means we honour the update_mmu_cache_range()
>contract.
>
>And it also makes it clear that we're gating changes on the PTE being
>untouchable so any future stuff that maybe changes some of these rules doesn't
>get caught out.
>
>So probably worth keeping.

Yes!

Cheers, Lance

>>
>> --
>> Cheers,
>>
>> David
>
>Thanks, Lorenzo
>

^ permalink raw reply

* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Suzuki K Poulose @ 2026-06-05  8:54 UTC (permalink / raw)
  To: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgF43RBv77RgM67kXRRHDnQw4L5uwQTuvkJHzkHJWB1mag@mail.gmail.com>

On 04/06/2026 20:05, Ackerley Tng wrote:
> Suzuki K Poulose <suzuki.poulose@arm.com> writes:
> 
>>
>> [...snip...]
>>
>>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>>> +ignored completely. Otherwise, ``uaddr`` is required if
>>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>>> +in the latter case guest memory can be initialized directly from userspace
>>> +prior to converting it to private and passing the GPA range on to this
>>> +interface.
>>
>> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in
>> the process of making it "private" ? i.e., the contents of a SNP shared
>> page are preserved while transitioning to "SNP Private" (via RMP
>> update).
>>
>> Suzuki
>>
> 
> The following is the guest_memfd perspective, I didn't look at the SNP
> spec:
> 
> Do you mean specifically for KVM_SEV_SNP_PAGE_TYPE_ZERO, or for any
> type?
> 
> guest_memfd has no plans to do any special zeroing based on type.
> 
> guest_memfd decoupled zeroing from preparation a while ago (Michael had
> some patches), so zeroing is supposed to be once during folio ownership
> by guest_memfd, tracked by the uptodate flag, and preparation is tracked
> outside of guest_memfd. So far only SNP does preparation.

I am talking about the SEV SNP conversions (specifically quoted in my 
response), I will follow up on Michael's response.

Suzuki


^ permalink raw reply

* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: David Hildenbrand (Arm) @ 2026-06-05  8:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Borislav Petkov, Zhuo, Qiuxu, mchehab+huawei@kernel.org,
	Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
	xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
	linux-edac@vger.kernel.org, linux-mm@kvack.org,
	linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <20260603153115.775a2e81@fedora>

On 6/3/26 21:31, Steven Rostedt wrote:
> On Wed, 3 Jun 2026 21:13:30 +0200
> "David Hildenbrand (Arm)" <david@kernel.org> wrote:
> 
>> Thanks, that makes sense!
>>
>> So, would it be fair to say that, in general, what's exposed through
>>
>> 	/sys/kernel/tracing/events/
>>
>> is stable ABI?
> 
> It's only stable if something depends on it. It changes all the time.
> It's only when someone complains about it that it becomes "stable"!

Heh, so we only know that it's stable when we break it ...

Let me figure out how to document that.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH] mm/memory-failure: trace: change memory_failure_event to ras subsystem
From: David Hildenbrand (Arm) @ 2026-06-05  8:51 UTC (permalink / raw)
  To: Xie Yuanbin, qiuxu.zhuo, bp, akpm, rostedt, linmiaohe,
	nao.horiguchi, mhiramat, mchehab+huawei, tony.luck, yi1.lai
  Cc: linux-edac, linux-kernel, linux-mm, linux-trace-kernel, torvalds,
	lilinjie8, liaohua4
In-Reply-To: <20260605081213.154660-1-xieyuanbin1@huawei.com>

On 6/5/26 10:12, Xie Yuanbin wrote:
> For historical version, commit 97f0b1345219 ("tracing: add trace event
> for memory-failure") introduced memory_failure_event in ras subsystem.
> commit 31807483d395 ("mm/memory-failure: remove the selection of RAS")
> changed memory_failure_event to memory_failure subsystem. This breaks
> the backward compatibility, some user programs rely on it.
> 
> Change memory_failure_event to ras subsystem to keep backward
> compatibility.
> 
> Fixes: 31807483d395 ("mm/memory-failure: remove the selection of RAS")
> 
> Reported-by: Yi Lai <yi1.lai@intel.com>
> Reported-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Closes: https://lore.kernel.org/linux-mm/CY8PR11MB7134346A3E4BB28ECA28D6E989132@CY8PR11MB7134.namprd11.prod.outlook.com
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
> ---
>  include/trace/events/memory-failure.h | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
> index aa57cc8f896b..7a8ee5d1a44e 100644
> --- a/include/trace/events/memory-failure.h
> +++ b/include/trace/events/memory-failure.h
> @@ -1,6 +1,10 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #undef TRACE_SYSTEM
> -#define TRACE_SYSTEM memory_failure
> +/*
> + * For historical versions, memory_failure_event is in ras subsystem,
> + * some user programs depend on it.
> + */
> +#define TRACE_SYSTEM ras
>  #define TRACE_INCLUDE_FILE memory-failure
>  
>  #if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)

We should

Cc: <stable@vger.kernel.org>

given that it's in v6.19 and nobody noticed :(

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

Thanks, and fortunately now I learned about possible ABI salability of trace events.

-- 
Cheers,

David

^ permalink raw reply

* [PATCH] mm/memory-failure: trace: change memory_failure_event to ras subsystem
From: Xie Yuanbin @ 2026-06-05  8:12 UTC (permalink / raw)
  To: david, qiuxu.zhuo, bp, akpm, rostedt, linmiaohe, nao.horiguchi,
	mhiramat, mchehab+huawei, tony.luck, yi1.lai
  Cc: linux-edac, linux-kernel, linux-mm, linux-trace-kernel, torvalds,
	lilinjie8, liaohua4, Xie Yuanbin

For historical version, commit 97f0b1345219 ("tracing: add trace event
for memory-failure") introduced memory_failure_event in ras subsystem.
commit 31807483d395 ("mm/memory-failure: remove the selection of RAS")
changed memory_failure_event to memory_failure subsystem. This breaks
the backward compatibility, some user programs rely on it.

Change memory_failure_event to ras subsystem to keep backward
compatibility.

Fixes: 31807483d395 ("mm/memory-failure: remove the selection of RAS")

Reported-by: Yi Lai <yi1.lai@intel.com>
Reported-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Closes: https://lore.kernel.org/linux-mm/CY8PR11MB7134346A3E4BB28ECA28D6E989132@CY8PR11MB7134.namprd11.prod.outlook.com
Cc: David Hildenbrand <david@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
---
 include/trace/events/memory-failure.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
index aa57cc8f896b..7a8ee5d1a44e 100644
--- a/include/trace/events/memory-failure.h
+++ b/include/trace/events/memory-failure.h
@@ -1,6 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #undef TRACE_SYSTEM
-#define TRACE_SYSTEM memory_failure
+/*
+ * For historical versions, memory_failure_event is in ras subsystem,
+ * some user programs depend on it.
+ */
+#define TRACE_SYSTEM ras
 #define TRACE_INCLUDE_FILE memory-failure
 
 #if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-05  8:07 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Nico Pache, Lance Yang, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, usama.arif
In-Reply-To: <0ef96c28-9e6c-4d04-90ae-ac43c81d465d@kernel.org>

On Fri, Jun 05, 2026 at 09:18:27AM +0200, David Hildenbrand (Arm) wrote:
> On 6/4/26 19:04, Nico Pache wrote:
> > On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
> >>
> >> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> >>>
> >>>
> >>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
> >>> draft.
> >>
> >> Okay, I read the above and did some investigating.
> >>
> >> I will try to implement and verify the changes you suggested :)
> >
> > I've implemented something slightly different actually and I *think* its better!
> >
> > } else {
> >        /* this is map_anon_folio_pte_nopf with no mmu update */
> >         __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
> >                       /*uffd_wp=*/ false);
> >        smp_wmb();
> >         pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >         /*
> >          * Some architectures (e.g. MIPS) walk the live page table in
> >          * their implementation. update_mmu_cache_range() must be called
> >          * with a valid page table hierarchy and the PTE lock held.
> >          * Acquire it nested inside pmd_ptl when they are distinct locks.
> >          */
> >         if (pte_ptl != pmd_ptl)
> >             spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
> >         update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
> >         if (pte_ptl != pmd_ptl)
> >             spin_unlock(pte_ptl);
> >     }
> > spin_unlock(pmd_ptl);
> >
> > The logic here is that when the PMD becomes visible, PTEs are already
> > populated (no possibility of spurious faults on local CPU)
> >
> > the SMP_WMB makes sure of the above

THe locks prevent those 'spurious' (really: incorrect) faults anyway so I don't
think this is necessary.

> >
> > And the pmd is installed with the pte and pmd lock both held through
> > the mmu_cache update.
> >
> > This follows the conventions used in pmd_install() and clears the
> > potential for local CPU faults hitting cleared PTE entries.
>
> After the pmdp_collapse_flush() we'd be getting CPU faults due to the cleared
> PMD already? So the case here is rather different.

Yeah conceptually the code above is problematic because you immediately make the
PTE available right at the point you populate, so taking a PTE lock after that
is rather shutting the stable door after the horse has bolted.

Doing it this way is not a good idea in any case because we're adding
complexity, an extra function and an open-coded cache maintenance call for
really no benefit.

I asked Nico to abstract the anon folio mapping stuff explicitly so we could
avoid this sort of duplication so let's not roll that back :)

So again, I think going with the original suggestion (with an updated comment)
is the right thing to do.


Anyway, an aside But in practice we can't have page faults here right? The VMA is:

- Ensured to span at least the PMD range (this isn't immediately obvious in the
  code)
- VMA write locked (mmap write lock held)

And we hold the anon_vma lock so no rmap walkers can walk the page tables here
either.

So I actually wonder, given that, whether we need the PTE PTL at all.

But.

At this stage it'll almost certainly be an owned exclusive cache line so it's
very low cost to do it, and it means we honour the update_mmu_cache_range()
contract.

And it also makes it clear that we're gating changes on the PTE being
untouchable so any future stuff that maybe changes some of these rules doesn't
get caught out.

So probably worth keeping.

>
> --
> Cheers,
>
> David

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: David Hildenbrand (Arm) @ 2026-06-05  7:18 UTC (permalink / raw)
  To: Nico Pache
  Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <CAA1CXcBSPVG4CJFCBDvbuodcJ_7eXoDQTpK0ZN0HEhkDPi-DEw@mail.gmail.com>

On 6/4/26 19:04, Nico Pache wrote:
> On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
>>
>> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>>
>>>
>>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>>> draft.
>>
>> Okay, I read the above and did some investigating.
>>
>> I will try to implement and verify the changes you suggested :)
> 
> I've implemented something slightly different actually and I *think* its better!
> 
> } else {
>        /* this is map_anon_folio_pte_nopf with no mmu update */
>         __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
>                       /*uffd_wp=*/ false);
>        smp_wmb();
>         pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>         /*
>          * Some architectures (e.g. MIPS) walk the live page table in
>          * their implementation. update_mmu_cache_range() must be called
>          * with a valid page table hierarchy and the PTE lock held.
>          * Acquire it nested inside pmd_ptl when they are distinct locks.
>          */
>         if (pte_ptl != pmd_ptl)
>             spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
>         update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
>         if (pte_ptl != pmd_ptl)
>             spin_unlock(pte_ptl);
>     }
> spin_unlock(pmd_ptl);
> 
> The logic here is that when the PMD becomes visible, PTEs are already
> populated (no possibility of spurious faults on local CPU)
> 
> the SMP_WMB makes sure of the above
> 
> And the pmd is installed with the pte and pmd lock both held through
> the mmu_cache update.
> 
> This follows the conventions used in pmd_install() and clears the
> potential for local CPU faults hitting cleared PTE entries.

After the pmdp_collapse_flush() we'd be getting CPU faults due to the cleared
PMD already? So the case here is rather different.

-- 
Cheers,

David

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox