* [PATCH v10 0/5] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu (Google) @ 2026-03-17 9:36 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers
Hi,
Here is the 10th version of improvement patches for making persistent
ring buffers robust to failures.
The previous version is here:
https://lore.kernel.org/all/177319273059.130641.10882692460536780093.stgit@mhiramat.tok.corp.google.com/
In this version, I added a new patch to skip invalid page in rewinding
process[4/5], add entry_bytes check in the test [5/5] and do not
compile test code when CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n[5/5].
Thank you,
---
Masami Hiramatsu (Google) (5):
ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
ring-buffer: Flush and stop persistent ring buffer on panic
ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
ring-buffer: Add persistent ring buffer selftest
arch/alpha/include/asm/Kbuild | 1
arch/arc/include/asm/Kbuild | 1
arch/arm/include/asm/Kbuild | 1
arch/arm64/include/asm/ring_buffer.h | 10 ++
arch/csky/include/asm/Kbuild | 1
arch/hexagon/include/asm/Kbuild | 1
arch/loongarch/include/asm/Kbuild | 1
arch/m68k/include/asm/Kbuild | 1
arch/microblaze/include/asm/Kbuild | 1
arch/mips/include/asm/Kbuild | 1
arch/nios2/include/asm/Kbuild | 1
arch/openrisc/include/asm/Kbuild | 1
arch/parisc/include/asm/Kbuild | 1
arch/powerpc/include/asm/Kbuild | 1
arch/riscv/include/asm/Kbuild | 1
arch/s390/include/asm/Kbuild | 1
arch/sh/include/asm/Kbuild | 1
arch/sparc/include/asm/Kbuild | 1
arch/um/include/asm/Kbuild | 1
arch/x86/include/asm/Kbuild | 1
arch/xtensa/include/asm/Kbuild | 1
include/asm-generic/ring_buffer.h | 13 ++
include/linux/ring_buffer.h | 1
kernel/trace/Kconfig | 15 ++
kernel/trace/ring_buffer.c | 226 ++++++++++++++++++++++++++--------
kernel/trace/trace.c | 4 +
26 files changed, 233 insertions(+), 56 deletions(-)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v10 1/5] ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-03-17 9:36 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers
In-Reply-To: <177374017536.2358053.12341235939816794384.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since the validation loop in rb_meta_validate_events() updates
the same cpu_buffer->head_page->entries, the other subbuf entries
are not updated.
Fix to use head_page to update the entries field, since it is the
cursor in this loop.
Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/ring_buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 96e0d80d492b..d6bebb782efc 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2024,7 +2024,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
entries += ret;
entry_bytes += local_read(&head_page->page->commit);
- local_set(&cpu_buffer->head_page->entries, ret);
+ local_set(&head_page->entries, ret);
if (head_page == cpu_buffer->commit_page)
break;
^ permalink raw reply related
* [PATCH v10 2/5] ring-buffer: Flush and stop persistent ring buffer on panic
From: Masami Hiramatsu (Google) @ 2026-03-17 9:36 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers
In-Reply-To: <177374017536.2358053.12341235939816794384.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.
To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v9:
- Fix typo of & to &&.
- Fix typo of "Generic"
Changes in v6:
- Introduce asm/ring_buffer.h for arch_ring_buffer_flush_range().
- Use flush_cache_vmap() instead of flush_cache_all().
Changes in v5:
- Use ring_buffer_record_off() instead of ring_buffer_record_disable().
- Use flush_cache_all() to ensure flush all cache.
Changes in v3:
- update patch description.
---
arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/ring_buffer.h | 10 ++++++++++
arch/csky/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/loongarch/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/x86/include/asm/Kbuild | 1 +
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/ring_buffer.h | 13 +++++++++++++
kernel/trace/ring_buffer.c | 22 ++++++++++++++++++++++
23 files changed, 65 insertions(+)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 483965c5a4de..b154b4e3dfa8 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += agp.h
generic-y += asm-offsets.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 4c69522e0328..483caacc6988 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -5,5 +5,6 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 03657ff8fbe3..decad5f2c826 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -3,6 +3,7 @@ generic-y += early_ioremap.h
generic-y += extable.h
generic-y += flat.h
generic-y += parport.h
+generic-y += ring_buffer.h
generated-y += mach-types.h
generated-y += unistd-nr.h
diff --git a/arch/arm64/include/asm/ring_buffer.h b/arch/arm64/include/asm/ring_buffer.h
new file mode 100644
index 000000000000..62316c406888
--- /dev/null
+++ b/arch/arm64/include/asm/ring_buffer.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_ARM64_RING_BUFFER_H
+#define _ASM_ARM64_RING_BUFFER_H
+
+#include <asm/cacheflush.h>
+
+/* Flush D-cache on persistent ring buffer */
+#define arch_ring_buffer_flush_range(start, end) dcache_clean_pop(start, end)
+
+#endif /* _ASM_ARM64_RING_BUFFER_H */
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index 3a5c7f6e5aac..7dca0c6cdc84 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -9,6 +9,7 @@ generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
generic-y += text-patching.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 1efa1e993d4b..0f887d4238ed 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += extable.h
generic-y += iomap.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild
index 9034b583a88a..7e92957baf6a 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -10,5 +10,6 @@ generic-y += qrwlock.h
generic-y += user.h
generic-y += ioctl.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
generic-y += statfs.h
generic-y += text-patching.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index b282e0dd8dc1..62543bf305ff 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -3,5 +3,6 @@ generated-y += syscall_table.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += text-patching.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index 7178f990e8b3..0030309b47ad 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += syscalls.h
generic-y += tlb.h
generic-y += user.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 684569b2ecd6..9771c3d85074 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -12,5 +12,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 28004301c236..0a2530964413 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += cmpxchg.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index cef49d60d74c..8aa34621702d 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -8,4 +8,5 @@ generic-y += spinlock_types.h
generic-y += spinlock.h
generic-y += qrwlock_types.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index 4fb596d94c89..d48d158f7241 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 2e23533b67e3..805b5aeebb6f 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -5,4 +5,5 @@ generated-y += syscall_table_spu.h
generic-y += agp.h
generic-y += mcs_spinlock.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += early_ioremap.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index bd5fc9403295..7721b63642f4 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -14,5 +14,6 @@ generic-y += ticket_spinlock.h
generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 80bad7de7a04..0c1fc47c3ba0 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,3 +7,4 @@ generated-y += unistd_nr.h
generic-y += asm-offsets.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 4d3f10ed8275..f0403d3ee8ab 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -3,4 +3,5 @@ generated-y += syscall_table.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 17ee8a273aa6..49c6bb326b75 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 1b9b82bbe322..2a1629ba8140 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -17,6 +17,7 @@ generic-y += module.lds.h
generic-y += parport.h
generic-y += percpu.h
generic-y += preempt.h
+generic-y += ring_buffer.h
generic-y += runtime-const.h
generic-y += softirq_stack.h
generic-y += switch_to.h
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4566000e15c4..078fd2c0d69d 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -14,3 +14,4 @@ generic-y += early_ioremap.h
generic-y += fprobe.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 13fe45dea296..e57af619263a 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -6,5 +6,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/include/asm-generic/ring_buffer.h b/include/asm-generic/ring_buffer.h
new file mode 100644
index 000000000000..930d96571f23
--- /dev/null
+++ b/include/asm-generic/ring_buffer.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Generic arch dependent ring_buffer macros.
+ */
+#ifndef __ASM_GENERIC_RING_BUFFER_H__
+#define __ASM_GENERIC_RING_BUFFER_H__
+
+#include <linux/cacheflush.h>
+
+/* Flush cache on ring buffer range if needed */
+#define arch_ring_buffer_flush_range(start, end) flush_cache_vmap(start, end)
+
+#endif /* __ASM_GENERIC_RING_BUFFER_H__ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index d6bebb782efc..3d2acaf75e79 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7,6 +7,7 @@
#include <linux/ring_buffer_types.h>
#include <linux/sched/isolation.h>
#include <linux/trace_recursion.h>
+#include <linux/panic_notifier.h>
#include <linux/trace_events.h>
#include <linux/ring_buffer.h>
#include <linux/trace_clock.h>
@@ -31,6 +32,7 @@
#include <linux/oom.h>
#include <linux/mm.h>
+#include <asm/ring_buffer.h>
#include <asm/local64.h>
#include <asm/local.h>
#include <asm/setup.h>
@@ -559,6 +561,7 @@ struct trace_buffer {
unsigned long range_addr_start;
unsigned long range_addr_end;
+ struct notifier_block flush_nb;
struct ring_buffer_meta *meta;
@@ -2520,6 +2523,16 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+/* Stop recording on a persistent buffer and flush cache if needed. */
+static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
+{
+ struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
+
+ ring_buffer_record_off(buffer);
+ arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
+ return NOTIFY_DONE;
+}
+
static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
int order, unsigned long start,
unsigned long end,
@@ -2650,6 +2663,12 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
mutex_init(&buffer->mutex);
+ /* Persistent ring buffer needs to flush cache before reboot. */
+ if (start && end) {
+ buffer->flush_nb.notifier_call = rb_flush_buffer_cb;
+ atomic_notifier_chain_register(&panic_notifier_list, &buffer->flush_nb);
+ }
+
return_ptr(buffer);
fail_free_buffers:
@@ -2748,6 +2767,9 @@ ring_buffer_free(struct trace_buffer *buffer)
{
int cpu;
+ if (buffer->range_addr_start && buffer->range_addr_end)
+ atomic_notifier_chain_unregister(&panic_notifier_list, &buffer->flush_nb);
+
cpuhp_state_remove_instance(CPUHP_TRACE_RB_PREPARE, &buffer->node);
irq_work_sync(&buffer->irq_work.work);
^ permalink raw reply related
* [PATCH v10 3/5] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-03-17 9:36 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers
In-Reply-To: <177374017536.2358053.12341235939816794384.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).
If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffersa that ar found to be corrupted.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v9:
- Add meta->subbuf_size check.
- Fix a typo.
- Handle invalid reader_page case.
Changes in v8:
- Add comment in rb_valudate_buffer()
- Clear the RB_MISSED_* flags in rb_valudate_buffer() instead of
skipping subbuf.
- Remove unused subbuf local variable from rb_cpu_meta_valid().
Changes in v7:
- Combined with Handling RB_MISSED_* flags patch, focus on validation at boot.
- Remove checking subbuffer data when validating metadata, because it should be done
later.
- Do not mark the discarded sub buffer page but just reset it.
Changes in v6:
- Show invalid page detection message once per CPU.
Changes in v5:
- Instead of showing errors for each page, just show the number
of discarded pages at last.
Changes in v3:
- Record missed data event on commit.
---
kernel/trace/ring_buffer.c | 98 ++++++++++++++++++++++++++------------------
1 file changed, 58 insertions(+), 40 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3d2acaf75e79..67826021867b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -370,6 +370,12 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
return local_read(&bpage->page->commit);
}
+/* Size is determined by what has been committed */
+static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
+{
+ return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
/* Range pages are not to be freed */
@@ -1762,7 +1768,6 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
unsigned long *subbuf_mask)
{
int subbuf_size = PAGE_SIZE;
- struct buffer_data_page *subbuf;
unsigned long buffers_start;
unsigned long buffers_end;
int i;
@@ -1770,6 +1775,11 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
if (!subbuf_mask)
return false;
+ if (meta->subbuf_size != PAGE_SIZE) {
+ pr_info("Ring buffer boot meta [%d] invalid subbuf_size\n", cpu);
+ return false;
+ }
+
buffers_start = meta->first_buffer;
buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
@@ -1786,11 +1796,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- subbuf = rb_subbufs_from_meta(meta);
-
bitmap_clear(subbuf_mask, 0, meta->nr_subbufs);
- /* Is the meta buffers and the subbufs themselves have correct data? */
+ /*
+ * Ensure the meta::buffers array has correct data. The data in each subbufs
+ * are checked later in rb_meta_validate_events().
+ */
for (i = 0; i < meta->nr_subbufs; i++) {
if (meta->buffers[i] < 0 ||
meta->buffers[i] >= meta->nr_subbufs) {
@@ -1798,18 +1809,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- if ((unsigned)local_read(&subbuf->commit) > subbuf_size) {
- pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu);
- return false;
- }
-
if (test_bit(meta->buffers[i], subbuf_mask)) {
pr_info("Ring buffer boot meta [%d] array has duplicates\n", cpu);
return false;
}
set_bit(meta->buffers[i], subbuf_mask);
- subbuf = (void *)subbuf + subbuf_size;
}
return true;
@@ -1873,13 +1878,22 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+ struct ring_buffer_cpu_meta *meta)
{
unsigned long long ts;
+ unsigned long tail;
u64 delta;
- int tail;
- tail = local_read(&dpage->commit);
+ /*
+ * When a sub-buffer is recovered from a read, the commit value may
+ * have RB_MISSED_* bits set, as these bits are reset on reuse.
+ * Even after clearing these bits, a commit value greater than the
+ * subbuf_size is considered invalid.
+ */
+ tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ if (tail > meta->subbuf_size)
+ return -1;
return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
}
@@ -1890,6 +1904,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
struct buffer_page *head_page, *orig_head;
unsigned long entry_bytes = 0;
unsigned long entries = 0;
+ int discarded = 0;
int ret;
u64 ts;
int i;
@@ -1900,14 +1915,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
/* Do the reader page first */
- ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer reader page is invalid\n");
- goto invalid;
+ pr_info("Ring buffer meta [%d] invalid reader page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&cpu_buffer->reader_page->entries, 0);
+ local_set(&cpu_buffer->reader_page->page->commit, 0);
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(cpu_buffer->reader_page);
+ local_set(&cpu_buffer->reader_page->entries, ret);
}
- entries += ret;
- entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
- local_set(&cpu_buffer->reader_page->entries, ret);
ts = head_page->page->time_stamp;
@@ -1935,7 +1955,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
break;
/* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0)
break;
@@ -2014,21 +2034,24 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->reader_page)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer meta [%d] invalid buffer page\n",
- cpu_buffer->cpu);
- goto invalid;
- }
-
- /* If the buffer has content, update pages_touched */
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
-
- entries += ret;
- entry_bytes += local_read(&head_page->page->commit);
- local_set(&head_page->entries, ret);
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&head_page->entries, 0);
+ local_set(&head_page->page->commit, 0);
+ } else {
+ /* If the buffer has content, update pages_touched */
+ if (ret)
+ local_inc(&cpu_buffer->pages_touched);
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ local_set(&head_page->entries, ret);
+ }
if (head_page == cpu_buffer->commit_page)
break;
}
@@ -2042,7 +2065,8 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
local_set(&cpu_buffer->entries, entries);
local_set(&cpu_buffer->entries_bytes, entry_bytes);
- pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu);
+ pr_info("Ring buffer meta [%d] is from previous boot! (%d pages discarded)\n",
+ cpu_buffer->cpu, discarded);
return;
invalid:
@@ -3329,12 +3353,6 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
return NULL;
}
-/* Size is determined by what has been committed */
-static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
-{
- return rb_page_commit(bpage) & ~RB_MISSED_MASK;
-}
-
static __always_inline unsigned
rb_commit_index(struct ring_buffer_per_cpu *cpu_buffer)
{
^ permalink raw reply related
* [PATCH v10 4/5] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-03-17 9:36 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers
In-Reply-To: <177374017536.2358053.12341235939816794384.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.
To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v10:
- Newly added.
---
kernel/trace/ring_buffer.c | 65 ++++++++++++++++++++++++--------------------
1 file changed, 35 insertions(+), 30 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 67826021867b..c33ff60dea6f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -363,6 +363,7 @@ struct buffer_page {
static void rb_init_page(struct buffer_data_page *bpage)
{
local_set(&bpage->commit, 0);
+ bpage->time_stamp = 0;
}
static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
@@ -1878,12 +1879,14 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
struct ring_buffer_cpu_meta *meta)
{
+ struct buffer_data_page *dpage = bpage->page;
unsigned long long ts;
unsigned long tail;
u64 delta;
+ int ret = -1;
/*
* When a sub-buffer is recovered from a read, the commit value may
@@ -1892,9 +1895,17 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
* subbuf_size is considered invalid.
*/
tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
- if (tail > meta->subbuf_size)
- return -1;
- return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+ if (tail <= meta->subbuf_size)
+ ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+
+ if (ret < 0) {
+ local_set(&bpage->entries, 0);
+ local_set(&bpage->page->commit, 0);
+ } else {
+ local_set(&bpage->entries, ret);
+ }
+
+ return ret;
}
/* If the meta data has been validated, now validate the events */
@@ -1915,18 +1926,14 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
/* Do the reader page first */
- ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
+ ret = rb_validate_buffer(cpu_buffer->reader_page, cpu_buffer->cpu, meta);
if (ret < 0) {
pr_info("Ring buffer meta [%d] invalid reader page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&cpu_buffer->reader_page->entries, 0);
- local_set(&cpu_buffer->reader_page->page->commit, 0);
} else {
entries += ret;
entry_bytes += rb_page_size(cpu_buffer->reader_page);
- local_set(&cpu_buffer->reader_page->entries, ret);
}
ts = head_page->page->time_stamp;
@@ -1945,26 +1952,28 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->tail_page)
break;
- /* Ensure the page has older data than head. */
- if (ts < head_page->page->time_stamp)
- break;
-
- ts = head_page->page->time_stamp;
- /* Ensure the page has correct timestamp and some data. */
- if (!ts || rb_page_commit(head_page) == 0)
+ /* Rewind until unused page (no timestamp, no commit). */
+ if (!head_page->page->time_stamp && rb_page_commit(head_page) == 0)
break;
- /* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
- if (ret < 0)
+ /* Ensure the page has older data than the previous verified page. */
+ if (ts < head_page->page->time_stamp)
break;
- /* Recover the number of entries and update stats. */
- local_set(&head_page->entries, ret);
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
- entries += ret;
- entry_bytes += rb_page_commit(head_page);
+ /* Skip if the page is invalid. */
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
+ if (ret < 0) {
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ if (ret > 0)
+ local_inc(&cpu_buffer->pages_touched);
+ ts = head_page->page->time_stamp;
+ }
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2034,15 +2043,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->reader_page)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
if (ret < 0) {
if (!discarded)
pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
} else {
/* If the buffer has content, update pages_touched */
if (ret)
@@ -2050,7 +2056,6 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
entries += ret;
entry_bytes += rb_page_size(head_page);
- local_set(&head_page->entries, ret);
}
if (head_page == cpu_buffer->commit_page)
break;
^ permalink raw reply related
* [PATCH v10 5/5] ring-buffer: Add persistent ring buffer selftest
From: Masami Hiramatsu (Google) @ 2026-03-17 9:36 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers
In-Reply-To: <177374017536.2358053.12341235939816794384.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a self-destractive test for the persistent ring buffer. This
will invalidate some sub-buffer pages in the persistent ring buffer
when kernel gets panic, and check whether the number of detected
invalid pages and the total entry_bytes are the same as record
after reboot.
This can ensure the kernel correctly recover partially corrupted
persistent ring buffer when boot.
The test only runs on the persistent ring buffer whose name is
"ptracingtest". And user has to fill it up with events before
kernel panics.
To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_SELFTEST
and you have to setup the kernel cmdline;
reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
panic=1
And run following commands after the 1st boot;
cd /sys/kernel/tracing/instances/ptracingtest
echo 1 > tracing_on
echo 1 > events/enable
sleep 3
echo c > /proc/sysrq-trigger
After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.
Ring buffer meta [2] invalid buffer page detected
Ring buffer meta [2] is from previous boot! (318 pages discarded)
Ring buffer testing [2] invalid pages: PASSED (318/318)
Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v10:
- Add entry_bytes test.
- Do not compile test code if CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n.
Changes in v9:
- Test also reader pages.
---
include/linux/ring_buffer.h | 1 +
kernel/trace/Kconfig | 15 +++++++++
kernel/trace/ring_buffer.c | 69 +++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 4 ++
4 files changed, 89 insertions(+)
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 994f52b34344..0670742b2d60 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -238,6 +238,7 @@ int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
+ RB_FL_TESTING = 1 << 1,
};
#ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..094d5511bb17 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1202,6 +1202,21 @@ config RING_BUFFER_VALIDATE_TIME_DELTAS
Only say Y if you understand what this does, and you
still want it enabled. Otherwise say N
+config RING_BUFFER_PERSISTENT_SELFTEST
+ bool "Enable persistent ring buffer selftest"
+ depends on RING_BUFFER
+ help
+ Run a selftest on the persistent ring buffer which names
+ "ptracingtest" (and its backup) when panic_on_reboot by
+ invalidating ring buffer pages.
+ Note that user has to enable events on the persistent ring
+ buffer manually to fill up ring buffers before rebooting.
+ Since this invalidates the data on test target ring buffer,
+ "ptracingtest" persistent ring buffer must not be used for
+ actual tracing, but only for testing.
+
+ If unsure, say N
+
config MMIOTRACE_TEST
tristate "Test module for mmiotrace"
depends on MMIOTRACE && m
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c33ff60dea6f..82be6c3b9feb 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -64,6 +64,10 @@ struct ring_buffer_cpu_meta {
unsigned long commit_buffer;
__u32 subbuf_size;
__u32 nr_subbufs;
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_SELFTEST
+ __u32 nr_invalid;
+ __u32 entry_bytes;
+#endif
int buffers[];
};
@@ -2072,6 +2076,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
pr_info("Ring buffer meta [%d] is from previous boot! (%d pages discarded)\n",
cpu_buffer->cpu, discarded);
+
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_SELFTEST
+ if (meta->nr_invalid)
+ pr_info("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
+ cpu_buffer->cpu,
+ (discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+ discarded, meta->nr_invalid);
+ if (meta->entry_bytes)
+ pr_info("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
+ cpu_buffer->cpu,
+ (entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+ (long)entry_bytes, (long)meta->entry_bytes);
+#endif
return;
invalid:
@@ -2552,12 +2569,64 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_SELFTEST
+static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
+{
+ struct ring_buffer_per_cpu *cpu_buffer;
+ struct ring_buffer_cpu_meta *meta;
+ struct buffer_data_page *dpage;
+ u32 entry_bytes = 0;
+ unsigned long ptr;
+ int subbuf_size;
+ int invalid = 0;
+ int cpu;
+ int i;
+
+ if (!(buffer->flags & RB_FL_TESTING))
+ return;
+
+ guard(preempt)();
+ cpu = smp_processor_id();
+
+ cpu_buffer = buffer->buffers[cpu];
+ meta = cpu_buffer->ring_meta;
+ ptr = (unsigned long)rb_subbufs_from_meta(meta);
+ subbuf_size = meta->subbuf_size;
+
+ for (i = 0; i < meta->nr_subbufs; i++) {
+ int idx = meta->buffers[i];
+
+ dpage = (void *)(ptr + idx * subbuf_size);
+ /* Skip unused pages */
+ if (!local_read(&dpage->commit))
+ continue;
+
+ /* Invalidate even pages. */
+ if (!(i & 0x1)) {
+ local_add(subbuf_size + 1, &dpage->commit);
+ invalid++;
+ } else {
+ /* Count total commit bytes. */
+ entry_bytes += local_read(&dpage->commit);
+ }
+ }
+
+ pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
+ invalid, cpu, (long)entry_bytes);
+ meta->nr_invalid = invalid;
+ meta->entry_bytes = entry_bytes;
+}
+#else /* !CONFIG_RING_BUFFER_PERSISTENT_SELFTEST */
+#define rb_test_inject_invalid_pages(buffer) do { } while (0)
+#endif
+
/* Stop recording on a persistent buffer and flush cache if needed. */
static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
{
struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
ring_buffer_record_off(buffer);
+ rb_test_inject_invalid_pages(buffer);
arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
return NOTIFY_DONE;
}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 5e1129b011cb..dc23fa63c789 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9349,6 +9349,8 @@ static void setup_trace_scratch(struct trace_array *tr,
memset(tscratch, 0, size);
}
+#define TRACE_TEST_PTRACING_NAME "ptracingtest"
+
static int
allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned long size)
{
@@ -9361,6 +9363,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned
buf->tr = tr;
if (tr->range_addr_start && tr->range_addr_size) {
+ if (!strcmp(tr->name, TRACE_TEST_PTRACING_NAME))
+ rb_flags |= RB_FL_TESTING;
/* Add scratch buffer to handle 128 modules */
buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
tr->range_addr_start,
^ permalink raw reply related
* Re: [PATCH 33/53] ext4: use on-stack dentries in ext4_fc_replay_link_internal()
From: Jan Kara @ 2026-03-17 9:37 UTC (permalink / raw)
To: NeilBrown
Cc: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
Baolin Wang, David Howells, Marc Dionne, Steve French,
Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel,
linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
linux-efi
In-Reply-To: <20260312214330.3885211-34-neilb@ownmail.net>
On Fri 13-03-26 08:12:20, NeilBrown wrote:
> From: NeilBrown <neil@brown.name>
>
> ext4_fc_replay_link_internal() uses two dentries to simply code-reuse
> when replaying a "link" operation. It does not need to interact with
> the dcache and removes the dentries shortly after adding them.
>
> They are passed to __ext4_link() which only performs read accesses on
> these dentries and only uses the name and parent of dentry_inode (plus
> checking a flag is unset) and only uses the inode of the parent.
>
> So instead of allocating dentries and adding them to the dcache, allocat
> two dentries on the stack, set up the required fields, and pass these to
> __ext4_link().
>
> This substantially simplifies the code and removes on of the few uses of
> d_alloc() - preparing for its removal.
>
> Signed-off-by: NeilBrown <neil@brown.name>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/fast_commit.c | 40 ++++++++--------------------------------
> 1 file changed, 8 insertions(+), 32 deletions(-)
>
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 2a5daf1d9667..e3593bb90a62 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -1446,8 +1446,6 @@ static int ext4_fc_replay_link_internal(struct super_block *sb,
> struct inode *inode)
> {
> struct inode *dir = NULL;
> - struct dentry *dentry_dir = NULL, *dentry_inode = NULL;
> - struct qstr qstr_dname = QSTR_INIT(darg->dname, darg->dname_len);
> int ret = 0;
>
> dir = ext4_iget(sb, darg->parent_ino, EXT4_IGET_NORMAL);
> @@ -1457,28 +1455,14 @@ static int ext4_fc_replay_link_internal(struct super_block *sb,
> goto out;
> }
>
> - dentry_dir = d_obtain_alias(dir);
> - if (IS_ERR(dentry_dir)) {
> - ext4_debug("Failed to obtain dentry");
> - dentry_dir = NULL;
> - goto out;
> - }
> + {
> + struct dentry dentry_dir = { .d_inode = dir };
> + const struct dentry dentry_inode = {
> + .d_parent = &dentry_dir,
> + .d_name = QSTR_LEN(darg->dname, darg->dname_len),
> + };
>
> - dentry_inode = d_alloc(dentry_dir, &qstr_dname);
> - if (!dentry_inode) {
> - ext4_debug("Inode dentry not created.");
> - ret = -ENOMEM;
> - goto out;
> - }
> -
> - ihold(inode);
> - inc_nlink(inode);
> - ret = __ext4_link(dir, inode, dentry_inode);
> - if (ret) {
> - drop_nlink(inode);
> - iput(inode);
> - } else {
> - d_instantiate(dentry_inode, inode);
> + ret = __ext4_link(dir, inode, &dentry_inode);
> }
> /*
> * It's possible that link already existed since data blocks
> @@ -1493,16 +1477,8 @@ static int ext4_fc_replay_link_internal(struct super_block *sb,
>
> ret = 0;
> out:
> - if (dentry_dir) {
> - d_drop(dentry_dir);
> - dput(dentry_dir);
> - } else if (dir) {
> + if (dir)
> iput(dir);
> - }
> - if (dentry_inode) {
> - d_drop(dentry_inode);
> - dput(dentry_inode);
> - }
>
> return ret;
> }
> --
> 2.50.0.107.gf914562f5916.dirty
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH 32/53] ext4: move dcache modifying code out of __ext4_link()
From: Jan Kara @ 2026-03-17 10:00 UTC (permalink / raw)
To: NeilBrown
Cc: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
Baolin Wang, David Howells, Marc Dionne, Steve French,
Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel,
linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
linux-efi
In-Reply-To: <20260312214330.3885211-33-neilb@ownmail.net>
On Fri 13-03-26 08:12:19, NeilBrown wrote:
...
> diff --git a/fs/dcache.c b/fs/dcache.c
> index a1219b446b74..c48337d95f9a 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -358,7 +358,7 @@ static inline int dname_external(const struct dentry *dentry)
> return dentry->d_name.name != dentry->d_shortname.string;
> }
>
> -void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry)
> +void take_dentry_name_snapshot(struct name_snapshot *name, const struct dentry *dentry)
> {
> unsigned seq;
> const unsigned char *s;
The constification of take_dentry_name_snapshot() should probably be a
separate patch? Also I'd note that this constification (and the
constification of __ext4_fc_track_link()) isn't really needed here because
ext4_fc_track_link() will immediately bail through ext4_fc_disabled() when
fast commit replay is happening so __ext4_fc_track_link() never gets called
in that case - more about that below.
> @@ -1471,7 +1471,15 @@ static int ext4_fc_replay_link_internal(struct super_block *sb,
> goto out;
> }
>
> + ihold(inode);
> + inc_nlink(inode);
> ret = __ext4_link(dir, inode, dentry_inode);
> + if (ret) {
> + drop_nlink(inode);
> + iput(inode);
> + } else {
> + d_instantiate(dentry_inode, inode);
> + }
> /*
> * It's possible that link already existed since data blocks
> * for the dir in question got persisted before we crashed OR
...
> @@ -3460,8 +3460,6 @@ int __ext4_link(struct inode *dir, struct inode *inode, struct dentry *dentry)
> ext4_handle_sync(handle);
>
> inode_set_ctime_current(inode);
> - ext4_inc_count(inode);
> - ihold(inode);
>
> err = ext4_add_entry(handle, dentry, inode);
> if (!err) {
> @@ -3471,11 +3469,7 @@ int __ext4_link(struct inode *dir, struct inode *inode, struct dentry *dentry)
> */
> if (inode->i_nlink == 1)
> ext4_orphan_del(handle, inode);
> - d_instantiate(dentry, inode);
> - ext4_fc_track_link(handle, dentry);
> - } else {
> - drop_nlink(inode);
> - iput(inode);
> + __ext4_fc_track_link(handle, inode, dentry);
This looks wrong. If fastcommit replay is running, we must skip calling
__ext4_fc_track_link(). Similarly if the filesystem is currently
inelligible for fastcommit (due to some complex unsupported operations
running in parallel). Why did you change ext4_fc_track_link() to
__ext4_fc_track_link()?
> @@ -3504,7 +3498,16 @@ static int ext4_link(struct dentry *old_dentry,
> err = dquot_initialize(dir);
> if (err)
> return err;
> - return __ext4_link(dir, inode, dentry);
> + ihold(inode);
> + ext4_inc_count(inode);
I'd put inc_nlink() here instead. We are guaranteed to have a regular file
anyway and it matches what we do in ext4_fc_replay_link_internal().
Alternatively we could consistently use ext4_inc_count() &
ext4_dec_count() in these functions.
> + err = __ext4_link(dir, inode, dentry);
> + if (err) {
> + drop_nlink(inode);
> + iput(inode);
> + } else {
> + d_instantiate(dentry, inode);
> + }
> + return err;
> }
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH 04/15] net: Use trace_invoke_##name() at guarded tracepoint call sites
From: Paolo Abeni @ 2026-03-17 10:34 UTC (permalink / raw)
To: Vineeth Pillai (Google)
Cc: Steven Rostedt, Peter Zijlstra, David S. Miller, Eric Dumazet,
Jakub Kicinski, Simon Horman, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Aaron Conole, Eelco Chaudron, Ilya Maximets,
Marcelo Ricardo Leitner, Xin Long, Jon Maloy, Kuniyuki Iwashima,
Samiullah Khawaja, Hangbin Liu, netdev, linux-kernel, bpf, dev,
linux-sctp, tipc-discussion, linux-trace-kernel
In-Reply-To: <20260312150523.2054552-5-vineeth@bitbyteword.org>
On 3/12/26 4:04 PM, Vineeth Pillai (Google) wrote:
> Replace trace_foo() with the new trace_invoke_foo() at sites already
> guarded by trace_foo_enabled(), avoiding a redundant
> static_branch_unlikely() re-evaluation inside the tracepoint.
> trace_invoke_foo() calls the tracepoint callbacks directly without
> utilizing the static branch again.
>
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Assisted-by: Claude:claude-sonnet-4-6
Side question: which is the merge plan here? since this patch depends on
1/15 I guess it's via the trace tree, am I correct?
/P
^ permalink raw reply
* Re: [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 10:35 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260226032631.234234-1-npache@redhat.com>
On Wed, Feb 25, 2026 at 08:26:31PM -0700, Nico Pache wrote:
> There are cases where, if an attempted collapse fails, all subsequent
> orders are guaranteed to also fail. Avoid these collapse attempts by
> bailing out early.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
With David's concern addressed:
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> ---
> mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
> 1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1c3711ed4513..388d3f2537e2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1492,9 +1492,42 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> ret = collapse_huge_page(mm, collapse_address, referenced,
> unmapped, cc, mmap_locked,
> order);
> - if (ret == SCAN_SUCCEED) {
> +
> + switch (ret) {
> + /* Cases were we continue to next collapse candidate */
> + case SCAN_SUCCEED:
> collapsed += nr_pte_entries;
> + fallthrough;
> + case SCAN_PTE_MAPPED_HUGEPAGE:
> continue;
> + /* Cases were lower orders might still succeed */
> + case SCAN_LACK_REFERENCED_PAGE:
> + case SCAN_EXCEED_NONE_PTE:
> + case SCAN_EXCEED_SWAP_PTE:
> + case SCAN_EXCEED_SHARED_PTE:
> + case SCAN_PAGE_LOCK:
> + case SCAN_PAGE_COUNT:
> + case SCAN_PAGE_LRU:
> + case SCAN_PAGE_NULL:
> + case SCAN_DEL_PAGE_LRU:
> + case SCAN_PTE_NON_PRESENT:
> + case SCAN_PTE_UFFD_WP:
> + case SCAN_ALLOC_HUGE_PAGE_FAIL:
> + goto next_order;
> + /* Cases were no further collapse is possible */
> + case SCAN_CGROUP_CHARGE_FAIL:
> + case SCAN_COPY_MC:
> + case SCAN_ADDRESS_RANGE:
> + case SCAN_NO_PTE_TABLE:
> + case SCAN_ANY_PROCESS:
> + case SCAN_VMA_NULL:
> + case SCAN_VMA_CHECK:
> + case SCAN_SCAN_ABORT:
> + case SCAN_PAGE_ANON:
> + case SCAN_PMD_MAPPED:
> + case SCAN_FAIL:
> + default:
Agree with david, let's spell them out please :)
> + return collapsed;
> }
> }
>
> --
> 2.53.0
>
^ permalink raw reply
* Re: [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 10:58 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260226032650.234386-1-npache@redhat.com>
On Wed, Feb 25, 2026 at 08:26:50PM -0700, Nico Pache wrote:
> From: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> If any order (m)THP is enabled we should allow running khugepaged to
> attempt scanning and collapsing mTHPs. In order for khugepaged to operate
> when only mTHP sizes are specified in sysfs, we must modify the predicate
> function that determines whether it ought to run to do so.
>
> This function is currently called hugepage_pmd_enabled(), this patch
> renames it to hugepage_enabled() and updates the logic to check to
> determine whether any valid orders may exist which would justify
> khugepaged running.
>
> We must also update collapse_allowable_orders() to check all orders if
> the vma is anonymous and the collapse is khugepaged.
>
> After this patch khugepaged mTHP collapse is fully enabled.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
This looks good to me, so:
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> ---
> mm/khugepaged.c | 30 ++++++++++++++++++------------
> 1 file changed, 18 insertions(+), 12 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 388d3f2537e2..e8bfcc1d0c9a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -434,23 +434,23 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
> mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
> }
>
> -static bool hugepage_pmd_enabled(void)
> +static bool hugepage_enabled(void)
> {
> /*
> * We cover the anon, shmem and the file-backed case here; file-backed
> * hugepages, when configured in, are determined by the global control.
> - * Anon pmd-sized hugepages are determined by the pmd-size control.
> + * Anon hugepages are determined by its per-size mTHP control.
Well also PMD right? I mean this terminology sucks because in a sense mTHP
includes PMD... :)
> * Shmem pmd-sized hugepages are also determined by its pmd-size control,
> * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
> */
> if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
> hugepage_global_enabled())
> return true;
> - if (test_bit(PMD_ORDER, &huge_anon_orders_always))
> + if (READ_ONCE(huge_anon_orders_always))
> return true;
> - if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
> + if (READ_ONCE(huge_anon_orders_madvise))
> return true;
> - if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
> + if (READ_ONCE(huge_anon_orders_inherit) &&
> hugepage_global_enabled())
> return true;
> if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
> @@ -521,8 +521,14 @@ static unsigned int collapse_max_ptes_none(unsigned int order)
> static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> vm_flags_t vm_flags, bool is_khugepaged)
> {
> + unsigned long orders;
> enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> - unsigned long orders = BIT(HPAGE_PMD_ORDER);
> +
> + /* If khugepaged is scanning an anonymous vma, allow mTHP collapse */
> + if (is_khugepaged && vma_is_anonymous(vma))
> + orders = THP_ORDERS_ALL_ANON;
> + else
> + orders = BIT(HPAGE_PMD_ORDER);
>
> return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> }
> @@ -531,7 +537,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> vm_flags_t vm_flags)
> {
> if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> - hugepage_pmd_enabled()) {
> + hugepage_enabled()) {
> if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))
> __khugepaged_enter(vma->vm_mm);
> }
> @@ -2929,7 +2935,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
>
> static int khugepaged_has_work(void)
> {
> - return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
> + return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
> }
>
> static int khugepaged_wait_event(void)
> @@ -3002,7 +3008,7 @@ static void khugepaged_wait_work(void)
> return;
> }
>
> - if (hugepage_pmd_enabled())
> + if (hugepage_enabled())
> wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
> }
>
> @@ -3033,7 +3039,7 @@ static void set_recommended_min_free_kbytes(void)
> int nr_zones = 0;
> unsigned long recommended_min;
>
> - if (!hugepage_pmd_enabled()) {
> + if (!hugepage_enabled()) {
> calculate_min_free_kbytes();
> goto update_wmarks;
> }
> @@ -3083,7 +3089,7 @@ int start_stop_khugepaged(void)
> int err = 0;
>
> mutex_lock(&khugepaged_mutex);
> - if (hugepage_pmd_enabled()) {
> + if (hugepage_enabled()) {
> if (!khugepaged_thread)
> khugepaged_thread = kthread_run(khugepaged, NULL,
> "khugepaged");
> @@ -3109,7 +3115,7 @@ int start_stop_khugepaged(void)
> void khugepaged_min_free_kbytes_update(void)
> {
> mutex_lock(&khugepaged_mutex);
> - if (hugepage_pmd_enabled() && khugepaged_thread)
> + if (hugepage_enabled() && khugepaged_thread)
> set_recommended_min_free_kbytes();
> mutex_unlock(&khugepaged_mutex);
> }
> --
> 2.53.0
>
^ permalink raw reply
* Re: [PATCH mm-unstable v15 13/13] Documentation: mm: update the admin guide for mTHP collapse
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 11:02 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe, Bagas Sanjaya
In-Reply-To: <20260226032706.234519-1-npache@redhat.com>
On Wed, Feb 25, 2026 at 08:27:06PM -0700, Nico Pache wrote:
> Now that we can collapse to mTHPs lets update the admin guide to
> reflect these changes and provide proper guidance on how to utilize it.
>
> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
LGTM, but maybe we should mention somewhere about mTHP's max_ptes_none
behaviour?
Anyway with that addressed:
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> ---
> Documentation/admin-guide/mm/transhuge.rst | 48 +++++++++++++---------
> 1 file changed, 28 insertions(+), 20 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index eebb1f6bbc6c..67836c683e8d 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -63,7 +63,8 @@ often.
> THP can be enabled system wide or restricted to certain tasks or even
> memory ranges inside task's address space. Unless THP is completely
> disabled, there is ``khugepaged`` daemon that scans memory and
> -collapses sequences of basic pages into PMD-sized huge pages.
> +collapses sequences of basic pages into huge pages of either PMD size
> +or mTHP sizes, if the system is configured to do so.
>
> The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
> interface and using madvise(2) and prctl(2) system calls.
> @@ -219,10 +220,10 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing
> echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
> echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
>
> -khugepaged will be automatically started when PMD-sized THP is enabled
> +khugepaged will be automatically started when any THP size is enabled
> (either of the per-size anon control or the top-level control are set
> to "always" or "madvise"), and it'll be automatically shutdown when
> -PMD-sized THP is disabled (when both the per-size anon control and the
> +all THP sizes are disabled (when both the per-size anon control and the
> top-level control are "never")
>
> process THP controls
> @@ -264,11 +265,6 @@ support the following arguments::
> Khugepaged controls
> -------------------
>
> -.. note::
> - khugepaged currently only searches for opportunities to collapse to
> - PMD-sized THP and no attempt is made to collapse to other THP
> - sizes.
> -
> khugepaged runs usually at low frequency so while one may not want to
> invoke defrag algorithms synchronously during the page faults, it
> should be worth invoking defrag at least in khugepaged. However it's
> @@ -296,11 +292,11 @@ allocation failure to throttle the next allocation attempt::
> The khugepaged progress can be seen in the number of pages collapsed (note
> that this counter may not be an exact count of the number of pages
> collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
> -being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
> -one 2M hugepage. Each may happen independently, or together, depending on
> -the type of memory and the failures that occur. As such, this value should
> -be interpreted roughly as a sign of progress, and counters in /proc/vmstat
> -consulted for more accurate accounting)::
> +being replaced by a PMD mapping, or (2) physical pages replaced by one
> +hugepage of various sizes (PMD-sized or mTHP). Each may happen independently,
> +or together, depending on the type of memory and the failures that occur.
> +As such, this value should be interpreted roughly as a sign of progress,
> +and counters in /proc/vmstat consulted for more accurate accounting)::
>
> /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
>
> @@ -308,16 +304,19 @@ for each pass::
>
> /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
>
> -``max_ptes_none`` specifies how many extra small pages (that are
> -not already mapped) can be allocated when collapsing a group
> -of small pages into one large page::
> +``max_ptes_none`` specifies how many empty (none/zero) pages are allowed
> +when collapsing a group of small pages into one large page::
>
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>
> -A higher value leads to use additional memory for programs.
> -A lower value leads to gain less thp performance. Value of
> -max_ptes_none can waste cpu time very little, you can
> -ignore it.
> +For PMD-sized THP collapse, this directly limits the number of empty pages
> +allowed in the 2MB region. For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1)
> +are supported. Any other value will emit a warning and no mTHP collapse
> +will be attempted.
> +
> +A higher value allows more empty pages, potentially leading to more memory
> +usage but better THP performance. A lower value is more conservative and
> +may result in fewer THP collapses.
>
> ``max_ptes_swap`` specifies how many pages can be brought in from
> swap when collapsing a group of pages into a transparent huge page::
> @@ -337,6 +336,15 @@ that THP is shared. Exceeding the number would block the collapse::
>
> A higher value may increase memory footprint for some workloads.
>
> +.. note::
> + For mTHP collapse, khugepaged does not support collapsing regions that
> + contain shared or swapped out pages, as this could lead to continuous
> + promotion to higher orders. The collapse will fail if any shared or
> + swapped PTEs are encountered during the scan.
> +
> + Currently, madvise_collapse only supports collapsing to PMD-sized THPs
> + and does not attempt mTHP collapses.
> +
> Boot parameters
> ===============
>
> --
> 2.53.0
>
^ permalink raw reply
* Re: [PATCH v6 09/17] lib/bootconfig: validate child node index in xbc_verify_tree()
From: Markus Elfring @ 2026-03-17 11:03 UTC (permalink / raw)
To: Josh Law, linux-trace-kernel, Andrew Morton, Masami Hiramatsu
Cc: LKML, Steven Rostedt
In-Reply-To: <20260315122015.55965-10-objecting@objecting.org>
…
> +++ b/lib/bootconfig.c
> @@ -823,6 +823,10 @@ static int __init xbc_verify_tree(void)
> return xbc_parse_error("No closing brace",
> xbc_node_get_data(xbc_nodes + i));
> }
> + if (xbc_nodes[i].child >= xbc_node_num) {
> + return xbc_parse_error("Broken child node",
> + xbc_node_get_data(xbc_nodes + i));
> + }
> }
…
How do you think about to omit curly brackets for this if branch?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v7.0-rc4#n197
Regards,
Markus
^ permalink raw reply
* [PATCH] module/kallsyms: sort function symbols and use binary search
From: Stanislaw Gruszka @ 2026-03-17 11:04 UTC (permalink / raw)
To: linux-modules, Sami Tolvanen, Luis Chamberlain
Cc: linux-kernel, linux-trace-kernel, live-patching, Petr Pavlu,
Daniel Gomez, Aaron Tomlin, Steven Rostedt, Masami Hiramatsu,
Jordan Rome, Viktor Malik
Module symbol lookup via find_kallsyms_symbol() performs a linear scan
over the entire symtab when resolving an address. The number of symbols
in module symtabs has grown over the years, largely due to additional
metadata in non-standard sections, making this lookup very slow.
Improve this by separating function symbols during module load, placing
them at the beginning of the symtab, sorting them by address, and using
binary search when resolving addresses in module text.
This also should improve times for linear symbol name lookups, as valid
function symbols are now located at the beginning of the symtab.
The cost of sorting is small relative to module load time. In repeated
module load tests [1], depending on .config options, this change
increases load time between 2% and 4%. With cold caches, the difference
is not measurable, as memory access latency dominates.
The sorting theoretically could be done in compile time, but much more
complicated as we would have to simulate kernel addresses resolution
for symbols, and then correct relocation entries. That would be risky
if get out of sync.
The improvement can be observed when listing ftrace filter functions:
root@nano:~# time cat /sys/kernel/tracing/available_filter_functions | wc -l
74908
real 0m1.315s
user 0m0.000s
sys 0m1.312s
After:
root@nano:~# time cat /sys/kernel/tracing/available_filter_functions | wc -l
74911
real 0m0.167s
user 0m0.004s
sys 0m0.175s
(there are three more symbols introduced by the patch)
For livepatch modules, the symtab layout is preserved and the existing
linear search is used. For this case, it should be possible to keep
the original ELF symtab instead of copying it 1:1, but that is outside
the scope of this patch.
Link: https://gist.github.com/sgruszka/09f3fb1dad53a97b1aad96e1927ab117 [1]
Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
---
include/linux/module.h | 6 ++
kernel/module/internal.h | 1 +
kernel/module/kallsyms.c | 169 ++++++++++++++++++++++++++++-----------
3 files changed, 131 insertions(+), 45 deletions(-)
diff --git a/include/linux/module.h b/include/linux/module.h
index ac254525014c..4da0289fba02 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -379,8 +379,14 @@ struct module_memory {
struct mod_kallsyms {
Elf_Sym *symtab;
unsigned int num_symtab;
+ unsigned int num_func_syms;
char *strtab;
char *typetab;
+
+ unsigned int (*search)(struct mod_kallsyms *kallsyms,
+ struct module_memory *mod_mem,
+ unsigned long addr, unsigned long *bestval,
+ unsigned long *nextval);
};
#ifdef CONFIG_LIVEPATCH
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 618202578b42..6a4d498619b1 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -73,6 +73,7 @@ struct load_info {
bool sig_ok;
#ifdef CONFIG_KALLSYMS
unsigned long mod_kallsyms_init_off;
+ unsigned long num_func_syms;
#endif
#ifdef CONFIG_MODULE_DECOMPRESS
#ifdef CONFIG_MODULE_STATS
diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c
index 0fc11e45df9b..9e131baae441 100644
--- a/kernel/module/kallsyms.c
+++ b/kernel/module/kallsyms.c
@@ -10,6 +10,7 @@
#include <linux/kallsyms.h>
#include <linux/buildid.h>
#include <linux/bsearch.h>
+#include <linux/sort.h>
#include "internal.h"
/* Lookup exported symbol in given range of kernel_symbols */
@@ -103,6 +104,95 @@ static bool is_core_symbol(const Elf_Sym *src, const Elf_Shdr *sechdrs,
return true;
}
+static inline bool is_func_symbol(const Elf_Sym *sym)
+{
+ return sym->st_shndx != SHN_UNDEF && sym->st_size != 0 &&
+ ELF_ST_TYPE(sym->st_info) == STT_FUNC;
+}
+
+static unsigned int bsearch_kallsyms_symbol(struct mod_kallsyms *kallsyms,
+ struct module_memory *mod_mem,
+ unsigned long addr,
+ unsigned long *bestval,
+ unsigned long *nextval)
+
+{
+ unsigned int mid, low = 1, high = kallsyms->num_func_syms + 1;
+ unsigned int best = 0;
+ unsigned long thisval;
+
+ while (low < high) {
+ mid = low + (high - low) / 2;
+ thisval = kallsyms_symbol_value(&kallsyms->symtab[mid]);
+
+ if (thisval <= addr) {
+ *bestval = thisval;
+ best = mid;
+ low = mid + 1;
+ } else {
+ *nextval = thisval;
+ high = mid;
+ }
+ }
+
+ return best;
+}
+
+static const char *kallsyms_symbol_name(struct mod_kallsyms *kallsyms,
+ unsigned int symnum)
+{
+ return kallsyms->strtab + kallsyms->symtab[symnum].st_name;
+}
+
+static unsigned int search_kallsyms_symbol(struct mod_kallsyms *kallsyms,
+ struct module_memory *mod_mem,
+ unsigned long addr,
+ unsigned long *bestval,
+ unsigned long *nextval)
+{
+ unsigned int i, best = 0;
+
+ /*
+ * Scan for closest preceding symbol, and next symbol. (ELF
+ * starts real symbols at 1).
+ */
+ for (i = 1; i < kallsyms->num_symtab; i++) {
+ const Elf_Sym *sym = &kallsyms->symtab[i];
+ unsigned long thisval = kallsyms_symbol_value(sym);
+
+ if (sym->st_shndx == SHN_UNDEF)
+ continue;
+
+ /*
+ * We ignore unnamed symbols: they're uninformative
+ * and inserted at a whim.
+ */
+ if (*kallsyms_symbol_name(kallsyms, i) == '\0' ||
+ is_mapping_symbol(kallsyms_symbol_name(kallsyms, i)))
+ continue;
+
+ if (thisval <= addr && thisval > *bestval) {
+ best = i;
+ *bestval = thisval;
+ }
+ if (thisval > addr && thisval < *nextval)
+ *nextval = thisval;
+ }
+
+ return best;
+}
+
+static int elf_sym_cmp(const void *a, const void *b)
+{
+ const Elf_Sym *sym_a = a;
+ const Elf_Sym *sym_b = b;
+
+ if (sym_a->st_value < sym_b->st_value)
+ return -1;
+
+ return sym_a->st_value > sym_b->st_value;
+}
+
/*
* We only allocate and copy the strings needed by the parts of symtab
* we keep. This is simple, but has the effect of making multiple
@@ -115,9 +205,10 @@ void layout_symtab(struct module *mod, struct load_info *info)
Elf_Shdr *symsect = info->sechdrs + info->index.sym;
Elf_Shdr *strsect = info->sechdrs + info->index.str;
const Elf_Sym *src;
- unsigned int i, nsrc, ndst, strtab_size = 0;
+ unsigned int i, nsrc, ndst, nfunc, strtab_size = 0;
struct module_memory *mod_mem_data = &mod->mem[MOD_DATA];
struct module_memory *mod_mem_init_data = &mod->mem[MOD_INIT_DATA];
+ bool is_lp_mod = is_livepatch_module(mod);
/* Put symbol section at end of init part of module. */
symsect->sh_flags |= SHF_ALLOC;
@@ -129,12 +220,14 @@ void layout_symtab(struct module *mod, struct load_info *info)
nsrc = symsect->sh_size / sizeof(*src);
/* Compute total space required for the core symbols' strtab. */
- for (ndst = i = 0; i < nsrc; i++) {
- if (i == 0 || is_livepatch_module(mod) ||
+ for (ndst = nfunc = i = 0; i < nsrc; i++) {
+ if (i == 0 || is_lp_mod ||
is_core_symbol(src + i, info->sechdrs, info->hdr->e_shnum,
info->index.pcpu)) {
strtab_size += strlen(&info->strtab[src[i].st_name]) + 1;
ndst++;
+ if (!is_lp_mod && is_func_symbol(src + i))
+ nfunc++;
}
}
@@ -156,6 +249,7 @@ void layout_symtab(struct module *mod, struct load_info *info)
mod_mem_init_data->size = ALIGN(mod_mem_init_data->size,
__alignof__(struct mod_kallsyms));
info->mod_kallsyms_init_off = mod_mem_init_data->size;
+ info->num_func_syms = nfunc;
mod_mem_init_data->size += sizeof(struct mod_kallsyms);
info->init_typeoffs = mod_mem_init_data->size;
@@ -169,7 +263,7 @@ void layout_symtab(struct module *mod, struct load_info *info)
*/
void add_kallsyms(struct module *mod, const struct load_info *info)
{
- unsigned int i, ndst;
+ unsigned int i, di, nfunc, ndst;
const Elf_Sym *src;
Elf_Sym *dst;
char *s;
@@ -178,6 +272,7 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
void *data_base = mod->mem[MOD_DATA].base;
void *init_data_base = mod->mem[MOD_INIT_DATA].base;
struct mod_kallsyms *kallsyms;
+ bool is_lp_mod = is_livepatch_module(mod);
kallsyms = init_data_base + info->mod_kallsyms_init_off;
@@ -186,6 +281,7 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
/* Make sure we get permanent strtab: don't use info->strtab. */
kallsyms->strtab = (void *)info->sechdrs[info->index.str].sh_addr;
kallsyms->typetab = init_data_base + info->init_typeoffs;
+ kallsyms->search = search_kallsyms_symbol;
/*
* Now populate the cut down core kallsyms for after init
@@ -194,19 +290,30 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
mod->core_kallsyms.symtab = dst = data_base + info->symoffs;
mod->core_kallsyms.strtab = s = data_base + info->stroffs;
mod->core_kallsyms.typetab = data_base + info->core_typeoffs;
+ mod->core_kallsyms.search = is_lp_mod ? search_kallsyms_symbol :
+ bsearch_kallsyms_symbol;
+
strtab_size = info->core_typeoffs - info->stroffs;
src = kallsyms->symtab;
- for (ndst = i = 0; i < kallsyms->num_symtab; i++) {
+ ndst = info->num_func_syms + 1;
+
+ for (nfunc = i = 0; i < kallsyms->num_symtab; i++) {
kallsyms->typetab[i] = elf_type(src + i, info);
- if (i == 0 || is_livepatch_module(mod) ||
+ if (i == 0 || is_lp_mod ||
is_core_symbol(src + i, info->sechdrs, info->hdr->e_shnum,
info->index.pcpu)) {
ssize_t ret;
- mod->core_kallsyms.typetab[ndst] =
- kallsyms->typetab[i];
- dst[ndst] = src[i];
- dst[ndst++].st_name = s - mod->core_kallsyms.strtab;
+ if (i == 0)
+ di = 0;
+ else if (!is_lp_mod && is_func_symbol(src + i))
+ di = 1 + nfunc++;
+ else
+ di = ndst++;
+
+ mod->core_kallsyms.typetab[di] = kallsyms->typetab[i];
+ dst[di] = src[i];
+ dst[di].st_name = s - mod->core_kallsyms.strtab;
ret = strscpy(s, &kallsyms->strtab[src[i].st_name],
strtab_size);
if (ret < 0)
@@ -216,9 +323,13 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
}
}
+ WARN_ON_ONCE(nfunc != info->num_func_syms);
+ sort(dst + 1, nfunc, sizeof(Elf_Sym), elf_sym_cmp, NULL);
+
/* Set up to point into init section. */
rcu_assign_pointer(mod->kallsyms, kallsyms);
mod->core_kallsyms.num_symtab = ndst;
+ mod->core_kallsyms.num_func_syms = nfunc;
}
#if IS_ENABLED(CONFIG_STACKTRACE_BUILD_ID)
@@ -241,11 +352,6 @@ void init_build_id(struct module *mod, const struct load_info *info)
}
#endif
-static const char *kallsyms_symbol_name(struct mod_kallsyms *kallsyms, unsigned int symnum)
-{
- return kallsyms->strtab + kallsyms->symtab[symnum].st_name;
-}
-
/*
* Given a module and address, find the corresponding symbol and return its name
* while providing its size and offset if needed.
@@ -255,7 +361,7 @@ static const char *find_kallsyms_symbol(struct module *mod,
unsigned long *size,
unsigned long *offset)
{
- unsigned int i, best = 0;
+ unsigned int best;
unsigned long nextval, bestval;
struct mod_kallsyms *kallsyms = rcu_dereference(mod->kallsyms);
struct module_memory *mod_mem;
@@ -267,36 +373,9 @@ static const char *find_kallsyms_symbol(struct module *mod,
mod_mem = &mod->mem[MOD_TEXT];
nextval = (unsigned long)mod_mem->base + mod_mem->size;
+ bestval = kallsyms_symbol_value(&kallsyms->symtab[0]);
- bestval = kallsyms_symbol_value(&kallsyms->symtab[best]);
-
- /*
- * Scan for closest preceding symbol, and next symbol. (ELF
- * starts real symbols at 1).
- */
- for (i = 1; i < kallsyms->num_symtab; i++) {
- const Elf_Sym *sym = &kallsyms->symtab[i];
- unsigned long thisval = kallsyms_symbol_value(sym);
-
- if (sym->st_shndx == SHN_UNDEF)
- continue;
-
- /*
- * We ignore unnamed symbols: they're uninformative
- * and inserted at a whim.
- */
- if (*kallsyms_symbol_name(kallsyms, i) == '\0' ||
- is_mapping_symbol(kallsyms_symbol_name(kallsyms, i)))
- continue;
-
- if (thisval <= addr && thisval > bestval) {
- best = i;
- bestval = thisval;
- }
- if (thisval > addr && thisval < nextval)
- nextval = thisval;
- }
-
+ best = kallsyms->search(kallsyms, mod_mem, addr, &bestval, &nextval);
if (!best)
return NULL;
--
2.50.1
^ permalink raw reply related
* [PATCH AUTOSEL 6.19-6.18] powerpc64/ftrace: fix OOL stub count with clang
From: Sasha Levin @ 2026-03-17 11:32 UTC (permalink / raw)
To: patches, stable
Cc: Hari Bathini, Venkat Rao Bagalkote, Madhavan Srinivasan,
Sasha Levin, rostedt, mhiramat, mpe, linux-kernel,
linux-trace-kernel, linuxppc-dev
In-Reply-To: <20260317113249.117771-1-sashal@kernel.org>
From: Hari Bathini <hbathini@linux.ibm.com>
[ Upstream commit 875612a7745013a43c67493cb0583ee3f7476344 ]
The total number of out-of-line (OOL) stubs required for function
tracing is determined using the following command:
$(OBJDUMP) -r -j __patchable_function_entries vmlinux.o
While this works correctly with GNU objdump, llvm-objdump does not
list the expected relocation records for this section. Fix this by
using the -d option and counting R_PPC64_ADDR64 relocation entries.
This works as desired with both objdump and llvm-objdump.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260127084926.34497-3-hbathini@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
This is a **build fix**. When building with clang/LLVM toolchain:
1. `llvm-objdump -r -j __patchable_function_entries` produces no
relocation records
2. `grep -c "$RELOCATION"` finds 0 matches, returns exit code 1
3. With `set -e` at the top of the script, the script aborts
4. The kernel build fails
This is a clear build failure for the LLVM/clang toolchain on powerpc64
with ftrace OOL stubs enabled.
## Analysis
### What the commit fixes
This commit fixes a **build failure** when compiling the Linux kernel
with the LLVM/clang toolchain on powerpc64 with
`CONFIG_PPC_FTRACE_OUT_OF_LINE` enabled. The `llvm-objdump` tool does
not list relocation records the same way as GNU objdump when using `-r
-j __patchable_function_entries`. Adding the `-d` flag makes both GNU
objdump and llvm-objdump produce the expected output.
### Stable kernel criteria assessment
1. **Obviously correct and tested**: Yes. The fix adds `-d` to objdump
invocations. It has `Tested-by:` from Venkat Rao Bagalkote. The
commit message clearly explains the problem and solution.
2. **Fixes a real bug**: Yes. Build failure with LLVM/clang toolchain.
3. **Fixes an important issue**: Yes. Build fixes are explicitly listed
as acceptable for stable trees per stable kernel rules. Users
building with LLVM toolchain cannot build a working kernel without
this fix.
4. **Small and contained**: Yes. Only 2 lines changed - adding `-d` flag
to two objdump invocations in a shell script.
5. **No new features or APIs**: Correct. No new features.
### Scope and risk
- **2 lines changed**: Minimal scope, adding a single flag to objdump
commands.
- **Only affects build tooling**: A shell script used during build, not
runtime kernel code.
- **Cross-toolchain compatible**: The fix works with both GNU objdump
and llvm-objdump.
- **Risk**: Extremely low. The `-d` (disassemble) flag is standard for
both objdump implementations and the commit message confirms testing
with both.
### Affected versions
- The file was introduced in v6.13 (commit eec37961a56aa)
- The fix is in v7.0-rc4
- Stable trees v6.13.y and v6.14.y are affected
### Dependencies
No dependencies on other commits. The fix is self-contained.
### User impact
Anyone building the powerpc64 kernel with clang/LLVM and
CONFIG_PPC_FTRACE_OUT_OF_LINE enabled hits a build failure. The
clang/LLVM toolchain is increasingly used for kernel builds
(ClangBuiltLinux project), making this a real-world issue.
## Verification
- `git log --follow -- arch/powerpc/tools/ftrace-gen-ool-stubs.sh`
confirmed the file was introduced in commit eec37961a56aa (v6.13)
- `git merge-base --is-ancestor eec37961a56aa v6.13` confirmed the file
is in v6.13 but not v6.12
- `git tag --contains 875612a774501` confirmed the fix landed in
v7.0-rc4
- Read `arch/powerpc/kernel/trace/ftrace.c:190-239` confirmed that wrong
stub count leads to WARN_ON and -EINVAL (ftrace failure)
- Read `arch/powerpc/tools/ftrace-gen-ool-stubs.sh` confirmed `set -e`
at line 5, meaning grep returning exit code 1 (zero matches) would
abort the script and fail the build
- `git tag -l 'v6.13.*'` and `git tag -l 'v6.14.*'` confirmed both
stable trees exist and need this fix
- The diff is exactly 2 lines changed (adding `-d` flag) - verified from
the commit diff
**YES**
arch/powerpc/tools/ftrace-gen-ool-stubs.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/tools/ftrace-gen-ool-stubs.sh b/arch/powerpc/tools/ftrace-gen-ool-stubs.sh
index bac186bdf64a7..9218d43aeb548 100755
--- a/arch/powerpc/tools/ftrace-gen-ool-stubs.sh
+++ b/arch/powerpc/tools/ftrace-gen-ool-stubs.sh
@@ -15,9 +15,9 @@ if [ -z "$is_64bit" ]; then
RELOCATION=R_PPC_ADDR32
fi
-num_ool_stubs_total=$($objdump -r -j __patchable_function_entries "$vmlinux_o" |
+num_ool_stubs_total=$($objdump -r -j __patchable_function_entries -d "$vmlinux_o" |
grep -c "$RELOCATION")
-num_ool_stubs_inittext=$($objdump -r -j __patchable_function_entries "$vmlinux_o" |
+num_ool_stubs_inittext=$($objdump -r -j __patchable_function_entries -d "$vmlinux_o" |
grep -e ".init.text" -e ".text.startup" | grep -c "$RELOCATION")
num_ool_stubs_text=$((num_ool_stubs_total - num_ool_stubs_inittext))
--
2.51.0
^ permalink raw reply related
* Re: [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders
From: Lance Yang @ 2026-03-17 11:36 UTC (permalink / raw)
To: baolin.wang, npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260226032650.234386-1-npache@redhat.com>
On Wed, Feb 25, 2026 at 08:26:50PM -0700, Nico Pache wrote:
>From: Baolin Wang <baolin.wang@linux.alibaba.com>
>
>If any order (m)THP is enabled we should allow running khugepaged to
>attempt scanning and collapsing mTHPs. In order for khugepaged to operate
>when only mTHP sizes are specified in sysfs, we must modify the predicate
>function that determines whether it ought to run to do so.
>
>This function is currently called hugepage_pmd_enabled(), this patch
>renames it to hugepage_enabled() and updates the logic to check to
>determine whether any valid orders may exist which would justify
>khugepaged running.
>
>We must also update collapse_allowable_orders() to check all orders if
>the vma is anonymous and the collapse is khugepaged.
>
>After this patch khugepaged mTHP collapse is fully enabled.
>
>Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 30 ++++++++++++++++++------------
> 1 file changed, 18 insertions(+), 12 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 388d3f2537e2..e8bfcc1d0c9a 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -434,23 +434,23 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
> mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
> }
>
>-static bool hugepage_pmd_enabled(void)
>+static bool hugepage_enabled(void)
> {
> /*
> * We cover the anon, shmem and the file-backed case here; file-backed
> * hugepages, when configured in, are determined by the global control.
>- * Anon pmd-sized hugepages are determined by the pmd-size control.
>+ * Anon hugepages are determined by its per-size mTHP control.
> * Shmem pmd-sized hugepages are also determined by its pmd-size control,
> * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
> */
> if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
> hugepage_global_enabled())
> return true;
>- if (test_bit(PMD_ORDER, &huge_anon_orders_always))
>+ if (READ_ONCE(huge_anon_orders_always))
> return true;
>- if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
>+ if (READ_ONCE(huge_anon_orders_madvise))
> return true;
>- if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
>+ if (READ_ONCE(huge_anon_orders_inherit) &&
> hugepage_global_enabled())
> return true;
> if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
>@@ -521,8 +521,14 @@ static unsigned int collapse_max_ptes_none(unsigned int order)
> static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> vm_flags_t vm_flags, bool is_khugepaged)
> {
>+ unsigned long orders;
> enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>- unsigned long orders = BIT(HPAGE_PMD_ORDER);
>+
>+ /* If khugepaged is scanning an anonymous vma, allow mTHP collapse */
>+ if (is_khugepaged && vma_is_anonymous(vma))
>+ orders = THP_ORDERS_ALL_ANON;
>+ else
>+ orders = BIT(HPAGE_PMD_ORDER);
>
> return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> }
IIUC, an anonymous VMA can pass collapse_allowable_orders() even if it
is smaller than 2MB ...
But collapse_scan_mm_slot() still scans only full PMD-sized windows:
hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
if (khugepaged_scan.address > hend) {
cc->progress++;
continue;
}
and hugepage_vma_revalidate() still requires PMD suitability:
/* Always check the PMD order to ensure its not shared by another VMA */
if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
return SCAN_ADDRESS_RANGE;
>@@ -531,7 +537,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> vm_flags_t vm_flags)
> {
> if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>- hugepage_pmd_enabled()) {
>+ hugepage_enabled()) {
> if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))
> __khugepaged_enter(vma->vm_mm);
I wonder if we should also require at least one PMD-sized scan window
here? Not a big deal, just might be good to tighten the gate a bit :)
Apart from that, LGTM!
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH v6 06/17] lib/bootconfig: drop redundant memset of xbc_nodes
From: Markus Elfring @ 2026-03-17 11:46 UTC (permalink / raw)
To: Josh Law, linux-trace-kernel, Andrew Morton, Masami Hiramatsu
Cc: LKML, Steven Rostedt
In-Reply-To: <20260315122015.55965-7-objecting@objecting.org>
> memblock_alloc() already returns zeroed memory,
Interesting …
> so the explicit memset
> in xbc_init() is redundant. …
How is this conclusion appropriate for the mentioned function implementation?
https://elixir.bootlin.com/linux/v7.0-rc4/source/lib/bootconfig.c#L932-L998
Regards,
Markus
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-03-17 13:05 UTC (permalink / raw)
To: Gregory Price
Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
zhengqi.arch, terry.bowman
In-Reply-To: <aZx7hsVNU0XOCCiG@gourry-fedora-PF4VCD3F>
On 2/23/26 17:08, Gregory Price wrote:
> On Mon, Feb 23, 2026 at 09:54:55AM -0500, Gregory Price wrote:
>> On Mon, Feb 23, 2026 at 02:07:15PM +0100, David Hildenbrand (Arm) wrote:
>>>
>>> I'm concerned about adding more special-casing (similar to what we already
>>> added for ZONE_DEVICE) all over the place.
>>>
>>> Like the whole folio_managed_() stuff in mprotect.c
>>>
>>> Having that said, sounds like a reasonable topic to discuss.
>>>
>>
>> Another option would be to add the hook to vma_wants_writenotify()
>> instead of the page table code - and mask MM_CP_TRY_CHANGE_WRITABLE.
>>
>
> scratch all this - existing hooks exist for exactly this purpose:
>
> can_change_[pte|pmd]_writable()
>
> Surprised I missed this.
>
> I can clean this up to remove it from the page table walks.
Sorry for the late reply -- sounds like we can handle this cleaner.
But I am wondering: why is this even required?
Is it just for "Services that intercept write faults (e.g., for
promotion tracking) need PTEs to stay read-only"
But that promotion tracking sounds like some orthogonal work to me. What
am I missing that this is required in this patch set? (is it just for
the special compressed RAM bits?)
--
Cheers,
David
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-03-17 13:25 UTC (permalink / raw)
To: Gregory Price, lsf-pc
Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
zhengqi.arch, terry.bowman
In-Reply-To: <20260222084842.1824063-1-gourry@gourry.net>
On 2/22/26 09:48, Gregory Price wrote:
> Topic type: MM
Hi Gregory,
stumbling over this again, some questions whereby I'll just ignore the
compressed RAM bits for now and focus on use cases where promotion etc
are not relevant :)
[...]
>
> TL;DR
> ===
>
> N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
> explicit holes in that isolation to do useful things we couldn't do
> before without re-implementing entire portions of mm/ in a driver.
Just to clarify: we don't currently have any mechanism to expose, say,
SPM/PMEM/whatsoever to the buddy allocator through the dax/kmem driver
and *not* have random allocations end up on it, correct?
Assume we online the memory to ZONE_MOVABLE, still other (fallback)
allocations might end up on that memory.
How would we currently handle something like that? (do we have drivers
for that? I'd assume that drivers would only migrate some user memory to
ZONE_DEVICE memory.)
Assuming we don't have such a mechanism, I assume that part of your
proposal would be very interesting: online the memory to a
"special"/"restricted" (you call it private) NUMA node, whereby all
memory of that NUMA node will only be consumable through
mbind() and friends.
Any other allocations (including automatic page migration etc) would not
end up on that memory.
Thinking of some "terribly slow" or "terribly fast" memory that we don't
want to involve in automatic memory tiering, being able to just let
selected workloads consume that memory sounds very helpful.
(wondering if there could be some way allocations might get migrated out
of the node, for example, during memory offlining etc, which might also
not be desirable)
I am not sure if __GFP_PRIVATE etc is really required for that. But some
mechanism to make that work seems extremely helpful.
Because ...
>
>
> /* This is my memory. There are many like it, but this one is mine. */
> rc = add_private_memory_driver_managed(nid, start, size, name, flags,
> online_type, private_context);
>
> page = alloc_pages_node(nid, __GFP_PRIVATE, 0);
>
> /* Ok but I want to do something useful with it */
> static const struct node_private_ops ops = {
> .migrate_to = my_migrate_to,
> .folio_migrate = my_folio_migrate,
> .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> };
> node_private_set_ops(nid, &ops);
>
> /* And now I can use mempolicy with my memory */
> buf = mmap(...);
> mbind(buf, len, mode, private_node, ...);
> buf[0] = 0xdeadbeef; /* Faults onto private node */
... just being able to consume that memory through mbind() and having
guarantees sounds extremely helpful.
[...]
>
>
> Background
> ===
>
> Today, drivers that want mm-like services on non-general-purpose
> memory either use ZONE_DEVICE (self-managed memory) or hotplug into
> N_MEMORY and accept the risk of uncontrolled allocation.
>
> Neither option provides what we really want - the ability to:
> 1) selectively participate in mm/ subsystems, while
> 2) isolating that memory from general purpose use.
>
> Some device-attached memory cannot be managed as fully general-purpose
> system RAM. CXL devices with inline compression, for example, may
> corrupt data or crash the machine if the compression ratio drops
> below a threshold -- we simply run out of physical memory.
>
> This is a hard problem to solve: how does an operating system deal
> with a device that basically lies about how much capacity it has?
>
> (We'll discuss that in the CRAM section)
>
>
> Core Proposal: N_MEMORY_PRIVATE
> ===
>
> Introduce N_MEMORY_PRIVATE, a NUMA node state for memory managed by
> the buddy allocator, but excluded from normal allocation paths.
>
> Private nodes:
>
> - Are filtered from zonelist fallback: all existing callers to
> get_page_from_freelist cannot reach these nodes through any
> normal fallback mechanism.
Good.
>
> - Filter allocation requests on __GFP_PRIVATE
> numa_zone_allowed() excludes them otherwise.
I think we discussed that in the past, but why can't we find a way that
only people requesting __GFP_THISNODE could allocate that memory, for
example? I guess we'd have to remove it from all "default NUMA bitmaps"
somehow.
>
> Applies to systems with and without cpusets.
>
> GFP_PRIVATE is (__GFP_PRIVATE | __GFP_THISNODE).
>
> Services use it when they need to allocate specifically from
> a private node (e.g., CRAM allocating a destination folio).
>
> No existing allocator path sets __GFP_PRIVATE, so private nodes
> are unreachable by default.
>
> - Use standard struct page / folio. No ZONE_DEVICE, no pgmap,
> no struct page metadata limitations.
Good.
>
> - Use a node-scoped metadata structure to accomplish filtering
> and callback support.
>
> - May participate in the buddy allocator, reclaim, compaction,
> and LRU like normal memory, gated by an opt-in set of flags.
>
> The key abstraction is node_private_ops: a per-node callback table
> registered by a driver or service.
>
> Each callback is individually gated by an NP_OPS_* capability flag.
>
> A driver opts in only to the mm/ operations it needs.
>
> It is similar to ZONE_DEVICE's pgmap at a node granularity.
>
> In fact...
>
>
> Re-use of ZONE_DEVICE Hooks
> ===
I think all of that might not be required for the simplistic use case I
mentioned above (fast/slow memory only to be consumed by selected user
space that opts in through mbind() and friends).
Or are there other use cases for these callbacks
[...]
>
>
> Flag-gated behavior (NP_OPS_*) controls:
> ===
>
> We use OPS flags to denote what mm/ services we want to allow on our
> private node. I've plumbed these through so far:
>
> NP_OPS_MIGRATION - Node supports migration
> NP_OPS_MEMPOLICY - Node supports mempolicy actions
> NP_OPS_DEMOTION - Node appears in demotion target lists
> NP_OPS_PROTECT_WRITE - Node memory is read-only (wrprotect)
> NP_OPS_RECLAIM - Node supports reclaim
> NP_OPS_NUMA_BALANCING - Node supports numa balancing
> NP_OPS_COMPACTION - Node supports compaction
> NP_OPS_LONGTERM_PIN - Node supports longterm pinning
> NP_OPS_OOM_ELIGIBLE - (MIGRATION | DEMOTION), node is reachable
> as normal system ram storage, so it should
> be considered in OOM pressure calculations.
I have to think about all that, and whether that would be required as a
first step. I'd assume in a simplistic use case mentioned above we might
only forbid the memory to be used as a fallback for any oom etc.
Whether reclaim (e.g., swapout) makes sense is a good question.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v6 09/17] lib/bootconfig: validate child node index in xbc_verify_tree()
From: Steven Rostedt @ 2026-03-17 15:10 UTC (permalink / raw)
To: Markus Elfring
Cc: Josh Law, linux-trace-kernel, Andrew Morton, Masami Hiramatsu,
LKML
In-Reply-To: <fbe3dbdd-61a4-47e7-90a1-1042caa71129@web.de>
On Tue, 17 Mar 2026 12:03:40 +0100
Markus Elfring <Markus.Elfring@web.de> wrote:
> …
> > +++ b/lib/bootconfig.c
> > @@ -823,6 +823,10 @@ static int __init xbc_verify_tree(void)
> > return xbc_parse_error("No closing brace",
> > xbc_node_get_data(xbc_nodes + i));
> > }
> > + if (xbc_nodes[i].child >= xbc_node_num) {
> > + return xbc_parse_error("Broken child node",
> > + xbc_node_get_data(xbc_nodes + i));
> > + }
> > }
> …
>
> How do you think about to omit curly brackets for this if branch?
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v7.0-rc4#n197
>
Markus, please stop. If you don't understand the rules, do not suggest them.
The brackets *are* appropriate. The rule of omitting the brackets is for
*single line* statements. The above return statement is long and there's a
line break, which means, curly brackets *are* required for visibility reasons.
-- Steve
^ permalink raw reply
* Re: [PATCH 00/15] tracepoint: Avoid double static_branch evaluation at guarded call sites
From: Steven Rostedt @ 2026-03-17 16:00 UTC (permalink / raw)
To: Vineeth Remanan Pillai
Cc: Andrii Nakryiko, Mathieu Desnoyers, Peter Zijlstra,
Dmitry Ilvokhin, Masami Hiramatsu, Ingo Molnar, Jens Axboe,
io-uring, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Marcelo Ricardo Leitner, Xin Long, Jon Maloy, Aaron Conole,
Eelco Chaudron, Ilya Maximets, netdev, bpf, linux-sctp,
tipc-discussion, dev, Oded Gabbay, Koby Elbaz, dri-devel,
Rafael J. Wysocki, Viresh Kumar, Gautham R. Shenoy, Huang Rui,
Mario Limonciello, Len Brown, Srinivas Pandruvada, linux-pm,
MyungJoo Ham, Kyungmin Park, Chanwoo Choi, Christian König,
Sumit Semwal, linaro-mm-sig, Eddie James, Andrew Jeffery,
Joel Stanley, linux-fsi, David Airlie, Simona Vetter,
Alex Deucher, Danilo Krummrich, Matthew Brost, Philipp Stanner,
Harry Wentland, Leo Li, amd-gfx, Jiri Kosina, Benjamin Tissoires,
linux-input, Wolfram Sang, linux-i2c, Mark Brown,
Michael Hennerich, Nuno Sá, linux-spi, James E.J. Bottomley,
Martin K. Petersen, linux-scsi, Chris Mason, David Sterba,
linux-btrfs, linux-trace-kernel, linux-kernel
In-Reply-To: <CAO7JXPgHYZ9zF1HFahb2447X85YRZCQQBHB6ihOwKSDtiZi8kQ@mail.gmail.com>
On Fri, 13 Mar 2026 10:02:32 -0400
Vineeth Remanan Pillai <vineeth@bitbyteword.org> wrote:
> >
> > Perhaps: call_trace_foo() ?
> >
> call_trace_foo has one collision with the tracepoint
> sched_update_nr_running and a function
> call_trace_sched_update_nr_running. I had considered this and later
> moved to trace_invoke_foo() because of the collision. But I can rename
> call_trace_sched_update_nr_running to something else if call_trace_foo
> is the general consensus.
OK, then lets go with: trace_call__foo()
The double underscore should prevent any name collisions.
Does anyone have an objections?
-- Steve
^ permalink raw reply
* Re: [PATCH 00/15] tracepoint: Avoid double static_branch evaluation at guarded call sites
From: Mathieu Desnoyers @ 2026-03-17 16:02 UTC (permalink / raw)
To: Steven Rostedt, Vineeth Remanan Pillai
Cc: Andrii Nakryiko, Peter Zijlstra, Dmitry Ilvokhin,
Masami Hiramatsu, Ingo Molnar, Jens Axboe, io-uring,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Marcelo Ricardo Leitner,
Xin Long, Jon Maloy, Aaron Conole, Eelco Chaudron, Ilya Maximets,
netdev, bpf, linux-sctp, tipc-discussion, dev, Oded Gabbay,
Koby Elbaz, dri-devel, Rafael J. Wysocki, Viresh Kumar,
Gautham R. Shenoy, Huang Rui, Mario Limonciello, Len Brown,
Srinivas Pandruvada, linux-pm, MyungJoo Ham, Kyungmin Park,
Chanwoo Choi, Christian König, Sumit Semwal, linaro-mm-sig,
Eddie James, Andrew Jeffery, Joel Stanley, linux-fsi,
David Airlie, Simona Vetter, Alex Deucher, Danilo Krummrich,
Matthew Brost, Philipp Stanner, Harry Wentland, Leo Li, amd-gfx,
Jiri Kosina, Benjamin Tissoires, linux-input, Wolfram Sang,
linux-i2c, Mark Brown, Michael Hennerich, Nuno Sá, linux-spi,
James E.J. Bottomley, Martin K. Petersen, linux-scsi, Chris Mason,
David Sterba, linux-btrfs, linux-trace-kernel, linux-kernel
In-Reply-To: <20260317120049.6a60fa88@gandalf.local.home>
On 2026-03-17 12:00, Steven Rostedt wrote:
> On Fri, 13 Mar 2026 10:02:32 -0400
> Vineeth Remanan Pillai <vineeth@bitbyteword.org> wrote:
>
>>>
>>> Perhaps: call_trace_foo() ?
>>>
>> call_trace_foo has one collision with the tracepoint
>> sched_update_nr_running and a function
>> call_trace_sched_update_nr_running. I had considered this and later
>> moved to trace_invoke_foo() because of the collision. But I can rename
>> call_trace_sched_update_nr_running to something else if call_trace_foo
>> is the general consensus.
>
> OK, then lets go with: trace_call__foo()
>
> The double underscore should prevent any name collisions.
>
> Does anyone have an objections?
I'm OK with it.
Thanks!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply
* [PATCH v7 12/15] lib/bootconfig: use size_t for strlen result in xbc_node_match_prefix()
From: Josh Law @ 2026-03-17 16:09 UTC (permalink / raw)
To: Masami Hiramatsu, Andrew Morton
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <20260317160916.33576-1-objecting@objecting.org>
lib/bootconfig.c:198:19: warning: conversion from 'size_t' to 'int'
may change value [-Wconversion]
lib/bootconfig.c:200:33: warning: conversion to '__kernel_size_t'
from 'int' may change the sign of the result [-Wsign-conversion]
strlen() returns size_t but the result was stored in an int. The value
is then passed back to strncmp() which expects size_t, causing a second
sign-conversion warning on the round-trip. Use size_t throughout to
match the API types.
Signed-off-by: Josh Law <objecting@objecting.org>
---
lib/bootconfig.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index dd639046912c..3c5b6ad32a54 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -195,7 +195,7 @@ static bool __init
xbc_node_match_prefix(struct xbc_node *node, const char **prefix)
{
const char *p = xbc_node_get_data(node);
- int len = strlen(p);
+ size_t len = strlen(p);
if (strncmp(*prefix, p, len))
return false;
--
2.34.1
^ permalink raw reply related
* [PATCH v7 13/15] lib/bootconfig: use signed type for offset in xbc_init_node()
From: Josh Law @ 2026-03-17 16:09 UTC (permalink / raw)
To: Masami Hiramatsu, Andrew Morton
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <20260317160916.33576-1-objecting@objecting.org>
lib/bootconfig.c:415:32: warning: conversion to 'long unsigned int'
from 'long int' may change the sign of the result [-Wsign-conversion]
Pointer subtraction yields ptrdiff_t (signed), which was stored in
unsigned long. The original unsigned type implicitly caught a negative
offset (data < xbc_data) because the wrapped value would exceed
XBC_DATA_MAX. Make this intent explicit by using a signed long and
adding an offset < 0 check to the WARN_ON condition.
Signed-off-by: Josh Law <objecting@objecting.org>
---
lib/bootconfig.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 3c5b6ad32a54..119c3d429c1f 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -412,9 +412,9 @@ const char * __init xbc_node_find_next_key_value(struct xbc_node *root,
static int __init xbc_init_node(struct xbc_node *node, char *data, uint16_t flag)
{
- unsigned long offset = data - xbc_data;
+ long offset = data - xbc_data;
- if (WARN_ON(offset >= XBC_DATA_MAX))
+ if (WARN_ON(offset < 0 || offset >= XBC_DATA_MAX))
return -EINVAL;
node->data = (uint16_t)offset | flag;
--
2.34.1
^ permalink raw reply related
* [PATCH v7 11/15] lib/bootconfig: fix signed comparison in xbc_node_get_data()
From: Josh Law @ 2026-03-17 16:09 UTC (permalink / raw)
To: Masami Hiramatsu, Andrew Morton
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <20260317160916.33576-1-objecting@objecting.org>
lib/bootconfig.c:188:28: warning: comparison of integer expressions
of different signedness: 'int' and 'size_t' [-Wsign-compare]
The local variable 'offset' is declared as int, but xbc_data_size is
size_t. Change the type to size_t to match and eliminate the warning.
Signed-off-by: Josh Law <objecting@objecting.org>
---
lib/bootconfig.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index ecc4e8d93547..dd639046912c 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -183,7 +183,7 @@ struct xbc_node * __init xbc_node_get_next(struct xbc_node *node)
*/
const char * __init xbc_node_get_data(struct xbc_node *node)
{
- int offset = node->data & ~XBC_VALUE;
+ size_t offset = node->data & ~XBC_VALUE;
if (WARN_ON(offset >= xbc_data_size))
return NULL;
--
2.34.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox