* [PATCH v2 0/2] Add trace events for Qualcomm GENI SPI drivers
From: Praveen Talari @ 2026-05-12 6:12 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Mark Brown
Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
mukesh.savaliya, aniket.randive, chandana.chiluveru,
jyothi.seerapu, Praveen Talari
Add tracepoints to the Qualcomm GENI (Generic Interface) SPI driver.
These trace events enable runtime debugging and performance analysis
of SPI operations.
The trace events capture SPI clock configuration, FIFO parameters,
transfer details, interrupt status.
Usage examples:
Enable all SPI traces:
echo 1 > /sys/kernel/tracing/events/spi/enable
echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_spi/enable
cat /sys/kernel/debug/tracing/trace_pipe
Example trace output:
1003.956560: spi_message_submit: spi16.0 000000001b20b93c
1003.956642: spi_controller_busy: spi16
1003.956643: spi_message_start: spi16.0 000000001b20b93c
1003.956646: geni_spi_fifo_params: 888000.spi: cs=0 mode=0x00000020
mode_changed=0x00000007 cs_changed=0
1003.956647: spi_set_cs: spi16.0 activate
1003.956648: spi_transfer_start: spi16.0 00000000ea1cf8b6 len=16
tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00]
1003.956653: geni_spi_clk_cfg: 888000.spi: req_hz=20000000
sclk_hz=100000000 clk_idx=5 clk_div=5 bpw=8
1003.956691: geni_spi_transfer: 888000.spi: len=16 m_cmd=0x00000003
1003.956708: geni_spi_irq: 888000.spi: m_irq=0x08000081
dma_tx=0x00000000 dma_rx=0x00000000
1003.956717: spi_transfer_stop: spi16.0 00000000ea1cf8b6 len=16
tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
1003.956717: spi_set_cs: spi16.0 deactivate
1003.956718: spi_message_done: spi16.0 000000001b20b93c len=16/16
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
Changes in v2:
- Removed tx/rx data capture since spi core had already support.
- Updated commit text in patches and cover letter.
- Link to v1: https://lore.kernel.org/r/20260506-add-tracepoints-for-qcom-geni-spi-v1-0-c957cfe712d1@oss.qualcomm.com
---
Praveen Talari (2):
spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
spi: qcom-geni: Add trace events for Qualcomm GENI SPI driver
drivers/spi/spi-geni-qcom.c | 13 +++++
include/trace/events/qcom_geni_spi.h | 103 +++++++++++++++++++++++++++++++++++
2 files changed, 116 insertions(+)
---
base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
change-id: 20260506-add-tracepoints-for-qcom-geni-spi-e31457c2267c
Best regards,
--
Praveen Talari <praveen.talari@oss.qualcomm.com>
^ permalink raw reply
* Re: [PATCH v1 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Praveen Talari @ 2026-05-12 6:10 UTC (permalink / raw)
To: Trilok Soni, Mark Brown
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, linux-arm-msm, linux-spi,
MukeshKumarSavaliyamukesh.savaliya, AniketRandiveaniket.randive,
chandana.chiluveru, jyothi.seerapu
In-Reply-To: <e37f2d5d-daa9-4302-8d34-9ce198e60a4a@oss.qualcomm.com>
Hi Trilok,
On 09-05-2026 04:44, Trilok Soni wrote:
> On 5/8/2026 7:01 AM, Mark Brown wrote:
>> On Thu, May 07, 2026 at 11:03:39PM +0530, Praveen Talari wrote:
>>> On 07-05-2026 13:43, Mark Brown wrote:
>>>> By generic I mean this should not be driver specific at all.
>>> I hope these changes are fine. Please let me know if you have any concerns
>>> or feedback.
>> The data tracepoints look plausible but I would expect them to be
>> generated by the core, they'll be there for everything so I'd expect
>> them to work for everything.
> I agree here. Praveen - this is similar to suggestion I had for the i2c
> internally.
Sure i will review for I2C as well.
Thanks,
Praveen Talari
>
>
> ---Trilok Soni
>
^ permalink raw reply
* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Masami Hiramatsu @ 2026-05-12 5:14 UTC (permalink / raw)
To: Jiri Olsa
Cc: Andrii Nakryiko, bpf, linux-trace-kernel, oleg, peterz, mingo,
mhiramat
In-Reply-To: <agD3xq2MUyskd7X-@krava>
On Sun, 10 May 2026 23:25:26 +0200
Jiri Olsa <olsajiri@gmail.com> wrote:
> On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > user code may keep temporary data without adjusting rsp.
> >
> > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > also does not provide a return address. Replace the single trampoline
> > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > assigned slot, the slot moves rsp below the red zone, saves the registers
> > clobbered by syscall, and invokes the uprobe syscall:
> >
> > Probe site: jmp slot_N (5B, replaces nop5)
> >
> > Slot N: lea -128(%rsp), %rsp (5B) skip red zone
> > push %rcx (1B) save (syscall clobbers)
> > push %r11 (2B) save (syscall clobbers)
> > push %rax (1B) save (syscall uses for nr)
> > mov $336, %eax (5B) uprobe syscall number
> > syscall (2B)
> >
> > All slots contain identical code at different offsets, so the trampoline
> > page is generated once at boot and mapped read-execute into each process.
> > The syscall handler identifies the slot from regs->ip, which points just
> > after the syscall instruction, and uses a per-mm slot table to recover the
> > original probe address.
> >
> > The uprobe syscall does not return to the trampoline slot. The handler
> > restores the probe-site register state, runs the uprobe consumers, sets
> > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > execution, and returns directly through the IRET path. This preserves
> > general purpose registers, including rcx and r11, without requiring any
> > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > shadow stack concerns.
> >
> > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > without taking mmap_lock. The optimized-instruction detection path also
> > walks the trampoline list under an RCU read-side lock. Since that path
> > starts from the JMP target, it translates the slot start to the post-syscall
> > IP expected by the shared resolver before checking the trampoline mapping.
> >
> > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > to their first probe address and are reused only when the same address is
> > probed again. Reassigning detached slots is deliberately avoided because a
> > thread can remain in a trampoline for an unbounded time due to ptrace,
> > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > path.
> >
> > Require the entire trampoline page to be reachable by a rel32 JMP before
> > reusing it for a probe. This keeps every slot in the page within the range
> > that can be encoded at the probe site.
> >
> > Change the error code returned when the uprobe syscall is invoked outside
> > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > similar libraries distinguish fixed kernels from kernels with the
> > red-zone-clobbering implementation and enable nop5 optimization only on
> > fixed kernels.
> >
> > Performance (usdt single-thread, M/s):
> >
> > usdt-nop usdt-nop5-base usdt-nop5-fix nop5-change iret%
> > Skylake 3.149 6.422 4.865 -24.3% 39.1%
> > Milan 2.910 3.443 3.820 +11.0% 24.3%
> > Sapphire Rapids 1.896 4.023 3.693 -8.2% 24.9%
> > Bergamo 3.393 3.895 3.849 -1.2% 24.5%
> >
> > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > measured systems. The regression relative to the old CALL-based trampoline
> > comes from IRET being more expensive than SYSRET, most noticeably on older
> > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > and AMD Milan improves because removing mmap_lock from the hot path more
> > than offsets the IRET cost.
> >
> > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
>
> hi,
> thanks a lot for the fix
>
> FWIW we discussed also an option to have 10-bytes nop and do:
> [rsp+0x80, call trampoline]
>
> we would not need the slots re-use logic, but not sure what other
> surprises there are with 10-bytes nop
Does this mean we have to update UDST implementation?
>
> I tried that change [1], it seems to work, but it has other
> difficulties, like I think the unoptimized path needs to do:
> [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> instead of patching back the 10-byte nop, because some thread
> could be inside the nop area already.
Yeah, but at that moment, we know where the modified code is.
Maybe memory dump shows different code, but that is also true
if uprobe is active. So I think it is OK.
Thanks,
>
> jirka
>
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=redzone_fix&id=74b09240289dba8368c2783b771e678b2cc31574
>
> >
> > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> > arch/x86/include/asm/uprobes.h | 18 ++
> > arch/x86/kernel/uprobes.c | 262 ++++++++++--------
> > tools/lib/bpf/features.c | 8 +-
> > .../selftests/bpf/prog_tests/uprobe_syscall.c | 5 +-
> > tools/testing/selftests/bpf/prog_tests/usdt.c | 2 +-
> > 5 files changed, 181 insertions(+), 114 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > index 362210c79998..a7cf5c92d95a 100644
> > --- a/arch/x86/include/asm/uprobes.h
> > +++ b/arch/x86/include/asm/uprobes.h
> > @@ -25,6 +25,24 @@ enum {
> > ARCH_UPROBE_FLAG_OPTIMIZE_FAIL = 1,
> > };
> >
> > +/*
> > + * Trampoline page layout: identical 16-byte slots, each containing:
> > + * lea -128(%rsp), %rsp (5B) skip red zone
> > + * push %rcx (1B) save (syscall clobbers)
> > + * push %r11 (2B) save (syscall clobbers)
> > + * push %rax (1B) save (syscall uses for nr)
> > + * mov $336, %eax (5B) uprobe syscall number
> > + * syscall (2B)
> > + * = 16B, no padding needed
> > + *
> > + * The handler identifies which probe fired from regs->ip (each
> > + * slot is at a unique offset), looks up the probe address from a
> > + * per-process table, and returns directly to probe_addr+5 via iret
> > + * with all registers restored.
> > + */
> > +#define UPROBE_TRAMP_SLOT_SIZE 16
> > +#define UPROBE_TRAMP_MAX_SLOTS (PAGE_SIZE / UPROBE_TRAMP_SLOT_SIZE)
> > +
> > struct uprobe_xol_ops;
> >
> > struct arch_uprobe {
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index ebb1baf1eb1d..7e1f14200bbb 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -633,16 +633,25 @@ static struct vm_special_mapping tramp_mapping = {
> >
> > struct uprobe_trampoline {
> > struct hlist_node node;
> > + struct rcu_head rcu;
> > unsigned long vaddr;
> > + unsigned long probe_addrs[UPROBE_TRAMP_MAX_SLOTS];
> > };
> >
> > -static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
> > +
> > +static bool is_reachable_by_jmp(unsigned long dst, unsigned long src)
> > {
> > - long delta = (long)(vaddr + 5 - vtramp);
> > + long delta = (long)(dst - (src + JMP32_INSN_SIZE));
> >
> > return delta >= INT_MIN && delta <= INT_MAX;
> > }
> >
> > +static bool is_reachable_by_trampoline(unsigned long vtramp, unsigned long vaddr)
> > +{
> > + return is_reachable_by_jmp(vtramp, vaddr) &&
> > + is_reachable_by_jmp(vtramp + PAGE_SIZE - 1, vaddr);
> > +}
> > +
> > static unsigned long find_nearest_trampoline(unsigned long vaddr)
> > {
> > struct vm_unmapped_area_info info = {
> > @@ -711,6 +720,21 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> > return tramp;
> > }
> >
> > +static int tramp_alloc_slot(struct uprobe_trampoline *tramp, unsigned long probe_addr)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++) {
> > + if (tramp->probe_addrs[i] == probe_addr)
> > + return i;
> > + if (tramp->probe_addrs[i] == 0) {
> > + tramp->probe_addrs[i] = probe_addr;
> > + return i;
> > + }
> > + }
> > + return -ENOSPC;
> > +}
> > +
> > static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
> > {
> > struct uprobes_state *state = ¤t->mm->uprobes_state;
> > @@ -720,7 +744,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
> > return NULL;
> >
> > hlist_for_each_entry(tramp, &state->head_tramps, node) {
> > - if (is_reachable_by_call(tramp->vaddr, vaddr)) {
> > + if (is_reachable_by_trampoline(tramp->vaddr, vaddr)) {
> > *new = false;
> > return tramp;
> > }
> > @@ -731,7 +755,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
> > return NULL;
> >
> > *new = true;
> > - hlist_add_head(&tramp->node, &state->head_tramps);
> > + hlist_add_head_rcu(&tramp->node, &state->head_tramps);
> > return tramp;
> > }
> >
> > @@ -742,8 +766,8 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > * because there's no easy way to make sure none of the threads
> > * is still inside the trampoline.
> > */
> > - hlist_del(&tramp->node);
> > - kfree(tramp);
> > + hlist_del_rcu(&tramp->node);
> > + kfree_rcu(tramp, rcu);
> > }
> >
> > void arch_uprobe_init_state(struct mm_struct *mm)
> > @@ -761,147 +785,153 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
> > destroy_uprobe_trampoline(tramp);
> > }
> >
> > -static bool __in_uprobe_trampoline(unsigned long ip)
> > +/*
> > + * Find the trampoline containing @ip. If @probe_addr is non-NULL, also
> > + * resolve the slot index from @ip and return the probe address.
> > + *
> > + * @ip is expected to point right after the syscall instruction, i.e.,
> > + * at the end of the slot (slot_start + UPROBE_TRAMP_SLOT_SIZE).
> > + */
> > +static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
> > {
> > - struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> > + struct uprobes_state *state = ¤t->mm->uprobes_state;
> > + struct uprobe_trampoline *tramp;
> >
> > - return vma && vma_is_special_mapping(vma, &tramp_mapping);
> > -}
> > + hlist_for_each_entry_rcu(tramp, &state->head_tramps, node) {
> > + /*
> > + * ip points to after syscall, so it's on 16 byte boundary,
> > + * which means that valid ip can point right after the page
> > + * and should never be at zero offset within the page
> > + */
> > + if (ip <= tramp->vaddr || ip > tramp->vaddr + PAGE_SIZE)
> > + continue;
> >
> > -static bool in_uprobe_trampoline(unsigned long ip)
> > -{
> > - struct mm_struct *mm = current->mm;
> > - bool found, retry = true;
> > - unsigned int seq;
> > + if (probe_addr) {
> > + /* we already validated ip is within expected range */
> > + unsigned int slot = (ip - tramp->vaddr - 1) / UPROBE_TRAMP_SLOT_SIZE;
> > + unsigned long addr = tramp->probe_addrs[slot];
> >
> > - rcu_read_lock();
> > - if (mmap_lock_speculate_try_begin(mm, &seq)) {
> > - found = __in_uprobe_trampoline(ip);
> > - retry = mmap_lock_speculate_retry(mm, seq);
> > - }
> > - rcu_read_unlock();
> > + *probe_addr = addr;
> > + if (addr == 0)
> > + return false;
> > + }
> >
> > - if (retry) {
> > - mmap_read_lock(mm);
> > - found = __in_uprobe_trampoline(ip);
> > - mmap_read_unlock(mm);
> > + return true;
> > }
> > - return found;
> > + return false;
> > +}
> > +
> > +static bool in_uprobe_trampoline(unsigned long ip, unsigned long *probe_addr)
> > +{
> > + guard(rcu)();
> > + return resolve_uprobe_addr(ip, probe_addr);
> > }
> >
> > /*
> > - * See uprobe syscall trampoline; the call to the trampoline will push
> > - * the return address on the stack, the trampoline itself then pushes
> > - * cx, r11 and ax.
> > + * The trampoline slot pushes cx, r11, ax (the registers syscall clobbers)
> > + * before doing the uprobe syscall. No return address is pushed — the
> > + * probe site uses jmp, not call.
> > */
> > struct uprobe_syscall_args {
> > unsigned long ax;
> > unsigned long r11;
> > unsigned long cx;
> > - unsigned long retaddr;
> > };
> >
> > +#define UPROBE_TRAMP_REDZONE 128
> > +
> > SYSCALL_DEFINE0(uprobe)
> > {
> > struct pt_regs *regs = task_pt_regs(current);
> > struct uprobe_syscall_args args;
> > - unsigned long ip, sp, sret;
> > + unsigned long probe_addr;
> > int err;
> >
> > /* Allow execution only from uprobe trampolines. */
> > - if (!in_uprobe_trampoline(regs->ip))
> > - return -ENXIO;
> > + if (!in_uprobe_trampoline(regs->ip, &probe_addr))
> > + return -EPROTO;
> >
> > err = copy_from_user(&args, (void __user *)regs->sp, sizeof(args));
> > if (err)
> > goto sigill;
> >
> > - ip = regs->ip;
> > -
> > /*
> > - * expose the "right" values of ax/r11/cx/ip/sp to uprobe_consumer/s, plus:
> > - * - adjust ip to the probe address, call saved next instruction address
> > - * - adjust sp to the probe's stack frame (check trampoline code)
> > + * Restore the register state as it was at the probe site:
> > + * - ax/r11/cx from the trampoline-saved copies on user stack
> > + * - adjust ip to the probe address based on matching slot
> > + * - adjust sp to skip red zone and pushed args
> > */
> > regs->ax = args.ax;
> > regs->r11 = args.r11;
> > regs->cx = args.cx;
> > - regs->ip = args.retaddr - 5;
> > - regs->sp += sizeof(args);
> > + regs->ip = probe_addr;
> > + regs->sp += sizeof(args) + UPROBE_TRAMP_REDZONE;
> > regs->orig_ax = -1;
> >
> > - sp = regs->sp;
> > -
> > - err = shstk_pop((u64 *)&sret);
> > - if (err == -EFAULT || (!err && sret != args.retaddr))
> > - goto sigill;
> > -
> > - handle_syscall_uprobe(regs, regs->ip);
> > + handle_syscall_uprobe(regs, probe_addr);
> >
> > /*
> > - * Some of the uprobe consumers has changed sp, we can do nothing,
> > - * just return via iret.
> > + * Skip the jmp instruction at the probe site (5 bytes) unless
> > + * a consumer redirected execution elsewhere.
> > */
> > - if (regs->sp != sp) {
> > - /* skip the trampoline call */
> > - if (args.retaddr - 5 == regs->ip)
> > - regs->ip += 5;
> > - return regs->ax;
> > - }
> > + if (regs->ip == probe_addr)
> > + regs->ip = probe_addr + 5;
> >
> > - regs->sp -= sizeof(args);
> > -
> > - /* for the case uprobe_consumer has changed ax/r11/cx */
> > - args.ax = regs->ax;
> > - args.r11 = regs->r11;
> > - args.cx = regs->cx;
> > -
> > - /* keep return address unless we are instructed otherwise */
> > - if (args.retaddr - 5 != regs->ip)
> > - args.retaddr = regs->ip;
> > -
> > - if (shstk_push(args.retaddr) == -EFAULT)
> > - goto sigill;
> > -
> > - regs->ip = ip;
> > -
> > - err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
> > - if (err)
> > - goto sigill;
> > -
> > - /* ensure sysret, see do_syscall_64() */
> > - regs->r11 = regs->flags;
> > - regs->cx = regs->ip;
> > - return 0;
> > + /*
> > + * Return via iret by returning regs->ax. This preserves all
> > + * GP registers (including cx and r11) without needing any
> > + * user-space cleanup code. The iret path is used because we
> > + * don't set up cx/r11 for sysret.
> > + */
> > + return regs->ax;
> >
> > sigill:
> > force_sig(SIGILL);
> > return -1;
> > }
> >
> > +/*
> > + * All uprobe trampoline slots are identical: skip the red zone,
> > + * save the three registers that syscall clobbers, then invoke
> > + * the uprobe syscall. The handler returns directly to the probe
> > + * caller via iret. Execution never returns to the trampoline.
> > + */
> > asm (
> > ".pushsection .rodata\n"
> > - ".balign " __stringify(PAGE_SIZE) "\n"
> > - "uprobe_trampoline_entry:\n"
> > + ".balign " __stringify(UPROBE_TRAMP_SLOT_SIZE) "\n"
> > + "uprobe_trampoline_slot:\n"
> > + "lea -128(%rsp), %rsp\n"
> > "push %rcx\n"
> > "push %r11\n"
> > "push %rax\n"
> > - "mov $" __stringify(__NR_uprobe) ", %rax\n"
> > + "mov $" __stringify(__NR_uprobe) ", %eax\n"
> > "syscall\n"
> > - "pop %rax\n"
> > - "pop %r11\n"
> > - "pop %rcx\n"
> > - "ret\n"
> > - "int3\n"
> > - ".balign " __stringify(PAGE_SIZE) "\n"
> > + "uprobe_trampoline_slot_end:\n"
> > ".popsection\n"
> > );
> >
> > -extern u8 uprobe_trampoline_entry[];
> > +extern u8 uprobe_trampoline_slot[];
> > +extern u8 uprobe_trampoline_slot_end[];
> >
> > static int __init arch_uprobes_init(void)
> > {
> > - tramp_mapping_pages[0] = virt_to_page(uprobe_trampoline_entry);
> > + unsigned int slot_size = uprobe_trampoline_slot_end - uprobe_trampoline_slot;
> > + struct page *page;
> > + u8 *page_addr;
> > + int i;
> > +
> > + BUILD_BUG_ON(UPROBE_TRAMP_SLOT_SIZE != 16);
> > + WARN_ON_ONCE(slot_size != UPROBE_TRAMP_SLOT_SIZE);
> > +
> > + page = alloc_page(GFP_KERNEL);
> > + if (!page)
> > + return -ENOMEM;
> > +
> > + page_addr = page_address(page);
> > + for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++)
> > + memcpy(page_addr + i * UPROBE_TRAMP_SLOT_SIZE, uprobe_trampoline_slot, slot_size);
> > +
> > + tramp_mapping_pages[0] = page;
> > return 0;
> > }
> >
> > @@ -909,7 +939,7 @@ late_initcall(arch_uprobes_init);
> >
> > enum {
> > EXPECT_SWBP,
> > - EXPECT_CALL,
> > + EXPECT_JMP,
> > };
> >
> > struct write_opcode_ctx {
> > @@ -917,14 +947,14 @@ struct write_opcode_ctx {
> > int expect;
> > };
> >
> > -static int is_call_insn(uprobe_opcode_t *insn)
> > +static int is_jmp_insn(uprobe_opcode_t *insn)
> > {
> > - return *insn == CALL_INSN_OPCODE;
> > + return *insn == JMP32_INSN_OPCODE;
> > }
> >
> > /*
> > * Verification callback used by int3_update uprobe_write calls to make sure
> > - * the underlying instruction is as expected - either int3 or call.
> > + * the underlying instruction is as expected - either int3 or jmp.
> > */
> > static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
> > int nbytes, void *data)
> > @@ -939,8 +969,8 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
> > if (is_swbp_insn(&old_opcode[0]))
> > return 1;
> > break;
> > - case EXPECT_CALL:
> > - if (is_call_insn(&old_opcode[0]))
> > + case EXPECT_JMP:
> > + if (is_jmp_insn(&old_opcode[0]))
> > return 1;
> > break;
> > }
> > @@ -978,7 +1008,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > * so we can skip this step for optimize == true.
> > */
> > if (!optimize) {
> > - ctx.expect = EXPECT_CALL;
> > + ctx.expect = EXPECT_JMP;
> > err = uprobe_write(auprobe, vma, vaddr, &int3, 1, verify_insn,
> > true /* is_register */, false /* do_update_ref_ctr */,
> > &ctx);
> > @@ -1015,13 +1045,13 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > }
> >
> > static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > - unsigned long vaddr, unsigned long tramp)
> > + unsigned long vaddr, unsigned long slot_vaddr)
> > {
> > - u8 call[5];
> > + u8 jmp[5];
> >
> > - __text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
> > - (const void *) tramp, CALL_INSN_SIZE);
> > - return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
> > + __text_gen_insn(jmp, JMP32_INSN_OPCODE, (const void *) vaddr,
> > + (const void *) slot_vaddr, JMP32_INSN_SIZE);
> > + return int3_update(auprobe, vma, vaddr, jmp, true /* optimize */);
> > }
> >
> > static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > @@ -1049,11 +1079,17 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
> > struct __packed __arch_relative_insn {
> > u8 op;
> > s32 raddr;
> > - } *call = (struct __arch_relative_insn *) insn;
> > + } *jmp = (struct __arch_relative_insn *) insn;
> >
> > - if (!is_call_insn(insn))
> > + if (!is_jmp_insn(&jmp->op))
> > return false;
> > - return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
> > +
> > + guard(rcu)();
> > + /*
> > + * resolve_uprobe_addr() expects IP pointing after syscall instruction
> > + * (after the slot, basically), so adjust jump target address accordingly
> > + */
> > + return resolve_uprobe_addr(vaddr + 5 + jmp->raddr + UPROBE_TRAMP_SLOT_SIZE, NULL);
> > }
> >
> > static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
> > @@ -1113,8 +1149,9 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> > {
> > struct uprobe_trampoline *tramp;
> > struct vm_area_struct *vma;
> > + unsigned long slot_vaddr;
> > bool new = false;
> > - int err = 0;
> > + int slot, err;
> >
> > vma = find_vma(mm, vaddr);
> > if (!vma)
> > @@ -1122,8 +1159,17 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> > tramp = get_uprobe_trampoline(vaddr, &new);
> > if (!tramp)
> > return -EINVAL;
> > - err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
> > - if (WARN_ON_ONCE(err) && new)
> > +
> > + slot = tramp_alloc_slot(tramp, vaddr);
> > + if (slot < 0) {
> > + if (new)
> > + destroy_uprobe_trampoline(tramp);
> > + return slot;
> > + }
> > +
> > + slot_vaddr = tramp->vaddr + slot * UPROBE_TRAMP_SLOT_SIZE;
> > + err = swbp_optimize(auprobe, vma, vaddr, slot_vaddr);
> > + if (err && new)
> > destroy_uprobe_trampoline(tramp);
> > return err;
> > }
> > diff --git a/tools/lib/bpf/features.c b/tools/lib/bpf/features.c
> > index 4f19a0d79b0c..1b6c113357b2 100644
> > --- a/tools/lib/bpf/features.c
> > +++ b/tools/lib/bpf/features.c
> > @@ -577,10 +577,12 @@ static int probe_ldimm64_full_range_off(int token_fd)
> > static int probe_uprobe_syscall(int token_fd)
> > {
> > /*
> > - * If kernel supports uprobe() syscall, it will return -ENXIO when called
> > - * from the outside of a kernel-generated uprobe trampoline.
> > + * If kernel supports uprobe() syscall, it will return -EPROTO when
> > + * called from outside a kernel-generated uprobe trampoline.
> > + * Older kernels with the red-zone-clobbering bug return -ENXIO;
> > + * we only enable the nop5 optimization on fixed kernels.
> > */
> > - return syscall(__NR_uprobe) < 0 && errno == ENXIO;
> > + return syscall(__NR_uprobe) < 0 && errno == EPROTO;
> > }
> > #else
> > static int probe_uprobe_syscall(int token_fd)
> > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > index 955a37751b52..0d5eb4cd1ddf 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > @@ -422,7 +422,8 @@ static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigge
> > /* .. and check the trampoline is as expected. */
> > call = (struct __arch_relative_insn *) addr;
> > tramp = (void *) (call + 1) + call->raddr;
> > - ASSERT_EQ(call->op, 0xe8, "call");
> > + tramp = (void *)((unsigned long)tramp & ~(getpagesize() - 1UL));
> > + ASSERT_EQ(call->op, 0xe9, "jmp");
> > ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
> >
> > return tramp;
> > @@ -762,7 +763,7 @@ static void test_uprobe_error(void)
> > long err = syscall(__NR_uprobe);
> >
> > ASSERT_EQ(err, -1, "error");
> > - ASSERT_EQ(errno, ENXIO, "errno");
> > + ASSERT_EQ(errno, EPROTO, "errno");
> > }
> >
> > static void __test_uprobe_syscall(void)
> > diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
> > index 69759b27794d..9d3744d4e936 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/usdt.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
> > @@ -329,7 +329,7 @@ static void subtest_optimized_attach(void)
> > ASSERT_EQ(*addr_2, 0x90, "nop");
> >
> > /* call is on addr_2 + 1 address */
> > - ASSERT_EQ(*(addr_2 + 1), 0xe8, "call");
> > + ASSERT_EQ(*(addr_2 + 1), 0xe9, "jmp");
> > ASSERT_EQ(skel->bss->executed, 4, "executed");
> >
> > cleanup:
> > --
> > 2.53.0-Meta
> >
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Lance Yang @ 2026-05-12 4:44 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <20260511185817.686831-4-npache@redhat.com>
On Mon, May 11, 2026 at 12:58:03PM -0600, Nico Pache wrote:
>The following cleanup reworks all the max_ptes_* handling into helper
>functions. This increases the code readability and will later be used to
>implement the mTHP handling of these variables.
>
>With these changes we abstract all the madvise_collapse() special casing
>(dont respect the sysctls) away from the functions that utilize them. And
Nit: s/dont/do not/
>will be used later in this series to cleanly restrict the mTHP collapse
>behavior.
>
>No functional change is intended; however, we are now only reading the
>sysfs variables once per scan, whereas before these variables were being
>read on each loop iteration.
>
>Suggested-by: David Hildenbrand <david@kernel.org>
>Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>Acked-by: Usama Arif <usama.arif@linux.dev>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 118 +++++++++++++++++++++++++++++++++---------------
> 1 file changed, 82 insertions(+), 36 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index f0e29d5c7b1f..f68853b3caa7 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -348,6 +348,62 @@ static bool pte_none_or_zero(pte_t pte)
> return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
> }
>
>+/**
>+ * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
>+ * PTEs for the given collapse operation.
>+ * @cc: The collapse control struct
>+ * @vma: The vma to check for userfaultfd
>+ *
>+ * Return: Maximum number of none-page or zero-page PTEs allowed for the
>+ * collapse operation.
>+ */
>+static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>+ struct vm_area_struct *vma)
>+{
>+ // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
>+ if (vma && userfaultfd_armed(vma))
>+ return 0;
>+ // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
>+ if (!cc->is_khugepaged)
>+ return HPAGE_PMD_NR;
>+ // For all other cases repect the user defined maximum.
>+ return khugepaged_max_ptes_none;
Nit: kernel code usually uses C-style comments. This could be:
/* For all other cases, respect the user-defined maximum. */
Also, s/repect/respect/.
>+}
>+
>+/**
>+ * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
>+ * anonymous pages for the given collapse operation.
>+ * @cc: The collapse control struct
>+ *
>+ * Return: Maximum number of PTEs that map shared anonymous pages for the
>+ * collapse operation
>+ */
>+static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>+{
>+ // for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
>+ // anonymous pages.
Ditto.
>+ if (!cc->is_khugepaged)
>+ return HPAGE_PMD_NR;
>+ return khugepaged_max_ptes_shared;
>+}
>+
>+/**
>+ * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
>+ * maximum allowed non-present pagecache entries for the given collapse operation.
>+ * @cc: The collapse control struct
>+ *
>+ * Return: Maximum number of non-present PTEs or the maximum allowed non-present
>+ * pagecache entries for the collapse operation.
>+ */
>+static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>+{
>+ // for MADV_COLLAPSE, do not restrict the number PTEs entries or
>+ // pagecache entries that are non-present.
Same here.
>+ if (!cc->is_khugepaged)
>+ return HPAGE_PMD_NR;
>+ return khugepaged_max_ptes_swap;
>+}
>+
> int hugepage_madvise(struct vm_area_struct *vma,
> vm_flags_t *vm_flags, int advice)
> {
>@@ -546,21 +602,19 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> pte_t *_pte;
> int none_or_zero = 0, shared = 0, referenced = 0;
> enum scan_result result = SCAN_FAIL;
>+ unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>+ unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
Nit: could these be const, as David suggested earlier?
Nothing else jumped out at me. LGTM!
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH v1 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Praveen Talari @ 2026-05-12 3:43 UTC (permalink / raw)
To: Mark Brown
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, linux-arm-msm, linux-spi,
MukeshKumarSavaliyamukesh.savaliya, AniketRandiveaniket.randive,
chandana.chiluveru, jyothi.seerapu
In-Reply-To: <agB8AgF3qVqDw60Z@sirena.co.uk>
Hi Mark,
On 10-05-2026 18:07, Mark Brown wrote:
> On Sat, May 09, 2026 at 07:37:26AM +0530, Praveen Talari wrote:
>
>> Could you also please review the changes made in spi.c ?
>> I would appreciate any feedback or suggestions you may have.
> Please just sumbmit normal patches instead of sending partial patches in
> reply to another thread unless something is really unclear.
>
>> @@ -1658,6 +1658,11 @@ static int spi_transfer_one_message(struct
>> spi_controller *ctlr,
>>
>> trace_spi_transfer_stop(msg, xfer);
>>
>> + if (spi_valid_txbuf(msg, xfer))
>> + trace_spi_tx_data(msg->spi, xfer->tx_buf,
>> xfer->len);
>> + if (spi_valid_rxbuf(msg, xfer))
>> + trace_spi_rx_data(msg->spi, xfer->rx_buf,
>> xfer->len);
> It feels like it'd be more helpful to log the transmit data before we do
> the send.
I can see that TX/RX data tracepoints are already present in the core
layer, so I will drop the TX/RX data tracepoints from this series. I
will update the patch set accordingly.
trace log from spi.c:
spi_transfer_start: spi16.0 00000000631b0da2 len=16
tx=[d8-07-1f-c4-b3-b3-07-d6-a8-7b-33-6f-7b-bb-ae-9b]
rx=[d8-07-1f-c4-b3-b3-07-d6-a8-7b-33-6f-7b-bb-ae-9b
Thanks,
Praveen Talari
^ permalink raw reply
* Re: [PATCH v19 0/7] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu @ 2026-05-12 0:54 UTC (permalink / raw)
To: Steven Rostedt
Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260511122943.41e204bc@gandalf.local.home>
On Mon, 11 May 2026 12:29:43 -0400
Steven Rostedt <rostedt@kernel.org> wrote:
> On Thu, 7 May 2026 13:14:16 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > > I'll test this some more, and make a proper patch.
> >
> > Ah, indeed. Thanks for fixing!
> >
> > BTW, shouldn't we unify common logic of those functions?
>
> Hmm, there's not much common between the two. One is a consuming read and
> the other is a non-consuming read that needs to test for a bunch of race
> conditions.
>
> If you see something that can be shared, I'm all for it.
Maybe we can introduce a common inline function to calculate
max_loop, or at least replacing "3" with a common macro.
Thank you,
>
> -- Steve
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCHv2] uprobes: Use flexible array for xol_area bitmap
From: Masami Hiramatsu @ 2026-05-12 0:48 UTC (permalink / raw)
To: Rosen Penev
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Masami Hiramatsu, Oleg Nesterov,
open list:PERFORMANCE EVENTS SUBSYSTEM, open list:UPROBES
In-Reply-To: <20260511225648.27886-1-rosenp@gmail.com>
On Mon, 11 May 2026 15:56:48 -0700
Rosen Penev <rosenp@gmail.com> wrote:
> The XOL slot bitmap has the same lifetime as struct xol_area, but it
> is currently allocated separately. That adds another allocation
> failure path and a matching cleanup branch without buying any extra
> flexibility.
>
> Store the bitmap as a flexible array member and allocate it together
> with the xol_area using kzalloc_flex(). The bitmap remains
> zero-initialized, while the allocation and error handling become
> simpler.
>
> Assisted-by: Codex:GPT-5.5
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
OK, this looks good to me.
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thanks,
> ---
> v2: add missing kfree
> kernel/events/uprobes.c | 14 +++-----------
> 1 file changed, 3 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 4084e926e284..eba71700667e 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -108,7 +108,6 @@ static LIST_HEAD(delayed_uprobe_list);
> */
> struct xol_area {
> wait_queue_head_t wq; /* if all slots are busy */
> - unsigned long *bitmap; /* 0 = free slot */
>
> struct page *page;
> /*
> @@ -117,6 +116,7 @@ struct xol_area {
> * the vma go away, and we must handle that reasonably gracefully.
> */
> unsigned long vaddr; /* Page(s) of instruction slots */
> + unsigned long bitmap[]; /* 0 = free slot */
> };
>
> static void uprobe_warn(struct task_struct *t, const char *msg)
> @@ -1755,18 +1755,13 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> struct xol_area *area;
> void *insns;
>
> - area = kzalloc_obj(*area);
> + area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));
> if (unlikely(!area))
> goto out;
>
> - area->bitmap = kcalloc(BITS_TO_LONGS(UINSNS_PER_PAGE), sizeof(long),
> - GFP_KERNEL);
> - if (!area->bitmap)
> - goto free_area;
> -
> area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> if (!area->page)
> - goto free_bitmap;
> + goto free_area;
>
> area->vaddr = vaddr;
> init_waitqueue_head(&area->wq);
> @@ -1779,8 +1774,6 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> return area;
>
> __free_page(area->page);
> - free_bitmap:
> - kfree(area->bitmap);
> free_area:
> kfree(area);
> out:
> @@ -1831,7 +1824,6 @@ void uprobe_clear_state(struct mm_struct *mm)
> return;
>
> put_page(area->page);
> - kfree(area->bitmap);
> kfree(area);
> }
>
> --
> 2.54.0
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 0/2] tools/bootconfig: render kernel.* subtree as a cmdline string
From: Masami Hiramatsu @ 2026-05-12 0:43 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, Masami Hiramatsu, linux-kernel, linux-trace-kernel,
paulmck, oss, kernel-team
In-Reply-To: <agIFZ0vSfedX7foN@gmail.com>
On Mon, 11 May 2026 09:38:25 -0700
Breno Leitao <leitao@debian.org> wrote:
> On Fri, May 08, 2026 at 02:56:41PM -0700, Andrew Morton wrote:
> > On Fri, 08 May 2026 06:55:02 -0700 Breno Leitao <leitao@debian.org> wrote:
> >
> > > Add a bootconfig -> kernel cmdline rendering capability shared between
> > > the kernel parser library and the userspace tools/bootconfig binary.
> > >
> > > The new userspace mode "tools/bootconfig -C <file>" walks a bootconfig
> > > file's "kernel" subtree and prints it as a flat, space-separated
> > > cmdline string suitable for direct use as (or appending to) a kernel
> > > command line.
> > >
> > > This series prepares tools/bootconfig and lib/bootconfig.c for an
> > > upcoming feature that lets the kernel build render an embedded
> > > bootconfig file's "kernel" subtree to a flat cmdline string and embed
> > > it in the kernel image.
> > >
> > > The follow-up series (sent separately) wires this into setup_arch() so
> > > early_param() handlers see values supplied via CONFIG_BOOT_CONFIG_EMBED_FILE,
> > > following Masami suggestion in [1]
> > >
> > > These two patches are pure groundwork. They add no kernel feature,
> > > change no runtime behavior, and are useful on their own (the new
> > > "tools/bootconfig -C" mode lets anyone render a .bootconfig file to
> > > a cmdline string from the shell).
> > >
> > > Landing them independently lets the follow-up series focus on the
> > > kernel-side plumbing without dragging the refactor and tool addition
> > > through the same review cycle.
> >
> > I'll assume that Masami will process this, although
> > `scripts/get_maintainer.pl lib/bootconfig.c' doesn't mention a git
> > tree.
> >
> > https://sashiko.dev/#/patchset/20260508-bootconfig_using_tools-v1-0-1132219aa773@debian.org
> > says a bunch of picky things which seem pretty ignorable to me. Your
> > call ;)
>
> Well, these are some warnings about not checking that the output was
> properly set. From my view, these are not new to this patch, it is just the
> pattern we see, which is fire-and-forget writes to stdout, which seems
> quite reasonable.
I confirmed there was no problem with UBSAN.
"NULL + 1" is undefined, yes, but "(char *)NULL + 1" is sane.
>
> If we decide to fix this, it would make more sense to do it file-wide
> instead of just in this patch code.
So no need to fix it. Let me pick this series.
Thanks!
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [bug report] bootconfig: init: Allow admin to use bootconfig for kernel command line
From: Masami Hiramatsu @ 2026-05-12 0:16 UTC (permalink / raw)
To: Dan Carpenter
Cc: kernel-janitors, Linux Trace Kernel, linux-kernel, Breno Leitao
In-Reply-To: <af4YTUrDM-ciyoa-@stanley.mountain>
Hi Dan,
Thanks for reporting. A similar problem is pointed by Sashiko [1].
[1] https://sashiko.dev/#/patchset/20260508-bootconfig_using_tools-v1-0-1132219aa773%40debian.org
On Fri, 8 May 2026 20:07:25 +0300
Dan Carpenter <error27@gmail.com> wrote:
> Hello Masami Hiramatsu,
>
> Commit 51887d03aca1 ("bootconfig: init: Allow admin to use bootconfig
> for kernel command line") from Jan 11, 2020 (linux-next), leads to
> the following Smatch static checker warning:
>
> init/main.c:368 xbc_snprint_cmdline()
> use scnprintf() instead of snprintf()
>
> init/main.c
> 331 static int __init xbc_snprint_cmdline(char *buf, size_t size,
> 332 struct xbc_node *root)
> 333 {
> 334 struct xbc_node *knode, *vnode;
> 335 char *end = buf + size;
> 336 const char *val, *q;
> 337 int ret;
> 338
> 339 xbc_node_for_each_key_value(root, knode, val) {
> 340 ret = xbc_node_compose_key_after(root, knode,
> 341 xbc_namebuf, XBC_KEYLEN_MAX);
> 342 if (ret < 0)
> 343 return ret;
> 344
> 345 vnode = xbc_node_get_child(knode);
> 346 if (!vnode) {
> 347 ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> 348 if (ret < 0)
> 349 return ret;
> 350 buf += ret;
>
> In user space snprintf() can return negative, but in the kernel, no.
> It returns the number of bytes (not counting the NUL terminator) which
> would have been copied if there were enough space. So maybe you want
> to do something like:
>
> remain = rest(buf, end);
> ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> if (ret >= remain)
> return -ENOSPC;
Actually, we need to query the length of required buffer size if buf == NULL
or the buffer size is not enough.
But as Sashiko pointed, I need to check it with UBSAN. (but I think,
even if @buf is NULL, the @buf is char *, thus it is safe to add some
value...)
>
> Or maybe you might want to use scnprintf() which returns the number of
> bytes actually copied. Otherwise bug ends up pointing to beyond the end
> of the buffer.
No, I need to calculate the required length of buffer.
Thank you,
>
> 351 continue;
> 352 }
> 353 xbc_array_for_each_value(vnode, val) {
> 354 /*
> 355 * For prettier and more readable /proc/cmdline, only
> 356 * quote the value when necessary, i.e. when it contains
> 357 * whitespace.
> 358 */
> 359 q = strpbrk(val, " \t\r\n") ? "\"" : "";
> 360 ret = snprintf(buf, rest(buf, end), "%s=%s%s%s ",
> ^^^^^^^^^^^^^^^
> Same.
>
> 361 xbc_namebuf, q, val, q);
> 362 if (ret < 0)
> 363 return ret;
> 364 buf += ret;
> 365 }
> 366 }
> 367
> --> 368 return buf - (end - size);
> 369 }
>
> This email is a free service from the Smatch-CI project [smatch.sf.net].
>
> regards,
> dan carpenter
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 1/2] bootconfig: move xbc_snprint_cmdline() to lib/bootconfig.c
From: Masami Hiramatsu @ 2026-05-12 0:00 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, linux-kernel, linux-trace-kernel, paulmck, oss,
kernel-team
In-Reply-To: <20260508-bootconfig_using_tools-v1-1-1132219aa773@debian.org>
On Fri, 08 May 2026 06:55:03 -0700
Breno Leitao <leitao@debian.org> wrote:
> Move xbc_snprint_cmdline() from init/main.c to lib/bootconfig.c so the
> function (and its xbc_namebuf scratch buffer) becomes part of the shared
> parser library. tools/bootconfig already compiles lib/bootconfig.c
> directly, which lets a follow-up patch reuse the same renderer in the
> userspace tool to convert a bootconfig file into a flat cmdline string
> at build time.
>
> No functional change.
Yeah, this should be under lib/bootconfig.c
Thanks,
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> include/linux/bootconfig.h | 3 +++
> init/main.c | 45 -------------------------------------
> lib/bootconfig.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 59 insertions(+), 45 deletions(-)
>
> diff --git a/include/linux/bootconfig.h b/include/linux/bootconfig.h
> index 692a5acc2ffc4..1c7f3b74ffcf3 100644
> --- a/include/linux/bootconfig.h
> +++ b/include/linux/bootconfig.h
> @@ -265,6 +265,9 @@ static inline struct xbc_node * __init xbc_node_get_subkey(struct xbc_node *node
> int __init xbc_node_compose_key_after(struct xbc_node *root,
> struct xbc_node *node, char *buf, size_t size);
>
> +/* Render key/value pairs under @root as a flat cmdline string */
> +int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root);
> +
> /**
> * xbc_node_compose_key() - Compose full key string of the XBC node
> * @node: An XBC node.
> diff --git a/init/main.c b/init/main.c
> index 96f93bb06c490..e363232b428b4 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -324,51 +324,6 @@ static void * __init get_boot_config_from_initrd(size_t *_size)
>
> #ifdef CONFIG_BOOT_CONFIG
>
> -static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
> -
> -#define rest(dst, end) ((end) > (dst) ? (end) - (dst) : 0)
> -
> -static int __init xbc_snprint_cmdline(char *buf, size_t size,
> - struct xbc_node *root)
> -{
> - struct xbc_node *knode, *vnode;
> - char *end = buf + size;
> - const char *val, *q;
> - int ret;
> -
> - xbc_node_for_each_key_value(root, knode, val) {
> - ret = xbc_node_compose_key_after(root, knode,
> - xbc_namebuf, XBC_KEYLEN_MAX);
> - if (ret < 0)
> - return ret;
> -
> - vnode = xbc_node_get_child(knode);
> - if (!vnode) {
> - ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> - if (ret < 0)
> - return ret;
> - buf += ret;
> - continue;
> - }
> - xbc_array_for_each_value(vnode, val) {
> - /*
> - * For prettier and more readable /proc/cmdline, only
> - * quote the value when necessary, i.e. when it contains
> - * whitespace.
> - */
> - q = strpbrk(val, " \t\r\n") ? "\"" : "";
> - ret = snprintf(buf, rest(buf, end), "%s=%s%s%s ",
> - xbc_namebuf, q, val, q);
> - if (ret < 0)
> - return ret;
> - buf += ret;
> - }
> - }
> -
> - return buf - (end - size);
> -}
> -#undef rest
> -
> /* Make an extra command line under given key word */
> static char * __init xbc_make_cmdline(const char *key)
> {
> diff --git a/lib/bootconfig.c b/lib/bootconfig.c
> index c470b93d5dbc2..f445b7703fdd9 100644
> --- a/lib/bootconfig.c
> +++ b/lib/bootconfig.c
> @@ -408,6 +408,62 @@ const char * __init xbc_node_find_next_key_value(struct xbc_node *root,
> return ""; /* No value key */
> }
>
> +static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
> +
> +#define rest(dst, end) ((end) > (dst) ? (end) - (dst) : 0)
> +
> +/**
> + * xbc_snprint_cmdline() - Render bootconfig keys under @root as a cmdline string
> + * @buf: Destination buffer (may be NULL when @size is 0 to query the length)
> + * @size: Size of @buf in bytes
> + * @root: Subtree root whose key=value pairs should be rendered
> + *
> + * Walk all key/value pairs under @root and emit them as a space-separated
> + * cmdline string into @buf. Values containing whitespace are quoted with
> + * double quotes. Returns the number of bytes that would be written if @buf
> + * were large enough (matching snprintf semantics), or a negative errno on
> + * failure.
> + */
> +int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
> +{
> + struct xbc_node *knode, *vnode;
> + char *end = buf + size;
> + const char *val, *q;
> + int ret;
> +
> + xbc_node_for_each_key_value(root, knode, val) {
> + ret = xbc_node_compose_key_after(root, knode,
> + xbc_namebuf, XBC_KEYLEN_MAX);
> + if (ret < 0)
> + return ret;
> +
> + vnode = xbc_node_get_child(knode);
> + if (!vnode) {
> + ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> + if (ret < 0)
> + return ret;
> + buf += ret;
> + continue;
> + }
> + xbc_array_for_each_value(vnode, val) {
> + /*
> + * For prettier and more readable /proc/cmdline, only
> + * quote the value when necessary, i.e. when it contains
> + * whitespace.
> + */
> + q = strpbrk(val, " \t\r\n") ? "\"" : "";
> + ret = snprintf(buf, rest(buf, end), "%s=%s%s%s ",
> + xbc_namebuf, q, val, q);
> + if (ret < 0)
> + return ret;
> + buf += ret;
> + }
> + }
> +
> + return buf - (end - size);
> +}
> +#undef rest
> +
> /* XBC parse and tree build */
>
> static int __init xbc_init_node(struct xbc_node *node, char *data, uint16_t flag)
>
> --
> 2.53.0-Meta
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 2/2] tools/bootconfig: render kernel.* subtree as cmdline string with -C
From: Masami Hiramatsu @ 2026-05-12 0:00 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, linux-kernel, linux-trace-kernel, paulmck, oss,
kernel-team
In-Reply-To: <20260508-bootconfig_using_tools-v1-2-1132219aa773@debian.org>
On Fri, 08 May 2026 06:55:04 -0700
Breno Leitao <leitao@debian.org> wrote:
> Add a -C option that finds the "kernel" subtree of a bootconfig file
> and prints it as a flat, space-separated cmdline string by calling the
> shared xbc_snprint_cmdline() renderer. An empty or absent kernel.*
> subtree produces empty output and exits successfully.
>
> This lets the kernel build embed a bootconfig file as a plain cmdline
> string at build time, so embedded bootconfig values can reach
> parse_early_param() during architecture setup without parsing the
> bootconfig at runtime.
>
> The renderer is intentionally limited to the kernel.* subtree: that is
> the only thing the kernel build needs to embed; init.* and other
> subtrees keep going through the runtime parser.
>
> Example of this new mode:
> # cat /tmp/test.bconf
> kernel {
> foo = bar
> baz = "hello world"
> arr = 1, 2
> }
> init.foo = nope
>
> # ./tools/bootconfig/bootconfig -C /tmp/test.bconf
> foo=bar baz="hello world" arr=1 arr=2 %
>
Nice! Looks good to me. Let me pick it.
Thanks,
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> tools/bootconfig/main.c | 60 ++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 52 insertions(+), 8 deletions(-)
>
> diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c
> index 643f707b8f1da..e1bfab044fbcb 100644
> --- a/tools/bootconfig/main.c
> +++ b/tools/bootconfig/main.c
> @@ -286,7 +286,41 @@ static int init_xbc_with_error(char *buf, int len)
> return ret;
> }
>
> -static int show_xbc(const char *path, bool list)
> +static int show_xbc_kernel_cmdline(void)
> +{
> + struct xbc_node *root;
> + char *buf = NULL;
> + int len, ret;
> +
> + root = xbc_find_node("kernel");
> + if (!root)
> + return 0; /* no kernel.* keys: emit empty output */
> +
> + len = xbc_snprint_cmdline(NULL, 0, root);
> + if (len < 0) {
> + pr_err("Failed to size cmdline output: %d\n", len);
> + return len;
> + }
> + if (len == 0)
> + return 0;
> +
> + buf = malloc(len + 1);
> + if (!buf)
> + return -ENOMEM;
> +
> + ret = xbc_snprint_cmdline(buf, len + 1, root);
> + if (ret < 0) {
> + pr_err("Failed to render cmdline output: %d\n", ret);
> + free(buf);
> + return ret;
> + }
> +
> + fputs(buf, stdout);
> + free(buf);
> + return 0;
> +}
> +
> +static int show_xbc(const char *path, bool list, bool render_cmdline)
> {
> int ret, fd;
> char *buf = NULL;
> @@ -322,11 +356,14 @@ static int show_xbc(const char *path, bool list)
> if (init_xbc_with_error(buf, ret) < 0)
> goto out;
> }
> - if (list)
> + if (render_cmdline)
> + ret = show_xbc_kernel_cmdline();
> + else if (list)
> xbc_show_list();
> else
> xbc_show_compact_tree();
> - ret = 0;
> + if (ret > 0)
> + ret = 0;
> out:
> free(buf);
>
> @@ -486,7 +523,10 @@ static int usage(void)
> " Options:\n"
> " -a <config>: Apply boot config to initrd\n"
> " -d : Delete boot config file from initrd\n"
> - " -l : list boot config in initrd or file\n\n"
> + " -l : list boot config in initrd or file\n"
> + " -C : render the kernel.* subtree as a flat cmdline\n"
> + " string (suitable for embedding in a kernel image)\n"
> + " and print it to stdout\n\n"
> " If no option is given, show the bootconfig in the given file.\n");
> return -1;
> }
> @@ -495,10 +535,11 @@ int main(int argc, char **argv)
> {
> char *path = NULL;
> char *apply = NULL;
> + bool render_cmdline = false;
> bool delete = false, list = false;
> int opt;
>
> - while ((opt = getopt(argc, argv, "hda:l")) != -1) {
> + while ((opt = getopt(argc, argv, "hda:lC")) != -1) {
> switch (opt) {
> case 'd':
> delete = true;
> @@ -509,14 +550,17 @@ int main(int argc, char **argv)
> case 'l':
> list = true;
> break;
> + case 'C':
> + render_cmdline = true;
> + break;
> case 'h':
> default:
> return usage();
> }
> }
>
> - if ((apply && delete) || (delete && list) || (apply && list)) {
> - pr_err("Error: You can give one of -a, -d or -l at once.\n");
> + if ((!!apply + !!delete + !!list + !!render_cmdline) > 1) {
> + pr_err("Error: You can give one of -a, -d, -l or -C at once.\n");
> return usage();
> }
>
> @@ -532,5 +576,5 @@ int main(int argc, char **argv)
> else if (delete)
> return delete_xbc(path);
>
> - return show_xbc(path, list);
> + return show_xbc(path, list, render_cmdline);
> }
>
> --
> 2.53.0-Meta
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [RFC PATCH] trace: Introduce a new filter_pred "caller"
From: Masami Hiramatsu @ 2026-05-11 23:47 UTC (permalink / raw)
To: Chen Jun; +Cc: rostedt, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260508122623.74290-1-chenjun102@huawei.com>
On Fri, 8 May 2026 20:26:23 +0800
Chen Jun <chenjun102@huawei.com> wrote:
> Low-level functions have many call paths, and sometimes
> we only care about the calls on a specific call path.
> Add a new filter to filter based on the call stack.
>
> Usage:
> 1. echo 'caller=="$function_name"' > events/../filter
Thanks for interesting idea :)
BTW, we already have "stacktrace". Since this actually checks
stacktrace, not caller, so I think we should reuse it.
Also, I think OP_GLOB is more suitable for this case.
(and more useful)
Thank you,
>
> Only support OP_EQ and OP_NE
>
> Signed-off-by: Chen Jun <chenjun102@huawei.com>
> ---
> include/linux/trace_events.h | 1 +
> kernel/trace/trace.h | 3 ++-
> kernel/trace/trace_events.c | 1 +
> kernel/trace/trace_events_filter.c | 40 ++++++++++++++++++++++++++++--
> 4 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 40a43a4c7caf..1f109669a391 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -851,6 +851,7 @@ enum {
> FILTER_COMM,
> FILTER_CPU,
> FILTER_STACKTRACE,
> + FILTER_CALLER,
> };
>
> extern int trace_event_raw_init(struct trace_event_call *call);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 80fe152af1dd..4e4b92ce264f 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -1825,7 +1825,8 @@ static inline bool is_string_field(struct ftrace_event_field *field)
> field->filter_type == FILTER_RDYN_STRING ||
> field->filter_type == FILTER_STATIC_STRING ||
> field->filter_type == FILTER_PTR_STRING ||
> - field->filter_type == FILTER_COMM;
> + field->filter_type == FILTER_COMM ||
> + field->filter_type == FILTER_CALLER;
> }
>
> static inline bool is_function_field(struct ftrace_event_field *field)
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index c46e623e7e0d..6d220d7eec73 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -199,6 +199,7 @@ static int trace_define_generic_fields(void)
> __generic_field(char *, comm, FILTER_COMM);
> __generic_field(char *, stacktrace, FILTER_STACKTRACE);
> __generic_field(char *, STACKTRACE, FILTER_STACKTRACE);
> + __generic_field(char *, caller, FILTER_CALLER);
>
> return ret;
> }
> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
> index 609325f57942..1cf040065abe 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -72,6 +72,7 @@ enum filter_pred_fn {
> FILTER_PRED_FN_CPUMASK,
> FILTER_PRED_FN_CPUMASK_CPU,
> FILTER_PRED_FN_FUNCTION,
> + FILTER_PRED_FN_CALLER,
> FILTER_PRED_FN_,
> FILTER_PRED_TEST_VISITED,
> };
> @@ -1009,6 +1010,21 @@ static int filter_pred_function(struct filter_pred *pred, void *event)
> return pred->op == OP_EQ ? ret : !ret;
> }
>
> +/* Filter predicate for caller. */
> +static int filter_pred_caller(struct filter_pred *pred, void *event)
> +{
> + unsigned long entries[32];
> + unsigned int nr_entries;
> + int i;
> +
> + nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
> + for (i = 0; i < nr_entries ; i++)
> + if (pred->val <= entries[i] && entries[i] < pred->val2)
> + return !pred->not;
> +
> + return pred->not;
> +}
> +
> /*
> * regex_match_foo - Basic regex callbacks
> *
> @@ -1617,6 +1633,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
> return filter_pred_cpumask_cpu(pred, event);
> case FILTER_PRED_FN_FUNCTION:
> return filter_pred_function(pred, event);
> + case FILTER_PRED_FN_CALLER:
> + return filter_pred_caller(pred, event);
> case FILTER_PRED_TEST_VISITED:
> return test_pred_visited_fn(pred, event);
> default:
> @@ -2002,10 +2020,28 @@ static int parse_pred(const char *str, void *data,
>
> } else if (field->filter_type == FILTER_DYN_STRING) {
> pred->fn_num = FILTER_PRED_FN_STRLOC;
> - } else if (field->filter_type == FILTER_RDYN_STRING)
> + } else if (field->filter_type == FILTER_RDYN_STRING) {
> pred->fn_num = FILTER_PRED_FN_STRRELLOC;
> - else {
> + } else if (field->filter_type == FILTER_CALLER) {
> + unsigned long caller;
> +
> + if (op == OP_GLOB)
> + goto err_free;
>
> + pred->fn_num = FILTER_PRED_FN_CALLER;
> + caller = kallsyms_lookup_name(pred->regex->pattern);
> + if (!caller) {
> + parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
> + goto err_free;
> + }
> + /* Now find the function start and end address */
> + if (!kallsyms_lookup_size_offset(caller, &size, &offset)) {
> + parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
> + goto err_free;
> + }
> + pred->val = caller - offset;
> + pred->val2 = pred->val + size;
> + } else {
> if (!ustring_per_cpu) {
> /* Once allocated, keep it around for good */
> ustring_per_cpu = alloc_percpu(struct ustring_buffer);
> --
> 2.22.0
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCHv2] uprobes: Use flexible array for xol_area bitmap
From: Rosen Penev @ 2026-05-11 22:56 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
Ian Rogers, Adrian Hunter, James Clark, Masami Hiramatsu,
Oleg Nesterov, open list:PERFORMANCE EVENTS SUBSYSTEM,
open list:UPROBES
The XOL slot bitmap has the same lifetime as struct xol_area, but it
is currently allocated separately. That adds another allocation
failure path and a matching cleanup branch without buying any extra
flexibility.
Store the bitmap as a flexible array member and allocate it together
with the xol_area using kzalloc_flex(). The bitmap remains
zero-initialized, while the allocation and error handling become
simpler.
Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
v2: add missing kfree
kernel/events/uprobes.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4084e926e284..eba71700667e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -108,7 +108,6 @@ static LIST_HEAD(delayed_uprobe_list);
*/
struct xol_area {
wait_queue_head_t wq; /* if all slots are busy */
- unsigned long *bitmap; /* 0 = free slot */
struct page *page;
/*
@@ -117,6 +116,7 @@ struct xol_area {
* the vma go away, and we must handle that reasonably gracefully.
*/
unsigned long vaddr; /* Page(s) of instruction slots */
+ unsigned long bitmap[]; /* 0 = free slot */
};
static void uprobe_warn(struct task_struct *t, const char *msg)
@@ -1755,18 +1755,13 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
struct xol_area *area;
void *insns;
- area = kzalloc_obj(*area);
+ area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));
if (unlikely(!area))
goto out;
- area->bitmap = kcalloc(BITS_TO_LONGS(UINSNS_PER_PAGE), sizeof(long),
- GFP_KERNEL);
- if (!area->bitmap)
- goto free_area;
-
area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
if (!area->page)
- goto free_bitmap;
+ goto free_area;
area->vaddr = vaddr;
init_waitqueue_head(&area->wq);
@@ -1779,8 +1774,6 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
return area;
__free_page(area->page);
- free_bitmap:
- kfree(area->bitmap);
free_area:
kfree(area);
out:
@@ -1831,7 +1824,6 @@ void uprobe_clear_state(struct mm_struct *mm)
return;
put_page(area->page);
- kfree(area->bitmap);
kfree(area);
}
--
2.54.0
^ permalink raw reply related
* [PATCH] rtla: Stop the record trace on interrupt
From: Crystal Wood @ 2026-05-11 22:35 UTC (permalink / raw)
To: Tomas Glozar
Cc: Steven Rostedt, linux-trace-kernel, John Kacur, Costa Shulyupin,
Wander Lairson Costa, Crystal Wood
Before, when rtla gets a signal, it stopped the main trace but not the
record trace. save_trace_to_file() could also fail to keep up on a debug
kernel -- and in any case, it adds post-stoppage noise to the trace file.
Signed-off-by: Crystal Wood <crwood@redhat.com>
---
tools/tracing/rtla/src/common.c | 19 +++++++++++--------
tools/tracing/rtla/src/common.h | 1 -
tools/tracing/rtla/src/timerlat.c | 2 +-
3 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index 35e3d3aa922e..effad523e8cf 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -10,7 +10,7 @@
#include "common.h"
-struct trace_instance *trace_inst;
+struct osnoise_tool *trace_tool;
volatile int stop_tracing;
int nr_cpus;
@@ -21,12 +21,16 @@ static void stop_trace(int sig)
* Stop requested twice in a row; abort event processing and
* exit immediately
*/
- tracefs_iterate_stop(trace_inst->inst);
+ if (trace_tool)
+ tracefs_iterate_stop(trace_tool->trace.inst);
return;
}
stop_tracing = 1;
- if (trace_inst)
- trace_instance_stop(trace_inst);
+ if (trace_tool) {
+ trace_instance_stop(&trace_tool->trace);
+ if (trace_tool->record)
+ trace_instance_stop(&trace_tool->record->trace);
+ }
}
/*
@@ -273,11 +277,10 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
tool->params = params;
/*
- * Save trace instance into global variable so that SIGINT can stop
- * the timerlat tracer.
+ * Expose the tool to signal handlers so they can stop the trace.
* Otherwise, rtla could loop indefinitely when overloaded.
*/
- trace_inst = &tool->trace;
+ trace_tool = tool;
retval = ops->apply_config(tool);
if (retval) {
@@ -285,7 +288,7 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
goto out_free;
}
- retval = enable_tracer_by_name(trace_inst->inst, ops->tracer);
+ retval = enable_tracer_by_name(tool->trace.inst, ops->tracer);
if (retval) {
err_msg("Failed to enable %s tracer\n", ops->tracer);
goto out_free;
diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
index 51665db4ffce..eba40b6d9504 100644
--- a/tools/tracing/rtla/src/common.h
+++ b/tools/tracing/rtla/src/common.h
@@ -54,7 +54,6 @@ struct osnoise_context {
int opt_workload;
};
-extern struct trace_instance *trace_inst;
extern volatile int stop_tracing;
struct hist_params {
diff --git a/tools/tracing/rtla/src/timerlat.c b/tools/tracing/rtla/src/timerlat.c
index f8c057518d22..637f68d684f5 100644
--- a/tools/tracing/rtla/src/timerlat.c
+++ b/tools/tracing/rtla/src/timerlat.c
@@ -202,7 +202,7 @@ void timerlat_analyze(struct osnoise_tool *tool, bool stopped)
* If the trace did not stop with --aa-only, at least print
* the max known latency.
*/
- max_lat = tracefs_instance_file_read(trace_inst->inst, "tracing_max_latency", NULL);
+ max_lat = tracefs_instance_file_read(tool->trace.inst, "tracing_max_latency", NULL);
if (max_lat) {
printf(" Max latency was %s\n", max_lat);
free(max_lat);
--
2.54.0
^ permalink raw reply related
* [PATCH RESEND] tracing/osnoise: Dump stack on timerlat uret threshold event
From: Crystal Wood @ 2026-05-11 22:31 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-trace-kernel, John Kacur, Tomas Glozar, Costa Shulyupin,
Wander Lairson Costa, Crystal Wood
Dump the saved IRQ stack trace regardless of whether the event was
THREAD_CONTEXT or THREAD_URET.
In the uret case, the latency presumably had not yet crossed the
threshold at IRQ time (or else it would have dumped the stack at thread
wakeup time, unless we're racing with a change to the threshold), but it
may have at least contributed -- and this is possible with THREAD_CONTEXT
as well.
In any case, it helps with writing reliable rtla tests if we always get
a stack trace on a threshold event.
Signed-off-by: Crystal Wood <crwood@redhat.com>
---
Original: https://lore.kernel.org/all/20251112152529.956778-3-crwood@redhat.com/
kernel/trace/trace_osnoise.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 75678053b21c..62c2667d97fa 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2544,9 +2544,12 @@ timerlat_fd_read(struct file *file, char __user *ubuf, size_t count,
notify_new_max_latency(diff);
tlat->tracing_thread = false;
- if (osnoise_data.stop_tracing_total)
- if (time_to_us(diff) >= osnoise_data.stop_tracing_total)
+ if (osnoise_data.stop_tracing_total) {
+ if (time_to_us(diff) >= osnoise_data.stop_tracing_total) {
+ timerlat_dump_stack(time_to_us(diff));
osnoise_stop_tracing();
+ }
+ }
} else {
tlat->tracing_thread = false;
tlat->kthread = current;
--
2.54.0
^ permalink raw reply related
* [PATCH v2] tracing/osnoise: Array printk init and cleanup
From: Crystal Wood @ 2026-05-11 22:30 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-trace-kernel, John Kacur, Tomas Glozar, Costa Shulyupin,
Wander Lairson Costa, Crystal Wood
None of the calls to trace_array_printk_buf() will do anything
if we don't initialize the buffer on instance creation (unless
some other tracer called it), so do that.
Add an osnoise_print() function to facilitate adding debug prints
(without tainting).
Use trace_array_printk() instead of trace_array_printk_buf(), as we're
only writing to the main buffer (of a non-main instance) anyway -- and
trace_array_printk_buf() skips the check to make sure we're not printing
to the global instance.
Signed-off-by: Crystal Wood <crwood@redhat.com>
---
v2: s/macro/function/ in commit message
v1: https://lore.kernel.org/all/20251112152529.956778-4-crwood@redhat.com/
kernel/trace/trace_osnoise.c | 39 ++++++++++++++++++++++--------------
1 file changed, 24 insertions(+), 15 deletions(-)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 62c2667d97fa..5e83c4f6f2b4 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -83,6 +83,22 @@ struct osnoise_instance {
static struct list_head osnoise_instances;
+static void osnoise_print(const char *fmt, ...)
+{
+ struct osnoise_instance *inst;
+ struct trace_array *tr;
+ va_list ap;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(inst, &osnoise_instances, list) {
+ tr = inst->tr;
+ va_start(ap, fmt);
+ trace_array_vprintk(tr, _RET_IP_, fmt, ap);
+ va_end(ap);
+ }
+ rcu_read_unlock();
+}
+
static bool osnoise_has_registered_instances(void)
{
return !!list_first_or_null_rcu(&osnoise_instances,
@@ -123,6 +139,7 @@ static int osnoise_register_instance(struct trace_array *tr)
* trace_types_lock.
*/
lockdep_assert_held(&trace_types_lock);
+ trace_array_init_printk(tr);
inst = kmalloc_obj(*inst);
if (!inst)
@@ -471,15 +488,7 @@ static void print_osnoise_headers(struct seq_file *s)
* osnoise_taint - report an osnoise error.
*/
#define osnoise_taint(msg) ({ \
- struct osnoise_instance *inst; \
- struct trace_buffer *buffer; \
- \
- rcu_read_lock(); \
- list_for_each_entry_rcu(inst, &osnoise_instances, list) { \
- buffer = inst->tr->array_buffer.buffer; \
- trace_array_printk_buf(buffer, _THIS_IP_, msg); \
- } \
- rcu_read_unlock(); \
+ osnoise_print(msg); \
osnoise_data.tainted = true; \
})
@@ -1189,10 +1198,10 @@ static __always_inline void osnoise_stop_exception(char *msg, int cpu)
rcu_read_lock();
list_for_each_entry_rcu(inst, &osnoise_instances, list) {
tr = inst->tr;
- trace_array_printk_buf(tr->array_buffer.buffer, _THIS_IP_,
- "stop tracing hit on cpu %d due to exception: %s\n",
- smp_processor_id(),
- msg);
+ trace_array_printk(tr, _THIS_IP_,
+ "stop tracing hit on cpu %d due to exception: %s\n",
+ smp_processor_id(),
+ msg);
if (test_bit(OSN_PANIC_ON_STOP, &osnoise_options))
panic("tracer hit on cpu %d due to exception: %s\n",
@@ -1362,8 +1371,8 @@ static __always_inline void osnoise_stop_tracing(void)
rcu_read_lock();
list_for_each_entry_rcu(inst, &osnoise_instances, list) {
tr = inst->tr;
- trace_array_printk_buf(tr->array_buffer.buffer, _THIS_IP_,
- "stop tracing hit on cpu %d\n", smp_processor_id());
+ trace_array_printk(tr, _THIS_IP_,
+ "stop tracing hit on cpu %d\n", smp_processor_id());
if (test_bit(OSN_PANIC_ON_STOP, &osnoise_options))
panic("tracer hit stop condition on CPU %d\n", smp_processor_id());
--
2.54.0
^ permalink raw reply related
* Re: [PATCH mm-unstable v17 00/14] khugepaged: mTHP support
From: Andrew Morton @ 2026-05-11 21:04 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>
On Mon, 11 May 2026 12:58:00 -0600 Nico Pache <npache@redhat.com> wrote:
> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.
Thanks, I've updated mm.git's mm-new branch to this version.
> V17 Changes:
> - Added Acks/RB
> - New patch(5): split the mmap_read_unlock() locking contract change out of
> "generalize collapse_huge_page" into its own patch; add a comment
> documenting the enter/exit-with-lock-dropped contract (Usama, David)
> - [patch 03] Add const to max_ptes_none/shared/swap variables; improve the
> three helper docstrings; replace the paragraphs with inline comments;
> note that sysctl values are now snapshotted once per scan (Usama, David)
> - [patch 04] Add SCAN_INVALID_PTES_NONE result code and return it instead
> of SCAN_FAIL when collapse_max_ptes_none() returns -EINVAL (Usama);
> snapshot khugepaged_max_ptes_none into a local variable to fix race on
> the two comparisons (Usama); clean up mTHP docstring paragraphs into
> inline comments; fix commit message wording (David)
> - [patch 06] Remove /* PMD collapse */ and /* mTHP collapse */ comments
> (David); move const declarations to top of variable list (David); add
> comment explaining that map_anon_folio_pte_nopf() calls set_ptes under
> pmd_ptl and is safe because PMD is expected to be none (Usama)
> - [patch 08] Shorten sysfs counter documentation for
> collapse_exceed_swap/shared_pte to concise one-liners; trim
> collapse_exceed_none_pte description; fix "dont" → "do not" (David)
> - [patch 10] Keep vm_flags parameter in khugepaged_enter_vma() and
> collapse_allowable_orders() rather than dropping it and reading
> vma->vm_flags internally; pass vm_flags explicitly at all three
> collapse_allowable_orders() call sites (David, sashskio)
> - [patch 11] Fix MTHP_STACK_SIZE: was exponential (~128); correct formula
> is (height + 1) for a DFS on a binary tree rewrite comment to explain
> the DFS sizing (sashskio)
> - [patch 12] Replace SCAN_PAGE_LRU with SCAN_PAGE_LAZYFREE in the
> "goto next_order" early-bail cases; non-LRU page failures cannot be
> recovered at any order and belong in the default (return) path
> - [patch 13] Use tva_flags == TVA_KHUGEPAGED (strict equality) instead of
> tva_flags & TVA_KHUGEPAGED; flatten nested if into single condition;
> retain vm_flags parameter; pass vm_flags to collapse_allowable_orders()
Here's how v17 altered mm.git:
Documentation/admin-guide/mm/transhuge.rst | 24 ---
include/linux/khugepaged.h | 6
include/trace/events/huge_memory.h | 3
mm/huge_memory.c | 2
mm/khugepaged.c | 152 ++++++++++---------
mm/vma.c | 6
tools/testing/vma/include/stubs.h | 3
7 files changed, 99 insertions(+), 97 deletions(-)
--- a/Documentation/admin-guide/mm/transhuge.rst~b
+++ a/Documentation/admin-guide/mm/transhuge.rst
@@ -725,27 +725,17 @@ nr_anon_partially_mapped
collapse_exceed_none_pte
The number of collapse attempts that failed due to exceeding the
- max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none
- values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will
- emit a warning and no mTHP collapse will be attempted. khugepaged will
- try to collapse to the largest enabled (m)THP size; if it fails, it will
- try the next lower enabled mTHP size. This counter records the number of
- times a collapse attempt was skipped for exceeding the max_ptes_none
- threshold, and khugepaged will move on to the next available mTHP size.
+ max_ptes_none threshold.
collapse_exceed_swap_pte
- The number of anonymous mTHP PTE ranges which were unable to collapse due
- to containing at least one swap PTE. Currently khugepaged does not
- support collapsing mTHP regions that contain a swap PTE. This counter can
- be used to monitor the number of khugepaged mTHP collapses that failed
- due to the presence of a swap PTE.
+ The number of collapse attempts that failed due to exceeding the
+ max_ptes_swap threshold. For non-PMD orders this occurs if a mTHP range
+ contains at least one swap PTE.
collapse_exceed_shared_pte
- The number of anonymous mTHP PTE ranges which were unable to collapse due
- to containing at least one shared PTE. Currently khugepaged does not
- support collapsing mTHP PTE ranges that contain a shared PTE. This
- counter can be used to monitor the number of khugepaged mTHP collapses
- that failed due to the presence of a shared PTE.
+ The number of collapse attempts that failed due to exceeding the
+ max_ptes_shared threshold. For non-PMD orders this occurs if a mTHP range
+ contains at least one shared PTE.
As the system ages, allocating huge pages may be expensive as the
system uses memory compaction to copy data around memory to free a
--- a/include/linux/khugepaged.h~b
+++ a/include/linux/khugepaged.h
@@ -13,7 +13,8 @@ extern void khugepaged_destroy(void);
extern int start_stop_khugepaged(void);
extern void __khugepaged_enter(struct mm_struct *mm);
extern void __khugepaged_exit(struct mm_struct *mm);
-extern void khugepaged_enter_vma(struct vm_area_struct *vma);
+extern void khugepaged_enter_vma(struct vm_area_struct *vma,
+ vm_flags_t vm_flags);
extern void khugepaged_min_free_kbytes_update(void);
extern bool current_is_khugepaged(void);
void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
@@ -37,7 +38,8 @@ static inline void khugepaged_fork(struc
static inline void khugepaged_exit(struct mm_struct *mm)
{
}
-static inline void khugepaged_enter_vma(struct vm_area_struct *vma)
+static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
+ vm_flags_t vm_flags)
{
}
static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
--- a/include/trace/events/huge_memory.h~b
+++ a/include/trace/events/huge_memory.h
@@ -39,7 +39,8 @@
EM( SCAN_STORE_FAILED, "store_failed") \
EM( SCAN_COPY_MC, "copy_poisoned_page") \
EM( SCAN_PAGE_FILLED, "page_filled") \
- EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
+ EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback") \
+ EMe(SCAN_INVALID_PTES_NONE, "invalid_ptes_none")
#undef EM
#undef EMe
--- a/mm/huge_memory.c~b
+++ a/mm/huge_memory.c
@@ -1571,7 +1571,7 @@ vm_fault_t do_huge_pmd_anonymous_page(st
ret = vmf_anon_prepare(vmf);
if (ret)
return ret;
- khugepaged_enter_vma(vma);
+ khugepaged_enter_vma(vma, vma->vm_flags);
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm) &&
--- a/mm/khugepaged.c~b
+++ a/mm/khugepaged.c
@@ -61,6 +61,7 @@ enum scan_result {
SCAN_COPY_MC,
SCAN_PAGE_FILLED,
SCAN_PAGE_DIRTY_OR_WRITEBACK,
+ SCAN_INVALID_PTES_NONE,
};
#define CREATE_TRACE_POINTS
@@ -101,16 +102,15 @@ static struct kmem_cache *mm_slot_cache
#define KHUGEPAGED_MIN_MTHP_ORDER 2
/*
- * The maximum number of mTHP ranges that can be stored on the stack.
- * This is calculated based on the number of PTE entries in a PTE page table
- * and the minimum mTHP order.
+ * mthp_collapse() does an iterative DFS over a binary tree, from
+ * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
+ * size needed for a DFS on a binary tree is height + 1, where
+ * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
*
- * ilog2 is needed in place of HPAGE_PMD_ORDER due to some architectures
- * (ie ppc64le) not defining HPAGE_PMD_ORDER until after build time.
- *
- * At most there will be 1 << (PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER) mTHP ranges
+ * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
+ * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
*/
-#define MTHP_STACK_SIZE (1UL << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
+#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
/*
* Defines a range of PTE entries in a PTE page table which are being
@@ -380,89 +380,87 @@ static bool pte_none_or_zero(pte_t pte)
}
/**
- * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
+ * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
+ * PTEs for the given collapse operation.
* @cc: The collapse control struct
* @vma: The vma to check for userfaultfd
* @order: The folio order being collapsed to
*
- * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
- * empty page. For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the
- * configured khugepaged_max_ptes_none value.
- *
- * For mTHP collapses, we currently only support khugepaged_max_pte_none values
- * of 0 or (KHUGEPAGED_MAX_PTES_LIMIT). Any other value will emit a warning and
- * no mTHP collapse will be attempted
- *
- * Return: Maximum number of empty PTEs allowed for the collapse operation
+ * Return: Maximum number of none-page or zero-page PTEs allowed for the
+ * collapse operation.
*/
static int collapse_max_ptes_none(struct collapse_control *cc,
struct vm_area_struct *vma, unsigned int order)
{
+ unsigned int max_ptes_none = khugepaged_max_ptes_none;
+ // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
if (vma && userfaultfd_armed(vma))
return 0;
+ // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
if (!cc->is_khugepaged)
return HPAGE_PMD_NR;
+ // for PMD collapse, respect the user defined maximum.
if (is_pmd_order(order))
- return khugepaged_max_ptes_none;
+ return max_ptes_none;
/* Zero/non-present collapse disabled. */
- if (!khugepaged_max_ptes_none)
+ if (!max_ptes_none)
return 0;
- if (khugepaged_max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
+ // for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
+ // scale the maximum number of PTEs to the order of the collapse.
+ if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
return (1 << order) - 1;
+ // We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
+ // Emit a warning and return -EINVAL.
pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
KHUGEPAGED_MAX_PTES_LIMIT);
return -EINVAL;
}
/**
- * collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for collapse
+ * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
+ * anonymous pages for the given collapse operation.
* @cc: The collapse control struct
* @order: The folio order being collapsed to
*
- * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
- * shared page.
- *
- * For mTHP collapses, we currently dont support collapsing memory with
- * shared memory.
- *
- * Return: Maximum number of shared PTEs allowed for the collapse operation
+ * Return: Maximum number of PTEs that map shared anonymous pages for the
+ * collapse operation
*/
static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
unsigned int order)
{
+ // for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
+ // anonymous pages.
if (!cc->is_khugepaged)
return HPAGE_PMD_NR;
+ // for mTHP collapse do not allow collapsing anonymous memory pages that
+ // are shared between processes.
if (!is_pmd_order(order))
return 0;
-
+ // for PMD collapse, respect the user defined maximum.
return khugepaged_max_ptes_shared;
}
/**
- * collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collapse
+ * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
+ * maximum allowed non-present pagecache entries for the given collapse operation.
* @cc: The collapse control struct
* @order: The folio order being collapsed to
*
- * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
- * swap page.
- *
- * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
- * khugepaged_max_ptes_swap value.
- *
- * For mTHP collapses, we currently dont support collapsing memory with
- * swapped out memory.
- *
- * Return: Maximum number of swap PTEs allowed for the collapse operation
+ * Return: Maximum number of non-present PTEs or the maximum allowed non-present
+ * pagecache entries for the collapse operation.
*/
static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
unsigned int order)
{
+ // for MADV_COLLAPSE, do not restrict the number PTEs entries or
+ // pagecache entries that are non-present.
if (!cc->is_khugepaged)
return HPAGE_PMD_NR;
+ // for mTHP collapse do not allow any non-present PTEs or pagecache entries.
if (!is_pmd_order(order))
return 0;
-
+ // for PMD collapse, respect the user defined maximum.
return khugepaged_max_ptes_swap;
}
@@ -478,7 +476,7 @@ int hugepage_madvise(struct vm_area_stru
* register it here without waiting a page fault that
* may not happen any time soon.
*/
- khugepaged_enter_vma(vma);
+ khugepaged_enter_vma(vma, *vm_flags);
break;
case MADV_NOHUGEPAGE:
*vm_flags &= ~VM_HUGEPAGE;
@@ -579,26 +577,26 @@ void __khugepaged_enter(struct mm_struct
/* Check what orders are allowed based on the vma and collapse type */
static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
- enum tva_type tva_flags)
+ vm_flags_t vm_flags, enum tva_type tva_flags)
{
unsigned long orders;
/* If khugepaged is scanning an anonymous vma, allow mTHP collapse */
- if ((tva_flags & TVA_KHUGEPAGED) && vma_is_anonymous(vma))
+ if ((tva_flags == TVA_KHUGEPAGED) && vma_is_anonymous(vma))
orders = THP_ORDERS_ALL_ANON;
else
orders = BIT(HPAGE_PMD_ORDER);
- return thp_vma_allowable_orders(vma, vma->vm_flags, tva_flags, orders);
+ return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
}
-void khugepaged_enter_vma(struct vm_area_struct *vma)
+void khugepaged_enter_vma(struct vm_area_struct *vma,
+ vm_flags_t vm_flags)
{
if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
- hugepage_enabled()) {
- if (collapse_allowable_orders(vma, TVA_KHUGEPAGED))
- __khugepaged_enter(vma->vm_mm);
- }
+ collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED) &&
+ hugepage_enabled())
+ __khugepaged_enter(vma->vm_mm);
}
void __khugepaged_exit(struct mm_struct *mm)
@@ -683,7 +681,7 @@ static enum scan_result __collapse_huge_
unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
if (max_ptes_none < 0)
- return result;
+ return SCAN_INVALID_PTES_NONE;
for (_pte = pte; _pte < pte + nr_pages;
_pte++, addr += PAGE_SIZE) {
@@ -905,6 +903,7 @@ static void __collapse_huge_page_copy_fa
{
const unsigned long nr_pages = 1UL << order;
spinlock_t *pmd_ptl;
+
/*
* Re-establish the PMD to point to the original page table
* entry. Restoring PMD needs to be done prior to releasing
@@ -944,6 +943,7 @@ static enum scan_result __collapse_huge_
const unsigned long nr_pages = 1UL << order;
unsigned int i;
enum scan_result result = SCAN_SUCCEED;
+
/*
* Copying pages' contents is subject to memory poison at any iteration.
*/
@@ -1263,10 +1263,20 @@ static enum scan_result alloc_charge_fol
return SCAN_SUCCEED;
}
+/*
+ * collapse_huge_page expects the mmap_read_lock to be dropped before
+ * entering this function. The function will also always return with the lock
+ * dropped. The function starts by allocation a folio, which can potentially
+ * take a long time if it involves sync compaction, and we do not need to hold
+ * the mmap_lock during that. We must recheck the vma after taking it again in
+ * write mode.
+ */
static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
int referenced, int unmapped, struct collapse_control *cc,
unsigned int order)
{
+ const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
+ const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
LIST_HEAD(compound_pagelist);
pmd_t *pmd, _pmd;
pte_t *pte = NULL;
@@ -1277,8 +1287,6 @@ static enum scan_result collapse_huge_pa
struct vm_area_struct *vma;
struct mmu_notifier_range range;
bool anon_vma_locked = false;
- const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
- const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
result = alloc_charge_folio(&folio, mm, cc, order);
if (result != SCAN_SUCCEED)
@@ -1399,11 +1407,16 @@ static enum scan_result collapse_huge_pa
__folio_mark_uptodate(folio);
spin_lock(pmd_ptl);
WARN_ON_ONCE(!pmd_none(*pmd));
- if (is_pmd_order(order)) { /* PMD collapse */
+ if (is_pmd_order(order)) {
pgtable = pmd_pgtable(_pmd);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
- } else { /* mTHP collapse */
+ } else {
+ /*
+ * set_ptes is called in map_anon_folio_pte_nopf with the
+ * pmd_ptl lock still held; this is safe as the PMD is expected
+ * to be none. The pmd entry is then repopulated below.
+ */
map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
@@ -1538,12 +1551,12 @@ static int mthp_collapse(struct mm_struc
case SCAN_EXCEED_SHARED_PTE:
case SCAN_PAGE_LOCK:
case SCAN_PAGE_COUNT:
- case SCAN_PAGE_LRU:
case SCAN_PAGE_NULL:
case SCAN_DEL_PAGE_LRU:
case SCAN_PTE_NON_PRESENT:
case SCAN_PTE_UFFD_WP:
case SCAN_ALLOC_HUGE_PAGE_FAIL:
+ case SCAN_PAGE_LAZYFREE:
goto next_order;
/* Cases where no further collapse is possible */
default:
@@ -1569,6 +1582,10 @@ static enum scan_result collapse_scan_pm
struct vm_area_struct *vma, unsigned long start_addr,
bool *lock_dropped, struct collapse_control *cc)
{
+ int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
+ const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
+ const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
+ enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
pmd_t *pmd;
pte_t *pte, *_pte, pteval;
int i;
@@ -1580,10 +1597,6 @@ static enum scan_result collapse_scan_pm
unsigned long enabled_orders;
spinlock_t *ptl;
int node = NUMA_NO_NODE, unmapped = 0;
- int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
- unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
- unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
- enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
@@ -1597,7 +1610,7 @@ static enum scan_result collapse_scan_pm
memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
- enabled_orders = collapse_allowable_orders(vma, tva_flags);
+ enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
/*
* If PMD is the only enabled order, enforce max_ptes_none, otherwise
@@ -1757,12 +1770,7 @@ static enum scan_result collapse_scan_pm
out_unmap:
pte_unmap_unlock(pte, ptl);
if (result == SCAN_SUCCEED) {
- /*
- * Before allocating the hugepage, release the mmap_lock read lock.
- * The allocation can take potentially a long time if it involves
- * sync compaction, and we do not need to hold the mmap_lock during
- * that. We will recheck the vma after taking it again in write mode.
- */
+ /* collapse_huge_page expects the lock to be dropped before calling */
mmap_read_unlock(mm);
nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
cc, enabled_orders);
@@ -2657,14 +2665,14 @@ static enum scan_result collapse_scan_fi
unsigned long addr, struct file *file, pgoff_t start,
struct collapse_control *cc)
{
+ const int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
+ const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
struct folio *folio = NULL;
struct address_space *mapping = file->f_mapping;
XA_STATE(xas, &mapping->i_pages, start);
int present, swap;
int node = NUMA_NO_NODE;
enum scan_result result = SCAN_SUCCEED;
- int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
- unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
present = 0;
swap = 0;
@@ -2867,7 +2875,7 @@ static void collapse_scan_mm_slot(unsign
cc->progress++;
break;
}
- if (!collapse_allowable_orders(vma, TVA_KHUGEPAGED)) {
+ if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED)) {
cc->progress++;
continue;
}
@@ -3177,7 +3185,7 @@ int madvise_collapse(struct vm_area_stru
BUG_ON(vma->vm_start > start);
BUG_ON(vma->vm_end < end);
- if (!collapse_allowable_orders(vma, TVA_FORCED_COLLAPSE))
+ if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_FORCED_COLLAPSE))
return -EINVAL;
cc = kmalloc_obj(*cc);
--- a/mm/vma.c~b
+++ a/mm/vma.c
@@ -989,7 +989,7 @@ static __must_check struct vm_area_struc
goto abort;
vma_set_flags_mask(vmg->target, sticky_flags);
- khugepaged_enter_vma(vmg->target);
+ khugepaged_enter_vma(vmg->target, vmg->vm_flags);
vmg->state = VMA_MERGE_SUCCESS;
return vmg->target;
@@ -1110,7 +1110,7 @@ struct vm_area_struct *vma_merge_new_ran
* following VMA if we have VMAs on both sides.
*/
if (vmg->target && !vma_expand(vmg)) {
- khugepaged_enter_vma(vmg->target);
+ khugepaged_enter_vma(vmg->target, vmg->vm_flags);
vmg->state = VMA_MERGE_SUCCESS;
return vmg->target;
}
@@ -2589,7 +2589,7 @@ static int __mmap_new_vma(struct mmap_st
* call covers the non-merge case.
*/
if (!vma_is_anonymous(vma))
- khugepaged_enter_vma(vma);
+ khugepaged_enter_vma(vma, map->vm_flags);
*vmap = vma;
return 0;
--- a/tools/testing/vma/include/stubs.h~b
+++ a/tools/testing/vma/include/stubs.h
@@ -183,7 +183,8 @@ static inline bool mpol_equal(struct mem
return true;
}
-static inline void khugepaged_enter_vma(struct vm_area_struct *vma)
+static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
+ vm_flags_t vm_flags)
{
}
_
^ permalink raw reply
* [PATCH mm-unstable v17 14/14] Documentation: mm: update the admin guide for mTHP collapse
From: Nico Pache @ 2026-05-11 18:58 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
Bagas Sanjaya
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>
Now that we can collapse to mTHPs lets update the admin guide to
reflect these changes and provide proper guidance on how to utilize it.
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
Documentation/admin-guide/mm/transhuge.rst | 49 +++++++++++++---------
1 file changed, 29 insertions(+), 20 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 80a4d0bed70b..fc0127a36ef6 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -63,7 +63,8 @@ often.
THP can be enabled system wide or restricted to certain tasks or even
memory ranges inside task's address space. Unless THP is completely
disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into PMD-sized huge pages.
+collapses sequences of basic pages into huge pages of either PMD size
+or mTHP sizes, if the system is configured to do so.
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
interface and using madvise(2) and prctl(2) system calls.
@@ -219,10 +220,10 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing
echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
-khugepaged will be automatically started when PMD-sized THP is enabled
+khugepaged will be automatically started when any THP size is enabled
(either of the per-size anon control or the top-level control are set
to "always" or "madvise"), and it'll be automatically shutdown when
-PMD-sized THP is disabled (when both the per-size anon control and the
+all THP sizes are disabled (when both the per-size anon control and the
top-level control are "never")
process THP controls
@@ -264,11 +265,6 @@ support the following arguments::
Khugepaged controls
-------------------
-.. note::
- khugepaged currently only searches for opportunities to collapse to
- PMD-sized THP and no attempt is made to collapse to other THP
- sizes.
-
khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
@@ -296,11 +292,11 @@ allocation failure to throttle the next allocation attempt::
The khugepaged progress can be seen in the number of pages collapsed (note
that this counter may not be an exact count of the number of pages
collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
-being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
-one 2M hugepage. Each may happen independently, or together, depending on
-the type of memory and the failures that occur. As such, this value should
-be interpreted roughly as a sign of progress, and counters in /proc/vmstat
-consulted for more accurate accounting)::
+being replaced by a PMD mapping, or (2) physical pages replaced by one
+hugepage of various sizes (PMD-sized or mTHP). Each may happen independently,
+or together, depending on the type of memory and the failures that occur.
+As such, this value should be interpreted roughly as a sign of progress,
+and counters in /proc/vmstat consulted for more accurate accounting)::
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
@@ -308,16 +304,20 @@ for each pass::
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
-``max_ptes_none`` specifies how many extra small pages (that are
-not already mapped) can be allocated when collapsing a group
-of small pages into one large page::
+``max_ptes_none`` specifies how many empty (none/zero) pages are allowed
+when collapsing a group of small pages into one large page::
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
-A higher value leads to use additional memory for programs.
-A lower value leads to gain less thp performance. Value of
-max_ptes_none can waste cpu time very little, you can
-ignore it.
+For PMD-sized THP collapse, this directly limits the number of empty pages
+allowed in the 2MB region.
+
+For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other value
+will emit a warning and no mTHP collapse will be attempted.
+
+A higher value allows more empty pages, potentially leading to more memory
+usage but better THP performance. A lower value is more conservative and
+may result in fewer THP collapses.
``max_ptes_swap`` specifies how many pages can be brought in from
swap when collapsing a group of pages into a transparent huge page::
@@ -337,6 +337,15 @@ that THP is shared. Exceeding the number would block the collapse::
A higher value may increase memory footprint for some workloads.
+.. note::
+ For mTHP collapse, khugepaged does not support collapsing regions that
+ contain shared or swapped out pages, as this could lead to continuous
+ promotion to higher orders. The collapse will fail if any shared or
+ swapped PTEs are encountered during the scan.
+
+ Currently, madvise_collapse only supports collapsing to PMD-sized THPs
+ and does not attempt mTHP collapses.
+
Boot parameters
===============
--
2.54.0
^ permalink raw reply related
* [PATCH mm-unstable v17 13/14] mm/khugepaged: run khugepaged for all orders
From: Nico Pache @ 2026-05-11 18:58 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
Usama Arif
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>
From: Baolin Wang <baolin.wang@linux.alibaba.com>
If any order (m)THP is enabled we should allow running khugepaged to
attempt scanning and collapsing mTHPs. In order for khugepaged to operate
when only mTHP sizes are specified in sysfs, we must modify the predicate
function that determines whether it ought to run to do so.
This function is currently called hugepage_pmd_enabled(), this patch
renames it to hugepage_enabled() and updates the logic to check to
determine whether any valid orders may exist which would justify
khugepaged running.
We must also update collapse_allowable_orders() to check all orders if
the vma is anonymous and the collapse is khugepaged.
After this patch khugepaged mTHP collapse is fully enabled.
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Acked-by: Usama Arif <usama.arif@linux.dev>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 35 ++++++++++++++++++++---------------
1 file changed, 20 insertions(+), 15 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f0ae02936638..5ba298d420b7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -522,23 +522,23 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
}
-static bool hugepage_pmd_enabled(void)
+static bool hugepage_enabled(void)
{
/*
* We cover the anon, shmem and the file-backed case here; file-backed
* hugepages, when configured in, are determined by the global control.
- * Anon pmd-sized hugepages are determined by the pmd-size control.
+ * Anon hugepages are determined by its per-size mTHP control.
* Shmem pmd-sized hugepages are also determined by its pmd-size control,
* except when the global shmem_huge is set to SHMEM_HUGE_DENY.
*/
if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
hugepage_global_enabled())
return true;
- if (test_bit(PMD_ORDER, &huge_anon_orders_always))
+ if (READ_ONCE(huge_anon_orders_always))
return true;
- if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
+ if (READ_ONCE(huge_anon_orders_madvise))
return true;
- if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+ if (READ_ONCE(huge_anon_orders_inherit) &&
hugepage_global_enabled())
return true;
if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
@@ -579,7 +579,13 @@ void __khugepaged_enter(struct mm_struct *mm)
static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
vm_flags_t vm_flags, enum tva_type tva_flags)
{
- unsigned long orders = BIT(HPAGE_PMD_ORDER);
+ unsigned long orders;
+
+ /* If khugepaged is scanning an anonymous vma, allow mTHP collapse */
+ if ((tva_flags == TVA_KHUGEPAGED) && vma_is_anonymous(vma))
+ orders = THP_ORDERS_ALL_ANON;
+ else
+ orders = BIT(HPAGE_PMD_ORDER);
return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
}
@@ -588,10 +594,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
- hugepage_pmd_enabled()) {
- if (collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))
- __khugepaged_enter(vma->vm_mm);
- }
+ collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED) &&
+ hugepage_enabled())
+ __khugepaged_enter(vma->vm_mm);
}
void __khugepaged_exit(struct mm_struct *mm)
@@ -2945,7 +2950,7 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
static int khugepaged_has_work(void)
{
- return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
+ return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
}
static int khugepaged_wait_event(void)
@@ -3018,7 +3023,7 @@ static void khugepaged_wait_work(void)
return;
}
- if (hugepage_pmd_enabled())
+ if (hugepage_enabled())
wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
}
@@ -3049,7 +3054,7 @@ void set_recommended_min_free_kbytes(void)
int nr_zones = 0;
unsigned long recommended_min;
- if (!hugepage_pmd_enabled()) {
+ if (!hugepage_enabled()) {
calculate_min_free_kbytes();
goto update_wmarks;
}
@@ -3099,7 +3104,7 @@ int start_stop_khugepaged(void)
int err = 0;
mutex_lock(&khugepaged_mutex);
- if (hugepage_pmd_enabled()) {
+ if (hugepage_enabled()) {
if (!khugepaged_thread)
khugepaged_thread = kthread_run(khugepaged, NULL,
"khugepaged");
@@ -3125,7 +3130,7 @@ int start_stop_khugepaged(void)
void khugepaged_min_free_kbytes_update(void)
{
mutex_lock(&khugepaged_mutex);
- if (hugepage_pmd_enabled() && khugepaged_thread)
+ if (hugepage_enabled() && khugepaged_thread)
set_recommended_min_free_kbytes();
mutex_unlock(&khugepaged_mutex);
}
--
2.54.0
^ permalink raw reply related
* [PATCH mm-unstable v17 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
From: Nico Pache @ 2026-05-11 18:58 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
Usama Arif
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>
There are cases where, if an attempted collapse fails, all subsequent
orders are guaranteed to also fail. Avoid these collapse attempts by
bailing out early.
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 39bf7ea8a6e8..f0ae02936638 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1531,9 +1531,31 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
collapse_address = address + offset * PAGE_SIZE;
ret = collapse_huge_page(mm, collapse_address, referenced,
unmapped, cc, order);
- if (ret == SCAN_SUCCEED) {
+
+ switch (ret) {
+ /* Cases where we continue to next collapse candidate */
+ case SCAN_SUCCEED:
collapsed += nr_ptes;
+ fallthrough;
+ case SCAN_PTE_MAPPED_HUGEPAGE:
continue;
+ /* Cases where lower orders might still succeed */
+ case SCAN_LACK_REFERENCED_PAGE:
+ case SCAN_EXCEED_NONE_PTE:
+ case SCAN_EXCEED_SWAP_PTE:
+ case SCAN_EXCEED_SHARED_PTE:
+ case SCAN_PAGE_LOCK:
+ case SCAN_PAGE_COUNT:
+ case SCAN_PAGE_NULL:
+ case SCAN_DEL_PAGE_LRU:
+ case SCAN_PTE_NON_PRESENT:
+ case SCAN_PTE_UFFD_WP:
+ case SCAN_ALLOC_HUGE_PAGE_FAIL:
+ case SCAN_PAGE_LAZYFREE:
+ goto next_order;
+ /* Cases where no further collapse is possible */
+ default:
+ return collapsed;
}
}
--
2.54.0
^ permalink raw reply related
* [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-05-11 18:58 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>
Enable khugepaged to collapse to mTHP orders. This patch implements the
main scanning logic using a bitmap to track occupied pages and a stack
structure that allows us to find optimal collapse sizes.
Previous to this patch, PMD collapse had 3 main phases, a light weight
scanning phase (mmap_read_lock) that determines a potential PMD
collapse, an alloc phase (mmap unlocked), then finally heavier collapse
phase (mmap_write_lock).
To enabled mTHP collapse we make the following changes:
During PMD scan phase, track occupied pages in a bitmap. When mTHP
orders are enabled, we remove the restriction of max_ptes_none during the
scan phase to avoid missing potential mTHP collapse candidates. Once we
have scanned the full PMD range and updated the bitmap to track occupied
pages, we use the bitmap to find the optimal mTHP size.
Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
and determine the best eligible order for the collapse. A stack structure
is used instead of traditional recursion to manage the search. This also
prevents a traditional recursive approach when the kernel stack struct is
limited. The algorithm recursively splits the bitmap into smaller chunks to
find the highest order mTHPs that satisfy the collapse criteria. We start
by attempting the PMD order, then moved on the consecutively lower orders
(mTHP collapse). The stack maintains a pair of variables (offset, order),
indicating the number of PTEs from the start of the PMD, and the order of
the potential collapse candidate.
The algorithm for consuming the bitmap works as such:
1) push (0, HPAGE_PMD_ORDER) onto the stack
2) pop the stack
3) check if the number of set bits in that (offset,order) pair
statisfy the max_ptes_none threshold for that order
4) if yes, attempt collapse
5) if no (or collapse fails), push two new stack items representing
the left and right halves of the current bitmap range, at the
next lower order
6) repeat at step (2) until stack is empty.
Below is a diagram representing the algorithm and stack items:
offset mid_offset
| |
| |
v v
____________________________________
| PTE Page Table |
--------------------------------------
<-------><------->
order-1 order-1
mTHP collapses reject regions containing swapped out or shared pages.
This is because adding new entries can lead to new none pages, and these
may lead to constant promotion into a higher order mTHP. A similar
issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
introducing at least 2x the number of pages, and on a future scan will
satisfy the promotion condition once again. This issue is prevented via
the collapse_max_ptes_none() function which imposes the max_ptes_none
restrictions above.
We currently only support mTHP collapse for max_ptes_none values of 0
and HPAGE_PMD_NR - 1. resulting in the following behavior:
- max_ptes_none=0: Never introduce new empty pages during collapse
- max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
available mTHP order
Any other max_ptes_none value will emit a warning and skip mTHP collapse
attempts. There should be no behavior change for PMD collapse.
Once we determine what mTHP sizes fits best in that PMD range a collapse
is attempted. A minimum collapse order of 2 is used as this is the lowest
order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
Currently madv_collapse is not supported and will only attempt PMD
collapse.
We can also remove the check for is_khugepaged inside the PMD scan as
the collapse_max_ptes_none() function handles this logic now.
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 182 +++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 174 insertions(+), 8 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3492b135d667..39bf7ea8a6e8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -100,6 +100,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
static struct kmem_cache *mm_slot_cache __ro_after_init;
+#define KHUGEPAGED_MIN_MTHP_ORDER 2
+/*
+ * mthp_collapse() does an iterative DFS over a binary tree, from
+ * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
+ * size needed for a DFS on a binary tree is height + 1, where
+ * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
+ *
+ * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
+ * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
+ */
+#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
+
+/*
+ * Defines a range of PTE entries in a PTE page table which are being
+ * considered for mTHP collapse.
+ *
+ * @offset: the offset of the first PTE entry in a PMD range.
+ * @order: the order of the PTE entries being considered for collapse.
+ */
+struct mthp_range {
+ u16 offset;
+ u8 order;
+};
+
struct collapse_control {
bool is_khugepaged;
@@ -111,6 +135,12 @@ struct collapse_control {
/* nodemask for allocation fallback */
nodemask_t alloc_nmask;
+
+ /* Each bit represents a single occupied (!none/zero) page. */
+ DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
+ /* A mask of the current range being considered for mTHP collapse. */
+ DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
+ struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
};
/**
@@ -1404,20 +1434,140 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
return result;
}
+static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
+ u16 offset, u8 order)
+{
+ const int size = *stack_size;
+ struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
+
+ VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
+ stack->order = order;
+ stack->offset = offset;
+ (*stack_size)++;
+}
+
+static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
+ int *stack_size)
+{
+ const int size = *stack_size;
+
+ VM_WARN_ON_ONCE(size <= 0);
+ (*stack_size)--;
+ return cc->mthp_bitmap_stack[size - 1];
+}
+
+static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
+ u16 offset, unsigned int nr_ptes)
+{
+ bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
+ bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
+ return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
+}
+
+/*
+ * mthp_collapse() consumes the bitmap that is generated during
+ * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
+ *
+ * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
+ * A stack structure cc->mthp_bitmap_stack is used to check different regions
+ * of the bitmap for collapse eligibility. The stack maintains a pair of
+ * variables (offset, order), indicating the number of PTEs from the start of
+ * the PMD, and the order of the potential collapse candidate respectively. We
+ * start at the PMD order and check if it is eligible for collapse; if not, we
+ * add two entries to the stack at a lower order to represent the left and right
+ * halves of the PTE page table we are examining.
+ *
+ * offset mid_offset
+ * | |
+ * | |
+ * v v
+ * --------------------------------------
+ * | cc->mthp_bitmap |
+ * --------------------------------------
+ * <-------><------->
+ * order-1 order-1
+ *
+ * For each of these, we determine how many PTE entries are occupied in the
+ * range of PTE entries we propose to collapse, then we compare this to a
+ * threshold number of PTE entries which would need to be occupied for a
+ * collapse to be permitted at that order (accounting for max_ptes_none).
+ *
+ * If a collapse is permitted, we attempt to collapse the PTE range into a
+ * mTHP.
+ */
+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
+ int referenced, int unmapped, struct collapse_control *cc,
+ unsigned long enabled_orders)
+{
+ unsigned int nr_occupied_ptes, nr_ptes;
+ int max_ptes_none, collapsed = 0, stack_size = 0;
+ unsigned long collapse_address;
+ struct mthp_range range;
+ u16 offset;
+ u8 order;
+
+ collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
+
+ while (stack_size) {
+ range = collapse_mthp_stack_pop(cc, &stack_size);
+ order = range.order;
+ offset = range.offset;
+ nr_ptes = 1UL << order;
+
+ if (!test_bit(order, &enabled_orders))
+ goto next_order;
+
+ max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
+
+ if (max_ptes_none < 0)
+ return collapsed;
+
+ nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
+ nr_ptes);
+
+ if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
+ int ret;
+
+ collapse_address = address + offset * PAGE_SIZE;
+ ret = collapse_huge_page(mm, collapse_address, referenced,
+ unmapped, cc, order);
+ if (ret == SCAN_SUCCEED) {
+ collapsed += nr_ptes;
+ continue;
+ }
+ }
+
+next_order:
+ if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
+ const u8 next_order = order - 1;
+ const u16 mid_offset = offset + (nr_ptes / 2);
+
+ collapse_mthp_stack_push(cc, &stack_size, mid_offset,
+ next_order);
+ collapse_mthp_stack_push(cc, &stack_size, offset,
+ next_order);
+ }
+ }
+ return collapsed;
+}
+
static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long start_addr,
bool *lock_dropped, struct collapse_control *cc)
{
- const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
+ int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
+ enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
pmd_t *pmd;
- pte_t *pte, *_pte;
- int none_or_zero = 0, shared = 0, referenced = 0;
+ pte_t *pte, *_pte, pteval;
+ int i;
+ int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
enum scan_result result = SCAN_FAIL;
struct page *page = NULL;
struct folio *folio = NULL;
unsigned long addr;
+ unsigned long enabled_orders;
spinlock_t *ptl;
int node = NUMA_NO_NODE, unmapped = 0;
@@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
goto out;
}
+ bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
+
+ enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
+
+ /*
+ * If PMD is the only enabled order, enforce max_ptes_none, otherwise
+ * scan all pages to populate the bitmap for mTHP collapse.
+ */
+ if (enabled_orders != BIT(HPAGE_PMD_ORDER))
+ max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
+
pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
if (!pte) {
cc->progress++;
@@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
goto out;
}
- for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
- _pte++, addr += PAGE_SIZE) {
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ _pte = pte + i;
+ addr = start_addr + i * PAGE_SIZE;
+ pteval = ptep_get(_pte);
+
cc->progress++;
- pte_t pteval = ptep_get(_pte);
if (pte_none_or_zero(pteval)) {
if (++none_or_zero > max_ptes_none) {
result = SCAN_EXCEED_NONE_PTE;
@@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
}
}
+ /* Set bit for occupied pages */
+ __set_bit(i, cc->mthp_bitmap);
/*
* Record which node the original page is from and save this
* information to cc->node_load[].
@@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
if (result == SCAN_SUCCEED) {
/* collapse_huge_page expects the lock to be dropped before calling */
mmap_read_unlock(mm);
- result = collapse_huge_page(mm, start_addr, referenced,
- unmapped, cc, HPAGE_PMD_ORDER);
+ nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
+ cc, enabled_orders);
/* collapse_huge_page will return with the mmap_lock released */
*lock_dropped = true;
+ result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
}
out:
trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
--
2.54.0
^ permalink raw reply related
* [PATCH mm-unstable v17 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function
From: Nico Pache @ 2026-05-11 18:58 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>
Add collapse_allowable_orders() to generalize THP order eligibility. The
function determines which THP orders are permitted based on collapse
context (khugepaged vs madv_collapse).
This consolidates collapse configuration logic and provides a clean
interface for future mTHP collapse support where the orders may be
different.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f28066069437..3492b135d667 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -545,12 +545,21 @@ void __khugepaged_enter(struct mm_struct *mm)
wake_up_interruptible(&khugepaged_wait);
}
+/* Check what orders are allowed based on the vma and collapse type */
+static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
+ vm_flags_t vm_flags, enum tva_type tva_flags)
+{
+ unsigned long orders = BIT(HPAGE_PMD_ORDER);
+
+ return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
+}
+
void khugepaged_enter_vma(struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
hugepage_pmd_enabled()) {
- if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
+ if (collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))
__khugepaged_enter(vma->vm_mm);
}
}
@@ -2673,7 +2682,7 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
cc->progress++;
break;
}
- if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
+ if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED)) {
cc->progress++;
continue;
}
@@ -2983,7 +2992,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
BUG_ON(vma->vm_start > start);
BUG_ON(vma->vm_end < end);
- if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
+ if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_FORCED_COLLAPSE))
return -EINVAL;
cc = kmalloc_obj(*cc);
--
2.54.0
^ permalink raw reply related
* [PATCH mm-unstable v17 09/14] mm/khugepaged: improve tracepoints for mTHP orders
From: Nico Pache @ 2026-05-11 18:58 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>
Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to
give better insight into what order is being operated at for.
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nico Pache <npache@redhat.com>
---
include/trace/events/huge_memory.h | 34 +++++++++++++++++++-----------
mm/khugepaged.c | 9 ++++----
2 files changed, 27 insertions(+), 16 deletions(-)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 443e0bd13fdb..70c25136e7e8 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -90,40 +90,44 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
TRACE_EVENT(mm_collapse_huge_page,
- TP_PROTO(struct mm_struct *mm, int isolated, int status),
+ TP_PROTO(struct mm_struct *mm, int isolated, int status, unsigned int order),
- TP_ARGS(mm, isolated, status),
+ TP_ARGS(mm, isolated, status, order),
TP_STRUCT__entry(
__field(struct mm_struct *, mm)
__field(int, isolated)
__field(int, status)
+ __field(unsigned int, order)
),
TP_fast_assign(
__entry->mm = mm;
__entry->isolated = isolated;
__entry->status = status;
+ __entry->order = order;
),
- TP_printk("mm=%p, isolated=%d, status=%s",
+ TP_printk("mm=%p, isolated=%d, status=%s, order=%u",
__entry->mm,
__entry->isolated,
- __print_symbolic(__entry->status, SCAN_STATUS))
+ __print_symbolic(__entry->status, SCAN_STATUS),
+ __entry->order)
);
TRACE_EVENT(mm_collapse_huge_page_isolate,
TP_PROTO(struct folio *folio, int none_or_zero,
- int referenced, int status),
+ int referenced, int status, unsigned int order),
- TP_ARGS(folio, none_or_zero, referenced, status),
+ TP_ARGS(folio, none_or_zero, referenced, status, order),
TP_STRUCT__entry(
__field(unsigned long, pfn)
__field(int, none_or_zero)
__field(int, referenced)
__field(int, status)
+ __field(unsigned int, order)
),
TP_fast_assign(
@@ -131,26 +135,30 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
__entry->none_or_zero = none_or_zero;
__entry->referenced = referenced;
__entry->status = status;
+ __entry->order = order;
),
- TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s",
+ TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s, order=%u",
__entry->pfn,
__entry->none_or_zero,
__entry->referenced,
- __print_symbolic(__entry->status, SCAN_STATUS))
+ __print_symbolic(__entry->status, SCAN_STATUS),
+ __entry->order)
);
TRACE_EVENT(mm_collapse_huge_page_swapin,
- TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
+ TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
+ unsigned int order),
- TP_ARGS(mm, swapped_in, referenced, ret),
+ TP_ARGS(mm, swapped_in, referenced, ret, order),
TP_STRUCT__entry(
__field(struct mm_struct *, mm)
__field(int, swapped_in)
__field(int, referenced)
__field(int, ret)
+ __field(unsigned int, order)
),
TP_fast_assign(
@@ -158,13 +166,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
__entry->swapped_in = swapped_in;
__entry->referenced = referenced;
__entry->ret = ret;
+ __entry->order = order;
),
- TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d",
+ TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%u",
__entry->mm,
__entry->swapped_in,
__entry->referenced,
- __entry->ret)
+ __entry->ret,
+ __entry->order)
);
TRACE_EVENT(mm_khugepaged_scan_file,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 27654ea3f5ca..f28066069437 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -779,13 +779,13 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
} else {
result = SCAN_SUCCEED;
trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
- referenced, result);
+ referenced, result, order);
return result;
}
out:
release_pte_pages(pte, _pte, compound_pagelist);
trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
- referenced, result);
+ referenced, result, order);
return result;
}
@@ -1181,7 +1181,8 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
result = SCAN_SUCCEED;
out:
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+ trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
+ order);
return result;
}
@@ -1390,7 +1391,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
out_nolock:
if (folio)
folio_put(folio);
- trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+ trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order);
return result;
}
--
2.54.0
^ permalink raw reply related
* [PATCH mm-unstable v17 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics
From: Nico Pache @ 2026-05-11 18:58 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>
Add three new mTHP statistics to track collapse failures for different
orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
- collapse_exceed_swap_pte: Increment when mTHP collapse fails due to
encountering a swap PTE.
- collapse_exceed_none_pte: Counts when mTHP collapse fails due to
exceeding the none PTE threshold for the given order
- collapse_exceed_shared_pte: Counts when mTHP collapse fails due to
encountering a shared PTE.
These statistics complement the existing THP_SCAN_EXCEED_* events by
providing per-order granularity for mTHP collapse attempts. The stats are
exposed via sysfs under
`/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
supported hugepage size.
As we currently do not support collapsing mTHPs that contain a swap or
shared entry, those statistics keep track of how often we are
encountering failed mTHP collapses due to these restrictions.
We will add support for mTHP collapse for anonymous pages next; lets also
track when this happens at the PMD level within the per-mTHP stats.
Signed-off-by: Nico Pache <npache@redhat.com>
---
Documentation/admin-guide/mm/transhuge.rst | 14 ++++++++++++++
include/linux/huge_mm.h | 3 +++
mm/huge_memory.c | 7 +++++++
mm/khugepaged.c | 21 +++++++++++++++++++--
4 files changed, 43 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index c51932e6275d..80a4d0bed70b 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -714,6 +714,20 @@ nr_anon_partially_mapped
an anonymous THP as "partially mapped" and count it here, even though it
is not actually partially mapped anymore.
+collapse_exceed_none_pte
+ The number of collapse attempts that failed due to exceeding the
+ max_ptes_none threshold.
+
+collapse_exceed_swap_pte
+ The number of collapse attempts that failed due to exceeding the
+ max_ptes_swap threshold. For non-PMD orders this occurs if a mTHP range
+ contains at least one swap PTE.
+
+collapse_exceed_shared_pte
+ The number of collapse attempts that failed due to exceeding the
+ max_ptes_shared threshold. For non-PMD orders this occurs if a mTHP range
+ contains at least one shared PTE.
+
As the system ages, allocating huge pages may be expensive as the
system uses memory compaction to copy data around memory to free a
huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ba7ae6808544..48496f09909b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -144,6 +144,9 @@ enum mthp_stat_item {
MTHP_STAT_SPLIT_DEFERRED,
MTHP_STAT_NR_ANON,
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
+ MTHP_STAT_COLLAPSE_EXCEED_SWAP,
+ MTHP_STAT_COLLAPSE_EXCEED_NONE,
+ MTHP_STAT_COLLAPSE_EXCEED_SHARED,
__MTHP_STAT_COUNT
};
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 05f482a72a89..3e9eabc74c6c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -717,6 +717,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+
static struct attribute *anon_stats_attrs[] = {
&anon_fault_alloc_attr.attr,
@@ -733,6 +737,9 @@ static struct attribute *anon_stats_attrs[] = {
&split_deferred_attr.attr,
&nr_anon_attr.attr,
&nr_anon_partially_mapped_attr.attr,
+ &collapse_exceed_swap_pte_attr.attr,
+ &collapse_exceed_none_pte_attr.attr,
+ &collapse_exceed_shared_pte_attr.attr,
NULL,
};
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ba21b134fc86..27654ea3f5ca 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -645,7 +645,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
if (pte_none_or_zero(pteval)) {
if (++none_or_zero > max_ptes_none) {
result = SCAN_EXCEED_NONE_PTE;
- count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+ if (is_pmd_order(order))
+ count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
goto out;
}
continue;
@@ -679,9 +681,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
/* See collapse_scan_pmd(). */
if (folio_maybe_mapped_shared(folio)) {
+ /*
+ * TODO: Support shared pages without leading to further
+ * mTHP collapses. Currently bringing in new pages via
+ * shared may cause a future higher order collapse on a
+ * rescan of the same range.
+ */
if (++shared > max_ptes_shared) {
result = SCAN_EXCEED_SHARED_PTE;
- count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+ if (is_pmd_order(order))
+ count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
goto out;
}
}
@@ -1130,6 +1140,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
* range.
*/
if (!is_pmd_order(order)) {
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
pte_unmap(pte);
mmap_read_unlock(mm);
result = SCAN_EXCEED_SWAP_PTE;
@@ -1426,6 +1437,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
if (++none_or_zero > max_ptes_none) {
result = SCAN_EXCEED_NONE_PTE;
count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+ count_mthp_stat(HPAGE_PMD_ORDER,
+ MTHP_STAT_COLLAPSE_EXCEED_NONE);
goto out_unmap;
}
continue;
@@ -1434,6 +1447,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
if (++unmapped > max_ptes_swap) {
result = SCAN_EXCEED_SWAP_PTE;
count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
+ count_mthp_stat(HPAGE_PMD_ORDER,
+ MTHP_STAT_COLLAPSE_EXCEED_SWAP);
goto out_unmap;
}
/*
@@ -1491,6 +1506,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
if (++shared > max_ptes_shared) {
result = SCAN_EXCEED_SHARED_PTE;
count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+ count_mthp_stat(HPAGE_PMD_ORDER,
+ MTHP_STAT_COLLAPSE_EXCEED_SHARED);
goto out_unmap;
}
}
--
2.54.0
^ permalink raw reply related
* [PATCH mm-unstable v17 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders
From: Nico Pache @ 2026-05-11 18:58 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, npache, peterx, pfalcato,
rakie.kim, raquini, rdunlap, richard.weiyang, rientjes, rostedt,
rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
Usama Arif
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>
khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
some pages being unmapped. Skip these cases until we have a way to check
if its ok to collapse to a smaller mTHP size (like in the case of a
partially mapped folio). This check is also not done during the scan phase
as the current collapse order is unknown at that time.
This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f49bef78cf51..ba21b134fc86 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -685,6 +685,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
goto out;
}
}
+ /*
+ * TODO: In some cases of partially-mapped folios, we'd actually
+ * want to collapse.
+ */
+ if (!is_pmd_order(order) && folio_order(folio) >= order) {
+ result = SCAN_PTE_MAPPED_HUGEPAGE;
+ goto out;
+ }
if (folio_test_large(folio)) {
struct folio *f;
--
2.54.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox