Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: (subset) [PATCH v3 00/28] vfs/nfsd: add support for CB_NOTIFY callbacks in directory delegations
From: Chuck Lever @ 2026-05-18 16:05 UTC (permalink / raw)
  To: Christian Brauner, Jeff Layton, Chuck Lever
  Cc: Alexander Viro, Jan Kara, Alexander Aring, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Trond Myklebust, Anna Schumaker, Amir Goldstein, Calum Mackay,
	linux-fsdevel, linux-kernel, linux-trace-kernel, linux-doc,
	linux-nfs
In-Reply-To: <20260515-weltschmerz-folgen-68ca0db1ef84@brauner>



On Fri, May 15, 2026, at 1:26 PM, Christian Brauner wrote:
> On Tue, 28 Apr 2026 08:09:44 +0100, Jeff Layton wrote:
>> Re-posting the set per Christian's request. The only difference in this
>> version is a small error handling fix in alloc_init_dir_deleg(). The old
>> version could crash since release_pages() can't handle an array with
>> NULL pointers in it.
>> 
>> ---------------------------------8<------------------------------------
>> 
>> [...]
>
> @Chuck, @Jeff, I've only merged the vfs specific changes into a stable branch.
> You can pull it I won't touch it again. You can pull the nfsd work in in
> whatever form you like. Same procedure I use with io_uring et al.
>
> Let me know if that work for you.
>
> ---
>
> Applied to the vfs-7.2.directory.delegations branch of the vfs/vfs.git 
> tree.
> Patches in the vfs-7.2.directory.delegations branch should appear in 
> linux-next soon.
>
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
>
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
>
> Note that commit hashes shown below are subject to change due to rebase,
> trailer updates or similar. If in doubt, please check the listed branch.
>
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs-7.2.directory.delegations
>
> [01/28] filelock: pass current blocking lease to 
> trace_break_lease_block() rather than "new_fl"
>         https://git.kernel.org/vfs/vfs/c/89330d3a60f7
> [02/28] filelock: add support for ignoring deleg breaks for dir change 
> events
>         https://git.kernel.org/vfs/vfs/c/24cbf43337f4
> [03/28] filelock: add a tracepoint to start of break_lease()
>         https://git.kernel.org/vfs/vfs/c/e39026a86b48
> [04/28] filelock: add an inode_lease_ignore_mask helper
>         https://git.kernel.org/vfs/vfs/c/95825fdcc0b0
> [05/28] fsnotify: new tracepoint in fsnotify()
>         https://git.kernel.org/vfs/vfs/c/ad4489dcd08d
> [06/28] fsnotify: add fsnotify_modify_mark_mask()
>         https://git.kernel.org/vfs/vfs/c/12ffbb117b64
> [07/28] fsnotify: add FSNOTIFY_EVENT_RENAME data type
>         https://git.kernel.org/vfs/vfs/c/010043003c0c

Looks good.

To make the NFSD pieces apply, I need v7.1-rc4 and
vfs-7.2.directory.delegations merged into vfs.all. Given your
regular merge cadence over the past few weeks, I expect that
will happen end of this week? Early next?


-- 
Chuck Lever

^ permalink raw reply

* Re: [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Andrii Nakryiko @ 2026-05-18 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Oleg Nesterov, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel, x86, linux-kernel
In-Reply-To: <20260518104306.GU3102624@noisy.programming.kicks-ass.net>

On Mon, May 18, 2026 at 3:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> You seem to have forgotten to Cc LKML and x86 :-(
>
> On Thu, May 14, 2026 at 03:53:36PM +0200, Jiri Olsa wrote:
>
> > @@ -1017,17 +1030,32 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> >  static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> >                        unsigned long vaddr, unsigned long tramp)
> >  {
> > -     u8 call[5];
> > +     u8 insn[OPT_INSN_SIZE], *call = &insn[LEA_INSN_SIZE];
> >
> > -     __text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
> > +     /*
> > +      * We have nop10 instruction (with first byte overwritten to int3),
> > +      * changing it to:
> > +      *   lea -0x80(%rsp), %rsp
> > +      *   call tramp
> > +      */
> > +     memcpy(insn, lea_rsp, LEA_INSN_SIZE);
> > +     __text_gen_insn(call, CALL_INSN_OPCODE,
> > +                     (const void *) (vaddr + LEA_INSN_SIZE),
> >                       (const void *) tramp, CALL_INSN_SIZE);
> > -     return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
> > +     return int3_update(auprobe, vma, vaddr, insn, OPT_INSN_SIZE, true /* optimize */);
> >  }
> >
> >  static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> >                          unsigned long vaddr)
> >  {
> > -     return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* optimize */);
> > +     /*
> > +      * We have optimized nop10 (lea, call), changing it to 'jmp rel8' to
> > +      * end of the 10-byte slot instead of restoring the original nop10,
> > +      * because we could have thread already inside lea instruction.
>
> Inaccurate, RIP could be on CALL, not inside LEA. Writing NOP10 would
> make it inside NOP10 though, and that would cause havoc IF you use the
> normal NOP10.
>
> Thing is, the encoding of NOP{8,9,10} would actually allow you to
> preserve the CALL instruction :-)
>
> That is, observe:
>
>        PF1   PF2   ESC   NOPL  MOD   SIB   DISP32
>
> NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
> NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x78, 0x56, 0x34, 0x12 -- cs nopw 0x12345678(%rax,%rbp,8)
>
> Specifically the CALL opcode sits in the SIB byte and decodes like:
>
>   e8 := 11 101 000
>
>   scale = 11  (2^3 = 8)
>   index = 101 BP
>   base  = 000 AX
>
> And the displacement is just that, a displacement.
>
> So you *could* in fact, write back _A_ NOP10, just not the standard
> NOP10.
>
> > +      */
> > +     u8 jmp[OPT_INSN_SIZE] = { JMP8_INSN_OPCODE, OPT_JMP8_OFFSET };
> > +
> > +     return int3_update(auprobe, vma, vaddr, jmp, JMP8_INSN_SIZE, false /* optimize */);
> >  }
>
> Changelog wants significant update to explain this scheme.
>
> So we have:
>
>   NOP10 -+-> LEA -0x80(%rsp), %rsp, CALL foo -> JMP.d8 +8
>          |                                          |
>          `------------------------------------------'
>
> And you want to belabour the point of how you ensure re-writing the CALL
> instruction isn't a problem (because I'm not convinced).
>
> Note that the above results in:
>
> initial:
> 0: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
>
> optimize-int3:
> 1: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- int3
> optimize-tail:
> 2: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> optimize-finish:
> 3: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- lea -0x80(%rsp),%rsp; call 0x78563412
>
> unoptimize-int3:
> 4: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> unoptimize-tail:
> 5: 0xcc, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> unoptimize-finish:
> 6: 0xeb, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- jmp.d8 +8; call 0x78563412
>
> optimize-int3:
> 7: 0xcc, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> optimize-tail:
> 8: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call 0x12345678
> optimize-finish:
> 9: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call 0x12345678
>
> Note that from step 7 to step 8, you re-write the CALL instruction
> without going through INT3. This means it is entirely possible for a
> concurrent execution to observe a composite instruction.
>
> This is NOT sound!

We shouldn't need to change call instruction ever, uprobe trampoline
is permanent within the given process and its address won't change.

>
> However, I think it can be salvaged, if instead of only writing INT3 at
> +0, you also write INT3 at +5. The sequence then becomes:
>
> initial:
> 0: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
>
> optimize-int3:
> 1: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xcc, 0x00, 0x00, 0x00, 0x00 -- int3; int3
> optimize-tail(s):
> 2: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xcc, 0x12, 0x34, 0x56, 0x78 -- int3; int3
> optimize-finish-1:
> 3: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> optimize-finish-2:
> 3: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- lea -0x80(%rsp),%rsp; call 0x78563412
>
> unoptimize-int3:
> 4: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> unoptimize-tail:
> 5: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> unoptimize-finish:
> 6: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x12, 0x34, 0x56, 0x78 -- cs nopw 0x78563412(%rax,%rbp,8); call 0x78563412
>
> optimize-int3:
> 7: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xcc, 0x12, 0x34, 0x56, 0x78 -- int3; int3
> optimize-tail(s):
> 8: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xcc, 0x78, 0x56, 0x34, 0x12 -- int3; int3
> optimize-finish-1:
> 9: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call 0x12345678
> optimize-finish-2:
> 9: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- lea -0x80(%rsp),%rsp; call 0x12345678
>

[...]

^ permalink raw reply

* Re: [PATCH] ftrace: fix race in __modify_ftrace_direct() between tmp_ops registration and direct_functions update
From: Steven Rostedt @ 2026-05-18 16:19 UTC (permalink / raw)
  To: Andrii Kuchmenko
  Cc: linux-trace-kernel, mhiramat, linux-kernel, stable, Jiri Olsa
In-Reply-To: <20260517110155.21706-1-capyenglishlite@gmail.com>

On Sun, 17 May 2026 14:01:53 +0300
Andrii Kuchmenko <capyenglishlite@gmail.com> wrote:

> In __modify_ftrace_direct(), register_ftrace_function_nolock() makes
> tmp_ops visible in ftrace_ops_list before entry->direct is updated
> under ftrace_lock. During this window any CPU entering the traced
> function calls call_direct_funcs(), reads the old address from
> direct_functions via RCU, and jumps to it via
> arch_ftrace_set_direct_caller(). If the caller freed or invalidated
> the old trampoline before calling modify_ftrace_direct(), this is a
> use-after-free in executable code context.
> 
> The race window:
> 
>   CPU 0 (__modify_ftrace_direct)       CPU 1 (executing traced func)
>   ──────────────────────────────       ──────────────────────────────
>   register_ftrace_function_nolock()
>     -> tmp_ops visible in ops_list  
>                                         call_direct_funcs()
>                                           ftrace_find_rec_direct() -> old_addr
>                                           arch_ftrace_set_direct_caller(old_addr)
>                                           jump to old_addr  <- UAF if freed

You do not state where old_addr is freed.

>   mutex_lock(&ftrace_lock)
>   entry->direct = addr   <- too late
>   mutex_unlock(&ftrace_lock)
> 
> Fix: update entry->direct under ftrace_lock BEFORE registering tmp_ops.
> Any CPU that observes tmp_ops in ftrace_ops_list after this point will
> already see the new address when it calls ftrace_find_rec_direct().
> Add smp_wmb() between the store and the registration to ensure the
> write is visible on weakly-ordered architectures before tmp_ops
> becomes observable via ftrace_ops_list.
> 
> On error from register_ftrace_function_nolock(), restore entry->direct
> to old_addr since tmp_ops never became visible to other CPUs.

The above statement is incorrect. The tmp_ops hash entries are also
*shared* with the ops that is being updated. That is, by changing the entry->direct, you 

> 
> This affects all callers of __modify_ftrace_direct(), including:
>   - modify_ftrace_direct() used by kernel modules and live patching
>   - modify_ftrace_direct_nolock() used by BPF trampolines
>     (kernel/bpf/trampoline.c) reachable with CAP_BPF + CAP_PERFMON
> 
> Fixes: 0567d6809440 ("ftrace: Add modify_ftrace_direct()")
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: stable@vger.kernel.org
> Signed-off-by: Andrii Kuchmenko <capyenglishlite@gmail.com>
> ---
>  kernel/trace/ftrace.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index a1b2c3d4e5f6..b7c8d9e0f1a2 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -5950,6 +5950,7 @@ static int __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
>  	struct ftrace_func_entry *entry;
>  	struct ftrace_ops tmp_ops;
> +	unsigned long old_addr;
>  	int err;
>  
>  	lockdep_assert_held(&direct_mutex);
> @@ -5960,22 +5961,36 @@ static int __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
>  	if (!entry)
>  		return -ENODEV;
>  
> -	/*
> -	 * tmp_ops is registered into ftrace_ops_list here, making it
> -	 * visible to all CPUs executing the traced function. However,
> -	 * entry->direct is not updated until after this call returns,
> -	 * leaving a window where CPUs read the stale (possibly freed)
> -	 * direct call address via ftrace_find_rec_direct().
> -	 */

Are you posting patches on top of your own patches that are not public?

> -	err = register_ftrace_function_nolock(&tmp_ops);
> -	if (err)
> -		return err;
> -
> +	/* Save old address in case we need to roll back on error. */
> +	old_addr = entry->direct;
> +
> +	/*
> +	 * Update entry->direct BEFORE registering tmp_ops into
> +	 * ftrace_ops_list. This closes the race window where a CPU
> +	 * executing the traced function could read the old (potentially
> +	 * freed) direct call address between tmp_ops becoming visible
> +	 * and entry->direct being updated.
> +	 *
> +	 * Any CPU that observes tmp_ops in ftrace_ops_list after the
> +	 * smp_wmb() below is guaranteed to see the new address when
> +	 * it calls ftrace_find_rec_direct().
> +	 */
>  	mutex_lock(&ftrace_lock);
>  	entry->direct = addr;
>  	mutex_unlock(&ftrace_lock);
>  
> +	/*
> +	 * Ensure entry->direct store is ordered before tmp_ops
> +	 * becomes visible via ftrace_ops_list on weakly-ordered archs.
> +	 */
> +	smp_wmb();

You do realize that register_ftrace_function_nolock() is itself a full
memory barrier? It's doing code modification which requires lots of
barriers to work.

Still, the only bug I see that is possible is that the caller may need to
do some synchronize RCU calls before freeing an old trampoline.

Can you show a path that doesn't do that?

-- Steve


> +
> +	err = register_ftrace_function_nolock(&tmp_ops);
> +	if (err) {
> +		/* tmp_ops never became visible; safe to restore old_addr. */
> +		mutex_lock(&ftrace_lock);
> +		entry->direct = old_addr;
> +		mutex_unlock(&ftrace_lock);
> +		return err;
> +	}
> +
>  	/*
>  	 * Now that tmp_ops is registered and entry->direct is updated,
>  	 * unregister the original ops and clean up.


^ permalink raw reply

* Re: [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jiri Olsa @ 2026-05-18 16:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Ingo Molnar, Masami Hiramatsu, Andrii Nakryiko,
	bpf, linux-trace-kernel, x86, linux-kernel
In-Reply-To: <20260518104306.GU3102624@noisy.programming.kicks-ass.net>

On Mon, May 18, 2026 at 12:43:06PM +0200, Peter Zijlstra wrote:
> 
> You seem to have forgotten to Cc LKML and x86 :-(
> 
> On Thu, May 14, 2026 at 03:53:36PM +0200, Jiri Olsa wrote:
> 
> > @@ -1017,17 +1030,32 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> >  static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> >  			 unsigned long vaddr, unsigned long tramp)
> >  {
> > -	u8 call[5];
> > +	u8 insn[OPT_INSN_SIZE], *call = &insn[LEA_INSN_SIZE];
> >  
> > -	__text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
> > +	/*
> > +	 * We have nop10 instruction (with first byte overwritten to int3),
> > +	 * changing it to:
> > +	 *   lea -0x80(%rsp), %rsp
> > +	 *   call tramp
> > +	 */
> > +	memcpy(insn, lea_rsp, LEA_INSN_SIZE);
> > +	__text_gen_insn(call, CALL_INSN_OPCODE,
> > +			(const void *) (vaddr + LEA_INSN_SIZE),
> >  			(const void *) tramp, CALL_INSN_SIZE);
> > -	return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
> > +	return int3_update(auprobe, vma, vaddr, insn, OPT_INSN_SIZE, true /* optimize */);
> >  }
> >  
> >  static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> >  			   unsigned long vaddr)
> >  {
> > -	return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* optimize */);
> > +	/*
> > +	 * We have optimized nop10 (lea, call), changing it to 'jmp rel8' to
> > +	 * end of the 10-byte slot instead of restoring the original nop10,
> > +	 * because we could have thread already inside lea instruction.
> 
> Inaccurate, RIP could be on CALL, not inside LEA. Writing NOP10 would
> make it inside NOP10 though, and that would cause havoc IF you use the
> normal NOP10.
> 
> Thing is, the encoding of NOP{8,9,10} would actually allow you to
> preserve the CALL instruction :-)
> 
> That is, observe:
> 
>        PF1   PF2   ESC   NOPL  MOD   SIB   DISP32
> 
> NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
> NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x78, 0x56, 0x34, 0x12 -- cs nopw 0x12345678(%rax,%rbp,8)
> 
> Specifically the CALL opcode sits in the SIB byte and decodes like:
> 
>   e8 := 11 101 000
> 
>   scale = 11  (2^3 = 8)
>   index = 101 BP
>   base  = 000 AX
> 
> And the displacement is just that, a displacement.
> 
> So you *could* in fact, write back _A_ NOP10, just not the standard
> NOP10.
> 
> > +	 */
> > +	u8 jmp[OPT_INSN_SIZE] = { JMP8_INSN_OPCODE, OPT_JMP8_OFFSET };
> > +
> > +	return int3_update(auprobe, vma, vaddr, jmp, JMP8_INSN_SIZE, false /* optimize */);
> >  }
> 
> Changelog wants significant update to explain this scheme.
> 
> So we have:
> 
>   NOP10 -+-> LEA -0x80(%rsp), %rsp, CALL foo -> JMP.d8 +8
>          |                                          |
>          `------------------------------------------'
> 
> And you want to belabour the point of how you ensure re-writing the CALL
> instruction isn't a problem (because I'm not convinced).
> 
> Note that the above results in:
> 
> initial:
> 0: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
> 
> optimize-int3:
> 1: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- int3
> optimize-tail:
> 2: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> optimize-finish:
> 3: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- lea -0x80(%rsp),%rsp; call 0x78563412
> 
> unoptimize-int3:
> 4: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> unoptimize-tail:
> 5: 0xcc, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> unoptimize-finish:
> 6: 0xeb, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- jmp.d8 +8; call 0x78563412
> 
> optimize-int3:
> 7: 0xcc, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> optimize-tail:
> 8: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call 0x12345678
> optimize-finish:
> 9: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call 0x12345678
> 
> Note that from step 7 to step 8, you re-write the CALL instruction
> without going through INT3. This means it is entirely possible for a
> concurrent execution to observe a composite instruction.
> 
> This is NOT sound!
> 
> However, I think it can be salvaged, if instead of only writing INT3 at
> +0, you also write INT3 at +5. The sequence then becomes:
> 
> initial:
> 0: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
> 
> optimize-int3:
> 1: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xcc, 0x00, 0x00, 0x00, 0x00 -- int3; int3
> optimize-tail(s):
> 2: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xcc, 0x12, 0x34, 0x56, 0x78 -- int3; int3
> optimize-finish-1:
> 3: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> optimize-finish-2:
> 3: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- lea -0x80(%rsp),%rsp; call 0x78563412
> 
> unoptimize-int3:
> 4: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> unoptimize-tail:
> 5: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
> unoptimize-finish:
> 6: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x12, 0x34, 0x56, 0x78 -- cs nopw 0x78563412(%rax,%rbp,8); call 0x78563412
> 
> optimize-int3:
> 7: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xcc, 0x12, 0x34, 0x56, 0x78 -- int3; int3
> optimize-tail(s):
> 8: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xcc, 0x78, 0x56, 0x34, 0x12 -- int3; int3
> optimize-finish-1:
> 9: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call 0x12345678
> optimize-finish-2:
> 9: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- lea -0x80(%rsp),%rsp; call 0x12345678

sorry I missed this reply.. awesome, I'll check how to do this

> 
> > @@ -1095,14 +1125,25 @@ int set_orig_insn(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> >  		  unsigned long vaddr)
> >  {
> >  	if (test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) {
> > -		int ret = is_optimized(vma->vm_mm, vaddr);
> > -		if (ret < 0)
> > +		uprobe_opcode_t insn[OPT_INSN_SIZE];
> > +		int ret;
> > +
> > +		ret = copy_from_vaddr(vma->vm_mm, vaddr, &insn, OPT_INSN_SIZE);
> > +		if (ret)
> >  			return ret;
> > -		if (ret) {
> > +		if (__is_optimized((uprobe_opcode_t *)&insn, vaddr)) {
> >  			ret = swbp_unoptimize(auprobe, vma, vaddr);
> >  			WARN_ON_ONCE(ret);
> >  			return ret;
> >  		}
> > +		/*
> > +		 * We can have re-attached probe on top of jmp8 instruction,
> > +		 * which did not get optimized. We need to restore the jmp8
> > +		 * instruction, instead of the original instruction (nop10).
> > +		 */
> > +		if (is_swbp_insn(&insn[0]) && insn[1] == OPT_JMP8_OFFSET)
> > +			return uprobe_write_opcode(auprobe, vma, vaddr, JMP8_INSN_OPCODE,
> > +						   false /* is_register */);
> 
> Coding style wants { } on any multi-line statement, even if its only one
> statement.

will fix

thanks,
jirka

^ permalink raw reply

* Re: [PATCH mm-unstable v17 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Usama Arif @ 2026-05-18 17:00 UTC (permalink / raw)
  To: Nico Pache
  Cc: Usama Arif, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260511185817.686831-7-npache@redhat.com>

On Mon, 11 May 2026 12:58:06 -0600 Nico Pache <npache@redhat.com> wrote:

> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
> 
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>  1 file changed, 55 insertions(+), 38 deletions(-)
> 

The patch did 2 things:

Make it work with any order and not just PMD order.

Keeps anon_vma_write held across the copy and install for non-PMD orders,
as mTHP leaves the out-of-range PTEs mapped while the PMD is temporarily none.
rmap walkers cannot reach here until PMD is isntalled.

Acked-by: Usama Arif <usama.arif@linux.dev>

^ permalink raw reply

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
From: Dmitry Ilvokhin @ 2026-05-18 17:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Vlastimil Babka (SUSE), Andrew Morton, Matthew Wilcox, linux-mm,
	Steven Rostedt, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	David Hildenbrand, Lorenzo Stoakes, Shuah Khan, linux-kernel,
	linux-trace-kernel, kernel-team
In-Reply-To: <fab7d27a-6c2b-47aa-abe8-a327f05fb5cd@kernel.org>

On Wed, May 13, 2026 at 05:32:41PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 08/05/2026 20.07, Dmitry Ilvokhin wrote:
> > On Fri, May 08, 2026 at 07:40:51PM +0200, Vlastimil Babka (SUSE) wrote:
> > > On 5/8/26 7:38 PM, Vlastimil Babka (SUSE) wrote:
> > > > On 5/8/26 7:29 PM, Andrew Morton wrote:
> > > > > e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
> > > > > 
> > > > > > Add tracepoints to the page allocator fast paths that acquire
> > > > > > zone->lock, allowing diagnosis of lock contention in production.
> > > > > 
> > > > > Thanks, I'm surprised we haven't done this yet.
> > > > 
> > > > There was a recent attempt [1]. Not being a generic solution wasn't welcome.
> > > > 
> > > > [1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/
> > > 
> > > And this is the generic solution I think?
> > > 
> > > https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/
> > 
> > Thanks for cc'ing me, Vlastimil.
> > 
> > Yes, this is an attempt at a generic solution for tracing contended
> > locks, including spinlocks, so it should also cover the use case
> > proposed in this patchset.
> > 
> 
> I'm aware of the generic solution and often use `perf lock contention`.
> And the tool libbpf-tools/klockstat. My experience is unfortunately that
> enabling these tracepoint is prohibitive expensive on production server,
> and production suffers when I run these tools.

I think it depends on the workload: in particular how lock heavy it is.

At Meta we have a lock contention profiler (uses contention_begin and
contention_end tracepoints under the hood) running continiously in the
fleet. It is heavily sampled and each profilling session runs only for
few seconds, but in practice it is usually enough to get a pretty good
understanding what is going on.

That said, I understand the concern, and I can absolutely imagine
workloads where the overhead is still unacceptably high.

> 
> I'm very happy to see a patchset adding a contended case. But I worry
> that tracing all contented locks in the system is also too much to have
> enabled continuously for production.
> 
> This patch is carefully constructed to minimize overhead, such that I
> can enable this continuously on production to catch issues.  If I
> identify issue I will use the generic tracpoints for further debugging.
> 
> 
> > In fact, zone->lock contention was one of the primary motivations for
> > this work.
> 
> In the generic solution I'm loosing the "zone" and pages "count".  I
> need this information to get the answers I'm looking for.  Specifically
> I'm looking at reducing CONFIG_PCP_BATCH_SCALE_MAX, but I want to this
> to be a data-driven decision (my first principle is: if you cannot
> measure it you cannot improve it).
> 
> I'm likely going to apply this patch to our production system, such that
> I can get my data-driven decision.  I need to deploy it widely enough to
> get enough server experiencing direct-reclaim.  I'll report back if
> people are interested in these learning?

I would definitely be interested in hearing about your findings.

> 
> --Jesper

^ permalink raw reply

* [PATCH v3 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Praveen Talari @ 2026-05-18 17:00 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Mark Brown
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
	mukesh.savaliya, aniket.randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari, Konrad Dybcio
In-Reply-To: <20260518-add-tracepoints-for-qcom-geni-spi-v3-0-7928f6810a79@oss.qualcomm.com>

Add tracepoint support to the Qualcomm GENI SPI driver to provide
runtime visibility into driver behavior without requiring invasive debug
patches.

The trace events cover clock and setup parameter configuration,
transfer metadata, interrupt status to be making it easier to diagnose
communication issues in the field..

Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v2->v3:
- Renamed geni_spi_fifo_params to geni_spi_setup_params trace event.
- Updated commit text.

v1->v2:
- Removed TX/RX data tracepoints.
- Updated commit text.
---
 include/trace/events/qcom_geni_spi.h | 103 +++++++++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)

diff --git a/include/trace/events/qcom_geni_spi.h b/include/trace/events/qcom_geni_spi.h
new file mode 100644
index 000000000000..6d027adf2e1d
--- /dev/null
+++ b/include/trace/events/qcom_geni_spi.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM qcom_geni_spi
+
+#if !defined(_TRACE_QCOM_GENI_SPI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_QCOM_GENI_SPI_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(geni_spi_setup_params,
+	    TP_PROTO(struct device *dev, u8 cs, u32 mode,
+		     u32 mode_changed, bool cs_changed),
+	    TP_ARGS(dev, cs, mode, mode_changed, cs_changed),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(u8, cs)
+			     __field(u32, mode)
+			     __field(u32, mode_changed)
+			     __field(bool, cs_changed)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->cs = cs;
+			   __entry->mode = mode;
+			   __entry->mode_changed = mode_changed;
+			   __entry->cs_changed = cs_changed;
+	    ),
+
+	    TP_printk("%s: cs=%u mode=0x%08x mode_changed=0x%08x cs_changed=%d",
+		      __get_str(name), __entry->cs, __entry->mode,
+		      __entry->mode_changed, __entry->cs_changed)
+);
+
+TRACE_EVENT(geni_spi_clk_cfg,
+	    TP_PROTO(struct device *dev, unsigned long req_hz,
+		     unsigned long sclk_hz, unsigned int clk_idx,
+		     unsigned int clk_div, unsigned int bpw),
+	    TP_ARGS(dev, req_hz, sclk_hz, clk_idx, clk_div, bpw),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned long, req_hz)
+			     __field(unsigned long, sclk_hz)
+			     __field(unsigned int, clk_idx)
+			     __field(unsigned int, clk_div)
+			     __field(unsigned int, bpw)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->req_hz = req_hz;
+			   __entry->sclk_hz = sclk_hz;
+			   __entry->clk_idx = clk_idx;
+			   __entry->clk_div = clk_div;
+			   __entry->bpw = bpw;
+	    ),
+
+	    TP_printk("%s: req_hz=%lu sclk_hz=%lu clk_idx=%u clk_div=%u bpw=%u",
+		      __get_str(name), __entry->req_hz, __entry->sclk_hz,
+		      __entry->clk_idx, __entry->clk_div, __entry->bpw)
+);
+
+TRACE_EVENT(geni_spi_transfer,
+	    TP_PROTO(struct device *dev, unsigned int len, u32 m_cmd),
+	    TP_ARGS(dev, len, m_cmd),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, len)
+			     __field(u32, m_cmd)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->len = len;
+			   __entry->m_cmd = m_cmd;
+	    ),
+
+	    TP_printk("%s: len=%u m_cmd=0x%08x",
+		      __get_str(name), __entry->len, __entry->m_cmd)
+);
+
+TRACE_EVENT(geni_spi_irq,
+	    TP_PROTO(struct device *dev, u32 m_irq, u32 dma_tx, u32 dma_rx),
+	    TP_ARGS(dev, m_irq, dma_tx, dma_rx),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(u32, m_irq)
+			     __field(u32, dma_tx)
+			     __field(u32, dma_rx)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->m_irq = m_irq;
+			   __entry->dma_tx = dma_tx;
+			   __entry->dma_rx = dma_rx;
+	    ),
+
+	    TP_printk("%s: m_irq=0x%08x dma_tx=0x%08x dma_rx=0x%08x",
+		      __get_str(name), __entry->m_irq, __entry->dma_tx,
+		      __entry->dma_rx)
+);
+
+#endif /* _TRACE_QCOM_GENI_SPI_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

-- 
2.34.1


^ permalink raw reply related

* [PATCH v3 2/2] spi: qcom-geni: Add trace events for Qualcomm GENI SPI driver
From: Praveen Talari @ 2026-05-18 17:00 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Mark Brown
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
	mukesh.savaliya, aniket.randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari, Konrad Dybcio
In-Reply-To: <20260518-add-tracepoints-for-qcom-geni-spi-v3-0-7928f6810a79@oss.qualcomm.com>

Add tracepoints to the Qualcomm GENI (Generic Interface) SPI driver.
These trace events enable runtime debugging and performance analysis
of SPI operations.

The trace events capture SPI clock configuration, setup parameters,
transfer details, interrupt status.

Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v2->v3:
- Replaced geni_spi_fifo_params with geni_spi_setup_params trace event.
- Updated commit text.

v1->v2:
- Removed tx/rx data capture since spi core had already support.
- Updated commit text.
---
 drivers/spi/spi-geni-qcom.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index d5fb0edc8e0c..a04cdc1e5ad4 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -1,6 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 // Copyright (c) 2017-2018, The Linux foundation. All rights reserved.
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/qcom_geni_spi.h>
+
 #include <linux/clk.h>
 #include <linux/dmaengine.h>
 #include <linux/dma-mapping.h>
@@ -332,6 +335,9 @@ static int geni_spi_set_clock_and_bw(struct spi_geni_master *mas,
 	writel(clk_sel, se->base + SE_GENI_CLK_SEL);
 	writel(m_clk_cfg, se->base + GENI_SER_M_CLK_CFG);
 
+	trace_geni_spi_clk_cfg(mas->dev, clk_hz, mas->cur_sclk_hz, idx, div,
+			       mas->cur_bits_per_word);
+
 	/* Set BW quota for CPU as driver supports FIFO mode only. */
 	se->icc_paths[CPU_TO_GENI].avg_bw = Bps_to_icc(mas->cur_speed_hz);
 	ret = geni_icc_set_bw(se);
@@ -366,6 +372,9 @@ static int setup_fifo_params(struct spi_device *spi_slv,
 	if ((mode_changed & SPI_CS_HIGH) || (cs_changed && (spi_slv->mode & SPI_CS_HIGH)))
 		writel((spi_slv->mode & SPI_CS_HIGH) ? BIT(chipselect) : 0, se->base + SE_SPI_DEMUX_OUTPUT_INV);
 
+	trace_geni_spi_setup_params(mas->dev, chipselect, spi_slv->mode,
+				    mode_changed, cs_changed);
+
 	return 0;
 }
 
@@ -861,6 +870,8 @@ static int setup_se_xfer(struct spi_transfer *xfer,
 	spin_lock_irq(&mas->lock);
 	geni_se_setup_m_cmd(se, m_cmd, m_params);
 
+	trace_geni_spi_transfer(mas->dev, len, m_cmd);
+
 	if (mas->cur_xfer_mode == GENI_SE_DMA) {
 		if (m_cmd & SPI_RX_ONLY)
 			geni_se_rx_init_dma(se, sg_dma_address(xfer->rx_sg.sgl),
@@ -915,6 +926,8 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
 	if (!m_irq && !dma_tx_status && !dma_rx_status)
 		return IRQ_NONE;
 
+	trace_geni_spi_irq(mas->dev, m_irq, dma_tx_status, dma_rx_status);
+
 	if (m_irq & (M_CMD_OVERRUN_EN | M_ILLEGAL_CMD_EN | M_CMD_FAILURE_EN |
 		     M_RX_FIFO_RD_ERR_EN | M_RX_FIFO_WR_ERR_EN |
 		     M_TX_FIFO_RD_ERR_EN | M_TX_FIFO_WR_ERR_EN))

-- 
2.34.1


^ permalink raw reply related

* [PATCH v3 0/2] Add trace events for Qualcomm GENI SPI drivers
From: Praveen Talari @ 2026-05-18 17:00 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Mark Brown
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
	mukesh.savaliya, aniket.randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari, Konrad Dybcio

Add tracepoints to the Qualcomm GENI (Generic Interface) SPI driver.
These trace events enable runtime debugging and performance analysis
of SPI operations.

The trace events capture SPI clock configuration, setup parameters,
transfer details, interrupt status.

Usage examples:

Enable all SPI traces:
  echo 1 > /sys/kernel/tracing/events/spi/enable  
  echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_spi/enable
  cat /sys/kernel/debug/tracing/trace_pipe

Example trace output:

1003.956560: spi_message_submit: spi16.0 000000001b20b93c
1003.956642: spi_controller_busy: spi16
1003.956643: spi_message_start: spi16.0 000000001b20b93c
1003.956646: geni_spi_setup_params: 888000.spi: cs=0 mode=0x00000020
     mode_changed=0x00000007 cs_changed=0
1003.956647: spi_set_cs: spi16.0 activate
1003.956648: spi_transfer_start: spi16.0 00000000ea1cf8b6 len=16
     tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00]
1003.956653: geni_spi_clk_cfg: 888000.spi: req_hz=20000000
     sclk_hz=100000000 clk_idx=5 clk_div=5 bpw=8
1003.956691: geni_spi_transfer: 888000.spi: len=16 m_cmd=0x00000003
1003.956708: geni_spi_irq: 888000.spi: m_irq=0x08000081
     dma_tx=0x00000000 dma_rx=0x00000000
1003.956717: spi_transfer_stop: spi16.0 00000000ea1cf8b6 len=16
     tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
1003.956717: spi_set_cs: spi16.0 deactivate
1003.956718: spi_message_done: spi16.0 000000001b20b93c len=16/16

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
Changes in v3:
- Replaced geni_spi_fifo_params with geni_spi_setup_params trace event.
- Updated commit text.
- Link to v2: https://lore.kernel.org/r/20260512-add-tracepoints-for-qcom-geni-spi-v2-0-3b184068ecf9@oss.qualcomm.com

Changes in v2:
- Removed tx/rx data capture since spi core had already support.
- Updated commit text in patches and cover letter.
- Link to v1: https://lore.kernel.org/r/20260506-add-tracepoints-for-qcom-geni-spi-v1-0-c957cfe712d1@oss.qualcomm.com

---
Praveen Talari (2):
      spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
      spi: qcom-geni: Add trace events for Qualcomm GENI SPI driver

 drivers/spi/spi-geni-qcom.c          |  13 +++++
 include/trace/events/qcom_geni_spi.h | 103 +++++++++++++++++++++++++++++++++++
 2 files changed, 116 insertions(+)
---
base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
change-id: 20260506-add-tracepoints-for-qcom-geni-spi-e31457c2267c

Best regards,
-- 
Praveen Talari <praveen.talari@oss.qualcomm.com>


^ permalink raw reply

* Re: [PATCHv2 05/11] libbpf: Detect uprobe syscall with new error
From: Andrii Nakryiko @ 2026-05-18 17:39 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260518105957.123445-6-jolsa@kernel.org>

On Mon, May 18, 2026 at 4:00 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> In the previous optimized uprobe fix we changed the syscall
> error used for its detection from ENXIO to EPROTO.
>
> Changing related probe_uprobe_syscall detection check.
>
> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
> Fixes: 05738da0efa1 ("libbpf: Add uprobe syscall feature detection")
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  tools/lib/bpf/features.c                                | 4 ++--
>  tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
>

sashiko is wrong, this change is correct (we do not want to attempt
sys_uprobe optimization if -ENXIO is returned from broken kernels)

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> diff --git a/tools/lib/bpf/features.c b/tools/lib/bpf/features.c
> index b7e388f99d0b..e5641fa60163 100644
> --- a/tools/lib/bpf/features.c
> +++ b/tools/lib/bpf/features.c
> @@ -577,10 +577,10 @@ static int probe_ldimm64_full_range_off(int token_fd)
>  static int probe_uprobe_syscall(int token_fd)
>  {
>         /*
> -        * If kernel supports uprobe() syscall, it will return -ENXIO when called
> +        * If kernel supports uprobe() syscall, it will return -EPROTO when called
>          * from the outside of a kernel-generated uprobe trampoline.
>          */
> -       return syscall(__NR_uprobe) < 0 && errno == ENXIO;
> +       return syscall(__NR_uprobe) < 0 && errno == EPROTO;
>  }
>  #else
>  static int probe_uprobe_syscall(int token_fd)
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index 955a37751b52..c944136252c6 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> @@ -762,7 +762,7 @@ static void test_uprobe_error(void)
>         long err = syscall(__NR_uprobe);
>
>         ASSERT_EQ(err, -1, "error");
> -       ASSERT_EQ(errno, ENXIO, "errno");
> +       ASSERT_EQ(errno, EPROTO, "errno");
>  }
>
>  static void __test_uprobe_syscall(void)
> --
> 2.53.0
>

^ permalink raw reply

* [PATCH v3 0/2] Add tracepoints support for Qualcomm GENI Serial drivers
From: Praveen Talari @ 2026-05-18 17:56 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari

Add tracepoints to the Qualcomm GENI (Generic Interface) serial driver.
These trace events enable runtime debugging and performance analysis of
UART operations.

The trace events cover UART termios configuration, clock setup, manual
control state, interrupt status, and actual transmitted/received data in
hexadecimal format.

Usage examples:

Enable all serial traces:
  echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_serial/enable
  cat /sys/kernel/debug/tracing/trace_pipe

Example trace output:

2517.938432: geni_serial_clk_cfg: a94000.serial: desired_rate=1843200
     clk_rate=7372800 clk_div=4 clk_idx=0
2517.938753: geni_serial_irq: a94000.serial: m_irq=0x88800000
     s_irq=0x08000111 dma_tx=0x00000000 dma_rx=0x00000000
2517.938803: geni_serial_set_termios: a94000.serial: baud=115200 bpc=8
     tx_trans=0x00000002 tx_par=0x00000000 rx_trans=0x00000000
rx_par=0x00000000 stop=0
2517.938807: geni_serial_set_mctrl: a94000.serial: mctrl=0x8006
     uart_manual_rfr=0x00000000
2517.938818: geni_serial_get_mctrl: a94000.serial: mctrl=0x0160
     geni_ios=0x00000001
2517.939165: geni_serial_irq: a94000.serial: m_irq=0x00400000
     s_irq=0x00000000 dma_tx=0x00000000 dma_rx=0x00000000
2517.939592: geni_serial_tx_data: a94000.serial: tx_len=8 data=61 62 63
     64 65 66 67 68
2517.940610: geni_serial_irq: a94000.serial: m_irq=0x00000001
     s_irq=0x00000000 dma_tx=0x00000003 dma_rx=0x00000000
2517.942174: geni_serial_irq: a94000.serial: m_irq=0x08000000
     s_irq=0x08000100 dma_tx=0x00000000 dma_rx=0x00000003
2517.942323: geni_serial_rx_data: a94000.serial: rx_len=8 data=61 62 63
     64 65 66 67 68
2517.942680: geni_serial_set_mctrl: a94000.serial: mctrl=0x8000
     uart_manual_rfr=0x80000002

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
Changes in v3:
- Removed \n from geni_serial_tx_data and geni_serial_rx_data events.
- Resolved aligment issues in geni_serial_data, geni_serial_tx_data and
  geni_serial_rx_data events.
- Link to v2: https://lore.kernel.org/r/20260512-add-tracepoints-for-qcom-geni-serial-v2-0-a5726421b3af@oss.qualcomm.com

Changes in v2:
- removed multiple trace events for TX/RX events, instead used
  DECLARE_EVENT_CLASS and DEFINE_EVENT.
- Link to v1: https://lore.kernel.org/r/20260506-add-tracepoints-for-qcom-geni-serial-v1-0-544b22612e08@oss.qualcomm.com

---
Praveen Talari (2):
      serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
      serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver

 drivers/tty/serial/qcom_geni_serial.c   |  27 +++++-
 include/trace/events/qcom_geni_serial.h | 164 ++++++++++++++++++++++++++++++++
 2 files changed, 187 insertions(+), 4 deletions(-)
---
base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
change-id: 20260427-add-tracepoints-for-qcom-geni-serial-948777218b7b

Best regards,
-- 
Praveen Talari <praveen.talari@oss.qualcomm.com>


^ permalink raw reply

* [PATCH v3 1/2] serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
From: Praveen Talari @ 2026-05-18 17:56 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari
In-Reply-To: <20260518-add-tracepoints-for-qcom-geni-serial-v3-0-b4addb151376@oss.qualcomm.com>

Add tracepoint support to the Qualcomm GENI serial driver to provide
runtime visibility into driver behavior without requiring invasive debug
patches.

The trace events cover UART termios configuration, clock setup, modem
control state, interrupt status, and TX/RX data, making it easier to
diagnose communication issues in the field.

Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v2->v3:
- Removed \n from geni_serial_tx_data and geni_serial_rx_data events.
- Resolved aligment issues in geni_serial_data, geni_serial_tx_data and
  geni_serial_rx_data events.

v1->v2:
- Removed multiple TX/RX trace events, instead used
  DECLARE_EVENT_CLASS and DEFINE_EVENT.
---
 include/trace/events/qcom_geni_serial.h | 164 ++++++++++++++++++++++++++++++++
 1 file changed, 164 insertions(+)

diff --git a/include/trace/events/qcom_geni_serial.h b/include/trace/events/qcom_geni_serial.h
new file mode 100644
index 000000000000..417ec01f9fc8
--- /dev/null
+++ b/include/trace/events/qcom_geni_serial.h
@@ -0,0 +1,164 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM qcom_geni_serial
+
+#if !defined(_TRACE_QCOM_GENI_SERIAL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_QCOM_GENI_SERIAL_H
+
+#include <linux/device.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(geni_serial_set_termios,
+	    TP_PROTO(struct device *dev, unsigned int baud,
+		     unsigned int bits_per_char, u32 tx_trans_cfg,
+		     u32 tx_parity_cfg, u32 rx_trans_cfg,
+		     u32 rx_parity_cfg, u32 stop_bit_len),
+	    TP_ARGS(dev, baud, bits_per_char, tx_trans_cfg, tx_parity_cfg,
+		    rx_trans_cfg, rx_parity_cfg, stop_bit_len),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, baud)
+			     __field(unsigned int, bits_per_char)
+			     __field(u32, tx_trans_cfg)
+			     __field(u32, tx_parity_cfg)
+			     __field(u32, rx_trans_cfg)
+			     __field(u32, rx_parity_cfg)
+			     __field(u32, stop_bit_len)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->baud = baud;
+			   __entry->bits_per_char = bits_per_char;
+			   __entry->tx_trans_cfg = tx_trans_cfg;
+			   __entry->tx_parity_cfg = tx_parity_cfg;
+			   __entry->rx_trans_cfg = rx_trans_cfg;
+			   __entry->rx_parity_cfg = rx_parity_cfg;
+			   __entry->stop_bit_len = stop_bit_len;
+	    ),
+
+	    TP_printk("%s: baud=%u bpc=%u tx_trans=0x%08x tx_par=0x%08x rx_trans=0x%08x rx_par=0x%08x stop=%u",
+		      __get_str(name), __entry->baud, __entry->bits_per_char,
+		      __entry->tx_trans_cfg, __entry->tx_parity_cfg,
+		      __entry->rx_trans_cfg, __entry->rx_parity_cfg,
+		      __entry->stop_bit_len)
+);
+
+TRACE_EVENT(geni_serial_clk_cfg,
+	    TP_PROTO(struct device *dev, unsigned int desired_rate,
+		     unsigned long clk_rate, unsigned int clk_div,
+		     unsigned int clk_idx),
+	    TP_ARGS(dev, desired_rate, clk_rate, clk_div, clk_idx),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, desired_rate)
+			     __field(unsigned long, clk_rate)
+			     __field(unsigned int, clk_div)
+			     __field(unsigned int, clk_idx)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->desired_rate = desired_rate;
+			   __entry->clk_rate = clk_rate;
+			   __entry->clk_div = clk_div;
+			   __entry->clk_idx = clk_idx;
+	    ),
+
+	    TP_printk("%s: desired_rate=%u clk_rate=%lu clk_div=%u clk_idx=%u",
+		      __get_str(name), __entry->desired_rate, __entry->clk_rate,
+		      __entry->clk_div, __entry->clk_idx)
+);
+
+TRACE_EVENT(geni_serial_irq,
+	    TP_PROTO(struct device *dev, u32 m_irq, u32 s_irq,
+		     u32 dma_tx, u32 dma_rx),
+	    TP_ARGS(dev, m_irq, s_irq, dma_tx, dma_rx),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(u32, m_irq)
+			     __field(u32, s_irq)
+			     __field(u32, dma_tx)
+			     __field(u32, dma_rx)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->m_irq = m_irq;
+			   __entry->s_irq = s_irq;
+			   __entry->dma_tx = dma_tx;
+			   __entry->dma_rx = dma_rx;
+	    ),
+
+	    TP_printk("%s: m_irq=0x%08x s_irq=0x%08x dma_tx=0x%08x dma_rx=0x%08x",
+		      __get_str(name), __entry->m_irq, __entry->s_irq,
+		      __entry->dma_tx, __entry->dma_rx)
+);
+
+DECLARE_EVENT_CLASS(geni_serial_data,
+		    TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+		    TP_ARGS(dev, buf, len),
+
+		    TP_STRUCT__entry(__string(name, dev_name(dev))
+				     __field(unsigned int, len)
+				     __dynamic_array(u8, data, len)
+		    ),
+
+		    TP_fast_assign(__assign_str(name);
+				   __entry->len = len;
+				   memcpy(__get_dynamic_array(data), buf, len);
+		    ),
+
+		    TP_printk("%s: len=%u data=%s",
+			      __get_str(name), __entry->len,
+			      __print_hex(__get_dynamic_array(data), __entry->len))
+);
+
+DEFINE_EVENT(geni_serial_data, geni_serial_tx_data,
+	     TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+	     TP_ARGS(dev, buf, len)
+);
+
+DEFINE_EVENT(geni_serial_data, geni_serial_rx_data,
+	     TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+	     TP_ARGS(dev, buf, len)
+);
+
+TRACE_EVENT(geni_serial_set_mctrl,
+	    TP_PROTO(struct device *dev, unsigned int mctrl,
+		     u32 uart_manual_rfr),
+	    TP_ARGS(dev, mctrl, uart_manual_rfr),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, mctrl)
+			     __field(u32, uart_manual_rfr)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->mctrl = mctrl;
+			   __entry->uart_manual_rfr = uart_manual_rfr;
+	    ),
+
+	    TP_printk("%s: mctrl=0x%04x uart_manual_rfr=0x%08x",
+		      __get_str(name), __entry->mctrl, __entry->uart_manual_rfr)
+);
+
+TRACE_EVENT(geni_serial_get_mctrl,
+	    TP_PROTO(struct device *dev, unsigned int mctrl, u32 geni_ios),
+	    TP_ARGS(dev, mctrl, geni_ios),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, mctrl)
+			     __field(u32, geni_ios)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->mctrl = mctrl;
+			   __entry->geni_ios = geni_ios;
+	    ),
+
+	    TP_printk("%s: mctrl=0x%04x geni_ios=0x%08x",
+		      __get_str(name), __entry->mctrl, __entry->geni_ios)
+);
+
+#endif /* _TRACE_QCOM_GENI_SERIAL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

-- 
2.34.1


^ permalink raw reply related

* [PATCH v3 2/2] serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver
From: Praveen Talari @ 2026-05-18 17:56 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari
In-Reply-To: <20260518-add-tracepoints-for-qcom-geni-serial-v3-0-b4addb151376@oss.qualcomm.com>

Add tracing to the Qualcomm GENI serial driver to improve runtime
observability.

Trace hooks are added at key points including termios and clock
configuration, manual control get/set, interrupt handling, and data
TX/RX paths.

Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v2->v3:
- Updated commit text(removed example as it was available on cover
  letter).
---
 drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index e6b0a55f0cfb..9e2de074d799 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -7,6 +7,9 @@
 /* Disable MMIO tracing to prevent excessive logging of unwanted MMIO traces */
 #define __DISABLE_TRACE_MMIO__
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/qcom_geni_serial.h>
+
 #include <linux/clk.h>
 #include <linux/console.h>
 #include <linux/io.h>
@@ -225,7 +228,7 @@ static void qcom_geni_serial_config_port(struct uart_port *uport, int cfg_flags)
 static unsigned int qcom_geni_serial_get_mctrl(struct uart_port *uport)
 {
 	unsigned int mctrl = TIOCM_DSR | TIOCM_CAR;
-	u32 geni_ios;
+	u32 geni_ios = 0;
 
 	if (uart_console(uport)) {
 		mctrl |= TIOCM_CTS;
@@ -235,6 +238,8 @@ static unsigned int qcom_geni_serial_get_mctrl(struct uart_port *uport)
 			mctrl |= TIOCM_CTS;
 	}
 
+	trace_geni_serial_get_mctrl(uport->dev, mctrl, geni_ios);
+
 	return mctrl;
 }
 
@@ -253,6 +258,8 @@ static void qcom_geni_serial_set_mctrl(struct uart_port *uport,
 	if (!(mctrl & TIOCM_RTS) && !uport->suspended)
 		uart_manual_rfr = UART_MANUAL_RFR_EN | UART_RFR_NOT_READY;
 	writel(uart_manual_rfr, uport->membase + SE_UART_MANUAL_RFR);
+
+	trace_geni_serial_set_mctrl(uport->dev, mctrl, uart_manual_rfr);
 }
 
 static const char *qcom_geni_serial_get_type(struct uart_port *uport)
@@ -683,6 +690,8 @@ static void qcom_geni_serial_start_tx_dma(struct uart_port *uport)
 	xmit_size = kfifo_out_linear_ptr(&tport->xmit_fifo, &tail,
 			UART_XMIT_SIZE);
 
+	trace_geni_serial_tx_data(uport->dev, tail, xmit_size);
+
 	qcom_geni_set_rs485_mode(uport, SER_RS485_RTS_ON_SEND);
 
 	qcom_geni_serial_setup_tx(uport, xmit_size);
@@ -909,8 +918,10 @@ static void qcom_geni_serial_handle_rx_dma(struct uart_port *uport, bool drop)
 		return;
 	}
 
-	if (!drop)
+	if (!drop) {
+		trace_geni_serial_rx_data(uport->dev, port->rx_buf, rx_in);
 		handle_rx_uart(uport, rx_in);
+	}
 
 	ret = geni_se_rx_dma_prep(&port->se, port->rx_buf,
 				  DMA_RX_BUF_SIZE,
@@ -1069,6 +1080,10 @@ static irqreturn_t qcom_geni_serial_isr(int isr, void *dev)
 	geni_status = readl(uport->membase + SE_GENI_STATUS);
 	dma = readl(uport->membase + SE_GENI_DMA_MODE_EN);
 	m_irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
+
+	trace_geni_serial_irq(uport->dev, m_irq_status, s_irq_status,
+			      dma_tx_status, dma_rx_status);
+
 	writel(m_irq_status, uport->membase + SE_GENI_M_IRQ_CLEAR);
 	writel(s_irq_status, uport->membase + SE_GENI_S_IRQ_CLEAR);
 	writel(dma_tx_status, uport->membase + SE_DMA_TX_IRQ_CLR);
@@ -1281,8 +1296,8 @@ static int geni_serial_set_rate(struct uart_port *uport, unsigned int baud)
 		return -EINVAL;
 	}
 
-	dev_dbg(port->se.dev, "desired_rate = %u, clk_rate = %lu, clk_div = %u\n, clk_idx = %u\n",
-		baud * sampling_rate, clk_rate, clk_div, clk_idx);
+	trace_geni_serial_clk_cfg(uport->dev, baud * sampling_rate, clk_rate,
+				  clk_div, clk_idx);
 
 	uport->uartclk = clk_rate;
 	port->clk_rate = clk_rate;
@@ -1432,6 +1447,10 @@ static void qcom_geni_serial_set_termios(struct uart_port *uport,
 	writel(bits_per_char, uport->membase + SE_UART_TX_WORD_LEN);
 	writel(bits_per_char, uport->membase + SE_UART_RX_WORD_LEN);
 	writel(stop_bit_len, uport->membase + SE_UART_TX_STOP_BIT_LEN);
+
+	trace_geni_serial_set_termios(uport->dev, baud, bits_per_char,
+				      tx_trans_cfg, tx_parity_cfg, rx_trans_cfg,
+				      rx_parity_cfg, stop_bit_len);
 }
 
 #ifdef CONFIG_SERIAL_QCOM_GENI_CONSOLE

-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Lorenzo Stoakes @ 2026-05-18 19:32 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Wei Yang, Lance Yang, npache, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <9b33339e-157a-45b7-942e-3be3418a5142@kernel.org>

On Mon, May 18, 2026 at 03:16:11PM +0200, David Hildenbrand (Arm) wrote:
> On 5/14/26 05:10, Wei Yang wrote:
> > On Tue, May 12, 2026 at 03:42:02PM +0800, Lance Yang wrote:
> >>
> >> On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
> >>> generalize the order of the __collapse_huge_page_* and collapse_max_*
> >>> functions to support future mTHP collapse.
> >>>
> >>> The current mechanism for determining collapse with the
> >>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> >>> raises a key design issue: if we support user defined max_pte_none values
> >>> (even those scaled by order), a collapse of a lower order can introduces
> >>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> >>> than HPAGE_PMD_NR / 2. [1]
> >>>
> >>> With this configuration, a successful collapse to order N will populate
> >>> enough pages to satisfy the collapse condition on order N+1 on the next
> >>> scan. This leads to unnecessary work and memory churn.
> >>>
> >>> To fix this issue introduce a helper function that will limit mTHP
> >>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> >>> This effectively supports two modes: [2]
> >>>
> >>> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
> >>>  that maps the shared zeropage. Consequently, no memory bloat.
> >>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> >>>  available mTHP order.
> >>>
> >>> This removes the possiblilty of "creep", while not modifying any uAPI
> >>> expectations. A warning will be emitted if any non-supported
> >>> max_ptes_none value is configured with mTHP enabled.
> >>>
> >>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> >>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> >>> shared or swapped entry.
> >>>
> >>> No functional changes in this patch; however it defines future behavior
> >>> for mTHP collapse.
> >>>
> >>> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> >>> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
> >>>
> >>> Co-developed-by: Dev Jain <dev.jain@arm.com>
> >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >>> Signed-off-by: Nico Pache <npache@redhat.com>
> >>> ---
> >>> include/trace/events/huge_memory.h |   3 +-
> >>> mm/khugepaged.c                    | 117 ++++++++++++++++++++---------
> >>> 2 files changed, 85 insertions(+), 35 deletions(-)
> >>>
> >>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> >>> index bcdc57eea270..443e0bd13fdb 100644
> >>> --- a/include/trace/events/huge_memory.h
> >>> +++ b/include/trace/events/huge_memory.h
> >>> @@ -39,7 +39,8 @@
> >>> 	EM( SCAN_STORE_FAILED,		"store_failed")			\
> >>> 	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
> >>> 	EM( SCAN_PAGE_FILLED,		"page_filled")			\
> >>> -	EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
> >>> +	EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")	\
> >>> +	EMe(SCAN_INVALID_PTES_NONE,	"invalid_ptes_none")
> >>>
> >>> #undef EM
> >>> #undef EMe
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index f68853b3caa7..27465161fa6d 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -61,6 +61,7 @@ enum scan_result {
> >>> 	SCAN_COPY_MC,
> >>> 	SCAN_PAGE_FILLED,
> >>> 	SCAN_PAGE_DIRTY_OR_WRITEBACK,
> >>> +	SCAN_INVALID_PTES_NONE,
> >>> };
> >>>
> >>> #define CREATE_TRACE_POINTS
> >>> @@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
> >>>  * PTEs for the given collapse operation.
> >>>  * @cc: The collapse control struct
> >>>  * @vma: The vma to check for userfaultfd
> >>> + * @order: The folio order being collapsed to
> >>>  *
> >>>  * Return: Maximum number of none-page or zero-page PTEs allowed for the
> >>>  * collapse operation.
> >>>  */
> >>> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> >>> -		struct vm_area_struct *vma)
> >>> +static int collapse_max_ptes_none(struct collapse_control *cc,
> >>> +		struct vm_area_struct *vma, unsigned int order)
> >>> {
> >>> +	unsigned int max_ptes_none = khugepaged_max_ptes_none;
> >>> 	// If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
> >>
> >> One thing I still want to call out: kernel code usually uses C-style
> >> comments :)
> >>
> >>> 	if (vma && userfaultfd_armed(vma))
> >>> 		return 0;
> >>> 	// for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> >>> 	if (!cc->is_khugepaged)
> >>> 		return HPAGE_PMD_NR;
> >>> -	// For all other cases repect the user defined maximum.
> >>> -	return khugepaged_max_ptes_none;
> >>> +	// for PMD collapse, respect the user defined maximum.
> >>> +	if (is_pmd_order(order))
> >>> +		return max_ptes_none;
> >>> +	/* Zero/non-present collapse disabled. */
> >>> +	if (!max_ptes_none)
> >>> +		return 0;
> >>> +	// for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> >>> +	// scale the maximum number of PTEs to the order of the collapse.
> >>> +	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> >>> +		return (1 << order) - 1;
> >>> +
> >>> +	// We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
> >>> +	// Emit a warning and return -EINVAL.
> >>> +	pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
> >>> +		      KHUGEPAGED_MAX_PTES_LIMIT);
> >>
> >> Maybe fallback to 0 instead, as David suggested earlier?
> >>
> >
> > It looks reasonable to fallback to 0.
> >
> > But as the updated Document says in patch 14:
> >
> >   For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other
> >   value will emit a warning and no mTHP collapse will be attempted.
> >
> > This is why it does like this now.
> >
> >     mthp_collapse()
> >         max_ptes_none = collapse_max_ptes_none();
> >         if (max_ptes_none < 0)
> >             return collapsed;
> >
> >> max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
> >> intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
> >> disable it :(
> >>
> >
> > So it depends on what we want to do here :-)
> >
> > For me, I would vote for fallback to 0.
>
> At this point I'll prefer to not return errors from collapse_max_ptes_none().
> It's just rather awkward to return an error deep down in collapse code for a
> configuration problem.
>
> For mthp collapse, we only support max_ptes_none==0 and
> max_ptes_none=="HPAGE_PMD_NR - 1" (default).
>
> If another value is specified while collapsing mTHP, print a warning and treat
> it as 0 (save value, no creep, no memory waste).
>
> In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
> for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
> warning, because we would issue a warning with the default settings).
>
> @Lorenzo, fine with you?

Yes 100%, this sounds sensible both in terms of the error and the default. Let's
keep our lives simple(-ish) please :)

>
> --
> Cheers,
>
> David

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH v4 2/3] perf: enable unprivileged syscall tracing with perf trace
From: Peter Zijlstra @ 2026-05-18 21:41 UTC (permalink / raw)
  To: Anubhav Shelat
  Cc: mpetlan, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Falcon, linux-kernel, linux-trace-kernel,
	linux-perf-users
In-Reply-To: <20260515194010.93725-4-ashelat@redhat.com>

On Fri, May 15, 2026 at 03:40:06PM -0400, Anubhav Shelat wrote:
> Allow unprivileged users to trace their own processes' syscalls using
> perf trace, similar to strace without the intrusive overhead of ptrace().
> 
> Currently, perf trace requires CAP_PERFMON or paranoid level ≤ 1 even
> though the kernel has existing infrastructure (TRACE_EVENT_FL_CAP_ANY)
> specifically designed to mark syscall tracepoints as safe for
> unprivileged access. To fix this:
> 
> 1. Loosen the condition in perf_event_open() which requires privileges
>    for all events with exclude_kernel=0. This allows perf_event_open() to
>    bypass the paranoid check for task-attached tracepoint events. Ensure
>    that sample types which can expose kernel addresses to unprivileged
>    users are blocked. Ensure the PERF_SECURITY_KERNEL LSM hook is
>    preserved.
> 
> 2. Make the format and id tracefs files world-readable only for tracepoints
>    with TRACE_EVENT_FL_CAP_ANY, allowing unprivileged users to see syscall
>    tracepoint ids without exposing sensitive information.
> 
> 3. Add a check to perf_trace_event_perm() to block PERF_SAMPLE_IP on
>    kernel tracepoints for unprivileged users to prevent KASLR bypass. We do
>    this here rather than in kaddr_leak because perf_trace_event_perm() can
>    distinguish between kernel tracepoints and uprobe tracepoints, where the
>    IP is a safe user space address and is necessary for uprobe
>    functionality.
> 
> 4. Restrict pure counting events (no PERF_SAMPLE_RAW) to
>    TRACE_EVENT_FL_CAP_ANY tracepoints preventing unprivileged users from
>    counting internal kernel tracepoints while preserving current
>    behavior for exclude_kernel=1 events.

Typically patches are supposed to a single thing, you're listing 4
things. What gives?

> Example usage after this change:
>   $ perf trace ls          # works as unprivileged user
>   $ perf trace             # system-wide, still requires privileges
>   $ perf trace -p 1234     # requires ptrace permission on pid 1234
> 
> Assisted-by: Claude:claude-sonnet-4.5
> Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
> ---
>  kernel/events/core.c            | 28 +++++++++++++++++++++++++---
>  kernel/trace/trace_event_perf.c | 21 ++++++++++++++++++++-
>  kernel/trace/trace_events.c     | 16 ++++++++++++++--
>  3 files changed, 59 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7935d5663944..ff2d1e9a0b79 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -13873,9 +13873,31 @@ SYSCALL_DEFINE5(perf_event_open,
>  		return err;
>  
>  	if (!attr.exclude_kernel) {
> -		err = perf_allow_kernel();
> -		if (err)
> -			return err;
> +		bool tp_bypass = false;
> +
> +		/* Check unprivileged tracepoints */
> +		if (attr.type == PERF_TYPE_TRACEPOINT && pid != -1) {
> +			/*
> +			 * Block sample types that expose kernel addresses to
> +			 * prevent KASLR bypass
> +			 */
> +			u64 kaddr_leak = PERF_SAMPLE_CALLCHAIN |
> +					 PERF_SAMPLE_BRANCH_STACK |
> +					 PERF_SAMPLE_ADDR |
> +					 PERF_SAMPLE_REGS_INTR;

PERF_SAMPLE_IP should be here too, no?

And I'm not sure if tracepoints can trigger it, but PHYS_ADDR also seems
something we shouldn't allow.

And we're sure RAW doesn't include pointers?

> +
> +			tp_bypass = !(attr.sample_type & kaddr_leak);
> +		}
> +
> +		if (!tp_bypass) {
> +			err = perf_allow_kernel();
> +			if (err)
> +				return err;
> +		} else {
> +			err = security_perf_event_open(PERF_SECURITY_KERNEL);
> +			if (err)
> +				return err;
> +		}
>  	}
>  
>  	if (attr.namespaces) {
> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> index a6bb7577e8c5..466007ed2869 100644
> --- a/kernel/trace/trace_event_perf.c
> +++ b/kernel/trace/trace_event_perf.c
> @@ -72,9 +72,28 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,
>  			return -EINVAL;
>  	}
>  
> +	/*
> +	 * PERF_SAMPLE_IP on kernel tracepoints exposes a kernel text
> +	 * address, weakening KASLR. Block for unprivileged users unless
> +	 * the tracepoint is a uprobe (userspace IP, safe to expose).
> +	 */
> +	if ((p_event->attr.sample_type & PERF_SAMPLE_IP) &&
> +	    !p_event->attr.exclude_kernel &&
> +	    !(tp_event->flags & TRACE_EVENT_FL_UPROBE) &&
> +	    sysctl_perf_event_paranoid > 1 && !perfmon_capable())
> +		return -EACCES;
> +
>  	/* No tracing, just counting, so no obvious leak */
> -	if (!(p_event->attr.sample_type & PERF_SAMPLE_RAW))
> +	if (!(p_event->attr.sample_type & PERF_SAMPLE_RAW)) {
> +		/* Prevent unprivileged users from counting kernel tracepoints */
> +		if (!p_event->attr.exclude_kernel &&
> +		    sysctl_perf_event_paranoid > 1 && !perfmon_capable()) {
> +			if (!(p_event->attach_state == PERF_ATTACH_TASK &&
> +			      (tp_event->flags & TRACE_EVENT_FL_CAP_ANY)))
> +				return -EACCES;
> +		}
>  		return 0;
> +	}

Maybe use less AI and try and type this yourself. I think you'll find
that repeating the same clauses over and over gets tiresome. IIRC they
invented something for that in the 60s or so :/

>  	/* Some events are ok to be traced by non-root users... */
>  	if (p_event->attach_state == PERF_ATTACH_TASK) {
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index c46e623e7e0d..cbd07e2ec528 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -3050,7 +3050,13 @@ static int event_callback(const char *name, umode_t *mode, void **data,
>  	struct trace_event_call *call = file->event_call;
>  
>  	if (strcmp(name, "format") == 0) {
> -		*mode = TRACE_MODE_READ;
> +		/*
> +		 * Make format tracefs file world readable for tracepoints with
> +		 * TRACE_EVENT_FL_CAP_ANY
> +		 */
> +		*mode = (call->flags & TRACE_EVENT_FL_CAP_ANY) ?
> +			(TRACE_MODE_READ | 0004) :
> +			TRACE_MODE_READ;
>  		*fops = &ftrace_event_format_fops;
>  		return 1;
>  	}
> @@ -3086,7 +3092,13 @@ static int event_callback(const char *name, umode_t *mode, void **data,
>  #ifdef CONFIG_PERF_EVENTS
>  	if (call->event.type && call->class->reg &&
>  	    strcmp(name, "id") == 0) {
> -		*mode = TRACE_MODE_READ;
> +		/*
> +		 * Make id tracefs file world readable for tracepoints with
> +		 * TRACE_EVENT_FL_CAP_ANY
> +		 */
> +		*mode = (call->flags & TRACE_EVENT_FL_CAP_ANY) ?
> +			(TRACE_MODE_READ | 0004) :
> +			TRACE_MODE_READ;
>  		*data = (void *)(long)call->event.type;
>  		*fops = &ftrace_event_id_fops;
>  		return 1;

Again, you're doing the same thing in multiple places. If only there was
something to re-use a previous expression.

None of this gives me warm and fuzzy feelings.

^ permalink raw reply

* Re: [PATCH v3 10/11] kernel: time, trace: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Remanan Pillai @ 2026-05-18 23:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
	Steven Rostedt, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, Peter Zijlstra
In-Reply-To: <87jyt2xzj6.ffs@tglx>

On Sun, May 17, 2026 at 3:31 AM Thomas Gleixner <tglx@kernel.org> wrote:
>
> On Fri, May 15 2026 at 09:59, Vineeth Pillai wrote:
> > ---
> >  kernel/time/tick-sched.c       | 12 ++++++------
> >  kernel/trace/trace_benchmark.c |  2 +-
> >  2 files changed, 7 insertions(+), 7 deletions(-)
>
> Please split that into a tick/sched and trace patch so each can be picked
> up in the relevant subsystems.
>
Sorry about this, will split and send it in next iteration.

Thanks,
Vineeth

^ permalink raw reply

* Re: [PATCH v3 06/11] drm: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Remanan Pillai @ 2026-05-18 23:20 UTC (permalink / raw)
  To: phasta
  Cc: Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Harry Wentland, Leo Li, Matthew Brost, Danilo Krummrich,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, amd-gfx,
	dri-devel, Steven Rostedt, linux-trace-kernel, Peter Zijlstra
In-Reply-To: <81783d0807a5ffac93f61eddba0d2f595d7f239f.camel@mailbox.org>

On Mon, May 18, 2026 at 11:01 AM Philipp Stanner <phasta@mailbox.org> wrote:
>
> On Fri, 2026-05-15 at 09:59 -0400, Vineeth Pillai (Google) wrote:
> > From: Vineeth Pillai <vineeth@bitbyteword.org>
> >
> > Replace trace_foo() with the new trace_call__foo() at sites already
> > guarded by trace_foo_enabled(), avoiding a redundant
> > static_branch_unlikely() re-evaluation inside the tracepoint.
> > trace_call__foo() calls the tracepoint callbacks directly without
> > utilizing the static branch again.
>
> The "foo" terminology is unusual I think? I always wrote it with regex,
> like "trace_*()".
>
Sorry about the terminology. Part of the patches got merged this way,
so is it okay to continue the terminology to have consistency?

>
>
> >
> > Original v2 series:
> > https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/
>
> I'd put this in a Link: tag section below.
>
Makes sense, will do. Steve also suggested to put this whole section
after "---" because it isn't relevant to the changes. Will fix this in
next iteration.

> >
> > Parts of the original v2 series have already been merged in mainline.
> > This patch is being reposted as a follow-up cleanup for the remaining
> > unmerged pieces.
>
> So this v3 series as a whole is a followup to that v2?
>
v3 is a follow up to remaining patches that were not merged with the
previous cycle. The core api and couple of patches went in the
previous cycle, so this is for rest of it.

The intention was to send this v3 as a direct patch to individual
subsystem maintainers but forgot to remove the numbering and hence
there might be a confusion. Will remove the numbering and send it  as
stand alone patch in the next iteration.

> >
> > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> > Assisted-by: Claude:claude-sonnet-4-6
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  4 ++--
> >  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 +++++-----
> >  drivers/gpu/drm/scheduler/sched_entity.c          |  5 +++--
> >  4 files changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index b24d5d21be5f..cb0b5cb07d57 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -1004,7 +1004,7 @@ static void trace_amdgpu_cs_ibs(struct amdgpu_cs_parser *p)
> >               struct amdgpu_job *job = p->jobs[i];
> >
> >               for (j = 0; j < job->num_ibs; ++j)
> > -                     trace_amdgpu_cs(p, job, &job->ibs[j]);
> > +                     trace_call__amdgpu_cs(p, job, &job->ibs[j]);
> >       }
> >  }
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > index 9ba9de16a27a..a36ae94c425f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -1415,7 +1415,7 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, struct amdgpu_bo_va *bo_va,
> >
> >       if (trace_amdgpu_vm_bo_mapping_enabled()) {
> >               list_for_each_entry(mapping, &bo_va->valids, list)
> > -                     trace_amdgpu_vm_bo_mapping(mapping);
> > +                     trace_call__amdgpu_vm_bo_mapping(mapping);
> >       }
> >
> >  error_free:
> > @@ -2183,7 +2183,7 @@ void amdgpu_vm_bo_trace_cs(struct amdgpu_vm *vm, struct ww_acquire_ctx *ticket)
> >                               continue;
> >               }
> >
> > -             trace_amdgpu_vm_bo_cs(mapping);
> > +             trace_call__amdgpu_vm_bo_cs(mapping);
> >       }
> >  }
> >
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index 5fc5d5608506..fbdc12cdd6bb 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -5263,11 +5263,11 @@ static void amdgpu_dm_backlight_set_level(struct amdgpu_display_manager *dm,
> >       }
> >
> >       if (trace_amdgpu_dm_brightness_enabled()) {
> > -             trace_amdgpu_dm_brightness(__builtin_return_address(0),
> > -                                        user_brightness,
> > -                                        brightness,
> > -                                        caps->aux_support,
> > -                                        power_supply_is_system_supplied() > 0);
> > +             trace_call__amdgpu_dm_brightness(__builtin_return_address(0),
> > +                                              user_brightness,
> > +                                              brightness,
> > +                                              caps->aux_support,
> > +                                              power_supply_is_system_supplied() > 0);
> >       }
> >
> >       if (caps->aux_support) {
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > index fe174a4857be..185a2636b599 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -429,7 +429,8 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
> >
> >       if (trace_drm_sched_job_unschedulable_enabled() &&
> >           !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &entity->dependency->flags))
> > -             trace_drm_sched_job_unschedulable(sched_job, entity->dependency);
> > +             trace_call__drm_sched_job_unschedulable(sched_job,
> > +                                                     entity->dependency);
>
> I would be more happy if you sacrifice a bit of space here and keep it
> a single line since the if condition is already quite convoluted and
> challenging to read.
>
I understand, will fix it in next iteration.

Thanks,
Vineeth

^ permalink raw reply

* Re: [PATCH v3 08/11] scsi: ufs: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Remanan Pillai @ 2026-05-18 23:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Steven Rostedt, James E.J. Bottomley, Martin K. Petersen,
	linux-scsi, linux-trace-kernel, Peter Zijlstra
In-Reply-To: <ebdc020e-419d-458a-9211-36f22af0c1d9@acm.org>

On Fri, May 15, 2026 at 3:22 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 5/15/26 11:50 AM, Steven Rostedt wrote:
> > On Fri, 15 May 2026 08:27:27 -0700
> > Bart Van Assche <bvanassche@acm.org> wrote:
> >
> >> On 5/15/26 6:59 AM, Vineeth Pillai (Google) wrote:
> >>>    static void ufshcd_add_query_upiu_trace(struct ufs_hba *hba,
> >>> @@ -432,8 +432,8 @@ static void ufshcd_add_query_upiu_trace(struct ufs_hba *hba,
> >>>     if (!trace_ufshcd_upiu_enabled())
> >>>             return;
> >>>
> >>> -   trace_ufshcd_upiu(hba, str_t, &rq_rsp->header,
> >>> -                     &rq_rsp->qr, UFS_TSF_OSF);
> >>> +   trace_call__ufshcd_upiu(hba, str_t, &rq_rsp->header,
> >>> +                          &rq_rsp->qr, UFS_TSF_OSF);
> >>>    }
> >>
> >> Instead of making this change, please remove the
> >> trace_ufshcd_upiu_enabled() call because it is redundant.
> >
> > You mean to remove the ufshcd_add_query_upiu_trace() function and just use
> > a tracepoint where it is called?
>
> That would be even better.
>
Will do.

> >>>    static void ufshcd_add_tm_upiu_trace(struct ufs_hba *hba, unsigned int tag,
> >>> @@ -445,15 +445,15 @@ static void ufshcd_add_tm_upiu_trace(struct ufs_hba *hba, unsigned int tag,
> >>>             return;
> >>>
> >>>     if (str_t == UFS_TM_SEND)
> >>> -           trace_ufshcd_upiu(hba, str_t,
> >>> -                             &descp->upiu_req.req_header,
> >>> -                             &descp->upiu_req.input_param1,
> >>> -                             UFS_TSF_TM_INPUT);
> >>> +           trace_call__ufshcd_upiu(hba, str_t,
> >>> +                                   &descp->upiu_req.req_header,
> >>> +                                   &descp->upiu_req.input_param1,
> >>> +                                   UFS_TSF_TM_INPUT);
> >>>     else
> >>> -           trace_ufshcd_upiu(hba, str_t,
> >>> -                             &descp->upiu_rsp.rsp_header,
> >>> -                             &descp->upiu_rsp.output_param1,
> >>> -                             UFS_TSF_TM_OUTPUT);
> >>> +           trace_call__ufshcd_upiu(hba, str_t,
> >>> +                                   &descp->upiu_rsp.rsp_header,
> >>> +                                   &descp->upiu_rsp.output_param1,
> >>> +                                   UFS_TSF_TM_OUTPUT);
> >>>    }
> >>
> >> Same comment here: I think it would be better to remove the
> >> trace_ufshcd_upiu_enabled() call rather than
> >> changing trace_ufshcd_upiu() into trace_call__ufshcd_upiu().
> >
> > Well, removing it here would mean placing the if (str == UFS_TM_SEND) into
> > the code and processing it even when tracing is disabled. With the
> > trace_*_enabled() helper, it's all a nop.
>
> The ufshcd_add_tm_upiu_trace() function is only called from the UFS
> error handler and hence is not performance sensitive. The execution of
> an additional if-test in this function is not a concern at all.
>
Sure, I shall change this.

Thanks,
Vineeth

^ permalink raw reply

* [PATCH 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-18 23:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel

TL; DR
======

Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring.  In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and
PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.

Background: High Cost of Page Level Properties Monitoring
=========================================================

DAMON is initially introduced as a Data Access MONitor.  It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS).  But still the monitoring part is only for
data accesses.

Data access patterns is good information, but some users need more
holistic views.  Particularly, users want to show the access pattern
information together with the types of the memory.  For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages.  Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.

For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14.  Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters.  Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.

This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size.  It was useful for test or debugging
purposes on a small number of machines.  But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches.  For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet.  If the runtime and the  size of the
fleet is long and big enough, it should provide statistically meaningful
data.

But users are too busy to implement such controls on their own.

Data Attributes Monitoring
==========================

Extend DAMON to monitor not only data accesses, but also general data
attributes.  Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.

Allow users to specify what data attributes in addition to the data
access they want to monitor.  Users can install one 'data probe' per
data attribute of their interest for this purpose.  The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute.  E.g., if memory of physical
address 42 belongs to cgroup A.  Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.

When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.

Users can read 'probe_hits' just before the values are reset.  In this
way, users can know how many hot/cold memory regions have data
attributes of their interest.  E.g., 30 percent of this system's hot
memory is belonging to cgroup A, and 80 percent of the cgroup
A-belonging hot memory is backed by huge pages.

Patches Sequence
================

First eight patches implement the core feature, interface and the
working support.  Patch 1 introduces data probe data structure, namely
damon_probe.  Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter.  Patch 4 updates damon_ctx commit function
to handle the probes.  Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits.  Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation.  Patch 7 updates kdamond_fn() to invoke the probes
applying callback.  Patch 8 finally implements the probes support on
paddr ops.

Ten changes for user interface (patches 9-18) come next.  Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively.  Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Following three patches (patches 15-17) implement sysfs directories and
files for showing the probe_hits to users, namely probes directory,
probe directory and hits files, respectively.  Patch 18 introduces a new
tracepoint for showing the probe_hits via tracefs.

Patch 19 adds a selftest for the sysfs files.

Patches 20 and 21 documents the design and usage of the new feature,
respectively.

Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow.  Depending on the feedback, this part might be separated
to another series in future.  Patch 22 defines the DAMON filter type for
the new attribute, namely DAMON_FILTER_TYPE_MEMCG.  Patch 23 add the
support on paddr ops.  Patch 24 updates the sysfs interface for setup of
the target memcg.  Patch 25 move code for easy reuse of the filter
target memcg setup.  Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.

Discussions
===========

This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads.  Because the
sampling time for access check is reused for data attributes check,  the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.

Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information.  When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.

Plan for Dropping RFC tag
=========================

Making changes for feedback from myself, humans and Sashiko should be
the major remaining work.

I'm currently hoping to drop the RFC tag by 7.2-rc1.

Future Works: Mid Term
========================

This version of implementation is limiting the maximum number of data
probes to four.  I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.

Future Works: Long Term
=======================

There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring.  For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive.  Actually there is
another ongoing work [3] for extending DAMON with PMU events.  The
motivation of the work is reducing the overhead, though.

In my work [2], I was introducing a new interface for access sampling
primitives control.  Now I think this data probe interface can be used
for that, too.  That is, data access becomes just one type of data
attribute.  Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.

The regions adjustment mechanism is currently working based on the
access information.  That's because DAMON is designed for data access
monitoring.  That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.

Once data access becomes just one of data attributes, there is no reason
to think data access that special.  There might be some users not
interested in access at all but want to know the location of memory of
specific type.  Data probes interface will allow doing that.  Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute.  Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.

DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes.  From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".

[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

Changes from RFC v3
- rfc v3: https://lore.kernel.org/20260516183712.81393-1-sj@kernel.org
- Wordsmithing documentation.
- Drop RFC tag.
- Rebase to mm-new.
Changes from RFC v2.2
- rfc v2.2: https://lore.kernel.org/20260515004433.128933-1-sj@kernel.org
- Rename damon_aggregated_v2 trace event to damon_region_aggregated.
- Address Sashiko issues.
  - Enclose arguments on damon_for_each_{probe,filter}[_safe]() macros.
  - Fix typos in comments and documents.
  - Update probe_hits for region split and merge.
  - Add more documentation for damon_operation->apply_probes() callback.
  - Reduce unnecessary folio_{get,put}() in damon_pa_apply_probes().
  - Define damon_sysfs_probe_attrs as static.
  - Link scheme tried region sysfs dir and increase the count only after
    all internal dir population success.
  - Commit damon_filter->memcg_id for newly added filters.
Changes from RFC v2.1
- rfc v2.1: https://lore.kernel.org/20260514140904.119781-1-sj@kernel.org
- Rebase to mm-stable (7.1-rc3) to avoid Sashiko patch apply failure.
Changes from RFC v2
- rfc v2: https://lore.kernel.org/20260512143645.113201-1-sj@kernel.org
- Optimize nr_probes calculation for probe_hits tracepoint.
- Use TRACE_EVENT_CONDITION() for probe_hits tracepoint.
- Rebase to latest mm-new.
Changes from RFC
- rfc: https://lore.kernel.org/all/20260426205222.93895-1-sj@kernel.org/
- Support memcg DAMON filter.
- Use per-probe probe_hits sysfs file.
- Use dynamic_array for probe_hits tracing.
- Fix filter matching field.
- Fix folio leaking in damon_pa_filter_pass().
- Move nr_regions of damon_aggregated_v2 tracepoint after end.
- Rename DAMON_TEST_TYPE_ANON to DAMON_FILTER_TYPE_ANON.

SeongJae Park (28):
  mm/damon/core: introduce struct damon_probe
  mm/damon/core: embed damon_probe objects in damon_ctx
  mm/damon/core: introduce damon_filter
  mm/damon/core: commit probes
  mm/damon/core: introduce damon_region->probe_hits
  mm/damon/core: introduce damon_ops->apply_probes
  mm/damon/core: do data attributes monitoring
  mm/damon/paddr: support data attributes monitoring
  mm/damon/sysfs: implement probes dir
  mm/damon/sysfs: implement probe dir
  mm/damon/sysfs: implement filters directory
  mm/damon/sysfs: implement filter dir
  mm/damon/sysfs: implement filter dir files
  mm/damon/sysfs: setup probes on DAMON core API parameters
  mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/
  mm/damon/sysfs-schemes: implement probe dir
  mm/damon/sysfs-schemes: implement probe/hits file
  mm/damon: trace probe_hits
  selftests/damon/sysfs.sh: test probes dir
  Docs/mm/damon/design: document data attributes monitoring
  Docs/admin-guide/mm/damon/usage: document data attributes monitoring
  mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
  mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCG
  mm/damon/sysfs: add filters/<F>/path file
  mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
  mm/damon/sysfs: setup damon_filter->memcg_id from path
  Docs/mm/damon/design: update for memcg damon filter
  Docs/admin-guide/mm/damon/usage: update for memcg damon filter

 Documentation/admin-guide/mm/damon/usage.rst |  46 +-
 Documentation/mm/damon/design.rst            |  39 ++
 include/linux/damon.h                        |  69 +++
 include/trace/events/damon.h                 |  38 ++
 mm/damon/core.c                              | 211 +++++++
 mm/damon/paddr.c                             |  76 +++
 mm/damon/sysfs-common.c                      |  41 ++
 mm/damon/sysfs-common.h                      |   2 +
 mm/damon/sysfs-schemes.c                     | 224 ++++++--
 mm/damon/sysfs.c                             | 557 +++++++++++++++++++
 tools/testing/selftests/damon/sysfs.sh       |  48 ++
 11 files changed, 1303 insertions(+), 48 deletions(-)


base-commit: b491d3b062a367a23fdc98def7fe3a8cf21bb3b0
-- 
2.47.3

^ permalink raw reply

* [PATCH 18/28] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-05-18 23:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260518234119.97569-1-sj@kernel.org>

Introduce a new tracepoint for exposing the per-region per-probe
positive sample count via tracefs.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/trace/events/damon.h | 38 ++++++++++++++++++++++++++++++++++++
 mm/damon/core.c              |  9 +++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index 7e25f4469b81b..78388538acf44 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -130,6 +130,44 @@ TRACE_EVENT(damon_monitor_intervals_tune,
 	TP_printk("sample_us=%lu", __entry->sample_us)
 );
 
+TRACE_EVENT_CONDITION(damon_region_aggregated,
+
+	TP_PROTO(unsigned int target_id, struct damon_region *r,
+		unsigned int nr_regions, unsigned int nr_probes),
+
+	TP_ARGS(target_id, r, nr_regions, nr_probes),
+
+	TP_CONDITION(nr_probes > 0),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, target_id)
+		__field(unsigned long, start)
+		__field(unsigned long, end)
+		__field(unsigned int, nr_regions)
+		__field(unsigned int, nr_accesses)
+		__field(unsigned int, age)
+		__dynamic_array(unsigned char, probe_hits, nr_probes)
+	),
+
+	TP_fast_assign(
+		__entry->target_id = target_id;
+		__entry->start = r->ar.start;
+		__entry->end = r->ar.end;
+		__entry->nr_regions = nr_regions;
+		__entry->nr_accesses = r->nr_accesses;
+		__entry->age = r->age;
+		memcpy(__get_dynamic_array(probe_hits), r->probe_hits,
+			sizeof(*r->probe_hits) * nr_probes);
+	),
+
+	TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u %u probe_hits=%s",
+			__entry->target_id, __entry->nr_regions,
+			__entry->start, __entry->end,
+			__entry->nr_accesses, __entry->age,
+			__print_hex(__get_dynamic_array(probe_hits),
+				__get_dynamic_array_len(probe_hits)))
+);
+
 TRACE_EVENT(damon_aggregated,
 
 	TP_PROTO(unsigned int target_id, struct damon_region *r,
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 433da8781e255..5ba7ad4df4351 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1908,6 +1908,13 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 {
 	struct damon_target *t;
 	unsigned int ti = 0;	/* target's index */
+	unsigned int nr_probes = 0;
+	struct damon_probe *probe;
+
+	if (trace_damon_region_aggregated_enabled()) {
+		damon_for_each_probe(probe, c)
+			nr_probes++;
+	}
 
 	damon_for_each_target(t, c) {
 		struct damon_region *r;
@@ -1916,6 +1923,8 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 			int i;
 
 			trace_damon_aggregated(ti, r, damon_nr_regions(t));
+			trace_damon_region_aggregated(ti, r,
+					damon_nr_regions(t), nr_probes);
 			damon_warn_fix_nr_accesses_corruption(r);
 			r->last_nr_accesses = r->nr_accesses;
 			r->nr_accesses = 0;
-- 
2.47.3

^ permalink raw reply related

* Re: [PATCH 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-18 23:53 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260518234119.97569-1-sj@kernel.org>

On Mon, 18 May 2026 16:40:48 -0700 SeongJae Park <sj@kernel.org> wrote:

> TL; DR
> ======
> 
> Extend DAMON for monitoring general data attributes other than accesses.
> The short term motivation is lightweight page type (e.g., belonging
> cgroup) aware monitoring.  In long term, this will help extending DAMON
> for multiple access events capture primitives (e.g., page faults and
> PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
> Operations eNgine" in long term.
[...]
> Changes from RFC v3
> - rfc v3: https://lore.kernel.org/20260516183712.81393-1-sj@kernel.org
> - Wordsmithing documentation.
> - Drop RFC tag.
> - Rebase to mm-new.

Sashiko failed [1] to reivew this series because it is still having an old
version of mm-new, while this series is based on mm-new.  Same issues were
found in RFC versions, so I was making those to based on mm-stable, and got
Sashiko reviews.  On the last version (RFC v3), I confirmed [2] Sashiko find no
more blocker.  So I believe this is good to go for more testing in mm-new.  I
will of course happy to get different inputs.

[1] https://sashiko.dev/#/patchset/20260518234119.97569-1-sj%40kernel.org
[2] https://lore.kernel.org/20260516220317.4300-1-sj@kernel.org


Thanks,
SJ

[...]

^ permalink raw reply

* Re: [PATCH 00/28] mm/damon: introduce data attributes monitoring
From: Andrew Morton @ 2026-05-19  0:54 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, David Hildenbrand, Jonathan Corbet,
	Lorenzo Stoakes, Masami Hiramatsu, Mathieu Desnoyers,
	Michal Hocko, Mike Rapoport, Shuah Khan, Shuah Khan,
	Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka, damon,
	linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260518234119.97569-1-sj@kernel.org>

On Mon, 18 May 2026 16:40:48 -0700 SeongJae Park <sj@kernel.org> wrote:

> TL; DR
> ======
> 
> Extend DAMON for monitoring general data attributes other than accesses.
> The short term motivation is lightweight page type (e.g., belonging
> cgroup) aware monitoring.  In long term, this will help extending DAMON
> for multiple access events capture primitives (e.g., page faults and
> PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
> Operations eNgine" in long term.

Added, thanks.

> Plan for Dropping RFC tag
> =========================
> 
> Making changes for feedback from myself, humans and Sashiko should be
> the major remaining work.
> 
> I'm currently hoping to drop the RFC tag by 7.2-rc1.
> 

I removed this section.



^ permalink raw reply

* [PATCH] tools/bootconfig: Fix buf leaks in apply_xbc
From: lihongtao @ 2026-05-19  3:12 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: linux-kernel, linux-trace-kernel, lihongtao

If data calloc failed, free the buf before return.

Fixes: 950313ebf79c ("tools: bootconfig: Add bootconfig command")
Signed-off-by: lihongtao <lihongtao@kylinos.cn>
---
 tools/bootconfig/main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c
index 643f707b8f1d..ddabde20585f 100644
--- a/tools/bootconfig/main.c
+++ b/tools/bootconfig/main.c
@@ -390,8 +390,10 @@ static int apply_xbc(const char *path, const char *xbc_path)
 
 	/* Backup the bootconfig data */
 	data = calloc(size + BOOTCONFIG_ALIGN + BOOTCONFIG_FOOTER_SIZE, 1);
-	if (!data)
+	if (!data) {
+		free(buf);
 		return -ENOMEM;
+	}
 	memcpy(data, buf, size);
 
 	/* Check the data format */
-- 
2.25.1


^ permalink raw reply related

* [PATCH] tools/bootconfig: Fix null pointer when free buf
From: lihongtao @ 2026-05-19  3:14 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: linux-kernel, linux-trace-kernel, lihongtao

In show_xbc() and delete_xbc(), if load_xbc_from_initrd failed,
the buf may be NULL.

Fixes: 950313ebf79c ("tools: bootconfig: Add bootconfig command")
Signed-off-by: lihongtao <lihongtao@kylinos.cn>
---
 tools/bootconfig/main.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c
index ddabde20585f..417d07a46f92 100644
--- a/tools/bootconfig/main.c
+++ b/tools/bootconfig/main.c
@@ -328,7 +328,8 @@ static int show_xbc(const char *path, bool list)
 		xbc_show_compact_tree();
 	ret = 0;
 out:
-	free(buf);
+	if (buf)
+		free(buf);
 
 	return ret;
 }
@@ -360,7 +361,8 @@ static int delete_xbc(const char *path)
 	} /* Ignore if there is no boot config in initrd */
 
 	close(fd);
-	free(buf);
+	if (buf)
+		free(buf);
 
 	return ret;
 }
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4] tracing/probes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-19  3:23 UTC (permalink / raw)
  To: LKML, Linux Trace Kernel, bpf
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa,
	"Subject:[PATCH  v2]", tracing/pr

From: Steven Rostedt <rostedt@goodmis.org>

Add syntax to the FETCHARGS parsing of probes to be able to typecast a
value to a pointer to a structure.

Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference, unless the member is a function parameter that BTF already has
information about what structure the argument is pointing to.

But for event probes, or generic kprobes that records a register that
happens to be a pointer to a structure, they cannot dereference these
values with BTF naming, but must use numerical offsets.

For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:

 (gdb) p &((struct sk_buff *)0)->dev
 $1 = (struct net_device **) 0x10
 (gdb) p &((struct net_device *)0)->name
 $2 = (char (*)[16]) 0x118

And then use the raw numbers to dereference:

  # echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events

If BTF is in the kernel, then instead, the $skbaddr can be typecast to
sk_buff and use the normal dereference logic.

  # echo 'e:xmit net.net_dev_xmit (sk_buff*)$skbaddr->dev->name:string' >> dynamic_events
  # echo 1 > events/eprobes/xmit/enable
  # cat trace
[..]
    sshd-session-1022    [000] b..2.   860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"

The syntax is simply: ([STRUCT]*)(VAR)->FIELD[->FIELD..]

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v3: https://patch.msgid.link/20260518095832.52659a3a@gandalf.local.home

 *** COMPLETE REWRITE FROM V3 ***

- Rewrote it to use typecasting instead of simply replacing BTF names with
  offsets.

 Documentation/trace/kprobetrace.rst |   3 +
 kernel/trace/trace_probe.c          | 110 ++++++++++++++++++++++++----
 kernel/trace/trace_probe.h          |   3 +
 3 files changed, 100 insertions(+), 16 deletions(-)

diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 3b6791c17e9b..450ac646fe4c 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -54,6 +54,9 @@ Synopsis of kprobe_events
   $retval	: Fetch return value.(\*2)
   $comm		: Fetch current task comm.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+  (STRUCT*)FETCHARG->FIELD[->FIELD] : If BTF is supported, typecast FETCHARG to
+                  a pointer to STRUCT and then derference the pointer defined by
+                  ->FIELD.
   \IMM		: Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0d3a0da26af..b0829eb1cb52 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -464,6 +464,26 @@ static const char *fetch_type_from_btf_type(struct btf *btf,
 	return NULL;
 }
 
+static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
+{
+	int id;
+
+	if (!ctx->btf) {
+		struct btf *btf;
+		id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+		if (id < 0)
+			return -EINVAL;
+		ctx->btf = btf;
+	} else {
+		id = btf_find_by_name_kind(ctx->btf, sname, BTF_KIND_STRUCT);
+		if (id < 0)
+			return -EINVAL;
+	}
+
+	ctx->last_struct = btf_type_by_id(ctx->btf, id);
+	return 0;
+}
+
 static int query_btf_context(struct traceprobe_parse_context *ctx)
 {
 	const struct btf_param *param;
@@ -471,12 +491,12 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
 	struct btf *btf;
 	s32 nr;
 
-	if (ctx->btf)
-		return 0;
-
 	if (!ctx->funcname)
 		return -EINVAL;
 
+	if (ctx->btf)
+		return 0;
+
 	type = btf_find_func_proto(ctx->funcname, &btf);
 	if (!type)
 		return -ENOENT;
@@ -514,6 +534,7 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
 		ctx->proto = NULL;
 		ctx->params = NULL;
 		ctx->nr_params = 0;
+		ctx->last_struct = NULL;
 	}
 }
 
@@ -554,22 +575,28 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 	struct fetch_insn *code = *pcode;
 	const struct btf_member *field;
 	u32 bitoffs, anon_offs;
+	bool is_struct = ctx->flags & TPARG_FL_STRUCT;
 	char *next;
 	int is_ptr;
 	s32 tid;
 
 	do {
-		/* Outer loop for solving arrow operator ('->') */
-		if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
-			trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
-			return -EINVAL;
-		}
-		/* Convert a struct pointer type to a struct type */
-		type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
-		if (!type) {
-			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-			return -EINVAL;
+		if (!is_struct) {
+			/* Outer loop for solving arrow operator ('->') */
+			if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
+				trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+				return -EINVAL;
+			}
+
+			/* Convert a struct pointer type to a struct type */
+			type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
+			if (!type) {
+				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+				return -EINVAL;
+			}
 		}
+		/* Only the first type can skip being a pointer */
+		is_struct = false;
 
 		bitoffs = 0;
 		do {
@@ -635,12 +662,12 @@ static int parse_btf_arg(char *varname,
 {
 	struct fetch_insn *code = *pcode;
 	const struct btf_param *params;
-	const struct btf_type *type;
+	const struct btf_type *type = NULL;
 	char *field = NULL;
 	int i, is_ptr, ret;
 	u32 tid;
 
-	if (WARN_ON_ONCE(!ctx->funcname))
+	if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_STRUCT)))
 		return -EINVAL;
 
 	is_ptr = split_next_field(varname, &field, ctx);
@@ -704,11 +731,18 @@ static int parse_btf_arg(char *varname,
 			goto found;
 		}
 	}
+
+	if (ctx->flags & TPARG_FL_STRUCT) {
+		type = ctx->last_struct;
+		goto found;
+	}
+
 	trace_probe_log_err(ctx->offset, NO_BTFARG);
 	return -ENOENT;
 
 found:
-	type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+	if (!type)
+		type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 		return -EINVAL;
@@ -952,6 +986,12 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
 	int ret = 0;
 	int len;
 
+	if (ctx->flags & TPARG_FL_STRUCT) {
+		ret = parse_btf_arg(orig_arg, pcode, end, ctx);
+		if (ret < 0)
+			return ret;
+	}
+
 	if (ctx->flags & TPARG_FL_TEVENT) {
 		if (code->data)
 			return -EFAULT;
@@ -1231,6 +1271,43 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 				code->op = FETCH_OP_IMM;
 		}
 		break;
+	case '(':
+		tmp = strrchr(arg, ')');
+		if (!tmp) {
+			trace_probe_log_err(ctx->offset + strlen(arg),
+					    DEREF_OPEN_BRACE);
+			return -EINVAL;
+		}
+
+		tmp--;
+		if (*tmp != '*') {
+			trace_probe_log_err(ctx->offset + (tmp - arg),
+					    NO_PTR_STRCT);
+			return -EINVAL;
+		}
+		*tmp = '\0';
+		ret = query_btf_struct(arg + 1, ctx);
+		*tmp = '*';
+
+		if (ret < 0) {
+			trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+			return -EINVAL;
+		}
+
+		ctx->flags |= TPARG_FL_STRUCT;
+		tmp += 2;
+
+		if (*tmp != '$') {
+			trace_probe_log_err(ctx->offset + (tmp - arg),
+					    BAD_VAR);
+			return -EINVAL;
+		}
+
+		ctx->offset += tmp - arg;
+		ret = parse_probe_vars(tmp, type, pcode, end, ctx);
+		ctx->flags &= ~TPARG_FL_STRUCT;
+		ctx->last_struct = NULL;
+		break;
 	default:
 		if (isalpha(arg[0]) || arg[0] == '_') {	/* BTF variable */
 			if (!tparg_is_function_entry(ctx->flags) &&
@@ -1504,6 +1581,7 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
 	code[FETCH_INSN_MAX - 1].op = FETCH_OP_END;
 
 	ctx->last_type = NULL;
+	ctx->last_struct = NULL;
 	ret = parse_probe_arg(arg, parg->type, &code, &code[FETCH_INSN_MAX - 1],
 			      ctx);
 	if (ret < 0)
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 262d8707a3df..88ab9f6da591 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -394,6 +394,7 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
  * TPARG_FL_KERNEL and TPARG_FL_USER are also mutually exclusive.
  * TPARG_FL_FPROBE and TPARG_FL_TPOINT are optional but it should be with
  * TPARG_FL_KERNEL.
+ * TPARG_FL_STRUCT is set if an argument was typecast to a structure.
  */
 #define TPARG_FL_RETURN BIT(0)
 #define TPARG_FL_KERNEL BIT(1)
@@ -402,6 +403,7 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
 #define TPARG_FL_USER   BIT(4)
 #define TPARG_FL_FPROBE BIT(5)
 #define TPARG_FL_TPOINT BIT(6)
+#define TPARG_FL_STRUCT BIT(7)
 #define TPARG_FL_LOC_MASK	GENMASK(4, 0)
 
 static inline bool tparg_is_function_entry(unsigned int flags)
@@ -423,6 +425,7 @@ struct traceprobe_parse_context {
 	s32 nr_params;			/* The number of the parameters */
 	struct btf *btf;		/* The BTF to be used */
 	const struct btf_type *last_type;	/* Saved type */
+	const struct btf_type *last_struct;	/* Saved structure */
 	u32 last_bitoffs;		/* Saved bitoffs */
 	u32 last_bitsize;		/* Saved bitsize */
 	struct trace_probe *tp;
-- 
2.53.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox