Linux Trace Kernel
 help / color / mirror / Atom feed
* [PATCHv2 00/11] uprobes/x86: Fix red zone issue for optimized uprobes
From: Jiri Olsa @ 2026-05-18 10:59 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko
  Cc: bpf, linux-trace-kernel

hi,
Andrii reported an issue with optimized uprobes [1] that can clobber
redzone area with call instruction storing return address on stack
where user code may keep temporary data without adjusting rsp.

Fixing this by moving the optimized uprobes on top of 10-bytes nop
instruction, so we can squeeze another instruction to escape the
redzone area before doing the call.

Note we need upstream update first for patch 3 (github.com/libbpf/usdt),
if we decide to take this change.

thanks,
jirka

v1: https://lore.kernel.org/bpf/20260514135342.22130-1-jolsa@kernel.org/

v2 changes:
- several selftest fixes [sashiko]
- consolidate is_lea_insn and is_call_insn insto single check [Jakub Sitnicki]
- use proper mm_struct object in __in_uprobe_trampoline check [sashiko]
- allow to copy uprobe trampolines vma objects on fork [sashiko]
- change uprobe syscall detection error from -ENXIO to -EPROTO [Andrii]
- added fork/clone tests
- I kept the selftest changes and nop5->nop10 changes in separate
  commits for easier review, we can squash them later if we want to keep
  bisect working properly


[1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
---
Andrii Nakryiko (1):
      selftests/bpf: Add tests for uprobe nop10 red zone clobbering

Jiri Olsa (10):
      uprobes/x86: Use proper mm_struct in __in_uprobe_trampoline
      uprobes/x86: Allow to copy uprobe trampolines on fork
      uprobes/x86: Move optimized uprobe from nop5 to nop10
      libbpf: Change has_nop_combo to work on top of nop10
      libbpf: Detect uprobe syscall with new error
      selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
      selftests/bpf: Change uprobe syscall tests to use nop10
      selftests/bpf: Change uprobe/usdt trigger bench code to use nop10
      selftests/bpf: Add reattach tests for uprobe syscall
      selftests/bpf: Add tests for forked/cloned optimized uprobes

 arch/x86/kernel/uprobes.c                               | 144 ++++++++++++++++++++++------------
 tools/lib/bpf/features.c                                |   4 +-
 tools/lib/bpf/usdt.c                                    |  16 ++--
 tools/testing/selftests/bpf/bench.c                     |  20 ++---
 tools/testing/selftests/bpf/benchs/bench_trigger.c      |  38 ++++-----
 tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh |   2 +-
 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 307 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 tools/testing/selftests/bpf/prog_tests/usdt.c           |  74 +++++++++++++++---
 tools/testing/selftests/bpf/progs/test_usdt.c           |  25 ++++++
 tools/testing/selftests/bpf/usdt.h                      |   2 +-
 tools/testing/selftests/bpf/usdt_2.c                    |  15 +++-
 11 files changed, 524 insertions(+), 123 deletions(-)

^ permalink raw reply

* Re: [PATCH v2] tracing/probes: Allow use of BTF names to dereference pointers
From: Leon Hwang @ 2026-05-18 10:45 UTC (permalink / raw)
  To: Steven Rostedt, LKML, Linux trace kernel, bpf
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260516173310.1dbad146@fedora>

On 17/5/26 05:33, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Add syntax to the FETCHARGS parsing of probes to allow the use of
> structure and member names to get the offsets to dereference pointers.
> 
> Currently, a dereference must be a number, where the user has to figure
> out manually the offset of a member of a structure that they want to
> reference. For example, to get the size of a kmem_cache that was passed to
> the function kmem_cache_alloc_noprof, one would need to do:
> 
>  # cd /sys/kernel/tracing
>  # echo 'f:cache kmem_cache_alloc_noprof size=+0x18($arg1):u32' >> dynamic_events
> 
> This requires knowing that the offset of size is 0x18, which can be found
> with gdb:
> 
>   (gdb) p &((struct kmem_cache *)0)->size
>   $1 = (unsigned int *) 0x18
> 
> If BTF is in the kernel, it can be used to find this with names, where the
> user doesn't need to find the actual offset:
> 
>  # echo 'f:cache kmem_cache_alloc_noprof size=+kmem_cache.size($arg1):u32' >> dynamic_events
> 
> Instead of the "+0x18", it would have "+kmem_cache.size" where the format is:
> 
>   +STRUCT.MEMBER[.MEMBER[..]]
> 
> The delimiter is '.' and the first item is the structure name. Then the
> member of the structure to get the offset of. If that member is an
> embedded structure, another '.MEMBER' may be added to get the offset of
> its members with respect to the original value.
> 
>   "+kmem_cache.size($arg1)" is equivalent to:
> 
>   (*(struct kmem_cache *)$arg1).size
> 
> Anonymous structures are also handled:
> 
>   # echo 'e:xmit net.net_dev_xmit +net_device.name(+sk_buff.dev($skbaddr)):string' >> dynamic_events
> 
> Where "+net_device.name(+sk_buff.dev($skbaddr))" is equivalent to:
> 
>   (*(struct net_device *)((*(struct sk_buff *)($skbaddr)).dev)->name)
> 
> Note that "dev" of struct sk_buff is inside an anonymous structure:
> 
> struct sk_buff {
> 	union {
> 		struct {
> 			/* These two members must be first to match sk_buff_head. */
> 			struct sk_buff		*next;
> 			struct sk_buff		*prev;
> 
> 			union {
> 				struct net_device	*dev;
> 				[..]
> 			};
> 		};
> 		[..]
> 	};
> 
> This will allow up to three deep of anonymous structures before it will
> fail to find a member.
> 
> The above produces:
> 
>     sshd-session-1080    [000] b..5.  1526.337161: xmit: (net.net_dev_xmit) arg1="enp7s0"
> 
> And nested structures can be found by adding more members to the arg:
> 
>   # echo 'f:read filemap_readahead.isra.0 file=+0(+dentry.d_name.name(+file.f_path.dentry($arg2))):string' >> dynamic_events
> 
> The above is equivalent to:
> 
>   *((*(struct dentry *)(*(struct file *)$arg2).f_path.dentry)->d_name.name)
> 
> And produces:
> 
>        trace-cmd-1381    [002] ...1.  2082.676268: read: (filemap_readahead.isra.0+0x0/0x150) file="trace.dat"
> 
Hi Steve,

Great to see that BTF is going to be nested into trace.

I'm glad to share my BPF tool, bpfsnoop [1], that utilizes the similar
way to inspect argument's data.

Read device name:
bpfsnoop -t net_dev_xmit --output-arg 'str(skb->dev->name)'
--limit-events 20
- net_dev_xmit[tp] args=((struct sk_buff *)skb=0xffff88818821d4e8,
(int)rc=0, (struct net_device *)dev=0xffff88984ba64000, (unsigned
int)skb_len=0x1f2/498) cpu=2 process=(0:swapper/2)
timestamp=18:06:17.309492697
Arg attrs: (array(char[16]))'str(skb->dev->name)'="eth0"

Read dentry name:
bpfsnoop -k 'vfs_read' --output-arg
'str((file->f_path.dentry)->d_name.name)' --limit-events 20
← vfs_read args=((struct file *)file=0xffff888175e08400, (char
*)buf=0x55c7a1168400(0x0/0), (size_t)count=0x10000/65536, (loff_t
*)pos=0xffffc9000f707bb0(0)) retval=(long int)510 cpu=3
process=(339834:sudo) timestamp=18:24:16.22021166
Arg attrs: (unsigned char *)'str((file->f_path.dentry)->d_name.name)'="ptmx"

In bpfsnoop, it provides a friendly way to inspect argument's data using
C expressions. Under the hood, it compiles the C expressions, specified
by --filter-arg/--output-arg, into BPF byte code by parsing the
struct/union member access with BTF. (I'm too lazy to write documents to
explain its internal details. But you can study it with AI assistance.)

Insanely, after developing such feature for bpfsnoop, I wondered whether
to embed a light-weight C compiler into trace tool in order to compile C
expression into BPF byte code, and then load the BPF program to
filter/output argument. Finally, users are able to filter/output
arguments using C expressions. It seemed too crazy for me to post such
idea to trace mailing list at that time, as I wasn't familiar with trace
infrastructure.

[1] https://github.com/bpfsnoop/bpfsnoop/

Thanks,
Leon


^ permalink raw reply

* Re: [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Peter Zijlstra @ 2026-05-18 10:43 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Ingo Molnar, Masami Hiramatsu, Andrii Nakryiko,
	bpf, linux-trace-kernel, x86, linux-kernel
In-Reply-To: <20260514135342.22130-2-jolsa@kernel.org>


You seem to have forgotten to Cc LKML and x86 :-(

On Thu, May 14, 2026 at 03:53:36PM +0200, Jiri Olsa wrote:

> @@ -1017,17 +1030,32 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>  static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>  			 unsigned long vaddr, unsigned long tramp)
>  {
> -	u8 call[5];
> +	u8 insn[OPT_INSN_SIZE], *call = &insn[LEA_INSN_SIZE];
>  
> -	__text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
> +	/*
> +	 * We have nop10 instruction (with first byte overwritten to int3),
> +	 * changing it to:
> +	 *   lea -0x80(%rsp), %rsp
> +	 *   call tramp
> +	 */
> +	memcpy(insn, lea_rsp, LEA_INSN_SIZE);
> +	__text_gen_insn(call, CALL_INSN_OPCODE,
> +			(const void *) (vaddr + LEA_INSN_SIZE),
>  			(const void *) tramp, CALL_INSN_SIZE);
> -	return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
> +	return int3_update(auprobe, vma, vaddr, insn, OPT_INSN_SIZE, true /* optimize */);
>  }
>  
>  static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>  			   unsigned long vaddr)
>  {
> -	return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* optimize */);
> +	/*
> +	 * We have optimized nop10 (lea, call), changing it to 'jmp rel8' to
> +	 * end of the 10-byte slot instead of restoring the original nop10,
> +	 * because we could have thread already inside lea instruction.

Inaccurate, RIP could be on CALL, not inside LEA. Writing NOP10 would
make it inside NOP10 though, and that would cause havoc IF you use the
normal NOP10.

Thing is, the encoding of NOP{8,9,10} would actually allow you to
preserve the CALL instruction :-)

That is, observe:

       PF1   PF2   ESC   NOPL  MOD   SIB   DISP32

NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x78, 0x56, 0x34, 0x12 -- cs nopw 0x12345678(%rax,%rbp,8)

Specifically the CALL opcode sits in the SIB byte and decodes like:

  e8 := 11 101 000

  scale = 11  (2^3 = 8)
  index = 101 BP
  base  = 000 AX

And the displacement is just that, a displacement.

So you *could* in fact, write back _A_ NOP10, just not the standard
NOP10.

> +	 */
> +	u8 jmp[OPT_INSN_SIZE] = { JMP8_INSN_OPCODE, OPT_JMP8_OFFSET };
> +
> +	return int3_update(auprobe, vma, vaddr, jmp, JMP8_INSN_SIZE, false /* optimize */);
>  }

Changelog wants significant update to explain this scheme.

So we have:

  NOP10 -+-> LEA -0x80(%rsp), %rsp, CALL foo -> JMP.d8 +8
         |                                          |
         `------------------------------------------'

And you want to belabour the point of how you ensure re-writing the CALL
instruction isn't a problem (because I'm not convinced).

Note that the above results in:

initial:
0: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)

optimize-int3:
1: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- int3
optimize-tail:
2: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
optimize-finish:
3: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- lea -0x80(%rsp),%rsp; call 0x78563412

unoptimize-int3:
4: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
unoptimize-tail:
5: 0xcc, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
unoptimize-finish:
6: 0xeb, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- jmp.d8 +8; call 0x78563412

optimize-int3:
7: 0xcc, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
optimize-tail:
8: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call 0x12345678
optimize-finish:
9: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call 0x12345678

Note that from step 7 to step 8, you re-write the CALL instruction
without going through INT3. This means it is entirely possible for a
concurrent execution to observe a composite instruction.

This is NOT sound!

However, I think it can be salvaged, if instead of only writing INT3 at
+0, you also write INT3 at +5. The sequence then becomes:

initial:
0: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)

optimize-int3:
1: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xcc, 0x00, 0x00, 0x00, 0x00 -- int3; int3
optimize-tail(s):
2: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xcc, 0x12, 0x34, 0x56, 0x78 -- int3; int3
optimize-finish-1:
3: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
optimize-finish-2:
3: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- lea -0x80(%rsp),%rsp; call 0x78563412

unoptimize-int3:
4: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
unoptimize-tail:
5: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call 0x78563412
unoptimize-finish:
6: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x12, 0x34, 0x56, 0x78 -- cs nopw 0x78563412(%rax,%rbp,8); call 0x78563412

optimize-int3:
7: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xcc, 0x12, 0x34, 0x56, 0x78 -- int3; int3
optimize-tail(s):
8: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xcc, 0x78, 0x56, 0x34, 0x12 -- int3; int3
optimize-finish-1:
9: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call 0x12345678
optimize-finish-2:
9: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- lea -0x80(%rsp),%rsp; call 0x12345678

> @@ -1095,14 +1125,25 @@ int set_orig_insn(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>  		  unsigned long vaddr)
>  {
>  	if (test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) {
> -		int ret = is_optimized(vma->vm_mm, vaddr);
> -		if (ret < 0)
> +		uprobe_opcode_t insn[OPT_INSN_SIZE];
> +		int ret;
> +
> +		ret = copy_from_vaddr(vma->vm_mm, vaddr, &insn, OPT_INSN_SIZE);
> +		if (ret)
>  			return ret;
> -		if (ret) {
> +		if (__is_optimized((uprobe_opcode_t *)&insn, vaddr)) {
>  			ret = swbp_unoptimize(auprobe, vma, vaddr);
>  			WARN_ON_ONCE(ret);
>  			return ret;
>  		}
> +		/*
> +		 * We can have re-attached probe on top of jmp8 instruction,
> +		 * which did not get optimized. We need to restore the jmp8
> +		 * instruction, instead of the original instruction (nop10).
> +		 */
> +		if (is_swbp_insn(&insn[0]) && insn[1] == OPT_JMP8_OFFSET)
> +			return uprobe_write_opcode(auprobe, vma, vaddr, JMP8_INSN_OPCODE,
> +						   false /* is_register */);

Coding style wants { } on any multi-line statement, even if its only one
statement.

>  	}
>  	return uprobe_write_opcode(auprobe, vma, vaddr, *(uprobe_opcode_t *)&auprobe->insn,
>  				   false /* is_register */);

^ permalink raw reply

* Re: [PATCH v2 05/17] tracing: Add __print_untrusted_str()
From: Mickaël Salaün @ 2026-05-18 10:26 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: Christian Brauner, Günther Noack, Jann Horn, Jeff Xu,
	Justin Suess, Kees Cook, Mathieu Desnoyers, Matthieu Buffet,
	Mikhail Ivanov, Tingmao Wang, kernel-team, linux-fsdevel,
	linux-security-module, linux-trace-kernel, Andrii Nakryiko
In-Reply-To: <20260406143717.1815792-6-mic@digikod.net>

Steve, Masami, Mathieu, are you ok with this new helper?

On Mon, Apr 06, 2026 at 04:37:03PM +0200, Mickaël Salaün wrote:
> Landlock tracepoints expose filesystem paths and process names
> that may contain spaces, equal signs, or other characters that
> break ftrace field parsing.
> 
> Add a new __print_untrusted_str() helper to safely print strings after
> escaping all special characters, including common separators (space,
> equal sign), quotes, and backslashes.  This transforms a string from an
> untrusted source (e.g. user space) to make it:
> - safe to parse,
> - easy to read (for simple strings),
> - easy to get back the original.
> 
> Cc: Günther Noack <gnoack@google.com>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tingmao Wang <m@maowtm.org>
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> ---
> 
> Changes since v1:
> https://lore.kernel.org/r/20250523165741.693976-4-mic@digikod.net
> - Remove WARN_ON() (pointed out by Steven Rostedt).
> ---
>  include/linux/trace_events.h               |  2 ++
>  include/trace/stages/stage3_trace_output.h |  4 +++
>  include/trace/stages/stage7_class_define.h |  1 +
>  kernel/trace/trace_output.c                | 41 ++++++++++++++++++++++
>  4 files changed, 48 insertions(+)
> 
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 37eb2f0f3dd8..7f4325d327ee 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -57,6 +57,8 @@ trace_print_hex_dump_seq(struct trace_seq *p, const char *prefix_str,
>  			 int prefix_type, int rowsize, int groupsize,
>  			 const void *buf, size_t len, bool ascii);
>  
> +const char *trace_print_untrusted_str_seq(struct trace_seq *s, const char *str);
> +
>  int trace_raw_output_prep(struct trace_iterator *iter,
>  			  struct trace_event *event);
>  extern __printf(2, 3)
> diff --git a/include/trace/stages/stage3_trace_output.h b/include/trace/stages/stage3_trace_output.h
> index fce85ea2df1c..62e98babb969 100644
> --- a/include/trace/stages/stage3_trace_output.h
> +++ b/include/trace/stages/stage3_trace_output.h
> @@ -133,6 +133,10 @@
>  	trace_print_hex_dump_seq(p, prefix_str, prefix_type,		\
>  				 rowsize, groupsize, buf, len, ascii)
>  
> +#undef __print_untrusted_str
> +#define __print_untrusted_str(str)							\
> +		trace_print_untrusted_str_seq(p, __get_str(str))
> +
>  #undef __print_ns_to_secs
>  #define __print_ns_to_secs(value)			\
>  	({						\
> diff --git a/include/trace/stages/stage7_class_define.h b/include/trace/stages/stage7_class_define.h
> index fcd564a590f4..1164aacd550f 100644
> --- a/include/trace/stages/stage7_class_define.h
> +++ b/include/trace/stages/stage7_class_define.h
> @@ -24,6 +24,7 @@
>  #undef __print_array
>  #undef __print_dynamic_array
>  #undef __print_hex_dump
> +#undef __print_untrusted_str
>  #undef __get_buf
>  
>  /*
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index 1996d7aba038..9d14c7cc654d 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -16,6 +16,7 @@
>  #include <linux/btf.h>
>  #include <linux/bpf.h>
>  #include <linux/hashtable.h>
> +#include <linux/string_helpers.h>
>  
>  #include "trace_output.h"
>  #include "trace_btf.h"
> @@ -321,6 +322,46 @@ trace_print_hex_dump_seq(struct trace_seq *p, const char *prefix_str,
>  }
>  EXPORT_SYMBOL(trace_print_hex_dump_seq);
>  
> +/**
> + * trace_print_untrusted_str_seq - print a string after escaping characters
> + * @s: trace seq struct to write to
> + * @src: The string to print
> + *
> + * Prints a string to a trace seq after escaping all special characters,
> + * including common separators (space, equal sign), quotes, and backslashes.
> + * This transforms a string from an untrusted source (e.g. user space) to make
> + * it:
> + * - safe to parse,
> + * - easy to read (for simple strings),
> + * - easy to get back the original.
> + */
> +const char *trace_print_untrusted_str_seq(struct trace_seq *s,
> +					   const char *src)
> +{
> +	int escaped_size;
> +	char *buf;
> +	size_t buf_size = seq_buf_get_buf(&s->seq, &buf);
> +	const char *ret = trace_seq_buffer_ptr(s);
> +
> +	/* Buffer exhaustion is normal when the trace buffer is full. */
> +	if (!src || buf_size == 0)
> +		return NULL;
> +
> +	escaped_size = string_escape_mem(src, strlen(src), buf, buf_size,
> +		ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_NAP | ESCAPE_APPEND |
> +		ESCAPE_OCTAL, " ='\"\\");
> +	if (unlikely(escaped_size >= buf_size)) {
> +		/* We need some room for the final '\0'. */
> +		seq_buf_set_overflow(&s->seq);
> +		s->full = 1;
> +		return NULL;
> +	}
> +	seq_buf_commit(&s->seq, escaped_size);
> +	trace_seq_putc(s, 0);
> +	return ret;
> +}
> +EXPORT_SYMBOL(trace_print_untrusted_str_seq);
> +
>  int trace_raw_output_prep(struct trace_iterator *iter,
>  			  struct trace_event *trace_event)
>  {
> -- 
> 2.53.0
> 
> 

^ permalink raw reply

* Re: [PATCH v2 08/14] verification/rvgen: Add golden and spec folders for tests
From: Nam Cao @ 2026-05-18  8:57 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Gabriele Monaco
  Cc: Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-9-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> Create reference models specifications and generated files in the golded
> folder. Those can be used as reference to validate rvgen still generates
> files as expected in automated tests.
>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Didn't look at the "golden" files, I presume those are generated.

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v2 07/14] verification/rvgen: Fix ltl2k writing True as a literal
From: Nam Cao @ 2026-05-18  8:52 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Gabriele Monaco
  Cc: Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-8-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> The rvgen parser for LTL stores literal true values in the python
> representation (capitalised True), this doesn't build in C.
> The Literal class should already handle this case but ASTNode skips its
> strigification method and converts the value (true/false) directly.
>
> Fix by delegating ASTNode stringification to the Literal and Variable
> classes instead of bypassing them.
>
> Fixes: 97ffa4ce6ab32 ("verification/rvgen: Add support for linear temporal logic")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v2 06/14] verification/rvgen: Fix options shared among commands
From: Nam Cao @ 2026-05-18  8:49 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Gabriele Monaco
  Cc: Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-7-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> After rvgen was refactored to use subparsers, the common options (-a and
> -D) were left in the main parser. This meant that they needed to be
> called /before/ the subcommand and using them without subcommand was
> allowed. This is not the original intent.
>
>   rvgen -D "some description" container -n name
>
> Define the options as parent in the subparsers to allow them to be used
> from both subcommands together with other options.
>
>   rvgen container -n name -D "some description"
>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

I didn't know we can do this.

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v2] tools/rtla: Fix --dump-tasks usage in timerlat
From: Tomas Glozar @ 2026-05-18  8:49 UTC (permalink / raw)
  To: Costa Shulyupin
  Cc: Steven Rostedt, Crystal Wood, Wander Lairson Costa, Ivan Pravdin,
	linux-trace-kernel, linux-kernel
In-Reply-To: <20260414185223.65353-1-costa.shul@redhat.com>

út 14. 4. 2026 v 20:52 odesílatel Costa Shulyupin
<costa.shul@redhat.com> napsal:
>
> Fix --dump-task to --dump-tasks in timerlat_hist usage string
> and getopt_long table for consistency with timerlat_top.
>
> Add missing --dump-tasks to timerlat_top usage synopsis.
>
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
> ---
> v2:
> - Address comments of Crystal Wood.

Please also add the link to v1 next time when sending a v2, it makes
it easier to check what is being addressed in the v2.

> ---
>  tools/tracing/rtla/src/timerlat_hist.c | 4 ++--
>  tools/tracing/rtla/src/timerlat_top.c  | 3 ++-
>  2 files changed, 4 insertions(+), 3 deletions(-)
>

The runtime test expanded to hist in [1] now passes, thanks! Adding:

Fixes: 2091336b9a8b ("rtla/timerlat_hist: Add auto-analysis support")

[1] https://lore.kernel.org/linux-trace-kernel/20260423130558.882022-2-tglozar@redhat.com/

Tomas


^ permalink raw reply

* Re: [PATCH v2 05/14] tools/rv: Add selftests
From: Nam Cao @ 2026-05-18  8:46 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Gabriele Monaco
  Cc: Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-6-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> The rv tool needs automated testing to catch regressions and verify
> correct functionality across different usage scenarios.
>
> Add selftests that validate monitor listing (including containers and
> nested monitors), monitor execution with different configurations
> (reactors, verbose output, tracing), and trace output format for both
> per-task and per-cpu monitors. Error handling paths are also tested.
> Tests use a shared engine for common patterns.
>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

I am not good enough at bash script to review this. But test scripts can
never hurt, so:

Acked-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v2 04/14] tools/rv: Fix cleanup after failed trace setup
From: Nam Cao @ 2026-05-18  8:42 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Gabriele Monaco
  Cc: Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-5-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:

> Currently if ikm_setup_trace_instance() fails, the tool returns without
> any cleanup, if rv was called with both -t and -r, this means the
> reactor is not going to be cleared.
>
> Jump to the cleanup label to restore the reactor if necessary.
>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v2 03/14] tools/rv: Fix exit status when monitor execution fails
From: Nam Cao @ 2026-05-18  8:32 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Gabriele Monaco
  Cc: Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-4-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> +	exit(run <= 0);

Probably better to stick to the C standard:

    exit(run > 0 ? EXIT_SUCCESS : EXIT_FAILURE)

but whatever.

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v2 02/14] tools/rv: Fix substring match when listing container monitors
From: Nam Cao @ 2026-05-18  8:21 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Gabriele Monaco
  Cc: Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-3-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> When listing monitors within a specific container (rv list <container>),
> the tool incorrectly matched monitors if the requested container name
> was only a prefix of the actual container (e.g., 'rv list sche' would
> incorrectly list monitors from 'sched:').
>
> Fix this by ensuring the container name is an exact match and is
> immediately followed by the ':' separator.
>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v6 2/2] blk-mq: expose tag starvation counts via debugfs
From: John Garry @ 2026-05-18  8:14 UTC (permalink / raw)
  To: Aaron Tomlin, axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260517213614.350367-3-atomlin@atomlin.com>

On 17/05/2026 22:36, Aaron Tomlin wrote:
> In high-performance storage environments, particularly when utilising
> RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> latency spikes can occur when fast devices are starved of available
> tags.
> 
> This patch introduces two new debugfs attributes for each block
> hardware queue:
>    - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag
>    - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag

How would these counters be used? You are just saying that we may have 
performance latency spikes and so here are two new counters.

> 
> These files expose atomic counters that increment each time a submitting
> context is forced into an uninterruptible sleep via io_schedule() due to
> the complete exhaustion of physical driver tags or software scheduler
> tags, respectively.
> 
> To ensure negligible performance overhead even in production
> environments where CONFIG_BLK_DEBUG_FS is actively enabled, this
> tracking logic utilises dynamically allocated per-CPU counters. When
> this configuration is disabled, the tracking logic compiles down to a
> safe no-op.

How does one normalise the values which are measured? I mean, during a 
period of high contention, we may get a bunch of threads waiting for a 
driver tag and the value in wait_on_hw_tag may jump considerably - how 
do you normalize this value in wait_on_hw_tag for meaningful analysis?

> 
> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> ---
>   block/blk-mq-debugfs.c | 109 +++++++++++++++++++++++++++++++++++++++++
>   block/blk-mq-debugfs.h |  19 +++++++
>   block/blk-mq-tag.c     |   4 ++
>   block/blk-mq.c         |   5 ++
>   include/linux/blk-mq.h |  12 +++++
>   5 files changed, 149 insertions(+)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 047ec887456b..a94ffc2eacdf 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -7,6 +7,7 @@
>   #include <linux/blkdev.h>
>   #include <linux/build_bug.h>
>   #include <linux/debugfs.h>
> +#include <linux/percpu.h>
>   
>   #include "blk.h"
>   #include "blk-mq.h"
> @@ -484,6 +485,54 @@ static int hctx_dispatch_busy_show(void *data, struct seq_file *m)
>   	return 0;
>   }
>   
> +/**
> + * hctx_wait_on_hw_tag_show - display hardware tag starvation count
> + * @data: generic pointer to the associated hardware context (hctx)
> + * @m: seq_file pointer for debugfs output formatting
> + *
> + * Prints the cumulative number of times a submitting context was forced
> + * to block due to the exhaustion of physical hardware driver tags.
> + *
> + * Return: 0 on success.
> + */
> +static int hctx_wait_on_hw_tag_show(void *data, struct seq_file *m)
> +{
> +	struct blk_mq_hw_ctx *hctx = data;
> +	unsigned long count = 0;
> +	int cpu;
> +
> +	if (hctx->wait_on_hw_tag) {
> +		for_each_possible_cpu(cpu)
> +			count += *per_cpu_ptr(hctx->wait_on_hw_tag, cpu);
> +	}
> +	seq_printf(m, "%lu\n", count);
> +	return 0;
> +}
> +
> +/**
> + * hctx_wait_on_sched_tag_show - display scheduler tag starvation count
> + * @data: generic pointer to the associated hardware context (hctx)
> + * @m: seq_file pointer for debugfs output formatting
> + *
> + * Prints the cumulative number of times a submitting context was forced
> + * to block due to the exhaustion of software scheduler tags.
> + *
> + * Return: 0 on success.
> + */
> +static int hctx_wait_on_sched_tag_show(void *data, struct seq_file *m)
> +{
> +	struct blk_mq_hw_ctx *hctx = data;
> +	unsigned long count = 0;
> +	int cpu;
> +
> +	if (hctx->wait_on_sched_tag) {
> +		for_each_possible_cpu(cpu)
> +			count += *per_cpu_ptr(hctx->wait_on_sched_tag, cpu);
> +	}
> +	seq_printf(m, "%lu\n", count);
> +	return 0;
> +}
> +
>   #define CTX_RQ_SEQ_OPS(name, type)					\
>   static void *ctx_##name##_rq_list_start(struct seq_file *m, loff_t *pos) \
>   	__acquires(&ctx->lock)						\
> @@ -599,6 +648,8 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
>   	{"active", 0400, hctx_active_show},
>   	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
>   	{"type", 0400, hctx_type_show},
> +	{"wait_on_hw_tag", 0400, hctx_wait_on_hw_tag_show},
> +	{"wait_on_sched_tag", 0400, hctx_wait_on_sched_tag_show},
>   	{},
>   };
>   
> @@ -815,3 +866,61 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx)
>   	debugfs_remove_recursive(hctx->sched_debugfs_dir);
>   	hctx->sched_debugfs_dir = NULL;
>   }
> +
> +/**
> + * blk_mq_debugfs_alloc_hctx_stats - Allocate per-cpu starvation statistics
> + * @hctx: hardware context associated with the tag allocation
> + * @gfp: memory allocation flags
> + *
> + * Allocates the per-cpu memory for tracking hardware and scheduler tag
> + * starvation.
> + */
> +void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx, gfp_t gfp)
> +{
> +	if (!hctx->wait_on_hw_tag)
> +		hctx->wait_on_hw_tag = alloc_percpu_gfp(unsigned long,
> +							gfp);
> +	if (!hctx->wait_on_sched_tag)
> +		hctx->wait_on_sched_tag = alloc_percpu_gfp(unsigned long,
> +							   gfp);
> +}
> +
> +/**
> + * blk_mq_debugfs_free_hctx_stats - Free per-cpu starvation statistics
> + * @hctx: hardware context associated with the tag allocation
> + *
> + * Frees the per-cpu memory used for tracking hardware and scheduler tag
> + * starvation. This must only be called during hardware queue teardown when
> + * the queue is safely frozen and no active I/O submissions can race to
> + * increment the statistics.
> + */
> +void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
> +{
> +	free_percpu(hctx->wait_on_hw_tag);
> +	hctx->wait_on_hw_tag = NULL;
> +	free_percpu(hctx->wait_on_sched_tag);
> +	hctx->wait_on_sched_tag = NULL;
> +}
> +
> +/**
> + * blk_mq_debugfs_inc_wait_tags - increment the tag starvation counters
> + * @hctx: hardware context associated with the tag allocation
> + * @is_sched: true if the starved pool is the software scheduler
> + *
> + * Evaluates the exhausted tag pool and safely increments the appropriate
> + * per-cpu debugfs starvation counter.
> + *
> + * Note: The per-cpu pointers are explicitly checked to prevent a NULL
> + * pointer dereference in the event that the system was under heavy memory
> + * pressure and the initial per-cpu allocation failed.
> + */
> +void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
> +				  bool is_sched)
> +{
> +	unsigned long __percpu *tags = is_sched ?
> +			READ_ONCE(hctx->wait_on_sched_tag) :
> +			READ_ONCE(hctx->wait_on_hw_tag);
> +
> +	if (likely(tags))
> +		raw_cpu_inc(*tags);
> +}
> diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
> index 49bb1aaa83dc..7a7c0f376a2b 100644
> --- a/block/blk-mq-debugfs.h
> +++ b/block/blk-mq-debugfs.h
> @@ -17,6 +17,8 @@ struct blk_mq_debugfs_attr {
>   	const struct seq_operations *seq_ops;
>   };
>   
> +void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
> +				  bool is_sched);
>   int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
>   int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
>   
> @@ -26,6 +28,9 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
>   void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx);
>   void blk_mq_debugfs_register_hctxs(struct request_queue *q);
>   void blk_mq_debugfs_unregister_hctxs(struct request_queue *q);
> +void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
> +				     gfp_t gfp);
> +void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx);
>   
>   void blk_mq_debugfs_register_sched(struct request_queue *q);
>   void blk_mq_debugfs_unregister_sched(struct request_queue *q);
> @@ -35,6 +40,11 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);
>   
>   void blk_mq_debugfs_register_rq_qos(struct request_queue *q);
>   #else
> +static inline void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
> +						bool is_sched)
> +{
> +}
> +
>   static inline void blk_mq_debugfs_register(struct request_queue *q)
>   {
>   }
> @@ -56,6 +66,15 @@ static inline void blk_mq_debugfs_unregister_hctxs(struct request_queue *q)
>   {
>   }
>   
> +static inline void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
> +						   gfp_t gfp)
> +{
> +}
> +
> +static inline void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
> +{
> +}
> +
>   static inline void blk_mq_debugfs_register_sched(struct request_queue *q)
>   {
>   }
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 66138dd043d4..3cc6a97a87a0 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -17,6 +17,7 @@
>   #include "blk.h"
>   #include "blk-mq.h"
>   #include "blk-mq-sched.h"
> +#include "blk-mq-debugfs.h"
>   
>   /*
>    * Recalculate wakeup batch when tag is shared by hctx.
> @@ -191,6 +192,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>   		trace_block_rq_tag_wait(data->q, data->hctx,
>   					data->rq_flags & RQF_SCHED_TAGS);
>   
> +		blk_mq_debugfs_inc_wait_tags(data->hctx,
> +					     data->rq_flags & RQF_SCHED_TAGS);
> +
>   		bt_prev = bt;
>   		io_schedule();
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c5c16cce4f8..cd52bf6f82ce 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3991,6 +3991,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>   			blk_free_flush_queue_callback);
>   	hctx->fq = NULL;
>   
> +	blk_mq_debugfs_free_hctx_stats(hctx);
> +
>   	spin_lock(&q->unused_hctx_lock);
>   	list_add(&hctx->hctx_list, &q->unused_hctx_list);
>   	spin_unlock(&q->unused_hctx_lock);
> @@ -4016,6 +4018,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
>   {
>   	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
>   
> +	blk_mq_debugfs_alloc_hctx_stats(hctx, gfp);
> +
>   	hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
>   	if (!hctx->fq)
>   		goto fail;
> @@ -4041,6 +4045,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
>   	blk_free_flush_queue(hctx->fq);
>   	hctx->fq = NULL;
>    fail:
> +	blk_mq_debugfs_free_hctx_stats(hctx);
>   	return -1;
>   }
>   
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 18a2388ba581..41d61488d683 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -453,6 +453,18 @@ struct blk_mq_hw_ctx {
>   	struct dentry		*debugfs_dir;
>   	/** @sched_debugfs_dir:	debugfs directory for the scheduler. */
>   	struct dentry		*sched_debugfs_dir;
> +	/**
> +	 * @wait_on_hw_tag: Cumulative per-cpu counter incremented each
> +	 * time a submitting context is forced to block due to physical
> +	 * hardware tag exhaustion.
> +	 */
> +	unsigned long __percpu	*wait_on_hw_tag;
> +	/**
> +	 * @wait_on_sched_tag: Cumulative per-cpu counter incremented each
> +	 * time a submitting context is forced to block due to software
> +	 * scheduler tag exhaustion.
> +	 */
> +	unsigned long __percpu	*wait_on_sched_tag;
>   #endif
>   
>   	/**


^ permalink raw reply

* Re: [PATCH v2 01/14] tools/rv: Fix substring match bug in monitor name search
From: Nam Cao @ 2026-05-18  8:15 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Gabriele Monaco
  Cc: Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <87ecj9879f.fsf@yellow.woof>

Nam Cao <namcao@linutronix.de> writes:

> Gabriele Monaco <gmonaco@redhat.com> writes:
>>  static int __ikm_find_monitor_name(char *monitor_name, char *out_name)
>>  {
>> -	char *available_monitors, container[MAX_DA_NAME_LEN+1], *cursor, *end;
>> -	int retval = 1;
>> +	char *available_monitors, *cursor, *line;
>> +	int len = strlen(monitor_name);
>> +	int found = 0;
>>  
>>  	available_monitors = tracefs_instance_file_read(NULL, "rv/available_monitors", NULL);
>>  	if (!available_monitors)
>>  		return -1;
>>  
>> -	cursor = strstr(available_monitors, monitor_name);
>> -	if (!cursor) {
>> -		retval = 0;
>> -		goto out_free;
>> -	}
>> +	config_is_container = 0;
>
> Isn't config_is_container unused?
>
> Perhaps it is used in a follow-up patch? Let me keep reading..

Never mind, I'm stupid.

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v2 01/14] tools/rv: Fix substring match bug in monitor name search
From: Nam Cao @ 2026-05-18  8:13 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Gabriele Monaco
  Cc: Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-2-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
>  static int __ikm_find_monitor_name(char *monitor_name, char *out_name)
>  {
> -	char *available_monitors, container[MAX_DA_NAME_LEN+1], *cursor, *end;
> -	int retval = 1;
> +	char *available_monitors, *cursor, *line;
> +	int len = strlen(monitor_name);
> +	int found = 0;
>  
>  	available_monitors = tracefs_instance_file_read(NULL, "rv/available_monitors", NULL);
>  	if (!available_monitors)
>  		return -1;
>  
> -	cursor = strstr(available_monitors, monitor_name);
> -	if (!cursor) {
> -		retval = 0;
> -		goto out_free;
> -	}
> +	config_is_container = 0;

Isn't config_is_container unused?

Perhaps it is used in a follow-up patch? Let me keep reading..

Nam

^ permalink raw reply

* Re: [PATCH 07/13] rv: Simply hybrid automata monitors's clock variables
From: Nam Cao @ 2026-05-18  7:44 UTC (permalink / raw)
  To: Gabriele Monaco
  Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
	linux-kernel
In-Reply-To: <ad9ca4916604d3f5ffe7a6683f9b82008784fa0e.camel@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> On Mon, 2026-05-11 at 13:55 +0200, Nam Cao wrote:
>> That can work, but not ideal, because hrtimer will not be usable.
>
> Why not? If we have HA_TIMER_WHEEL , we'd use timer and expire, if we have
> HA_TIMER_HRTIMER we'd only need hrtimer with it's hrtimer_get_expires():
>
>  union {
>  struct hrtimer hrtimer;
>  struct {
>  struct timer_list timer;
>  u64 expire; /* Explicitly store the armed budget */
>  };
>
> we already can't use timer and hrtimer interchangeably.
> What am I missing here?

Ah, now I understand the trick, thanks.

We already have an "expires" field in struct timer_list. But I am not
sure if we are supposed to touch that field. Your proposal looks safer.

>> Looking at the throttle monitor again, is it possible to rewrite
>> runtime_left_ns() to read .dl_runtime instead of .runtime? I don't know
>> the deadline schedule very well, but I think .dl_runtime is not changing
>> like .runtime?
>
> In theory yes, but since the runtime is consumed only when running, we cannot
> just set the timeout once. We either save how much was consumed somewhere or do
> some start/pause mechanism.
> Neither looks simpler to me.

Understood.

Nam

^ permalink raw reply

* Re: [PATCH 7/7] selftests/bpf: Add tests for uprobe nop10 red zone clobbering
From: Jiri Olsa @ 2026-05-18  7:30 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: oleg, peterz, mingo, mhiramat, andrii, bpf, linux-trace-kernel,
	ast, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai
In-Reply-To: <e91d31e0c1378fe970db6385c7af419e97af0c2acd37b3b8775a2fac02ea2361@mail.kernel.org>

On Thu, May 14, 2026 at 02:55:27PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > index 82b3c0ce9253..d553485e7db5 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> 
> [ ... ]
> 
> > diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
> > index be34c4087ff5..606601ccdc42 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/usdt.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
> 
> [ ... ]
> 
> > diff --git a/tools/testing/selftests/bpf/progs/test_usdt.c b/tools/testing/selftests/bpf/progs/test_usdt.c
> > index f00cb52874e0..0ee78fb050a1 100644
> > --- a/tools/testing/selftests/bpf/progs/test_usdt.c
> > +++ b/tools/testing/selftests/bpf/progs/test_usdt.c
> 
> [ ... ]
> 
> > diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
> > index b359b389f6c0..5e38f8605b02 100644
> > --- a/tools/testing/selftests/bpf/usdt_2.c
> > +++ b/tools/testing/selftests/bpf/usdt_2.c
> > @@ -13,4 +13,17 @@ void usdt_2(void)
> >  	USDT(optimized_attach, usdt_2);
> >  }
> >
> > +static volatile unsigned long usdt_red_zone_arg1 = 0xDEADBEEF;
> > +static volatile unsigned long usdt_red_zone_arg2 = 0xCAFEBABE;
> > +static volatile unsigned long usdt_red_zone_arg3 = 0xFEEDFACE;
> > +
> > +void __attribute__((noinline)) usdt_red_zone_trigger(void)
> > +{
> > +	unsigned long a1 = usdt_red_zone_arg1;
> > +	unsigned long a2 = usdt_red_zone_arg2;
> > +	unsigned long a3 = usdt_red_zone_arg3;
> > +
> > +	USDT(optimized_attach, usdt_red_zone, a1, a2, a3);
> > +}
> > +
> >  #endif
> 
> Is this test reliable across different compiler versions and optimizations?
> 
> The USDT macro uses the "nor" inline assembly constraint, which allows the
> compiler to allocate these operands to CPU registers. Because there is no
> register pressure in usdt_red_zone_trigger(), modern compilers might allocate
> a1, a2, and a3 directly to registers rather than the stack's red zone.
> 
> If the operands are placed in registers, the uprobe CALL optimization
> clobbering [rsp-8] will not corrupt them. This could cause the test to pass
> even on buggy kernels, creating a false positive.
> 
> Would it be safer to explicitly force these operands into the red zone using
> inline assembly constraints, rather than depending on the compiler's register
> allocator?

answered by Andrii in:
  https://lore.kernel.org/bpf/CAEf4BzZ-nASpoF+-vLYZ1D8eWRVd5JWAuhrwx8XZNrX7ZPY5_Q@mail.gmail.com/

jirka

^ permalink raw reply

* Re: [PATCH 06/13] verification/rvgen: Convert __fill_verify_guards_func() to Lark
From: Nam Cao @ 2026-05-18  7:21 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <agd1Tb8RDUqfjaGn@wcosta-defaultstring.rmtbr.csb>

Wander Lairson Costa <wander@redhat.com> writes:
>> +        if not self.has_guard:
>> +            return
>
> The signature of function says this function return a list, instead of
> None.

Can you share the tools you are using to catch these? Or did you notice
that yourself?

Nam

^ permalink raw reply

* Re: [PATCH 03/13] verification/rvgen: Implement state and transition parser based on Lark
From: Nam Cao @ 2026-05-18  7:19 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <agdtxdk-JSyGLZ6o@wcosta-defaultstring.rmtbr.csb>

Wander Lairson Costa <wander@redhat.com> writes:
> On Tue, May 05, 2026 at 08:59:24AM +0200, Nam Cao wrote:
>> +        self.rules = [[c, None]]
>
> Here self.rules is a list of lists...
>
>> +
>> +    def chain(self, op: str, c: ConstraintCondition):
>> +        self.rules[-1][1] = op
>> +        self.rules.append((c, None))
>
> ... but here it is a list of tuples.

Thanks. I will switch it entirely to list of tuples.

Nam

^ permalink raw reply

* Re: [PATCH 02/13] verification/rvgen: Introduce a parse tree for automata using Lark
From: Nam Cao @ 2026-05-18  7:18 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <agdnvQPLLiLBIxw7@wcosta-defaultstring.rmtbr.csb>

Wander Lairson Costa <wander@redhat.com> writes:
> On Tue, May 05, 2026 at 08:59:23AM +0200, Nam Cao wrote:
>> +    ID: /[_a-zA-Z][_a-zA-Z0-9]+/
>
> This regex rejects symbol character symbol. Is that intentional?

It wasn't intentional. This is blindly copied from the existing regex.

Let me switch to Lark's CNAME.

Nam

^ permalink raw reply

* Re: [PATCH 01/13] verification/rvgen: Switch LTL parser to Lark
From: Nam Cao @ 2026-05-18  7:15 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <agdBcW8_GTP3YuNs@wcosta-defaultstring.rmtbr.csb>

Wander Lairson Costa <wander@redhat.com> writes:
>> +VARIABLE: /[A-Z_0-9]+/
>
> Is it ok to start variable names with a number? (unless I am reading the
> regex wrong).

Thanks for pointing that out. Let me switch to Lark's built-in CNAME.

>> +    def LITERAL(self, args):
>> +        return ASTNode(Variable(args))
>
> shouldn't this be `return ASTNode(Literal(args) == "true")`?

Yeah that's broken, should be
    return ASTNode(Literal(args == "true"))

Nam

^ permalink raw reply

* Re: [PATCH 9/9] rv: Mandate deallocation for per-obj monitors
From: Gabriele Monaco @ 2026-05-18  6:36 UTC (permalink / raw)
  To: Wen Yang
  Cc: Nam Cao, linux-kernel, Steven Rostedt, Masami Hiramatsu,
	linux-trace-kernel
In-Reply-To: <c4a3c4fc-7efb-4eb0-81d6-cc7a0d51f8fa@linux.dev>

On Sun, 2026-05-17 at 17:52 +0800, Wen Yang wrote:
> 
> One gap: tools/verification/rvgen/rvgen/templates/dot2k/main.c uses 
> RV_MON_%%MONITOR_TYPE%% but generates no deallocation code, may fail
> to build with a -Wunused-function warning.
> 

Thanks for the review!

That's technically the purpose of this patch, we don't know exactly how is the
per-obj monitor going to deallocate, so we make sure build fails if they don't
set up a way.

This combined with the fact per-obj monitors aren't really documented (yet),
makes it quite confusing, doesn't it?

Would you prefer we always generate a dummy hook calling da_destroy_storage()
and let the user decide what to do with it without forcing (obscure) compiler
warnings?

Thanks,
Gabriele

> 
> --
> Best wishes,
> Wen
> 
> On 5/12/26 22:02, Gabriele Monaco wrote:
> > The per-object monitors use a hash tables and dynamic allocation of the
> > monitor storage, functions to clean a monitor that is no longer needed
> > are provided but nothing ensures the monitor actually uses them.
> > 
> > Remove the inline specifier on the deallocation function to let the
> > compiler warn in case it isn't referenced. If the monitor really doesn't
> > need one (for instance because instances will never cease to exist
> > before disabling the monitor), the da_skip_deallocation() helper macro
> > can be used to silence the warning.
> > 
> > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> > ---
> >   include/rv/da_monitor.h                      | 14 +++++++++++++-
> >   kernel/trace/rv/monitors/deadline/deadline.h |  5 ++++-
> >   2 files changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > index 402d3b935c08..378d23ab7dfb 100644
> > --- a/include/rv/da_monitor.h
> > +++ b/include/rv/da_monitor.h
> > @@ -489,8 +489,11 @@ static inline monitor_target
> > da_get_target_by_id(da_id_type id)
> >    * locks.
> >    * This function includes an RCU read-side critical section to synchronise
> >    * against da_monitor_destroy().
> > + * NOTE: inline is omitted on purpose to let the compiler warn if this
> > function
> > + * is never referenced. For monitors that don't require a deallocation
> > hook,
> > + * da_skip_deallocation() can be used.
> >    */
> > -static inline void da_destroy_storage(da_id_type id)
> > +static void da_destroy_storage(da_id_type id)
> >   {
> >   	struct da_monitor_storage *mon_storage;
> >   
> > @@ -504,6 +507,15 @@ static inline void da_destroy_storage(da_id_type id)
> >   	kfree_rcu(mon_storage, rcu);
> >   }
> >   
> > +/*
> > + * da_skip_deallocation - explicitly mark a deallocation function as not
> > required
> > + *
> > + * Only use when you are absolutely sure the monitor doesn't require a
> > + * deallocation hook (i.e. it's not possible for an object to finish
> > existing
> > + * when the monitor is still running).
> > + */
> > +#define da_skip_deallocation(hook) ((void)hook)
> > +
> >   static void da_monitor_reset_all(void)
> >   {
> >   	struct da_monitor_storage *mon_storage;
> > diff --git a/kernel/trace/rv/monitors/deadline/deadline.h
> > b/kernel/trace/rv/monitors/deadline/deadline.h
> > index 78fca873d61e..c39fd79148c2 100644
> > --- a/kernel/trace/rv/monitors/deadline/deadline.h
> > +++ b/kernel/trace/rv/monitors/deadline/deadline.h
> > @@ -194,7 +194,10 @@ static void __maybe_unused handle_newtask(void *data,
> > struct task_struct *task,
> >   		da_create_storage(EXPAND_ID_TASK(task), NULL);
> >   }
> >   
> > -static void __maybe_unused handle_exit(void *data, struct task_struct *p,
> > bool group_dead)
> > +/*
> > + * Deallocation hook, use da_skip_deallocation() when not necessary
> > + */
> > +static void handle_exit(void *data, struct task_struct *p, bool group_dead)
> >   {
> >   	if (p->policy == SCHED_DEADLINE)
> >   		da_destroy_storage(get_entity_id(&p->dl, DL_TASK,
> > DL_TASK));


^ permalink raw reply

* Re: [PATCH v2] tracing/probes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-05-18  6:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, bpf, Masami Hiramatsu,
	Mathieu Desnoyers, Mark Rutland, Peter Zijlstra, Namhyung Kim,
	Takaya Saeki, Douglas Raillard, Tom Zanussi, Andrew Morton,
	Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260516173310.1dbad146@fedora>

On Sat, 16 May 2026 17:33:09 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

>  Function arguments at kretprobe
>  -------------------------------
> diff --git a/kernel/trace/trace_btf.c b/kernel/trace/trace_btf.c
> index 00172f301f25..be88cc4d97dd 100644
> --- a/kernel/trace/trace_btf.c
> +++ b/kernel/trace/trace_btf.c
> @@ -120,3 +120,114 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
>  	return member;
>  }
>  
> +#define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3)
> +
> +static int find_member(const char *ptr, struct btf *btf,
> +		       const struct btf_type **type, int level)

We already have a similar function.

/*
 * Find a member of data structure/union by name and return it.
 * Return NULL if not found, or -EINVAL if parameter is invalid.
 * If the member is an member of anonymous union/structure, the offset
 * of that anonymous union/structure is stored into @anon_offset. Caller
 * can calculate the correct offset from the root data structure by
 * adding anon_offset to the member's offset.
 */
const struct btf_member *btf_find_struct_member(struct btf *btf,
						const struct btf_type *type,
						const char *member_name,
						u32 *anon_offset)

And this seems useful for you,


> +{
> +	const struct btf_member *member;
> +	const struct btf_type *t = *type;
> +	int i;
> +
> +	/* Max of 3 depth of anonymous structures */
> +	if (level > 3)
> +		return -E2BIG;
> +
> +	for_each_member(i, t, member) {
> +		const char *tname = btf_name_by_offset(btf, member->name_off);
> +
> +		if (strcmp(ptr, tname) == 0) {
> +			int offset = __btf_member_bit_offset(t, member);
> +			*type = btf_type_by_id(btf, member->type);
> +			return BITS_ROUNDDOWN_BYTES(offset);
> +		}
> +
> +		/* Handle anonymous structures */
> +		if (strlen(tname))
> +			continue;
> +
> +		*type = btf_type_by_id(btf, member->type);
> +		if (btf_type_is_struct(*type)) {
> +			int offset = find_member(ptr, btf, type, level + 1);
> +
> +			if (offset < 0)
> +				continue;
> +
> +			return offset + BITS_ROUNDDOWN_BYTES(member->offset);
> +		}
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +/**
> + * btf_find_offset - Find an offset of a member for a structure
> + * @arg: A structure name followed by one or more members
> + * @offset_p: A pointer to where to store the offset
> + *
> + * Will parse @arg with the expected format of: struct.member[[.member]..]
> + * It is delimited by '.'. The first item must be a structure type.
> + * The next are its members. If the member is also of a structure type it
> + * another member may follow ".member".
> + *
> + * Note, @arg is modified but will be put back to what it was on return.
> + *
> + * Returns: 0 on success and -EINVAL if no '.' is present
> + *    or -ENXIO if the structure or member is not found.
> + *    Returns -EINVAL if BTF is not defined.
> + *  On success, @offset_p will contain the offset of the member specified
> + *    by @arg.
> + */
> +int btf_find_offset(char *arg, long *offset_p)

And this can share the inner loop of parse_btf_field().
(BTW, parse_btf_field() supports bitfield, but this does not.)

> +{
> +	const struct btf_type *t;
> +	struct btf *btf;

As Sashiko pointed, btf = NULL here.
But I think we would better to have DEFINE_FREE(btf_put).

> +	long offset = 0;
> +	char *ptr;
> +	int ret;
> +	s32 id;
> +
> +	ptr = strchr(arg, '.');
> +	if (!ptr)
> +		return -EINVAL;
> +
> +	*ptr = '\0';
> +
> +	ret = -ENXIO;
> +	id = bpf_find_btf_id(arg, BTF_KIND_STRUCT, &btf);
> +	if (id < 0)
> +		goto error;
> +
> +	/* Get BTF_KIND_FUNC type */
> +	t = btf_type_by_id(btf, id);
> +
> +	/* May allow more than one member, as long as they are structures */
> +	do {
> +		ret = -ENXIO;
> +		if (!t || !btf_type_is_struct(t))
> +			goto error;
> +
> +		*ptr++ = '.';
> +		arg = ptr;
> +		ptr = strchr(ptr, '.');
> +		if (ptr)
> +			*ptr = '\0';
> +

So, if you don't share the inner loop, you can just reuse
btf_field_struct_member() as below (this should be equivalent
to find_member()):

		field = btf_find_struct_member(btf, t, arg, &anon_offs);
		if (IS_ERR_OR_NULL(field)) {
			ret = field ? PTR_ERR(field) : -ENOENT;
			goto error;
		}

		offset += field->offset + anon_offs;

> +
> +	} while (ptr);
> +
> +	btf_put(btf);
> +	*offset_p = offset;
> +	return 0;
> +
> +error:
> +	btf_put(btf);
> +	if (ptr)
> +		*ptr = '.';
> +	return ret;
> +}
> diff --git a/kernel/trace/trace_btf.h b/kernel/trace/trace_btf.h
> index 4bc44bc261e6..7b0797a6050b 100644
> --- a/kernel/trace/trace_btf.h
> +++ b/kernel/trace/trace_btf.h
> @@ -9,3 +9,13 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
>  						const struct btf_type *type,
>  						const char *member_name,
>  						u32 *anon_offset);
> +
> +#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
> +/* Will modify arg, but will put it back before returning. */
> +int btf_find_offset(char *arg, long *offset);
> +#else
> +static inline int btf_find_offset(char *arg, long *offset)
> +{
> +	return -EINVAL;
> +}
> +#endif
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index e0d3a0da26af..6fcede2de1a5 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1165,7 +1165,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
>  
>  	case '+':	/* deref memory */
>  	case '-':
> -		if (arg[1] == 'u') {
> +		if (arg[1] == 'u' && isdigit(arg[2])) {

As Sashiko pointed, This can be broken if STRUCT name starts
with "u[0-9]". What about using "()" for this case?

e.g. 

+u(user_unreg.disable_addr)(uarg)


>  			deref = FETCH_OP_UDEREF;
>  			arg[1] = arg[0];
>  			arg++;
> @@ -1178,7 +1178,22 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
>  			return -EINVAL;
>  		}
>  		*tmp = '\0';
> -		ret = kstrtol(arg, 0, &offset);
> +		if (arg[0] != '-' && !isdigit(*arg)) {
> +			int err = 0;
> +			ret = btf_find_offset(arg, &offset);
> +			switch (ret) {
> +			case -ENODEV: err = TP_ERR_NOSUP_BTFARG; break;
> +			case -E2BIG: err = TP_ERR_MEMBER_TOO_DEEP; break;
> +			case -EINVAL: err = TP_ERR_BAD_STRUCT_FMT; break;
> +			case -ENXIO: err = TP_ERR_BAD_BTF_TID; break;

This should have -ENOENT and default case for unknown error.
(There is TP_ERR_NO_BTF_FIELD already)

> +			}
> +			if (err)
> +				__trace_probe_log_err(ctx->offset, err);
> +			if (ret < 0)
> +				return ret;
> +		} else {
> +			ret = kstrtol(arg, 0, &offset);
> +		}
>  		if (ret) {
>  			trace_probe_log_err(ctx->offset, BAD_DEREF_OFFS);
>  			break;
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 262d8707a3df..d649bb9f5b7c 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -563,7 +563,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
>  	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
>  	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
>  	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
> -	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
> +	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"), \
> +	C(MEMBER_TOO_DEEP,	"Too many indirections of anonymous structure"), \
> +	C(BAD_STRUCT_FMT,	"Unknown BTF structure"),
>  
>  #undef C
>  #define C(a, b)		TP_ERR_##a
> -- 
> 2.53.0
> 

Thanks,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH v6 2/2] blk-mq: expose tag starvation counts via debugfs
From: Aaron Tomlin @ 2026-05-17 21:36 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260517213614.350367-1-atomlin@atomlin.com>

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices are starved of available
tags.

This patch introduces two new debugfs attributes for each block
hardware queue:
  - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag
  - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag

These files expose atomic counters that increment each time a submitting
context is forced into an uninterruptible sleep via io_schedule() due to
the complete exhaustion of physical driver tags or software scheduler
tags, respectively.

To ensure negligible performance overhead even in production
environments where CONFIG_BLK_DEBUG_FS is actively enabled, this
tracking logic utilises dynamically allocated per-CPU counters. When
this configuration is disabled, the tracking logic compiles down to a
safe no-op.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq-debugfs.c | 109 +++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.h |  19 +++++++
 block/blk-mq-tag.c     |   4 ++
 block/blk-mq.c         |   5 ++
 include/linux/blk-mq.h |  12 +++++
 5 files changed, 149 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 047ec887456b..a94ffc2eacdf 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -7,6 +7,7 @@
 #include <linux/blkdev.h>
 #include <linux/build_bug.h>
 #include <linux/debugfs.h>
+#include <linux/percpu.h>
 
 #include "blk.h"
 #include "blk-mq.h"
@@ -484,6 +485,54 @@ static int hctx_dispatch_busy_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+/**
+ * hctx_wait_on_hw_tag_show - display hardware tag starvation count
+ * @data: generic pointer to the associated hardware context (hctx)
+ * @m: seq_file pointer for debugfs output formatting
+ *
+ * Prints the cumulative number of times a submitting context was forced
+ * to block due to the exhaustion of physical hardware driver tags.
+ *
+ * Return: 0 on success.
+ */
+static int hctx_wait_on_hw_tag_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	unsigned long count = 0;
+	int cpu;
+
+	if (hctx->wait_on_hw_tag) {
+		for_each_possible_cpu(cpu)
+			count += *per_cpu_ptr(hctx->wait_on_hw_tag, cpu);
+	}
+	seq_printf(m, "%lu\n", count);
+	return 0;
+}
+
+/**
+ * hctx_wait_on_sched_tag_show - display scheduler tag starvation count
+ * @data: generic pointer to the associated hardware context (hctx)
+ * @m: seq_file pointer for debugfs output formatting
+ *
+ * Prints the cumulative number of times a submitting context was forced
+ * to block due to the exhaustion of software scheduler tags.
+ *
+ * Return: 0 on success.
+ */
+static int hctx_wait_on_sched_tag_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	unsigned long count = 0;
+	int cpu;
+
+	if (hctx->wait_on_sched_tag) {
+		for_each_possible_cpu(cpu)
+			count += *per_cpu_ptr(hctx->wait_on_sched_tag, cpu);
+	}
+	seq_printf(m, "%lu\n", count);
+	return 0;
+}
+
 #define CTX_RQ_SEQ_OPS(name, type)					\
 static void *ctx_##name##_rq_list_start(struct seq_file *m, loff_t *pos) \
 	__acquires(&ctx->lock)						\
@@ -599,6 +648,8 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
 	{"active", 0400, hctx_active_show},
 	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
 	{"type", 0400, hctx_type_show},
+	{"wait_on_hw_tag", 0400, hctx_wait_on_hw_tag_show},
+	{"wait_on_sched_tag", 0400, hctx_wait_on_sched_tag_show},
 	{},
 };
 
@@ -815,3 +866,61 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx)
 	debugfs_remove_recursive(hctx->sched_debugfs_dir);
 	hctx->sched_debugfs_dir = NULL;
 }
+
+/**
+ * blk_mq_debugfs_alloc_hctx_stats - Allocate per-cpu starvation statistics
+ * @hctx: hardware context associated with the tag allocation
+ * @gfp: memory allocation flags
+ *
+ * Allocates the per-cpu memory for tracking hardware and scheduler tag
+ * starvation.
+ */
+void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx, gfp_t gfp)
+{
+	if (!hctx->wait_on_hw_tag)
+		hctx->wait_on_hw_tag = alloc_percpu_gfp(unsigned long,
+							gfp);
+	if (!hctx->wait_on_sched_tag)
+		hctx->wait_on_sched_tag = alloc_percpu_gfp(unsigned long,
+							   gfp);
+}
+
+/**
+ * blk_mq_debugfs_free_hctx_stats - Free per-cpu starvation statistics
+ * @hctx: hardware context associated with the tag allocation
+ *
+ * Frees the per-cpu memory used for tracking hardware and scheduler tag
+ * starvation. This must only be called during hardware queue teardown when
+ * the queue is safely frozen and no active I/O submissions can race to
+ * increment the statistics.
+ */
+void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
+{
+	free_percpu(hctx->wait_on_hw_tag);
+	hctx->wait_on_hw_tag = NULL;
+	free_percpu(hctx->wait_on_sched_tag);
+	hctx->wait_on_sched_tag = NULL;
+}
+
+/**
+ * blk_mq_debugfs_inc_wait_tags - increment the tag starvation counters
+ * @hctx: hardware context associated with the tag allocation
+ * @is_sched: true if the starved pool is the software scheduler
+ *
+ * Evaluates the exhausted tag pool and safely increments the appropriate
+ * per-cpu debugfs starvation counter.
+ *
+ * Note: The per-cpu pointers are explicitly checked to prevent a NULL
+ * pointer dereference in the event that the system was under heavy memory
+ * pressure and the initial per-cpu allocation failed.
+ */
+void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+				  bool is_sched)
+{
+	unsigned long __percpu *tags = is_sched ?
+			READ_ONCE(hctx->wait_on_sched_tag) :
+			READ_ONCE(hctx->wait_on_hw_tag);
+
+	if (likely(tags))
+		raw_cpu_inc(*tags);
+}
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 49bb1aaa83dc..7a7c0f376a2b 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -17,6 +17,8 @@ struct blk_mq_debugfs_attr {
 	const struct seq_operations *seq_ops;
 };
 
+void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+				  bool is_sched);
 int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
 int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
 
@@ -26,6 +28,9 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
 void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx);
 void blk_mq_debugfs_register_hctxs(struct request_queue *q);
 void blk_mq_debugfs_unregister_hctxs(struct request_queue *q);
+void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
+				     gfp_t gfp);
+void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_debugfs_register_sched(struct request_queue *q);
 void blk_mq_debugfs_unregister_sched(struct request_queue *q);
@@ -35,6 +40,11 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_debugfs_register_rq_qos(struct request_queue *q);
 #else
+static inline void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+						bool is_sched)
+{
+}
+
 static inline void blk_mq_debugfs_register(struct request_queue *q)
 {
 }
@@ -56,6 +66,15 @@ static inline void blk_mq_debugfs_unregister_hctxs(struct request_queue *q)
 {
 }
 
+static inline void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
+						   gfp_t gfp)
+{
+}
+
+static inline void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
+{
+}
+
 static inline void blk_mq_debugfs_register_sched(struct request_queue *q)
 {
 }
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 66138dd043d4..3cc6a97a87a0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -17,6 +17,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-mq-debugfs.h"
 
 /*
  * Recalculate wakeup batch when tag is shared by hctx.
@@ -191,6 +192,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		trace_block_rq_tag_wait(data->q, data->hctx,
 					data->rq_flags & RQF_SCHED_TAGS);
 
+		blk_mq_debugfs_inc_wait_tags(data->hctx,
+					     data->rq_flags & RQF_SCHED_TAGS);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..cd52bf6f82ce 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3991,6 +3991,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 			blk_free_flush_queue_callback);
 	hctx->fq = NULL;
 
+	blk_mq_debugfs_free_hctx_stats(hctx);
+
 	spin_lock(&q->unused_hctx_lock);
 	list_add(&hctx->hctx_list, &q->unused_hctx_list);
 	spin_unlock(&q->unused_hctx_lock);
@@ -4016,6 +4018,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
 
+	blk_mq_debugfs_alloc_hctx_stats(hctx, gfp);
+
 	hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
 	if (!hctx->fq)
 		goto fail;
@@ -4041,6 +4045,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	blk_free_flush_queue(hctx->fq);
 	hctx->fq = NULL;
  fail:
+	blk_mq_debugfs_free_hctx_stats(hctx);
 	return -1;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..41d61488d683 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -453,6 +453,18 @@ struct blk_mq_hw_ctx {
 	struct dentry		*debugfs_dir;
 	/** @sched_debugfs_dir:	debugfs directory for the scheduler. */
 	struct dentry		*sched_debugfs_dir;
+	/**
+	 * @wait_on_hw_tag: Cumulative per-cpu counter incremented each
+	 * time a submitting context is forced to block due to physical
+	 * hardware tag exhaustion.
+	 */
+	unsigned long __percpu	*wait_on_hw_tag;
+	/**
+	 * @wait_on_sched_tag: Cumulative per-cpu counter incremented each
+	 * time a submitting context is forced to block due to software
+	 * scheduler tag exhaustion.
+	 */
+	unsigned long __percpu	*wait_on_sched_tag;
 #endif
 
 	/**
-- 
2.51.0


^ permalink raw reply related

* [PATCH v6 1/2] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-05-17 21:36 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260517213614.350367-1-atomlin@atomlin.com>

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait trace point in the tag
allocation slow-path. It triggers immediately before the thread yields
the CPU, exposing the exact hardware context (hctx) that is starved, the
specific pool experiencing starvation (hardware or software scheduler),
and the total pool depth.

This provides storage engineers and performance monitoring agents
with a zero-configuration, low-overhead mechanism to definitively
identify shared-tag bottlenecks and tune I/O schedulers or cgroup
throttling accordingly.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq-tag.c           |  4 ++++
 include/trace/events/block.h | 43 ++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..66138dd043d4 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -13,6 +13,7 @@
 #include <linux/kmemleak.h>
 
 #include <linux/delay.h>
+#include <trace/events/block.h>
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
@@ -187,6 +188,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != BLK_MQ_NO_TAG)
 			break;
 
+		trace_block_rq_tag_wait(data->q, data->hctx,
+					data->rq_flags & RQF_SCHED_TAGS);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..d6b9a6bc3c63 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -226,6 +226,49 @@ DECLARE_EVENT_CLASS(block_rq,
 		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
 );
 
+/**
+ * block_rq_tag_wait - triggered when a request is starved of a tag
+ * @q: request queue of the target device
+ * @hctx: hardware context of the request experiencing starvation
+ * @is_sched_tag: indicates whether the starved pool is the software scheduler
+ *
+ * Called immediately before the submitting context is forced to block due
+ * to the exhaustion of available tags (i.e., physical hardware driver tags
+ * or software scheduler tags). This trace point indicates that the context
+ * will be placed into an uninterruptible state via io_schedule() until an
+ * active request completes and relinquishes its assigned tag.
+ */
+TRACE_EVENT(block_rq_tag_wait,
+
+	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx, bool is_sched_tag),
+
+	TP_ARGS(q, hctx, is_sched_tag),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( u32,		hctx_id			)
+		__field( u32,		nr_tags			)
+		__field( bool,		is_sched_tag		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= q->disk ? disk_devt(q->disk) : 0;
+		__entry->hctx_id	= hctx->queue_num;
+		__entry->is_sched_tag	= is_sched_tag;
+
+		if (is_sched_tag)
+			__entry->nr_tags = hctx->sched_tags->nr_tags;
+		else
+			__entry->nr_tags = hctx->tags->nr_tags;
+	),
+
+	TP_printk("%d,%d hctx=%u starved on %s tags (depth=%u)",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->hctx_id,
+		  __entry->is_sched_tag ? "scheduler" : "hardware",
+		  __entry->nr_tags)
+);
+
 /**
  * block_rq_insert - insert block operation request into queue
  * @rq: block IO operation request
-- 
2.51.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox