Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v4 0/2] blk-mq: introduce tag starvation observability
From: Aaron Tomlin @ 2026-04-19  2:30 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel

Hi Jens, Steve, Masami,

In high-performance storage environments, particularly when utilising RAID
controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency
spikes can occur when fast devices are starved of available tags.
Currently, diagnosing this specific queue contention requires deploying
dynamic kprobes or inferring sleep states, which lacks a simple,
out-of-the-box diagnostic path.

This short series introduces dedicated, low-overhead observability for tag
exhaustion events in the block layer:

  - Patch 1 introduces the "block_rq_tag_wait" tracepoint in the tag
    allocation slow-path to capture precise, event-based starvation.

  - Patch 2 complements this by exposing "wait_on_hw_tag" and
    "wait_on_sched_tag" per-CPU counters via debugfs for quick,
    point-in-time cumulative polling.

Together, these provide storage engineers with zero-configuration
mechanisms to definitively identify shared-tag bottlenecks.

Please let me know your thoughts.


Changes since v3 [1]:
 - Transitioned tracking architecture from shared atomic_t variables to
   dynamically allocated per-CPU counters to resolve cache line bouncing
   (Bart Van Assche)

Changes since v2 [2]:
 - Added "Reviewed-by:" and "Tested-by:" tags for patch 1
 - Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)
 - Introduced atomic counters via debugfs 

Changes since v1 [3]:
 - Improved the description of the trace point (Damien Le Moal)
 - Removed the redundant "active requests" (Laurence Oberman)
 - Introduced pool-specific starvation tracking

[1]: https://lore.kernel.org/lkml/20260319221956.332770-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/

Aaron Tomlin (2):
  blk-mq: add tracepoint block_rq_tag_wait
  blk-mq: expose tag starvation counts via debugfs

 block/blk-mq-debugfs.c       | 84 ++++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.h       |  7 +++
 block/blk-mq-tag.c           |  8 ++++
 include/linux/blk-mq.h       | 12 ++++++
 include/trace/events/block.h | 43 ++++++++++++++++++
 5 files changed, 154 insertions(+)

-- 
2.51.0


^ permalink raw reply

* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Song Chen @ 2026-04-19  0:21 UTC (permalink / raw)
  To: David Laight
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260416133004.07bd2886@pumpkin>

Hi,

On 4/16/26 20:30, David Laight wrote:
> On Wed, 15 Apr 2026 15:01:37 +0800
> chensong_2000@189.cn wrote:
> 
>> From: Song Chen <chensong_2000@189.cn>
>>
>> The current notifier chain implementation uses a single-linked list
>> (struct notifier_block *next), which only supports forward traversal
>> in priority order. This makes it difficult to handle cleanup/teardown
>> scenarios that require notifiers to be called in reverse priority order.
> 
> If it is only cleanup/teardown then the list can be order-reversed
> as part of that process at the same time as the list is deleted.
> 
> 	David
> 
> 
> 

Sorry, i don't follow, the notifiers in the list are deleted when 
calling notifier_chain_unregister, other than that, they are traversed 
forward and backward.

Song


^ permalink raw reply

* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Song Chen @ 2026-04-19  0:07 UTC (permalink / raw)
  To: Petr Mladek
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, joe.lawrence, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <aeC7ByGA5MHBcGQR@pathway.suse.cz>

Hi,


On 4/16/26 18:33, Petr Mladek wrote:
> On Wed 2026-04-15 15:01:37, chensong_2000@189.cn wrote:
>> From: Song Chen <chensong_2000@189.cn>
>>
>> The current notifier chain implementation uses a single-linked list
>> (struct notifier_block *next), which only supports forward traversal
>> in priority order. This makes it difficult to handle cleanup/teardown
>> scenarios that require notifiers to be called in reverse priority order.
>>
>> A concrete example is the ordering dependency between ftrace and
>> livepatch during module load/unload. see the detail here [1].
>>
>> This patch replaces the single-linked list in struct notifier_block
>> with a struct list_head, converting the notifier chain into a
>> doubly-linked list sorted in descending priority order. Based on
>> this, a new function notifier_call_chain_reverse() is introduced,
>> which traverses the chain in reverse (ascending priority order).
>> The corresponding blocking_notifier_call_chain_reverse() is also
>> added as the locking wrapper for blocking notifier chains.
>>
>> The internal notifier_call_chain_robust() is updated to use
>> notifier_call_chain_reverse() for rollback: on error, it records
>> the failing notifier (last_nb) and the count of successfully called
>> notifiers (nr), then rolls back exactly those nr-1 notifiers in
>> reverse order starting from last_nb's predecessor, without needing
>> to know the total length of the chain.
>>
>> With this change, subsystems with symmetric setup/teardown ordering
>> requirements can register a single notifier_block with one priority
>> value, and rely on blocking_notifier_call_chain() for forward
>> traversal and blocking_notifier_call_chain_reverse() for reverse
>> traversal, without needing hard-coded call sequences or separate
>> notifier registrations for each direction.
>>
>> [1]:https://lore.kernel.org/all
>> 	/alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/
>>
>> --- a/include/linux/notifier.h
>> +++ b/include/linux/notifier.h
>> @@ -53,41 +53,41 @@ typedef	int (*notifier_fn_t)(struct notifier_block *nb,
> [...]
>>   struct notifier_block {
>>   	notifier_fn_t notifier_call;
>> -	struct notifier_block __rcu *next;
>> +	struct list_head __rcu entry;
>>   	int priority;
>>   };
> [...]
>>   #define ATOMIC_INIT_NOTIFIER_HEAD(name) do {	\
>>   		spin_lock_init(&(name)->lock);	\
>> -		(name)->head = NULL;		\
>> +		INIT_LIST_HEAD(&(name)->head);		\
> 
> I would expect the RCU variant here, aka INIT_LIST_HEAD_RCU().

I'm not familiar with list rcu, but i will look into it and give it a try.
> 
>> --- a/kernel/notifier.c
>> +++ b/kernel/notifier.c
>> @@ -14,39 +14,47 @@
>>    *	are layered on top of these, with appropriate locking added.
>>    */
>>   
>> -static int notifier_chain_register(struct notifier_block **nl,
>> +static int notifier_chain_register(struct list_head *nl,
>>   				   struct notifier_block *n,
>>   				   bool unique_priority)
>>   {
>> -	while ((*nl) != NULL) {
>> -		if (unlikely((*nl) == n)) {
>> +	struct notifier_block *cur;
>> +
>> +	list_for_each_entry(cur, nl, entry) {
>> +		if (unlikely(cur == n)) {
>>   			WARN(1, "notifier callback %ps already registered",
>>   			     n->notifier_call);
>>   			return -EEXIST;
>>   		}
>> -		if (n->priority > (*nl)->priority)
>> -			break;
>> -		if (n->priority == (*nl)->priority && unique_priority)
>> +
>> +		if (n->priority == cur->priority && unique_priority)
>>   			return -EBUSY;
>> -		nl = &((*nl)->next);
>> +
>> +		if (n->priority > cur->priority) {
>> +			list_add_tail(&n->entry, &cur->entry);
>> +			goto out;
>> +		}
>>   	}
>> -	n->next = *nl;
>> -	rcu_assign_pointer(*nl, n);
>> +
>> +	list_add_tail(&n->entry, nl);
> 
> I would expect list_add_tail_rcu() here.
> 
>> @@ -59,25 +67,25 @@ static int notifier_chain_unregister(struct notifier_block **nl,
>>    *			value of this parameter is -1.
>>    *	@nr_calls:	Records the number of notifications sent. Don't care
>>    *			value of this field is NULL.
>> + *	@last_nb:  Records the last called notifier block for rolling back
>>    *	Return:		notifier_call_chain returns the value returned by the
>>    *			last notifier function called.
>>    */
>> -static int notifier_call_chain(struct notifier_block **nl,
>> +static int notifier_call_chain(struct list_head *nl,
>>   			       unsigned long val, void *v,
>> -			       int nr_to_call, int *nr_calls)
>> +			       int nr_to_call, int *nr_calls,
>> +				   struct notifier_block **last_nb)
>>   {
>>   	int ret = NOTIFY_DONE;
>> -	struct notifier_block *nb, *next_nb;
>> -
>> -	nb = rcu_dereference_raw(*nl);
>> +	struct notifier_block *nb;
>>   
>> -	while (nb && nr_to_call) {
>> -		next_nb = rcu_dereference_raw(nb->next);
>> +	if (!nr_to_call)
>> +		return ret;
>>   
>> +	list_for_each_entry(nb, nl, entry) {
> 
> I would expect the RCU variant here, aka list_for_each_rcu()
> 
> These are just two random examples which I found by a quick look.
> 
> I guess that the notifier API is very old and it does not use all
> the RCU API features which allow to track safety when
> CONFIG_PROVE_RCU and CONFIG_PROVE_RCU_LIST are enabled.
> 
> It actually might be worth to audit the code and make it right.
> 
>>   #ifdef CONFIG_DEBUG_NOTIFIERS
>>   		if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
>>   			WARN(1, "Invalid notifier called!");
>> -			nb = next_nb;
>>   			continue;
>>   		}
>>   #endif
> 
> That said, I am not sure if the ftrace/livepatching handlers are
> the right motivation for this. Especially when I see the
> complexity of the 2nd patch [*]
> 
> After thinking more about it. I am not even sure that the ftrace and
> livepatching callbacks are good candidates for generic notifiers.
> They are too special. It is not only about ordering them against
> each other. But it is also about ordering them against other
> notifiers. The ftrace/livepatching callbacks must be the first/last
> during module load/release.
> 
> [*] The 2nd patch is not archived by lore for some reason.
>      I have found only a review but it gives a good picture, see
>      https://lore.kernel.org/all/1191caf5-6a61-4622-a15e-854d3701f4fc@suse.com/
> 
> Best Regards,
> Petr
> 

Thanks.

BR

Song

^ permalink raw reply

* Re: [PATCH v2 00/19] tracepoint: Avoid double static_branch evaluation at guarded call sites
From: Steven Rostedt @ 2026-04-18 23:04 UTC (permalink / raw)
  To: Vineeth Pillai (Google)
  Cc: Peter Zijlstra, Dmitry Ilvokhin, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Jens Axboe, io-uring,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Marcelo Ricardo Leitner,
	Xin Long, Jon Maloy, Aaron Conole, Eelco Chaudron, Ilya Maximets,
	netdev, bpf, linux-sctp, tipc-discussion, dev, Jiri Pirko,
	Oded Gabbay, Koby Elbaz, dri-devel, Rafael J. Wysocki,
	Viresh Kumar, Gautham R. Shenoy, Huang Rui, Mario Limonciello,
	Len Brown, Srinivas Pandruvada, linux-pm, MyungJoo Ham,
	Kyungmin Park, Chanwoo Choi, Christian König, Sumit Semwal,
	linaro-mm-sig, Eddie James, Andrew Jeffery, Joel Stanley,
	linux-fsi, David Airlie, Simona Vetter, Alex Deucher,
	Danilo Krummrich, Matthew Brost, Philipp Stanner, Harry Wentland,
	Leo Li, amd-gfx, Jiri Kosina, Benjamin Tissoires, linux-input,
	Wolfram Sang, linux-i2c, Mark Brown, Michael Hennerich,
	Nuno Sá, linux-spi, James E.J. Bottomley, Martin K. Petersen,
	linux-scsi, Chris Mason, David Sterba, linux-btrfs,
	Thomas Gleixner, Andrew Morton, SeongJae Park, linux-mm,
	Borislav Petkov, Dave Hansen, x86, linux-trace-kernel,
	linux-kernel
In-Reply-To: <20260323160052.17528-1-vineeth@bitbyteword.org>

On Mon, 23 Mar 2026 12:00:19 -0400
"Vineeth Pillai (Google)" <vineeth@bitbyteword.org> wrote:

>   if (trace_foo_enabled() && cond)
>       trace_call__foo(args);   /* calls __do_trace_foo() directly */

Hi Vineeth,

Could you rebase this series on top of 7.1-rc1 when it comes out?
Several of these patches were accepted already. Obviously drop those.
They were the patches that added the feature, and any where the
maintainer acked the patch.

Now that the feature has been accepted, if you post the patch series
again after 7.1-rc1 with all the patches that haven't been accepted
yet, then the maintainers can simply take them directly. As the feature
is now accepted, there's no dependency on it, and they don't need to go
through the tracing tree.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH] trace: propagate registration failure from tracing_start_*_record()
From: Steven Rostedt @ 2026-04-18 19:52 UTC (permalink / raw)
  To: Yash Suthar
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	skhan, me, syzbot+a1d25e53cd4a10f7f2d3
In-Reply-To: <CAPfzD4kQg94qA0oRd+=JOju=+a76872WCsBSo8=aNG5QduNQbg@mail.gmail.com>

On Sat, 18 Apr 2026 11:08:42 +0530
Yash Suthar <yashsuthar983@gmail.com> wrote:

> Hello Steven,
> Thank you for taking a look and really sorry.
> 
> I did use ai assistance for commit message, but I reviewed, modified
> and tested(with syzbot locally) the code myself. I should have
> disclosed really sorry.
> 
> One thing I want to know (or I am still missing something):
> sched_cmdline_ref is incremented before tracing_sched_register() and
> register fails,  but sched_cmdline_ref stays at 1 and on disable
> tracepoint_remove_func() sees NULL and return error (as syzbot
> reported and reproduce also locally).
> your suggestion WARN_ONCE correctly flags the upstream failure, but
> the secondary WARN at  tracepoint.c:358 will still fire on the next
> disable, since the refcount desync isn't addressed. Was that
> intentional ?

Yes.

This is why I'm not too thrilled about syzbot injecting kmalloc
failures. These injections are for one page or less, in which case the
system is in pretty much a failed state anyway.

I don't care about warnings being triggering due to kmalloc failures.
I'll fix UAF or NULL pointer dereferences, but warnings? No!

If you fail to alloc a single page, expect a lot of warnings to happen.
That is intentional. syzbot should not flag an error for a warning that
was triggered due to a single page memory failure, unless that memory
was allocated by in atomic or something where it can't do reclaim.

-- Steve

^ permalink raw reply

* [PATCH] eventfs: Hold eventfs_mutex and SRCU when remount walks events
From: David Carlier @ 2026-04-18 19:17 UTC (permalink / raw)
  To: rostedt, mhiramat
  Cc: mathieu.desnoyers, linux-trace-kernel, linux-kernel,
	David Carlier, stable

Commit 340f0c7067a9 ("eventfs: Update all the eventfs_inodes from the
events descriptor") had eventfs_set_attrs() recurse through ei->children
on remount.  The walk only holds the rcu_read_lock() taken by
tracefs_apply_options() over tracefs_inodes, which is wrong:

  - list_for_each_entry over ei->children races with the list_del_rcu()
    in eventfs_remove_rec() -- LIST_POISON1 deref, same shape as
    d2603279c7d6.
  - eventfs_inodes are freed via call_srcu(&eventfs_srcu, ...).
    rcu_read_lock() does not extend an SRCU grace period, so ti->private
    can be reclaimed under the walk.
  - The writes to ei->attr race with eventfs_set_attr(), which holds
    eventfs_mutex.

Reproducer:

  while :; do mount -o remount,uid=$((RANDOM%1000)) /sys/kernel/tracing; done &
  while :; do
      echo "p:kp submit_bio" > /sys/kernel/tracing/kprobe_events
      echo > /sys/kernel/tracing/kprobe_events
  done

Wrap the events portion of tracefs_apply_options() in
eventfs_remount_lock()/_unlock() that take eventfs_mutex and
srcu_read_lock(&eventfs_srcu).  eventfs_set_attrs() doesn't sleep so the
nested rcu_read_lock() is fine; lockdep_assert_held() pins the contract.

Comment in tracefs_drop_inode() said "RCU cycle" -- it is SRCU.

Fixes: 340f0c7067a9 ("eventfs: Update all the eventfs_inodes from the events descriptor")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
---
 fs/tracefs/event_inode.c | 14 ++++++++++++++
 fs/tracefs/inode.c       |  5 ++++-
 fs/tracefs/internal.h    |  3 +++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 81df94038f2e..79193021c6b0 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -244,6 +244,8 @@ static void eventfs_set_attrs(struct eventfs_inode *ei, bool update_uid, kuid_t
 {
 	struct eventfs_inode *ei_child;
 
+	lockdep_assert_held(&eventfs_mutex);
+
 	/* Update events/<system>/<event> */
 	if (WARN_ON_ONCE(level > 3))
 		return;
@@ -886,3 +888,15 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 	d_invalidate(dentry);
 	d_make_discardable(dentry);
 }
+
+int eventfs_remount_lock(void)
+{
+	mutex_lock(&eventfs_mutex);
+	return srcu_read_lock(&eventfs_srcu);
+}
+
+void eventfs_remount_unlock(int srcu_idx)
+{
+	srcu_read_unlock(&eventfs_srcu, srcu_idx);
+	mutex_unlock(&eventfs_mutex);
+}
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 03f768536fd5..f3d6188a3b7b 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -313,6 +313,7 @@ static int tracefs_apply_options(struct super_block *sb, bool remount)
 	struct inode *inode = d_inode(sb->s_root);
 	struct tracefs_inode *ti;
 	bool update_uid, update_gid;
+	int srcu_idx;
 	umode_t tmp_mode;
 
 	/*
@@ -337,6 +338,7 @@ static int tracefs_apply_options(struct super_block *sb, bool remount)
 		update_uid = fsi->opts & BIT(Opt_uid);
 		update_gid = fsi->opts & BIT(Opt_gid);
 
+		srcu_idx = eventfs_remount_lock();
 		rcu_read_lock();
 		list_for_each_entry_rcu(ti, &tracefs_inodes, list) {
 			if (update_uid) {
@@ -358,6 +360,7 @@ static int tracefs_apply_options(struct super_block *sb, bool remount)
 				eventfs_remount(ti, update_uid, update_gid);
 		}
 		rcu_read_unlock();
+		eventfs_remount_unlock(srcu_idx);
 	}
 
 	return 0;
@@ -403,7 +406,7 @@ static int tracefs_drop_inode(struct inode *inode)
 	 * This inode is being freed and cannot be used for
 	 * eventfs. Clear the flag so that it doesn't call into
 	 * eventfs during the remount flag updates. The eventfs_inode
-	 * gets freed after an RCU cycle, so the content will still
+	 * gets freed after an SRCU cycle, so the content will still
 	 * be safe if the iteration is going on now.
 	 */
 	ti->flags &= ~TRACEFS_EVENT_INODE;
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index d83c2a25f288..a4a7f8431aff 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -76,4 +76,7 @@ struct inode *tracefs_get_inode(struct super_block *sb);
 void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool update_gid);
 void eventfs_d_release(struct dentry *dentry);
 
+int eventfs_remount_lock(void);
+void eventfs_remount_unlock(int srcu_idx);
+
 #endif /* _TRACEFS_INTERNAL_H */
-- 
2.53.0


^ permalink raw reply related

* [PATCH] eventfs: Use list_add_tail_rcu() for SRCU-protected children list
From: David Carlier @ 2026-04-18 15:22 UTC (permalink / raw)
  To: rostedt, mhiramat
  Cc: mathieu.desnoyers, linux-trace-kernel, linux-kernel,
	David Carlier, stable

Commit d2603279c7d6 ("eventfs: Use list_del_rcu() for SRCU protected
list variable") converted the removal side to pair with the
list_for_each_entry_srcu() walker in eventfs_iterate(). The insertion
in eventfs_create_dir() was left as a plain list_add_tail(), which on
weakly-ordered architectures can expose a new entry to the SRCU reader
before its list pointers and fields are observable.

Use list_add_tail_rcu() so the publication pairs with the existing
list_del_rcu() and list_for_each_entry_srcu().

Fixes: 43aa6f97c2d0 ("eventfs: Get rid of dentry pointers without refcounts")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
---
 fs/tracefs/event_inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 81df94038f2e..8dd554508828 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -706,7 +706,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode

 	scoped_guard(mutex, &eventfs_mutex) {
 		if (!parent->is_freed)
-			list_add_tail(&ei->list, &parent->children);
+			list_add_tail_rcu(&ei->list, &parent->children);
 	}
 	/* Was the parent freed? */
 	if (list_empty(&ei->list)) {
-- 
2.53.0

^ permalink raw reply related

* Re: [PATCHv5 bpf-next 10/28] bpf: Add multi tracing attach types
From: bot+bpf-ci @ 2026-04-18  6:49 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-11-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2812 bytes --]

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1154,6 +1154,8 @@ enum bpf_attach_type {
>  	BPF_TRACE_KPROBE_SESSION,
>  	BPF_TRACE_UPROBE_SESSION,
>  	BPF_TRACE_FSESSION,
> +	BPF_TRACE_FENTRY_MULTI,
> +	BPF_TRACE_FEXIT_MULTI,
>  	__MAX_BPF_ATTACH_TYPE
>  };

In RFC v2 (Feb 4, 2026), Leon Hwang asked whether BPF_TRACE_FSESSION_MULTI
should be added alongside BPF_TRACE_FENTRY_MULTI and BPF_TRACE_FEXIT_MULTI
for API consistency, since BPF_TRACE_FSESSION exists and is handled
similarly to FENTRY/FEXIT. You acknowledged this with 'good catch, will add
it' but BPF_TRACE_FSESSION_MULTI was never added in any subsequent version
(v1, v2, v3, v4, or v5).

Is this attach type intentionally omitted, or should it be included to
complete the multi tracing API?

Reference: https://lore.kernel.org/bpf/3015991d-2e9b-4bf4-9c17-26e00b5d5926@linux.dev/

[ ... ]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c

[ ... ]

> @@ -19368,7 +19379,17 @@
>  		if (ret < 0)
>  			return ret;
>
> -		if (tgt_prog) {
> +		/* *.multi programs don't need an address during program
> +		 * verification, we just take the module ref if needed.
> +		 */
> +		if (is_tracing_multi_id(prog, btf_id)) {
> +			if (btf_is_module(btf)) {
> +				mod = btf_try_get_module(btf);
> +				if (!mod)
> +					return -ENOENT;
> +			}
> +			addr = 0;
> +		} else if (tgt_prog) {

This isn't a bug, but the new multi-line comment here uses the old kernel
comment style with text on the same line as the opening '/*'. The BPF
subsystem guide mandates the modern multi-line style with '/*' on its own
line for new code under kernel/bpf/, even if surrounding code still uses
the old style.

Should the comment be reformatted as:

	/*
	 * *.multi programs don't need an address during program
	 * verification, we just take the module ref if needed.
	 */

> @@ -19396,6 +19417,12 @@
>  			ret = -EINVAL;
>  			switch (prog->type) {
>  			case BPF_PROG_TYPE_TRACING:
> +				/* *.multi sleepable programs will pass initial sleepable check,
> +				 * the actual attached btf ids are checked later during the link
> +				 * attachment.
> +				 */
> +				if (is_tracing_multi_id(prog, btf_id))
> +					ret = 0;

The same comment style issue appears here. Should this also use the modern
style with '/*' on its own line?

Note that another comment you added in check_attach_btf_id uses the correct
modern style, making this inconsistency clearly fixable.

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 04/28] bpf: Use mutex lock pool for bpf trampolines
From: bot+bpf-ci @ 2026-04-18  6:49 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-5-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2035 bytes --]

> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> index f02254a21585..eb4ea78ff77f 100644
> --- a/kernel/bpf/trampoline.c
> +++ b/kernel/bpf/trampoline.c
> @@ -30,6 +30,34 @@ static struct hlist_head trampoline_ip_table[TRAMPOLINE_TABLE_SIZE];
>  /* serializes access to trampoline tables */
>  static DEFINE_MUTEX(trampoline_mutex);
>
> +/*
> + * We keep 32 trampoline locks (5 bits) in the pool, because there is
> + * 48 (MAX_LOCK_DEPTH) locks limit allowed to be simultaneously held
> + * by task. Each lock has its own lockdep key to keep it simple.
> + */

The comment explains why the pool has 32 locks (MAX_LOCK_DEPTH limit), but
could it clarify why each lock needs its own lock_class_key?

Without this detail, it's not obvious that distinct keys are needed to avoid
lockdep "recursive locking" warnings when trampoline_lock_all() acquires all
32 pool mutexes simultaneously (presumably in following changes).

This was raised by bot+bpf-ci@kernel.org in v3 review:

  "Comment explains 32 count but not why per-lock class keys needed; should
   mention avoiding lockdep recursive locking warnings"
  https://lore.kernel.org/bpf/31ae46274a3157f2b9840e1a09b2698d1ec0cfd461737ff460c2d3349a9f0054@mail.kernel.org/

The concern was acknowledged with "will add", but the v5 comment still says
only "to keep it simple" without addressing the lockdep recursive locking
concern.

The same issue was raised again in v5:

  https://lore.kernel.org/bpf/53b3e6cb10e12555d3b1b321eeb4e04a447c707c387568d1e9ef759fb6459ef4@mail.kernel.org/

> +#define TRAMPOLINE_LOCKS_BITS 5
> +#define TRAMPOLINE_LOCKS_TABLE_SIZE (1 << TRAMPOLINE_LOCKS_BITS)
> +
> +static struct {
> +	struct mutex mutex;
> +	struct lock_class_key key;
> +} trampoline_locks[TRAMPOLINE_LOCKS_TABLE_SIZE];

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 20/28] libbpf: Add support to create tracing multi link
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-21-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2170 bytes --]

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 49368ee161b1..9bf7e8754c7b 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c

[ ... ]

> @@ -12510,6 +12592,197 @@ bool btf_type_is_traceable_func(const struct btf *btf, const struct btf_type *t
>  	return true;
>  }
>
> +static int collect_func_ids_by_glob(const struct bpf_program *prog, const char *pattern, __u32 **ids)
> +{
> +	struct bpf_object *obj = prog->obj;
> +	const struct module_btf *mod;
> +	struct btf *btf = NULL;
> +	const char *sep;
> +	int err;
> +
> +	err = bpf_object__load_vmlinux_btf(obj, true);
> +	if (err)
> +		return err;
> +
> +	/* In case we have module specified, we will find its btf and use that. */
> +	sep = strchr(pattern, ':');
> +	if (sep) {
> +		mod = find_attach_module(obj, pattern);
> +		if (!mod) {
> +			err = -EINVAL;
> +			goto cleanup;
> +		}
> +		btf = mod->btf;
> +		pattern = sep + 1;
> +	} else {
> +		/* Program is loaded for kernel module. */
> +		if (prog->attach_btf_obj_fd)
> +			return -EINVAL;
> +		btf = obj->btf_vmlinux;
> +	}
> +
> +	err = collect_btf_func_ids_by_glob(btf, pattern, ids);
> +
> +cleanup:
> +	bpf_object_cleanup_btf(obj);
> +	return err;
> +}

Does this function leak the vmlinux BTF when prog->attach_btf_obj_fd is
set? The early return bypasses the cleanup label that calls
bpf_object_cleanup_btf(obj). Every other error path goes through
cleanup, which suggests the BTF loaded by bpf_object__load_vmlinux_btf()
needs to be cleaned up before returning.

The design pattern here is that bpf_object__load_vmlinux_btf(obj, true)
is called specifically at attach-time (after the object has already been
post-load-cleaned), so obj->btf_vmlinux is freshly loaded. Skipping the
cleanup leaves that freshly-loaded btf_vmlinux attached to the bpf_object
until bpf_object__close().

Should this be 'err = -EINVAL; goto cleanup;' to match the other error
paths?

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 12/28] bpf: Add bpf_trampoline_multi_attach/detach functions
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-13-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 14299 bytes --]

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h

[ ... ]

> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> --- a/kernel/bpf/trampoline.c
> +++ b/kernel/bpf/trampoline.c
> @@ -1447,6 +1447,276 @@ int __weak arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags,
>  	return -ENOTSUPP;
>  }
>
> +#if defined(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS) && \
> +    defined(CONFIG_HAVE_SINGLE_FTRACE_DIRECT_OPS) && \
> +    defined(CONFIG_BPF_SYSCALL)
> +
> +static void trampoline_lock_all(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < TRAMPOLINE_LOCKS_TABLE_SIZE; i++)
> +		mutex_lock(&trampoline_locks[i].mutex);
> +}
> +
> +static void trampoline_unlock_all(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < TRAMPOLINE_LOCKS_TABLE_SIZE; i++)
> +		mutex_unlock(&trampoline_locks[i].mutex);
> +}
> +
> +static void remove_tracing_multi_data(struct bpf_tracing_multi_data *data)
> +{
> +	ftrace_hash_remove(data->reg);
> +	ftrace_hash_remove(data->unreg);
> +	ftrace_hash_remove(data->modify);
> +}
> +
> +static void clear_tracing_multi_data(struct bpf_tracing_multi_data *data)
> +{
> +	remove_tracing_multi_data(data);
> +
> +	free_ftrace_hash(data->reg);
> +	free_ftrace_hash(data->unreg);
> +	free_ftrace_hash(data->modify);
> +}
> +
> +static int init_tracing_multi_data(struct bpf_tracing_multi_data *data)
> +{
> +	data->reg    = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
> +	data->unreg  = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
> +	data->modify = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
> +
> +	if (!data->reg || !data->unreg || !data->modify) {
> +		clear_tracing_multi_data(data);
> +		return -ENOMEM;
> +	}
> +	return 0;
> +}
> +
> +static void ftrace_hash_add(struct ftrace_hash *hash, struct ftrace_func_entry *entry,
> +			    unsigned long ip, unsigned long direct)
> +{
> +	entry->ip = ip;
> +	entry->direct = direct;
> +	add_ftrace_hash_entry(hash, entry);
> +}
> +
> +static int register_fentry_multi(struct bpf_trampoline *tr, struct bpf_tramp_image *im, void *ptr)
> +{
> +	unsigned long addr = (unsigned long) im->image;
> +	unsigned long ip = ftrace_location(tr->ip);
> +	struct bpf_tracing_multi_data *data = ptr;
> +
> +	if (bpf_trampoline_use_jmp(tr->flags))
> +		addr = ftrace_jmp_set(addr);
> +
> +	ftrace_hash_add(data->reg, data->entry, ip, addr);
> +	tr->cur_image = im;
> +	return 0;
> +}
> +
> +static int unregister_fentry_multi(struct bpf_trampoline *tr, u32 orig_flags, void *ptr)
> +{
> +	unsigned long addr = (unsigned long) tr->cur_image->image;
> +	unsigned long ip = ftrace_location(tr->ip);
> +	struct bpf_tracing_multi_data *data = ptr;
> +
> +	if (bpf_trampoline_use_jmp(tr->flags))
> +		addr = ftrace_jmp_set(addr);
> +
> +	ftrace_hash_add(data->unreg, data->entry, ip, addr);
> +	tr->cur_image = NULL;
> +	return 0;
> +}
> +
> +static int modify_fentry_multi(struct bpf_trampoline *tr, u32 orig_flags, struct bpf_tramp_image *im,
> +			       bool lock_direct_mutex, void *ptr)
> +{
> +	unsigned long addr = (unsigned long) im->image;
> +	unsigned long ip = ftrace_location(tr->ip);
> +	struct bpf_tracing_multi_data *data = ptr;
> +
> +	if (bpf_trampoline_use_jmp(tr->flags))
> +		addr = ftrace_jmp_set(addr);
> +
> +	ftrace_hash_add(data->modify, data->entry, ip, addr);
> +	tr->cur_image = im;
> +	return 0;
> +}
> +
> +static const struct bpf_trampoline_ops trampoline_multi_ops = {
> +	.register_fentry   = register_fentry_multi,
> +	.unregister_fentry = unregister_fentry_multi,
> +	.modify_fentry     = modify_fentry_multi,
> +};
> +
> +static void bpf_trampoline_multi_attach_init(struct bpf_trampoline *tr)
> +{
> +	tr->multi_attach.old_image = tr->cur_image;
> +	tr->multi_attach.old_flags = tr->flags;
> +}
> +
> +static void bpf_trampoline_multi_attach_free(struct bpf_trampoline *tr)
> +{
> +	if (tr->multi_attach.old_image)
> +		bpf_tramp_image_put(tr->multi_attach.old_image);
> +
> +	tr->multi_attach.old_image = NULL;
> +	tr->multi_attach.old_flags = 0;
> +}
> +
> +static void bpf_trampoline_multi_attach_rollback(struct bpf_trampoline *tr)
> +{
> +	if (tr->cur_image)
> +		bpf_tramp_image_put(tr->cur_image);
> +	tr->cur_image = tr->multi_attach.old_image;
> +	tr->flags = tr->multi_attach.old_flags;
> +
> +	tr->multi_attach.old_image = NULL;
> +	tr->multi_attach.old_flags = 0;
> +}
> +
> +#define for_each_mnode_cnt(mnode, link, cnt) \
> +	for (i = 0, mnode = &link->nodes[i]; i < cnt; i++, mnode = &link->nodes[i])
> +
> +#define for_each_mnode(mnode, link) \
> +	for_each_mnode_cnt(mnode, link, link->nodes_cnt)
> +
> +int bpf_trampoline_multi_attach(struct bpf_prog *prog, u32 *ids,
> +				struct bpf_tracing_multi_link *link)
> +{
> +	struct bpf_tracing_multi_data *data = &link->data;
> +	struct bpf_attach_target_info tgt_info = {};
> +	struct btf *btf = prog->aux->attach_btf;
> +	struct bpf_tracing_multi_node *mnode;
> +	struct bpf_trampoline *tr;
> +	int i, err, rollback_cnt;
> +	u64 key;
> +
> +	for_each_mnode(mnode, link) {
> +		rollback_cnt = i;
> +
> +		err = bpf_check_attach_btf_id_multi(btf, prog, ids[i], &tgt_info);
> +		if (err)
> +			goto rollback_put;
> +
> +		key = bpf_trampoline_compute_key(NULL, btf, ids[i]);
> +
> +		tr = bpf_trampoline_get(key, &tgt_info);
> +		if (!tr) {
> +			err = -ENOMEM;
> +			goto rollback_put;
> +		}
> +
> +		mnode->trampoline = tr;
> +		mnode->node.link = &link->link;
> +
> +		cond_resched();
> +	}
> +
> +	err = init_tracing_multi_data(data);
> +	if (err) {
> +		rollback_cnt = link->nodes_cnt;
> +		goto rollback_put;
> +	}
> +
> +	trampoline_lock_all();
> +
> +	for_each_mnode(mnode, link) {
> +		bpf_trampoline_multi_attach_init(mnode->trampoline);
> +
> +		data->entry = &mnode->entry;
> +		err = __bpf_trampoline_link_prog(&mnode->node, mnode->trampoline, NULL,
> +						 &trampoline_multi_ops, data);
> +		if (err) {
> +			rollback_cnt = i;
> +			goto rollback_unlink;
> +		}
> +	}

When user-provided ids[] contains duplicate BTF IDs (or distinct IDs
that resolve to the same trampoline key), multiple nodes point to the
same struct bpf_trampoline. The link loop above then calls
bpf_trampoline_multi_attach_init() more than once on that trampoline,
overwriting the saved old_image with the newly assigned cur_image from
the previous iteration.

Scenario with ids[0] == ids[1] and trampoline X starting with OLD_X:

  i=0: attach_init(X) saves old_image=OLD_X
       __bpf_trampoline_link_prog() -> modify_fentry_multi() sets
       X->cur_image=NEW_X (OLD_X refcount not dropped, intent is for
       multi_attach_free() to release it later).

  i=1: attach_init(X) re-runs on the same trampoline and overwrites
       old_image=NEW_X (the only saved reference to OLD_X is lost).
       __bpf_trampoline_link_prog() returns -EBUSY (duplicate prog).
       rollback_cnt=i=1; goto rollback_unlink.

  rollback_unlink calls bpf_trampoline_multi_attach_rollback(X) once:

    static void bpf_trampoline_multi_attach_rollback(struct bpf_trampoline *tr)
    {
        if (tr->cur_image)
            bpf_tramp_image_put(tr->cur_image);    /* puts NEW_X */
        tr->cur_image = tr->multi_attach.old_image;  /* = NEW_X (stale!) */
        ...
    }

Result: OLD_X leaks (its refcount is never decremented). cur_image
points at NEW_X which was just released via bpf_tramp_image_put().

The rollback_put loop calls bpf_trampoline_put() which frees the
trampoline (prog was removed, progs_hlist is empty), so the dangling
NEW_X pointer is cleaned up. However OLD_X remains allocated via
module_alloc forever.

Reachability: commit dbf2afe2f603 ("bpf: Add support for tracing multi
link") copies ids[] from userspace without deduplication. The series'
selftests (commit 411fb40d4b2a "selftests/bpf: Add tracing multi attach
fails test", 'fail#7 (kernel) attach with duplicate id') exercise this
path explicitly, expecting -EBUSY.

The trigger requires the shared trampoline to already have cur_image !=
NULL when the first iteration runs (i.e., another program attached to
the same function), so modify_fentry_multi() is used and there is a
real OLD_X to leak. The first-time-attach case (register_fentry_multi,
OLD_X=NULL) only leaves cur_image dangling briefly until the trampoline
is freed.

Can the code save old_image/old_flags once per trampoline instead of
once per node, or store them per-mnode to avoid overwriting the saved
state?

> +
> +	rollback_cnt = link->nodes_cnt;
> +	if (ftrace_hash_count(data->reg)) {
> +		err = update_ftrace_direct_add(&direct_ops, data->reg);
> +		if (err)
> +			goto rollback_unlink;
> +	}
> +
> +	if (ftrace_hash_count(data->modify)) {
> +		err = update_ftrace_direct_mod(&direct_ops, data->modify, true);
> +		if (err) {
> +			WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));
> +			                                                       ^^^^^^^^^^
> +			goto rollback_unlink;
> +		}
> +	}

The recovery path unconditionally calls update_ftrace_direct_del with
data->reg, but update_ftrace_direct_add is only invoked when data->reg
is non-empty:

    if (ftrace_hash_count(data->reg)) {
        err = update_ftrace_direct_add(&direct_ops, data->reg);
        ...
    }

When update_ftrace_direct_mod fails and all trampolines already had
cur_image != NULL (so nothing was added to data->reg),
update_ftrace_direct_del is called with an empty hash.
update_ftrace_direct_del at kernel/trace/ftrace.c:6511 checks:

    if (!hash_count(hash))
        return -EINVAL;

so it returns -EINVAL and the WARN_ON_ONCE fires spuriously. Should
this guard match the add path?

    if (ftrace_hash_count(data->reg))
        WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));

> +
> +	for_each_mnode(mnode, link)
> +		bpf_trampoline_multi_attach_free(mnode->trampoline);
> +
> +	trampoline_unlock_all();
> +
> +	remove_tracing_multi_data(data);
> +	return 0;
> +
> +rollback_unlink:
> +	for_each_mnode_cnt(mnode, link, rollback_cnt) {
> +		bpf_trampoline_remove_prog(mnode->trampoline, &mnode->node);
> +		bpf_trampoline_multi_attach_rollback(mnode->trampoline);
> +	}
> +
> +	trampoline_unlock_all();
> +
> +	clear_tracing_multi_data(data);
> +	rollback_cnt = link->nodes_cnt;
> +
> +rollback_put:
> +	for_each_mnode_cnt(mnode, link, rollback_cnt)
> +		bpf_trampoline_put(mnode->trampoline);
> +
> +	return err;
> +}

[ ... ]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -19658,6 +19658,58 @@ struct bpf_attach_target_info tgt_info = {};
>  	return 0;
>  }
>
> +int bpf_check_attach_btf_id_multi(struct btf *btf, struct bpf_prog *prog, u32 btf_id,
> +				  struct bpf_attach_target_info *tgt_info)
> +{
> +	const struct btf_type *t;
> +	unsigned long addr;
> +	const char *tname;
> +	int err;
> +
> +	if (!btf_id || !btf)
> +		return -EINVAL;
> +
> +	/* Check noreturn attachment. */
> +	if (prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI ||
> +	     btf_id_set_contains(&noreturn_deny, btf_id))
> +		return -EINVAL;

Operator precedence issue in the noreturn check. The two subexpressions
are joined by ||, so the function returns -EINVAL whenever EITHER:

  (a) the program's expected_attach_type is BPF_TRACE_FEXIT_MULTI
      (regardless of whether btf_id targets a noreturn function), OR

  (b) the btf_id is in the noreturn_deny set (regardless of program
      attach type, rejecting FENTRY_MULTI attachments to noreturn
      functions as well).

The intent is to reject only FEXIT-like multi attachments to noreturn
functions. The non-multi path in check_attach_btf_id() correctly uses
&&:

    } else if ((prog->expected_attach_type == BPF_TRACE_FEXIT ||
               prog->expected_attach_type == BPF_TRACE_FSESSION ||
               prog->expected_attach_type == BPF_MODIFY_RETURN) &&
               btf_id_set_contains(&noreturn_deny, btf_id)) {
        ...
        return -EINVAL;
    }

Consequences once bpf_trampoline_multi_attach() is wired up from
user-space (commit dbf2afe2f603 "bpf: Add support for tracing multi
link"):

  1. Every BPF_TRACE_FEXIT_MULTI program attach request is rejected
     with -EINVAL, even when the target function returns normally. This
     disables BPF_TRACE_FEXIT_MULTI entirely.

  2. Any tracing_multi program (including BPF_TRACE_FENTRY_MULTI) is
     rejected when its btf_id is in noreturn_deny (do_exit,
     __x64_sys_exit, make_task_dead, etc.). Rejecting FENTRY_MULTI on
     noreturn functions is an unintended behavioral restriction.

Forward search in the git range shows commit bfcad202ce99 ("bpf: Add
support for tracing_multi link session") rewrites this hunk to:

    if ((prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI ||
         prog->expected_attach_type == BPF_TRACE_FSESSION_MULTI) &&
         btf_id_set_contains(&noreturn_deny, btf_id))
        return -EINVAL;

The parenthesization + && in the later commit confirms this is a
precedence bug. Should this use && instead?

> +
> +	/* Check and get function target data. */
> +	t = btf_type_by_id(btf, btf_id);
> +	if (!t)
> +		return -EINVAL;
> +	tname = btf_name_by_offset(btf, t->name_off);
> +	if (!tname)
> +		return -EINVAL;
> +	if (!btf_type_is_func(t))
> +		return -EINVAL;
> +	t = btf_type_by_id(btf, t->type);
> +	if (!btf_type_is_func_proto(t))
> +		return -EINVAL;
> +	err = btf_distill_func_proto(NULL, btf, t, tname, &tgt_info->fmodel);
> +	if (err < 0)
> +		return err;
> +	if (btf_is_module(btf)) {
> +		/* The bpf program already holds reference to module. */
> +		if (WARN_ON_ONCE(!prog->aux->mod))
> +			return -EINVAL;
> +		addr = find_kallsyms_symbol_value(prog->aux->mod, tname);
> +	} else {
> +		addr = kallsyms_lookup_name(tname);
> +	}
> +	if (!addr || !ftrace_location(addr))
> +		return -ENOENT;
> +
> +	/* Check sleepable program attachment. */
> +	if (prog->sleepable) {
> +		err = btf_id_allow_sleepable(btf_id, addr, prog, btf);
> +		if (err)
> +			return err;
> +	}
> +	tgt_info->tgt_addr = addr;
> +	return 0;
> +}
> +
>  struct btf *bpf_get_btf_vmlinux(void)
>  {
>  	if (!btf_vmlinux && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) {


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 08/28] bpf: Add struct bpf_tramp_node object
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: hengqi.chen, bpf, linux-trace-kernel, kafai, eddyz87,
	songliubraving, yhs, menglong8.dong, rostedt, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-9-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3917 bytes --]

> diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
> index 50103b3794fb..b97d9f68fc40 100644
> --- a/arch/powerpc/net/bpf_jit_comp.c
> +++ b/arch/powerpc/net/bpf_jit_comp.c

[ ... ]

> @@ -827,14 +827,14 @@ static void bpf_trampoline_restore_args_stack(u32 *image, struct codegen_context
>  static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_image,
>  					 void *rw_image_end, void *ro_image,
>  					 const struct btf_func_model *m, u32 flags,
> -					 struct bpf_tramp_links *tlinks,
> +					 struct bpf_tramp_nodes *tnodes,
>  					 void *func_addr)
>  {
>  	int regs_off, func_meta_off, ip_off, run_ctx_off, retval_off;
>  	int nvr_off, alt_lr_off, r4_off = 0;
> -	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> -	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
> -	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
> +	struct bpf_tramp_nodes *fmod_ret = &tnodes[BPF_TRAMP_MODIFY_RETURN];
> +	struct bpf_tramp_nodes *fentry = &tnodes[BPF_TRAMP_FENTRY];
> +	struct bpf_tramp_nodes *fexit = &tnodes[BPF_TRAMP_FEXIT];
>  	int i, ret, nr_regs, retaddr_off, bpf_frame_size = 0;
>  	struct codegen_context codegen_ctx, *ctx;
>  	int cookie_off, cookie_cnt, cookie_ctx_off;
> -	int fsession_cnt = bpf_fsession_cnt(tlinks);
> +	int fsession_cnt = bpf_fsession_cnt(tnodes);
                                            ^^^^^

This line wasn't actually updated in the diff. The parameter was renamed from
tlinks to tnodes, but this call site still references the old name. Does this
compile?

The same issue appears later in the function:

>  	cookie_off = bpf_frame_size;
> -	cookie_cnt = bpf_fsession_cookie_cnt(tlinks);
> +	cookie_cnt = bpf_fsession_cookie_cnt(tnodes);
                                              ^^^^^

This line also wasn't updated and still references tlinks.

> @@ -1050,21 +1050,21 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
>
>  	cookie_ctx_off = (regs_off - cookie_off) / 8;
>
> -	for (i = 0; i < fentry->nr_links; i++) {
> -		if (bpf_prog_calls_session_cookie(fentry->links[i])) {
> +	for (i = 0; i < fentry->nr_nodes; i++)
> +		if (bpf_prog_calls_session_cookie(fentry->nodes[i])) {
>  			u64 meta = func_meta | (cookie_ctx_off << BPF_TRAMP_COOKIE_INDEX_SHIFT);
>
>  			store_func_meta(image, ctx, meta, func_meta_off);
>  			cookie_ctx_off--;
>  		}
>
> -		if (invoke_bpf_prog(image, ro_image, ctx, fentry->links[i], regs_off, retval_off,
> +		if (invoke_bpf_prog(image, ro_image, ctx, fentry->nodes[i], regs_off, retval_off,
>  				    run_ctx_off, flags & BPF_TRAMP_F_RET_FENTRY_RET))
>  			return -EINVAL;
>  	}

The opening brace after the for-loop was removed but the closing brace at the
end remains. This creates two problems:

First, the for-loop now only controls the inner if statement checking for
session cookies. The invoke_bpf_prog call is no longer inside the loop and
will only execute once (with whatever value i holds after the loop exits),
instead of being called for each fentry node.

Second, the unmatched closing brace at the end creates a syntax error.

The code structure should be:

    for (i = 0; i < fentry->nr_nodes; i++) {
        if (bpf_prog_calls_session_cookie(fentry->nodes[i])) {
            u64 meta = func_meta | (cookie_ctx_off << BPF_TRAMP_COOKIE_INDEX_SHIFT);

            store_func_meta(image, ctx, meta, func_meta_off);
            cookie_ctx_off--;
        }

        if (invoke_bpf_prog(image, ro_image, ctx, fentry->nodes[i], regs_off, retval_off,
                    run_ctx_off, flags & BPF_TRAMP_F_RET_FENTRY_RET))
            return -EINVAL;
    }

[ ... ]

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 01/28] ftrace: Add ftrace_hash_count function
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-2-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2302 bytes --]

> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index c242fe49af4c..401f8dfd05d3 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -415,6 +415,7 @@ struct ftrace_hash *alloc_ftrace_hash(int size_bits);
>  void free_ftrace_hash(struct ftrace_hash *hash);
>  struct ftrace_func_entry *add_ftrace_hash_entry_direct(struct ftrace_hash *hash,
>  						       unsigned long ip, unsigned long direct);
> +unsigned long ftrace_hash_count(struct ftrace_hash *hash);
>
>  /* The hash used to know what functions callbacks trace */
>  struct ftrace_ops_hash {

[ ... ]

> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 413310912609..68a071e80f32 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -6288,11 +6288,16 @@ int modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
>  }
>  EXPORT_SYMBOL_GPL(modify_ftrace_direct);
>
> -static unsigned long hash_count(struct ftrace_hash *hash)
> +static inline unsigned long hash_count(struct ftrace_hash *hash)
>  {
>  	return hash ? hash->count : 0;
>  }
>
> +unsigned long ftrace_hash_count(struct ftrace_hash *hash)
> +{
> +	return hash_count(hash);
> +}
                     ^^^^

Will this cause a linker error on some kernel configurations?

The function ftrace_hash_count() is declared in include/linux/ftrace.h under
CONFIG_DYNAMIC_FTRACE, but this definition appears in kernel/trace/ftrace.c
inside a section guarded by CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS (which
extends from around modify_ftrace_direct() to several hundred lines below).

On architectures where CONFIG_DYNAMIC_FTRACE=y but
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=n, code that calls
ftrace_hash_count() would compile successfully but fail to link since no
definition would be present in the built kernel.

This was raised by sashiko-bot@kernel.org on the v5 review:
https://lore.kernel.org/all/20260418005821.4F156C19425@smtp.kernel.org/

> +
>  /**
>   * hash_add - adds two struct ftrace_hash and returns the result
>   * @a: struct ftrace_hash object


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 02/28] ftrace: Add ftrace_hash_remove function
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-3-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2705 bytes --]

> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -416,6 +416,7 @@ void free_ftrace_hash(struct ftrace_hash *hash);
>  struct ftrace_func_entry *add_ftrace_hash_entry_direct(struct ftrace_hash *hash,
>  						       unsigned long ip, unsigned long direct);
>  unsigned long ftrace_hash_count(struct ftrace_hash *hash);
> +void ftrace_hash_remove(struct ftrace_hash *hash);
>
>  /* The hash used to know what functions callbacks trace */
>  struct ftrace_ops_hash {
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1249,6 +1249,25 @@ remove_hash_entry(struct ftrace_hash *hash,
>  	hash->count--;
>  }
>
> +void ftrace_hash_remove(struct ftrace_hash *hash)
> +{
> +	struct hlist_head *hhd;
> +	struct hlist_node *tn;
> +	struct ftrace_func_entry *entry;
> +	int size = 1 << hash->size_bits;
                          ^^^^^^^^^^^^^^^^

Can this dereference a NULL pointer? The code computes the size using
hash->size_bits before checking if hash is NULL:

    int size = 1 << hash->size_bits;
    ...
    if (!hash || !hash->count)
        return;

If a caller passes NULL, the function will crash on the size computation.

Looking at how this is used in the later commit e77bc5ff90dd
('bpf: Add bpf_trampoline_multi_attach/detach functions'),
init_tracing_multi_data() in kernel/bpf/trampoline.c allocates three hashes:

    data->reg    = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
    data->unreg  = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
    data->modify = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
    if (!data->reg || !data->unreg || !data->modify) {
        clear_tracing_multi_data(data);
        return -ENOMEM;
    }

On allocation failure, clear_tracing_multi_data() calls
remove_tracing_multi_data(), which calls ftrace_hash_remove() on data->reg,
data->unreg, and data->modify. Any of these might be NULL if the
corresponding alloc_ftrace_hash() failed.

Would it be safer to move the size calculation below the NULL check?

> +	int i;
> +
> +	if (!hash || !hash->count)
> +		return;
> +
> +	for (i = 0; i < size; i++) {
> +		hhd = &hash->buckets[i];
> +		hlist_for_each_entry_safe(entry, tn, hhd, hlist)
> +			remove_hash_entry(hash, entry);
> +	}
> +	FTRACE_WARN_ON(hash->count);
> +}
> +
>  static void ftrace_hash_clear(struct ftrace_hash *hash)
>  {
>  	struct hlist_head *hhd;


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 06/28] bpf: Move trampoline image setup into bpf_trampoline_ops callbacks
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-7-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1276 bytes --]

commit 405d6ece5a34824f3bc154895a3415605c07b216
Author: Jiri Olsa <jolsa@kernel.org>

bpf: Move trampoline image setup into bpf_trampoline_ops callbacks

Moving trampoline image setup into bpf_trampoline_ops callbacks,
so we can have different image handling for multi attachment which
is coming in following changes.

There's slight functional change for the unregister path, where we
currently free the image unconditionally even if the detach fails.
The new code keeps the image in place, possibly preventing the crash.

The commit message mentions a functional change where "we currently free
the image unconditionally even if the detach fails." This appears to fix
a bug introduced in commit e21aa341785c ("bpf: Fix fexit trampoline.").
Should this include a Fixes tag?

The suggested tag would be:

Fixes: e21aa341785c ("bpf: Fix fexit trampoline.")

This was also noted by bot+bpf-ci@kernel.org during the v5 review:
https://lore.kernel.org/bpf/8a315de2e20505875006c4e32af33c0838000a79484f7a2e44b1081187a01379@mail.kernel.org/

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 21/28] selftests/bpf: Add tracing multi skel/pattern/ids attach tests
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-22-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2713 bytes --]

> diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> new file mode 100644
> index 000000000000..52d3339b430a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> @@ -0,0 +1,256 @@

[ ... ]

> +static __u32 *get_ids(const char * const funcs[], int funcs_cnt, const char *mod)
> +{
> +	struct btf *btf, *vmlinux_btf = NULL;
> +	__u32 nr, type_id, cnt = 0;
> +	void *root = NULL;
> +	__u32 *ids = NULL;
> +	int i, err = 0;
> +
> +	btf = btf__load_vmlinux_btf();
> +	if (!ASSERT_OK_PTR(btf, "btf__load_vmlinux_btf"))
> +		return NULL;
> +
> +	if (mod) {
> +		vmlinux_btf = btf;
> +		btf = btf__load_module_btf(mod, vmlinux_btf);
> +		if (!ASSERT_OK_PTR(btf, "btf__load_module_btf")) {
> +			btf__free(vmlinux_btf);
                        ^^^^

Does this code double-free vmlinux_btf when btf__load_module_btf() fails?

When btf__load_module_btf(mod, vmlinux_btf) fails, the error path at
ASSERT_OK_PTR calls btf__free(vmlinux_btf) and then jumps to the out:
label:

> +			goto out;
> +		}
> +	}

[ ... ]

> +out:
> +	tdestroy(root, tdestroy_free_nop);
> +	btf__free(vmlinux_btf);
        ^^^^

which calls btf__free(vmlinux_btf) again on the same pointer.

Multiple reviewers (bot+bpf-ci@kernel.org and sashiko-bot@kernel.org) raised
this concern in v5. The fix should either set vmlinux_btf=NULL after the
first free or return directly without going to the out label.

Reference: https://lore.kernel.org/bpf/1ab81cf709ce7d536a898695ffad30c15c8a6a7bf48dcb7271239952dd270315@mail.kernel.org/

> +	ids = calloc(funcs_cnt, sizeof(ids[0]));
> +	if (!ids)
> +		goto out;
> +
> +	/*
> +	 * We sort function names by name and search them
> +	 * below for each function.
> +	 */
> +	for (i = 0; i < funcs_cnt; i++) {
> +		if (!tsearch(&funcs[i], &root, compare)) {
> +			ASSERT_FAIL("tsearch failed");
                        ^^^^

When tsearch() fails, should err be set to -1 before the goto?

Currently when tsearch() fails, we jump to out, but err is still 0.
Because err is 0, the cleanup block below won't free ids and will
instead return the partially or fully zero-initialized array:

> +			goto out;
> +		}
> +	}

[ ... ]

> +	if (err) {
> +		free(ids);
> +		ids = NULL;
> +	}

This can lead to incorrect test behavior. The fix should set err=-1
before the goto out statement.

Reference: https://lore.kernel.org/bpf/20260418033438.03847C19424@smtp.kernel.org/

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 19/28] libbpf: Add btf_type_is_traceable_func function
From: bot+bpf-ci @ 2026-04-18  5:59 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-20-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3140 bytes --]

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index fc08ac6bcdde..49368ee161b1 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c

[ ... ]

> +static int btf_get_type_size(const struct btf *btf, __u32 type_id,
> +			     const struct btf_type **ret_type)
> +{
> +	const struct btf_type *t;
> +
> +	*ret_type = btf__type_by_id(btf, 0);
> +	if (!type_id)
> +		return 0;
> +	t = btf__type_by_id(btf, type_id);
> +	while (t && btf_type_is_modifier(t))
> +		t = btf__type_by_id(btf, t->type);

Does this loop need a maximum resolution depth limit?

If a malformed BTF object contains a circular modifier reference
(e.g., typedef A -> const B -> typedef A), could this loop spin
indefinitely and cause a hang?

> +	if (!t)
> +		return -EINVAL;
> +	*ret_type = t;
> +	if (btf_is_ptr(t))
> +		return sizeof(void *);

Does this correctly calculate the pointer size for cross-architecture
analysis?

Since libbpf is a user-space library that may process BTF blobs from
different architectures (like a 64-bit tool processing a 32-bit BPF
object), will hardcoding sizeof(void *) return the host's pointer size
instead of the target's pointer size?

Would using btf__pointer_size(btf) be more appropriate here?

> +	if (btf_is_int(t) || btf_is_any_enum(t) || btf_is_struct(t) || btf_is_union(t))
> +		return t->size;
> +	return -EINVAL;
> +}
> +
> +bool btf_type_is_traceable_func(const struct btf *btf, const struct btf_type *t)
> +{
> +	const struct btf_param *args;
> +	const struct btf_type *proto;
> +	__u32 i, nargs;
> +	int ret;
> +
> +	proto = btf_type_by_id(btf, t->type);

Could this cause an out-of-bounds read if t is not a function type?

Because the type and size fields share a union in struct btf_type,
passing a type like BTF_KIND_STRUCT causes its size to be misinterpreted
as a type ID.

The internal btf_type_by_id() helper skips bounds checking. If the
interpreted ID exceeds the BTF type array bounds, it might read past the
end of the array and return a garbage pointer, which would then be
dereferenced in the next line.

Should this verify !btf_is_func(t) first, and use the bounds-checked
public API btf__type_by_id() along with a NULL check for proto?

> +	if (BTF_INFO_KIND(proto->info) != BTF_KIND_FUNC_PROTO)
> +		return false;

[ ... ]

> diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
> index 3781c45b46d3..a9845f57e56c 100644
> --- a/tools/lib/bpf/libbpf_internal.h
> +++ b/tools/lib/bpf/libbpf_internal.h
> @@ -250,6 +250,7 @@ const struct btf_type *skip_mods_and_typedefs(const struct btf *btf, __u32 id, _
>  const struct btf_header *btf_header(const struct btf *btf);
>  void btf_set_base_btf(struct btf *btf, const struct btf *base_btf);
>  int btf_relocate(struct btf *btf, const struct btf *base_btf, __u32 **id_map);
> +bool btf_type_is_traceable_func(const struct btf *btf, const struct btf_type *t);

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCH] trace: propagate registration failure from tracing_start_*_record()
From: Yash Suthar @ 2026-04-18  5:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	skhan, me, syzbot+a1d25e53cd4a10f7f2d3
In-Reply-To: <20260417115258.65e216be@fedora>

Hello Steven,
Thank you for taking a look and really sorry.

I did use ai assistance for commit message, but I reviewed, modified
and tested(with syzbot locally) the code myself. I should have
disclosed really sorry.

One thing I want to know (or I am still missing something):
sched_cmdline_ref is incremented before tracing_sched_register() and
register fails,  but sched_cmdline_ref stays at 1 and on disable
tracepoint_remove_func() sees NULL and return error (as syzbot
reported and reproduce also locally).
your suggestion WARN_ONCE correctly flags the upstream failure, but
the secondary WARN at  tracepoint.c:358 will still fire on the next
disable, since the refcount desync isn't addressed. Was that
intentional ?

in my earlier approach, I tried to propagate the error and handle this
desync under fault injection. I understand this may not align with the
preferred direction, but that was the motivation behind the change.

If you think it makes sense, I can prepare a v2 with your suggestion
while also addressing the refcount consistency.

Thank you

On Fri, Apr 17, 2026 at 9:23 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 17 Apr 2026 12:08:27 +0530
> Yash Suthar <yashsuthar983@gmail.com> wrote:
>
> > syzbot reported a WARN in tracepoint_probe_unregister():
> >
> > tracing_start_sched_switch() increments sched_cmdline_ref /
> > sched_tgid_ref before calling tracing_sched_register(), and its
> > return value is discarded because the API is void. When the first
> > register_trace_sched_*() fails (e.g. kmalloc under memory pressure
> > or failslab), the function's fail_deprobe* labels roll back any
> > partial probe registration, but the caller's refcount has already
> > been bumped. The state is now desynced: refs > 0 but no probes in
> > tp->funcs.
> >
> > Later, when the caller pairs the start with a stop, the refcount
> > walks back to 0 and tracing_sched_unregister() calls
> > unregister_trace_sched_*() against an empty tp->funcs.
> > func_remove() returns -ENOENT and the
> > WARN_ON_ONCE(IS_ERR(old)) in tracepoint_remove_func() fires.
> >
> > Fix: make tracing_start_sched_switch() and the two exported
> > wrappers, tracing_start_cmdline_record() and
> > tracing_start_tgid_record(), return int; register the probes
> > before bumping the refcount; and propagate the error to callers
> > so refs are only held on behalf of a caller whose registration
> > actually succeeded.
> >
> > Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks")
> > Reported-by: syzbot+a1d25e53cd4a10f7f2d3@syzkaller.appspotmail.com
> > Closes: https://syzkaller.appspot.com/bug?id=f93e97cd824071a2577a40cde9ecd957f59f87eb
>
> Did you use AI to create any of this? If so you must disclose it. This
> reads very much like an AI patch.
>
> >
> > Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
> > ---
> >  kernel/trace/trace.c                 |  6 +++---
> >  kernel/trace/trace.h                 |  4 ++--
> >  kernel/trace/trace_events.c          | 28 +++++++++++++++++++--------
> >  kernel/trace/trace_functions.c       |  8 +++++++-
> >  kernel/trace/trace_functions_graph.c |  6 +++++-
> >  kernel/trace/trace_sched_switch.c    | 29 ++++++++++++++++++----------
> >  kernel/trace/trace_selftest.c        |  7 ++++++-
> >  7 files changed, 62 insertions(+), 26 deletions(-)
>
> NAK on all this. If you are under severe memory constraints that causes
> this to fail, then you'll be hitting a bunch more errors.
>
> >
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 8bd4ec08fb36..e936eed99b27 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -3320,7 +3320,7 @@ void trace_printk_init_buffers(void)
> >        * allocated here, then this was called by module code.
> >        */
> >       if (global_trace.array_buffer.buffer)
> > -             tracing_start_cmdline_record();
> > +             (void)tracing_start_cmdline_record();
>
> WTF??? Why are you adding the typecast of (void) here? Don't do that!
>
>
> >  }
> >  EXPORT_SYMBOL_GPL(trace_printk_init_buffers);
> >
> > @@ -3329,7 +3329,7 @@ void trace_printk_start_comm(void)
> >       /* Start tracing comms if trace printk is set */
> >       if (!buffers_allocated)
> >               return;
> > -     tracing_start_cmdline_record();
> > +     (void)tracing_start_cmdline_record();
> >  }
> >
> >  static void trace_printk_start_stop_comm(int enabled)
> > @@ -3338,7 +3338,7 @@ static void trace_printk_start_stop_comm(int enabled)
> >               return;
> >
> >       if (enabled)
> > -             tracing_start_cmdline_record();
> > +             (void)tracing_start_cmdline_record();
> >       else
> >               tracing_stop_cmdline_record();
> >  }
> > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> > index b6d42fe06115..6fe2c8429560 100644
> > --- a/kernel/trace/trace.h
> > +++ b/kernel/trace/trace.h
> > @@ -751,9 +751,9 @@ void trace_graph_return(struct ftrace_graph_ret *trace, struct fgraph_ops *gops,
> >  int trace_graph_entry(struct ftrace_graph_ent *trace, struct fgraph_ops *gops,
> >                     struct ftrace_regs *fregs);
> >
> > -void tracing_start_cmdline_record(void);
> > +int tracing_start_cmdline_record(void);
> >  void tracing_stop_cmdline_record(void);
> > -void tracing_start_tgid_record(void);
> > +int tracing_start_tgid_record(void);
> >  void tracing_stop_tgid_record(void);
> >
> >  int register_tracer(struct tracer *type);
> > diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> > index 137b4d9bb116..e6713aa80a03 100644
> > --- a/kernel/trace/trace_events.c
> > +++ b/kernel/trace/trace_events.c
> > @@ -734,9 +734,9 @@ void trace_event_enable_cmd_record(bool enable)
> >                       continue;
> >
> >               if (enable) {
> > -                     tracing_start_cmdline_record();
> > -                     set_bit(EVENT_FILE_FL_RECORDED_CMD_BIT, &file->flags);
> > -             } else {
> > +                     if (!tracing_start_cmdline_record())
> > +                             set_bit(EVENT_FILE_FL_RECORDED_CMD_BIT, &file->flags);
> > +             } else if (file->flags & EVENT_FILE_FL_RECORDED_CMD) {
> >                       tracing_stop_cmdline_record();
> >                       clear_bit(EVENT_FILE_FL_RECORDED_CMD_BIT, &file->flags);
> >               }
> > @@ -755,9 +755,9 @@ void trace_event_enable_tgid_record(bool enable)
> >                       continue;
> >
> >               if (enable) {
> > -                     tracing_start_tgid_record();
> > -                     set_bit(EVENT_FILE_FL_RECORDED_TGID_BIT, &file->flags);
> > -             } else {
> > +                     if (!tracing_start_tgid_record())
> > +                             set_bit(EVENT_FILE_FL_RECORDED_TGID_BIT, &file->flags);
> > +             } else if (file->flags & EVENT_FILE_FL_RECORDED_TGID) {
> >                       tracing_stop_tgid_record();
> >                       clear_bit(EVENT_FILE_FL_RECORDED_TGID_BIT,
> >                                 &file->flags);
> > @@ -847,14 +847,26 @@ static int __ftrace_event_enable_disable(struct trace_event_file *file,
> >                               set_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags);
> >
> >                       if (tr->trace_flags & TRACE_ITER(RECORD_CMD)) {
> > +                             ret = tracing_start_cmdline_record();
> > +                             if (ret) {
> > +                                     pr_info("event trace: Could not enable event %s\n",
> > +                                             trace_event_name(call));
> > +                                     break;
> > +                             }
> >                               cmd = true;
> > -                             tracing_start_cmdline_record();
> >                               set_bit(EVENT_FILE_FL_RECORDED_CMD_BIT, &file->flags);
> >                       }
> >
> >                       if (tr->trace_flags & TRACE_ITER(RECORD_TGID)) {
> > +                             ret = tracing_start_tgid_record();
> > +                             if (ret) {
> > +                                     if (cmd)
> > +                                             tracing_stop_cmdline_record();
> > +                                     pr_info("event trace: Could not enable event %s\n",
> > +                                             trace_event_name(call));
> > +                                     break;
> > +                             }
> >                               tgid = true;
> > -                             tracing_start_tgid_record();
> >                               set_bit(EVENT_FILE_FL_RECORDED_TGID_BIT, &file->flags);
> >                       }
> >
> > diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
> > index c12795c2fb39..14d099734345 100644
> > --- a/kernel/trace/trace_functions.c
> > +++ b/kernel/trace/trace_functions.c
> > @@ -146,6 +146,8 @@ static bool handle_func_repeats(struct trace_array *tr, u32 flags_val)
> >  static int function_trace_init(struct trace_array *tr)
> >  {
> >       ftrace_func_t func;
> > +     int ret;
> > +
> >       /*
> >        * Instance trace_arrays get their ops allocated
> >        * at instance creation. Unless it failed
> > @@ -165,7 +167,11 @@ static int function_trace_init(struct trace_array *tr)
> >
> >       tr->array_buffer.cpu = raw_smp_processor_id();
> >
> > -     tracing_start_cmdline_record();
> > +     ret = tracing_start_cmdline_record();
> > +     if (ret) {
> > +             ftrace_reset_array_ops(tr);
> > +             return ret;
> > +     }
> >       tracing_start_function_trace(tr);
> >       return 0;
> >  }
> > diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
> > index 1de6f1573621..6b27ed62fee8 100644
> > --- a/kernel/trace/trace_functions_graph.c
> > +++ b/kernel/trace/trace_functions_graph.c
> > @@ -487,7 +487,11 @@ static int graph_trace_init(struct trace_array *tr)
> >       ret = register_ftrace_graph(tr->gops);
> >       if (ret)
> >               return ret;
> > -     tracing_start_cmdline_record();
> > +     ret = tracing_start_cmdline_record();
> > +     if (ret) {
> > +             unregister_ftrace_graph(tr->gops);
> > +             return ret;
> > +     }
> >
> >       return 0;
> >  }
> > diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
> > index c46d584ded3b..683ea4ca1498 100644
> > --- a/kernel/trace/trace_sched_switch.c
> > +++ b/kernel/trace/trace_sched_switch.c
> > @@ -89,12 +89,22 @@ static void tracing_sched_unregister(void)
> >       unregister_trace_sched_wakeup(probe_sched_wakeup, NULL);
> >  }
> >
> > -static void tracing_start_sched_switch(int ops)
> > +static int tracing_start_sched_switch(int ops)
> >  {
> > -     bool sched_register;
> > +     int ret = 0;
> >
> >       mutex_lock(&sched_register_mutex);
> > -     sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
> > +
> > +     /*
> > +      * If the registration fails, do not bump the reference count : the
> > +      * caller must observe the failure so it can avoid a later matching
> > +      * stop that would otherwise unregister probes that were never added.
> > +      */
> > +     if (!sched_cmdline_ref && !sched_tgid_ref) {
> > +             ret = tracing_sched_register();
> > +             if (ret)
> > +                     goto out;
> > +     }
> >
> >       switch (ops) {
> >       case RECORD_CMDLINE:
> > @@ -105,10 +115,9 @@ static void tracing_start_sched_switch(int ops)
> >               sched_tgid_ref++;
> >               break;
> >       }
> > -
> > -     if (sched_register && (sched_cmdline_ref || sched_tgid_ref))
> > -             tracing_sched_register();
>
> The only change that should deal with this would be:
>
>         if (sched_register && (sched_cmdline_ref || sched_tgid_ref)) {
>                 WARN_ONCE(tracing_sched_register() < 0,
>                         "Failed to register trace command line caching. Requires reboot to fix");
>         }
>
> -- Steve
>
>
>
> > +out:
> >       mutex_unlock(&sched_register_mutex);
> > +     return ret;
> >  }
> >
> >  static void tracing_stop_sched_switch(int ops)
> > @@ -130,9 +139,9 @@ static void tracing_stop_sched_switch(int ops)
> >       mutex_unlock(&sched_register_mutex);
> >  }
> >
> > -void tracing_start_cmdline_record(void)
> > +int tracing_start_cmdline_record(void)
> >  {
> > -     tracing_start_sched_switch(RECORD_CMDLINE);
> > +     return tracing_start_sched_switch(RECORD_CMDLINE);
> >  }
> >
> >  void tracing_stop_cmdline_record(void)
> > @@ -140,9 +149,9 @@ void tracing_stop_cmdline_record(void)
> >       tracing_stop_sched_switch(RECORD_CMDLINE);
> >  }
> >
> > -void tracing_start_tgid_record(void)
> > +int tracing_start_tgid_record(void)
> >  {
> > -     tracing_start_sched_switch(RECORD_TGID);
> > +     return tracing_start_sched_switch(RECORD_TGID);
> >  }
> >
> >  void tracing_stop_tgid_record(void)
> > diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> > index d88c44f1dfa5..238e7451f8e4 100644
> > --- a/kernel/trace/trace_selftest.c
> > +++ b/kernel/trace/trace_selftest.c
> > @@ -1084,7 +1084,12 @@ trace_selftest_startup_function_graph(struct tracer *trace,
> >               warn_failed_init_tracer(trace, ret);
> >               goto out;
> >       }
> > -     tracing_start_cmdline_record();
> > +     ret = tracing_start_cmdline_record();
> > +     if (ret) {
> > +             unregister_ftrace_graph(&fgraph_ops);
> > +             warn_failed_init_tracer(trace, ret);
> > +             goto out;
> > +     }
> >
> >       /* Sleep for a 1/10 of a second */
> >       msleep(100);
>

^ permalink raw reply

* Re: [PATCH v3 2/2] blk-mq: expose tag starvation counts via debugfs
From: Aaron Tomlin @ 2026-04-18  2:29 UTC (permalink / raw)
  To: Jens Axboe, bvanassche
  Cc: rostedt, mhiramat, mathieu.desnoyers, johannes.thumshirn, kch,
	dlemoal, ritesh.list, loberman, neelx, sean, mproche, chjohnst,
	linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <424110a4-4e14-4e40-af0d-59ca7a30d904@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 1521 bytes --]

On Fri, Apr 17, 2026 at 02:16:07PM -0600, Jens Axboe wrote:
> On 4/17/26 12:28 PM, Bart Van Assche wrote:
> > On 3/19/26 3:19 PM, Aaron Tomlin wrote:
> >> To guarantee zero performance overhead for production kernels compiled
> >> without debugfs, the underlying atomic_t variables and their associated
> >> increment routines are strictly guarded behind CONFIG_BLK_DEBUG_FS.
> >> When this configuration is disabled, the tracking logic compiles down
> >> to a safe no-op.
> > 
> > I don't think that's sufficient. Please use per-cpu counters to
> > minimize the overhead for kernels in which debugfs is enabled.
> 
> Agree, this is the usual nonsense of thinking you can hide any overhead
> behind a config option, when in practice production kernels very much DO
> have CONFIG_DEBUGFS enabled.

Hi Bart, Jens,

Thank you both for the candid feedback.

I concede that relying on CONFIG_BLK_DEBUG_FS to mitigate the performance
impact was a short-sighted approach, as you are absolutely right that
debugfs is routinely enabled in modern production environments. The use of
an atomic_t variable in this case would indeed introduce unacceptable cache
line contention under heavy workloads.

I shall address this in the next iteration by transitioning the tracking
logic entirely to per-CPU counters. This will ensure the overhead is
strictly minimised, irrespective of the build configuration.

Thank you again for your time and guidance on this matter.

Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [RFC PATCH 10/10] rcuscale: Add concurrent expedited GP threads for callback scaling tests
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

Add nexp and exp_interval parameters to rcuscale that spawn kthreads
running synchronize_rcu_expedited() in a loop. This generates concurrent
expedited GP load while the normal writers measure GP or callback
latency.

When combined with gp_async=1 (which uses call_rcu() for writers), this
tests how effectively callbacks benefit from expedited grace periods.
With RCU callback expedited GP tracking, the async callbacks should
complete faster because they piggyback on the expedited GPs rather than
waiting for normal GPs.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/rcuscale.c | 84 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index ac0b1c6b7dae..1097ec15879c 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -91,6 +91,8 @@ torture_param(int, shutdown_secs, !IS_MODULE(CONFIG_RCU_SCALE_TEST) * 300,
 torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
 torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
 torture_param(int, writer_holdoff_jiffies, 0, "Holdoff (jiffies) between GPs, zero to disable");
+torture_param(int, nexp, 0, "Number of expedited GP threads to run concurrently");
+torture_param(int, exp_interval, 0, "Interval (us) between expedited GPs, zero to disable");
 torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu() scale test?");
 torture_param(int, kfree_mult, 1, "Multiple of kfree_obj size to allocate.");
 torture_param(int, kfree_by_call_rcu, 0, "Use call_rcu() to emulate kfree_rcu()?");
@@ -115,8 +117,10 @@ struct writer_freelist {
 
 static int nrealreaders;
 static int nrealwriters;
+static int nrealexp;
 static struct task_struct **writer_tasks;
 static struct task_struct **reader_tasks;
+static struct task_struct **exp_tasks;
 
 static u64 **writer_durations;
 static bool *writer_done;
@@ -462,6 +466,34 @@ rcu_scale_reader(void *arg)
 	return 0;
 }
 
+/*
+ * RCU expedited GP kthread.  Repeatedly invokes expedited grace periods
+ * to generate concurrent expedited GP load while the normal-GP writers
+ * are being measured.  This allows measuring the benefit of callbacks
+ * that can piggyback on expedited grace periods.
+ */
+static int
+rcu_scale_exp(void *arg)
+{
+	long me = (long)arg;
+
+	VERBOSE_SCALEOUT_STRING("rcu_scale_exp task started");
+	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
+	set_user_nice(current, MIN_NICE);
+
+	if (holdoff)
+		schedule_timeout_idle(holdoff * HZ);
+
+	do {
+		if (exp_interval)
+			udelay(exp_interval);
+		cur_ops->exp_sync();
+		rcu_scale_wait_shutdown();
+	} while (!torture_must_stop());
+	torture_kthread_stopping("rcu_scale_exp");
+	return 0;
+}
+
 /*
  * Allocate a writer_mblock structure for the specified rcu_scale_writer
  * task.
@@ -664,8 +696,10 @@ static void
 rcu_scale_print_module_parms(struct rcu_scale_ops *cur_ops, const char *tag)
 {
 	pr_alert("%s" SCALE_FLAG
-		 "--- %s: gp_async=%d gp_async_max=%d gp_exp=%d holdoff=%d minruntime=%d nreaders=%d nwriters=%d writer_holdoff=%d writer_holdoff_jiffies=%d verbose=%d shutdown_secs=%d\n",
-		 scale_type, tag, gp_async, gp_async_max, gp_exp, holdoff, minruntime, nrealreaders, nrealwriters, writer_holdoff, writer_holdoff_jiffies, verbose, shutdown_secs);
+		 "--- %s: gp_async=%d gp_async_max=%d gp_exp=%d holdoff=%d minruntime=%d nreaders=%d nwriters=%d nexp=%d exp_interval=%d writer_holdoff=%d writer_holdoff_jiffies=%d verbose=%d shutdown_secs=%d\n",
+		 scale_type, tag, gp_async, gp_async_max, gp_exp, holdoff,
+		 minruntime, nrealreaders, nrealwriters, nrealexp, exp_interval,
+		 writer_holdoff, writer_holdoff_jiffies, verbose, shutdown_secs);
 }
 
 /*
@@ -809,6 +843,13 @@ kfree_scale_cleanup(void)
 	if (torture_cleanup_begin())
 		return;
 
+	if (exp_tasks) {
+		for (i = 0; i < nrealexp; i++)
+			torture_stop_kthread(rcu_scale_exp, exp_tasks[i]);
+		kfree(exp_tasks);
+		exp_tasks = NULL;
+	}
+
 	if (kfree_reader_tasks) {
 		for (i = 0; i < kfree_nrealthreads; i++)
 			torture_stop_kthread(kfree_scale_thread,
@@ -903,6 +944,22 @@ kfree_scale_init(void)
 			goto unwind;
 	}
 
+	if (nrealexp > 0 && cur_ops->exp_sync) {
+		exp_tasks = kzalloc_objs(exp_tasks[0], nrealexp);
+		if (!exp_tasks) {
+			SCALEOUT_ERRSTRING("out of memory");
+			firsterr = -ENOMEM;
+			goto unwind;
+		}
+		for (i = 0; i < nrealexp; i++) {
+			firsterr = torture_create_kthread(rcu_scale_exp,
+							  (void *)i,
+							  exp_tasks[i]);
+			if (torture_init_error(firsterr))
+				goto unwind;
+		}
+	}
+
 	while (atomic_read(&n_kfree_scale_thread_started) < kfree_nrealthreads)
 		schedule_timeout_uninterruptible(1);
 
@@ -959,6 +1016,13 @@ rcu_scale_cleanup(void)
 		return;
 	}
 
+	if (exp_tasks) {
+		for (i = 0; i < nrealexp; i++)
+			torture_stop_kthread(rcu_scale_exp, exp_tasks[i]);
+		kfree(exp_tasks);
+		exp_tasks = NULL;
+	}
+
 	if (reader_tasks) {
 		for (i = 0; i < nrealreaders; i++)
 			torture_stop_kthread(rcu_scale_reader,
@@ -1076,6 +1140,7 @@ rcu_scale_init(void)
 		if (kthread_tp)
 			kthread_stime = kthread_tp->stime;
 	}
+	nrealexp = nexp;
 	if (kfree_rcu_test)
 		return kfree_scale_init();
 
@@ -1107,6 +1172,21 @@ rcu_scale_init(void)
 	}
 	while (atomic_read(&n_rcu_scale_reader_started) < nrealreaders)
 		schedule_timeout_uninterruptible(1);
+	if (nrealexp > 0 && cur_ops->exp_sync) {
+		exp_tasks = kzalloc_objs(exp_tasks[0], nrealexp);
+		if (!exp_tasks) {
+			SCALEOUT_ERRSTRING("out of memory");
+			firsterr = -ENOMEM;
+			goto unwind;
+		}
+		for (i = 0; i < nrealexp; i++) {
+			firsterr = torture_create_kthread(rcu_scale_exp,
+							  (void *)i,
+							  exp_tasks[i]);
+			if (torture_init_error(firsterr))
+				goto unwind;
+		}
+	}
 	writer_tasks = kzalloc_objs(writer_tasks[0], nrealwriters);
 	writer_durations = kcalloc(nrealwriters, sizeof(*writer_durations), GFP_KERNEL);
 	writer_n_durations = kzalloc_objs(*writer_n_durations, nrealwriters);
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 09/10] rcu: Advance callbacks for expedited GP completion in rcu_core()
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

Even when rcu_pending() triggers rcu_core(), the normal callback
advancement path through note_gp_changes() -> __note_gp_changes() bails
out when rdp->gp_seq == rnp->gp_seq (no normal GP change). Since
expedited GPs do not update rnp->gp_seq, rcu_advance_cbs() is never
called and callbacks remain stuck in RCU_WAIT_TAIL.

Add a direct callback advancement block in rcu_core() that checks for GP
completion via rcu_segcblist_nextgp() combined with
poll_state_synchronize_rcu_full(). When detected, trylock rnp and call
rcu_advance_cbs() to move completed callbacks to RCU_DONE_TAIL. Wake the
GP kthread if rcu_advance_cbs() requests a new grace period.

Uses trylock to avoid adding contention on rnp->lock. If the lock is
contended, callbacks will be advanced on the next tick.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/tree.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 309273a37b0a..1a92b6105de5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2853,6 +2853,22 @@ static __latent_entropy void rcu_core(void)
 	/* Update RCU state based on any recent quiescent states. */
 	rcu_check_quiescent_state(rdp);

+	/* Advance callbacks if an expedited GP has completed. */
+	if (!rcu_rdp_is_offloaded(rdp) &&
+	    rcu_segcblist_is_enabled(&rdp->cblist)) {
+		struct rcu_gp_oldstate gp_state;
+
+		if (rcu_segcblist_nextgp(&rdp->cblist, &gp_state) &&
+		    poll_state_synchronize_rcu_full(&gp_state)) {
+			guard(irqsave)();
+			if (raw_spin_trylock_rcu_node(rnp)) {
+				if (rcu_advance_cbs(rnp, rdp))
+					rcu_gp_kthread_wake();
+				raw_spin_unlock_rcu_node(rnp);
+			}
+		}
+	}
+
 	/* No grace period and unregistered callbacks? */
 	if (!rcu_gp_in_progress() &&
 	    rcu_segcblist_is_enabled(&rdp->cblist) && !rcu_rdp_is_offloaded(rdp)) {
-- 
2.52.0

^ permalink raw reply related

* [RFC PATCH 08/10] rcu: Detect expedited grace period completion in rcu_pending()
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

rcu_pending() is the gatekeeper that decides whether rcu_core() should
run on the current CPU's timer tick. Currently it checks if the CPU has
callbacks ready to invoke or a grace period has completed or started.

It does not check that an expedited GP has completed. After an expedited
GP, callbacks remain in RCU_WAIT_TAIL (not yet advanced to
RCU_DONE_TAIL) and So rcu_core() never runs to advance them.

Add a check using rcu_segcblist_nextgp() combined with
poll_state_synchronize_rcu_full() to detect when any pending callbacks'
grace period has completed.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/tree.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0e43866dc4cd..309273a37b0a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3671,6 +3671,7 @@ EXPORT_SYMBOL_GPL(cond_synchronize_rcu_full);
 static int rcu_pending(int user)
 {
 	bool gp_in_progress;
+	struct rcu_gp_oldstate gp_state;
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 	struct rcu_node *rnp = rdp->mynode;
 
@@ -3701,6 +3702,12 @@ static int rcu_pending(int user)
 	    rcu_segcblist_ready_cbs(&rdp->cblist))
 		return 1;
 
+	/* Has a GP (normal or expedited) completed for pending callbacks? */
+	if (!rcu_rdp_is_offloaded(rdp) &&
+	    rcu_segcblist_nextgp(&rdp->cblist, &gp_state) &&
+	    poll_state_synchronize_rcu_full(&gp_state))
+		return 1;
+
 	/* Has RCU gone idle with this CPU needing another grace period? */
 	if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) &&
 	    !rcu_rdp_is_offloaded(rdp) &&
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 07/10] rcu: Wake NOCB rcuog kthreads on expedited grace period completion
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

When an expedited grace period completes, rcu_exp_wait_wake() wakes
waiters on rnp->exp_wq[] but does not notify NOCB rcuog kthreads.  These
kthreads may be sleeping waiting for a grace period to complete.
Without this wakeup, callbacks on offloaded CPUs that could benefit from
the expedited GP must wait until the rcuog kthread wakes for some other
reason (e.g., next normal GP completion or a timer).

Add rcu_exp_wake_nocb() which wakes rcuog kthreads for leaf-node CPUs,
deduplicating via rdp->nocb_gp_rdp since multiple CPUs share one rcuog
kthread.  Uses for_each_leaf_node_possible_cpu() because offline CPUs
can have pending callbacks. The function is defined in tree_nocb.h with
an empty stub for CONFIG_RCU_NOCB_CPU=n builds.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/tree.h      |  1 +
 kernel/rcu/tree_exp.h  |  1 +
 kernel/rcu/tree_nocb.h | 29 +++++++++++++++++++++++++++++
 3 files changed, 31 insertions(+)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 7dfc57e9adb1..40f778453591 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -500,6 +500,7 @@ static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
 static bool wake_nocb_gp(struct rcu_data *rdp);
+static void rcu_exp_wake_nocb(struct rcu_node *rnp);
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j, bool lazy);
 static void call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *head,
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 82cada459e5d..0df1009c6e97 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -708,6 +708,7 @@ static void rcu_exp_wait_wake(unsigned long s)
 		}
 		smp_mb(); /* All above changes before wakeup. */
 		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
+		rcu_exp_wake_nocb(rnp);
 	}
 	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
 	mutex_unlock(&rcu_state.exp_wake_mutex);
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 7462cd5e2507..f37ee56d62a9 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -190,6 +190,31 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 	init_swait_queue_head(&rnp->nocb_gp_wq[1]);
 }
 
+/*
+ * Wake NOCB rcuog kthreads for leaf-node CPUs so that they can advance
+ * callbacks that were waiting for the just-completed expedited GP.
+ * Deduplicate via nocb_gp_rdp since multiple CPUs share one rcuog
+ * kthread.  Use for_each_leaf_node_possible_cpu() because offline CPUs
+ * may have pending callbacks.
+ */
+static void rcu_exp_wake_nocb(struct rcu_node *rnp)
+{
+	struct rcu_data *last_rdp_gp = NULL;
+	int cpu;
+
+	if (!rcu_is_leaf_node(rnp))
+		return;
+
+	for_each_leaf_node_possible_cpu(rnp, cpu) {
+		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+
+		if (rdp->nocb_gp_rdp == last_rdp_gp)
+			continue;
+		last_rdp_gp = rdp->nocb_gp_rdp;
+		wake_nocb_gp(rdp);
+	}
+}
+
 /* Clear any pending deferred wakeup timer (nocb_gp_lock must be held). */
 static void nocb_defer_wakeup_cancel(struct rcu_data *rdp_gp)
 {
@@ -1668,6 +1693,10 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 {
 }
 
+static void rcu_exp_wake_nocb(struct rcu_node *rnp)
+{
+}
+
 static bool wake_nocb_gp(struct rcu_data *rdp)
 {
 	return false;
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 06/10] rcu: Update comments for gp_seq_full and expedited GP tracking
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

This commit updates documentation comments throughout the RCU callback
infrastructure to reflect the transition from single grace-period
sequence numbers to the full rcu_gp_oldstate structure that tracks both
normal and expedited grace periods.

The ->gp_seq_full[] array documentation in rcu_segcblist.h is updated
to describe dual GP tracking. The rcu_segcblist_advance(),
rcu_segcblist_accelerate(), and rcu_advance_cbs() comments are updated
to reference ->gp_seq_full[] and rgosp instead of the old ->gp_seq[]
and seq names.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 include/linux/rcu_segcblist.h | 14 +++++++-----
 kernel/rcu/rcu_segcblist.c    | 43 +++++++++++++++++++++++------------
 kernel/rcu/tree.c             |  6 ++---
 3 files changed, 39 insertions(+), 24 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index 59c68f2ba113..6052c5750ad3 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -50,12 +50,14 @@ struct rcu_cblist {
  * Note that RCU_WAIT_TAIL cannot be empty unless RCU_NEXT_READY_TAIL is also
  * empty.
  *
- * The ->gp_seq[] array contains the grace-period number at which the
- * corresponding segment of callbacks will be ready to invoke.  A given
- * element of this array is meaningful only when the corresponding segment
- * is non-empty, and it is never valid for RCU_DONE_TAIL (whose callbacks
- * are already ready to invoke) or for RCU_NEXT_TAIL (whose callbacks have
- * not yet been assigned a grace-period number).
+ * The ->gp_seq_full[] array contains the grace-period state at which the
+ * corresponding segment of callbacks will be ready to invoke.  This tracks
+ * both normal and expedited grace periods, allowing callbacks to complete
+ * when either type of GP finishes.  A given element of this array is
+ * meaningful only when the corresponding segment is non-empty, and it is
+ * never valid for RCU_DONE_TAIL (whose callbacks are already ready to
+ * invoke) or for RCU_NEXT_TAIL (whose callbacks have not yet been assigned
+ * a grace-period state).
  */
 #define RCU_DONE_TAIL		0	/* Also RCU_WAIT head. */
 #define RCU_WAIT_TAIL		1	/* Also RCU_NEXT_READY head. */
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 11174e2be3c2..0f13016c0a2c 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -495,7 +495,8 @@ static void rcu_segcblist_advance_compact(struct rcu_segcblist *rsclp, int i)
 
 /*
  * Advance the callbacks in the specified rcu_segcblist structure based
- * on the current value of the grace-period counter.
+ * on the current grace-period state.  Checks both normal and expedited
+ * grace periods, advancing callbacks when either GP type completes.
  */
 void rcu_segcblist_advance(struct rcu_segcblist *rsclp)
 {
@@ -506,8 +507,10 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp)
 		return;
 
 	/*
-	 * Find all callbacks whose ->gp_seq numbers indicate that they
-	 * are ready to invoke, and put them into the RCU_DONE_TAIL segment.
+	 * Find all callbacks whose grace periods have completed (either
+	 * normal or expedited) and put them into the RCU_DONE_TAIL segment.
+	 * We check against the current global GP state, which includes
+	 * proper memory barriers and handles special completion values.
 	 */
 	for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++) {
 		if (!poll_state_synchronize_rcu_full(&rsclp->gp_seq_full[i]))
@@ -534,9 +537,9 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp)
  * them to complete at the end of the earlier grace period.
  *
  * This function operates on an rcu_segcblist structure, and also the
- * grace-period sequence number seq at which new callbacks would become
+ * grace-period state rgosp at which new callbacks would become
  * ready to invoke.  Returns true if there are callbacks that won't be
- * ready to invoke until seq, false otherwise.
+ * ready to invoke until the grace period represented by rgosp, false otherwise.
  */
 bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp)
 {
@@ -548,11 +551,11 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstat
 
 	/*
 	 * Find the segment preceding the oldest segment of callbacks
-	 * whose ->gp_seq[] completion is at or after that passed in via
-	 * "seq", skipping any empty segments.  This oldest segment, along
+	 * whose grace period completion is at or after that passed in via
+	 * "rgosp", skipping any empty segments.  This oldest segment, along
 	 * with any later segments, can be merged in with any newly arrived
-	 * callbacks in the RCU_NEXT_TAIL segment, and assigned "seq"
-	 * as their ->gp_seq[] grace-period completion sequence number.
+	 * callbacks in the RCU_NEXT_TAIL segment, and assigned "rgosp"
+	 * as their grace-period completion state.
 	 */
 	for (i = RCU_NEXT_READY_TAIL; i > RCU_DONE_TAIL; i--)
 		if (!rcu_segcblist_segempty(rsclp, i) &&
@@ -561,7 +564,7 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstat
 
 	/*
 	 * If all the segments contain callbacks that correspond to
-	 * earlier grace-period sequence numbers than "seq", leave.
+	 * earlier grace-period sequence numbers than "rgosp", leave.
 	 * Assuming that the rcu_segcblist structure has enough
 	 * segments in its arrays, this can only happen if some of
 	 * the non-done segments contain callbacks that really are
@@ -569,15 +572,15 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstat
 	 * out by the next call to rcu_segcblist_advance().
 	 *
 	 * Also advance to the oldest segment of callbacks whose
-	 * ->gp_seq[] completion is at or after that passed in via "seq",
+	 * ->gp_seq_full[] completion is at or after that passed in via "rgosp",
 	 * skipping any empty segments.
 	 *
 	 * Note that segment "i" (and any lower-numbered segments
 	 * containing older callbacks) will be unaffected, and their
-	 * grace-period numbers remain unchanged.  For example, if i ==
+	 * grace-period states remain unchanged.  For example, if i ==
 	 * WAIT_TAIL, then neither WAIT_TAIL nor DONE_TAIL will be touched.
 	 * Instead, the CBs in NEXT_TAIL will be merged with those in
-	 * NEXT_READY_TAIL and the grace-period number of NEXT_READY_TAIL
+	 * NEXT_READY_TAIL and the grace-period state of NEXT_READY_TAIL
 	 * would be updated.  NEXT_TAIL would then be empty.
 	 */
 	if (rcu_segcblist_restempty(rsclp, i) || ++i >= RCU_NEXT_TAIL)
@@ -589,8 +592,8 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstat
 
 	/*
 	 * Merge all later callbacks, including newly arrived callbacks,
-	 * into the segment located by the for-loop above.  Assign "seq"
-	 * as the ->gp_seq[] value in order to correctly handle the case
+	 * into the segment located by the for-loop above.  Assign "rgosp"
+	 * as the grace-period state in order to correctly handle the case
 	 * where there were no pending callbacks in the rcu_segcblist
 	 * structure other than in the RCU_NEXT_TAIL segment.
 	 */
@@ -644,6 +647,10 @@ void srcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
 	if (rcu_segcblist_restempty(rsclp, RCU_DONE_TAIL))
 		return;
 
+	/*
+	 * Find all callbacks whose normal GP sequence numbers indicate
+	 * that they are ready to invoke.  For SRCU, we only check rgos_norm.
+	 */
 	for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++) {
 		if (ULONG_CMP_LT(seq, rsclp->gp_seq_full[i].rgos_norm))
 			break;
@@ -658,6 +665,12 @@ void srcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
 	rcu_segcblist_advance_compact(rsclp, i);
 }
 
+/*
+ * SRCU wrapper for rcu_segcblist_accelerate() - converts SRCU's unsigned
+ * long GP sequence to rcu_gp_oldstate format with rgos_exp set to
+ * RCU_GET_STATE_NOT_TRACKED (since SRCU does not use expedited GPs)
+ * and calls the core rcu_segcblist_accelerate().
+ */
 bool srcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq)
 {
 	struct rcu_gp_oldstate rgos = { .rgos_norm = seq };
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 35076092f754..0e43866dc4cd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1209,7 +1209,7 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp,
 /*
  * Move any callbacks whose grace period has completed to the
  * RCU_DONE_TAIL sublist, then compact the remaining sublists and
- * assign ->gp_seq numbers to any callbacks in the RCU_NEXT_TAIL
+ * assign ->gp_seq_full[] state to any callbacks in the RCU_NEXT_TAIL
  * sublist.  This function is idempotent, so it does not hurt to
  * invoke it repeatedly.  As long as it is not invoked -too- often...
  * Returns true if the RCU grace-period kthread needs to be awakened.
@@ -1226,8 +1226,8 @@ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 		return false;
 
 	/*
-	 * Find all callbacks whose ->gp_seq numbers indicate that they
-	 * are ready to invoke, and put them into the RCU_DONE_TAIL sublist.
+	 * Find all callbacks whose grace periods have completed (either
+	 * normal or expedited) and put them into the RCU_DONE_TAIL sublist.
 	 */
 	rcu_segcblist_advance(&rdp->cblist);
 
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 05/10] rcu: Enable RCU callbacks to benefit from expedited grace periods
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

Currently, RCU callbacks only track normal grace period sequence
numbers.  This means callbacks must wait for normal grace periods to
complete even when expedited grace periods have already elapsed.

This commit uses the full rcu_gp_oldstate structure (which tracks both
normal and expedited GP sequences) throughout the callback
infrastructure.

The rcu_segcblist_advance() function now checks both normal and
expedited GP completion via poll_state_synchronize_rcu_full(), becoming
parameterless since it reads the GP state internally.
rcu_segcblist_accelerate() stores the full GP state (both normal and
expedited sequences) instead of just the normal sequence.

The rcu_accelerate_cbs() and rcu_accelerate_cbs_unlocked() functions use
get_state_synchronize_rcu_full() to capture both GP sequences. The NOCB
code uses poll_state_synchronize_rcu_full() for advance checks instead
of comparing only the normal GP sequence.

srcu_segcblist_advance() become standalone implementations because
compares SRCU sequences directly (it cannot use
poll_state_synchronize_rcu_full(), which reads RCU-specific globals).
srcu_segcblist_accelerate() sets rgos_exp to RCU_GET_STATE_NOT_TRACKED
so poll_state_synchronize_rcu_full() only compares the rgosp->rgos_norm
and ignores rgos_exp.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/rcu_segcblist.c | 30 ++++++++++++++++++++++++------
 kernel/rcu/rcu_segcblist.h |  2 +-
 kernel/rcu/tree.c          |  9 +++------
 kernel/rcu/tree_nocb.h     | 33 +++++++++++++++++++++++----------
 4 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 00e164db8b74..11174e2be3c2 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -12,6 +12,7 @@
 #include <linux/kernel.h>
 #include <linux/types.h>
 
+#include "rcu.h"
 #include "rcu_segcblist.h"
 
 /* Initialize simple callback list. */
@@ -494,9 +495,9 @@ static void rcu_segcblist_advance_compact(struct rcu_segcblist *rsclp, int i)
 
 /*
  * Advance the callbacks in the specified rcu_segcblist structure based
- * on the current value passed in for the grace-period counter.
+ * on the current value of the grace-period counter.
  */
-void rcu_segcblist_advance(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp)
+void rcu_segcblist_advance(struct rcu_segcblist *rsclp)
 {
 	int i;
 
@@ -509,7 +510,7 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *
 	 * are ready to invoke, and put them into the RCU_DONE_TAIL segment.
 	 */
 	for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++) {
-		if (ULONG_CMP_LT(rgosp->rgos_norm, rsclp->gp_seq_full[i].rgos_norm))
+		if (!poll_state_synchronize_rcu_full(&rsclp->gp_seq_full[i]))
 			break;
 		WRITE_ONCE(rsclp->tails[RCU_DONE_TAIL], rsclp->tails[i]);
 		rcu_segcblist_move_seglen(rsclp, i, RCU_DONE_TAIL);
@@ -595,7 +596,7 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstat
 	 */
 	for (; i < RCU_NEXT_TAIL; i++) {
 		WRITE_ONCE(rsclp->tails[i], rsclp->tails[RCU_NEXT_TAIL]);
-		rsclp->gp_seq_full[i].rgos_norm = rgosp->rgos_norm;
+		rsclp->gp_seq_full[i] = *rgosp;
 	}
 	return true;
 }
@@ -637,14 +638,31 @@ void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
 
 void srcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
 {
-	struct rcu_gp_oldstate rgos = { .rgos_norm = seq };
+	int i;
 
-	rcu_segcblist_advance(rsclp, &rgos);
+	WARN_ON_ONCE(!rcu_segcblist_is_enabled(rsclp));
+	if (rcu_segcblist_restempty(rsclp, RCU_DONE_TAIL))
+		return;
+
+	for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++) {
+		if (ULONG_CMP_LT(seq, rsclp->gp_seq_full[i].rgos_norm))
+			break;
+		WRITE_ONCE(rsclp->tails[RCU_DONE_TAIL], rsclp->tails[i]);
+		rcu_segcblist_move_seglen(rsclp, i, RCU_DONE_TAIL);
+	}
+
+	/* If no callbacks moved, nothing more need be done. */
+	if (i == RCU_WAIT_TAIL)
+		return;
+
+	rcu_segcblist_advance_compact(rsclp, i);
 }
 
 bool srcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq)
 {
 	struct rcu_gp_oldstate rgos = { .rgos_norm = seq };
 
+	if (IS_ENABLED(CONFIG_SMP))
+		rgos.rgos_exp = RCU_GET_STATE_NOT_TRACKED;
 	return rcu_segcblist_accelerate(rsclp, &rgos);
 }
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 2c06ab830a3d..6e05fdf93e7b 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -139,7 +139,7 @@ void rcu_segcblist_insert_done_cbs(struct rcu_segcblist *rsclp,
 				   struct rcu_cblist *rclp);
 void rcu_segcblist_insert_pend_cbs(struct rcu_segcblist *rsclp,
 				   struct rcu_cblist *rclp);
-void rcu_segcblist_advance(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp);
+void rcu_segcblist_advance(struct rcu_segcblist *rsclp);
 bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp);
 void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
 			 struct rcu_segcblist *src_rsclp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 607fc5715cd1..35076092f754 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1164,7 +1164,7 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 	 * accelerating callback invocation to an earlier grace-period
 	 * number.
 	 */
-	rgos.rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
+	get_state_synchronize_rcu_full(&rgos);
 	if (rcu_segcblist_accelerate(&rdp->cblist, &rgos))
 		ret = rcu_start_this_gp(rnp, rdp, rgos.rgos_norm);
 
@@ -1193,7 +1193,7 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp,
 	bool needwake;
 
 	rcu_lockdep_assert_cblist_protected(rdp);
-	rgos.rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
+	get_state_synchronize_rcu_full(&rgos);
 	if (!READ_ONCE(rdp->gpwrap) && ULONG_CMP_GE(rdp->gp_seq_needed, rgos.rgos_norm)) {
 		/* Old request still live, so mark recent callbacks. */
 		(void)rcu_segcblist_accelerate(&rdp->cblist, &rgos);
@@ -1218,8 +1218,6 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp,
  */
 static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 {
-	struct rcu_gp_oldstate rgos;
-
 	rcu_lockdep_assert_cblist_protected(rdp);
 	raw_lockdep_assert_held_rcu_node(rnp);
 
@@ -1231,8 +1229,7 @@ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 	 * Find all callbacks whose ->gp_seq numbers indicate that they
 	 * are ready to invoke, and put them into the RCU_DONE_TAIL sublist.
 	 */
-	rgos.rgos_norm = rnp->gp_seq;
-	rcu_segcblist_advance(&rdp->cblist, &rgos);
+	rcu_segcblist_advance(&rdp->cblist);
 
 	/* Classify any remaining callbacks. */
 	return rcu_accelerate_cbs(rnp, rdp);
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 1837eedfb8c2..7462cd5e2507 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -502,7 +502,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 		}
 		if (j != rdp->nocb_gp_adv_time &&
 		    rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full) &&
-		    rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq_full.rgos_norm)) {
+		    poll_state_synchronize_rcu_full(&cur_gp_seq_full)) {
 			rcu_advance_cbs_nowake(rdp->mynode, rdp);
 			rdp->nocb_gp_adv_time = j;
 		}
@@ -731,7 +731,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 		if (!rcu_segcblist_restempty(&rdp->cblist,
 					     RCU_NEXT_READY_TAIL) ||
 		    (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full) &&
-		     rcu_seq_done(&rnp->gp_seq, cur_gp_seq_full.rgos_norm))) {
+		     poll_state_synchronize_rcu_full(&cur_gp_seq_full))) {
 			raw_spin_lock_rcu_node(rnp); /* irqs disabled. */
 			needwake_gp = rcu_advance_cbs(rnp, rdp);
 			wasempty = rcu_segcblist_restempty(&rdp->cblist,
@@ -742,7 +742,18 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 		WARN_ON_ONCE(wasempty &&
 			     !rcu_segcblist_restempty(&rdp->cblist,
 						      RCU_NEXT_READY_TAIL));
-		if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full)) {
+		/*
+		 * Only request a GP wait if the next pending callback's
+		 * GP has not already completed (normal or expedited).
+		 * If poll_state_synchronize_rcu_full() says it completed,
+		 * then rcu_advance_cbs() above already moved those
+		 * callbacks to RCU_DONE_TAIL, so there is no GP to wait
+		 * for.  Any remaining callbacks got new (future) GP
+		 * numbers from rcu_accelerate_cbs() inside
+		 * rcu_advance_cbs() and will be handled on the next pass.
+		 */
+		if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full) &&
+		    !poll_state_synchronize_rcu_full(&cur_gp_seq_full)) {
 			if (!needwait_gp ||
 			    ULONG_CMP_LT(cur_gp_seq_full.rgos_norm, wait_gp_seq))
 				wait_gp_seq = cur_gp_seq_full.rgos_norm;
@@ -919,7 +930,7 @@ static void nocb_cb_wait(struct rcu_data *rdp)
 	lockdep_assert_irqs_enabled();
 	rcu_nocb_lock_irqsave(rdp, flags);
 	if (rcu_segcblist_nextgp(cblist, &cur_gp_seq_full) &&
-	    rcu_seq_done(&rnp->gp_seq, cur_gp_seq_full.rgos_norm) &&
+	    poll_state_synchronize_rcu_full(&cur_gp_seq_full) &&
 	    raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */
 		needwake_gp = rcu_advance_cbs(rdp->mynode, rdp);
 		raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
@@ -1548,8 +1559,8 @@ static void show_rcu_nocb_gp_state(struct rcu_data *rdp)
 static void show_rcu_nocb_state(struct rcu_data *rdp)
 {
 	char bufd[22];
-	char bufw[45];
-	char bufr[45];
+	char bufw[64];
+	char bufr[64];
 	char bufn[22];
 	char bufb[22];
 	struct rcu_data *nocb_next_rdp;
@@ -1569,10 +1580,12 @@ static void show_rcu_nocb_state(struct rcu_data *rdp)
 					      nocb_entry_rdp);
 
 	sprintf(bufd, "%ld", rsclp->seglen[RCU_DONE_TAIL]);
-	sprintf(bufw, "%ld(%ld)", rsclp->seglen[RCU_WAIT_TAIL],
-		rsclp->gp_seq_full[RCU_WAIT_TAIL].rgos_norm);
-	sprintf(bufr, "%ld(%ld)", rsclp->seglen[RCU_NEXT_READY_TAIL],
-		rsclp->gp_seq_full[RCU_NEXT_READY_TAIL].rgos_norm);
+	sprintf(bufw, "%ld(%ld/%ld)", rsclp->seglen[RCU_WAIT_TAIL],
+		rsclp->gp_seq_full[RCU_WAIT_TAIL].rgos_norm,
+		rsclp->gp_seq_full[RCU_WAIT_TAIL].rgos_exp);
+	sprintf(bufr, "%ld(%ld/%ld)", rsclp->seglen[RCU_NEXT_READY_TAIL],
+		rsclp->gp_seq_full[RCU_NEXT_READY_TAIL].rgos_norm,
+		rsclp->gp_seq_full[RCU_NEXT_READY_TAIL].rgos_exp);
 	sprintf(bufn, "%ld", rsclp->seglen[RCU_NEXT_TAIL]);
 	sprintf(bufb, "%ld", rcu_cblist_n_cbs(&rdp->nocb_bypass));
 	pr_info("   CB %d^%d->%d %c%c%c%c%c F%ld L%ld C%d %c%s%c%s%c%s%c%s%c%s q%ld %c CPU %d%s\n",
-- 
2.52.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox