Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCHv5 bpf-next 06/28] bpf: Move trampoline image setup into bpf_trampoline_ops callbacks
From: bot+bpf-ci @ 2026-04-17 20:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-7-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 783 bytes --]

> There's slight functional change for the unregister path, where we
> currently free the image unconditionally even if the detach fails.
> The new code keeps the image in place, possibly preventing the crash.

This appears to be fixing a bug where the image was freed even when detach
failed. The bug was introduced in commit e21aa341785c ("bpf: Fix fexit
trampoline.") which unconditionally freed the trampoline image.

Should this commit include a Fixes tag? The suggested tag would be:

Fixes: e21aa341785c ("bpf: Fix fexit trampoline.")


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24583317711

^ permalink raw reply

* Re: [PATCH v3 2/2] blk-mq: expose tag starvation counts via debugfs
From: Jens Axboe @ 2026-04-17 20:16 UTC (permalink / raw)
  To: Bart Van Assche, Aaron Tomlin, rostedt, mhiramat,
	mathieu.desnoyers
  Cc: johannes.thumshirn, kch, dlemoal, ritesh.list, loberman, neelx,
	sean, mproche, chjohnst, linux-block, linux-kernel,
	linux-trace-kernel
In-Reply-To: <ac8d9fd1-15c6-44cf-ba5a-85bd4aefce9b@acm.org>

On 4/17/26 12:28 PM, Bart Van Assche wrote:
> On 3/19/26 3:19 PM, Aaron Tomlin wrote:
>> To guarantee zero performance overhead for production kernels compiled
>> without debugfs, the underlying atomic_t variables and their associated
>> increment routines are strictly guarded behind CONFIG_BLK_DEBUG_FS.
>> When this configuration is disabled, the tracking logic compiles down
>> to a safe no-op.
> 
> I don't think that's sufficient. Please use per-cpu counters to
> minimize the overhead for kernels in which debugfs is enabled.

Agree, this is the usual nonsense of thinking you can hide any overhead
behind a config option, when in practice production kernels very much DO
have CONFIG_DEBUGFS enabled.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCHv5 bpf-next 12/28] bpf: Add bpf_trampoline_multi_attach/detach functions
From: bot+bpf-ci @ 2026-04-17 20:22 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-13-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 7889 bytes --]

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index c815227f3bc6..37c96daaae9a 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h

[ ... ]

> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> index 23856bd28d59..f6180f15d5a4 100644
> --- a/kernel/bpf/trampoline.c
> +++ b/kernel/bpf/trampoline.c
> @@ -1447,6 +1447,276 @@ int __weak arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags,

[ ... ]

> +int bpf_trampoline_multi_attach(struct bpf_prog *prog, u32 *ids,
> +				struct bpf_tracing_multi_link *link)
> +{
> +	struct bpf_tracing_multi_data *data = &link->data;
> +	struct bpf_attach_target_info tgt_info = {};
> +	struct btf *btf = prog->aux->attach_btf;
> +	struct bpf_tracing_multi_node *mnode;
> +	struct bpf_trampoline *tr;
> +	int i, err, rollback_cnt;
> +	u64 key;
> +
> +	for_each_mnode(mnode, link) {
> +		rollback_cnt = i;
> +
> +		err = bpf_check_attach_btf_id_multi(btf, prog, ids[i], &tgt_info);
> +		if (err)
> +			goto rollback_put;
> +
> +		key = bpf_trampoline_compute_key(NULL, btf, ids[i]);
> +
> +		tr = bpf_trampoline_get(key, &tgt_info);
> +		if (!tr) {
> +			err = -ENOMEM;
> +			goto rollback_put;
> +		}
> +
> +		mnode->trampoline = tr;
> +		mnode->node.link = &link->link;
> +
> +		cond_resched();
> +	}
> +
> +	err = init_tracing_multi_data(data);
> +	if (err) {
> +		rollback_cnt = link->nodes_cnt;
> +		goto rollback_put;
> +	}
> +
> +	trampoline_lock_all();
> +
> +	for_each_mnode(mnode, link) {
> +		bpf_trampoline_multi_attach_init(mnode->trampoline);
> +
> +		data->entry = &mnode->entry;
> +		err = __bpf_trampoline_link_prog(&mnode->node, mnode->trampoline, NULL,
> +						 &trampoline_multi_ops, data);
> +		if (err) {
> +			rollback_cnt = i;
> +			goto rollback_unlink;
> +		}
> +	}
> +
> +	rollback_cnt = link->nodes_cnt;
> +	if (ftrace_hash_count(data->reg)) {
> +		err = update_ftrace_direct_add(&direct_ops, data->reg);
> +		if (err)
> +			goto rollback_unlink;
> +	}
> +
> +	if (ftrace_hash_count(data->modify)) {
> +		err = update_ftrace_direct_mod(&direct_ops, data->modify, true);
> +		if (err) {
> +			WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));
> +			goto rollback_unlink;
> +		}
> +	}

Does update_ftrace_direct_del need a guard here?

When update_ftrace_direct_mod fails, the recovery code unconditionally
calls update_ftrace_direct_del(&direct_ops, data->reg). However,
update_ftrace_direct_add is itself only invoked when data->reg is
non-empty:

    if (ftrace_hash_count(data->reg)) {
        err = update_ftrace_direct_add(&direct_ops, data->reg);
        ...
    }

If all trampolines being multi-attached already had cur_image != NULL,
nothing is ever added to data->reg, and the add call is skipped. Yet
the recovery path still calls update_ftrace_direct_del with the empty
data->reg.

update_ftrace_direct_del() at kernel/trace/ftrace.c:6511 does:

    if (!hash_count(hash))
        return -EINVAL;

so it returns -EINVAL, and WARN_ON_ONCE fires. No further damage is
done since the rollback_unlink path still unwinds correctly, but this
is a spurious warning that becomes visible whenever
update_ftrace_direct_mod fails and data->reg happens to be empty.

The undo should match the guard used for the add:

    if (err) {
        if (ftrace_hash_count(data->reg))
            WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));
        goto rollback_unlink;
    }

> +
> +	for_each_mnode(mnode, link)
> +		bpf_trampoline_multi_attach_free(mnode->trampoline);
> +
> +	trampoline_unlock_all();
> +
> +	remove_tracing_multi_data(data);
> +	return 0;
> +
> +rollback_unlink:
> +	for_each_mnode_cnt(mnode, link, rollback_cnt) {
> +		bpf_trampoline_remove_prog(mnode->trampoline, &mnode->node);
> +		bpf_trampoline_multi_attach_rollback(mnode->trampoline);
> +	}
> +
> +	trampoline_unlock_all();
> +
> +	clear_tracing_multi_data(data);
> +	rollback_cnt = link->nodes_cnt;
> +
> +rollback_put:
> +	for_each_mnode_cnt(mnode, link, rollback_cnt)
> +		bpf_trampoline_put(mnode->trampoline);
> +
> +	return err;
> +}

[ ... ]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 739e730add96..4f5a21a4eddc 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -19658,6 +19658,58 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
> +int bpf_check_attach_btf_id_multi(struct btf *btf, struct bpf_prog *prog, u32 btf_id,
> +				  struct bpf_attach_target_info *tgt_info)
> +{
> +	const struct btf_type *t;
> +	unsigned long addr;
> +	const char *tname;
> +	int err;
> +
> +	if (!btf_id || !btf)
> +		return -EINVAL;
> +
> +	/* Check noreturn attachment. */
> +	if (prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI ||
> +	     btf_id_set_contains(&noreturn_deny, btf_id))
> +		return -EINVAL;

Should this be && instead of ||?

This condition rejects all BPF_TRACE_FEXIT_MULTI programs regardless
of whether the target is on the noreturn_deny list. The C operator
precedence makes this `A || B`, so if the program's expected_attach_type
is BPF_TRACE_FEXIT_MULTI, the entire condition is TRUE and the
function returns -EINVAL.

This breaks FEXIT_MULTI attachment support. The caller
bpf_trampoline_multi_attach is the multi-attach path that supports both
BPF_TRACE_FENTRY_MULTI and BPF_TRACE_FEXIT_MULTI (see is_tracing_multi()
in include/linux/bpf.h).

The equivalent check in the single-target path, check_attach_btf_id,
correctly uses &&:

    } else if ((prog->expected_attach_type == BPF_TRACE_FEXIT ||
               prog->expected_attach_type == BPF_TRACE_FSESSION ||
               prog->expected_attach_type == BPF_MODIFY_RETURN) &&
               btf_id_set_contains(&noreturn_deny, btf_id)) {
            ...
    }

The intended semantics are to reject FEXIT_MULTI attachments only when
the target is on noreturn_deny. The correct operator is &&:

    if (prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI &&
        btf_id_set_contains(&noreturn_deny, btf_id))
        return -EINVAL;

Additional evidence:
1. The comment says "Check noreturn attachment" but the current code
   rejects all FEXIT_MULTI regardless of noreturn status.
2. The indentation of the second line matches the alignment pattern
   used when lines are joined by && rather than ||.
3. bpf_check_attach_target handles case BPF_TRACE_FEXIT_MULTI as a
   valid attach type in its switch.

Note: In this commit, bpf_trampoline_multi_attach has no callers yet;
it is infrastructure wired up by a later commit. Once the link creation
path invokes bpf_trampoline_multi_attach for a FEXIT_MULTI program,
every such attach will fail with -EINVAL.

> +
> +	/* Check and get function target data. */
> +	t = btf_type_by_id(btf, btf_id);
> +	if (!t)
> +		return -EINVAL;
> +	tname = btf_name_by_offset(btf, t->name_off);
> +	if (!tname)
> +		return -EINVAL;
> +	if (!btf_type_is_func(t))
> +		return -EINVAL;
> +	t = btf_type_by_id(btf, t->type);
> +	if (!btf_type_is_func_proto(t))
> +		return -EINVAL;
> +	err = btf_distill_func_proto(NULL, btf, t, tname, &tgt_info->fmodel);
> +	if (err < 0)
> +		return err;
> +	if (btf_is_module(btf)) {
> +		/* The bpf program already holds reference to module. */
> +		if (WARN_ON_ONCE(!prog->aux->mod))
> +			return -EINVAL;
> +		addr = find_kallsyms_symbol_value(prog->aux->mod, tname);
> +	} else {
> +		addr = kallsyms_lookup_name(tname);
> +	}
> +	if (!addr || !ftrace_location(addr))
> +		return -ENOENT;
> +
> +	/* Check sleepable program attachment. */
> +	if (prog->sleepable) {
> +		err = btf_id_allow_sleepable(btf_id, addr, prog, btf);
> +		if (err)
> +			return err;
> +	}
> +	tgt_info->tgt_addr = addr;
> +	return 0;
> +}

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24583317711

^ permalink raw reply

* Re: [PATCHv5 bpf-next 08/28] bpf: Add struct bpf_tramp_node object
From: bot+bpf-ci @ 2026-04-17 20:22 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: hengqi.chen, bpf, linux-trace-kernel, kafai, eddyz87,
	songliubraving, yhs, menglong8.dong, rostedt, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-9-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 4040 bytes --]

> diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
> index 50103b3794fb..b97d9f68fc40 100644
> --- a/arch/powerpc/net/bpf_jit_comp.c
> +++ b/arch/powerpc/net/bpf_jit_comp.c

[ ... ]

> @@ -827,14 +827,14 @@ static void bpf_trampoline_restore_args_stack(u32 *image, struct codegen_context
>  static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_image,
>  					 void *rw_image_end, void *ro_image,
>  					 const struct btf_func_model *m, u32 flags,
> -					 struct bpf_tramp_links *tlinks,
> +					 struct bpf_tramp_nodes *tnodes,
>  					 void *func_addr)
>  {
>  	int regs_off, func_meta_off, ip_off, run_ctx_off, retval_off;
>  	int nvr_off, alt_lr_off, r4_off = 0;
> -	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> -	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
> -	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
> +	struct bpf_tramp_nodes *fmod_ret = &tnodes[BPF_TRAMP_MODIFY_RETURN];
> +	struct bpf_tramp_nodes *fentry = &tnodes[BPF_TRAMP_FENTRY];
> +	struct bpf_tramp_nodes *fexit = &tnodes[BPF_TRAMP_FEXIT];
>  	int i, ret, nr_regs, retaddr_off, bpf_frame_size = 0;
>  	struct codegen_context codegen_ctx, *ctx;
>  	int cookie_off, cookie_cnt, cookie_ctx_off;
> -	int fsession_cnt = bpf_fsession_cnt(tlinks);
> +	int fsession_cnt = bpf_fsession_cnt(tnodes);
>  	u64 func_meta;

[ ... ]

> @@ -906,7 +906,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
>  	}
>
>  	cookie_off = bpf_frame_size;
> -	cookie_cnt = bpf_fsession_cookie_cnt(tlinks);
> +	cookie_cnt = bpf_fsession_cookie_cnt(tnodes);
>  	bpf_frame_size += cookie_cnt * 8;
                                             ^^^^^^

Does this code compile? The refactoring renamed the parameter from 'tlinks'
to 'tnodes', but this call to bpf_fsession_cookie_cnt still passes 'tlinks',
which is no longer in scope. Should this be bpf_fsession_cookie_cnt(tnodes)?

>
>  	retval_off = bpf_frame_size;

[ ... ]

> @@ -1050,21 +1050,21 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
>
>  	cookie_ctx_off = (regs_off - cookie_off) / 8;
>
> -	for (i = 0; i < fentry->nr_links; i++) {
> -		if (bpf_prog_calls_session_cookie(fentry->links[i])) {
> +	for (i = 0; i < fentry->nr_nodes; i++)
> +		if (bpf_prog_calls_session_cookie(fentry->nodes[i])) {
>  			u64 meta = func_meta | (cookie_ctx_off << BPF_TRAMP_COOKIE_INDEX_SHIFT);
>
>  			store_func_meta(image, ctx, meta, func_meta_off);
>  			cookie_ctx_off--;
>  		}
>
> -		if (invoke_bpf_prog(image, ro_image, ctx, fentry->links[i], regs_off, retval_off,
> +		if (invoke_bpf_prog(image, ro_image, ctx, fentry->nodes[i], regs_off, retval_off,
>  				    run_ctx_off, flags & BPF_TRAMP_F_RET_FENTRY_RET))
>  			return -EINVAL;
>  	}

Did the for-loop's opening brace get dropped by mistake? The loop header
changed from:

    for (i = 0; i < fentry->nr_links; i++) {

to:

    for (i = 0; i < fentry->nr_nodes; i++)

The closing brace at line 1064 is still there, but now there's no matching
opening brace. This creates two problems:

1) The braces are unmatched, which should cause a compilation error.

2) If the compiler somehow accepts this structure, the loop body would
consist only of the first if-statement (the cookie handling), and the
invoke_bpf_prog() call at line 1061 would execute AFTER the loop completes.
At that point, i == fentry->nr_nodes, so fentry->nodes[i] would be an
out-of-bounds access.

The parallel fexit loop later in the file was correctly updated with both
braces preserved.

>
> -	if (fmod_ret->nr_links) {
> -		branches = kcalloc(fmod_ret->nr_links, sizeof(u32), GFP_KERNEL);
> +	if (fmod_ret->nr_nodes) {
> +		branches = kcalloc(fmod_ret->nr_nodes, sizeof(u32), GFP_KERNEL);

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24583317711

^ permalink raw reply

* Re: [PATCHv5 bpf-next 10/28] bpf: Add multi tracing attach types
From: bot+bpf-ci @ 2026-04-17 20:22 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-11-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1269 bytes --]

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 552bc5d9afbd..5f814e7101cf 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1154,6 +1154,8 @@ enum bpf_attach_type {
>  	BPF_TRACE_KPROBE_SESSION,
>  	BPF_TRACE_UPROBE_SESSION,
>  	BPF_TRACE_FSESSION,
> +	BPF_TRACE_FENTRY_MULTI,
> +	BPF_TRACE_FEXIT_MULTI,
>  	__MAX_BPF_ATTACH_TYPE
>  };

In the RFC version posted on February 4, 2026, Leon Hwang asked whether
BPF_TRACE_FSESSION_MULTI should be added alongside BPF_TRACE_FENTRY_MULTI
and BPF_TRACE_FEXIT_MULTI for consistency, since BPF_TRACE_FSESSION exists
and is handled similarly to FENTRY/FEXIT.

Reference:
https://lore.kernel.org/bpf/3015991d-2e9b-4bf4-9c17-26e00b5d5926@linux.dev/

You acknowledged this with "good catch, will add it" in the same thread:
https://lore.kernel.org/bpf/aYM-jeV50WaVik1b@krava/

However, BPF_TRACE_FSESSION_MULTI was not added in any subsequent version
through v5. Should BPF_TRACE_FSESSION_MULTI be included here for API
consistency?


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24583317711

^ permalink raw reply

* [RFC PATCH 00/10] RCU: Enable callbacks to benefit from expedited grace periods
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso

RCU callbacks only track normal grace period sequence numbers. This
means callbacks must wait for normal GPs even when expedited GPs have
already elapsed.

This series tracks both normal and expedited GP sequences in the
callback infrastructure using struct rcu_gp_oldstate, so callbacks
advance when either GP type completes.

Commits 1-4 prepare the callback infrastructure: SRCU and Tasks RCU are
isolated behind wrapper functions, the segment compaction logic is
factored out for reuse, gp_seq is widened to struct rcu_gp_oldstate, and
RCU_GET_STATE_NOT_TRACKED is added for subsystems that do not track
expedited GPs.

Commit 5 is the core change: rcu_segcblist_advance() checks both normal
and expedited GP completion via poll_state_synchronize_rcu_full(), and
rcu_accelerate_cbs() captures both GP sequences via
get_state_synchronize_rcu_full(). SRCU wrappers become standalone
implementations since they cannot use poll_state_synchronize_rcu_full()
which reads RCU-specific globals.

Commits 7-9 fix three notification paths so that expedited GP completion
triggers callback processing: NOCB rcuog kthreads are woken,
rcu_pending() detects the completion, and rcu_core() advances the
callbacks.

Commit 10 adds nexp and exp_interval parameters to rcuscale for testing
callback drain under concurrent expedited GP load.

Puranjay Mohan (10):
  rcu/segcblist: Add SRCU and Tasks RCU wrapper functions
  rcu/segcblist: Factor out rcu_segcblist_advance_compact() helper
  rcu/segcblist: Change gp_seq to struct rcu_gp_oldstate gp_seq_full
  rcu: Add RCU_GET_STATE_NOT_TRACKED for subsystems without expedited
    GPs
  rcu: Enable RCU callbacks to benefit from expedited grace periods
  rcu: Update comments for gp_seq_full and expedited GP tracking
  rcu: Wake NOCB rcuog kthreads on expedited grace period completion
  rcu: Detect expedited grace period completion in rcu_pending()
  rcu: Advance callbacks for expedited GP completion in rcu_core()
  rcuscale: Add concurrent expedited GP threads for callback scaling
    tests

 include/linux/rcu_segcblist.h |  16 ++--
 include/trace/events/rcu.h    |   5 +-
 kernel/rcu/rcu.h              |  13 ++-
 kernel/rcu/rcu_segcblist.c    | 145 ++++++++++++++++++++++++----------
 kernel/rcu/rcu_segcblist.h    |   8 +-
 kernel/rcu/rcuscale.c         |  84 +++++++++++++++++++-
 kernel/rcu/srcutree.c         |  14 ++--
 kernel/rcu/tasks.h            |   8 +-
 kernel/rcu/tree.c             |  54 +++++++++----
 kernel/rcu/tree.h             |   1 +
 kernel/rcu/tree_exp.h         |   1 +
 kernel/rcu/tree_nocb.h        |  77 ++++++++++++++----
 12 files changed, 324 insertions(+), 102 deletions(-)

base-commit: 51289896149317e0b6a4abf5d300a569c3353dca
-- 
2.52.0

^ permalink raw reply

* [RFC PATCH 01/10] rcu/segcblist: Add SRCU and Tasks RCU wrapper functions
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

Add srcu_segcblist_advance() and srcu_segcblist_accelerate() wrappers
that forward to the core rcu_segcblist_advance() and
rcu_segcblist_accelerate() functions, and switch all SRCU (srcutree.c)
and Tasks RCU (tasks.h) callers to use these wrappers.

This isolates SRCU and Tasks RCU from upcoming changes to the core
advance/accelerate functions, which will switch to struct
rcu_gp_oldstate for dual normal/expedited GP tracking. Because SRCU and
Tasks RCU use only normal GP sequences, their wrappers will maintain the
existing unsigned long interface.

No functional change.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/rcu_segcblist.c | 10 ++++++++++
 kernel/rcu/rcu_segcblist.h |  2 ++
 kernel/rcu/srcutree.c      | 14 +++++++-------
 kernel/rcu/tasks.h         |  8 ++++----
 4 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 298a2c573f02..da39d818b01b 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -620,3 +620,13 @@ void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
 
 	rcu_segcblist_init(src_rsclp);
 }
+
+void srcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
+{
+	rcu_segcblist_advance(rsclp, seq);
+}
+
+bool srcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq)
+{
+	return rcu_segcblist_accelerate(rsclp, seq);
+}
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index fadc08ad4b7b..956f2967d9d2 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -143,3 +143,5 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq);
 bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq);
 void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
 			 struct rcu_segcblist *src_rsclp);
+void srcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq);
+bool srcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq);
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 7c2f7cc131f7..519a35719c89 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1351,7 +1351,7 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
 	 *  2) The grace period for RCU_WAIT_TAIL is seen as started but not
 	 *     completed so rcu_seq_current() returns X + SRCU_STATE_SCAN1.
 	 *
-	 *  3) This value is passed to rcu_segcblist_advance() which can't move
+	 *  3) This value is passed to srcu_segcblist_advance() which can't move
 	 *     any segment forward and fails.
 	 *
 	 *  4) srcu_gp_start_if_needed() still proceeds with callback acceleration.
@@ -1360,15 +1360,15 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
 	 *     RCU_NEXT_READY_TAIL segment as started (ie: X + 4 + SRCU_STATE_SCAN1)
 	 *     so it returns a snapshot of the next grace period, which is X + 12.
 	 *
-	 *  5) The value of X + 12 is passed to rcu_segcblist_accelerate() but the
+	 *  5) The value of X + 12 is passed to srcu_segcblist_accelerate() but the
 	 *     freshly enqueued callback in RCU_NEXT_TAIL can't move to
 	 *     RCU_NEXT_READY_TAIL which already has callbacks for a previous grace
 	 *     period (gp_num = X + 8). So acceleration fails.
 	 */
 	s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
 	if (rhp) {
-		rcu_segcblist_advance(&sdp->srcu_cblist,
-				      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
+		srcu_segcblist_advance(&sdp->srcu_cblist,
+				       rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
 		/*
 		 * Acceleration can never fail because the base current gp_seq
 		 * used for acceleration is <= the value of gp_seq used for
@@ -1376,7 +1376,7 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
 		 * always be able to be emptied by the acceleration into the
 		 * RCU_NEXT_READY_TAIL or RCU_WAIT_TAIL segments.
 		 */
-		WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s));
+		WARN_ON_ONCE(!srcu_segcblist_accelerate(&sdp->srcu_cblist, s));
 	}
 	if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) {
 		sdp->srcu_gp_seq_needed = s;
@@ -1891,8 +1891,8 @@ static void srcu_invoke_callbacks(struct work_struct *work)
 	rcu_cblist_init(&ready_cbs);
 	raw_spin_lock_irq_rcu_node(sdp);
 	WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL));
-	rcu_segcblist_advance(&sdp->srcu_cblist,
-			      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
+	srcu_segcblist_advance(&sdp->srcu_cblist,
+			       rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
 	/*
 	 * Although this function is theoretically re-entrant, concurrent
 	 * callbacks invocation is disallowed to avoid executing an SRCU barrier
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 48f0d803c8e2..137eb6c48b2c 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -480,8 +480,8 @@ static int rcu_tasks_need_gpcb(struct rcu_tasks *rtp)
 			if (cpu > 0)
 				ncbsnz += n;
 		}
-		rcu_segcblist_advance(&rtpcp->cblist, rcu_seq_current(&rtp->tasks_gp_seq));
-		(void)rcu_segcblist_accelerate(&rtpcp->cblist, rcu_seq_snap(&rtp->tasks_gp_seq));
+		srcu_segcblist_advance(&rtpcp->cblist, rcu_seq_current(&rtp->tasks_gp_seq));
+		(void)srcu_segcblist_accelerate(&rtpcp->cblist, rcu_seq_snap(&rtp->tasks_gp_seq));
 		if (rtpcp->urgent_gp > 0 && rcu_segcblist_pend_cbs(&rtpcp->cblist)) {
 			if (rtp->lazy_jiffies)
 				rtpcp->urgent_gp--;
@@ -564,7 +564,7 @@ static void rcu_tasks_invoke_cbs(struct rcu_tasks *rtp, struct rcu_tasks_percpu
 	if (rcu_segcblist_empty(&rtpcp->cblist))
 		return;
 	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
-	rcu_segcblist_advance(&rtpcp->cblist, rcu_seq_current(&rtp->tasks_gp_seq));
+	srcu_segcblist_advance(&rtpcp->cblist, rcu_seq_current(&rtp->tasks_gp_seq));
 	rcu_segcblist_extract_done_cbs(&rtpcp->cblist, &rcl);
 	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
 	len = rcl.len;
@@ -577,7 +577,7 @@ static void rcu_tasks_invoke_cbs(struct rcu_tasks *rtp, struct rcu_tasks_percpu
 	}
 	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
 	rcu_segcblist_add_len(&rtpcp->cblist, -len);
-	(void)rcu_segcblist_accelerate(&rtpcp->cblist, rcu_seq_snap(&rtp->tasks_gp_seq));
+	(void)srcu_segcblist_accelerate(&rtpcp->cblist, rcu_seq_snap(&rtp->tasks_gp_seq));
 	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
 }
 
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 02/10] rcu/segcblist: Factor out rcu_segcblist_advance_compact() helper
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

This commit extracts the tail-pointer cleanup and segment compaction
logic from rcu_segcblist_advance() into a new static helper function,
rcu_segcblist_advance_compact(). This shared logic will be reused by the
upcoming srcu_segcblist_advance() standalone implementation, which
cannot call the core rcu_segcblist_advance() because that function will
use RCU-specific globals.

No functional change.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/rcu_segcblist.c | 50 ++++++++++++++++++++++++--------------
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index da39d818b01b..421f1dadb5e5 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -462,13 +462,43 @@ void rcu_segcblist_insert_pend_cbs(struct rcu_segcblist *rsclp,
 	WRITE_ONCE(rsclp->tails[RCU_NEXT_TAIL], rclp->tail);
 }
 
+/*
+ * Clean up and compact the segmented callback list after callbacks have been
+ * advanced to the RCU_DONE_TAIL segment.  The @i parameter is the index of the
+ * first segment that was NOT advanced (i.e., the segment after the last one
+ * moved to RCU_DONE_TAIL). This function fixes up tail pointers and compacts
+ * any gaps left by the moved segments.
+ */
+static void rcu_segcblist_advance_compact(struct rcu_segcblist *rsclp, int i)
+{
+	int j;
+
+	/* Clean up tail pointers that might have been misordered above. */
+	for (j = RCU_WAIT_TAIL; j < i; j++)
+		WRITE_ONCE(rsclp->tails[j], rsclp->tails[RCU_DONE_TAIL]);
+
+	/*
+	 * Callbacks moved, so there might be an empty RCU_WAIT_TAIL
+	 * and a non-empty RCU_NEXT_READY_TAIL.  If so, copy the
+	 * RCU_NEXT_READY_TAIL segment to fill the RCU_WAIT_TAIL gap
+	 * created by the now-ready-to-invoke segments.
+	 */
+	for (j = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++, j++) {
+		if (rsclp->tails[j] == rsclp->tails[RCU_NEXT_TAIL])
+			break;  /* No more callbacks. */
+		WRITE_ONCE(rsclp->tails[j], rsclp->tails[i]);
+		rcu_segcblist_move_seglen(rsclp, i, j);
+		rsclp->gp_seq[j] = rsclp->gp_seq[i];
+	}
+}
+
 /*
  * Advance the callbacks in the specified rcu_segcblist structure based
  * on the current value passed in for the grace-period counter.
  */
 void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
 {
-	int i, j;
+	int i;
 
 	WARN_ON_ONCE(!rcu_segcblist_is_enabled(rsclp));
 	if (rcu_segcblist_restempty(rsclp, RCU_DONE_TAIL))
@@ -489,23 +519,7 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
 	if (i == RCU_WAIT_TAIL)
 		return;
 
-	/* Clean up tail pointers that might have been misordered above. */
-	for (j = RCU_WAIT_TAIL; j < i; j++)
-		WRITE_ONCE(rsclp->tails[j], rsclp->tails[RCU_DONE_TAIL]);
-
-	/*
-	 * Callbacks moved, so there might be an empty RCU_WAIT_TAIL
-	 * and a non-empty RCU_NEXT_READY_TAIL.  If so, copy the
-	 * RCU_NEXT_READY_TAIL segment to fill the RCU_WAIT_TAIL gap
-	 * created by the now-ready-to-invoke segments.
-	 */
-	for (j = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++, j++) {
-		if (rsclp->tails[j] == rsclp->tails[RCU_NEXT_TAIL])
-			break;  /* No more callbacks. */
-		WRITE_ONCE(rsclp->tails[j], rsclp->tails[i]);
-		rcu_segcblist_move_seglen(rsclp, i, j);
-		rsclp->gp_seq[j] = rsclp->gp_seq[i];
-	}
+	rcu_segcblist_advance_compact(rsclp, i);
 }
 
 /*
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 03/10] rcu/segcblist: Change gp_seq to struct rcu_gp_oldstate gp_seq_full
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

This commit renames the ->gp_seq[] field in struct rcu_segcblist to
->gp_seq_full[] and changes its type from unsigned long to struct
rcu_gp_oldstate. This prepares the callback tracking infrastructure to
support both normal and expedited grace periods.

All function signatures are updated to pass struct rcu_gp_oldstate
pointers: rcu_segcblist_nextgp(), rcu_segcblist_advance(), and
rcu_segcblist_accelerate() now take struct rcu_gp_oldstate * instead of
unsigned long. All callers are updated to use the .rgos_norm field for
comparisons and assignments.

The SRCU and Tasks RCU wrappers now construct an rcu_gp_oldstate with
just .rgos_norm set and forward to the core functions.

No functional change: only the .rgos_norm field is used in place of
gp_seq.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 include/linux/rcu_segcblist.h |  2 +-
 include/trace/events/rcu.h    |  5 +++--
 kernel/rcu/rcu_segcblist.c    | 30 +++++++++++++++++-------------
 kernel/rcu/rcu_segcblist.h    |  6 +++---
 kernel/rcu/tree.c             | 25 ++++++++++++++-----------
 kernel/rcu/tree_nocb.h        | 29 +++++++++++++++--------------
 6 files changed, 53 insertions(+), 44 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index 2fdc2208f1ca..59c68f2ba113 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -190,7 +190,7 @@ struct rcu_cblist {
 struct rcu_segcblist {
 	struct rcu_head *head;
 	struct rcu_head **tails[RCU_CBLIST_NSEGS];
-	unsigned long gp_seq[RCU_CBLIST_NSEGS];
+	struct rcu_gp_oldstate gp_seq_full[RCU_CBLIST_NSEGS];
 #ifdef CONFIG_RCU_NOCB_CPU
 	atomic_long_t len;
 #else
diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 5fbdabe3faea..2b859b274592 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -547,10 +547,11 @@ TRACE_EVENT_RCU(rcu_segcb_stats,
 		),
 
 		TP_fast_assign(
+			int i;
 			__entry->ctx = ctx;
 			memcpy(__entry->seglen, rs->seglen, RCU_CBLIST_NSEGS * sizeof(long));
-			memcpy(__entry->gp_seq, rs->gp_seq, RCU_CBLIST_NSEGS * sizeof(unsigned long));
-
+			for (i = 0; i < RCU_CBLIST_NSEGS; i++)
+				__entry->gp_seq[i] = rs->gp_seq_full[i].rgos_norm;
 		),
 
 		TP_printk("%s seglen: (DONE=%ld, WAIT=%ld, NEXT_READY=%ld, NEXT=%ld) "
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 421f1dadb5e5..00e164db8b74 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -238,8 +238,8 @@ void rcu_segcblist_init(struct rcu_segcblist *rsclp)
 {
 	int i;
 
-	BUILD_BUG_ON(RCU_NEXT_TAIL + 1 != ARRAY_SIZE(rsclp->gp_seq));
-	BUILD_BUG_ON(ARRAY_SIZE(rsclp->tails) != ARRAY_SIZE(rsclp->gp_seq));
+	BUILD_BUG_ON(RCU_NEXT_TAIL + 1 != ARRAY_SIZE(rsclp->gp_seq_full));
+	BUILD_BUG_ON(ARRAY_SIZE(rsclp->tails) != ARRAY_SIZE(rsclp->gp_seq_full));
 	rsclp->head = NULL;
 	for (i = 0; i < RCU_CBLIST_NSEGS; i++) {
 		rsclp->tails[i] = &rsclp->head;
@@ -307,13 +307,13 @@ struct rcu_head *rcu_segcblist_first_pend_cb(struct rcu_segcblist *rsclp)
 
 /*
  * Return false if there are no CBs awaiting grace periods, otherwise,
- * return true and store the nearest waited-upon grace period into *lp.
+ * return true and store the nearest waited-upon grace period state into *rgosp.
  */
-bool rcu_segcblist_nextgp(struct rcu_segcblist *rsclp, unsigned long *lp)
+bool rcu_segcblist_nextgp(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp)
 {
 	if (!rcu_segcblist_pend_cbs(rsclp))
 		return false;
-	*lp = rsclp->gp_seq[RCU_WAIT_TAIL];
+	*rgosp = rsclp->gp_seq_full[RCU_WAIT_TAIL];
 	return true;
 }
 
@@ -488,7 +488,7 @@ static void rcu_segcblist_advance_compact(struct rcu_segcblist *rsclp, int i)
 			break;  /* No more callbacks. */
 		WRITE_ONCE(rsclp->tails[j], rsclp->tails[i]);
 		rcu_segcblist_move_seglen(rsclp, i, j);
-		rsclp->gp_seq[j] = rsclp->gp_seq[i];
+		rsclp->gp_seq_full[j] = rsclp->gp_seq_full[i];
 	}
 }
 
@@ -496,7 +496,7 @@ static void rcu_segcblist_advance_compact(struct rcu_segcblist *rsclp, int i)
  * Advance the callbacks in the specified rcu_segcblist structure based
  * on the current value passed in for the grace-period counter.
  */
-void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
+void rcu_segcblist_advance(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp)
 {
 	int i;
 
@@ -509,7 +509,7 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
 	 * are ready to invoke, and put them into the RCU_DONE_TAIL segment.
 	 */
 	for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++) {
-		if (ULONG_CMP_LT(seq, rsclp->gp_seq[i]))
+		if (ULONG_CMP_LT(rgosp->rgos_norm, rsclp->gp_seq_full[i].rgos_norm))
 			break;
 		WRITE_ONCE(rsclp->tails[RCU_DONE_TAIL], rsclp->tails[i]);
 		rcu_segcblist_move_seglen(rsclp, i, RCU_DONE_TAIL);
@@ -537,7 +537,7 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
  * ready to invoke.  Returns true if there are callbacks that won't be
  * ready to invoke until seq, false otherwise.
  */
-bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq)
+bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp)
 {
 	int i, j;
 
@@ -555,7 +555,7 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq)
 	 */
 	for (i = RCU_NEXT_READY_TAIL; i > RCU_DONE_TAIL; i--)
 		if (!rcu_segcblist_segempty(rsclp, i) &&
-		    ULONG_CMP_LT(rsclp->gp_seq[i], seq))
+		    ULONG_CMP_LT(rsclp->gp_seq_full[i].rgos_norm, rgosp->rgos_norm))
 			break;
 
 	/*
@@ -595,7 +595,7 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq)
 	 */
 	for (; i < RCU_NEXT_TAIL; i++) {
 		WRITE_ONCE(rsclp->tails[i], rsclp->tails[RCU_NEXT_TAIL]);
-		rsclp->gp_seq[i] = seq;
+		rsclp->gp_seq_full[i].rgos_norm = rgosp->rgos_norm;
 	}
 	return true;
 }
@@ -637,10 +637,14 @@ void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
 
 void srcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
 {
-	rcu_segcblist_advance(rsclp, seq);
+	struct rcu_gp_oldstate rgos = { .rgos_norm = seq };
+
+	rcu_segcblist_advance(rsclp, &rgos);
 }
 
 bool srcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq)
 {
-	return rcu_segcblist_accelerate(rsclp, seq);
+	struct rcu_gp_oldstate rgos = { .rgos_norm = seq };
+
+	return rcu_segcblist_accelerate(rsclp, &rgos);
 }
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 956f2967d9d2..2c06ab830a3d 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -124,7 +124,7 @@ bool rcu_segcblist_ready_cbs(struct rcu_segcblist *rsclp);
 bool rcu_segcblist_pend_cbs(struct rcu_segcblist *rsclp);
 struct rcu_head *rcu_segcblist_first_cb(struct rcu_segcblist *rsclp);
 struct rcu_head *rcu_segcblist_first_pend_cb(struct rcu_segcblist *rsclp);
-bool rcu_segcblist_nextgp(struct rcu_segcblist *rsclp, unsigned long *lp);
+bool rcu_segcblist_nextgp(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp);
 void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp,
 			   struct rcu_head *rhp);
 bool rcu_segcblist_entrain(struct rcu_segcblist *rsclp,
@@ -139,8 +139,8 @@ void rcu_segcblist_insert_done_cbs(struct rcu_segcblist *rsclp,
 				   struct rcu_cblist *rclp);
 void rcu_segcblist_insert_pend_cbs(struct rcu_segcblist *rsclp,
 				   struct rcu_cblist *rclp);
-void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq);
-bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq);
+void rcu_segcblist_advance(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp);
+bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp);
 void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
 			 struct rcu_segcblist *src_rsclp);
 void srcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 55df6d37145e..cbc170dc3f72 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1142,7 +1142,7 @@ static void rcu_gp_kthread_wake(void)
  */
 static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 {
-	unsigned long gp_seq_req;
+	struct rcu_gp_oldstate rgos;
 	bool ret = false;
 
 	rcu_lockdep_assert_cblist_protected(rdp);
@@ -1164,15 +1164,15 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 	 * accelerating callback invocation to an earlier grace-period
 	 * number.
 	 */
-	gp_seq_req = rcu_seq_snap(&rcu_state.gp_seq);
-	if (rcu_segcblist_accelerate(&rdp->cblist, gp_seq_req))
-		ret = rcu_start_this_gp(rnp, rdp, gp_seq_req);
+	rgos.rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
+	if (rcu_segcblist_accelerate(&rdp->cblist, &rgos))
+		ret = rcu_start_this_gp(rnp, rdp, rgos.rgos_norm);
 
 	/* Trace depending on how much we were able to accelerate. */
 	if (rcu_segcblist_restempty(&rdp->cblist, RCU_WAIT_TAIL))
-		trace_rcu_grace_period(rcu_state.name, gp_seq_req, TPS("AccWaitCB"));
+		trace_rcu_grace_period(rcu_state.name, rgos.rgos_norm, TPS("AccWaitCB"));
 	else
-		trace_rcu_grace_period(rcu_state.name, gp_seq_req, TPS("AccReadyCB"));
+		trace_rcu_grace_period(rcu_state.name, rgos.rgos_norm, TPS("AccReadyCB"));
 
 	trace_rcu_segcb_stats(&rdp->cblist, TPS("SegCbPostAcc"));
 
@@ -1189,14 +1189,14 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp,
 					struct rcu_data *rdp)
 {
-	unsigned long c;
+	struct rcu_gp_oldstate rgos;
 	bool needwake;
 
 	rcu_lockdep_assert_cblist_protected(rdp);
-	c = rcu_seq_snap(&rcu_state.gp_seq);
-	if (!READ_ONCE(rdp->gpwrap) && ULONG_CMP_GE(rdp->gp_seq_needed, c)) {
+	rgos.rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
+	if (!READ_ONCE(rdp->gpwrap) && ULONG_CMP_GE(rdp->gp_seq_needed, rgos.rgos_norm)) {
 		/* Old request still live, so mark recent callbacks. */
-		(void)rcu_segcblist_accelerate(&rdp->cblist, c);
+		(void)rcu_segcblist_accelerate(&rdp->cblist, &rgos);
 		return;
 	}
 	raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
@@ -1218,6 +1218,8 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp,
  */
 static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 {
+	struct rcu_gp_oldstate rgos;
+
 	rcu_lockdep_assert_cblist_protected(rdp);
 	raw_lockdep_assert_held_rcu_node(rnp);
 
@@ -1229,7 +1231,8 @@ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 	 * Find all callbacks whose ->gp_seq numbers indicate that they
 	 * are ready to invoke, and put them into the RCU_DONE_TAIL sublist.
 	 */
-	rcu_segcblist_advance(&rdp->cblist, rnp->gp_seq);
+	rgos.rgos_norm = rnp->gp_seq;
+	rcu_segcblist_advance(&rdp->cblist, &rgos);
 
 	/* Classify any remaining callbacks. */
 	return rcu_accelerate_cbs(rnp, rdp);
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 1047b30cd46b..1837eedfb8c2 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -433,7 +433,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				bool lazy)
 {
 	unsigned long c;
-	unsigned long cur_gp_seq;
+	struct rcu_gp_oldstate cur_gp_seq_full;
 	unsigned long j = jiffies;
 	long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
 	long lazy_len = READ_ONCE(rdp->lazy_len);
@@ -501,8 +501,8 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 			return false; // Caller must enqueue the callback.
 		}
 		if (j != rdp->nocb_gp_adv_time &&
-		    rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
-		    rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) {
+		    rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full) &&
+		    rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq_full.rgos_norm)) {
 			rcu_advance_cbs_nowake(rdp->mynode, rdp);
 			rdp->nocb_gp_adv_time = j;
 		}
@@ -659,7 +659,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 {
 	bool bypass = false;
 	int __maybe_unused cpu = my_rdp->cpu;
-	unsigned long cur_gp_seq;
+	struct rcu_gp_oldstate cur_gp_seq_full;
 	unsigned long flags;
 	bool gotcbs = false;
 	unsigned long j = jiffies;
@@ -730,8 +730,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 		needwake_gp = false;
 		if (!rcu_segcblist_restempty(&rdp->cblist,
 					     RCU_NEXT_READY_TAIL) ||
-		    (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
-		     rcu_seq_done(&rnp->gp_seq, cur_gp_seq))) {
+		    (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full) &&
+		     rcu_seq_done(&rnp->gp_seq, cur_gp_seq_full.rgos_norm))) {
 			raw_spin_lock_rcu_node(rnp); /* irqs disabled. */
 			needwake_gp = rcu_advance_cbs(rnp, rdp);
 			wasempty = rcu_segcblist_restempty(&rdp->cblist,
@@ -742,10 +742,10 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 		WARN_ON_ONCE(wasempty &&
 			     !rcu_segcblist_restempty(&rdp->cblist,
 						      RCU_NEXT_READY_TAIL));
-		if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq)) {
+		if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full)) {
 			if (!needwait_gp ||
-			    ULONG_CMP_LT(cur_gp_seq, wait_gp_seq))
-				wait_gp_seq = cur_gp_seq;
+			    ULONG_CMP_LT(cur_gp_seq_full.rgos_norm, wait_gp_seq))
+				wait_gp_seq = cur_gp_seq_full.rgos_norm;
 			needwait_gp = true;
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("NeedWaitGP"));
@@ -877,7 +877,7 @@ static inline bool nocb_cb_wait_cond(struct rcu_data *rdp)
 static void nocb_cb_wait(struct rcu_data *rdp)
 {
 	struct rcu_segcblist *cblist = &rdp->cblist;
-	unsigned long cur_gp_seq;
+	struct rcu_gp_oldstate cur_gp_seq_full;
 	unsigned long flags;
 	bool needwake_gp = false;
 	struct rcu_node *rnp = rdp->mynode;
@@ -918,8 +918,8 @@ static void nocb_cb_wait(struct rcu_data *rdp)
 	local_bh_enable();
 	lockdep_assert_irqs_enabled();
 	rcu_nocb_lock_irqsave(rdp, flags);
-	if (rcu_segcblist_nextgp(cblist, &cur_gp_seq) &&
-	    rcu_seq_done(&rnp->gp_seq, cur_gp_seq) &&
+	if (rcu_segcblist_nextgp(cblist, &cur_gp_seq_full) &&
+	    rcu_seq_done(&rnp->gp_seq, cur_gp_seq_full.rgos_norm) &&
 	    raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */
 		needwake_gp = rcu_advance_cbs(rdp->mynode, rdp);
 		raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
@@ -1569,9 +1569,10 @@ static void show_rcu_nocb_state(struct rcu_data *rdp)
 					      nocb_entry_rdp);
 
 	sprintf(bufd, "%ld", rsclp->seglen[RCU_DONE_TAIL]);
-	sprintf(bufw, "%ld(%ld)", rsclp->seglen[RCU_WAIT_TAIL], rsclp->gp_seq[RCU_WAIT_TAIL]);
+	sprintf(bufw, "%ld(%ld)", rsclp->seglen[RCU_WAIT_TAIL],
+		rsclp->gp_seq_full[RCU_WAIT_TAIL].rgos_norm);
 	sprintf(bufr, "%ld(%ld)", rsclp->seglen[RCU_NEXT_READY_TAIL],
-		      rsclp->gp_seq[RCU_NEXT_READY_TAIL]);
+		rsclp->gp_seq_full[RCU_NEXT_READY_TAIL].rgos_norm);
 	sprintf(bufn, "%ld", rsclp->seglen[RCU_NEXT_TAIL]);
 	sprintf(bufb, "%ld", rcu_cblist_n_cbs(&rdp->nocb_bypass));
 	pr_info("   CB %d^%d->%d %c%c%c%c%c F%ld L%ld C%d %c%s%c%s%c%s%c%s%c%s q%ld %c CPU %d%s\n",
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 04/10] rcu: Add RCU_GET_STATE_NOT_TRACKED for subsystems without expedited GPs
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

SRCU and Tasks RCU do not track expedited grace periods. When their
callback state is checked via poll_state_synchronize_rcu_full(), the
uninitialized or zeroed rgos_exp field could cause false-positive
completion detection.

This commit adds an RCU_GET_STATE_NOT_TRACKED sentinel value (0x2) that
these subsystems can place into rgos_exp to indicate that expedited GP
tracking is not applicable. The expedited sequence check in
poll_state_synchronize_rcu_full() is guarded to skip entries marked with
this sentinel.

This is needed to allow rcu_segcblist_advance() and rcu_accelerate_cbs()
to work with both normal and expedited grace periods via
get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full().

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/rcu.h  | 13 +++++++++++--
 kernel/rcu/tree.c |  3 ++-
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index fa6d30ce73d1..b450febac823 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -46,16 +46,25 @@
  *					the number of pending readers that will use
  *					this inactive index is bounded).
  *
- * RCU polled GP special control value:
+ * RCU polled GP special control values:
  *
  *	RCU_GET_STATE_COMPLETED :	State value indicating an already-completed
  *					polled GP has completed.  This value covers
  *					both the state and the counter of the
  *					grace-period sequence number.
+ *
+ *	RCU_GET_STATE_NOT_TRACKED :	State value indicating that a GP component
+ *					is not tracked by this subsystem and should
+ *					not be checked.  Used by SRCU and RCU Tasks
+ *					which do not track expedited GPs, to prevent
+ *					false-positive completion when their
+ *					gp_seq_full entries are checked via
+ *					poll_state_synchronize_rcu_full().
  */
 
-/* Low-order bit definition for polled grace-period APIs. */
+/* Low-order bit definitions for polled grace-period APIs. */
 #define RCU_GET_STATE_COMPLETED	0x1
+#define RCU_GET_STATE_NOT_TRACKED	0x2
 
 /* A complete grace period count */
 #define RCU_SEQ_GP (RCU_SEQ_STATE_MASK + 1)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cbc170dc3f72..607fc5715cd1 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3603,7 +3603,8 @@ bool poll_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 	if (rgosp->rgos_norm == RCU_GET_STATE_COMPLETED ||
 	    rcu_seq_done_exact(&rnp->gp_seq, rgosp->rgos_norm) ||
 	    rgosp->rgos_exp == RCU_GET_STATE_COMPLETED ||
-	    rcu_seq_done_exact(&rcu_state.expedited_sequence, rgosp->rgos_exp)) {
+	    (rgosp->rgos_exp != RCU_GET_STATE_NOT_TRACKED &&
+	     rcu_seq_done_exact(&rcu_state.expedited_sequence, rgosp->rgos_exp))) {
 		smp_mb(); /* Ensure GP ends before subsequent accesses. */
 		return true;
 	}
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 05/10] rcu: Enable RCU callbacks to benefit from expedited grace periods
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

Currently, RCU callbacks only track normal grace period sequence
numbers.  This means callbacks must wait for normal grace periods to
complete even when expedited grace periods have already elapsed.

This commit uses the full rcu_gp_oldstate structure (which tracks both
normal and expedited GP sequences) throughout the callback
infrastructure.

The rcu_segcblist_advance() function now checks both normal and
expedited GP completion via poll_state_synchronize_rcu_full(), becoming
parameterless since it reads the GP state internally.
rcu_segcblist_accelerate() stores the full GP state (both normal and
expedited sequences) instead of just the normal sequence.

The rcu_accelerate_cbs() and rcu_accelerate_cbs_unlocked() functions use
get_state_synchronize_rcu_full() to capture both GP sequences. The NOCB
code uses poll_state_synchronize_rcu_full() for advance checks instead
of comparing only the normal GP sequence.

srcu_segcblist_advance() become standalone implementations because
compares SRCU sequences directly (it cannot use
poll_state_synchronize_rcu_full(), which reads RCU-specific globals).
srcu_segcblist_accelerate() sets rgos_exp to RCU_GET_STATE_NOT_TRACKED
so poll_state_synchronize_rcu_full() only compares the rgosp->rgos_norm
and ignores rgos_exp.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/rcu_segcblist.c | 30 ++++++++++++++++++++++++------
 kernel/rcu/rcu_segcblist.h |  2 +-
 kernel/rcu/tree.c          |  9 +++------
 kernel/rcu/tree_nocb.h     | 33 +++++++++++++++++++++++----------
 4 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 00e164db8b74..11174e2be3c2 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -12,6 +12,7 @@
 #include <linux/kernel.h>
 #include <linux/types.h>
 
+#include "rcu.h"
 #include "rcu_segcblist.h"
 
 /* Initialize simple callback list. */
@@ -494,9 +495,9 @@ static void rcu_segcblist_advance_compact(struct rcu_segcblist *rsclp, int i)
 
 /*
  * Advance the callbacks in the specified rcu_segcblist structure based
- * on the current value passed in for the grace-period counter.
+ * on the current value of the grace-period counter.
  */
-void rcu_segcblist_advance(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp)
+void rcu_segcblist_advance(struct rcu_segcblist *rsclp)
 {
 	int i;
 
@@ -509,7 +510,7 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *
 	 * are ready to invoke, and put them into the RCU_DONE_TAIL segment.
 	 */
 	for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++) {
-		if (ULONG_CMP_LT(rgosp->rgos_norm, rsclp->gp_seq_full[i].rgos_norm))
+		if (!poll_state_synchronize_rcu_full(&rsclp->gp_seq_full[i]))
 			break;
 		WRITE_ONCE(rsclp->tails[RCU_DONE_TAIL], rsclp->tails[i]);
 		rcu_segcblist_move_seglen(rsclp, i, RCU_DONE_TAIL);
@@ -595,7 +596,7 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstat
 	 */
 	for (; i < RCU_NEXT_TAIL; i++) {
 		WRITE_ONCE(rsclp->tails[i], rsclp->tails[RCU_NEXT_TAIL]);
-		rsclp->gp_seq_full[i].rgos_norm = rgosp->rgos_norm;
+		rsclp->gp_seq_full[i] = *rgosp;
 	}
 	return true;
 }
@@ -637,14 +638,31 @@ void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
 
 void srcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
 {
-	struct rcu_gp_oldstate rgos = { .rgos_norm = seq };
+	int i;
 
-	rcu_segcblist_advance(rsclp, &rgos);
+	WARN_ON_ONCE(!rcu_segcblist_is_enabled(rsclp));
+	if (rcu_segcblist_restempty(rsclp, RCU_DONE_TAIL))
+		return;
+
+	for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++) {
+		if (ULONG_CMP_LT(seq, rsclp->gp_seq_full[i].rgos_norm))
+			break;
+		WRITE_ONCE(rsclp->tails[RCU_DONE_TAIL], rsclp->tails[i]);
+		rcu_segcblist_move_seglen(rsclp, i, RCU_DONE_TAIL);
+	}
+
+	/* If no callbacks moved, nothing more need be done. */
+	if (i == RCU_WAIT_TAIL)
+		return;
+
+	rcu_segcblist_advance_compact(rsclp, i);
 }
 
 bool srcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq)
 {
 	struct rcu_gp_oldstate rgos = { .rgos_norm = seq };
 
+	if (IS_ENABLED(CONFIG_SMP))
+		rgos.rgos_exp = RCU_GET_STATE_NOT_TRACKED;
 	return rcu_segcblist_accelerate(rsclp, &rgos);
 }
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 2c06ab830a3d..6e05fdf93e7b 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -139,7 +139,7 @@ void rcu_segcblist_insert_done_cbs(struct rcu_segcblist *rsclp,
 				   struct rcu_cblist *rclp);
 void rcu_segcblist_insert_pend_cbs(struct rcu_segcblist *rsclp,
 				   struct rcu_cblist *rclp);
-void rcu_segcblist_advance(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp);
+void rcu_segcblist_advance(struct rcu_segcblist *rsclp);
 bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp);
 void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
 			 struct rcu_segcblist *src_rsclp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 607fc5715cd1..35076092f754 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1164,7 +1164,7 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 	 * accelerating callback invocation to an earlier grace-period
 	 * number.
 	 */
-	rgos.rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
+	get_state_synchronize_rcu_full(&rgos);
 	if (rcu_segcblist_accelerate(&rdp->cblist, &rgos))
 		ret = rcu_start_this_gp(rnp, rdp, rgos.rgos_norm);
 
@@ -1193,7 +1193,7 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp,
 	bool needwake;
 
 	rcu_lockdep_assert_cblist_protected(rdp);
-	rgos.rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
+	get_state_synchronize_rcu_full(&rgos);
 	if (!READ_ONCE(rdp->gpwrap) && ULONG_CMP_GE(rdp->gp_seq_needed, rgos.rgos_norm)) {
 		/* Old request still live, so mark recent callbacks. */
 		(void)rcu_segcblist_accelerate(&rdp->cblist, &rgos);
@@ -1218,8 +1218,6 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp,
  */
 static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 {
-	struct rcu_gp_oldstate rgos;
-
 	rcu_lockdep_assert_cblist_protected(rdp);
 	raw_lockdep_assert_held_rcu_node(rnp);
 
@@ -1231,8 +1229,7 @@ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 	 * Find all callbacks whose ->gp_seq numbers indicate that they
 	 * are ready to invoke, and put them into the RCU_DONE_TAIL sublist.
 	 */
-	rgos.rgos_norm = rnp->gp_seq;
-	rcu_segcblist_advance(&rdp->cblist, &rgos);
+	rcu_segcblist_advance(&rdp->cblist);
 
 	/* Classify any remaining callbacks. */
 	return rcu_accelerate_cbs(rnp, rdp);
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 1837eedfb8c2..7462cd5e2507 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -502,7 +502,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 		}
 		if (j != rdp->nocb_gp_adv_time &&
 		    rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full) &&
-		    rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq_full.rgos_norm)) {
+		    poll_state_synchronize_rcu_full(&cur_gp_seq_full)) {
 			rcu_advance_cbs_nowake(rdp->mynode, rdp);
 			rdp->nocb_gp_adv_time = j;
 		}
@@ -731,7 +731,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 		if (!rcu_segcblist_restempty(&rdp->cblist,
 					     RCU_NEXT_READY_TAIL) ||
 		    (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full) &&
-		     rcu_seq_done(&rnp->gp_seq, cur_gp_seq_full.rgos_norm))) {
+		     poll_state_synchronize_rcu_full(&cur_gp_seq_full))) {
 			raw_spin_lock_rcu_node(rnp); /* irqs disabled. */
 			needwake_gp = rcu_advance_cbs(rnp, rdp);
 			wasempty = rcu_segcblist_restempty(&rdp->cblist,
@@ -742,7 +742,18 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 		WARN_ON_ONCE(wasempty &&
 			     !rcu_segcblist_restempty(&rdp->cblist,
 						      RCU_NEXT_READY_TAIL));
-		if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full)) {
+		/*
+		 * Only request a GP wait if the next pending callback's
+		 * GP has not already completed (normal or expedited).
+		 * If poll_state_synchronize_rcu_full() says it completed,
+		 * then rcu_advance_cbs() above already moved those
+		 * callbacks to RCU_DONE_TAIL, so there is no GP to wait
+		 * for.  Any remaining callbacks got new (future) GP
+		 * numbers from rcu_accelerate_cbs() inside
+		 * rcu_advance_cbs() and will be handled on the next pass.
+		 */
+		if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full) &&
+		    !poll_state_synchronize_rcu_full(&cur_gp_seq_full)) {
 			if (!needwait_gp ||
 			    ULONG_CMP_LT(cur_gp_seq_full.rgos_norm, wait_gp_seq))
 				wait_gp_seq = cur_gp_seq_full.rgos_norm;
@@ -919,7 +930,7 @@ static void nocb_cb_wait(struct rcu_data *rdp)
 	lockdep_assert_irqs_enabled();
 	rcu_nocb_lock_irqsave(rdp, flags);
 	if (rcu_segcblist_nextgp(cblist, &cur_gp_seq_full) &&
-	    rcu_seq_done(&rnp->gp_seq, cur_gp_seq_full.rgos_norm) &&
+	    poll_state_synchronize_rcu_full(&cur_gp_seq_full) &&
 	    raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */
 		needwake_gp = rcu_advance_cbs(rdp->mynode, rdp);
 		raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
@@ -1548,8 +1559,8 @@ static void show_rcu_nocb_gp_state(struct rcu_data *rdp)
 static void show_rcu_nocb_state(struct rcu_data *rdp)
 {
 	char bufd[22];
-	char bufw[45];
-	char bufr[45];
+	char bufw[64];
+	char bufr[64];
 	char bufn[22];
 	char bufb[22];
 	struct rcu_data *nocb_next_rdp;
@@ -1569,10 +1580,12 @@ static void show_rcu_nocb_state(struct rcu_data *rdp)
 					      nocb_entry_rdp);
 
 	sprintf(bufd, "%ld", rsclp->seglen[RCU_DONE_TAIL]);
-	sprintf(bufw, "%ld(%ld)", rsclp->seglen[RCU_WAIT_TAIL],
-		rsclp->gp_seq_full[RCU_WAIT_TAIL].rgos_norm);
-	sprintf(bufr, "%ld(%ld)", rsclp->seglen[RCU_NEXT_READY_TAIL],
-		rsclp->gp_seq_full[RCU_NEXT_READY_TAIL].rgos_norm);
+	sprintf(bufw, "%ld(%ld/%ld)", rsclp->seglen[RCU_WAIT_TAIL],
+		rsclp->gp_seq_full[RCU_WAIT_TAIL].rgos_norm,
+		rsclp->gp_seq_full[RCU_WAIT_TAIL].rgos_exp);
+	sprintf(bufr, "%ld(%ld/%ld)", rsclp->seglen[RCU_NEXT_READY_TAIL],
+		rsclp->gp_seq_full[RCU_NEXT_READY_TAIL].rgos_norm,
+		rsclp->gp_seq_full[RCU_NEXT_READY_TAIL].rgos_exp);
 	sprintf(bufn, "%ld", rsclp->seglen[RCU_NEXT_TAIL]);
 	sprintf(bufb, "%ld", rcu_cblist_n_cbs(&rdp->nocb_bypass));
 	pr_info("   CB %d^%d->%d %c%c%c%c%c F%ld L%ld C%d %c%s%c%s%c%s%c%s%c%s q%ld %c CPU %d%s\n",
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 06/10] rcu: Update comments for gp_seq_full and expedited GP tracking
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

This commit updates documentation comments throughout the RCU callback
infrastructure to reflect the transition from single grace-period
sequence numbers to the full rcu_gp_oldstate structure that tracks both
normal and expedited grace periods.

The ->gp_seq_full[] array documentation in rcu_segcblist.h is updated
to describe dual GP tracking. The rcu_segcblist_advance(),
rcu_segcblist_accelerate(), and rcu_advance_cbs() comments are updated
to reference ->gp_seq_full[] and rgosp instead of the old ->gp_seq[]
and seq names.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 include/linux/rcu_segcblist.h | 14 +++++++-----
 kernel/rcu/rcu_segcblist.c    | 43 +++++++++++++++++++++++------------
 kernel/rcu/tree.c             |  6 ++---
 3 files changed, 39 insertions(+), 24 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index 59c68f2ba113..6052c5750ad3 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -50,12 +50,14 @@ struct rcu_cblist {
  * Note that RCU_WAIT_TAIL cannot be empty unless RCU_NEXT_READY_TAIL is also
  * empty.
  *
- * The ->gp_seq[] array contains the grace-period number at which the
- * corresponding segment of callbacks will be ready to invoke.  A given
- * element of this array is meaningful only when the corresponding segment
- * is non-empty, and it is never valid for RCU_DONE_TAIL (whose callbacks
- * are already ready to invoke) or for RCU_NEXT_TAIL (whose callbacks have
- * not yet been assigned a grace-period number).
+ * The ->gp_seq_full[] array contains the grace-period state at which the
+ * corresponding segment of callbacks will be ready to invoke.  This tracks
+ * both normal and expedited grace periods, allowing callbacks to complete
+ * when either type of GP finishes.  A given element of this array is
+ * meaningful only when the corresponding segment is non-empty, and it is
+ * never valid for RCU_DONE_TAIL (whose callbacks are already ready to
+ * invoke) or for RCU_NEXT_TAIL (whose callbacks have not yet been assigned
+ * a grace-period state).
  */
 #define RCU_DONE_TAIL		0	/* Also RCU_WAIT head. */
 #define RCU_WAIT_TAIL		1	/* Also RCU_NEXT_READY head. */
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 11174e2be3c2..0f13016c0a2c 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -495,7 +495,8 @@ static void rcu_segcblist_advance_compact(struct rcu_segcblist *rsclp, int i)
 
 /*
  * Advance the callbacks in the specified rcu_segcblist structure based
- * on the current value of the grace-period counter.
+ * on the current grace-period state.  Checks both normal and expedited
+ * grace periods, advancing callbacks when either GP type completes.
  */
 void rcu_segcblist_advance(struct rcu_segcblist *rsclp)
 {
@@ -506,8 +507,10 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp)
 		return;
 
 	/*
-	 * Find all callbacks whose ->gp_seq numbers indicate that they
-	 * are ready to invoke, and put them into the RCU_DONE_TAIL segment.
+	 * Find all callbacks whose grace periods have completed (either
+	 * normal or expedited) and put them into the RCU_DONE_TAIL segment.
+	 * We check against the current global GP state, which includes
+	 * proper memory barriers and handles special completion values.
 	 */
 	for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++) {
 		if (!poll_state_synchronize_rcu_full(&rsclp->gp_seq_full[i]))
@@ -534,9 +537,9 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp)
  * them to complete at the end of the earlier grace period.
  *
  * This function operates on an rcu_segcblist structure, and also the
- * grace-period sequence number seq at which new callbacks would become
+ * grace-period state rgosp at which new callbacks would become
  * ready to invoke.  Returns true if there are callbacks that won't be
- * ready to invoke until seq, false otherwise.
+ * ready to invoke until the grace period represented by rgosp, false otherwise.
  */
 bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstate *rgosp)
 {
@@ -548,11 +551,11 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstat
 
 	/*
 	 * Find the segment preceding the oldest segment of callbacks
-	 * whose ->gp_seq[] completion is at or after that passed in via
-	 * "seq", skipping any empty segments.  This oldest segment, along
+	 * whose grace period completion is at or after that passed in via
+	 * "rgosp", skipping any empty segments.  This oldest segment, along
 	 * with any later segments, can be merged in with any newly arrived
-	 * callbacks in the RCU_NEXT_TAIL segment, and assigned "seq"
-	 * as their ->gp_seq[] grace-period completion sequence number.
+	 * callbacks in the RCU_NEXT_TAIL segment, and assigned "rgosp"
+	 * as their grace-period completion state.
 	 */
 	for (i = RCU_NEXT_READY_TAIL; i > RCU_DONE_TAIL; i--)
 		if (!rcu_segcblist_segempty(rsclp, i) &&
@@ -561,7 +564,7 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstat
 
 	/*
 	 * If all the segments contain callbacks that correspond to
-	 * earlier grace-period sequence numbers than "seq", leave.
+	 * earlier grace-period sequence numbers than "rgosp", leave.
 	 * Assuming that the rcu_segcblist structure has enough
 	 * segments in its arrays, this can only happen if some of
 	 * the non-done segments contain callbacks that really are
@@ -569,15 +572,15 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstat
 	 * out by the next call to rcu_segcblist_advance().
 	 *
 	 * Also advance to the oldest segment of callbacks whose
-	 * ->gp_seq[] completion is at or after that passed in via "seq",
+	 * ->gp_seq_full[] completion is at or after that passed in via "rgosp",
 	 * skipping any empty segments.
 	 *
 	 * Note that segment "i" (and any lower-numbered segments
 	 * containing older callbacks) will be unaffected, and their
-	 * grace-period numbers remain unchanged.  For example, if i ==
+	 * grace-period states remain unchanged.  For example, if i ==
 	 * WAIT_TAIL, then neither WAIT_TAIL nor DONE_TAIL will be touched.
 	 * Instead, the CBs in NEXT_TAIL will be merged with those in
-	 * NEXT_READY_TAIL and the grace-period number of NEXT_READY_TAIL
+	 * NEXT_READY_TAIL and the grace-period state of NEXT_READY_TAIL
 	 * would be updated.  NEXT_TAIL would then be empty.
 	 */
 	if (rcu_segcblist_restempty(rsclp, i) || ++i >= RCU_NEXT_TAIL)
@@ -589,8 +592,8 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, struct rcu_gp_oldstat
 
 	/*
 	 * Merge all later callbacks, including newly arrived callbacks,
-	 * into the segment located by the for-loop above.  Assign "seq"
-	 * as the ->gp_seq[] value in order to correctly handle the case
+	 * into the segment located by the for-loop above.  Assign "rgosp"
+	 * as the grace-period state in order to correctly handle the case
 	 * where there were no pending callbacks in the rcu_segcblist
 	 * structure other than in the RCU_NEXT_TAIL segment.
 	 */
@@ -644,6 +647,10 @@ void srcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
 	if (rcu_segcblist_restempty(rsclp, RCU_DONE_TAIL))
 		return;
 
+	/*
+	 * Find all callbacks whose normal GP sequence numbers indicate
+	 * that they are ready to invoke.  For SRCU, we only check rgos_norm.
+	 */
 	for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++) {
 		if (ULONG_CMP_LT(seq, rsclp->gp_seq_full[i].rgos_norm))
 			break;
@@ -658,6 +665,12 @@ void srcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
 	rcu_segcblist_advance_compact(rsclp, i);
 }
 
+/*
+ * SRCU wrapper for rcu_segcblist_accelerate() - converts SRCU's unsigned
+ * long GP sequence to rcu_gp_oldstate format with rgos_exp set to
+ * RCU_GET_STATE_NOT_TRACKED (since SRCU does not use expedited GPs)
+ * and calls the core rcu_segcblist_accelerate().
+ */
 bool srcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq)
 {
 	struct rcu_gp_oldstate rgos = { .rgos_norm = seq };
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 35076092f754..0e43866dc4cd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1209,7 +1209,7 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp,
 /*
  * Move any callbacks whose grace period has completed to the
  * RCU_DONE_TAIL sublist, then compact the remaining sublists and
- * assign ->gp_seq numbers to any callbacks in the RCU_NEXT_TAIL
+ * assign ->gp_seq_full[] state to any callbacks in the RCU_NEXT_TAIL
  * sublist.  This function is idempotent, so it does not hurt to
  * invoke it repeatedly.  As long as it is not invoked -too- often...
  * Returns true if the RCU grace-period kthread needs to be awakened.
@@ -1226,8 +1226,8 @@ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
 		return false;
 
 	/*
-	 * Find all callbacks whose ->gp_seq numbers indicate that they
-	 * are ready to invoke, and put them into the RCU_DONE_TAIL sublist.
+	 * Find all callbacks whose grace periods have completed (either
+	 * normal or expedited) and put them into the RCU_DONE_TAIL sublist.
 	 */
 	rcu_segcblist_advance(&rdp->cblist);
 
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 07/10] rcu: Wake NOCB rcuog kthreads on expedited grace period completion
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

When an expedited grace period completes, rcu_exp_wait_wake() wakes
waiters on rnp->exp_wq[] but does not notify NOCB rcuog kthreads.  These
kthreads may be sleeping waiting for a grace period to complete.
Without this wakeup, callbacks on offloaded CPUs that could benefit from
the expedited GP must wait until the rcuog kthread wakes for some other
reason (e.g., next normal GP completion or a timer).

Add rcu_exp_wake_nocb() which wakes rcuog kthreads for leaf-node CPUs,
deduplicating via rdp->nocb_gp_rdp since multiple CPUs share one rcuog
kthread.  Uses for_each_leaf_node_possible_cpu() because offline CPUs
can have pending callbacks. The function is defined in tree_nocb.h with
an empty stub for CONFIG_RCU_NOCB_CPU=n builds.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/tree.h      |  1 +
 kernel/rcu/tree_exp.h  |  1 +
 kernel/rcu/tree_nocb.h | 29 +++++++++++++++++++++++++++++
 3 files changed, 31 insertions(+)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 7dfc57e9adb1..40f778453591 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -500,6 +500,7 @@ static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
 static bool wake_nocb_gp(struct rcu_data *rdp);
+static void rcu_exp_wake_nocb(struct rcu_node *rnp);
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j, bool lazy);
 static void call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *head,
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 82cada459e5d..0df1009c6e97 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -708,6 +708,7 @@ static void rcu_exp_wait_wake(unsigned long s)
 		}
 		smp_mb(); /* All above changes before wakeup. */
 		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
+		rcu_exp_wake_nocb(rnp);
 	}
 	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
 	mutex_unlock(&rcu_state.exp_wake_mutex);
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 7462cd5e2507..f37ee56d62a9 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -190,6 +190,31 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 	init_swait_queue_head(&rnp->nocb_gp_wq[1]);
 }
 
+/*
+ * Wake NOCB rcuog kthreads for leaf-node CPUs so that they can advance
+ * callbacks that were waiting for the just-completed expedited GP.
+ * Deduplicate via nocb_gp_rdp since multiple CPUs share one rcuog
+ * kthread.  Use for_each_leaf_node_possible_cpu() because offline CPUs
+ * may have pending callbacks.
+ */
+static void rcu_exp_wake_nocb(struct rcu_node *rnp)
+{
+	struct rcu_data *last_rdp_gp = NULL;
+	int cpu;
+
+	if (!rcu_is_leaf_node(rnp))
+		return;
+
+	for_each_leaf_node_possible_cpu(rnp, cpu) {
+		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+
+		if (rdp->nocb_gp_rdp == last_rdp_gp)
+			continue;
+		last_rdp_gp = rdp->nocb_gp_rdp;
+		wake_nocb_gp(rdp);
+	}
+}
+
 /* Clear any pending deferred wakeup timer (nocb_gp_lock must be held). */
 static void nocb_defer_wakeup_cancel(struct rcu_data *rdp_gp)
 {
@@ -1668,6 +1693,10 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 {
 }
 
+static void rcu_exp_wake_nocb(struct rcu_node *rnp)
+{
+}
+
 static bool wake_nocb_gp(struct rcu_data *rdp)
 {
 	return false;
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 08/10] rcu: Detect expedited grace period completion in rcu_pending()
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

rcu_pending() is the gatekeeper that decides whether rcu_core() should
run on the current CPU's timer tick. Currently it checks if the CPU has
callbacks ready to invoke or a grace period has completed or started.

It does not check that an expedited GP has completed. After an expedited
GP, callbacks remain in RCU_WAIT_TAIL (not yet advanced to
RCU_DONE_TAIL) and So rcu_core() never runs to advance them.

Add a check using rcu_segcblist_nextgp() combined with
poll_state_synchronize_rcu_full() to detect when any pending callbacks'
grace period has completed.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/tree.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0e43866dc4cd..309273a37b0a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3671,6 +3671,7 @@ EXPORT_SYMBOL_GPL(cond_synchronize_rcu_full);
 static int rcu_pending(int user)
 {
 	bool gp_in_progress;
+	struct rcu_gp_oldstate gp_state;
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 	struct rcu_node *rnp = rdp->mynode;
 
@@ -3701,6 +3702,12 @@ static int rcu_pending(int user)
 	    rcu_segcblist_ready_cbs(&rdp->cblist))
 		return 1;
 
+	/* Has a GP (normal or expedited) completed for pending callbacks? */
+	if (!rcu_rdp_is_offloaded(rdp) &&
+	    rcu_segcblist_nextgp(&rdp->cblist, &gp_state) &&
+	    poll_state_synchronize_rcu_full(&gp_state))
+		return 1;
+
 	/* Has RCU gone idle with this CPU needing another grace period? */
 	if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) &&
 	    !rcu_rdp_is_offloaded(rdp) &&
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 09/10] rcu: Advance callbacks for expedited GP completion in rcu_core()
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

Even when rcu_pending() triggers rcu_core(), the normal callback
advancement path through note_gp_changes() -> __note_gp_changes() bails
out when rdp->gp_seq == rnp->gp_seq (no normal GP change). Since
expedited GPs do not update rnp->gp_seq, rcu_advance_cbs() is never
called and callbacks remain stuck in RCU_WAIT_TAIL.

Add a direct callback advancement block in rcu_core() that checks for GP
completion via rcu_segcblist_nextgp() combined with
poll_state_synchronize_rcu_full(). When detected, trylock rnp and call
rcu_advance_cbs() to move completed callbacks to RCU_DONE_TAIL. Wake the
GP kthread if rcu_advance_cbs() requests a new grace period.

Uses trylock to avoid adding contention on rnp->lock. If the lock is
contended, callbacks will be advanced on the next tick.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/tree.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 309273a37b0a..1a92b6105de5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2853,6 +2853,22 @@ static __latent_entropy void rcu_core(void)
 	/* Update RCU state based on any recent quiescent states. */
 	rcu_check_quiescent_state(rdp);

+	/* Advance callbacks if an expedited GP has completed. */
+	if (!rcu_rdp_is_offloaded(rdp) &&
+	    rcu_segcblist_is_enabled(&rdp->cblist)) {
+		struct rcu_gp_oldstate gp_state;
+
+		if (rcu_segcblist_nextgp(&rdp->cblist, &gp_state) &&
+		    poll_state_synchronize_rcu_full(&gp_state)) {
+			guard(irqsave)();
+			if (raw_spin_trylock_rcu_node(rnp)) {
+				if (rcu_advance_cbs(rnp, rdp))
+					rcu_gp_kthread_wake();
+				raw_spin_unlock_rcu_node(rnp);
+			}
+		}
+	}
+
 	/* No grace period and unregistered callbacks? */
 	if (!rcu_gp_in_progress() &&
 	    rcu_segcblist_is_enabled(&rdp->cblist) && !rcu_rdp_is_offloaded(rdp)) {
-- 
2.52.0

^ permalink raw reply related

* [RFC PATCH 10/10] rcuscale: Add concurrent expedited GP threads for callback scaling tests
From: Puranjay Mohan @ 2026-04-17 23:11 UTC (permalink / raw)
  To: rcu, linux-kernel, linux-trace-kernel
  Cc: Puranjay Mohan, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-1-puranjay@kernel.org>

Add nexp and exp_interval parameters to rcuscale that spawn kthreads
running synchronize_rcu_expedited() in a loop. This generates concurrent
expedited GP load while the normal writers measure GP or callback
latency.

When combined with gp_async=1 (which uses call_rcu() for writers), this
tests how effectively callbacks benefit from expedited grace periods.
With RCU callback expedited GP tracking, the async callbacks should
complete faster because they piggyback on the expedited GPs rather than
waiting for normal GPs.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/rcu/rcuscale.c | 84 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index ac0b1c6b7dae..1097ec15879c 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -91,6 +91,8 @@ torture_param(int, shutdown_secs, !IS_MODULE(CONFIG_RCU_SCALE_TEST) * 300,
 torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
 torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
 torture_param(int, writer_holdoff_jiffies, 0, "Holdoff (jiffies) between GPs, zero to disable");
+torture_param(int, nexp, 0, "Number of expedited GP threads to run concurrently");
+torture_param(int, exp_interval, 0, "Interval (us) between expedited GPs, zero to disable");
 torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu() scale test?");
 torture_param(int, kfree_mult, 1, "Multiple of kfree_obj size to allocate.");
 torture_param(int, kfree_by_call_rcu, 0, "Use call_rcu() to emulate kfree_rcu()?");
@@ -115,8 +117,10 @@ struct writer_freelist {
 
 static int nrealreaders;
 static int nrealwriters;
+static int nrealexp;
 static struct task_struct **writer_tasks;
 static struct task_struct **reader_tasks;
+static struct task_struct **exp_tasks;
 
 static u64 **writer_durations;
 static bool *writer_done;
@@ -462,6 +466,34 @@ rcu_scale_reader(void *arg)
 	return 0;
 }
 
+/*
+ * RCU expedited GP kthread.  Repeatedly invokes expedited grace periods
+ * to generate concurrent expedited GP load while the normal-GP writers
+ * are being measured.  This allows measuring the benefit of callbacks
+ * that can piggyback on expedited grace periods.
+ */
+static int
+rcu_scale_exp(void *arg)
+{
+	long me = (long)arg;
+
+	VERBOSE_SCALEOUT_STRING("rcu_scale_exp task started");
+	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
+	set_user_nice(current, MIN_NICE);
+
+	if (holdoff)
+		schedule_timeout_idle(holdoff * HZ);
+
+	do {
+		if (exp_interval)
+			udelay(exp_interval);
+		cur_ops->exp_sync();
+		rcu_scale_wait_shutdown();
+	} while (!torture_must_stop());
+	torture_kthread_stopping("rcu_scale_exp");
+	return 0;
+}
+
 /*
  * Allocate a writer_mblock structure for the specified rcu_scale_writer
  * task.
@@ -664,8 +696,10 @@ static void
 rcu_scale_print_module_parms(struct rcu_scale_ops *cur_ops, const char *tag)
 {
 	pr_alert("%s" SCALE_FLAG
-		 "--- %s: gp_async=%d gp_async_max=%d gp_exp=%d holdoff=%d minruntime=%d nreaders=%d nwriters=%d writer_holdoff=%d writer_holdoff_jiffies=%d verbose=%d shutdown_secs=%d\n",
-		 scale_type, tag, gp_async, gp_async_max, gp_exp, holdoff, minruntime, nrealreaders, nrealwriters, writer_holdoff, writer_holdoff_jiffies, verbose, shutdown_secs);
+		 "--- %s: gp_async=%d gp_async_max=%d gp_exp=%d holdoff=%d minruntime=%d nreaders=%d nwriters=%d nexp=%d exp_interval=%d writer_holdoff=%d writer_holdoff_jiffies=%d verbose=%d shutdown_secs=%d\n",
+		 scale_type, tag, gp_async, gp_async_max, gp_exp, holdoff,
+		 minruntime, nrealreaders, nrealwriters, nrealexp, exp_interval,
+		 writer_holdoff, writer_holdoff_jiffies, verbose, shutdown_secs);
 }
 
 /*
@@ -809,6 +843,13 @@ kfree_scale_cleanup(void)
 	if (torture_cleanup_begin())
 		return;
 
+	if (exp_tasks) {
+		for (i = 0; i < nrealexp; i++)
+			torture_stop_kthread(rcu_scale_exp, exp_tasks[i]);
+		kfree(exp_tasks);
+		exp_tasks = NULL;
+	}
+
 	if (kfree_reader_tasks) {
 		for (i = 0; i < kfree_nrealthreads; i++)
 			torture_stop_kthread(kfree_scale_thread,
@@ -903,6 +944,22 @@ kfree_scale_init(void)
 			goto unwind;
 	}
 
+	if (nrealexp > 0 && cur_ops->exp_sync) {
+		exp_tasks = kzalloc_objs(exp_tasks[0], nrealexp);
+		if (!exp_tasks) {
+			SCALEOUT_ERRSTRING("out of memory");
+			firsterr = -ENOMEM;
+			goto unwind;
+		}
+		for (i = 0; i < nrealexp; i++) {
+			firsterr = torture_create_kthread(rcu_scale_exp,
+							  (void *)i,
+							  exp_tasks[i]);
+			if (torture_init_error(firsterr))
+				goto unwind;
+		}
+	}
+
 	while (atomic_read(&n_kfree_scale_thread_started) < kfree_nrealthreads)
 		schedule_timeout_uninterruptible(1);
 
@@ -959,6 +1016,13 @@ rcu_scale_cleanup(void)
 		return;
 	}
 
+	if (exp_tasks) {
+		for (i = 0; i < nrealexp; i++)
+			torture_stop_kthread(rcu_scale_exp, exp_tasks[i]);
+		kfree(exp_tasks);
+		exp_tasks = NULL;
+	}
+
 	if (reader_tasks) {
 		for (i = 0; i < nrealreaders; i++)
 			torture_stop_kthread(rcu_scale_reader,
@@ -1076,6 +1140,7 @@ rcu_scale_init(void)
 		if (kthread_tp)
 			kthread_stime = kthread_tp->stime;
 	}
+	nrealexp = nexp;
 	if (kfree_rcu_test)
 		return kfree_scale_init();
 
@@ -1107,6 +1172,21 @@ rcu_scale_init(void)
 	}
 	while (atomic_read(&n_rcu_scale_reader_started) < nrealreaders)
 		schedule_timeout_uninterruptible(1);
+	if (nrealexp > 0 && cur_ops->exp_sync) {
+		exp_tasks = kzalloc_objs(exp_tasks[0], nrealexp);
+		if (!exp_tasks) {
+			SCALEOUT_ERRSTRING("out of memory");
+			firsterr = -ENOMEM;
+			goto unwind;
+		}
+		for (i = 0; i < nrealexp; i++) {
+			firsterr = torture_create_kthread(rcu_scale_exp,
+							  (void *)i,
+							  exp_tasks[i]);
+			if (torture_init_error(firsterr))
+				goto unwind;
+		}
+	}
 	writer_tasks = kzalloc_objs(writer_tasks[0], nrealwriters);
 	writer_durations = kcalloc(nrealwriters, sizeof(*writer_durations), GFP_KERNEL);
 	writer_n_durations = kzalloc_objs(*writer_n_durations, nrealwriters);
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH v3 2/2] blk-mq: expose tag starvation counts via debugfs
From: Aaron Tomlin @ 2026-04-18  2:29 UTC (permalink / raw)
  To: Jens Axboe, bvanassche
  Cc: rostedt, mhiramat, mathieu.desnoyers, johannes.thumshirn, kch,
	dlemoal, ritesh.list, loberman, neelx, sean, mproche, chjohnst,
	linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <424110a4-4e14-4e40-af0d-59ca7a30d904@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 1521 bytes --]

On Fri, Apr 17, 2026 at 02:16:07PM -0600, Jens Axboe wrote:
> On 4/17/26 12:28 PM, Bart Van Assche wrote:
> > On 3/19/26 3:19 PM, Aaron Tomlin wrote:
> >> To guarantee zero performance overhead for production kernels compiled
> >> without debugfs, the underlying atomic_t variables and their associated
> >> increment routines are strictly guarded behind CONFIG_BLK_DEBUG_FS.
> >> When this configuration is disabled, the tracking logic compiles down
> >> to a safe no-op.
> > 
> > I don't think that's sufficient. Please use per-cpu counters to
> > minimize the overhead for kernels in which debugfs is enabled.
> 
> Agree, this is the usual nonsense of thinking you can hide any overhead
> behind a config option, when in practice production kernels very much DO
> have CONFIG_DEBUGFS enabled.

Hi Bart, Jens,

Thank you both for the candid feedback.

I concede that relying on CONFIG_BLK_DEBUG_FS to mitigate the performance
impact was a short-sighted approach, as you are absolutely right that
debugfs is routinely enabled in modern production environments. The use of
an atomic_t variable in this case would indeed introduce unacceptable cache
line contention under heavy workloads.

I shall address this in the next iteration by transitioning the tracking
logic entirely to per-CPU counters. This will ensure the overhead is
strictly minimised, irrespective of the build configuration.

Thank you again for your time and guidance on this matter.

Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] trace: propagate registration failure from tracing_start_*_record()
From: Yash Suthar @ 2026-04-18  5:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	skhan, me, syzbot+a1d25e53cd4a10f7f2d3
In-Reply-To: <20260417115258.65e216be@fedora>

Hello Steven,
Thank you for taking a look and really sorry.

I did use ai assistance for commit message, but I reviewed, modified
and tested(with syzbot locally) the code myself. I should have
disclosed really sorry.

One thing I want to know (or I am still missing something):
sched_cmdline_ref is incremented before tracing_sched_register() and
register fails,  but sched_cmdline_ref stays at 1 and on disable
tracepoint_remove_func() sees NULL and return error (as syzbot
reported and reproduce also locally).
your suggestion WARN_ONCE correctly flags the upstream failure, but
the secondary WARN at  tracepoint.c:358 will still fire on the next
disable, since the refcount desync isn't addressed. Was that
intentional ?

in my earlier approach, I tried to propagate the error and handle this
desync under fault injection. I understand this may not align with the
preferred direction, but that was the motivation behind the change.

If you think it makes sense, I can prepare a v2 with your suggestion
while also addressing the refcount consistency.

Thank you

On Fri, Apr 17, 2026 at 9:23 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 17 Apr 2026 12:08:27 +0530
> Yash Suthar <yashsuthar983@gmail.com> wrote:
>
> > syzbot reported a WARN in tracepoint_probe_unregister():
> >
> > tracing_start_sched_switch() increments sched_cmdline_ref /
> > sched_tgid_ref before calling tracing_sched_register(), and its
> > return value is discarded because the API is void. When the first
> > register_trace_sched_*() fails (e.g. kmalloc under memory pressure
> > or failslab), the function's fail_deprobe* labels roll back any
> > partial probe registration, but the caller's refcount has already
> > been bumped. The state is now desynced: refs > 0 but no probes in
> > tp->funcs.
> >
> > Later, when the caller pairs the start with a stop, the refcount
> > walks back to 0 and tracing_sched_unregister() calls
> > unregister_trace_sched_*() against an empty tp->funcs.
> > func_remove() returns -ENOENT and the
> > WARN_ON_ONCE(IS_ERR(old)) in tracepoint_remove_func() fires.
> >
> > Fix: make tracing_start_sched_switch() and the two exported
> > wrappers, tracing_start_cmdline_record() and
> > tracing_start_tgid_record(), return int; register the probes
> > before bumping the refcount; and propagate the error to callers
> > so refs are only held on behalf of a caller whose registration
> > actually succeeded.
> >
> > Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks")
> > Reported-by: syzbot+a1d25e53cd4a10f7f2d3@syzkaller.appspotmail.com
> > Closes: https://syzkaller.appspot.com/bug?id=f93e97cd824071a2577a40cde9ecd957f59f87eb
>
> Did you use AI to create any of this? If so you must disclose it. This
> reads very much like an AI patch.
>
> >
> > Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
> > ---
> >  kernel/trace/trace.c                 |  6 +++---
> >  kernel/trace/trace.h                 |  4 ++--
> >  kernel/trace/trace_events.c          | 28 +++++++++++++++++++--------
> >  kernel/trace/trace_functions.c       |  8 +++++++-
> >  kernel/trace/trace_functions_graph.c |  6 +++++-
> >  kernel/trace/trace_sched_switch.c    | 29 ++++++++++++++++++----------
> >  kernel/trace/trace_selftest.c        |  7 ++++++-
> >  7 files changed, 62 insertions(+), 26 deletions(-)
>
> NAK on all this. If you are under severe memory constraints that causes
> this to fail, then you'll be hitting a bunch more errors.
>
> >
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 8bd4ec08fb36..e936eed99b27 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -3320,7 +3320,7 @@ void trace_printk_init_buffers(void)
> >        * allocated here, then this was called by module code.
> >        */
> >       if (global_trace.array_buffer.buffer)
> > -             tracing_start_cmdline_record();
> > +             (void)tracing_start_cmdline_record();
>
> WTF??? Why are you adding the typecast of (void) here? Don't do that!
>
>
> >  }
> >  EXPORT_SYMBOL_GPL(trace_printk_init_buffers);
> >
> > @@ -3329,7 +3329,7 @@ void trace_printk_start_comm(void)
> >       /* Start tracing comms if trace printk is set */
> >       if (!buffers_allocated)
> >               return;
> > -     tracing_start_cmdline_record();
> > +     (void)tracing_start_cmdline_record();
> >  }
> >
> >  static void trace_printk_start_stop_comm(int enabled)
> > @@ -3338,7 +3338,7 @@ static void trace_printk_start_stop_comm(int enabled)
> >               return;
> >
> >       if (enabled)
> > -             tracing_start_cmdline_record();
> > +             (void)tracing_start_cmdline_record();
> >       else
> >               tracing_stop_cmdline_record();
> >  }
> > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> > index b6d42fe06115..6fe2c8429560 100644
> > --- a/kernel/trace/trace.h
> > +++ b/kernel/trace/trace.h
> > @@ -751,9 +751,9 @@ void trace_graph_return(struct ftrace_graph_ret *trace, struct fgraph_ops *gops,
> >  int trace_graph_entry(struct ftrace_graph_ent *trace, struct fgraph_ops *gops,
> >                     struct ftrace_regs *fregs);
> >
> > -void tracing_start_cmdline_record(void);
> > +int tracing_start_cmdline_record(void);
> >  void tracing_stop_cmdline_record(void);
> > -void tracing_start_tgid_record(void);
> > +int tracing_start_tgid_record(void);
> >  void tracing_stop_tgid_record(void);
> >
> >  int register_tracer(struct tracer *type);
> > diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> > index 137b4d9bb116..e6713aa80a03 100644
> > --- a/kernel/trace/trace_events.c
> > +++ b/kernel/trace/trace_events.c
> > @@ -734,9 +734,9 @@ void trace_event_enable_cmd_record(bool enable)
> >                       continue;
> >
> >               if (enable) {
> > -                     tracing_start_cmdline_record();
> > -                     set_bit(EVENT_FILE_FL_RECORDED_CMD_BIT, &file->flags);
> > -             } else {
> > +                     if (!tracing_start_cmdline_record())
> > +                             set_bit(EVENT_FILE_FL_RECORDED_CMD_BIT, &file->flags);
> > +             } else if (file->flags & EVENT_FILE_FL_RECORDED_CMD) {
> >                       tracing_stop_cmdline_record();
> >                       clear_bit(EVENT_FILE_FL_RECORDED_CMD_BIT, &file->flags);
> >               }
> > @@ -755,9 +755,9 @@ void trace_event_enable_tgid_record(bool enable)
> >                       continue;
> >
> >               if (enable) {
> > -                     tracing_start_tgid_record();
> > -                     set_bit(EVENT_FILE_FL_RECORDED_TGID_BIT, &file->flags);
> > -             } else {
> > +                     if (!tracing_start_tgid_record())
> > +                             set_bit(EVENT_FILE_FL_RECORDED_TGID_BIT, &file->flags);
> > +             } else if (file->flags & EVENT_FILE_FL_RECORDED_TGID) {
> >                       tracing_stop_tgid_record();
> >                       clear_bit(EVENT_FILE_FL_RECORDED_TGID_BIT,
> >                                 &file->flags);
> > @@ -847,14 +847,26 @@ static int __ftrace_event_enable_disable(struct trace_event_file *file,
> >                               set_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags);
> >
> >                       if (tr->trace_flags & TRACE_ITER(RECORD_CMD)) {
> > +                             ret = tracing_start_cmdline_record();
> > +                             if (ret) {
> > +                                     pr_info("event trace: Could not enable event %s\n",
> > +                                             trace_event_name(call));
> > +                                     break;
> > +                             }
> >                               cmd = true;
> > -                             tracing_start_cmdline_record();
> >                               set_bit(EVENT_FILE_FL_RECORDED_CMD_BIT, &file->flags);
> >                       }
> >
> >                       if (tr->trace_flags & TRACE_ITER(RECORD_TGID)) {
> > +                             ret = tracing_start_tgid_record();
> > +                             if (ret) {
> > +                                     if (cmd)
> > +                                             tracing_stop_cmdline_record();
> > +                                     pr_info("event trace: Could not enable event %s\n",
> > +                                             trace_event_name(call));
> > +                                     break;
> > +                             }
> >                               tgid = true;
> > -                             tracing_start_tgid_record();
> >                               set_bit(EVENT_FILE_FL_RECORDED_TGID_BIT, &file->flags);
> >                       }
> >
> > diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
> > index c12795c2fb39..14d099734345 100644
> > --- a/kernel/trace/trace_functions.c
> > +++ b/kernel/trace/trace_functions.c
> > @@ -146,6 +146,8 @@ static bool handle_func_repeats(struct trace_array *tr, u32 flags_val)
> >  static int function_trace_init(struct trace_array *tr)
> >  {
> >       ftrace_func_t func;
> > +     int ret;
> > +
> >       /*
> >        * Instance trace_arrays get their ops allocated
> >        * at instance creation. Unless it failed
> > @@ -165,7 +167,11 @@ static int function_trace_init(struct trace_array *tr)
> >
> >       tr->array_buffer.cpu = raw_smp_processor_id();
> >
> > -     tracing_start_cmdline_record();
> > +     ret = tracing_start_cmdline_record();
> > +     if (ret) {
> > +             ftrace_reset_array_ops(tr);
> > +             return ret;
> > +     }
> >       tracing_start_function_trace(tr);
> >       return 0;
> >  }
> > diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
> > index 1de6f1573621..6b27ed62fee8 100644
> > --- a/kernel/trace/trace_functions_graph.c
> > +++ b/kernel/trace/trace_functions_graph.c
> > @@ -487,7 +487,11 @@ static int graph_trace_init(struct trace_array *tr)
> >       ret = register_ftrace_graph(tr->gops);
> >       if (ret)
> >               return ret;
> > -     tracing_start_cmdline_record();
> > +     ret = tracing_start_cmdline_record();
> > +     if (ret) {
> > +             unregister_ftrace_graph(tr->gops);
> > +             return ret;
> > +     }
> >
> >       return 0;
> >  }
> > diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
> > index c46d584ded3b..683ea4ca1498 100644
> > --- a/kernel/trace/trace_sched_switch.c
> > +++ b/kernel/trace/trace_sched_switch.c
> > @@ -89,12 +89,22 @@ static void tracing_sched_unregister(void)
> >       unregister_trace_sched_wakeup(probe_sched_wakeup, NULL);
> >  }
> >
> > -static void tracing_start_sched_switch(int ops)
> > +static int tracing_start_sched_switch(int ops)
> >  {
> > -     bool sched_register;
> > +     int ret = 0;
> >
> >       mutex_lock(&sched_register_mutex);
> > -     sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
> > +
> > +     /*
> > +      * If the registration fails, do not bump the reference count : the
> > +      * caller must observe the failure so it can avoid a later matching
> > +      * stop that would otherwise unregister probes that were never added.
> > +      */
> > +     if (!sched_cmdline_ref && !sched_tgid_ref) {
> > +             ret = tracing_sched_register();
> > +             if (ret)
> > +                     goto out;
> > +     }
> >
> >       switch (ops) {
> >       case RECORD_CMDLINE:
> > @@ -105,10 +115,9 @@ static void tracing_start_sched_switch(int ops)
> >               sched_tgid_ref++;
> >               break;
> >       }
> > -
> > -     if (sched_register && (sched_cmdline_ref || sched_tgid_ref))
> > -             tracing_sched_register();
>
> The only change that should deal with this would be:
>
>         if (sched_register && (sched_cmdline_ref || sched_tgid_ref)) {
>                 WARN_ONCE(tracing_sched_register() < 0,
>                         "Failed to register trace command line caching. Requires reboot to fix");
>         }
>
> -- Steve
>
>
>
> > +out:
> >       mutex_unlock(&sched_register_mutex);
> > +     return ret;
> >  }
> >
> >  static void tracing_stop_sched_switch(int ops)
> > @@ -130,9 +139,9 @@ static void tracing_stop_sched_switch(int ops)
> >       mutex_unlock(&sched_register_mutex);
> >  }
> >
> > -void tracing_start_cmdline_record(void)
> > +int tracing_start_cmdline_record(void)
> >  {
> > -     tracing_start_sched_switch(RECORD_CMDLINE);
> > +     return tracing_start_sched_switch(RECORD_CMDLINE);
> >  }
> >
> >  void tracing_stop_cmdline_record(void)
> > @@ -140,9 +149,9 @@ void tracing_stop_cmdline_record(void)
> >       tracing_stop_sched_switch(RECORD_CMDLINE);
> >  }
> >
> > -void tracing_start_tgid_record(void)
> > +int tracing_start_tgid_record(void)
> >  {
> > -     tracing_start_sched_switch(RECORD_TGID);
> > +     return tracing_start_sched_switch(RECORD_TGID);
> >  }
> >
> >  void tracing_stop_tgid_record(void)
> > diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> > index d88c44f1dfa5..238e7451f8e4 100644
> > --- a/kernel/trace/trace_selftest.c
> > +++ b/kernel/trace/trace_selftest.c
> > @@ -1084,7 +1084,12 @@ trace_selftest_startup_function_graph(struct tracer *trace,
> >               warn_failed_init_tracer(trace, ret);
> >               goto out;
> >       }
> > -     tracing_start_cmdline_record();
> > +     ret = tracing_start_cmdline_record();
> > +     if (ret) {
> > +             unregister_ftrace_graph(&fgraph_ops);
> > +             warn_failed_init_tracer(trace, ret);
> > +             goto out;
> > +     }
> >
> >       /* Sleep for a 1/10 of a second */
> >       msleep(100);
>

^ permalink raw reply

* Re: [PATCHv5 bpf-next 19/28] libbpf: Add btf_type_is_traceable_func function
From: bot+bpf-ci @ 2026-04-18  5:59 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-20-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3140 bytes --]

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index fc08ac6bcdde..49368ee161b1 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c

[ ... ]

> +static int btf_get_type_size(const struct btf *btf, __u32 type_id,
> +			     const struct btf_type **ret_type)
> +{
> +	const struct btf_type *t;
> +
> +	*ret_type = btf__type_by_id(btf, 0);
> +	if (!type_id)
> +		return 0;
> +	t = btf__type_by_id(btf, type_id);
> +	while (t && btf_type_is_modifier(t))
> +		t = btf__type_by_id(btf, t->type);

Does this loop need a maximum resolution depth limit?

If a malformed BTF object contains a circular modifier reference
(e.g., typedef A -> const B -> typedef A), could this loop spin
indefinitely and cause a hang?

> +	if (!t)
> +		return -EINVAL;
> +	*ret_type = t;
> +	if (btf_is_ptr(t))
> +		return sizeof(void *);

Does this correctly calculate the pointer size for cross-architecture
analysis?

Since libbpf is a user-space library that may process BTF blobs from
different architectures (like a 64-bit tool processing a 32-bit BPF
object), will hardcoding sizeof(void *) return the host's pointer size
instead of the target's pointer size?

Would using btf__pointer_size(btf) be more appropriate here?

> +	if (btf_is_int(t) || btf_is_any_enum(t) || btf_is_struct(t) || btf_is_union(t))
> +		return t->size;
> +	return -EINVAL;
> +}
> +
> +bool btf_type_is_traceable_func(const struct btf *btf, const struct btf_type *t)
> +{
> +	const struct btf_param *args;
> +	const struct btf_type *proto;
> +	__u32 i, nargs;
> +	int ret;
> +
> +	proto = btf_type_by_id(btf, t->type);

Could this cause an out-of-bounds read if t is not a function type?

Because the type and size fields share a union in struct btf_type,
passing a type like BTF_KIND_STRUCT causes its size to be misinterpreted
as a type ID.

The internal btf_type_by_id() helper skips bounds checking. If the
interpreted ID exceeds the BTF type array bounds, it might read past the
end of the array and return a garbage pointer, which would then be
dereferenced in the next line.

Should this verify !btf_is_func(t) first, and use the bounds-checked
public API btf__type_by_id() along with a NULL check for proto?

> +	if (BTF_INFO_KIND(proto->info) != BTF_KIND_FUNC_PROTO)
> +		return false;

[ ... ]

> diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
> index 3781c45b46d3..a9845f57e56c 100644
> --- a/tools/lib/bpf/libbpf_internal.h
> +++ b/tools/lib/bpf/libbpf_internal.h
> @@ -250,6 +250,7 @@ const struct btf_type *skip_mods_and_typedefs(const struct btf *btf, __u32 id, _
>  const struct btf_header *btf_header(const struct btf *btf);
>  void btf_set_base_btf(struct btf *btf, const struct btf *base_btf);
>  int btf_relocate(struct btf *btf, const struct btf *base_btf, __u32 **id_map);
> +bool btf_type_is_traceable_func(const struct btf *btf, const struct btf_type *t);

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 21/28] selftests/bpf: Add tracing multi skel/pattern/ids attach tests
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-22-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2713 bytes --]

> diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> new file mode 100644
> index 000000000000..52d3339b430a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> @@ -0,0 +1,256 @@

[ ... ]

> +static __u32 *get_ids(const char * const funcs[], int funcs_cnt, const char *mod)
> +{
> +	struct btf *btf, *vmlinux_btf = NULL;
> +	__u32 nr, type_id, cnt = 0;
> +	void *root = NULL;
> +	__u32 *ids = NULL;
> +	int i, err = 0;
> +
> +	btf = btf__load_vmlinux_btf();
> +	if (!ASSERT_OK_PTR(btf, "btf__load_vmlinux_btf"))
> +		return NULL;
> +
> +	if (mod) {
> +		vmlinux_btf = btf;
> +		btf = btf__load_module_btf(mod, vmlinux_btf);
> +		if (!ASSERT_OK_PTR(btf, "btf__load_module_btf")) {
> +			btf__free(vmlinux_btf);
                        ^^^^

Does this code double-free vmlinux_btf when btf__load_module_btf() fails?

When btf__load_module_btf(mod, vmlinux_btf) fails, the error path at
ASSERT_OK_PTR calls btf__free(vmlinux_btf) and then jumps to the out:
label:

> +			goto out;
> +		}
> +	}

[ ... ]

> +out:
> +	tdestroy(root, tdestroy_free_nop);
> +	btf__free(vmlinux_btf);
        ^^^^

which calls btf__free(vmlinux_btf) again on the same pointer.

Multiple reviewers (bot+bpf-ci@kernel.org and sashiko-bot@kernel.org) raised
this concern in v5. The fix should either set vmlinux_btf=NULL after the
first free or return directly without going to the out label.

Reference: https://lore.kernel.org/bpf/1ab81cf709ce7d536a898695ffad30c15c8a6a7bf48dcb7271239952dd270315@mail.kernel.org/

> +	ids = calloc(funcs_cnt, sizeof(ids[0]));
> +	if (!ids)
> +		goto out;
> +
> +	/*
> +	 * We sort function names by name and search them
> +	 * below for each function.
> +	 */
> +	for (i = 0; i < funcs_cnt; i++) {
> +		if (!tsearch(&funcs[i], &root, compare)) {
> +			ASSERT_FAIL("tsearch failed");
                        ^^^^

When tsearch() fails, should err be set to -1 before the goto?

Currently when tsearch() fails, we jump to out, but err is still 0.
Because err is 0, the cleanup block below won't free ids and will
instead return the partially or fully zero-initialized array:

> +			goto out;
> +		}
> +	}

[ ... ]

> +	if (err) {
> +		free(ids);
> +		ids = NULL;
> +	}

This can lead to incorrect test behavior. The fix should set err=-1
before the goto out statement.

Reference: https://lore.kernel.org/bpf/20260418033438.03847C19424@smtp.kernel.org/

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 06/28] bpf: Move trampoline image setup into bpf_trampoline_ops callbacks
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-7-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1276 bytes --]

commit 405d6ece5a34824f3bc154895a3415605c07b216
Author: Jiri Olsa <jolsa@kernel.org>

bpf: Move trampoline image setup into bpf_trampoline_ops callbacks

Moving trampoline image setup into bpf_trampoline_ops callbacks,
so we can have different image handling for multi attachment which
is coming in following changes.

There's slight functional change for the unregister path, where we
currently free the image unconditionally even if the detach fails.
The new code keeps the image in place, possibly preventing the crash.

The commit message mentions a functional change where "we currently free
the image unconditionally even if the detach fails." This appears to fix
a bug introduced in commit e21aa341785c ("bpf: Fix fexit trampoline.").
Should this include a Fixes tag?

The suggested tag would be:

Fixes: e21aa341785c ("bpf: Fix fexit trampoline.")

This was also noted by bot+bpf-ci@kernel.org during the v5 review:
https://lore.kernel.org/bpf/8a315de2e20505875006c4e32af33c0838000a79484f7a2e44b1081187a01379@mail.kernel.org/

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 02/28] ftrace: Add ftrace_hash_remove function
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-3-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2705 bytes --]

> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -416,6 +416,7 @@ void free_ftrace_hash(struct ftrace_hash *hash);
>  struct ftrace_func_entry *add_ftrace_hash_entry_direct(struct ftrace_hash *hash,
>  						       unsigned long ip, unsigned long direct);
>  unsigned long ftrace_hash_count(struct ftrace_hash *hash);
> +void ftrace_hash_remove(struct ftrace_hash *hash);
>
>  /* The hash used to know what functions callbacks trace */
>  struct ftrace_ops_hash {
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1249,6 +1249,25 @@ remove_hash_entry(struct ftrace_hash *hash,
>  	hash->count--;
>  }
>
> +void ftrace_hash_remove(struct ftrace_hash *hash)
> +{
> +	struct hlist_head *hhd;
> +	struct hlist_node *tn;
> +	struct ftrace_func_entry *entry;
> +	int size = 1 << hash->size_bits;
                          ^^^^^^^^^^^^^^^^

Can this dereference a NULL pointer? The code computes the size using
hash->size_bits before checking if hash is NULL:

    int size = 1 << hash->size_bits;
    ...
    if (!hash || !hash->count)
        return;

If a caller passes NULL, the function will crash on the size computation.

Looking at how this is used in the later commit e77bc5ff90dd
('bpf: Add bpf_trampoline_multi_attach/detach functions'),
init_tracing_multi_data() in kernel/bpf/trampoline.c allocates three hashes:

    data->reg    = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
    data->unreg  = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
    data->modify = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
    if (!data->reg || !data->unreg || !data->modify) {
        clear_tracing_multi_data(data);
        return -ENOMEM;
    }

On allocation failure, clear_tracing_multi_data() calls
remove_tracing_multi_data(), which calls ftrace_hash_remove() on data->reg,
data->unreg, and data->modify. Any of these might be NULL if the
corresponding alloc_ftrace_hash() failed.

Would it be safer to move the size calculation below the NULL check?

> +	int i;
> +
> +	if (!hash || !hash->count)
> +		return;
> +
> +	for (i = 0; i < size; i++) {
> +		hhd = &hash->buckets[i];
> +		hlist_for_each_entry_safe(entry, tn, hhd, hlist)
> +			remove_hash_entry(hash, entry);
> +	}
> +	FTRACE_WARN_ON(hash->count);
> +}
> +
>  static void ftrace_hash_clear(struct ftrace_hash *hash)
>  {
>  	struct hlist_head *hhd;


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 01/28] ftrace: Add ftrace_hash_count function
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-2-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2302 bytes --]

> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index c242fe49af4c..401f8dfd05d3 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -415,6 +415,7 @@ struct ftrace_hash *alloc_ftrace_hash(int size_bits);
>  void free_ftrace_hash(struct ftrace_hash *hash);
>  struct ftrace_func_entry *add_ftrace_hash_entry_direct(struct ftrace_hash *hash,
>  						       unsigned long ip, unsigned long direct);
> +unsigned long ftrace_hash_count(struct ftrace_hash *hash);
>
>  /* The hash used to know what functions callbacks trace */
>  struct ftrace_ops_hash {

[ ... ]

> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 413310912609..68a071e80f32 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -6288,11 +6288,16 @@ int modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
>  }
>  EXPORT_SYMBOL_GPL(modify_ftrace_direct);
>
> -static unsigned long hash_count(struct ftrace_hash *hash)
> +static inline unsigned long hash_count(struct ftrace_hash *hash)
>  {
>  	return hash ? hash->count : 0;
>  }
>
> +unsigned long ftrace_hash_count(struct ftrace_hash *hash)
> +{
> +	return hash_count(hash);
> +}
                     ^^^^

Will this cause a linker error on some kernel configurations?

The function ftrace_hash_count() is declared in include/linux/ftrace.h under
CONFIG_DYNAMIC_FTRACE, but this definition appears in kernel/trace/ftrace.c
inside a section guarded by CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS (which
extends from around modify_ftrace_direct() to several hundred lines below).

On architectures where CONFIG_DYNAMIC_FTRACE=y but
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=n, code that calls
ftrace_hash_count() would compile successfully but fail to link since no
definition would be present in the built kernel.

This was raised by sashiko-bot@kernel.org on the v5 review:
https://lore.kernel.org/all/20260418005821.4F156C19425@smtp.kernel.org/

> +
>  /**
>   * hash_add - adds two struct ftrace_hash and returns the result
>   * @a: struct ftrace_hash object


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 08/28] bpf: Add struct bpf_tramp_node object
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: hengqi.chen, bpf, linux-trace-kernel, kafai, eddyz87,
	songliubraving, yhs, menglong8.dong, rostedt, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-9-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3917 bytes --]

> diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
> index 50103b3794fb..b97d9f68fc40 100644
> --- a/arch/powerpc/net/bpf_jit_comp.c
> +++ b/arch/powerpc/net/bpf_jit_comp.c

[ ... ]

> @@ -827,14 +827,14 @@ static void bpf_trampoline_restore_args_stack(u32 *image, struct codegen_context
>  static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_image,
>  					 void *rw_image_end, void *ro_image,
>  					 const struct btf_func_model *m, u32 flags,
> -					 struct bpf_tramp_links *tlinks,
> +					 struct bpf_tramp_nodes *tnodes,
>  					 void *func_addr)
>  {
>  	int regs_off, func_meta_off, ip_off, run_ctx_off, retval_off;
>  	int nvr_off, alt_lr_off, r4_off = 0;
> -	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> -	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
> -	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
> +	struct bpf_tramp_nodes *fmod_ret = &tnodes[BPF_TRAMP_MODIFY_RETURN];
> +	struct bpf_tramp_nodes *fentry = &tnodes[BPF_TRAMP_FENTRY];
> +	struct bpf_tramp_nodes *fexit = &tnodes[BPF_TRAMP_FEXIT];
>  	int i, ret, nr_regs, retaddr_off, bpf_frame_size = 0;
>  	struct codegen_context codegen_ctx, *ctx;
>  	int cookie_off, cookie_cnt, cookie_ctx_off;
> -	int fsession_cnt = bpf_fsession_cnt(tlinks);
> +	int fsession_cnt = bpf_fsession_cnt(tnodes);
                                            ^^^^^

This line wasn't actually updated in the diff. The parameter was renamed from
tlinks to tnodes, but this call site still references the old name. Does this
compile?

The same issue appears later in the function:

>  	cookie_off = bpf_frame_size;
> -	cookie_cnt = bpf_fsession_cookie_cnt(tlinks);
> +	cookie_cnt = bpf_fsession_cookie_cnt(tnodes);
                                              ^^^^^

This line also wasn't updated and still references tlinks.

> @@ -1050,21 +1050,21 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
>
>  	cookie_ctx_off = (regs_off - cookie_off) / 8;
>
> -	for (i = 0; i < fentry->nr_links; i++) {
> -		if (bpf_prog_calls_session_cookie(fentry->links[i])) {
> +	for (i = 0; i < fentry->nr_nodes; i++)
> +		if (bpf_prog_calls_session_cookie(fentry->nodes[i])) {
>  			u64 meta = func_meta | (cookie_ctx_off << BPF_TRAMP_COOKIE_INDEX_SHIFT);
>
>  			store_func_meta(image, ctx, meta, func_meta_off);
>  			cookie_ctx_off--;
>  		}
>
> -		if (invoke_bpf_prog(image, ro_image, ctx, fentry->links[i], regs_off, retval_off,
> +		if (invoke_bpf_prog(image, ro_image, ctx, fentry->nodes[i], regs_off, retval_off,
>  				    run_ctx_off, flags & BPF_TRAMP_F_RET_FENTRY_RET))
>  			return -EINVAL;
>  	}

The opening brace after the for-loop was removed but the closing brace at the
end remains. This creates two problems:

First, the for-loop now only controls the inner if statement checking for
session cookies. The invoke_bpf_prog call is no longer inside the loop and
will only execute once (with whatever value i holds after the loop exits),
instead of being called for each fentry node.

Second, the unmatched closing brace at the end creates a syntax error.

The code structure should be:

    for (i = 0; i < fentry->nr_nodes; i++) {
        if (bpf_prog_calls_session_cookie(fentry->nodes[i])) {
            u64 meta = func_meta | (cookie_ctx_off << BPF_TRAMP_COOKIE_INDEX_SHIFT);

            store_func_meta(image, ctx, meta, func_meta_off);
            cookie_ctx_off--;
        }

        if (invoke_bpf_prog(image, ro_image, ctx, fentry->nodes[i], regs_off, retval_off,
                    run_ctx_off, flags & BPF_TRAMP_F_RET_FENTRY_RET))
            return -EINVAL;
    }

[ ... ]

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

* Re: [PATCHv5 bpf-next 12/28] bpf: Add bpf_trampoline_multi_attach/detach functions
From: bot+bpf-ci @ 2026-04-18  6:10 UTC (permalink / raw)
  To: jolsa, ast, daniel, andrii
  Cc: bpf, linux-trace-kernel, kafai, eddyz87, songliubraving, yhs,
	menglong8.dong, rostedt, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260417192502.194548-13-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 14299 bytes --]

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h

[ ... ]

> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> --- a/kernel/bpf/trampoline.c
> +++ b/kernel/bpf/trampoline.c
> @@ -1447,6 +1447,276 @@ int __weak arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags,
>  	return -ENOTSUPP;
>  }
>
> +#if defined(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS) && \
> +    defined(CONFIG_HAVE_SINGLE_FTRACE_DIRECT_OPS) && \
> +    defined(CONFIG_BPF_SYSCALL)
> +
> +static void trampoline_lock_all(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < TRAMPOLINE_LOCKS_TABLE_SIZE; i++)
> +		mutex_lock(&trampoline_locks[i].mutex);
> +}
> +
> +static void trampoline_unlock_all(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < TRAMPOLINE_LOCKS_TABLE_SIZE; i++)
> +		mutex_unlock(&trampoline_locks[i].mutex);
> +}
> +
> +static void remove_tracing_multi_data(struct bpf_tracing_multi_data *data)
> +{
> +	ftrace_hash_remove(data->reg);
> +	ftrace_hash_remove(data->unreg);
> +	ftrace_hash_remove(data->modify);
> +}
> +
> +static void clear_tracing_multi_data(struct bpf_tracing_multi_data *data)
> +{
> +	remove_tracing_multi_data(data);
> +
> +	free_ftrace_hash(data->reg);
> +	free_ftrace_hash(data->unreg);
> +	free_ftrace_hash(data->modify);
> +}
> +
> +static int init_tracing_multi_data(struct bpf_tracing_multi_data *data)
> +{
> +	data->reg    = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
> +	data->unreg  = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
> +	data->modify = alloc_ftrace_hash(FTRACE_HASH_DEFAULT_BITS);
> +
> +	if (!data->reg || !data->unreg || !data->modify) {
> +		clear_tracing_multi_data(data);
> +		return -ENOMEM;
> +	}
> +	return 0;
> +}
> +
> +static void ftrace_hash_add(struct ftrace_hash *hash, struct ftrace_func_entry *entry,
> +			    unsigned long ip, unsigned long direct)
> +{
> +	entry->ip = ip;
> +	entry->direct = direct;
> +	add_ftrace_hash_entry(hash, entry);
> +}
> +
> +static int register_fentry_multi(struct bpf_trampoline *tr, struct bpf_tramp_image *im, void *ptr)
> +{
> +	unsigned long addr = (unsigned long) im->image;
> +	unsigned long ip = ftrace_location(tr->ip);
> +	struct bpf_tracing_multi_data *data = ptr;
> +
> +	if (bpf_trampoline_use_jmp(tr->flags))
> +		addr = ftrace_jmp_set(addr);
> +
> +	ftrace_hash_add(data->reg, data->entry, ip, addr);
> +	tr->cur_image = im;
> +	return 0;
> +}
> +
> +static int unregister_fentry_multi(struct bpf_trampoline *tr, u32 orig_flags, void *ptr)
> +{
> +	unsigned long addr = (unsigned long) tr->cur_image->image;
> +	unsigned long ip = ftrace_location(tr->ip);
> +	struct bpf_tracing_multi_data *data = ptr;
> +
> +	if (bpf_trampoline_use_jmp(tr->flags))
> +		addr = ftrace_jmp_set(addr);
> +
> +	ftrace_hash_add(data->unreg, data->entry, ip, addr);
> +	tr->cur_image = NULL;
> +	return 0;
> +}
> +
> +static int modify_fentry_multi(struct bpf_trampoline *tr, u32 orig_flags, struct bpf_tramp_image *im,
> +			       bool lock_direct_mutex, void *ptr)
> +{
> +	unsigned long addr = (unsigned long) im->image;
> +	unsigned long ip = ftrace_location(tr->ip);
> +	struct bpf_tracing_multi_data *data = ptr;
> +
> +	if (bpf_trampoline_use_jmp(tr->flags))
> +		addr = ftrace_jmp_set(addr);
> +
> +	ftrace_hash_add(data->modify, data->entry, ip, addr);
> +	tr->cur_image = im;
> +	return 0;
> +}
> +
> +static const struct bpf_trampoline_ops trampoline_multi_ops = {
> +	.register_fentry   = register_fentry_multi,
> +	.unregister_fentry = unregister_fentry_multi,
> +	.modify_fentry     = modify_fentry_multi,
> +};
> +
> +static void bpf_trampoline_multi_attach_init(struct bpf_trampoline *tr)
> +{
> +	tr->multi_attach.old_image = tr->cur_image;
> +	tr->multi_attach.old_flags = tr->flags;
> +}
> +
> +static void bpf_trampoline_multi_attach_free(struct bpf_trampoline *tr)
> +{
> +	if (tr->multi_attach.old_image)
> +		bpf_tramp_image_put(tr->multi_attach.old_image);
> +
> +	tr->multi_attach.old_image = NULL;
> +	tr->multi_attach.old_flags = 0;
> +}
> +
> +static void bpf_trampoline_multi_attach_rollback(struct bpf_trampoline *tr)
> +{
> +	if (tr->cur_image)
> +		bpf_tramp_image_put(tr->cur_image);
> +	tr->cur_image = tr->multi_attach.old_image;
> +	tr->flags = tr->multi_attach.old_flags;
> +
> +	tr->multi_attach.old_image = NULL;
> +	tr->multi_attach.old_flags = 0;
> +}
> +
> +#define for_each_mnode_cnt(mnode, link, cnt) \
> +	for (i = 0, mnode = &link->nodes[i]; i < cnt; i++, mnode = &link->nodes[i])
> +
> +#define for_each_mnode(mnode, link) \
> +	for_each_mnode_cnt(mnode, link, link->nodes_cnt)
> +
> +int bpf_trampoline_multi_attach(struct bpf_prog *prog, u32 *ids,
> +				struct bpf_tracing_multi_link *link)
> +{
> +	struct bpf_tracing_multi_data *data = &link->data;
> +	struct bpf_attach_target_info tgt_info = {};
> +	struct btf *btf = prog->aux->attach_btf;
> +	struct bpf_tracing_multi_node *mnode;
> +	struct bpf_trampoline *tr;
> +	int i, err, rollback_cnt;
> +	u64 key;
> +
> +	for_each_mnode(mnode, link) {
> +		rollback_cnt = i;
> +
> +		err = bpf_check_attach_btf_id_multi(btf, prog, ids[i], &tgt_info);
> +		if (err)
> +			goto rollback_put;
> +
> +		key = bpf_trampoline_compute_key(NULL, btf, ids[i]);
> +
> +		tr = bpf_trampoline_get(key, &tgt_info);
> +		if (!tr) {
> +			err = -ENOMEM;
> +			goto rollback_put;
> +		}
> +
> +		mnode->trampoline = tr;
> +		mnode->node.link = &link->link;
> +
> +		cond_resched();
> +	}
> +
> +	err = init_tracing_multi_data(data);
> +	if (err) {
> +		rollback_cnt = link->nodes_cnt;
> +		goto rollback_put;
> +	}
> +
> +	trampoline_lock_all();
> +
> +	for_each_mnode(mnode, link) {
> +		bpf_trampoline_multi_attach_init(mnode->trampoline);
> +
> +		data->entry = &mnode->entry;
> +		err = __bpf_trampoline_link_prog(&mnode->node, mnode->trampoline, NULL,
> +						 &trampoline_multi_ops, data);
> +		if (err) {
> +			rollback_cnt = i;
> +			goto rollback_unlink;
> +		}
> +	}

When user-provided ids[] contains duplicate BTF IDs (or distinct IDs
that resolve to the same trampoline key), multiple nodes point to the
same struct bpf_trampoline. The link loop above then calls
bpf_trampoline_multi_attach_init() more than once on that trampoline,
overwriting the saved old_image with the newly assigned cur_image from
the previous iteration.

Scenario with ids[0] == ids[1] and trampoline X starting with OLD_X:

  i=0: attach_init(X) saves old_image=OLD_X
       __bpf_trampoline_link_prog() -> modify_fentry_multi() sets
       X->cur_image=NEW_X (OLD_X refcount not dropped, intent is for
       multi_attach_free() to release it later).

  i=1: attach_init(X) re-runs on the same trampoline and overwrites
       old_image=NEW_X (the only saved reference to OLD_X is lost).
       __bpf_trampoline_link_prog() returns -EBUSY (duplicate prog).
       rollback_cnt=i=1; goto rollback_unlink.

  rollback_unlink calls bpf_trampoline_multi_attach_rollback(X) once:

    static void bpf_trampoline_multi_attach_rollback(struct bpf_trampoline *tr)
    {
        if (tr->cur_image)
            bpf_tramp_image_put(tr->cur_image);    /* puts NEW_X */
        tr->cur_image = tr->multi_attach.old_image;  /* = NEW_X (stale!) */
        ...
    }

Result: OLD_X leaks (its refcount is never decremented). cur_image
points at NEW_X which was just released via bpf_tramp_image_put().

The rollback_put loop calls bpf_trampoline_put() which frees the
trampoline (prog was removed, progs_hlist is empty), so the dangling
NEW_X pointer is cleaned up. However OLD_X remains allocated via
module_alloc forever.

Reachability: commit dbf2afe2f603 ("bpf: Add support for tracing multi
link") copies ids[] from userspace without deduplication. The series'
selftests (commit 411fb40d4b2a "selftests/bpf: Add tracing multi attach
fails test", 'fail#7 (kernel) attach with duplicate id') exercise this
path explicitly, expecting -EBUSY.

The trigger requires the shared trampoline to already have cur_image !=
NULL when the first iteration runs (i.e., another program attached to
the same function), so modify_fentry_multi() is used and there is a
real OLD_X to leak. The first-time-attach case (register_fentry_multi,
OLD_X=NULL) only leaves cur_image dangling briefly until the trampoline
is freed.

Can the code save old_image/old_flags once per trampoline instead of
once per node, or store them per-mnode to avoid overwriting the saved
state?

> +
> +	rollback_cnt = link->nodes_cnt;
> +	if (ftrace_hash_count(data->reg)) {
> +		err = update_ftrace_direct_add(&direct_ops, data->reg);
> +		if (err)
> +			goto rollback_unlink;
> +	}
> +
> +	if (ftrace_hash_count(data->modify)) {
> +		err = update_ftrace_direct_mod(&direct_ops, data->modify, true);
> +		if (err) {
> +			WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));
> +			                                                       ^^^^^^^^^^
> +			goto rollback_unlink;
> +		}
> +	}

The recovery path unconditionally calls update_ftrace_direct_del with
data->reg, but update_ftrace_direct_add is only invoked when data->reg
is non-empty:

    if (ftrace_hash_count(data->reg)) {
        err = update_ftrace_direct_add(&direct_ops, data->reg);
        ...
    }

When update_ftrace_direct_mod fails and all trampolines already had
cur_image != NULL (so nothing was added to data->reg),
update_ftrace_direct_del is called with an empty hash.
update_ftrace_direct_del at kernel/trace/ftrace.c:6511 checks:

    if (!hash_count(hash))
        return -EINVAL;

so it returns -EINVAL and the WARN_ON_ONCE fires spuriously. Should
this guard match the add path?

    if (ftrace_hash_count(data->reg))
        WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));

> +
> +	for_each_mnode(mnode, link)
> +		bpf_trampoline_multi_attach_free(mnode->trampoline);
> +
> +	trampoline_unlock_all();
> +
> +	remove_tracing_multi_data(data);
> +	return 0;
> +
> +rollback_unlink:
> +	for_each_mnode_cnt(mnode, link, rollback_cnt) {
> +		bpf_trampoline_remove_prog(mnode->trampoline, &mnode->node);
> +		bpf_trampoline_multi_attach_rollback(mnode->trampoline);
> +	}
> +
> +	trampoline_unlock_all();
> +
> +	clear_tracing_multi_data(data);
> +	rollback_cnt = link->nodes_cnt;
> +
> +rollback_put:
> +	for_each_mnode_cnt(mnode, link, rollback_cnt)
> +		bpf_trampoline_put(mnode->trampoline);
> +
> +	return err;
> +}

[ ... ]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -19658,6 +19658,58 @@ struct bpf_attach_target_info tgt_info = {};
>  	return 0;
>  }
>
> +int bpf_check_attach_btf_id_multi(struct btf *btf, struct bpf_prog *prog, u32 btf_id,
> +				  struct bpf_attach_target_info *tgt_info)
> +{
> +	const struct btf_type *t;
> +	unsigned long addr;
> +	const char *tname;
> +	int err;
> +
> +	if (!btf_id || !btf)
> +		return -EINVAL;
> +
> +	/* Check noreturn attachment. */
> +	if (prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI ||
> +	     btf_id_set_contains(&noreturn_deny, btf_id))
> +		return -EINVAL;

Operator precedence issue in the noreturn check. The two subexpressions
are joined by ||, so the function returns -EINVAL whenever EITHER:

  (a) the program's expected_attach_type is BPF_TRACE_FEXIT_MULTI
      (regardless of whether btf_id targets a noreturn function), OR

  (b) the btf_id is in the noreturn_deny set (regardless of program
      attach type, rejecting FENTRY_MULTI attachments to noreturn
      functions as well).

The intent is to reject only FEXIT-like multi attachments to noreturn
functions. The non-multi path in check_attach_btf_id() correctly uses
&&:

    } else if ((prog->expected_attach_type == BPF_TRACE_FEXIT ||
               prog->expected_attach_type == BPF_TRACE_FSESSION ||
               prog->expected_attach_type == BPF_MODIFY_RETURN) &&
               btf_id_set_contains(&noreturn_deny, btf_id)) {
        ...
        return -EINVAL;
    }

Consequences once bpf_trampoline_multi_attach() is wired up from
user-space (commit dbf2afe2f603 "bpf: Add support for tracing multi
link"):

  1. Every BPF_TRACE_FEXIT_MULTI program attach request is rejected
     with -EINVAL, even when the target function returns normally. This
     disables BPF_TRACE_FEXIT_MULTI entirely.

  2. Any tracing_multi program (including BPF_TRACE_FENTRY_MULTI) is
     rejected when its btf_id is in noreturn_deny (do_exit,
     __x64_sys_exit, make_task_dead, etc.). Rejecting FENTRY_MULTI on
     noreturn functions is an unintended behavioral restriction.

Forward search in the git range shows commit bfcad202ce99 ("bpf: Add
support for tracing_multi link session") rewrites this hunk to:

    if ((prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI ||
         prog->expected_attach_type == BPF_TRACE_FSESSION_MULTI) &&
         btf_id_set_contains(&noreturn_deny, btf_id))
        return -EINVAL;

The parenthesization + && in the later commit confirms this is a
precedence bug. Should this use && instead?

> +
> +	/* Check and get function target data. */
> +	t = btf_type_by_id(btf, btf_id);
> +	if (!t)
> +		return -EINVAL;
> +	tname = btf_name_by_offset(btf, t->name_off);
> +	if (!tname)
> +		return -EINVAL;
> +	if (!btf_type_is_func(t))
> +		return -EINVAL;
> +	t = btf_type_by_id(btf, t->type);
> +	if (!btf_type_is_func_proto(t))
> +		return -EINVAL;
> +	err = btf_distill_func_proto(NULL, btf, t, tname, &tgt_info->fmodel);
> +	if (err < 0)
> +		return err;
> +	if (btf_is_module(btf)) {
> +		/* The bpf program already holds reference to module. */
> +		if (WARN_ON_ONCE(!prog->aux->mod))
> +			return -EINVAL;
> +		addr = find_kallsyms_symbol_value(prog->aux->mod, tname);
> +	} else {
> +		addr = kallsyms_lookup_name(tname);
> +	}
> +	if (!addr || !ftrace_location(addr))
> +		return -ENOENT;
> +
> +	/* Check sleepable program attachment. */
> +	if (prog->sleepable) {
> +		err = btf_id_allow_sleepable(btf_id, addr, prog, btf);
> +		if (err)
> +			return err;
> +	}
> +	tgt_info->tgt_addr = addr;
> +	return 0;
> +}
> +
>  struct btf *bpf_get_btf_vmlinux(void)
>  {
>  	if (!btf_vmlinux && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) {


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox