* Re: [PATCHv5 bpf-next 12/28] bpf: Add bpf_trampoline_multi_attach/detach functions
From: Jiri Olsa @ 2026-04-21 8:56 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <11459a06ee7d070bdbdc3699d59adcdca12d5178eaa6ebd15286e0f51cc27141@mail.kernel.org>
On Fri, Apr 17, 2026 at 08:22:36PM +0000, bot+bpf-ci@kernel.org wrote:
SNIP
> > + if (ftrace_hash_count(data->modify)) {
> > + err = update_ftrace_direct_mod(&direct_ops, data->modify, true);
> > + if (err) {
> > + WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));
> > + goto rollback_unlink;
> > + }
> > + }
>
> Does update_ftrace_direct_del need a guard here?
>
> When update_ftrace_direct_mod fails, the recovery code unconditionally
> calls update_ftrace_direct_del(&direct_ops, data->reg). However,
> update_ftrace_direct_add is itself only invoked when data->reg is
> non-empty:
>
> if (ftrace_hash_count(data->reg)) {
> err = update_ftrace_direct_add(&direct_ops, data->reg);
> ...
> }
>
> If all trampolines being multi-attached already had cur_image != NULL,
> nothing is ever added to data->reg, and the add call is skipped. Yet
> the recovery path still calls update_ftrace_direct_del with the empty
> data->reg.
>
> update_ftrace_direct_del() at kernel/trace/ftrace.c:6511 does:
>
> if (!hash_count(hash))
> return -EINVAL;
>
> so it returns -EINVAL, and WARN_ON_ONCE fires. No further damage is
> done since the rollback_unlink path still unwinds correctly, but this
> is a spurious warning that becomes visible whenever
> update_ftrace_direct_mod fails and data->reg happens to be empty.
>
> The undo should match the guard used for the add:
>
> if (err) {
> if (ftrace_hash_count(data->reg))
> WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));
> goto rollback_unlink;
> }
right, that's the fix
>
> > +
> > + for_each_mnode(mnode, link)
> > + bpf_trampoline_multi_attach_free(mnode->trampoline);
> > +
> > + trampoline_unlock_all();
> > +
> > + remove_tracing_multi_data(data);
> > + return 0;
> > +
> > +rollback_unlink:
> > + for_each_mnode_cnt(mnode, link, rollback_cnt) {
> > + bpf_trampoline_remove_prog(mnode->trampoline, &mnode->node);
> > + bpf_trampoline_multi_attach_rollback(mnode->trampoline);
> > + }
> > +
> > + trampoline_unlock_all();
> > +
> > + clear_tracing_multi_data(data);
> > + rollback_cnt = link->nodes_cnt;
> > +
> > +rollback_put:
> > + for_each_mnode_cnt(mnode, link, rollback_cnt)
> > + bpf_trampoline_put(mnode->trampoline);
> > +
> > + return err;
> > +}
SNIP
> > + /* Check noreturn attachment. */
> > + if (prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI ||
> > + btf_id_set_contains(&noreturn_deny, btf_id))
> > + return -EINVAL;
>
> Should this be && instead of ||?
>
> This condition rejects all BPF_TRACE_FEXIT_MULTI programs regardless
> of whether the target is on the noreturn_deny list. The C operator
> precedence makes this `A || B`, so if the program's expected_attach_type
> is BPF_TRACE_FEXIT_MULTI, the entire condition is TRUE and the
> function returns -EINVAL.
>
> This breaks FEXIT_MULTI attachment support. The caller
> bpf_trampoline_multi_attach is the multi-attach path that supports both
> BPF_TRACE_FENTRY_MULTI and BPF_TRACE_FEXIT_MULTI (see is_tracing_multi()
> in include/linux/bpf.h).
>
> The equivalent check in the single-target path, check_attach_btf_id,
> correctly uses &&:
>
> } else if ((prog->expected_attach_type == BPF_TRACE_FEXIT ||
> prog->expected_attach_type == BPF_TRACE_FSESSION ||
> prog->expected_attach_type == BPF_MODIFY_RETURN) &&
> btf_id_set_contains(&noreturn_deny, btf_id)) {
> ...
> }
>
> The intended semantics are to reject FEXIT_MULTI attachments only when
> the target is on noreturn_deny. The correct operator is &&:
>
> if (prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI &&
> btf_id_set_contains(&noreturn_deny, btf_id))
> return -EINVAL;
>
> Additional evidence:
> 1. The comment says "Check noreturn attachment" but the current code
> rejects all FEXIT_MULTI regardless of noreturn status.
> 2. The indentation of the second line matches the alignment pattern
> used when lines are joined by && rather than ||.
> 3. bpf_check_attach_target handles case BPF_TRACE_FEXIT_MULTI as a
> valid attach type in its switch.
>
> Note: In this commit, bpf_trampoline_multi_attach has no callers yet;
> it is infrastructure wired up by a later commit. Once the link creation
> path invokes bpf_trampoline_multi_attach for a FEXIT_MULTI program,
> every such attach will fail with -EINVAL.
guilty as charged, should be &&, will fix
jirka
^ permalink raw reply
* Re: [PATCHv5 bpf-next 10/28] bpf: Add multi tracing attach types
From: Jiri Olsa @ 2026-04-21 8:56 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <23dc9c8f0305da291027f64f583beda22a65d77849bc00c08eb88f32e4651d95@mail.kernel.org>
On Sat, Apr 18, 2026 at 06:49:09AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -1154,6 +1154,8 @@ enum bpf_attach_type {
> > BPF_TRACE_KPROBE_SESSION,
> > BPF_TRACE_UPROBE_SESSION,
> > BPF_TRACE_FSESSION,
> > + BPF_TRACE_FENTRY_MULTI,
> > + BPF_TRACE_FEXIT_MULTI,
> > __MAX_BPF_ATTACH_TYPE
> > };
>
> In RFC v2 (Feb 4, 2026), Leon Hwang asked whether BPF_TRACE_FSESSION_MULTI
> should be added alongside BPF_TRACE_FENTRY_MULTI and BPF_TRACE_FEXIT_MULTI
> for API consistency, since BPF_TRACE_FSESSION exists and is handled
> similarly to FENTRY/FEXIT. You acknowledged this with 'good catch, will add
> it' but BPF_TRACE_FSESSION_MULTI was never added in any subsequent version
> (v1, v2, v3, v4, or v5).
>
> Is this attach type intentionally omitted, or should it be included to
> complete the multi tracing API?
ansrewed in the other reply
>
> Reference: https://lore.kernel.org/bpf/3015991d-2e9b-4bf4-9c17-26e00b5d5926@linux.dev/
>
> [ ... ]
>
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
>
> [ ... ]
>
> > @@ -19368,7 +19379,17 @@
> > if (ret < 0)
> > return ret;
> >
> > - if (tgt_prog) {
> > + /* *.multi programs don't need an address during program
> > + * verification, we just take the module ref if needed.
> > + */
> > + if (is_tracing_multi_id(prog, btf_id)) {
> > + if (btf_is_module(btf)) {
> > + mod = btf_try_get_module(btf);
> > + if (!mod)
> > + return -ENOENT;
> > + }
> > + addr = 0;
> > + } else if (tgt_prog) {
>
> This isn't a bug, but the new multi-line comment here uses the old kernel
> comment style with text on the same line as the opening '/*'. The BPF
> subsystem guide mandates the modern multi-line style with '/*' on its own
> line for new code under kernel/bpf/, even if surrounding code still uses
> the old style.
>
> Should the comment be reformatted as:
>
> /*
> * *.multi programs don't need an address during program
> * verification, we just take the module ref if needed.
> */
sure
jirka
>
> > @@ -19396,6 +19417,12 @@
> > ret = -EINVAL;
> > switch (prog->type) {
> > case BPF_PROG_TYPE_TRACING:
> > + /* *.multi sleepable programs will pass initial sleepable check,
> > + * the actual attached btf ids are checked later during the link
> > + * attachment.
> > + */
> > + if (is_tracing_multi_id(prog, btf_id))
> > + ret = 0;
>
> The same comment style issue appears here. Should this also use the modern
> style with '/*' on its own line?
>
> Note that another comment you added in check_attach_btf_id uses the correct
> modern style, making this inconsistency clearly fixable.
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24598000047
^ permalink raw reply
* Re: [PATCHv5 bpf-next 10/28] bpf: Add multi tracing attach types
From: Jiri Olsa @ 2026-04-21 8:55 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <1ef3a598cb11de0fac8dfe1ac94287d3588ca36257f27b0a2407e85f5037667a@mail.kernel.org>
On Fri, Apr 17, 2026 at 08:22:39PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 552bc5d9afbd..5f814e7101cf 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -1154,6 +1154,8 @@ enum bpf_attach_type {
> > BPF_TRACE_KPROBE_SESSION,
> > BPF_TRACE_UPROBE_SESSION,
> > BPF_TRACE_FSESSION,
> > + BPF_TRACE_FENTRY_MULTI,
> > + BPF_TRACE_FEXIT_MULTI,
> > __MAX_BPF_ATTACH_TYPE
> > };
>
> In the RFC version posted on February 4, 2026, Leon Hwang asked whether
> BPF_TRACE_FSESSION_MULTI should be added alongside BPF_TRACE_FENTRY_MULTI
> and BPF_TRACE_FEXIT_MULTI for consistency, since BPF_TRACE_FSESSION exists
> and is handled similarly to FENTRY/FEXIT.
>
> Reference:
> https://lore.kernel.org/bpf/3015991d-2e9b-4bf4-9c17-26e00b5d5926@linux.dev/
>
> You acknowledged this with "good catch, will add it" in the same thread:
> https://lore.kernel.org/bpf/aYM-jeV50WaVik1b@krava/
>
> However, BPF_TRACE_FSESSION_MULTI was not added in any subsequent version
> through v5. Should BPF_TRACE_FSESSION_MULTI be included here for API
> consistency?
>
I think the comment was to introduce the support for fsession,
not to add it to this specific patch
and BPF_TRACE_FSESSION_MULTI is introduced in:
bpf: Add support for tracing_multi link session
jirka
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24583317711
^ permalink raw reply
* Re: [PATCHv5 bpf-next 08/28] bpf: Add struct bpf_tramp_node object
From: Jiri Olsa @ 2026-04-21 8:55 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, hengqi.chen, bpf, linux-trace-kernel, kafai,
eddyz87, songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <fb38d0f3976f2225ae9781d2292bb79f4d787183360f050c08de3b4490790974@mail.kernel.org>
On Sat, Apr 18, 2026 at 06:10:55AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
> > index 50103b3794fb..b97d9f68fc40 100644
> > --- a/arch/powerpc/net/bpf_jit_comp.c
> > +++ b/arch/powerpc/net/bpf_jit_comp.c
>
> [ ... ]
>
> > @@ -827,14 +827,14 @@ static void bpf_trampoline_restore_args_stack(u32 *image, struct codegen_context
> > static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_image,
> > void *rw_image_end, void *ro_image,
> > const struct btf_func_model *m, u32 flags,
> > - struct bpf_tramp_links *tlinks,
> > + struct bpf_tramp_nodes *tnodes,
> > void *func_addr)
> > {
> > int regs_off, func_meta_off, ip_off, run_ctx_off, retval_off;
> > int nvr_off, alt_lr_off, r4_off = 0;
> > - struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> > - struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
> > - struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
> > + struct bpf_tramp_nodes *fmod_ret = &tnodes[BPF_TRAMP_MODIFY_RETURN];
> > + struct bpf_tramp_nodes *fentry = &tnodes[BPF_TRAMP_FENTRY];
> > + struct bpf_tramp_nodes *fexit = &tnodes[BPF_TRAMP_FEXIT];
> > int i, ret, nr_regs, retaddr_off, bpf_frame_size = 0;
> > struct codegen_context codegen_ctx, *ctx;
> > int cookie_off, cookie_cnt, cookie_ctx_off;
> > - int fsession_cnt = bpf_fsession_cnt(tlinks);
> > + int fsession_cnt = bpf_fsession_cnt(tnodes);
> ^^^^^
>
> This line wasn't actually updated in the diff. The parameter was renamed from
> tlinks to tnodes, but this call site still references the old name. Does this
> compile?
>
> The same issue appears later in the function:
>
> > cookie_off = bpf_frame_size;
> > - cookie_cnt = bpf_fsession_cookie_cnt(tlinks);
> > + cookie_cnt = bpf_fsession_cookie_cnt(tnodes);
ugh, it clashed with latest update for fsession,
and I skip the build test for some reason, will fix
jirka
^ permalink raw reply
* Re: [PATCHv5 bpf-next 06/28] bpf: Move trampoline image setup into bpf_trampoline_ops callbacks
From: Jiri Olsa @ 2026-04-21 8:55 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <8a315de2e20505875006c4e32af33c0838000a79484f7a2e44b1081187a01379@mail.kernel.org>
On Fri, Apr 17, 2026 at 08:10:25PM +0000, bot+bpf-ci@kernel.org wrote:
> > There's slight functional change for the unregister path, where we
> > currently free the image unconditionally even if the detach fails.
> > The new code keeps the image in place, possibly preventing the crash.
>
> This appears to be fixing a bug where the image was freed even when detach
> failed. The bug was introduced in commit e21aa341785c ("bpf: Fix fexit
> trampoline.") which unconditionally freed the trampoline image.
>
> Should this commit include a Fixes tag? The suggested tag would be:
>
> Fixes: e21aa341785c ("bpf: Fix fexit trampoline.")
right, I think I'll send separate fix for current code,
to have this fixed earlier
jirka
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24583317711
^ permalink raw reply
* Re: [PATCHv5 bpf-next 04/28] bpf: Use mutex lock pool for bpf trampolines
From: Jiri Olsa @ 2026-04-21 8:54 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <53b3e6cb10e12555d3b1b321eeb4e04a447c707c387568d1e9ef759fb6459ef4@mail.kernel.org>
On Fri, Apr 17, 2026 at 08:10:24PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
>
> [ ... ]
>
> > diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> > --- a/kernel/bpf/trampoline.c
> > +++ b/kernel/bpf/trampoline.c
> > @@ -30,6 +30,34 @@ #define TRAMPOLINE_TABLE_SIZE (1 << TRAMPOLINE_HASH_BITS)
> > /* serializes access to trampoline tables */
> > static DEFINE_MUTEX(trampoline_mutex);
> >
> > +/*
> > + * We keep 32 trampoline locks (5 bits) in the pool, because there is
> > + * 48 (MAX_LOCK_DEPTH) locks limit allowed to be simultaneously held
> > + * by task. Each lock has its own lockdep key to keep it simple.
> > + */
>
> The comment explains the 32 count (MAX_LOCK_DEPTH limit), but should it
> also explain why each lock has its own lock_class_key?
>
> Without that explanation, it is not obvious that distinct keys are required
> to avoid lockdep "recursive locking" warnings when trampoline_lock_all()
> acquires all 32 pool mutexes simultaneously.
>
> This was raised by bot+bpf-ci@kernel.org in v3 review:
> https://lore.kernel.org/bpf/31ae46274a3157f2b9840e1a09b2698d1ec0cfd461737ff460c2d3349a9f0054@mail.kernel.org/
>
> The author acknowledged it with "will add", but the current comment only
> says "to keep it simple" without addressing the recursive locking concern.
will add more details
jirka
>
> > +#define TRAMPOLINE_LOCKS_BITS 5
> > +#define TRAMPOLINE_LOCKS_TABLE_SIZE (1 << TRAMPOLINE_LOCKS_BITS)
> > +
> > +static struct {
> > + struct mutex mutex;
> > + struct lock_class_key key;
> > +} trampoline_locks[TRAMPOLINE_LOCKS_TABLE_SIZE];
>
> [ ... ]
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24583317711
^ permalink raw reply
* Re: [PATCHv5 bpf-next 02/28] ftrace: Add ftrace_hash_remove function
From: Jiri Olsa @ 2026-04-21 8:54 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <14d32b2f6aacddfce59d73dc5e1c0ea423f84e373ca130f6830c3e26154e70b3@mail.kernel.org>
On Sat, Apr 18, 2026 at 06:10:52AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -416,6 +416,7 @@ void free_ftrace_hash(struct ftrace_hash *hash);
> > struct ftrace_func_entry *add_ftrace_hash_entry_direct(struct ftrace_hash *hash,
> > unsigned long ip, unsigned long direct);
> > unsigned long ftrace_hash_count(struct ftrace_hash *hash);
> > +void ftrace_hash_remove(struct ftrace_hash *hash);
> >
> > /* The hash used to know what functions callbacks trace */
> > struct ftrace_ops_hash {
> > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > --- a/kernel/trace/ftrace.c
> > +++ b/kernel/trace/ftrace.c
> > @@ -1249,6 +1249,25 @@ remove_hash_entry(struct ftrace_hash *hash,
> > hash->count--;
> > }
> >
> > +void ftrace_hash_remove(struct ftrace_hash *hash)
> > +{
> > + struct hlist_head *hhd;
> > + struct hlist_node *tn;
> > + struct ftrace_func_entry *entry;
> > + int size = 1 << hash->size_bits;
> ^^^^^^^^^^^^^^^^
>
> Can this dereference a NULL pointer? The code computes the size using
> hash->size_bits before checking if hash is NULL:
>
> int size = 1 << hash->size_bits;
> ...
> if (!hash || !hash->count)
> return;
yep, we should deref after the check, will fix
jirka
^ permalink raw reply
* Re: [PATCHv5 bpf-next 21/28] selftests/bpf: Add tracing multi skel/pattern/ids attach tests
From: Jiri Olsa @ 2026-04-21 8:54 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <1ab81cf709ce7d536a898695ffad30c15c8a6a7bf48dcb7271239952dd270315@mail.kernel.org>
On Fri, Apr 17, 2026 at 08:10:22PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
>
> [ ... ]
>
> > @@ -0,0 +1,256 @@
> > +static __u32 *get_ids(const char * const funcs[], int funcs_cnt, const char *mod)
> > +{
> > + struct btf *btf, *vmlinux_btf = NULL;
> > + __u32 nr, type_id, cnt = 0;
> > + void *root = NULL;
> > + __u32 *ids = NULL;
> > + int i, err = 0;
> > +
> > + btf = btf__load_vmlinux_btf();
> > + if (!ASSERT_OK_PTR(btf, "btf__load_vmlinux_btf"))
> > + return NULL;
> > +
> > + if (mod) {
> > + vmlinux_btf = btf;
> > + btf = btf__load_module_btf(mod, vmlinux_btf);
> > + if (!ASSERT_OK_PTR(btf, "btf__load_module_btf")) {
> > + btf__free(vmlinux_btf);
> > + goto out;
> > + }
> > + }
>
> [ ... ]
>
> > +out:
> > + tdestroy(root, tdestroy_free_nop);
> > + btf__free(vmlinux_btf);
> > + btf__free(btf);
> > + return ids;
> > +}
>
> Does this code double-free vmlinux_btf when btf__load_module_btf() fails?
> If btf__load_module_btf(mod, vmlinux_btf) fails, the error path at
> ASSERT_OK_PTR calls btf__free(vmlinux_btf) and then jumps to the out:
> label, which calls btf__free(vmlinux_btf) again on the same pointer.
>
> The first free happens here:
>
> if (mod) {
> vmlinux_btf = btf;
> btf = btf__load_module_btf(mod, vmlinux_btf);
> if (!ASSERT_OK_PTR(btf, "btf__load_module_btf")) {
> btf__free(vmlinux_btf); <-- FIRST FREE
> goto out;
> }
> }
>
> Then at the out: label:
>
> out:
> tdestroy(root, tdestroy_free_nop);
> btf__free(vmlinux_btf); <-- SECOND FREE
> btf__free(btf);
>
> The btf__free() implementation in tools/lib/bpf/btf.c checks only
> IS_ERR_OR_NULL(btf) and does not guard against already-freed memory,
> so this would trigger a double-free that glibc would detect and abort.
>
> In the current commit, get_ids() is only called with mod=NULL from
> test_link_api_ids(), so this path is latent. However, a subsequent commit
> (4dda98b9e4d0, "selftests/bpf: Add tracing multi skel/pattern/ids module
> attach tests") adds a caller get_ids(bpf_testmod_fentry_test, cnt,
> "bpf_testmod") that activates the buggy path.
yep, double free, will fix
jirka
^ permalink raw reply
* Re: [PATCH v13 04/18] unwind_user/sframe: Add support for reading .sframe contents
From: Jens Remus @ 2026-04-21 8:27 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86, linux-mm,
Steven Rostedt
Cc: Josh Poimboeuf, Masami Hiramatsu, Mathieu Desnoyers,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
Vasily Gorbik, Steven Rostedt (Google), Indu Bhagat
In-Reply-To: <20260127150554.2760964-5-jremus@linux.ibm.com>
On 1/27/2026 4:05 PM, Jens Remus wrote:
> From: Josh Poimboeuf <jpoimboe@kernel.org>
>
> In preparation for using sframe to unwind user space stacks, add an
> sframe_find() interface for finding the sframe information associated
> with a given text address.
>
> For performance, use user_read_access_begin() and the corresponding
> unsafe_*() accessors. Note that use of pr_debug() in uaccess-enabled
> regions would break noinstr validation, so there aren't any debug
> messages yet. That will be added in a subsequent commit.
>
> Link: https://lore.kernel.org/all/77c0d1ec143bf2a53d66c4ecb190e7e0a576fbfd.1737511963.git.jpoimboe@kernel.org/
> Link: https://lore.kernel.org/all/b35ca3a3-8de5-4d32-8d30-d4e562f6b0de@linux.ibm.com/
>
> [ Jens Remus: Add initial support for SFrame V3 (limited to regular
> FDEs). Add support for PC-relative FDE function start offset. Simplify
> logic by using an internal FDE representation. Rename struct sframe_fre
> to sframe_fre_internal to align with struct sframe_fde_internal.
> Cleanup includes. Fix checkpatch errors "spaces required around that
> ':'". ]
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> +static __always_inline int __read_fde(struct sframe_section *sec,
> + unsigned int fde_num,
> + struct sframe_fde_internal *fde)
> +{
> + unsigned long fde_addr, fda_addr, func_addr;
> + struct sframe_fde_v3 _fde;
> + struct sframe_fda_v3 _fda;
> +
> + fde_addr = sec->fdes_start + (fde_num * sizeof(struct sframe_fde_v3));
> + unsafe_copy_from_user(&_fde, (void __user *)fde_addr,
> + sizeof(struct sframe_fde_v3), Efault);
> +
> + func_addr = fde_addr + _fde.func_start_off;
> + if (func_addr < sec->text_start || func_addr > sec->text_end)
sec->text_end points after the last byte of the .text section.
Therefore the check for whether FDE function start address is within the
section must be:
if (func_addr < sec->text_start || func_addr >= sec->text_end)
> + return -EINVAL;
Regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
jremus@de.ibm.com / jremus@linux.ibm.com
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Ehningen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply
* Re: [PATCH 0/3] mm: split the file's i_mmap tree for NUMA
From: Huang Shijie @ 2026-04-21 3:06 UTC (permalink / raw)
To: Pedro Falcato
Cc: Mateusz Guzik, akpm, viro, brauner, linux-mm, linux-kernel,
linux-arm-kernel, linux-fsdevel, muchun.song, osalvador,
linux-trace-kernel, linux-perf-users, linux-parisc, nvdimm,
zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <hshxzebq5y4gavo7mbrgn7qitz5j5wyun73wy7ooiiehzzpcui@hlknbp34sgja>
On Mon, Apr 20, 2026 at 02:48:49PM +0100, Pedro Falcato wrote:
> BTW you're missing _a lot_ of CC's here, including the whole of mm/rmap.c
> maintainership.
Thanks, my fault.
>
> On Mon, Apr 20, 2026 at 10:10:19AM +0800, Huang Shijie wrote:
> > On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote:
> > > On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> > > > In NUMA, there are maybe many NUMA nodes and many CPUs.
> > > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > > > In the UnixBench tests, there is a test "execl" which tests
> > > > the execve system call.
> > > >
> > > > When we test our server with "./Run -c 384 execl",
> > > > the test result is not good enough. The i_mmap locks contended heavily on
> > > > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
> > > > over 6000 VMAs, all the VMAs can be in different NUMA mode.
> > > > The insert/remove operations do not run quickly enough.
> > > >
> > > > patch 1 & patch 2 are try to hide the direct access of i_mmap.
> > > > patch 3 splits the i_mmap into sibling trees, and we can get better
> > > > performance with this patch set:
> > > > we can get 77% performance improvement(10 times average)
> > > >
> > >
> > > To my reading you kept the lock as-is and only distributed the protected
> > > state.
> > >
> > > While I don't doubt the improvement, I'm confident should you take a
> > > look at the profile you are going to find this still does not scale with
> > > rwsem being one of the problems (there are other global locks, some of
> > > which have experimental patches for).
> > >
> > > Apart from that this does nothing to help high core systems which are
> > > all one node, which imo puts another question mark on this specific
> > > proposal.
> > >
> > > Of course one may question whether a RB tree is the right choice here,
> > > it may be the lock-protected cost can go way down with merely a better
> > > data structure.
> > >
> > > Regardless of that, for actual scalability, there will be no way around
> > > decentralazing locking around this and partitioning per some core count
> > > (not just by numa awareness).
> > >
> > > Decentralizing locking is definitely possible, but I have not looked
> > > into specifics of how problematic it is. Best case scenario it will
> > > merely with separate locks. Worst case scenario something needs a fully
> > > stabilized state for traversal, in that case another rw lock can be
> > > slapped around this, creating locking order read lock -> per-subset
> > > write lock -- this will suffer scalability due to the read locking, but
> > > it will still scale drastically better as apart from that there will be
> > > no serialization. In this setting the problematic consumer will write
> > > lock the new thing to stabilize the state.
> > >
> > I thought over again.
> > I can change this patch set to support the non-NUMA case by:
> > 1.) Still use one rw lock.
>
> No. This doesn't help anything.
>
> > 2.) For NUMA, keep the patch set as it is.
>
> Please no. No NUMA vs non-NUMA case.
>
> > 3.) For non-NUMA case, split the i_mmap tree to several subtrees.
> > For example, if a machine has 192 CPUs, split the 32 CPUs as a tree.
>
> If lock contention is the problem, I don't see how splitting the tree helps,
> unless it helps reduce lock hold time in a way that randomly helps your workload.
> But that's entirely random.
We actually face two issues:
1.) the lock contention
2.) the lock hold time.
IMHO, if we can reduce the lock hold time, we can ease the lock contention too.
So this patch set is to reduce the lock hold time, which is much helpful in our
NUMA server in UnixBench test.
If we split the lock into small locks, we can also benefit from it.
If you or Mateusz create the patch in future, I can test it on our server.
I wonder if it can give us better performance then current patch set.
Thanks
Huang Shijie
^ permalink raw reply
* Re: [PATCH] tracing: branch: Fix inverted check on stat tracer registration
From: Masami Hiramatsu @ 2026-04-20 23:57 UTC (permalink / raw)
To: Breno Leitao
Cc: Steven Rostedt, Mathieu Desnoyers, Ingo Molnar,
Frederic Weisbecker, Steven Rostedt, linux-kernel,
linux-trace-kernel, paulmck, kernel-team
In-Reply-To: <20260420-tracing-v1-1-d8f4cd0d6af1@debian.org>
On Mon, 20 Apr 2026 06:25:09 -0700
Breno Leitao <leitao@debian.org> wrote:
> init_annotated_branch_stats() and all_annotated_branch_stats() check the
> return value of register_stat_tracer() with "if (!ret)", but
> register_stat_tracer() returns 0 on success and a negative errno on
> failure. The inverted check causes the warning to be printed on every
> successful registration, e.g.:
>
> Warning: could not register annotated branches stats
>
> while leaving real failures silent. The initcall also returned a
> hard-coded 1 instead of the actual error.
>
> Invert the check and propagate ret so that the warning fires on real
> errors and the initcall reports the correct status.
>
Good catch!
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thanks,
> Fixes: 002bb86d8d42 ("tracing/ftrace: separate events tracing and stats tracing engine")
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> kernel/trace/trace_branch.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/trace/trace_branch.c b/kernel/trace/trace_branch.c
> index 6809b370e991d..d1564db95a8f5 100644
> --- a/kernel/trace/trace_branch.c
> +++ b/kernel/trace/trace_branch.c
> @@ -373,10 +373,10 @@ __init static int init_annotated_branch_stats(void)
> int ret;
>
> ret = register_stat_tracer(&annotated_branch_stats);
> - if (!ret) {
> + if (ret) {
> printk(KERN_WARNING "Warning: could not register "
> "annotated branches stats\n");
> - return 1;
> + return ret;
> }
> return 0;
> }
> @@ -438,10 +438,10 @@ __init static int all_annotated_branch_stats(void)
> int ret;
>
> ret = register_stat_tracer(&all_branch_stats);
> - if (!ret) {
> + if (ret) {
> printk(KERN_WARNING "Warning: could not register "
> "all branches stats\n");
> - return 1;
> + return ret;
> }
> return 0;
> }
>
> ---
> base-commit: c7275b05bc428c7373d97aa2da02d3a7fa6b9f66
> change-id: 20260420-tracing-3f1367ee4b93
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 7.2 v16 06/13] mm/khugepaged: skip collapsing mTHP to smaller orders
From: Usama Arif @ 2026-04-20 15:36 UTC (permalink / raw)
To: Nico Pache
Cc: Usama Arif, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260419185750.260784-7-npache@redhat.com>
On Sun, 19 Apr 2026 12:57:43 -0600 Nico Pache <npache@redhat.com> wrote:
> khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
> some pages being unmapped. Skip these cases until we have a way to check
> if its ok to collapse to a smaller mTHP size (like in the case of a
> partially mapped folio). This check is also not done during the scan phase
> as the current collapse order is unknown at that time.
>
> This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
>
> [1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
>
> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
Acked-by: Usama Arif <usama.arif@linux.dev>
^ permalink raw reply
* Re: [PATCH 7.2 v16 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Usama Arif @ 2026-04-20 14:20 UTC (permalink / raw)
To: Nico Pache
Cc: Usama Arif, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260419185750.260784-6-npache@redhat.com>
On Sun, 19 Apr 2026 12:57:42 -0600 Nico Pache <npache@redhat.com> wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 103 +++++++++++++++++++++++++++---------------------
> 1 file changed, 57 insertions(+), 46 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 283bb63854a5..ff6f9f1883ed 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1198,42 +1198,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> return SCAN_SUCCEED;
> }
>
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> - int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> + int referenced, int unmapped, struct collapse_control *cc,
> + unsigned int order)
> {
> LIST_HEAD(compound_pagelist);
> pmd_t *pmd, _pmd;
> - pte_t *pte;
> + pte_t *pte = NULL;
> pgtable_t pgtable;
> struct folio *folio;
> spinlock_t *pmd_ptl, *pte_ptl;
> enum scan_result result = SCAN_FAIL;
> struct vm_area_struct *vma;
> struct mmu_notifier_range range;
> + bool anon_vma_locked = false;
> + const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> + const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>
> - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -
> - /*
> - * Before allocating the hugepage, release the mmap_lock read lock.
> - * The allocation can take potentially a long time if it involves
> - * sync compaction, and we do not need to hold the mmap_lock during
> - * that. We will recheck the vma after taking it again in write mode.
> - */
> - mmap_read_unlock(mm);
> -
My understanding now is that the caller will need to drop the mmap_read lock?
This needs explicit documentation at the start of the function IMO. I think
there are callers in later patches and its very easy to miss this.
> - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> + result = alloc_charge_folio(&folio, mm, cc, order);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
>
> mmap_read_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> - HPAGE_PMD_ORDER);
> + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> + &vma, cc, order);
> if (result != SCAN_SUCCEED) {
> mmap_read_unlock(mm);
> goto out_nolock;
> }
>
> - result = find_pmd_or_thp_or_none(mm, address, &pmd);
> + result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> if (result != SCAN_SUCCEED) {
> mmap_read_unlock(mm);
> goto out_nolock;
> @@ -1245,8 +1239,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * released when it fails. So we jump out_nolock directly in
> * that case. Continuing to collapse causes inconsistency.
> */
> - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> - referenced, HPAGE_PMD_ORDER);
> + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> + referenced, order);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
> }
> @@ -1261,20 +1255,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * mmap_lock.
> */
> mmap_write_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> - HPAGE_PMD_ORDER);
> + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> + &vma, cc, order);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> vma_start_write(vma);
> - result = check_pmd_still_valid(mm, address, pmd);
> + result = check_pmd_still_valid(mm, pmd_addr, pmd);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
>
> anon_vma_lock_write(vma->anon_vma);
> + anon_vma_locked = true;
>
> - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> - address + HPAGE_PMD_SIZE);
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> + end_addr);
> mmu_notifier_invalidate_range_start(&range);
>
> pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1286,26 +1281,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * Parallel GUP-fast is fine since GUP-fast will back off when
> * it detects PMD is changed.
> */
> - _pmd = pmdp_collapse_flush(vma, address, pmd);
> + _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
For an mTHP collapse covering, say, 64KiB of a 2MiB PMD, the patch still
flushes the entire PMD via pmdp_collapse_flush and tlb_remove_table_sync_one.
That triggers cross-CPU TLB shootdowns for ~1.94MiB of unrelated mappings on
every successful sub-PMD collapse. Probably acceptable as a first cut?
> spin_unlock(pmd_ptl);
> mmu_notifier_invalidate_range_end(&range);
> tlb_remove_table_sync_one();
>
> - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> if (pte) {
> - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> - HPAGE_PMD_ORDER,
> - &compound_pagelist);
> + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> + order, &compound_pagelist);
> spin_unlock(pte_ptl);
> } else {
> result = SCAN_NO_PTE_TABLE;
> }
>
> if (unlikely(result != SCAN_SUCCEED)) {
> - if (pte)
> - pte_unmap(pte);
> spin_lock(pmd_ptl);
> - BUG_ON(!pmd_none(*pmd));
> + WARN_ON_ONCE(!pmd_none(*pmd));
Why was this turned into WARN_ON_ONCE? Would be good to add to commit message
what the reason is if it has been discussed earlier.
The next line writes a PMD entry over an existing one — that leaks the
previous page table or PMD-mapped folio and can corrupt VA mappings. BUG_ON
failed loudly and safely; WARN_ON_ONCE continues into the corruption and falls
silent after the first hit.
> /*
> * We can only use set_pmd_at when establishing
> * hugepmds and never for establishing regular pmds that
> @@ -1313,21 +1305,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> */
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> spin_unlock(pmd_ptl);
> - anon_vma_unlock_write(vma->anon_vma);
> goto out_up_write;
> }
>
> /*
> - * All pages are isolated and locked so anon_vma rmap
> - * can't run anymore.
> + * For PMD collapse all pages are isolated and locked so anon_vma
> + * rmap can't run anymore. For mTHP collapse the PMD entry has been
> + * removed and not all pages are isolated and locked, so we must hold
> + * the lock to prevent neighboring folios from attempting to access
> + * this PMD until its reinstalled.
> */
> - anon_vma_unlock_write(vma->anon_vma);
> + if (is_pmd_order(order)) {
> + anon_vma_unlock_write(vma->anon_vma);
> + anon_vma_locked = false;
> + }
>
> result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> - vma, address, pte_ptl,
> - HPAGE_PMD_ORDER,
> - &compound_pagelist);
> - pte_unmap(pte);
> + vma, start_addr, pte_ptl,
> + order, &compound_pagelist);
> if (unlikely(result != SCAN_SUCCEED))
> goto out_up_write;
>
> @@ -1337,18 +1332,27 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * write.
> */
> __folio_mark_uptodate(folio);
> - pgtable = pmd_pgtable(_pmd);
> -
> spin_lock(pmd_ptl);
> - BUG_ON(!pmd_none(*pmd));
> - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> + WARN_ON_ONCE(!pmd_none(*pmd));
> + if (is_pmd_order(order)) { /* PMD collapse */
> + pgtable = pmd_pgtable(_pmd);
> + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> + } else { /* mTHP collapse */
> + map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
map_anon_folio_pte_nopf calls set_ptes and modifies pagetable, while holding
pmd_ptl only. It should be safe as we expect pmd_none. But I think you should
put a comment about this?
> + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> + }
> spin_unlock(pmd_ptl);
>
> folio = NULL;
>
> result = SCAN_SUCCEED;
> out_up_write:
> + if (anon_vma_locked)
> + anon_vma_unlock_write(vma->anon_vma);
> + if (pte)
> + pte_unmap(pte);
> mmap_write_unlock(mm);
> out_nolock:
> if (folio)
> @@ -1525,8 +1529,15 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> out_unmap:
> pte_unmap_unlock(pte, ptl);
> if (result == SCAN_SUCCEED) {
> + /*
> + * Before allocating the hugepage, release the mmap_lock read lock.
> + * The allocation can take potentially a long time if it involves
> + * sync compaction, and we do not need to hold the mmap_lock during
> + * that. We will recheck the vma after taking it again in write mode.
> + */
> + mmap_read_unlock(mm);
> result = collapse_huge_page(mm, start_addr, referenced,
> - unmapped, cc);
> + unmapped, cc, HPAGE_PMD_ORDER);
> /* collapse_huge_page will return with the mmap_lock released */
> *lock_dropped = true;
> }
> --
> 2.53.0
>
>
^ permalink raw reply
* [PATCH v10 8/8] selftests/ftrace: Add a testcase for multiple fprobe events
From: Masami Hiramatsu (Google) @ 2026-04-20 14:01 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177669363667.132053.12454670015890859277.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a testcase for multiple fprobe events on the same function
so that it clears ftrace hash map correctly when removing the
events.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
.../test.d/dynevent/add_remove_multiple_fprobe.tc | 69 ++++++++++++++++++++
1 file changed, 69 insertions(+)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc
new file mode 100644
index 000000000000..f2cbf2ffd29b
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc
@@ -0,0 +1,69 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Generic dynamic event - add/remove multiple fprobe events on the same function
+# requires: dynamic_events "f[:[<group>/][<event>]] <func-name>[%return] [<args>]":README enabled_functions
+
+echo 0 > events/enable
+echo > dynamic_events
+
+PLACE=vfs_read
+PLACE2=vfs_open
+
+:;: 'Ensure no other ftrace user' ;:
+test `cat enabled_functions | wc -l` -eq 0 || exit_unresolved
+
+:;: 'Test case 1: leave entry event' ;:
+:;: 'Add entry and exit events on the same place' ;:
+echo "f:event1 ${PLACE}" >> dynamic_events
+echo "f:event2 ${PLACE}%return" >> dynamic_events
+
+:;: 'Enable both of them' ;:
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'Disable and remove exit event' ;:
+echo 0 > events/fprobes/event2/enable
+echo -:event2 >> dynamic_events
+
+:;: 'Disable and remove all events' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+
+:;: 'Add another event' ;:
+echo "f:event3 ${PLACE2}%return" > dynamic_events
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'No other ftrace user' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+test `cat enabled_functions | wc -l` -eq 0
+
+:;: 'Test case 2: leave exit event' ;:
+:;: 'Add entry and exit events on the same place' ;:
+echo "f:event1 ${PLACE}" >> dynamic_events
+echo "f:event2 ${PLACE}%return" >> dynamic_events
+
+:;: 'Enable both of them' ;:
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'Disable and remove entry event' ;:
+echo 0 > events/fprobes/event1/enable
+echo -:event1 >> dynamic_events
+
+:;: 'Disable and remove all events' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+
+:;: 'Add another event' ;:
+echo "f:event3 ${PLACE2}" > dynamic_events
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'No other ftrace user' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+test `cat enabled_functions | wc -l` -eq 0
+
+clear_trace
^ permalink raw reply related
* [PATCH v10 7/8] selftests/ftrace: Add a testcase for fprobe events on module
From: Masami Hiramatsu (Google) @ 2026-04-20 14:01 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177669363667.132053.12454670015890859277.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a testcase for fprobe events on module, which unloads a kernel
module on which fprobe events are probing and ensure the ftrace
hash map is cleared correctly.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v10:
- Fix module name typo in error case trap.
Changes in v9:
- Use "trace-events-sample" instead of "trace_events_sample"
- Add checking unload module and remove core-kernel event case.
- Check test module exists when unloading it in EXIT.
Changes in v8:
- Newly added.
---
.../test.d/dynevent/add_remove_fprobe_module.tc | 87 ++++++++++++++++++++
1 file changed, 87 insertions(+)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc
new file mode 100644
index 000000000000..2915206777b6
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc
@@ -0,0 +1,87 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Generic dynamic event - add/remove fprobe events on module
+# requires: dynamic_events "f[:[<group>/][<event>]] <func-name>[%return] [<args>]":README enabled_functions
+
+rmmod trace-events-sample ||:
+if ! modprobe trace-events-sample ; then
+ echo "No trace-events sample module - please make CONFIG_SAMPLE_TRACE_EVENTS=m"
+ exit_unresolved;
+fi
+trap "lsmod | grep -q trace_events_sample && rmmod trace-events-sample" EXIT
+
+echo 0 > events/enable
+echo > dynamic_events
+
+FUNC1='foo_bar*'
+FUNC2='vfs_read'
+
+:;: "Add an event on the test module" ;:
+echo "f:test1 $FUNC1" >> dynamic_events
+echo 1 > events/fprobes/test1/enable
+
+:;: "Ensure it is enabled" ;:
+funcs=`cat enabled_functions | wc -l`
+test $funcs -ne 0
+
+:;: "Check the enabled_functions is cleared on unloading" ;:
+rmmod trace-events-sample
+funcs=`cat enabled_functions | wc -l`
+test $funcs -eq 0
+
+:;: "Check it is kept clean" ;:
+modprobe trace-events-sample
+echo 1 > events/fprobes/test1/enable || echo "OK"
+funcs=`cat enabled_functions | wc -l`
+test $funcs -eq 0
+
+:;: "Add another event not on the test module" ;:
+echo "f:test2 $FUNC2" >> dynamic_events
+echo 1 > events/fprobes/test2/enable
+
+:;: "Ensure it is enabled" ;:
+ofuncs=`cat enabled_functions | wc -l`
+test $ofuncs -ne 0
+
+:;: "Disable and remove the first event"
+echo 0 > events/fprobes/test1/enable
+echo "-:fprobes/test1" >> dynamic_events
+funcs=`cat enabled_functions | wc -l`
+test $ofuncs -eq $funcs
+
+:;: "Disable and remove other events" ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+funcs=`cat enabled_functions | wc -l`
+test $funcs -eq 0
+
+rmmod trace-events-sample
+
+:;: "Add events on kernel and test module" ;:
+modprobe trace-events-sample
+echo "f:test1 $FUNC1" >> dynamic_events
+echo 1 > events/fprobes/test1/enable
+echo "f:test2 $FUNC2" >> dynamic_events
+echo 1 > events/fprobes/test2/enable
+ofuncs=`cat enabled_functions | wc -l`
+test $ofuncs -ne 0
+
+:;: "Unload module (ftrace entry should be removed)" ;:
+rmmod trace-events-sample
+funcs=`cat enabled_functions | wc -l`
+test $funcs -ne 0
+test $ofuncs -ne $funcs
+
+:;: "Disable and remove core-kernel fprobe event" ;:
+echo 0 > events/fprobes/test2/enable
+echo "-:fprobes/test2" >> dynamic_events
+
+:;: "Ensure ftrace is disabled." ;:
+funcs=`cat enabled_functions | wc -l`
+test $funcs -eq 0
+
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+
+trap "" EXIT
+clear_trace
^ permalink raw reply related
* [PATCH v10 6/8] tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading
From: Masami Hiramatsu (Google) @ 2026-04-20 14:01 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177669363667.132053.12454670015890859277.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fix fprobe to unregister ftrace_ops if corresponding type of fprobe
does not exist on the fprobe_ip_table and it is expected to be empty
when unloading modules.
Since ftrace thinks that the empty hash means everything to be traced,
if we set fprobes only on the unloaded module, all functions are traced
unexpectedly after unloading module.
e.g.
# modprobe xt_LOG.ko
# echo 'f:test log_tg*' > dynamic_events
# echo 1 > events/fprobes/test/enable
# cat enabled_functions
log_tg [xt_LOG] (1) tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_check [xt_LOG] (1) tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_destroy [xt_LOG] (1) tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
# rmmod xt_LOG
# wc -l enabled_functions
34085 enabled_functions
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v10:
- Even if the removed node count is 0 because of the memory allocation
fails, call __fprobe_f*_unregister() if the number of ftrace_hash_node
is 0.
Changes in v9:
- Remove fprobe_graph_active and fprobe_ftrace_active to fix
remove fprobe after unload module case.
Changes in v8:
- Fix to check fprobe_graph/ftrace_registered flag directly
when registering ftrace_ops.
Changes in v7:
- Fix to split checking whether ftrace_ops is registered from
the number of registered fprobes, because ftrace_ops can be
unregistered in module unloading.
Changes in v6:
- Newly added.
---
kernel/trace/fprobe.c | 224 +++++++++++++++++++++++++++++++++++--------------
1 file changed, 159 insertions(+), 65 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index d0a68a2c5eaf..cc49ebd2a773 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -79,7 +79,7 @@ static const struct rhashtable_params fprobe_rht_params = {
};
/* Node insertion and deletion requires the fprobe_mutex */
-static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+static int __insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
{
int ret;
@@ -92,7 +92,7 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
return ret;
}
-static void delete_fprobe_node(struct fprobe_hlist_node *node)
+static void __delete_fprobe_node(struct fprobe_hlist_node *node)
{
lockdep_assert_held(&fprobe_mutex);
@@ -250,7 +250,65 @@ static inline int __fprobe_kprobe_handler(unsigned long ip, unsigned long parent
return ret;
}
+static int fprobe_fgraph_entry(struct ftrace_graph_ent *trace, struct fgraph_ops *gops,
+ struct ftrace_regs *fregs);
+static void fprobe_return(struct ftrace_graph_ret *trace,
+ struct fgraph_ops *gops,
+ struct ftrace_regs *fregs);
+
+static struct fgraph_ops fprobe_graph_ops = {
+ .entryfunc = fprobe_fgraph_entry,
+ .retfunc = fprobe_return,
+};
+/* Number of fgraph fprobe nodes */
+static int nr_fgraph_fprobes;
+/* Is fprobe_graph_ops registered? */
+static bool fprobe_graph_registered;
+
+/* Add @addrs to the ftrace filter and register fgraph if needed. */
+static int fprobe_graph_add_ips(unsigned long *addrs, int num)
+{
+ int ret;
+
+ lockdep_assert_held(&fprobe_mutex);
+
+ ret = ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 0, 0);
+ if (ret)
+ return ret;
+
+ if (!fprobe_graph_registered) {
+ ret = register_ftrace_graph(&fprobe_graph_ops);
+ if (WARN_ON_ONCE(ret)) {
+ ftrace_free_filter(&fprobe_graph_ops.ops);
+ return ret;
+ }
+ fprobe_graph_registered = true;
+ }
+ return 0;
+}
+
+static void __fprobe_graph_unregister(void)
+{
+ if (fprobe_graph_registered) {
+ unregister_ftrace_graph(&fprobe_graph_ops);
+ ftrace_free_filter(&fprobe_graph_ops.ops);
+ fprobe_graph_registered = false;
+ }
+}
+
+/* Remove @addrs from the ftrace filter and unregister fgraph if possible. */
+static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
+{
+ lockdep_assert_held(&fprobe_mutex);
+
+ if (!nr_fgraph_fprobes)
+ __fprobe_graph_unregister();
+ else if (num)
+ ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 1, 0);
+}
+
#if defined(CONFIG_DYNAMIC_FTRACE_WITH_ARGS) || defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS)
+
/* ftrace_ops callback, this processes fprobes which have only entry_handler. */
static void fprobe_ftrace_entry(unsigned long ip, unsigned long parent_ip,
struct ftrace_ops *ops, struct ftrace_regs *fregs)
@@ -293,7 +351,10 @@ static struct ftrace_ops fprobe_ftrace_ops = {
.func = fprobe_ftrace_entry,
.flags = FTRACE_OPS_FL_SAVE_ARGS,
};
-static int fprobe_ftrace_active;
+/* Number of ftrace fprobe nodes */
+static int nr_ftrace_fprobes;
+/* Is fprobe_ftrace_ops registered? */
+static bool fprobe_ftrace_registered;
static int fprobe_ftrace_add_ips(unsigned long *addrs, int num)
{
@@ -305,26 +366,33 @@ static int fprobe_ftrace_add_ips(unsigned long *addrs, int num)
if (ret)
return ret;
- if (!fprobe_ftrace_active) {
+ if (!fprobe_ftrace_registered) {
ret = register_ftrace_function(&fprobe_ftrace_ops);
if (ret) {
ftrace_free_filter(&fprobe_ftrace_ops);
return ret;
}
+ fprobe_ftrace_registered = true;
}
- fprobe_ftrace_active++;
return 0;
}
+static void __fprobe_ftrace_unregister(void)
+{
+ if (fprobe_ftrace_registered) {
+ unregister_ftrace_function(&fprobe_ftrace_ops);
+ ftrace_free_filter(&fprobe_ftrace_ops);
+ fprobe_ftrace_registered = false;
+ }
+}
+
static void fprobe_ftrace_remove_ips(unsigned long *addrs, int num)
{
lockdep_assert_held(&fprobe_mutex);
- fprobe_ftrace_active--;
- if (!fprobe_ftrace_active) {
- unregister_ftrace_function(&fprobe_ftrace_ops);
- ftrace_free_filter(&fprobe_ftrace_ops);
- } else if (num)
+ if (!nr_ftrace_fprobes)
+ __fprobe_ftrace_unregister();
+ else if (num)
ftrace_set_filter_ips(&fprobe_ftrace_ops, addrs, num, 1, 0);
}
@@ -333,6 +401,40 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return !fp->exit_handler;
}
+/* Node insertion and deletion requires the fprobe_mutex */
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+{
+ int ret;
+
+ lockdep_assert_held(&fprobe_mutex);
+
+ ret = __insert_fprobe_node(node, fp);
+ if (!ret) {
+ if (fprobe_is_ftrace(fp))
+ nr_ftrace_fprobes++;
+ else
+ nr_fgraph_fprobes++;
+ }
+
+ return ret;
+}
+
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
+{
+ struct fprobe *fp;
+
+ lockdep_assert_held(&fprobe_mutex);
+
+ fp = READ_ONCE(node->fp);
+ if (fp) {
+ if (fprobe_is_ftrace(fp))
+ nr_ftrace_fprobes--;
+ else
+ nr_fgraph_fprobes--;
+ }
+ __delete_fprobe_node(node);
+}
+
static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
{
struct rhlist_head *head, *pos;
@@ -362,8 +464,15 @@ static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
#ifdef CONFIG_MODULES
static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
- ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
- ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
+ if (!nr_fgraph_fprobes)
+ __fprobe_graph_unregister();
+ else if (cnt)
+ ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+
+ if (!nr_ftrace_fprobes)
+ __fprobe_ftrace_unregister();
+ else if (cnt)
+ ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
}
#endif
#else
@@ -381,6 +490,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return false;
}
+/* Node insertion and deletion requires the fprobe_mutex */
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+{
+ int ret;
+
+ lockdep_assert_held(&fprobe_mutex);
+
+ ret = __insert_fprobe_node(node, fp);
+ if (!ret)
+ nr_fgraph_fprobes++;
+
+ return ret;
+}
+
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
+{
+ struct fprobe *fp;
+
+ lockdep_assert_held(&fprobe_mutex);
+
+ fp = READ_ONCE(node->fp);
+ if (fp)
+ nr_fgraph_fprobes--;
+ __delete_fprobe_node(node);
+}
+
static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
{
struct rhlist_head *head, *pos;
@@ -407,7 +542,10 @@ static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
#ifdef CONFIG_MODULES
static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
- ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+ if (!nr_fgraph_fprobes)
+ __fprobe_graph_unregister();
+ else if (cnt)
+ ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
}
#endif
#endif /* !CONFIG_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
@@ -535,48 +673,6 @@ static void fprobe_return(struct ftrace_graph_ret *trace,
}
NOKPROBE_SYMBOL(fprobe_return);
-static struct fgraph_ops fprobe_graph_ops = {
- .entryfunc = fprobe_fgraph_entry,
- .retfunc = fprobe_return,
-};
-static int fprobe_graph_active;
-
-/* Add @addrs to the ftrace filter and register fgraph if needed. */
-static int fprobe_graph_add_ips(unsigned long *addrs, int num)
-{
- int ret;
-
- lockdep_assert_held(&fprobe_mutex);
-
- ret = ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 0, 0);
- if (ret)
- return ret;
-
- if (!fprobe_graph_active) {
- ret = register_ftrace_graph(&fprobe_graph_ops);
- if (WARN_ON_ONCE(ret)) {
- ftrace_free_filter(&fprobe_graph_ops.ops);
- return ret;
- }
- }
- fprobe_graph_active++;
- return 0;
-}
-
-/* Remove @addrs from the ftrace filter and unregister fgraph if possible. */
-static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
-{
- lockdep_assert_held(&fprobe_mutex);
-
- fprobe_graph_active--;
- /* Q: should we unregister it ? */
- if (!fprobe_graph_active) {
- unregister_ftrace_graph(&fprobe_graph_ops);
- ftrace_free_filter(&fprobe_graph_ops.ops);
- } else if (num)
- ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 1, 0);
-}
-
#ifdef CONFIG_MODULES
#define FPROBE_IPS_BATCH_INIT 128
@@ -653,16 +749,14 @@ static int fprobe_module_callback(struct notifier_block *nb,
} while (node == ERR_PTR(-EAGAIN) && !retry);
rhashtable_walk_exit(&iter);
/* Remove any ips from hash table(s) */
- if (alist.index > 0) {
- fprobe_remove_ips(alist.addrs, alist.index);
- /*
- * If we break rhashtable walk loop except for -EAGAIN, we need
- * to restart looping from start for safety. Anyway, this is
- * not a hotpath.
- */
- if (retry)
- goto again;
- }
+ fprobe_remove_ips(alist.addrs, alist.index);
+ /*
+ * If we break rhashtable walk loop except for -EAGAIN, we need
+ * to restart looping from start for safety. Anyway, this is
+ * not a hotpath.
+ */
+ if (retry)
+ goto again;
mutex_unlock(&fprobe_mutex);
^ permalink raw reply related
* [PATCH v10 5/8] tracing/fprobe: Check the same type fprobe on table as the unregistered one
From: Masami Hiramatsu (Google) @ 2026-04-20 14:01 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177669363667.132053.12454670015890859277.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
introduced a different ftrace_ops for entry-only fprobes.
However, when unregistering an fprobe, the kernel only checks if another
fprobe exists at the same address, without checking which type of fprobe
it is.
If different fprobes are registered at the same address, the same address
will be registered in both fgraph_ops and ftrace_ops, but only one of
them will be deleted when unregistering. (the one removed first will not
be deleted from the ops).
This results in junk entries remaining in either fgraph_ops or ftrace_ops.
For example:
=======
cd /sys/kernel/tracing
# 'Add entry and exit events on the same place'
echo 'f:event1 vfs_read' >> dynamic_events
echo 'f:event2 vfs_read%return' >> dynamic_events
# 'Enable both of them'
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_read (2) ->arch_ftrace_ops_list_func+0x0/0x210
# 'Disable and remove exit event'
echo 0 > events/fprobes/event2/enable
echo -:event2 >> dynamic_events
# 'Disable and remove all events'
echo 0 > events/fprobes/enable
echo > dynamic_events
# 'Add another event'
echo 'f:event3 vfs_open%return' > dynamic_events
cat dynamic_events
f:fprobes/event3 vfs_open%return
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_open (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
vfs_read (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
=======
As you can see, an entry for the vfs_read remains.
To fix this issue, when unregistering, the kernel should also check if
there is the same type of fprobes still exist at the same address, and
if not, delete its entry from either fgraph_ops or ftrace_ops.
Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/fprobe.c | 82 +++++++++++++++++++++++++++++++++++++++----------
1 file changed, 65 insertions(+), 17 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 9b913facfd36..d0a68a2c5eaf 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -92,11 +92,8 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
return ret;
}
-/* Return true if there are synonims */
-static bool delete_fprobe_node(struct fprobe_hlist_node *node)
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
{
- bool ret;
-
lockdep_assert_held(&fprobe_mutex);
/* Avoid double deleting and non-inserted nodes */
@@ -105,13 +102,6 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
rhltable_remove(&fprobe_ip_table, &node->hlist,
fprobe_rht_params);
}
-
- rcu_read_lock();
- ret = !!rhltable_lookup(&fprobe_ip_table, &node->addr,
- fprobe_rht_params);
- rcu_read_unlock();
-
- return ret;
}
/* Check existence of the fprobe */
@@ -343,6 +333,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return !fp->exit_handler;
}
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
+{
+ struct rhlist_head *head, *pos;
+ struct fprobe_hlist_node *node;
+ struct fprobe *fp;
+
+ guard(rcu)();
+ head = rhltable_lookup(&fprobe_ip_table, &ip,
+ fprobe_rht_params);
+ if (!head)
+ return false;
+ /* We have to check the same type on the list. */
+ rhl_for_each_entry_rcu(node, pos, head, hlist) {
+ if (node->addr != ip)
+ break;
+ fp = READ_ONCE(node->fp);
+ if (likely(fp)) {
+ if ((!ftrace && fp->exit_handler) ||
+ (ftrace && !fp->exit_handler))
+ return true;
+ }
+ }
+
+ return false;
+}
+
#ifdef CONFIG_MODULES
static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
@@ -365,6 +381,29 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return false;
}
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
+{
+ struct rhlist_head *head, *pos;
+ struct fprobe_hlist_node *node;
+ struct fprobe *fp;
+
+ guard(rcu)();
+ head = rhltable_lookup(&fprobe_ip_table, &ip,
+ fprobe_rht_params);
+ if (!head)
+ return false;
+ /* We only need to check fp is there. */
+ rhl_for_each_entry_rcu(node, pos, head, hlist) {
+ if (node->addr != ip)
+ break;
+ fp = READ_ONCE(node->fp);
+ if (likely(fp))
+ return true;
+ }
+
+ return false;
+}
+
#ifdef CONFIG_MODULES
static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
@@ -551,18 +590,25 @@ struct fprobe_addr_list {
static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
struct fprobe_addr_list *alist)
{
+ lockdep_assert_in_rcu_read_lock();
+
if (!within_module(node->addr, mod))
return 0;
- if (delete_fprobe_node(node))
- return 0;
+ delete_fprobe_node(node);
/* If no address list is available, we can't track this address. */
if (!alist->addrs)
return 0;
+ /*
+ * Don't care the type here, because all fprobes on the same
+ * address must be removed eventually.
+ */
+ if (!rhltable_lookup(&fprobe_ip_table, &node->addr, fprobe_rht_params)) {
+ alist->addrs[alist->index++] = node->addr;
+ if (alist->index == alist->size)
+ return -ENOSPC;
+ }
- alist->addrs[alist->index++] = node->addr;
- if (alist->index == alist->size)
- return -ENOSPC;
return 0;
}
@@ -933,7 +979,9 @@ static int unregister_fprobe_nolock(struct fprobe *fp)
/* Remove non-synonim ips from table and hash */
count = 0;
for (i = 0; i < hlist_array->size; i++) {
- if (!delete_fprobe_node(&hlist_array->array[i]) && addrs)
+ delete_fprobe_node(&hlist_array->array[i]);
+ if (addrs && !fprobe_exists_on_hash(hlist_array->array[i].addr,
+ fprobe_is_ftrace(fp)))
addrs[count++] = hlist_array->array[i].addr;
}
del_fprobe_hash(fp);
^ permalink raw reply related
* [PATCH v10 4/8] tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
From: Masami Hiramatsu (Google) @ 2026-04-20 14:01 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177669363667.132053.12454670015890859277.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
fprobe_remove_node_in_module() is called under RCU read locked, but
this invokes kcalloc() if there are more than 8 fprobes installed
on the module. Sashiko warns it because kcalloc() can sleep [1].
[1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com
To fix this issue, expand the batch size to 128 and do not expand
the fprobe_addr_list, but just cancel walking on fprobe_ip_table,
update fgraph/ftrace_ops and retry the loop again.
Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v6:
- Retry outside rhltable_walk_enter/exit() again.
Changes in v5:
- Skip updating ftrace_ops when fails to allocate memory in module
unloading.
Changes in v4:
- fix a build error typo in case of CONFIG_DYNAMIC_FTRACE=n.
Changes in v3:
- Retry inside rhltable_walk_enter/exit().
- Rename fprobe_set_ips() to fprobe_remove_ips().
- Rename 'retry' label to 'again'.
---
kernel/trace/fprobe.c | 92 ++++++++++++++++++++++++-------------------------
1 file changed, 45 insertions(+), 47 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 621477ad0947..9b913facfd36 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -344,11 +344,10 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
}
#ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
- int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
- ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
- ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, remove, reset);
+ ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+ ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
}
#endif
#else
@@ -367,10 +366,9 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
}
#ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
- int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
- ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
+ ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
}
#endif
#endif /* !CONFIG_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
@@ -542,7 +540,7 @@ static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
#ifdef CONFIG_MODULES
-#define FPROBE_IPS_BATCH_INIT 8
+#define FPROBE_IPS_BATCH_INIT 128
/* instruction pointer address list */
struct fprobe_addr_list {
int index;
@@ -550,45 +548,24 @@ struct fprobe_addr_list {
unsigned long *addrs;
};
-static int fprobe_addr_list_add(struct fprobe_addr_list *alist, unsigned long addr)
+static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
+ struct fprobe_addr_list *alist)
{
- unsigned long *addrs;
-
- /* Previously we failed to expand the list. */
- if (alist->index == alist->size)
- return -ENOSPC;
-
- alist->addrs[alist->index++] = addr;
- if (alist->index < alist->size)
+ if (!within_module(node->addr, mod))
return 0;
- /* Expand the address list */
- addrs = kcalloc(alist->size * 2, sizeof(*addrs), GFP_KERNEL);
- if (!addrs)
- return -ENOMEM;
-
- memcpy(addrs, alist->addrs, alist->size * sizeof(*addrs));
- alist->size *= 2;
- kfree(alist->addrs);
- alist->addrs = addrs;
+ if (delete_fprobe_node(node))
+ return 0;
+ /* If no address list is available, we can't track this address. */
+ if (!alist->addrs)
+ return 0;
+ alist->addrs[alist->index++] = node->addr;
+ if (alist->index == alist->size)
+ return -ENOSPC;
return 0;
}
-static void fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
- struct fprobe_addr_list *alist)
-{
- if (!within_module(node->addr, mod))
- return;
- if (delete_fprobe_node(node))
- return;
- /*
- * If failed to update alist, just continue to update hlist.
- * Therefore, at list user handler will not hit anymore.
- */
- fprobe_addr_list_add(alist, node->addr);
-}
-
/* Handle module unloading to manage fprobe_ip_table. */
static int fprobe_module_callback(struct notifier_block *nb,
unsigned long val, void *data)
@@ -597,29 +574,50 @@ static int fprobe_module_callback(struct notifier_block *nb,
struct fprobe_hlist_node *node;
struct rhashtable_iter iter;
struct module *mod = data;
+ bool retry;
if (val != MODULE_STATE_GOING)
return NOTIFY_DONE;
alist.addrs = kcalloc(alist.size, sizeof(*alist.addrs), GFP_KERNEL);
- /* If failed to alloc memory, we can not remove ips from hash. */
- if (!alist.addrs)
- return NOTIFY_DONE;
+ /*
+ * If failed to alloc memory, ftrace_ops will not be able to remove ips from
+ * hash, but we can still remove nodes from fprobe_ip_table, so we can avoid
+ * the potential wrong callback. So just print a warning here and try to
+ * continue without address list.
+ */
+ WARN_ONCE(!alist.addrs,
+ "Failed to allocate memory for fprobe_addr_list, ftrace_ops will not be updated");
mutex_lock(&fprobe_mutex);
+again:
+ retry = false;
+ alist.index = 0;
rhltable_walk_enter(&fprobe_ip_table, &iter);
do {
rhashtable_walk_start(&iter);
while ((node = rhashtable_walk_next(&iter)) && !IS_ERR(node))
- fprobe_remove_node_in_module(mod, node, &alist);
+ if (fprobe_remove_node_in_module(mod, node, &alist) < 0) {
+ retry = true;
+ break;
+ }
rhashtable_walk_stop(&iter);
- } while (node == ERR_PTR(-EAGAIN));
+ } while (node == ERR_PTR(-EAGAIN) && !retry);
rhashtable_walk_exit(&iter);
+ /* Remove any ips from hash table(s) */
+ if (alist.index > 0) {
+ fprobe_remove_ips(alist.addrs, alist.index);
+ /*
+ * If we break rhashtable walk loop except for -EAGAIN, we need
+ * to restart looping from start for safety. Anyway, this is
+ * not a hotpath.
+ */
+ if (retry)
+ goto again;
+ }
- if (alist.index > 0)
- fprobe_set_ips(alist.addrs, alist.index, 1, 0);
mutex_unlock(&fprobe_mutex);
kfree(alist.addrs);
^ permalink raw reply related
* [PATCH v10 3/8] tracing/fprobe: Remove fprobe from hash in failure path
From: Masami Hiramatsu (Google) @ 2026-04-20 14:01 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177669363667.132053.12454670015890859277.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When register_fprobe_ips() fails, it tries to remove a list of
fprobe_hash_node from fprobe_ip_table, but it missed to remove
fprobe itself from fprobe_table. Moreover, when removing
the fprobe_hash_node which is added to rhltable once, it must
use kfree_rcu() after removing from rhltable.
To fix these issues, this reuses unregister_fprobe() internal
code to rollback the half-way registered fprobe.
Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v10:
- Add RCU sync for the registration failure.
Changes in v8:
- Fix to check return value of add_fprobe_hash() and break
loop if insert_fprobe_node() is failed.
Changes in v7:
- Remove RCU grace period wait, since fprobe itself is not
that is not needed.
Changes in v6:
- Wait for an RCU grace period before returning error in
unregister_fprobe_nolock().
Changes in v5:
- When rolling back an fprobe that failed to register, the
fprobe_hash_node are forcibly removed and warn if failure.
Changes in v4:
- Remove short-cut case because we always need to upadte ftrace_ops.
- Use guard(mutex) in register_fprobe_ips() to unlock it correctly.
- Remove redundant !ret check in register_fprobe_ips().
- Do not set hlist_array->size in failure case, instead,
hlist_array->array[i].fp is set only when insertion is succeeded.
Changes in v3:
- Newly added.
---
kernel/trace/fprobe.c | 90 ++++++++++++++++++++++++++-----------------------
1 file changed, 47 insertions(+), 43 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index a2b659006e0e..621477ad0947 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -79,20 +79,27 @@ static const struct rhashtable_params fprobe_rht_params = {
};
/* Node insertion and deletion requires the fprobe_mutex */
-static int insert_fprobe_node(struct fprobe_hlist_node *node)
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
{
+ int ret;
+
lockdep_assert_held(&fprobe_mutex);
- return rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+ ret = rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+ /* Set the fprobe pointer if insertion was successful. */
+ if (!ret)
+ WRITE_ONCE(node->fp, fp);
+ return ret;
}
/* Return true if there are synonims */
static bool delete_fprobe_node(struct fprobe_hlist_node *node)
{
- lockdep_assert_held(&fprobe_mutex);
bool ret;
- /* Avoid double deleting */
+ lockdep_assert_held(&fprobe_mutex);
+
+ /* Avoid double deleting and non-inserted nodes */
if (READ_ONCE(node->fp) != NULL) {
WRITE_ONCE(node->fp, NULL);
rhltable_remove(&fprobe_ip_table, &node->hlist,
@@ -756,7 +763,6 @@ static int fprobe_init(struct fprobe *fp, unsigned long *addrs, int num)
fp->hlist_array = hlist_array;
hlist_array->fp = fp;
for (i = 0; i < num; i++) {
- hlist_array->array[i].fp = fp;
addr = ftrace_location(addrs[i]);
if (!addr) {
fprobe_fail_cleanup(fp);
@@ -820,6 +826,8 @@ int register_fprobe(struct fprobe *fp, const char *filter, const char *notfilter
}
EXPORT_SYMBOL_GPL(register_fprobe);
+static int unregister_fprobe_nolock(struct fprobe *fp);
+
/**
* register_fprobe_ips() - Register fprobe to ftrace by address.
* @fp: A fprobe data structure to be registered.
@@ -846,28 +854,25 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
if (ret)
return ret;
- hlist_array = fp->hlist_array;
if (fprobe_is_ftrace(fp))
ret = fprobe_ftrace_add_ips(addrs, num);
else
ret = fprobe_graph_add_ips(addrs, num);
-
- if (!ret) {
- add_fprobe_hash(fp);
- for (i = 0; i < hlist_array->size; i++) {
- ret = insert_fprobe_node(&hlist_array->array[i]);
- if (ret)
- break;
- }
- /* fallback on insert error */
- if (ret) {
- for (i--; i >= 0; i--)
- delete_fprobe_node(&hlist_array->array[i]);
- }
+ if (ret) {
+ fprobe_fail_cleanup(fp);
+ return ret;
}
- if (ret)
- fprobe_fail_cleanup(fp);
+ hlist_array = fp->hlist_array;
+ ret = add_fprobe_hash(fp);
+ for (i = 0; i < hlist_array->size && !ret; i++)
+ ret = insert_fprobe_node(&hlist_array->array[i], fp);
+
+ if (ret) {
+ unregister_fprobe_nolock(fp);
+ /* In error case, wait for clean up safely. */
+ synchronize_rcu();
+ }
return ret;
}
@@ -911,27 +916,12 @@ bool fprobe_is_registered(struct fprobe *fp)
return true;
}
-/**
- * unregister_fprobe() - Unregister fprobe.
- * @fp: A fprobe data structure to be unregistered.
- *
- * Unregister fprobe (and remove ftrace hooks from the function entries).
- *
- * Return 0 if @fp is unregistered successfully, -errno if not.
- */
-int unregister_fprobe(struct fprobe *fp)
+static int unregister_fprobe_nolock(struct fprobe *fp)
{
- struct fprobe_hlist *hlist_array;
+ struct fprobe_hlist *hlist_array = fp->hlist_array;
unsigned long *addrs = NULL;
- int ret = 0, i, count;
+ int i, count;
- mutex_lock(&fprobe_mutex);
- if (!fp || !fprobe_registered(fp)) {
- ret = -EINVAL;
- goto out;
- }
-
- hlist_array = fp->hlist_array;
addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
/*
* This will remove fprobe_hash_node from the hash table even if
@@ -957,12 +947,26 @@ int unregister_fprobe(struct fprobe *fp)
kfree_rcu(hlist_array, rcu);
fp->hlist_array = NULL;
+ kfree(addrs);
-out:
- mutex_unlock(&fprobe_mutex);
+ return 0;
+}
- kfree(addrs);
- return ret;
+/**
+ * unregister_fprobe() - Unregister fprobe.
+ * @fp: A fprobe data structure to be unregistered.
+ *
+ * Unregister fprobe (and remove ftrace hooks from the function entries).
+ *
+ * Return 0 if @fp is unregistered successfully, -errno if not.
+ */
+int unregister_fprobe(struct fprobe *fp)
+{
+ guard(mutex)(&fprobe_mutex);
+ if (!fp || !fprobe_registered(fp))
+ return -EINVAL;
+
+ return unregister_fprobe_nolock(fp);
}
EXPORT_SYMBOL_GPL(unregister_fprobe);
^ permalink raw reply related
* [PATCH v10 2/8] tracing/fprobe: Unregister fprobe even if memory allocation fails
From: Masami Hiramatsu (Google) @ 2026-04-20 14:00 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177669363667.132053.12454670015890859277.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
unregister_fprobe() can fail under memory pressure because of memory
allocation failure, but this maybe called from module unloading, and
usually there is no way to retry it. Moreover. trace_fprobe does not
check the return value.
To fix this problem, unregister fprobe and fprobe_hash_node even if
working memory allocation fails.
Anyway, if the last fprobe is removed, the filter will be freed.
Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v9:
- Clear ftrace_ops filter when unregister it.
Changes in v7:
- Newly added.
---
kernel/trace/fprobe.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index af9ba7250874..a2b659006e0e 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -324,9 +324,10 @@ static void fprobe_ftrace_remove_ips(unsigned long *addrs, int num)
lockdep_assert_held(&fprobe_mutex);
fprobe_ftrace_active--;
- if (!fprobe_ftrace_active)
+ if (!fprobe_ftrace_active) {
unregister_ftrace_function(&fprobe_ftrace_ops);
- if (num)
+ ftrace_free_filter(&fprobe_ftrace_ops);
+ } else if (num)
ftrace_set_filter_ips(&fprobe_ftrace_ops, addrs, num, 1, 0);
}
@@ -525,10 +526,10 @@ static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
fprobe_graph_active--;
/* Q: should we unregister it ? */
- if (!fprobe_graph_active)
+ if (!fprobe_graph_active) {
unregister_ftrace_graph(&fprobe_graph_ops);
-
- if (num)
+ ftrace_free_filter(&fprobe_graph_ops.ops);
+ } else if (num)
ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 1, 0);
}
@@ -932,15 +933,19 @@ int unregister_fprobe(struct fprobe *fp)
hlist_array = fp->hlist_array;
addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
- if (!addrs) {
- ret = -ENOMEM; /* TODO: Fallback to one-by-one loop */
- goto out;
- }
+ /*
+ * This will remove fprobe_hash_node from the hash table even if
+ * memory allocation fails. However, ftrace_ops will not be updated.
+ * Anyway, when the last fprobe is unregistered, ftrace_ops is also
+ * unregistered.
+ */
+ if (!addrs)
+ pr_warn("Failed to allocate working array. ftrace_ops may not sync.\n");
/* Remove non-synonim ips from table and hash */
count = 0;
for (i = 0; i < hlist_array->size; i++) {
- if (!delete_fprobe_node(&hlist_array->array[i]))
+ if (!delete_fprobe_node(&hlist_array->array[i]) && addrs)
addrs[count++] = hlist_array->array[i].addr;
}
del_fprobe_hash(fp);
^ permalink raw reply related
* [PATCH v10 1/8] tracing/fprobe: Reject registration of a registered fprobe before init
From: Masami Hiramatsu (Google) @ 2026-04-20 14:00 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177669363667.132053.12454670015890859277.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reject registration of a registered fprobe which is on the fprobe
hash table before initializing fprobe.
The add_fprobe_hash() checks this re-register fprobe, but since
fprobe_init() clears hlist_array field, it is too late to check it.
It has to check the re-registration before touncing fprobe.
Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v6:
- Newly added.
---
kernel/trace/fprobe.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 56d145017902..af9ba7250874 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -4,6 +4,7 @@
*/
#define pr_fmt(fmt) "fprobe: " fmt
+#include <linux/cleanup.h>
#include <linux/err.h>
#include <linux/fprobe.h>
#include <linux/kallsyms.h>
@@ -107,7 +108,7 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
}
/* Check existence of the fprobe */
-static bool is_fprobe_still_exist(struct fprobe *fp)
+static bool fprobe_registered(struct fprobe *fp)
{
struct hlist_head *head;
struct fprobe_hlist *fph;
@@ -120,7 +121,7 @@ static bool is_fprobe_still_exist(struct fprobe *fp)
}
return false;
}
-NOKPROBE_SYMBOL(is_fprobe_still_exist);
+NOKPROBE_SYMBOL(fprobe_registered);
static int add_fprobe_hash(struct fprobe *fp)
{
@@ -132,9 +133,6 @@ static int add_fprobe_hash(struct fprobe *fp)
if (WARN_ON_ONCE(!fph))
return -EINVAL;
- if (is_fprobe_still_exist(fp))
- return -EEXIST;
-
head = &fprobe_table[hash_ptr(fp, FPROBE_HASH_BITS)];
hlist_add_head_rcu(&fp->hlist_array->hlist, head);
return 0;
@@ -149,7 +147,7 @@ static int del_fprobe_hash(struct fprobe *fp)
if (WARN_ON_ONCE(!fph))
return -EINVAL;
- if (!is_fprobe_still_exist(fp))
+ if (!fprobe_registered(fp))
return -ENOENT;
fph->fp = NULL;
@@ -480,7 +478,7 @@ static void fprobe_return(struct ftrace_graph_ret *trace,
if (!fp)
break;
curr += FPROBE_HEADER_SIZE_IN_LONG;
- if (is_fprobe_still_exist(fp) && !fprobe_disabled(fp)) {
+ if (fprobe_registered(fp) && !fprobe_disabled(fp)) {
if (WARN_ON_ONCE(curr + size > size_words))
break;
fp->exit_handler(fp, trace->func, ret_ip, fregs,
@@ -839,12 +837,14 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
struct fprobe_hlist *hlist_array;
int ret, i;
+ guard(mutex)(&fprobe_mutex);
+ if (fprobe_registered(fp))
+ return -EEXIST;
+
ret = fprobe_init(fp, addrs, num);
if (ret)
return ret;
- mutex_lock(&fprobe_mutex);
-
hlist_array = fp->hlist_array;
if (fprobe_is_ftrace(fp))
ret = fprobe_ftrace_add_ips(addrs, num);
@@ -864,7 +864,6 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
delete_fprobe_node(&hlist_array->array[i]);
}
}
- mutex_unlock(&fprobe_mutex);
if (ret)
fprobe_fail_cleanup(fp);
@@ -926,7 +925,7 @@ int unregister_fprobe(struct fprobe *fp)
int ret = 0, i, count;
mutex_lock(&fprobe_mutex);
- if (!fp || !is_fprobe_still_exist(fp)) {
+ if (!fp || !fprobe_registered(fp)) {
ret = -EINVAL;
goto out;
}
^ permalink raw reply related
* [PATCH v10 0/8] tracing/fprobe: Fix fprobe_ip_table related bugs
From: Masami Hiramatsu (Google) @ 2026-04-20 14:00 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
Hi,
Here is the 10th version of fprobe bugfix series.
The previous version is here.
https://lore.kernel.org/all/177644266147.584467.8179035927318998910.stgit@mhiramat.tok.corp.google.com/
This version fixes minor bugs. Add an RCU sync when fprobe
registration failed [3/8], call __fprobe_f*_unregister()
if there is no node on hash table even if it fails to allocate
working memory [6/8]. And fix to check "normalized" module
name in error path [7/8].
Sashiko pointed other issues[1],
[1] https://sashiko.dev/#/patchset/177644266147.584467.8179035927318998910.stgit%40mhiramat.tok.corp.google.com
- Is there a missing RCU read-side critical section here?
-> No, it is under preemption disabled. It seems a preempt
disabling is stronger restriction for RCU read-side critical
section.
- Does the error fallback path leave dangling pointers in the global hash
table?
Yes, but it is not introduced by this, and is fixed by [3/8].
Thank you!
Masami Hiramatsu (Google) (8):
tracing/fprobe: Reject registration of a registered fprobe before init
tracing/fprobe: Unregister fprobe even if memory allocation fails
tracing/fprobe: Remove fprobe from hash in failure path
tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
tracing/fprobe: Check the same type fprobe on table as the unregistered one
tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading
selftests/ftrace: Add a testcase for fprobe events on module
selftests/ftrace: Add a testcase for multiple fprobe events
kernel/trace/fprobe.c | 472 +++++++++++++-------
.../test.d/dynevent/add_remove_fprobe_module.tc | 87 ++++
.../test.d/dynevent/add_remove_multiple_fprobe.tc | 69 +++
3 files changed, 466 insertions(+), 162 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc
base-commit: e0a384434ae1bdfb03954c46c464e3dbd3223ad6
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 7.2 v16 04/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Usama Arif @ 2026-04-20 13:55 UTC (permalink / raw)
To: Nico Pache
Cc: Usama Arif, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260419185750.260784-5-npache@redhat.com>
On Sun, 19 Apr 2026 12:57:41 -0600 Nico Pache <npache@redhat.com> wrote:
> generalize the order of the __collapse_huge_page_* and collapse_max_*
> functions to support future mTHP collapse.
>
> The current mechanism for determining collapse with the
> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> raises a key design issue: if we support user defined max_pte_none values
> (even those scaled by order), a collapse of a lower order can introduces
> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> than HPAGE_PMD_NR / 2.
>
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
>
> To fix this issue introduce a helper function that will limit mTHP
> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> This effectively supports two modes:
>
> - max_ptes_none=0: never introduce new none-pages for mTHP collapse.
> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> available mTHP order.
>
> This removes the possiblilty of "creep", while not modifying any uAPI
> expectations. A warning will be emitted if any non-supported
> max_ptes_none value is configured with mTHP enabled.
>
> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> shared or swapped entry.
>
> No functional changes in this patch; however it defines future behavior
> for mTHP collapse.
>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 124 ++++++++++++++++++++++++++++++++++--------------
> 1 file changed, 88 insertions(+), 36 deletions(-)
>
Small nits. Most might not need change.
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index f42b55421191..283bb63854a5 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -352,51 +352,86 @@ static bool pte_none_or_zero(pte_t pte)
> * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> * @cc: The collapse control struct
> * @vma: The vma to check for userfaultfd
> + * @order: The folio order being collapsed to
> *
> * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
> - * empty page.
> + * empty page. For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the
> + * configured khugepaged_max_ptes_none value.
> + *
> + * For mTHP collapses, we currently only support khugepaged_max_pte_none values
> + * of 0 or (KHUGEPAGED_MAX_PTES_LIMIT). Any other value will emit a warning and
> + * no mTHP collapse will be attempted
> *
> * Return: Maximum number of empty PTEs allowed for the collapse operation
> */
> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> - struct vm_area_struct *vma)
> +static int collapse_max_ptes_none(struct collapse_control *cc,
> + struct vm_area_struct *vma, unsigned int order)
> {
> if (vma && userfaultfd_armed(vma))
> return 0;
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
> - return khugepaged_max_ptes_none;
> + if (is_pmd_order(order))
> + return khugepaged_max_ptes_none;
> + /* Zero/non-present collapse disabled. */
> + if (!khugepaged_max_ptes_none)
> + return 0;
> + if (khugepaged_max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> + return (1 << order) - 1;
> +
There are 2 reads of khugepaged_max_ptes_none here.
A concurrent sysctl write between reads can yield "0 then non-zero" or "LIMIT
then mid-value".
Would be good to just snapshot once at the start of the function and use that
value?
> + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
> + KHUGEPAGED_MAX_PTES_LIMIT);
IMO, warn_once can get lost quickly in dmesg. Maybe pr_warn_ratelimited?
Not sure what others opinions are..
> + return -EINVAL;
> }
>
> /**
> * collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for collapse
> * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
> *
> * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
> * shared page.
> *
> + * For mTHP collapses, we currently dont support collapsing memory with
> + * shared memory.
> + *
> * Return: Maximum number of shared PTEs allowed for the collapse operation
> */
> -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
> + unsigned int order)
> {
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
> + if (!is_pmd_order(order))
> + return 0;
> +
> return khugepaged_max_ptes_shared;
> }
>
> /**
> * collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collapse
> * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
> *
> * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
> * swap page.
> *
> + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> + * khugepaged_max_ptes_swap value.
> + *
> + * For mTHP collapses, we currently dont support collapsing memory with
> + * swapped out memory.
> + *
> * Return: Maximum number of swap PTEs allowed for the collapse operation
> */
> -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
> + unsigned int order)
> {
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
> + if (!is_pmd_order(order))
> + return 0;
> +
> return khugepaged_max_ptes_swap;
> }
>
> @@ -590,18 +625,22 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>
> static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
> - struct list_head *compound_pagelist)
> + unsigned int order, struct list_head *compound_pagelist)
> {
> + const unsigned long nr_pages = 1UL << order;
> struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr = start_addr;
> pte_t *_pte;
> int none_or_zero = 0, shared = 0, referenced = 0;
> enum scan_result result = SCAN_FAIL;
> - unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> - unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> + int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> + unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
> +
> + if (max_ptes_none < 0)
> + return result;
Would a dedicated SCAN_INVALID_PTES_NONE make more sense here instead
of SCAN_FAIL?
>
> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> + for (_pte = pte; _pte < pte + nr_pages;
> _pte++, addr += PAGE_SIZE) {
> pte_t pteval = ptep_get(_pte);
> if (pte_none_or_zero(pteval)) {
> @@ -734,18 +773,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> }
>
> static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> - struct vm_area_struct *vma,
> - unsigned long address,
> - spinlock_t *ptl,
> - struct list_head *compound_pagelist)
> + struct vm_area_struct *vma, unsigned long address,
> + spinlock_t *ptl, unsigned int order,
> + struct list_head *compound_pagelist)
> {
> - unsigned long end = address + HPAGE_PMD_SIZE;
> + const unsigned long nr_pages = 1UL << order;
> + unsigned long end = address + (PAGE_SIZE << order);
> struct folio *src, *tmp;
> pte_t pteval;
> pte_t *_pte;
> unsigned int nr_ptes;
>
> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> + for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
> address += nr_ptes * PAGE_SIZE) {
> nr_ptes = 1;
> pteval = ptep_get(_pte);
> @@ -798,13 +837,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> }
>
> static void __collapse_huge_page_copy_failed(pte_t *pte,
> - pmd_t *pmd,
> - pmd_t orig_pmd,
> - struct vm_area_struct *vma,
> - struct list_head *compound_pagelist)
> + pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> + unsigned int order, struct list_head *compound_pagelist)
> {
> + const unsigned long nr_pages = 1UL << order;
> spinlock_t *pmd_ptl;
> -
Shouldn't remove the newline above?
> /*
> * Re-establish the PMD to point to the original page table
> * entry. Restoring PMD needs to be done prior to releasing
> @@ -818,7 +855,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> * Release both raw and compound pages isolated
> * in __collapse_huge_page_isolate.
> */
> - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> + release_pte_pages(pte, pte + nr_pages, compound_pagelist);
> }
>
> /*
> @@ -838,16 +875,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> */
> static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> - unsigned long address, spinlock_t *ptl,
> + unsigned long address, spinlock_t *ptl, unsigned int order,
> struct list_head *compound_pagelist)
> {
> + const unsigned long nr_pages = 1UL << order;
> unsigned int i;
> enum scan_result result = SCAN_SUCCEED;
> -
Same here?
> /*
> * Copying pages' contents is subject to memory poison at any iteration.
> */
> - for (i = 0; i < HPAGE_PMD_NR; i++) {
> + for (i = 0; i < nr_pages; i++) {
> pte_t pteval = ptep_get(pte + i);
> struct page *page = folio_page(folio, i);
> unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -866,10 +903,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>
> if (likely(result == SCAN_SUCCEED))
> __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> - compound_pagelist);
> + order, compound_pagelist);
> else
> __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> - compound_pagelist);
> + order, compound_pagelist);
>
> return result;
> }
> @@ -1040,12 +1077,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
> * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> */
> static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> - struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
> - int referenced)
> + struct vm_area_struct *vma, unsigned long start_addr,
> + pmd_t *pmd, int referenced, unsigned int order)
Will probably find out in later reviews, but there is tracepoint in __collapse_huge_page_swapin.
Would be good to add order in that tracepoint if you are adding order here?
> {
> int swapped_in = 0;
> vm_fault_t ret = 0;
> - unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
> + unsigned long addr, end = start_addr + (PAGE_SIZE << order);
> enum scan_result result;
> pte_t *pte = NULL;
> spinlock_t *ptl;
> @@ -1077,6 +1114,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> pte_present(vmf.orig_pte))
> continue;
>
> + /*
> + * TODO: Support swapin without leading to further mTHP
> + * collapses. Currently bringing in new pages via swapin may
> + * cause a future higher order collapse on a rescan of the same
> + * range.
> + */
> + if (!is_pmd_order(order)) {
Would it be good to introduce this in the patch that activates it? No strong
preference btw. Just that its dead code in this patch itself.
> + pte_unmap(pte);
> + mmap_read_unlock(mm);
> + result = SCAN_EXCEED_SWAP_PTE;
> + goto out;
> + }
> +
> vmf.pte = pte;
> vmf.ptl = ptl;
> ret = do_swap_page(&vmf);
> @@ -1196,7 +1246,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * that case. Continuing to collapse causes inconsistency.
> */
> result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> - referenced);
> + referenced, HPAGE_PMD_ORDER);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
> }
> @@ -1244,6 +1294,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> if (pte) {
> result = __collapse_huge_page_isolate(vma, address, pte, cc,
> + HPAGE_PMD_ORDER,
> &compound_pagelist);
> spin_unlock(pte_ptl);
> } else {
> @@ -1274,6 +1325,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>
> result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> vma, address, pte_ptl,
> + HPAGE_PMD_ORDER,
> &compound_pagelist);
> pte_unmap(pte);
> if (unlikely(result != SCAN_SUCCEED))
> @@ -1318,9 +1370,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> unsigned long addr;
> spinlock_t *ptl;
> int node = NUMA_NO_NODE, unmapped = 0;
> - unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> - unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> - unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> + int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> + unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> + unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>
> VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>
> @@ -2371,8 +2423,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
> int present, swap;
> int node = NUMA_NO_NODE;
> enum scan_result result = SCAN_SUCCEED;
> - unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
> - unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> + int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
> + unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>
> present = 0;
> swap = 0;
> --
> 2.53.0
>
>
^ permalink raw reply
* Re: [PATCH 0/3] mm: split the file's i_mmap tree for NUMA
From: Pedro Falcato @ 2026-04-20 13:48 UTC (permalink / raw)
To: Huang Shijie
Cc: Mateusz Guzik, akpm, viro, brauner, linux-mm, linux-kernel,
linux-arm-kernel, linux-fsdevel, muchun.song, osalvador,
linux-trace-kernel, linux-perf-users, linux-parisc, nvdimm,
zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <aeWLCxru6cLWsxvQ@SH-HV00110.Hygon.cn>
BTW you're missing _a lot_ of CC's here, including the whole of mm/rmap.c
maintainership.
On Mon, Apr 20, 2026 at 10:10:19AM +0800, Huang Shijie wrote:
> On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote:
> > On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> > > In NUMA, there are maybe many NUMA nodes and many CPUs.
> > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > > In the UnixBench tests, there is a test "execl" which tests
> > > the execve system call.
> > >
> > > When we test our server with "./Run -c 384 execl",
> > > the test result is not good enough. The i_mmap locks contended heavily on
> > > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
> > > over 6000 VMAs, all the VMAs can be in different NUMA mode.
> > > The insert/remove operations do not run quickly enough.
> > >
> > > patch 1 & patch 2 are try to hide the direct access of i_mmap.
> > > patch 3 splits the i_mmap into sibling trees, and we can get better
> > > performance with this patch set:
> > > we can get 77% performance improvement(10 times average)
> > >
> >
> > To my reading you kept the lock as-is and only distributed the protected
> > state.
> >
> > While I don't doubt the improvement, I'm confident should you take a
> > look at the profile you are going to find this still does not scale with
> > rwsem being one of the problems (there are other global locks, some of
> > which have experimental patches for).
> >
> > Apart from that this does nothing to help high core systems which are
> > all one node, which imo puts another question mark on this specific
> > proposal.
> >
> > Of course one may question whether a RB tree is the right choice here,
> > it may be the lock-protected cost can go way down with merely a better
> > data structure.
> >
> > Regardless of that, for actual scalability, there will be no way around
> > decentralazing locking around this and partitioning per some core count
> > (not just by numa awareness).
> >
> > Decentralizing locking is definitely possible, but I have not looked
> > into specifics of how problematic it is. Best case scenario it will
> > merely with separate locks. Worst case scenario something needs a fully
> > stabilized state for traversal, in that case another rw lock can be
> > slapped around this, creating locking order read lock -> per-subset
> > write lock -- this will suffer scalability due to the read locking, but
> > it will still scale drastically better as apart from that there will be
> > no serialization. In this setting the problematic consumer will write
> > lock the new thing to stabilize the state.
> >
> I thought over again.
> I can change this patch set to support the non-NUMA case by:
> 1.) Still use one rw lock.
No. This doesn't help anything.
> 2.) For NUMA, keep the patch set as it is.
Please no. No NUMA vs non-NUMA case.
> 3.) For non-NUMA case, split the i_mmap tree to several subtrees.
> For example, if a machine has 192 CPUs, split the 32 CPUs as a tree.
If lock contention is the problem, I don't see how splitting the tree helps,
unless it helps reduce lock hold time in a way that randomly helps your workload.
But that's entirely random.
>
> So extend the patch set to support both the NUMA and non-NUMA machines.
FYI I've discussed some concrete ideas for reworking file rmap with Mateusz.
I'll be giving them a shot. Note that this needs to be done _carefully_,
particularly as there are some hidden assumptions wrt forking that aren't
very clear as to how they work[1].
[1] https://lore.kernel.org/all/bnukmnuxxuhdfeasjz33miemgr7w35c4aa6pqdmgupx7oxmeeb@gozgc3yxhcdd/
--
Pedro
^ permalink raw reply
* [PATCH AUTOSEL 7.0-6.18] rtla: Handle pthread_create() failure properly
From: Sasha Levin @ 2026-04-20 13:19 UTC (permalink / raw)
To: patches, stable
Cc: Wander Lairson Costa, Tomas Glozar, Sasha Levin, rostedt,
linux-trace-kernel, linux-kernel
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>
From: Wander Lairson Costa <wander@redhat.com>
[ Upstream commit d847188bb92b14518a04d7542e44928a22060847 ]
Add proper error handling when pthread_create() fails to create the
timerlat user-space dispatcher thread. Previously, the code only logged
an error message but continued execution, which could lead to undefined
behavior when the tool later expects the thread to be running.
When pthread_create() returns an error, the function now jumps to the
out_trace error path to properly clean up resources and exit. This
ensures consistent error handling and prevents the tool from running
in an invalid state without the required user-space thread.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Link: https://lore.kernel.org/r/20260309195040.1019085-10-wander@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Now let me compile the full analysis.
---
## Phase 1: Commit Message Forensics
**Step 1.1: Subject Line**
- Subsystem: `rtla` (Real-Time Linux Analysis tool, under
`tools/tracing/rtla`)
- Action verb: "Handle" (strongly suggests fixing incorrect handling - a
bug fix pattern)
- Summary: Add proper error handling when pthread_create() fails for the
timerlat user-space dispatcher thread.
Record: [rtla] [Handle] [Add missing error path when pthread_create
fails]
**Step 1.2: Tags**
- `Signed-off-by: Wander Lairson Costa <wander@redhat.com>` - author,
Red Hat engineer
- `Link:
https://lore.kernel.org/r/20260309195040.1019085-10-wander@redhat.com`
- part of a larger series (patch 10 of a series)
- `Signed-off-by: Tomas Glozar <tglozar@redhat.com>` - maintainer sign-
off
- No Fixes: tag (expected for autosel candidates)
- No Reported-by: (found by code review, not user report)
- No Cc: stable (expected)
Record: Author is Wander (Red Hat, active rtla contributor). Part of a
larger series (patch 10). Accepted by Tomas Glozar (rtla maintainer).
**Step 1.3: Body Text**
The commit message clearly describes: when `pthread_create()` fails, the
code only logged an error but continued execution. This leads to the
tool running in an invalid state where it expects user-space threads
that don't exist.
Record: Bug = missing error exit on pthread_create failure. Symptom =
tool runs without required user-space thread. Root cause = missing `goto
out_trace` on error path.
**Step 1.4: Hidden Bug Fix Detection**
"Handle ... properly" is a classic bug-fix pattern. This IS a bug fix -
it adds a missing error exit path.
Record: Yes, this is a clear bug fix despite not using the word "fix" in
the subject.
## Phase 2: Diff Analysis
**Step 2.1: Changes Inventory**
- 1 file modified: `tools/tracing/rtla/src/common.c`
- Net change: +3 lines / -1 line (added braces + `goto out_trace;`)
- Function modified: `run_tool()`
- Scope: Single-file, surgical fix
**Step 2.2: Code Flow Change**
Before: `pthread_create()` failure logged an error message but execution
continued to `ops->enable(tool)`, `ops->main(tool)`, etc.
After: `pthread_create()` failure logs error and jumps to `out_trace`
for proper cleanup and exit.
**Step 2.3: Bug Mechanism**
Category: (a) Error path fix. The code was missing a `goto` to the error
cleanup path when `pthread_create()` failed. Without it, the tool runs
without the user-space timerlat threads, producing incorrect/misleading
measurements.
**Step 2.4: Fix Quality**
- Obviously correct: follows the identical pattern used by all other
error checks in the same function (lines 247, 253, 280, 287)
- Minimal/surgical: only adds braces and a `goto`
- Regression risk: extremely low - only changes behavior when
`pthread_create()` fails (which is already an error condition)
Record: Fix is obviously correct, minimal, and consistent with
surrounding code patterns. No regression risk.
## Phase 3: Git History Investigation
**Step 3.1: Blame**
The buggy code (lines 257-276) was introduced by commit `2f3172f9dd58cc`
("tools/rtla: Consolidate code between osnoise/timerlat and hist/top")
by Crystal Wood, September 2025. However, tracing further back, the
original missing error handling existed since commit `cdca4f4e5e8ea`
("rtla/timerlat_top: Add timerlat user-space support") by Daniel Bristot
de Oliveira, June 2023 (v6.5-rc1).
Record: Bug introduced in v6.5-rc1, present in all stable trees from
6.6.y onward. The consolidation commit just carried the bug forward into
`common.c`.
**Step 3.2: Fixes Tag**
No Fixes: tag present (expected for autosel candidates). The bug
logically traces to `cdca4f4e5e8ea` (v6.5-rc1).
**Step 3.3: File History**
The file has been actively developed. Recent commits include
consolidations of option parsing, volatile fix for stop_tracing, and
other improvements. The author (Wander Lairson Costa) is a prolific
contributor to rtla.
**Step 3.4: Author**
Wander has at least 17 commits in rtla (including multiple fixes like
NULL pointer dereference fix, parse return value doc fix, volatile fix).
He is a regular contributor and maintainer-level contributor for rtla.
Record: Author is a regular, trusted contributor to this subsystem.
**Step 3.5: Dependencies**
The `run_tool()` function and the `out_trace` label already exist in the
7.0 tree. No dependencies needed. However, the `run_tool()` function
only exists since the consolidation commit `2f3172f9dd58cc` (~v6.18
cycle). In older stable trees (6.6.y, 6.12.y), the same fix would need
to target `timerlat_top.c` and `timerlat_hist.c` instead.
Record: For 7.0.y, applies standalone with no dependencies. For older
trees, would need different patches.
## Phase 4: Mailing List and External Research
**Step 4.1-4.2: Patch Discussion**
The commit's Link tag shows it's patch 10 of a series (Message-ID
`20260309195040.1019085-10-wander@redhat.com`). Lore.kernel.org was
blocked by anti-bot protection, but b4 dig confirmed the author's other
patches in the same series (e.g., `20260106133655.249887-16` for the
volatile fix). The patch was accepted and signed off by maintainer Tomas
Glozar.
Record: Part of a larger cleanup/fix series. Accepted by rtla
maintainer.
**Step 4.3-4.5: Bug Report / Stable Discussion**
No explicit bug report found. This appears to be found by code
review/audit, not by a user hitting it in practice.
Record: No user reports. Found by code inspection.
## Phase 5: Code Semantic Analysis
**Step 5.1: Functions Modified**
Only `run_tool()` in `common.c`.
**Step 5.2: Callers**
`run_tool()` is the unified entry point for all rtla tool modes (osnoise
top/hist, timerlat top/hist). It's called from each tool's main
function.
**Step 5.3-5.4: Call Chain**
When `pthread_create()` fails and execution continues:
1. `ops->enable(tool)` - enables tracing infrastructure
2. `ops->main(tool)` - runs main measurement loop (top_main_loop or
hist_main_loop)
3. Both main loops check `params->user.stopped_running` to detect if
user threads died
4. Since threads were never created, `stopped_running` stays at 0, so
the tool thinks threads are still running
5. The tool produces measurements and statistics without user-space
thread contributions
**Step 5.5: Similar Patterns**
The original code in `timerlat_top.c` and `timerlat_hist.c` (pre-
consolidation) had the identical missing error handling pattern,
confirming this is a systematic bug.
## Phase 6: Cross-Referencing and Stable Tree Analysis
**Step 6.1: Buggy Code in Stable**
The `run_tool()` function in `common.c` only exists since ~v6.18 cycle.
In 7.0.y, the code exists as-is and the patch applies cleanly. For older
stable trees, different patches targeting `timerlat_top.c` and
`timerlat_hist.c` would be needed.
**Step 6.2: Backport Complications**
For 7.0.y: clean apply expected - no conflicts.
**Step 6.3: Related Fixes**
No other fix for this specific issue found in stable.
## Phase 7: Subsystem and Maintainer Context
**Step 7.1: Subsystem**
`tools/tracing/rtla` - userspace real-time latency analysis tool.
Criticality: PERIPHERAL (userspace tool, not kernel code), but important
for real-time system validation.
**Step 7.2: Activity**
Very actively developed - 14+ commits since the consolidation.
## Phase 8: Impact and Risk Assessment
**Step 8.1: Who Is Affected**
Users of the rtla timerlat tool with `--user-threads` option,
specifically when `pthread_create()` fails.
**Step 8.2: Trigger Conditions**
Rare - requires `--user-threads` mode AND `pthread_create()` failure
(typically due to resource exhaustion or system limits).
**Step 8.3: Failure Mode Severity**
When triggered: tool continues running in invalid state, producing
measurements without user-space thread contributions. For a real-time
analysis tool, this means **silently incorrect results** (the error
message is printed but could be missed). Severity: MEDIUM (incorrect
tool output, not kernel crash/corruption).
**Step 8.4: Risk-Benefit Ratio**
- BENEFIT: Moderate - prevents misleading latency measurements
- RISK: Very low - 2-line change to an error path only triggered on
failure
- Ratio: Favorable
## Phase 9: Final Synthesis
**Step 9.1: Evidence**
FOR backporting:
- Fixes a real bug (missing error exit leads to invalid tool state)
- Small and surgical (3 lines changed)
- Obviously correct (follows identical pattern in same function)
- Very low regression risk
- Author is trusted contributor; accepted by maintainer
- Bug existed since v6.5
AGAINST backporting:
- Userspace tool, not kernel code
- Trigger condition is rare (pthread_create failure)
- No user reports of actually hitting this bug
- MEDIUM severity (incorrect tool output, not crash/corruption)
**Step 9.2: Stable Rules Checklist**
1. Obviously correct and tested? **YES** - trivially correct
2. Fixes a real bug? **YES** - tool runs without required threads
3. Important issue? **MEDIUM** - incorrect measurements, not crash
4. Small and contained? **YES** - 3 lines in 1 file
5. No new features? **YES** - pure error handling fix
6. Can apply to stable? **YES** for 7.0.y
**Step 9.3: Exception Categories**
Not an exception category - standard bug fix.
**Step 9.4: Decision**
The fix is small, obviously correct, and prevents the rtla tool from
operating in an invalid state. While it's a userspace tool and the
trigger is rare, the fix has essentially zero regression risk and
improves correctness. The pattern of missing error handling on
`pthread_create()` failure is a clear bug.
## Verification
- [Phase 1] Parsed tags: Link to patch 10 of a series, SOBs from author
and maintainer
- [Phase 2] Diff analysis: +3/-1 lines, adds `goto out_trace` to error
path in `run_tool()`
- [Phase 3] git blame: buggy code carried from `cdca4f4e5e8ea`
(v6.5-rc1) through consolidation `2f3172f9dd58cc`
- [Phase 3] git show 2f3172f9dd58cc: confirmed consolidation commit
created `run_tool()` carrying the bug
- [Phase 3] Checked pre-consolidation files: both `timerlat_top.c` and
`timerlat_hist.c` had identical missing error handling
- [Phase 4] b4 dig: confirmed author's series via `af2962d68b970` match
- [Phase 4] Lore blocked by anti-bot; could not read full thread
discussion
- [Phase 5] Traced `run_tool()` flow: after failed pthread_create, tool
continues to enable/main/stats without user threads
- [Phase 5] Verified `out_trace` cleanup path exists and is used by
other error checks in same function
- [Phase 6] Code exists in 7.0.y (run_tool in common.c); older trees
have equivalent code in different files
- [Phase 8] Failure mode: tool produces results without user-space
threads, severity MEDIUM
- UNVERIFIED: Could not read full mailing list thread due to lore anti-
bot protection
**YES**
tools/tracing/rtla/src/common.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index ceff76a62a30b..68426ce6f9971 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -271,8 +271,10 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
params->user.cgroup_name = params->cgroup_name;
retval = pthread_create(&user_thread, NULL, timerlat_u_dispatcher, ¶ms->user);
- if (retval)
+ if (retval) {
err_msg("Error creating timerlat user-space threads\n");
+ goto out_trace;
+ }
}
retval = ops->enable(tool);
--
2.53.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox