Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Mladek @ 2026-04-16 13:09 UTC (permalink / raw)
  To: Song Chen
  Cc: Petr Pavlu, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
	snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
	danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
	frederic, mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, joe.lawrence, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <a35f5f94-7d5a-4347-974b-b270c89ef241@189.cn>

On Wed 2026-04-15 14:43:53, Song Chen wrote:
> Hi,
> 
> On 4/14/26 22:33, Petr Pavlu wrote:
> > On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
> > > From: Song Chen <chensong_2000@189.cn>
> > > 
> > > ftrace and livepatch currently have their module load/unload callbacks
> > > hard-coded in the module loader as direct function calls to
> > > ftrace_module_enable(), klp_module_coming(), klp_module_going()
> > > and ftrace_release_mod(). This tight coupling was originally introduced
> > > to enforce strict call ordering that could not be guaranteed by the
> > > module notifier chain, which only supported forward traversal. Their
> > > notifiers were moved in and out back and forth. see [1] and [2].
> > 
> > I'm unclear about what is meant by the notifiers being moved back and
> > forth. The links point to patches that converted ftrace+klp from using
> > module notifiers to explicit callbacks due to ordering issues, but this
> > switch occurred only once. Have there been other attempts to use
> > notifiers again?
> > 
> > > diff --git a/include/linux/module.h b/include/linux/module.h
> > > index 14f391b186c6..0bdd56f9defd 100644
> > > --- a/include/linux/module.h
> > > +++ b/include/linux/module.h
> > > @@ -308,6 +308,14 @@ enum module_state {
> > >   	MODULE_STATE_COMING,	/* Full formed, running module_init. */
> > >   	MODULE_STATE_GOING,	/* Going away. */
> > >   	MODULE_STATE_UNFORMED,	/* Still setting it up. */
> > > +	MODULE_STATE_FORMED,
> > 
> > I don't see a reason to add a new module state. Why is it necessary and
> > how does it fit with the existing states?
> > 
> because once notifier fails in state MODULE_STATE_UNFORMED (now only ftrace
> has someting to do in this state), notifier chain will roll back by calling
> blocking_notifier_call_chain_robust, i'm afraid MODULE_STATE_GOING is going
> to jeopardise the notifers which don't handle it appropriately, like:
> 
> case MODULE_STATE_COMING:
>      kmalloc();
> case MODULE_STATE_GOING:
>      kfree();
> 
> 
> > > +};
> > > +
> > > +enum module_notifier_prio {
> > > +	MODULE_NOTIFIER_PRIO_LOW = INT_MIN,	/* Low prioroty, coming last, going first */
> > > +	MODULE_NOTIFIER_PRIO_MID = 0,	/* Normal priority. */
> > > +	MODULE_NOTIFIER_PRIO_SECOND_HIGH = INT_MAX - 1,	/* Second high priorigy, coming second*/
> > > +	MODULE_NOTIFIER_PRIO_HIGH = INT_MAX,	/* High priorigy, coming first, going late. */
> > 
> > I suggest being explicit about how the notifiers are ordered. For
> > example:
> > 
> > enum module_notifier_prio {
> > 	MODULE_NOTIFIER_PRIO_NORMAL,	/* Normal priority, coming last, going first. */
> > 	MODULE_NOTIFIER_PRIO_LIVEPATCH,
> > 	MODULE_NOTIFIER_PRIO_FTRACE,	/* High priority, coming first, going late. */
> > };
> > 

I like the explicit PRIO_LIVEPATCH/FTRACE names.

But I would keep the INT_MAX - 1 and INT_MAX priorities. I believe
that ftrace/livepatching will always be the first/last to call.
And INT_MAX would help to preserve kABI when PRIO_NORMAL is not
enough for the rest of notifiers.

That said, I am not sure whether this is worth the effort.
This patch tries to move the explicit callbacks in a generic
notifiers API. But it will still need to use some explictly
defined (reserved) priorities. And it will
not guarantee a misuse. Also it requires the double linked
list which complicates the notifiers code.


> > >   };
> > >   struct mod_tree_node {
> > > --- a/kernel/module/main.c
> > > +++ b/kernel/module/main.c
> > > @@ -3281,20 +3277,14 @@ static int complete_formation(struct module *mod, struct load_info *info)
> > >   	return err;
> > >   }
> > > -static int prepare_coming_module(struct module *mod)
> > > +static int prepare_module_state_transaction(struct module *mod,
> > > +			unsigned long val_up, unsigned long val_down)
> > >   {
> > >   	int err;
> > > -	ftrace_module_enable(mod);
> > > -	err = klp_module_coming(mod);
> > > -	if (err)
> > > -		return err;
> > > -
> > >   	err = blocking_notifier_call_chain_robust(&module_notify_list,
> > > -			MODULE_STATE_COMING, MODULE_STATE_GOING, mod);
> > > +			val_up, val_down, mod);
> > >   	err = notifier_to_errno(err);
> > > -	if (err)
> > > -		klp_module_going(mod);
> > >   	return err;
> > >   }

I personally find the name "prepare_module_state_transaction"
misleading. What is the "transaction" here? If this was a "preparation"
step then where is the transaction done/finished?

It might be better to just opencode the
blocking_notifier_call_chain_robust() instead.

> > > @@ -3468,14 +3458,21 @@ static int load_module(struct load_info *info, const char __user *uargs,
> > >   	init_build_id(mod, info);
> > >   	/* Ftrace init must be called in the MODULE_STATE_UNFORMED state */
> > > -	ftrace_module_init(mod);
> > > +	err = prepare_module_state_transaction(mod,
> > > +				MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
> > 
> > I believe val_down should be MODULE_STATE_GOING to reverse the
> > operation. Why is the new state MODULE_STATE_FORMED needed here?
> to avoid this:
> 
> case MODULE_STATE_COMING:
>      kmalloc();
> case MODULE_STATE_GOING:
>      kfree();

Hmm, the module is in "FORMED" state here.

> > > +	if (err)
> > > +		goto ddebug_cleanup;
> > >   	/* Finally it's fully formed, ready to start executing. */
> > >   	err = complete_formation(mod, info);

And we call "complete_formation()" function. This sounds like
it was not really "FORMED" before. => It is confusing and nono.

Please, try to avoid the new state if possible. My experience
with reading the module loader code is that any new state
brings a lot of complexity. You need to take it into account
when checking correctness of other changes, features, ...

Something tells me that if the state was not needed before
then we could avoid it.

> > > -	if (err)
> > > +	if (err) {
> > > +		blocking_notifier_call_chain_reverse(&module_notify_list,
> > > +				MODULE_STATE_FORMED, mod);
> > >   		goto ddebug_cleanup;
> > > +	}
> > > -	err = prepare_coming_module(mod);
> > > +	err = prepare_module_state_transaction(mod,
> > > +				MODULE_STATE_COMING, MODULE_STATE_GOING);
> > >   	if (err)
> > >   		goto bug_cleanup;
> > > --- a/kernel/trace/ftrace.c
> > > +++ b/kernel/trace/ftrace.c
> > > @@ -5241,6 +5241,44 @@ static int __init ftrace_mod_cmd_init(void)
> > >   }
> > >   core_initcall(ftrace_mod_cmd_init);
> > > +static int ftrace_module_callback(struct notifier_block *nb, unsigned long op,
> > > +			void *module)
> > > +{
> > > +	struct module *mod = module;
> > > +
> > > +	switch (op) {
> > > +	case MODULE_STATE_UNFORMED:
> > > +		ftrace_module_init(mod);
> > > +		break;
> > > +	case MODULE_STATE_COMING:
> > > +		ftrace_module_enable(mod);
> > > +		break;
> > > +	case MODULE_STATE_LIVE:
> > > +		ftrace_free_mem(mod, mod->mem[MOD_INIT_TEXT].base,
> > > +				mod->mem[MOD_INIT_TEXT].base + mod->mem[MOD_INIT_TEXT].size);
> > > +		break;
> > > +	case MODULE_STATE_GOING:
> > > +	case MODULE_STATE_FORMED:
> > > +		ftrace_release_mod(mod);

This calls "release" in a "FORMED" state. It does not make any
sense. Something looks fishy, either the code or the naming.

> > > +		break;
> > > +	default:
> > > +		break;
> > > +	}
> > 

I am sorry for being so picky about names. I believe that good names
help to prevent bugs and reduce headaches.

Best Regards,
Petr

^ permalink raw reply

* Re: [PATCH 01/61] Coccinelle: Prefer IS_ERR_OR_NULL over manual NULL check
From: Krzysztof Kozlowski @ 2026-04-16 12:30 UTC (permalink / raw)
  To: Philipp Hahn, amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel,
	dri-devel, gfs2, intel-gfx, intel-wired-lan, iommu, kvm,
	linux-arm-kernel, linux-block, linux-bluetooth, linux-btrfs,
	linux-cifs, linux-clk, linux-erofs, linux-ext4, linux-fsdevel,
	linux-gpio, linux-hyperv, linux-input, linux-kernel, linux-leds,
	linux-media, linux-mips, linux-mm, linux-modules, linux-mtd,
	linux-nfs, linux-omap, linux-phy, linux-pm, linux-rockchip,
	linux-s390, linux-scsi, linux-sctp, linux-security-module,
	linux-sh, linux-sound, linux-stm32, linux-trace-kernel, linux-usb,
	linux-wireless, netdev, ntfs3, samba-technical, sched-ext,
	target-devel, tipc-discussion, v9fs
  Cc: Julia Lawall, Nicolas Palix
In-Reply-To: <20260310-b4-is_err_or_null-v1-1-bd63b656022d@avm.de>

On 10/03/2026 12:48, Philipp Hahn wrote:
> Find and convert uses of IS_ERR() plus NULL check to IS_ERR_OR_NULL().
> 
> There are several cases where `!ptr && WARN_ON[_ONCE](IS_ERR(ptr))` is
> used:
> - arch/x86/kernel/callthunks.c:215 WARN_ON_ONCE
> - drivers/clk/clk.c:4561 WARN_ON_ONCE
> - drivers/interconnect/core.c:793 WARN_ON
> - drivers/reset/core.c:718 WARN_ON
> The change is not 100% semantical equivalent as the warning will now
> also happen when the pointer is NULL.
> 
> To: Julia Lawall <Julia.Lawall@inria.fr>
> To: Nicolas Palix <nicolas.palix@imag.fr>
> Cc: cocci@inria.fr
> Cc: linux-kernel@vger.kernel.org
> 
> ---
> drivers/clocksource/mips-gic-timer.c:283 looks suspicious: ret != clk,
> but Daniel Lezcano verified it as cottect.
> 
> There are some cases where the checks are part of a larger expression:
> - mm/kmemleak.c:1095
> - mm/kmemleak.c:1155
> - mm/kmemleak.c:1173
> - mm/kmemleak.c:1290
> - mm/kmemleak.c:1328
> - mm/kmemleak.c:1241
> - mm/kmemleak.c:1310
> - mm/kmemleak.c:1258
> - net/netlink/af_netlink.c:2670
> Thanks to Julia Lawall for the help to also handle them.
> 
> Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
> ---
>  scripts/coccinelle/api/is_err_or_null.cocci | 125 ++++++++++++++++++++++++++++
>  1 file changed, 125 insertions(+)
> 

Neither this, nor try from 2011, nor any future try should be accepted,
because it creates impression IS_ERR_OR_NULL is somehow okay. No, it is
not okay, it is a discouraged pattern leading to less readable and
maintainable code. We should not have therefore any tools suggesting
usage of IS_ERR_OR_NULL, because people will be converting poor code
into that, instead of fixing that poor code.

Best regards,
Krzysztof

^ permalink raw reply

* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: David Laight @ 2026-04-16 12:30 UTC (permalink / raw)
  To: chensong_2000
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260415070137.17860-1-chensong_2000@189.cn>

On Wed, 15 Apr 2026 15:01:37 +0800
chensong_2000@189.cn wrote:

> From: Song Chen <chensong_2000@189.cn>
> 
> The current notifier chain implementation uses a single-linked list
> (struct notifier_block *next), which only supports forward traversal
> in priority order. This makes it difficult to handle cleanup/teardown
> scenarios that require notifiers to be called in reverse priority order.

If it is only cleanup/teardown then the list can be order-reversed
as part of that process at the same time as the list is deleted.

	David



^ permalink raw reply

* Re: [PATCH 55/61] interconnect: Prefer IS_ERR_OR_NULL over manual NULL check
From: Krzysztof Kozlowski @ 2026-04-16 12:24 UTC (permalink / raw)
  To: Philipp Hahn, amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel,
	dri-devel, gfs2, intel-gfx, intel-wired-lan, iommu, kvm,
	linux-arm-kernel, linux-block, linux-bluetooth, linux-btrfs,
	linux-cifs, linux-clk, linux-erofs, linux-ext4, linux-fsdevel,
	linux-gpio, linux-hyperv, linux-input, linux-kernel, linux-leds,
	linux-media, linux-mips, linux-mm, linux-modules, linux-mtd,
	linux-nfs, linux-omap, linux-phy, linux-pm, linux-rockchip,
	linux-s390, linux-scsi, linux-sctp, linux-security-module,
	linux-sh, linux-sound, linux-stm32, linux-trace-kernel, linux-usb,
	linux-wireless, netdev, ntfs3, samba-technical, sched-ext,
	target-devel, tipc-discussion, v9fs
  Cc: Georgi Djakov
In-Reply-To: <20260310-b4-is_err_or_null-v1-55-bd63b656022d@avm.de>

On 10/03/2026 12:49, Philipp Hahn wrote:
> Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
> check.
> 
> Semantich change: Previously the code only printed the warning on error,
> but not when the pointer was NULL. Now the warning is printed in both
> cases!

NAK, read the code

> 
> Change found with coccinelle.
> 
> To: Georgi Djakov <djakov@kernel.org>
> Cc: linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
> ---
>  drivers/interconnect/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/interconnect/core.c b/drivers/interconnect/core.c
> index 8569b78a18517b33abeafac091978b25cbc1acc7..22e92b30f73853d5bd2e05b4f52cb5aa22556468 100644
> --- a/drivers/interconnect/core.c
> +++ b/drivers/interconnect/core.c
> @@ -790,7 +790,7 @@ void icc_put(struct icc_path *path)
>  	size_t i;
>  	int ret;
>  
> -	if (!path || WARN_ON(IS_ERR(path)))
> +	if (WARN_ON(IS_ERR_OR_NULL(path)))

IS_ERR_OR_NULL is simply discouraged, but beside of code preference, you
just added bug here. This is clearly not equivalent and you emit warn on
perfectly valid case!

Best regards,
Krzysztof

^ permalink raw reply

* Re: [RFC PATCH 3/4] rv/tlob: Add KUnit tests for the tlob monitor
From: Gabriele Monaco @ 2026-04-16 12:09 UTC (permalink / raw)
  To: wen.yang
  Cc: linux-trace-kernel, linux-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <0a7f41ff8cb13f8601920ead2979db2ee5f2d442.1776020428.git.wen.yang@linux.dev>

On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add six KUnit test suites gated behind CONFIG_TLOB_KUNIT_TEST
> (depends on RV_MON_TLOB && KUNIT; default KUNIT_ALL_TESTS).
> A .kunitconfig fragment is provided for the kunit.py runner.
> 
> Coverage: automaton state transitions and self-loops; start/stop API
> error paths (duplicate start, missing start, overflow threshold,
> table-full, immediate deadline); scheduler context-switch accounting
> for on/off-CPU time; violation tracepoint payload fields; ring buffer
> push, drop-new overflow, and wakeup; and the uprobe line parser.
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

I was considering adding Kunit tests and thought to have them a bit more
integrated ([1] if you want to have a peek before I submit it for RFC, mind it's
a bit raw).

The problem with reimplementing the da_handle_event() is that you are in fact
validating only the model matrix, several other things could go wrong before you
get there (whether the monitor was started properly, other things you might be
doing from the tracepoint handler before you handle events, etc.).

Also, I believe it's a bit of an overkill to validate every single transition
like this, especially considering the work once you update the model for
whatever reason.

One meaningful thing to validate is that a certain sequence of events with a
certain timing causes a violation (or if you want, that a good sequence does
not), for instance. But that's just my opinion, of course.

Thanks,
Gabriele

> ---
>  kernel/trace/rv/Makefile                   |    1 +
>  kernel/trace/rv/monitors/tlob/.kunitconfig |    5 +
>  kernel/trace/rv/monitors/tlob/Kconfig      |   12 +
>  kernel/trace/rv/monitors/tlob/tlob.c       |    1 +
>  kernel/trace/rv/monitors/tlob/tlob_kunit.c | 1194 ++++++++++++++++++++
>  5 files changed, 1213 insertions(+)
>  create mode 100644 kernel/trace/rv/monitors/tlob/.kunitconfig
>  create mode 100644 kernel/trace/rv/monitors/tlob/tlob_kunit.c
> 
> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
> index cc3781a3b..6d963207d 100644
> --- a/kernel/trace/rv/Makefile
> +++ b/kernel/trace/rv/Makefile
> @@ -19,6 +19,7 @@ obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
>  obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
>  obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
>  obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
> +obj-$(CONFIG_TLOB_KUNIT_TEST) += monitors/tlob/tlob_kunit.o
>  # Add new monitors here
>  obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
>  obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
> diff --git a/kernel/trace/rv/monitors/tlob/.kunitconfig
> b/kernel/trace/rv/monitors/tlob/.kunitconfig
> new file mode 100644
> index 000000000..977c58601
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/.kunitconfig
> @@ -0,0 +1,5 @@
> +CONFIG_FTRACE=y
> +CONFIG_KUNIT=y
> +CONFIG_RV=y
> +CONFIG_RV_MON_TLOB=y
> +CONFIG_TLOB_KUNIT_TEST=y
> diff --git a/kernel/trace/rv/monitors/tlob/Kconfig
> b/kernel/trace/rv/monitors/tlob/Kconfig
> index 010237480..4ccd2f881 100644
> --- a/kernel/trace/rv/monitors/tlob/Kconfig
> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
> @@ -49,3 +49,15 @@ config RV_MON_TLOB
>  	  For further information, see:
>  	    Documentation/trace/rv/monitor_tlob.rst
>  
> +config TLOB_KUNIT_TEST
> +	tristate "KUnit tests for tlob monitor" if !KUNIT_ALL_TESTS
> +	depends on RV_MON_TLOB && KUNIT
> +	default KUNIT_ALL_TESTS
> +	help
> +	  Enable KUnit in-kernel unit tests for the tlob RV monitor.
> +
> +	  Tests cover automaton state transitions, the hash table helpers,
> +	  the start/stop task interface, and the event ring buffer including
> +	  overflow handling and wakeup behaviour.
> +
> +	  Say Y or M here to run the tlob KUnit test suite; otherwise say N.
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> index a6e474025..dd959eb9b 100644
> --- a/kernel/trace/rv/monitors/tlob/tlob.c
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -784,6 +784,7 @@ VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64
> *thr_out,
>  	*path_out  = buf + n;
>  	return 0;
>  }
> +EXPORT_SYMBOL_IF_KUNIT(tlob_parse_uprobe_line);
>  
>  static ssize_t tlob_monitor_write(struct file *file,
>  				  const char __user *ubuf,
> diff --git a/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> new file mode 100644
> index 000000000..64f5abb34
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> @@ -0,0 +1,1194 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KUnit tests for the tlob RV monitor.
> + *
> + * tlob_automaton:         DA transition table coverage.
> + * tlob_task_api:          tlob_start_task()/tlob_stop_task() lifecycle and
> errors.
> + * tlob_sched_integration: on/off-CPU accounting across real context
> switches.
> + * tlob_trace_output:      tlob_budget_exceeded tracepoint field
> verification.
> + * tlob_event_buf:         ring buffer push, overflow, and wakeup.
> + * tlob_parse_uprobe:      uprobe format string parser acceptance and
> rejection.
> + *
> + * The duplicate-(binary, offset_start) constraint enforced by
> tlob_add_uprobe()
> + * is not covered here: that function calls kern_path() and requires a real
> + * filesystem, which is outside the scope of unit tests. It is covered by the
> + * uprobe_duplicate_offset case in tools/testing/selftests/rv/test_tlob.sh.
> + */
> +#include <kunit/test.h>
> +#include <linux/atomic.h>
> +#include <linux/completion.h>
> +#include <linux/delay.h>
> +#include <linux/kthread.h>
> +#include <linux/ktime.h>
> +#include <linux/mutex.h>
> +#include <linux/sched.h>
> +#include <linux/sched/task.h>
> +#include <linux/tracepoint.h>
> +
> +/*
> + * Pull in the rv tracepoint declarations so that
> + * register_trace_tlob_budget_exceeded() is available.
> + * No CREATE_TRACE_POINTS here  --  the tracepoint implementation lives in
> rv.c.
> + */
> +#include <rv_trace.h>
> +
> +#include "tlob.h"
> +
> +/*
> + * da_handle_event_tlob - apply one automaton transition on @da_mon.
> + *
> + * This helper is used only by the KUnit automaton suite. It applies the
> + * tlob transition table directly on a supplied da_monitor without touching
> + * per-task slots, tracepoints, or timers.
> + */
> +static void da_handle_event_tlob(struct da_monitor *da_mon,
> +				 enum events_tlob event)
> +{
> +	enum states_tlob curr_state = (enum states_tlob)da_mon->curr_state;
> +	enum states_tlob next_state =
> +		(enum states_tlob)automaton_tlob.function[curr_state][event];
> +
> +	if (next_state != INVALID_STATE)
> +		da_mon->curr_state = next_state;
> +}
> +
> +MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");
> +
> +/*
> + * Suite 1: automaton state-machine transitions
> + */
> +
> +/* unmonitored -> trace_start -> on_cpu */
> +static void tlob_unmonitored_to_on_cpu(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> +	da_handle_event_tlob(&mon, trace_start_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +}
> +
> +/* on_cpu -> switch_out -> off_cpu */
> +static void tlob_on_cpu_switch_out(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, switch_out_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)off_cpu_tlob);
> +}
> +
> +/* off_cpu -> switch_in -> on_cpu */
> +static void tlob_off_cpu_switch_in(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, switch_in_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +}
> +
> +/* on_cpu -> budget_expired -> unmonitored */
> +static void tlob_on_cpu_budget_expired(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, budget_expired_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* off_cpu -> budget_expired -> unmonitored */
> +static void tlob_off_cpu_budget_expired(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, budget_expired_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* on_cpu -> trace_stop -> unmonitored */
> +static void tlob_on_cpu_trace_stop(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, trace_stop_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* off_cpu -> trace_stop -> unmonitored */
> +static void tlob_off_cpu_trace_stop(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, trace_stop_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* budget_expired -> unmonitored; a single trace_start re-enters on_cpu. */
> +static void tlob_violation_then_restart(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> +	da_handle_event_tlob(&mon, trace_start_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> +	da_handle_event_tlob(&mon, budget_expired_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +
> +	/* Single trace_start is sufficient to re-enter on_cpu */
> +	da_handle_event_tlob(&mon, trace_start_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> +	da_handle_event_tlob(&mon, trace_stop_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* off_cpu self-loops on switch_out and sched_wakeup. */
> +static void tlob_off_cpu_self_loops(struct kunit *test)
> +{
> +	static const enum events_tlob events[] = {
> +		switch_out_tlob, sched_wakeup_tlob,
> +	};
> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(events); i++) {
> +		struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> +		da_handle_event_tlob(&mon, events[i]);
> +		KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state,
> +				    (int)off_cpu_tlob,
> +				    "event %u should self-loop in off_cpu",
> +				    events[i]);
> +	}
> +}
> +
> +/* on_cpu self-loops on sched_wakeup. */
> +static void tlob_on_cpu_self_loops(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, sched_wakeup_tlob);
> +	KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state, (int)on_cpu_tlob,
> +			    "sched_wakeup should self-loop in on_cpu");
> +}
> +
> +/* Scheduling events in unmonitored self-loop (no state change). */
> +static void tlob_unmonitored_ignores_sched(struct kunit *test)
> +{
> +	static const enum events_tlob events[] = {
> +		switch_in_tlob, switch_out_tlob, sched_wakeup_tlob,
> +	};
> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(events); i++) {
> +		struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> +		da_handle_event_tlob(&mon, events[i]);
> +		KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state,
> +				    (int)unmonitored_tlob,
> +				    "event %u should self-loop in
> unmonitored",
> +				    events[i]);
> +	}
> +}
> +
> +static void tlob_full_happy_path(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> +	da_handle_event_tlob(&mon, trace_start_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> +	da_handle_event_tlob(&mon, switch_out_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)off_cpu_tlob);
> +
> +	da_handle_event_tlob(&mon, switch_in_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> +	da_handle_event_tlob(&mon, trace_stop_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +static void tlob_multiple_switches(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +	int i;
> +
> +	da_handle_event_tlob(&mon, trace_start_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> +	for (i = 0; i < 3; i++) {
> +		da_handle_event_tlob(&mon, switch_out_tlob);
> +		KUNIT_EXPECT_EQ(test, (int)mon.curr_state,
> (int)off_cpu_tlob);
> +		da_handle_event_tlob(&mon, switch_in_tlob);
> +		KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +	}
> +
> +	da_handle_event_tlob(&mon, trace_stop_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +static struct kunit_case tlob_automaton_cases[] = {
> +	KUNIT_CASE(tlob_unmonitored_to_on_cpu),
> +	KUNIT_CASE(tlob_on_cpu_switch_out),
> +	KUNIT_CASE(tlob_off_cpu_switch_in),
> +	KUNIT_CASE(tlob_on_cpu_budget_expired),
> +	KUNIT_CASE(tlob_off_cpu_budget_expired),
> +	KUNIT_CASE(tlob_on_cpu_trace_stop),
> +	KUNIT_CASE(tlob_off_cpu_trace_stop),
> +	KUNIT_CASE(tlob_off_cpu_self_loops),
> +	KUNIT_CASE(tlob_on_cpu_self_loops),
> +	KUNIT_CASE(tlob_unmonitored_ignores_sched),
> +	KUNIT_CASE(tlob_full_happy_path),
> +	KUNIT_CASE(tlob_violation_then_restart),
> +	KUNIT_CASE(tlob_multiple_switches),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_automaton_suite = {
> +	.name       = "tlob_automaton",
> +	.test_cases = tlob_automaton_cases,
> +};
> +
> +/*
> + * Suite 2: task registration API
> + */
> +
> +/* Basic start/stop cycle */
> +static void tlob_start_stop_ok(struct kunit *test)
> +{
> +	int ret;
> +
> +	ret = tlob_start_task(current, 10000000 /* 10 s, won't fire */, NULL,
> 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), 0);
> +}
> +
> +/* Double start must return -EEXIST. */
> +static void tlob_double_start(struct kunit *test)
> +{
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(current, 10000000, NULL, 0),
> 0);
> +	KUNIT_EXPECT_EQ(test, tlob_start_task(current, 10000000, NULL, 0), -
> EEXIST);
> +	tlob_stop_task(current);
> +}
> +
> +/* Stop without start must return -ESRCH. */
> +static void tlob_stop_without_start(struct kunit *test)
> +{
> +	tlob_stop_task(current);  /* clear any stale entry first */
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +}
> +
> +/*
> + * A 1 us budget fires before tlob_stop_task() is called. Either the
> + * timer wins (-ESRCH) or we are very fast (0); both are valid.
> + */
> +static void tlob_immediate_deadline(struct kunit *test)
> +{
> +	int ret = tlob_start_task(current, 1 /* 1 us - fires almost
> immediately */, NULL, 0);
> +
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +	/* Let the 1 us timer fire */
> +	udelay(100);
> +	/*
> +	 * By now the hrtimer has almost certainly fired. Either it has
> +	 * (returns -ESRCH) or we were very fast (returns 0). Both are
> +	 * acceptable; just ensure no crash and the table is clean after.
> +	 */
> +	ret = tlob_stop_task(current);
> +	KUNIT_EXPECT_TRUE(test, ret == 0 || ret == -ESRCH);
> +}
> +
> +/*
> + * Fill the table to TLOB_MAX_MONITORED using kthreads (each needs a
> + * distinct task_struct), then verify the next start returns -ENOSPC.
> + */
> +struct tlob_waiter_ctx {
> +	struct completion start;
> +	struct completion done;
> +};
> +
> +static int tlob_waiter_fn(void *arg)
> +{
> +	struct tlob_waiter_ctx *ctx = arg;
> +
> +	wait_for_completion(&ctx->start);
> +	complete(&ctx->done);
> +	return 0;
> +}
> +
> +static void tlob_enospc(struct kunit *test)
> +{
> +	struct tlob_waiter_ctx *ctxs;
> +	struct task_struct **threads;
> +	int i, ret;
> +
> +	ctxs = kunit_kcalloc(test, TLOB_MAX_MONITORED,
> +			     sizeof(*ctxs), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, ctxs);
> +
> +	threads = kunit_kcalloc(test, TLOB_MAX_MONITORED,
> +				sizeof(*threads), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, threads);
> +
> +	/* Start TLOB_MAX_MONITORED kthreads and monitor each */
> +	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> +		init_completion(&ctxs[i].start);
> +		init_completion(&ctxs[i].done);
> +
> +		threads[i] = kthread_run(tlob_waiter_fn, &ctxs[i],
> +					 "tlob_waiter_%d", i);
> +		if (IS_ERR(threads[i])) {
> +			KUNIT_FAIL(test, "kthread_run failed at i=%d", i);
> +			threads[i] = NULL;
> +			goto cleanup;
> +		}
> +		get_task_struct(threads[i]);
> +
> +		ret = tlob_start_task(threads[i], 10000000, NULL, 0);
> +		if (ret != 0) {
> +			KUNIT_FAIL(test, "tlob_start_task failed at i=%d:
> %d",
> +				   i, ret);
> +			put_task_struct(threads[i]);
> +			complete(&ctxs[i].start);
> +			goto cleanup;
> +		}
> +	}
> +
> +	/* The table is now full: one more must fail with -ENOSPC */
> +	ret = tlob_start_task(current, 10000000, NULL, 0);
> +	KUNIT_EXPECT_EQ(test, ret, -ENOSPC);
> +
> +cleanup:
> +	/*
> +	 * Two-pass cleanup: cancel tlob monitoring and unblock kthreads
> first,
> +	 * then kthread_stop() to wait for full exit before releasing refs.
> +	 */
> +	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> +		if (!threads[i])
> +			break;
> +		tlob_stop_task(threads[i]);
> +		complete(&ctxs[i].start);
> +	}
> +	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> +		if (!threads[i])
> +			break;
> +		kthread_stop(threads[i]);
> +		put_task_struct(threads[i]);
> +	}
> +}
> +
> +/*
> + * A kthread holds a mutex for 80 ms; arm a 10 ms budget, burn ~1 ms
> + * on-CPU, then block on the mutex. The timer fires off-CPU; stop
> + * must return -ESRCH.
> + */
> +struct tlob_holder_ctx {
> +	struct mutex		lock;
> +	struct completion	ready;
> +	unsigned int		hold_ms;
> +};
> +
> +static int tlob_holder_fn(void *arg)
> +{
> +	struct tlob_holder_ctx *ctx = arg;
> +
> +	mutex_lock(&ctx->lock);
> +	complete(&ctx->ready);
> +	msleep(ctx->hold_ms);
> +	mutex_unlock(&ctx->lock);
> +	return 0;
> +}
> +
> +static void tlob_deadline_fires_off_cpu(struct kunit *test)
> +{
> +	struct tlob_holder_ctx ctx = { .hold_ms = 80 };
> +	struct task_struct *holder;
> +	ktime_t t0;
> +	int ret;
> +
> +	mutex_init(&ctx.lock);
> +	init_completion(&ctx.ready);
> +
> +	holder = kthread_run(tlob_holder_fn, &ctx, "tlob_holder_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, holder);
> +	wait_for_completion(&ctx.ready);
> +
> +	/* Arm 10 ms budget while kthread holds the mutex. */
> +	ret = tlob_start_task(current, 10000, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* Phase 1: burn ~1 ms on-CPU to exercise on_cpu accounting. */
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 1000)
> +		cpu_relax();
> +
> +	/*
> +	 * Phase 2: block on the mutex -> on_cpu->off_cpu transition.
> +	 * The 10 ms budget fires while we are off-CPU.
> +	 */
> +	mutex_lock(&ctx.lock);
> +	mutex_unlock(&ctx.lock);
> +
> +	/* Timer already fired and removed the entry -> -ESRCH */
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +}
> +
> +/* Arm a 1 ms budget and busy-spin for 50 ms; timer fires on-CPU. */
> +static void tlob_deadline_fires_on_cpu(struct kunit *test)
> +{
> +	ktime_t t0;
> +	int ret;
> +
> +	ret = tlob_start_task(current, 1000 /* 1 ms */, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* Busy-spin 50 ms - 50x the budget */
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 50000)
> +		cpu_relax();
> +
> +	/* Timer fired during the spin; entry is gone */
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +}
> +
> +/*
> + * Start three tasks, call tlob_destroy_monitor() + tlob_init_monitor(),
> + * and verify the table is empty afterwards.
> + */
> +static int tlob_dummy_fn(void *arg)
> +{
> +	wait_for_completion((struct completion *)arg);
> +	return 0;
> +}
> +
> +static void tlob_stop_all_cleanup(struct kunit *test)
> +{
> +	struct completion done1, done2;
> +	struct task_struct *t1, *t2;
> +	int ret;
> +
> +	init_completion(&done1);
> +	init_completion(&done2);
> +
> +	t1 = kthread_run(tlob_dummy_fn, &done1, "tlob_dummy1");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t1);
> +	get_task_struct(t1);
> +
> +	t2 = kthread_run(tlob_dummy_fn, &done2, "tlob_dummy2");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t2);
> +	get_task_struct(t2);
> +
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(current, 10000000, NULL, 0),
> 0);
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(t1, 10000000, NULL, 0), 0);
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(t2, 10000000, NULL, 0), 0);
> +
> +	/* Destroy clears all entries via tlob_stop_all() */
> +	tlob_destroy_monitor();
> +	ret = tlob_init_monitor();
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* Table must be empty now */
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(t1), -ESRCH);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(t2), -ESRCH);
> +
> +	complete(&done1);
> +	complete(&done2);
> +	/*
> +	 * completions live on stack; wait for kthreads to exit before
> return.
> +	 */
> +	kthread_stop(t1);
> +	kthread_stop(t2);
> +	put_task_struct(t1);
> +	put_task_struct(t2);
> +}
> +
> +/* A threshold that overflows ktime_t must be rejected with -ERANGE. */
> +static void tlob_overflow_threshold(struct kunit *test)
> +{
> +	/* KTIME_MAX / NSEC_PER_USEC + 1 overflows ktime_t */
> +	u64 too_large = (u64)(KTIME_MAX / NSEC_PER_USEC) + 1;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_start_task(current, too_large, NULL, 0),
> +		-ERANGE);
> +}
> +
> +static int tlob_task_api_suite_init(struct kunit_suite *suite)
> +{
> +	return tlob_init_monitor();
> +}
> +
> +static void tlob_task_api_suite_exit(struct kunit_suite *suite)
> +{
> +	tlob_destroy_monitor();
> +}
> +
> +static struct kunit_case tlob_task_api_cases[] = {
> +	KUNIT_CASE(tlob_start_stop_ok),
> +	KUNIT_CASE(tlob_double_start),
> +	KUNIT_CASE(tlob_stop_without_start),
> +	KUNIT_CASE(tlob_immediate_deadline),
> +	KUNIT_CASE(tlob_enospc),
> +	KUNIT_CASE(tlob_overflow_threshold),
> +	KUNIT_CASE(tlob_deadline_fires_off_cpu),
> +	KUNIT_CASE(tlob_deadline_fires_on_cpu),
> +	KUNIT_CASE(tlob_stop_all_cleanup),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_task_api_suite = {
> +	.name       = "tlob_task_api",
> +	.suite_init = tlob_task_api_suite_init,
> +	.suite_exit = tlob_task_api_suite_exit,
> +	.test_cases = tlob_task_api_cases,
> +};
> +
> +/*
> + * Suite 3: scheduling integration
> + */
> +
> +struct tlob_ping_ctx {
> +	struct completion ping;
> +	struct completion pong;
> +};
> +
> +static int tlob_ping_fn(void *arg)
> +{
> +	struct tlob_ping_ctx *ctx = arg;
> +
> +	/* Wait for main to give us the CPU back */
> +	wait_for_completion(&ctx->ping);
> +	complete(&ctx->pong);
> +	return 0;
> +}
> +
> +/* Force two context switches and verify stop returns 0 (within budget). */
> +static void tlob_sched_switch_accounting(struct kunit *test)
> +{
> +	struct tlob_ping_ctx ctx;
> +	struct task_struct *peer;
> +	int ret;
> +
> +	init_completion(&ctx.ping);
> +	init_completion(&ctx.pong);
> +
> +	peer = kthread_run(tlob_ping_fn, &ctx, "tlob_ping_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, peer);
> +
> +	/* Arm a generous 5 s budget so the timer never fires */
> +	ret = tlob_start_task(current, 5000000, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/*
> +	 * complete(ping) -> peer runs, forcing a context switch out and
> back.
> +	 */
> +	complete(&ctx.ping);
> +	wait_for_completion(&ctx.pong);
> +
> +	/*
> +	 * Back on CPU after one off-CPU interval; stop must return 0.
> +	 */
> +	ret = tlob_stop_task(current);
> +	KUNIT_EXPECT_EQ(test, ret, 0);
> +}
> +
> +/*
> + * Verify that monitoring a kthread (not current) works: start on behalf
> + * of a kthread, let it block, then stop it.
> + */
> +static int tlob_block_fn(void *arg)
> +{
> +	struct completion *done = arg;
> +
> +	/* Block briefly, exercising off_cpu accounting for this task */
> +	msleep(20);
> +	complete(done);
> +	return 0;
> +}
> +
> +static void tlob_monitor_other_task(struct kunit *test)
> +{
> +	struct completion done;
> +	struct task_struct *target;
> +	int ret;
> +
> +	init_completion(&done);
> +
> +	target = kthread_run(tlob_block_fn, &done, "tlob_target_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, target);
> +	get_task_struct(target);
> +
> +	/* Arm a 5 s budget for the target task */
> +	ret = tlob_start_task(target, 5000000, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	wait_for_completion(&done);
> +
> +	/*
> +	 * Target has finished; stop_task may return 0 (still in htable)
> +	 * or -ESRCH (kthread exited and timer fired / entry cleaned up).
> +	 */
> +	ret = tlob_stop_task(target);
> +	KUNIT_EXPECT_TRUE(test, ret == 0 || ret == -ESRCH);
> +	put_task_struct(target);
> +}
> +
> +static int tlob_sched_suite_init(struct kunit_suite *suite)
> +{
> +	return tlob_init_monitor();
> +}
> +
> +static void tlob_sched_suite_exit(struct kunit_suite *suite)
> +{
> +	tlob_destroy_monitor();
> +}
> +
> +static struct kunit_case tlob_sched_integration_cases[] = {
> +	KUNIT_CASE(tlob_sched_switch_accounting),
> +	KUNIT_CASE(tlob_monitor_other_task),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_sched_integration_suite = {
> +	.name       = "tlob_sched_integration",
> +	.suite_init = tlob_sched_suite_init,
> +	.suite_exit = tlob_sched_suite_exit,
> +	.test_cases = tlob_sched_integration_cases,
> +};
> +
> +/*
> + * Suite 4: ftrace tracepoint field verification
> + */
> +
> +/* Capture fields from trace_tlob_budget_exceeded for inspection. */
> +struct tlob_exceeded_capture {
> +	atomic_t	fired;		/* 1 after first call */
> +	pid_t		pid;
> +	u64		threshold_us;
> +	u64		on_cpu_us;
> +	u64		off_cpu_us;
> +	u32		switches;
> +	bool		state_is_on_cpu;
> +	u64		tag;
> +};
> +
> +static void
> +probe_tlob_budget_exceeded(void *data,
> +			   struct task_struct *task, u64 threshold_us,
> +			   u64 on_cpu_us, u64 off_cpu_us,
> +			   u32 switches, bool state_is_on_cpu, u64 tag)
> +{
> +	struct tlob_exceeded_capture *cap = data;
> +
> +	/* Only capture the first event to avoid races. */
> +	if (atomic_cmpxchg(&cap->fired, 0, 1) != 0)
> +		return;
> +
> +	cap->pid		= task->pid;
> +	cap->threshold_us	= threshold_us;
> +	cap->on_cpu_us		= on_cpu_us;
> +	cap->off_cpu_us		= off_cpu_us;
> +	cap->switches		= switches;
> +	cap->state_is_on_cpu	= state_is_on_cpu;
> +	cap->tag		= tag;
> +}
> +
> +/*
> + * Arm a 2 ms budget and busy-spin for 60 ms. Verify the tracepoint fires
> + * once with matching threshold, correct pid, and total time >= budget.
> + *
> + * state_is_on_cpu is not asserted: preemption during the spin makes it
> + * non-deterministic.
> + */
> +static void tlob_trace_budget_exceeded_on_cpu(struct kunit *test)
> +{
> +	struct tlob_exceeded_capture cap = {};
> +	const u64 threshold_us = 2000; /* 2 ms */
> +	ktime_t t0;
> +	int ret;
> +
> +	atomic_set(&cap.fired, 0);
> +
> +	ret = register_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> +						  &cap);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	ret = tlob_start_task(current, threshold_us, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* Busy-spin 60 ms  --  30x the budget */
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 60000)
> +		cpu_relax();
> +
> +	/* Entry removed by timer; stop returns -ESRCH */
> +	tlob_stop_task(current);
> +
> +	/*
> +	 * Synchronise: ensure the probe callback has completed before we
> +	 * read the captured fields.
> +	 */
> +	tracepoint_synchronize_unregister();
> +	unregister_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> &cap);
> +
> +	KUNIT_EXPECT_EQ(test, atomic_read(&cap.fired), 1);
> +	KUNIT_EXPECT_EQ(test, (int)cap.pid, (int)current->pid);
> +	KUNIT_EXPECT_EQ(test, cap.threshold_us, threshold_us);
> +	/* Total elapsed must cover at least the budget */
> +	KUNIT_EXPECT_GE(test, cap.on_cpu_us + cap.off_cpu_us, threshold_us);
> +}
> +
> +/*
> + * Holder kthread grabs a mutex for 80 ms; arm 10 ms budget, burn ~1 ms
> + * on-CPU, then block on the mutex. Timer fires off-CPU. Verify:
> + * state_is_on_cpu == false, switches >= 1, off_cpu_us > 0.
> + */
> +static void tlob_trace_budget_exceeded_off_cpu(struct kunit *test)
> +{
> +	struct tlob_exceeded_capture cap = {};
> +	struct tlob_holder_ctx ctx = { .hold_ms = 80 };
> +	struct task_struct *holder;
> +	const u64 threshold_us = 10000; /* 10 ms */
> +	ktime_t t0;
> +	int ret;
> +
> +	atomic_set(&cap.fired, 0);
> +
> +	mutex_init(&ctx.lock);
> +	init_completion(&ctx.ready);
> +
> +	holder = kthread_run(tlob_holder_fn, &ctx, "tlob_holder2_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, holder);
> +	wait_for_completion(&ctx.ready);
> +
> +	ret = register_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> +						  &cap);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	ret = tlob_start_task(current, threshold_us, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* Phase 1: ~1 ms on-CPU */
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 1000)
> +		cpu_relax();
> +
> +	/* Phase 2: block -> off-CPU; timer fires here */
> +	mutex_lock(&ctx.lock);
> +	mutex_unlock(&ctx.lock);
> +
> +	tlob_stop_task(current);
> +
> +	tracepoint_synchronize_unregister();
> +	unregister_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> &cap);
> +
> +	KUNIT_EXPECT_EQ(test, atomic_read(&cap.fired), 1);
> +	KUNIT_EXPECT_EQ(test, cap.threshold_us, threshold_us);
> +	/* Violation happened off-CPU */
> +	KUNIT_EXPECT_FALSE(test, cap.state_is_on_cpu);
> +	/* At least the switch_out event was counted */
> +	KUNIT_EXPECT_GE(test, (u64)cap.switches, (u64)1);
> +	/* Off-CPU time must be non-zero */
> +	KUNIT_EXPECT_GT(test, cap.off_cpu_us, (u64)0);
> +}
> +
> +/* threshold_us in the tracepoint must exactly match the start argument. */
> +static void tlob_trace_threshold_field_accuracy(struct kunit *test)
> +{
> +	static const u64 thresholds[] = { 500, 1000, 3000 };
> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(thresholds); i++) {
> +		struct tlob_exceeded_capture cap = {};
> +		ktime_t t0;
> +		int ret;
> +
> +		atomic_set(&cap.fired, 0);
> +
> +		ret = register_trace_tlob_budget_exceeded(
> +			probe_tlob_budget_exceeded, &cap);
> +		KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +		ret = tlob_start_task(current, thresholds[i], NULL, 0);
> +		KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +		/* Spin for 20x the threshold to ensure timer fires */
> +		t0 = ktime_get();
> +		while (ktime_us_delta(ktime_get(), t0) <
> +		       (s64)(thresholds[i] * 20))
> +			cpu_relax();
> +
> +		tlob_stop_task(current);
> +
> +		tracepoint_synchronize_unregister();
> +		unregister_trace_tlob_budget_exceeded(
> +			probe_tlob_budget_exceeded, &cap);
> +
> +		KUNIT_EXPECT_EQ_MSG(test, cap.threshold_us, thresholds[i],
> +				    "threshold mismatch for entry %u", i);
> +	}
> +}
> +
> +static int tlob_trace_suite_init(struct kunit_suite *suite)
> +{
> +	int ret;
> +
> +	ret = tlob_init_monitor();
> +	if (ret)
> +		return ret;
> +	return tlob_enable_hooks();
> +}
> +
> +static void tlob_trace_suite_exit(struct kunit_suite *suite)
> +{
> +	tlob_disable_hooks();
> +	tlob_destroy_monitor();
> +}
> +
> +static struct kunit_case tlob_trace_output_cases[] = {
> +	KUNIT_CASE(tlob_trace_budget_exceeded_on_cpu),
> +	KUNIT_CASE(tlob_trace_budget_exceeded_off_cpu),
> +	KUNIT_CASE(tlob_trace_threshold_field_accuracy),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_trace_output_suite = {
> +	.name       = "tlob_trace_output",
> +	.suite_init = tlob_trace_suite_init,
> +	.suite_exit = tlob_trace_suite_exit,
> +	.test_cases = tlob_trace_output_cases,
> +};
> +
> +/* Suite 5: ring buffer */
> +
> +/*
> + * Allocate a synthetic rv_file_priv for ring buffer tests. Uses
> + * kunit_kzalloc() instead of __get_free_pages() since the ring is never
> + * mmap'd here.
> + */
> +static struct rv_file_priv *alloc_priv_kunit(struct kunit *test, u32 cap)
> +{
> +	struct rv_file_priv *priv;
> +	struct tlob_ring *ring;
> +
> +	priv = kunit_kzalloc(test, sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return NULL;
> +
> +	ring = &priv->ring;
> +
> +	ring->page = kunit_kzalloc(test, sizeof(struct tlob_mmap_page),
> +				   GFP_KERNEL);
> +	if (!ring->page)
> +		return NULL;
> +
> +	ring->data = kunit_kzalloc(test, cap * sizeof(struct tlob_event),
> +				   GFP_KERNEL);
> +	if (!ring->data)
> +		return NULL;
> +
> +	ring->mask            = cap - 1;
> +	ring->page->capacity  = cap;
> +	ring->page->version   = 1;
> +	ring->page->data_offset = PAGE_SIZE; /* nominal; not used in tests */
> +	ring->page->record_size = sizeof(struct tlob_event);
> +	spin_lock_init(&ring->lock);
> +	init_waitqueue_head(&priv->waitq);
> +	return priv;
> +}
> +
> +/* Push one record and verify all fields survive the round-trip. */
> +static void tlob_event_push_one(struct kunit *test)
> +{
> +	struct rv_file_priv *priv;
> +	struct tlob_ring *ring;
> +	struct tlob_event in = {
> +		.tid		= 1234,
> +		.threshold_us	= 5000,
> +		.on_cpu_us	= 3000,
> +		.off_cpu_us	= 2000,
> +		.switches	= 3,
> +		.state		= 1,
> +	};
> +	struct tlob_event out = {};
> +	u32 tail;
> +
> +	priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
> +	KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> +	ring = &priv->ring;
> +
> +	tlob_event_push_kunit(priv, &in);
> +
> +	/* One record written, none dropped */
> +	KUNIT_EXPECT_EQ(test, ring->page->data_head, 1u);
> +	KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
> +	KUNIT_EXPECT_EQ(test, ring->page->dropped,   0ull);
> +
> +	/* Dequeue manually */
> +	tail = ring->page->data_tail;
> +	out  = ring->data[tail & ring->mask];
> +	ring->page->data_tail = tail + 1;
> +
> +	KUNIT_EXPECT_EQ(test, out.tid,          in.tid);
> +	KUNIT_EXPECT_EQ(test, out.threshold_us, in.threshold_us);
> +	KUNIT_EXPECT_EQ(test, out.on_cpu_us,    in.on_cpu_us);
> +	KUNIT_EXPECT_EQ(test, out.off_cpu_us,   in.off_cpu_us);
> +	KUNIT_EXPECT_EQ(test, out.switches,     in.switches);
> +	KUNIT_EXPECT_EQ(test, out.state,        in.state);
> +
> +	/* Ring is now empty */
> +	KUNIT_EXPECT_EQ(test, ring->page->data_head, ring->page->data_tail);
> +}
> +
> +/*
> + * Fill to capacity, push one more. Drop-new policy: head stays at cap,
> + * dropped == 1, oldest record is preserved.
> + */
> +static void tlob_event_push_overflow(struct kunit *test)
> +{
> +	struct rv_file_priv *priv;
> +	struct tlob_ring *ring;
> +	struct tlob_event ntf = {};
> +	struct tlob_event out = {};
> +	const u32 cap = TLOB_RING_MIN_CAP;
> +	u32 i;
> +
> +	priv = alloc_priv_kunit(test, cap);
> +	KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> +	ring = &priv->ring;
> +
> +	/* Push cap + 1 records; tid encodes the sequence */
> +	for (i = 0; i <= cap; i++) {
> +		ntf.tid          = i;
> +		ntf.threshold_us = (u64)i * 1000;
> +		tlob_event_push_kunit(priv, &ntf);
> +	}
> +
> +	/* Drop-new: head stopped at cap; one record was silently discarded
> */
> +	KUNIT_EXPECT_EQ(test, ring->page->data_head, cap);
> +	KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
> +	KUNIT_EXPECT_EQ(test, ring->page->dropped,   1ull);
> +
> +	/* Oldest surviving record must be the first one pushed (tid == 0) */
> +	out = ring->data[ring->page->data_tail & ring->mask];
> +	KUNIT_EXPECT_EQ(test, out.tid, 0u);
> +
> +	/* Drain the ring; the last record must have tid == cap - 1 */
> +	for (i = 0; i < cap; i++) {
> +		u32 tail = ring->page->data_tail;
> +
> +		out = ring->data[tail & ring->mask];
> +		ring->page->data_tail = tail + 1;
> +	}
> +	KUNIT_EXPECT_EQ(test, out.tid, cap - 1);
> +	KUNIT_EXPECT_EQ(test, ring->page->data_head, ring->page->data_tail);
> +}
> +
> +/* A freshly initialised ring is empty. */
> +static void tlob_event_empty(struct kunit *test)
> +{
> +	struct rv_file_priv *priv;
> +	struct tlob_ring *ring;
> +
> +	priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
> +	KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> +	ring = &priv->ring;
> +
> +	KUNIT_EXPECT_EQ(test, ring->page->data_head, 0u);
> +	KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
> +	KUNIT_EXPECT_EQ(test, ring->page->dropped,   0ull);
> +}
> +
> +/* A kthread blocks on wait_event_interruptible(); pushing one record must
> + * wake it within 1 s.
> + */
> +
> +struct tlob_wakeup_ctx {
> +	struct rv_file_priv	*priv;
> +	struct completion	 ready;
> +	struct completion	 done;
> +	int			 woke;
> +};
> +
> +static int tlob_wakeup_thread(void *arg)
> +{
> +	struct tlob_wakeup_ctx *ctx = arg;
> +	struct tlob_ring *ring = &ctx->priv->ring;
> +
> +	complete(&ctx->ready);
> +
> +	wait_event_interruptible(ctx->priv->waitq,
> +		smp_load_acquire(&ring->page->data_head) !=
> +		READ_ONCE(ring->page->data_tail) ||
> +		kthread_should_stop());
> +
> +	if (smp_load_acquire(&ring->page->data_head) !=
> +	    READ_ONCE(ring->page->data_tail))
> +		ctx->woke = 1;
> +
> +	complete(&ctx->done);
> +	return 0;
> +}
> +
> +static void tlob_ring_wakeup(struct kunit *test)
> +{
> +	struct rv_file_priv *priv;
> +	struct tlob_wakeup_ctx ctx;
> +	struct task_struct *t;
> +	struct tlob_event ev = { .tid = 99 };
> +	long timeout;
> +
> +	priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
> +	KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> +	init_completion(&ctx.ready);
> +	init_completion(&ctx.done);
> +	ctx.priv = priv;
> +	ctx.woke = 0;
> +
> +	t = kthread_run(tlob_wakeup_thread, &ctx, "tlob_wakeup_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t);
> +	get_task_struct(t);
> +
> +	/* Let the kthread reach wait_event_interruptible */
> +	wait_for_completion(&ctx.ready);
> +	usleep_range(10000, 20000);
> +
> +	/* Push one record  --  must wake the waiter */
> +	tlob_event_push_kunit(priv, &ev);
> +
> +	timeout = wait_for_completion_timeout(&ctx.done,
> msecs_to_jiffies(1000));
> +	kthread_stop(t);
> +	put_task_struct(t);
> +
> +	KUNIT_EXPECT_GT(test, timeout, 0L);
> +	KUNIT_EXPECT_EQ(test, ctx.woke, 1);
> +	KUNIT_EXPECT_EQ(test, priv->ring.page->data_head, 1u);
> +}
> +
> +static struct kunit_case tlob_event_buf_cases[] = {
> +	KUNIT_CASE(tlob_event_push_one),
> +	KUNIT_CASE(tlob_event_push_overflow),
> +	KUNIT_CASE(tlob_event_empty),
> +	KUNIT_CASE(tlob_ring_wakeup),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_event_buf_suite = {
> +	.name       = "tlob_event_buf",
> +	.test_cases = tlob_event_buf_cases,
> +};
> +
> +/* Suite 6: uprobe format string parser */
> +
> +/* Happy path: decimal offsets, plain path. */
> +static void tlob_parse_decimal_offsets(struct kunit *test)
> +{
> +	char buf[] = "5000:4768:4848:/usr/bin/myapp";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		0);
> +	KUNIT_EXPECT_EQ(test, thr,      (u64)5000);
> +	KUNIT_EXPECT_EQ(test, start,    (loff_t)4768);
> +	KUNIT_EXPECT_EQ(test, stop,     (loff_t)4848);
> +	KUNIT_EXPECT_STREQ(test, path,  "/usr/bin/myapp");
> +}
> +
> +/* Happy path: 0x-prefixed hex offsets. */
> +static void tlob_parse_hex_offsets(struct kunit *test)
> +{
> +	char buf[] = "10000:0x12a0:0x12f0:/usr/bin/myapp";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		0);
> +	KUNIT_EXPECT_EQ(test, start,   (loff_t)0x12a0);
> +	KUNIT_EXPECT_EQ(test, stop,    (loff_t)0x12f0);
> +	KUNIT_EXPECT_STREQ(test, path, "/usr/bin/myapp");
> +}
> +
> +/* Path containing ':' must not be truncated. */
> +static void tlob_parse_path_with_colon(struct kunit *test)
> +{
> +	char buf[] = "1000:0x100:0x200:/opt/my:app/bin";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		0);
> +	KUNIT_EXPECT_STREQ(test, path, "/opt/my:app/bin");
> +}
> +
> +/* Zero threshold must be rejected. */
> +static void tlob_parse_zero_threshold(struct kunit *test)
> +{
> +	char buf[] = "0:0x100:0x200:/usr/bin/myapp";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		-EINVAL);
> +}
> +
> +/* Empty path (trailing ':' with nothing after) must be rejected. */
> +static void tlob_parse_empty_path(struct kunit *test)
> +{
> +	char buf[] = "5000:0x100:0x200:";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		-EINVAL);
> +}
> +
> +/* Missing field (3 tokens instead of 4) must be rejected. */
> +static void tlob_parse_too_few_fields(struct kunit *test)
> +{
> +	char buf[] = "5000:0x100:/usr/bin/myapp";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		-EINVAL);
> +}
> +
> +/* Negative offset must be rejected. */
> +static void tlob_parse_negative_offset(struct kunit *test)
> +{
> +	char buf[] = "5000:-1:0x200:/usr/bin/myapp";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		-EINVAL);
> +}
> +
> +static struct kunit_case tlob_parse_uprobe_cases[] = {
> +	KUNIT_CASE(tlob_parse_decimal_offsets),
> +	KUNIT_CASE(tlob_parse_hex_offsets),
> +	KUNIT_CASE(tlob_parse_path_with_colon),
> +	KUNIT_CASE(tlob_parse_zero_threshold),
> +	KUNIT_CASE(tlob_parse_empty_path),
> +	KUNIT_CASE(tlob_parse_too_few_fields),
> +	KUNIT_CASE(tlob_parse_negative_offset),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_parse_uprobe_suite = {
> +	.name       = "tlob_parse_uprobe",
> +	.test_cases = tlob_parse_uprobe_cases,
> +};
> +
> +kunit_test_suites(&tlob_automaton_suite,
> +		  &tlob_task_api_suite,
> +		  &tlob_sched_integration_suite,
> +		  &tlob_trace_output_suite,
> +		  &tlob_event_buf_suite,
> +		  &tlob_parse_uprobe_suite);
> +
> +MODULE_DESCRIPTION("KUnit tests for the tlob RV monitor");
> +MODULE_LICENSE("GPL");


^ permalink raw reply

* [PATCH v3] tracing/osnoise: Add option to align tlat threads
From: Tomas Glozar @ 2026-04-16 11:59 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, John Kacur, Luis Goncalves, Crystal Wood,
	Costa Shulyupin, Wander Lairson Costa, LKML, linux-trace-kernel,
	Tomas Glozar

Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
corresponding setting osnoise/timerlat_align_us.

This option sets the alignment of wakeup times between different
timerlat threads, similarly to cyclictest's -A/--aligned option. If
TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
records its first wake-up time. Each following thread sets its first
wake-up time to a fixed offset from the recorded time, and increments
it by the same offset.

Example:

osnoise/timerlat_period is set to 1000, osnoise/timerlat_align_us is
set to 20. There are four threads, on CPUs 1 to 4.

- CPU 4 enters first cycle first. The current time is 20000us, so
the wake-up of the first cycle is set to 21000us. This time is recorded.
- CPU 2 enter first cycle next. It reads the recorded time, increments
it to 21020us, and uses this value as its own wake-up time for the first
cycle.
- CPU 3 enters first cycle next. It reads the recorded time, increments
it to 21040 us, and uses the value as its own wake-up time.
- CPU 1 proceeds analogically.

In each next cycle, the wake-up time (called "absolute period" in
timerlat code) is incremented by the (relative) period of 1000us. Thus,
the wake-ups in the following cycles (provided the times are reached and
not in the past) will be as follows:

CPU 1		CPU 2		CPU 3	 	CPU 4
21080us		21020us		21040us		21000us
22080us		22020us		22040us		22000us
...		...		...		...

Even if any cycle is skipped due to e.g. the first cycle calculation
happening later, the alignment stays in place.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Reviewed-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Crystal Wood <crwood@redhat.com>
---
v2 + discussion: https://lore.kernel.org/linux-trace-kernel/20260302131316.385987-1-tglozar@redhat.com/T/#u

v1 + discussion: https://lore.kernel.org/linux-trace-kernel/20260227150420.319528-1-tglozar@redhat.com/T/#u

v3:
- Move align_next up and reset it in tlat_var_reset() instead of
osnoise_workload_start() to fix build failure with CONFIG_TIMERLAT_TRACER=n.
(Bug found by Steven Rostedt, fix suggested by Crystal Wood.)

v2:
- Make align_next global and reset it to 0 in osnoise_workload_start()
so that it gets set by the first thread of each measurement and is not stuck
on what is set by the first measurement until reboot.
- Use atomic64_add_return_relaxed() in place of atomic64_fetch_add_relaxed()
to make the code shorter and easier to read.
- Add more detailed comments to the alignment synchronization logic.
- Fix two typos in the commit message: 50 -> 20 in the example introduction,
and incremenets -> increments.

I tested the patch the same way I did for v2:

[root@cs9 tglozar]# cd /sys/kernel/tracing/osnoise/
[root@cs9 osnoise]# echo TIMERLAT_ALIGN > options
[root@cs9 osnoise]# echo 40 > timerlat_align_us
[root@cs9 osnoise]# rtla timerlat top -q -c 0,1,2,3 -d 1s
...
[root@cs9 osnoise]# rtla timerlat top -q -c 0,1,2,3 -d 1s
...
[root@cs9 osnoise]# dmesg | tail -n9
[   25.664229] timerlat: thread 0 setting align_next to 25659481645
[   25.664794] timerlat: aligning thread 1 to 25659521645
[   25.665294] timerlat: aligning thread 2 to 25659561645
[   25.665770] timerlat: aligning thread 3 to 25659601645
[   30.370790] NFSD: all clients done reclaiming, ending NFSv4 grace period (net effffff9)
[   30.442436] timerlat: thread 0 setting align_next to 30437695876
[   30.442853] timerlat: aligning thread 1 to 30437735876
[   30.443141] timerlat: aligning thread 2 to 30437775876
[   30.443659] timerlat: aligning thread 3 to 30437815876
[root@cs9 osnoise]#

to show that align_next is indeed reset. Additionaly I tested that building
without CONFIG_TIMERLAT_TRACER works.

 kernel/trace/trace_osnoise.c | 54 +++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index be6cf0bb3c03..75678053b21c 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -58,6 +58,7 @@ enum osnoise_options_index {
 	OSN_PANIC_ON_STOP,
 	OSN_PREEMPT_DISABLE,
 	OSN_IRQ_DISABLE,
+	OSN_TIMERLAT_ALIGN,
 	OSN_MAX
 };
 
@@ -66,7 +67,8 @@ static const char * const osnoise_options_str[OSN_MAX] = {
 							"OSNOISE_WORKLOAD",
 							"PANIC_ON_STOP",
 							"OSNOISE_PREEMPT_DISABLE",
-							"OSNOISE_IRQ_DISABLE" };
+							"OSNOISE_IRQ_DISABLE",
+							"TIMERLAT_ALIGN" };
 
 #define OSN_DEFAULT_OPTIONS		0x2
 static unsigned long osnoise_options	= OSN_DEFAULT_OPTIONS;
@@ -250,6 +252,11 @@ struct timerlat_variables {
 
 static DEFINE_PER_CPU(struct timerlat_variables, per_cpu_timerlat_var);
 
+/*
+ * timerlat wake-up offset for next thread with TIMERLAT_ALIGN set.
+ */
+static atomic64_t align_next;
+
 /*
  * this_cpu_tmr_var - Return the per-cpu timerlat_variables on its relative CPU
  */
@@ -268,6 +275,7 @@ static inline void tlat_var_reset(void)
 
 	/* Synchronize with the timerlat interfaces */
 	mutex_lock(&interface_lock);
+
 	/*
 	 * So far, all the values are initialized as 0, so
 	 * zeroing the structure is perfect.
@@ -278,6 +286,12 @@ static inline void tlat_var_reset(void)
 			hrtimer_cancel(&tlat_var->timer);
 		memset(tlat_var, 0, sizeof(*tlat_var));
 	}
+	/*
+	 * Reset also align_next, to be filled by a new offset by the first timerlat
+	 * thread that wakes up, if TIMERLAT_ALIGN is set.
+	 */
+	atomic64_set(&align_next, 0);
+
 	mutex_unlock(&interface_lock);
 }
 #else /* CONFIG_TIMERLAT_TRACER */
@@ -326,6 +340,7 @@ static struct osnoise_data {
 	u64	stop_tracing_total;	/* stop trace in the final operation (report/thread) */
 #ifdef CONFIG_TIMERLAT_TRACER
 	u64	timerlat_period;	/* timerlat period */
+	u64	timerlat_align_us;	/* timerlat alignment */
 	u64	print_stack;		/* print IRQ stack if total > */
 	int	timerlat_tracer;	/* timerlat tracer */
 #endif
@@ -338,6 +353,7 @@ static struct osnoise_data {
 #ifdef CONFIG_TIMERLAT_TRACER
 	.print_stack			= 0,
 	.timerlat_period		= DEFAULT_TIMERLAT_PERIOD,
+	.timerlat_align_us		= 0,
 	.timerlat_tracer		= 0,
 #endif
 };
@@ -1829,6 +1845,26 @@ static int wait_next_period(struct timerlat_variables *tlat)
 	 */
 	tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
 
+	/*
+	 * Align thread in the first cycle on each CPU to the set alignment
+	 * if TIMERLAT_ALIGN is set.
+	 *
+	 * This is done by using an atomic64_t to store the next absolute period.
+	 * The first thread that wakes up will set the atomic64_t to its
+	 * absolute period, and the other threads will increment it by
+	 * the alignment value.
+	 */
+	if (test_bit(OSN_TIMERLAT_ALIGN, &osnoise_options) && !tlat->count
+	    && atomic64_cmpxchg_relaxed(&align_next, 0, tlat->abs_period)) {
+		/*
+		 * A thread has already set align_next, use it and increment it
+		 * to be used by the next thread that wakes up after this one.
+		 */
+		tlat->abs_period = atomic64_add_return_relaxed(
+			osnoise_data.timerlat_align_us * 1000, &align_next);
+		next_abs_period = ns_to_ktime(tlat->abs_period);
+	}
+
 	/*
 	 * If the new abs_period is in the past, skip the activation.
 	 */
@@ -2650,6 +2686,17 @@ static struct trace_min_max_param timerlat_period = {
 	.min	= &timerlat_min_period,
 };
 
+/*
+ * osnoise/timerlat_align_us: align the first wakeup of all timerlat
+ * threads to a common boundary (in us). 0 means disabled.
+ */
+static struct trace_min_max_param timerlat_align_us = {
+	.lock	= &interface_lock,
+	.val	= &osnoise_data.timerlat_align_us,
+	.max	= NULL,
+	.min	= NULL,
+};
+
 static const struct file_operations timerlat_fd_fops = {
 	.open		= timerlat_fd_open,
 	.read		= timerlat_fd_read,
@@ -2746,6 +2793,11 @@ static int init_timerlat_tracefs(struct dentry *top_dir)
 	if (!tmp)
 		return -ENOMEM;
 
+	tmp = tracefs_create_file("timerlat_align_us", TRACE_MODE_WRITE, top_dir,
+				  &timerlat_align_us, &trace_min_max_fops);
+	if (!tmp)
+		return -ENOMEM;
+
 	retval = osnoise_create_cpu_timerlat_fd(top_dir);
 	if (retval)
 		return retval;
-- 
2.53.0


^ permalink raw reply related

* Re: [RFC PATCH 4/4] selftests/rv: Add selftest for the tlob monitor
From: Gabriele Monaco @ 2026-04-16 12:00 UTC (permalink / raw)
  To: wen.yang
  Cc: linux-trace-kernel, linux-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <5bdd82dd8aeb1d3f955b727ae1fce9819b35c170.1776020428.git.wen.yang@linux.dev>

On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add a kselftest suite (TAP output, 19 test points) for the tlob RV
> monitor under tools/testing/selftests/rv/.
> 
> test_tlob.sh drives a compiled C helper (tlob_helper) and, for uprobe
> tests, a target binary (tlob_uprobe_target). Coverage spans the
> tracefs enable/disable path, uprobe-triggered violations, and the
> ioctl interface (within-budget stop, CPU-bound and sleep violations,
> duplicate start, ring buffer mmap and consumption).
> 
> Requires CONFIG_RV_MON_TLOB=y and CONFIG_RV_CHARDEV=y; must be run
> as root.
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

Those are some extensive selftests!
Could you integrate them with the existing test suite under
tools/testing/selftests/verification ?

You would probably just get your tlob_helper built and call it from some shell
script under test.d, the harness should work without the need for extra helpers.

Thanks,
Gabriele

> ---
>  tools/include/uapi/linux/rv.h                 |  54 +
>  tools/testing/selftests/rv/Makefile           |  18 +
>  tools/testing/selftests/rv/test_tlob.sh       | 563 ++++++++++
>  tools/testing/selftests/rv/tlob_helper.c      | 994 ++++++++++++++++++
>  .../testing/selftests/rv/tlob_uprobe_target.c | 108 ++
>  5 files changed, 1737 insertions(+)
>  create mode 100644 tools/include/uapi/linux/rv.h
>  create mode 100644 tools/testing/selftests/rv/Makefile
>  create mode 100755 tools/testing/selftests/rv/test_tlob.sh
>  create mode 100644 tools/testing/selftests/rv/tlob_helper.c
>  create mode 100644 tools/testing/selftests/rv/tlob_uprobe_target.c
> 
> diff --git a/tools/include/uapi/linux/rv.h b/tools/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000..bef07aded
> --- /dev/null
> +++ b/tools/include/uapi/linux/rv.h
> @@ -0,0 +1,54 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * This is a tools-friendly copy of include/uapi/linux/rv.h.
> + * Keep in sync with the kernel header.
> + */
> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H
> +
> +#include <linux/types.h>
> +#include <sys/ioctl.h>
> +
> +/* Magic byte shared by all RV monitor ioctls. */
> +#define RV_IOC_MAGIC	0xB9
> +
> +/* -----------------------------------------------------------------------
> + * tlob: task latency over budget monitor  (nr 0x01 - 0x1F)
> + * -----------------------------------------------------------------------
> + */
> +
> +struct tlob_start_args {
> +	__u64 threshold_us;
> +	__u64 tag;
> +	__s32 notify_fd;
> +	__u32 flags;
> +};
> +
> +struct tlob_event {
> +	__u32 tid;
> +	__u32 pad;
> +	__u64 threshold_us;
> +	__u64 on_cpu_us;
> +	__u64 off_cpu_us;
> +	__u32 switches;
> +	__u32 state;   /* 1 = on_cpu, 0 = off_cpu */
> +	__u64 tag;
> +};
> +
> +struct tlob_mmap_page {
> +	__u32  data_head;
> +	__u32  data_tail;
> +	__u32  capacity;
> +	__u32  version;
> +	__u32  data_offset;
> +	__u32  record_size;
> +	__u64  dropped;
> +};
> +
> +#define TLOB_IOCTL_TRACE_START	_IOW(RV_IOC_MAGIC, 0x01, struct
> tlob_start_args)
> +#define TLOB_IOCTL_TRACE_STOP	_IO(RV_IOC_MAGIC,  0x02)
> +
> +#endif /* _UAPI_LINUX_RV_H */
> diff --git a/tools/testing/selftests/rv/Makefile
> b/tools/testing/selftests/rv/Makefile
> new file mode 100644
> index 000000000..14e94a1ab
> --- /dev/null
> +++ b/tools/testing/selftests/rv/Makefile
> @@ -0,0 +1,18 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# Makefile for rv selftests
> +
> +TEST_GEN_PROGS := tlob_helper tlob_uprobe_target
> +
> +TEST_PROGS := \
> +	test_tlob.sh \
> +
> +# TOOLS_INCLUDES is defined by ../lib.mk; provides -isystem to
> +# tools/include/uapi so that #include <linux/rv.h> resolves to the
> +# in-tree UAPI header without requiring make headers_install.
> +# Note: both must be added to the global variables, not as target-specific
> +# overrides, because lib.mk rewrites TEST_GEN_PROGS to $(OUTPUT)/name
> +# before per-target rules would be evaluated.
> +CFLAGS += $(TOOLS_INCLUDES)
> +LDLIBS += -lpthread
> +
> +include ../lib.mk
> diff --git a/tools/testing/selftests/rv/test_tlob.sh
> b/tools/testing/selftests/rv/test_tlob.sh
> new file mode 100755
> index 000000000..3ba2125eb
> --- /dev/null
> +++ b/tools/testing/selftests/rv/test_tlob.sh
> @@ -0,0 +1,563 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Selftest for the tlob (task latency over budget) RV monitor.
> +#
> +# Two interfaces are tested:
> +#
> +#   1. tracefs interface:
> +#        enable/disable, presence of tracefs files,
> +#        uprobe binding (threshold_us:offset_start:offset_stop:binary_path)
> and
> +#        violation detection via the ftrace ring buffer.
> +#
> +#   2. /dev/rv ioctl self-instrumentation (via tlob_helper):
> +#        within-budget, over-budget on-CPU, over-budget off-CPU (sleep),
> +#        double-start, stop-without-start.
> +#
> +# Written to be POSIX sh compatible (no bash-specific extensions).
> +
> +ksft_skip=4
> +t_pass=0; t_fail=0; t_skip=0; t_total=0
> +
> +tap_header() { echo "TAP version 13"; }
> +tap_plan()   { echo "1..$1"; }
> +tap_pass()   { t_pass=$((t_pass+1)); echo "ok $t_total - $1"; }
> +tap_fail()   { t_fail=$((t_fail+1)); echo "not ok $t_total - $1"
> +               [ -n "$2" ] && echo "  # $2"; }
> +tap_skip()   { t_skip=$((t_skip+1)); echo "ok $t_total - $1 # SKIP $2"; }
> +next_test()  { t_total=$((t_total+1)); }
> +
> +TRACEFS=$(grep -m1 tracefs /proc/mounts 2>/dev/null | awk '{print $2}')
> +[ -z "$TRACEFS" ] && TRACEFS=/sys/kernel/tracing
> +
> +RV_DIR="${TRACEFS}/rv"
> +TLOB_DIR="${RV_DIR}/monitors/tlob"
> +TRACE_FILE="${TRACEFS}/trace"
> +TRACING_ON="${TRACEFS}/tracing_on"
> +TLOB_MONITOR="${TLOB_DIR}/monitor"
> +BUDGET_EXCEEDED_ENABLE="${TRACEFS}/events/rv/tlob_budget_exceeded/enable"
> +RV_DEV="/dev/rv"
> +
> +# tlob_helper and tlob_uprobe_target must be in the same directory as
> +# this script or on PATH.
> +SCRIPT_DIR=$(dirname "$0")
> +IOCTL_HELPER="${SCRIPT_DIR}/tlob_helper"
> +UPROBE_TARGET="${SCRIPT_DIR}/tlob_uprobe_target"
> +
> +check_root()     { [ "$(id -u)" = "0" ] || { echo "# Need root" >&2; exit
> $ksft_skip; }; }
> +check_tracefs()  { [ -d "${TRACEFS}" ]   || { echo "# No tracefs" >&2; exit
> $ksft_skip; }; }
> +check_rv_dir()   { [ -d "${RV_DIR}" ]    || { echo "# No RV infra" >&2; exit
> $ksft_skip; }; }
> +check_tlob()     { [ -d "${TLOB_DIR}" ]  || { echo "# No tlob monitor" >&2;
> exit $ksft_skip; }; }
> +
> +tlob_enable()         { echo 1 > "${TLOB_DIR}/enable"; }
> +tlob_disable()        { echo 0 > "${TLOB_DIR}/enable" 2>/dev/null; }
> +tlob_is_enabled()     { [ "$(cat "${TLOB_DIR}/enable" 2>/dev/null)" = "1" ];
> }
> +trace_event_enable()  { echo 1 > "${BUDGET_EXCEEDED_ENABLE}" 2>/dev/null; }
> +trace_event_disable() { echo 0 > "${BUDGET_EXCEEDED_ENABLE}" 2>/dev/null; }
> +trace_on()            { echo 1 > "${TRACING_ON}" 2>/dev/null; }
> +trace_clear()         { echo > "${TRACE_FILE}"; }
> +trace_grep()          { grep -q "$1" "${TRACE_FILE}" 2>/dev/null; }
> +
> +cleanup() {
> +	tlob_disable
> +	trace_event_disable
> +	trace_clear
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 1: enable / disable
> +# ---------------------------------------------------------------------------
> +run_test_enable_disable() {
> +	next_test; cleanup
> +	tlob_enable
> +	if ! tlob_is_enabled; then
> +		tap_fail "enable_disable" "not enabled after echo 1";
> cleanup; return
> +	fi
> +	tlob_disable
> +	if tlob_is_enabled; then
> +		tap_fail "enable_disable" "still enabled after echo 0";
> cleanup; return
> +	fi
> +	tap_pass "enable_disable"; cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 2: tracefs files present
> +# ---------------------------------------------------------------------------
> +run_test_tracefs_files() {
> +	next_test; cleanup
> +	missing=""
> +	for f in enable desc monitor; do
> +		[ ! -e "${TLOB_DIR}/${f}" ] && missing="${missing} ${f}"
> +	done
> +	[ -n "${missing}" ] \
> +		&& tap_fail "tracefs_files" "missing:${missing}" \
> +		|| tap_pass "tracefs_files"
> +	cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Helper: resolve file offset of a function inside a binary.
> +#
> +# Usage: resolve_offset <binary> <vaddr_hex>
> +# Prints the hex file offset, or empty string on failure.
> +# ---------------------------------------------------------------------------
> +resolve_offset() {
> +	bin=$1; vaddr=$2
> +	# Parse /proc/self/maps to find the mapping that contains vaddr.
> +	# Each line: start-end perms offset dev inode [path]
> +	while IFS= read -r line; do
> +		set -- $line
> +		range=$1; off=$4; path=$7
> +		[ -z "$path" ] && continue
> +		# Only consider the mapping for our binary
> +		[ "$path" != "$bin" ] && continue
> +		# Split range into start and end
> +		start=$(echo "$range" | cut -d- -f1)
> +		end=$(echo "$range" | cut -d- -f2)
> +		# Convert hex to decimal for comparison (use printf)
> +		s=$(printf "%d" "0x${start}" 2>/dev/null) || continue
> +		e=$(printf "%d" "0x${end}"   2>/dev/null) || continue
> +		v=$(printf "%d" "${vaddr}"   2>/dev/null) || continue
> +		o=$(printf "%d" "0x${off}"   2>/dev/null) || continue
> +		if [ "$v" -ge "$s" ] && [ "$v" -lt "$e" ]; then
> +			file_off=$(printf "0x%x" $(( (v - s) + o )))
> +			echo "$file_off"
> +			return
> +		fi
> +	done < /proc/self/maps
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 3: uprobe binding - no false positive
> +#
> +# Bind this process with a 10 s budget.  Do nothing for 0.5 s.
> +# No budget_exceeded event should appear in the trace.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_no_false_positive() {
> +	next_test; cleanup
> +	if [ ! -e "${TLOB_MONITOR}" ]; then
> +		tap_skip "uprobe_no_false_positive" "monitor file not
> available"
> +		cleanup; return
> +	fi
> +	# We probe the "sleep" command that we will run as a subprocess.
> +	# Use /bin/sleep as the binary; find a valid function offset (0x0
> +	# resolves to the ELF entry point, which is sufficient for a
> +	# no-false-positive test since we just need the binding to exist).
> +	sleep_bin=$(command -v sleep 2>/dev/null)
> +	if [ -z "$sleep_bin" ]; then
> +		tap_skip "uprobe_no_false_positive" "sleep not found";
> cleanup; return
> +	fi
> +	pid=$$
> +	# offset 0x0 probes the entry point of /bin/sleep - this is a
> +	# deliberate probe that will not fire during a simple 'sleep 10'
> +	# invoked in a subshell, but registers the pid in tlob.
> +	#
> +	# Instead, bind our own pid with a generous 10 s threshold and
> +	# verify that 0.5 s of idle time does NOT fire the timer.
> +	#
> +	# Since we cannot easily get a valid uprobe offset in pure shell,
> +	# we skip this sub-test if we cannot form a valid binding.
> +	exe=$(readlink /proc/self/exe 2>/dev/null)
> +	if [ -z "$exe" ]; then
> +		tap_skip "uprobe_no_false_positive" "cannot read
> /proc/self/exe"
> +		cleanup; return
> +	fi
> +	trace_event_enable
> +	trace_on
> +	tlob_enable
> +	trace_clear
> +	# Sleep without any binding - just verify no spurious events
> +	sleep 0.5
> +	trace_grep "budget_exceeded" \
> +		&& tap_fail "uprobe_no_false_positive" \
> +			"spurious budget_exceeded without any binding" \
> +		|| tap_pass "uprobe_no_false_positive"
> +	cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Helper: get_uprobe_offset <binary> <symbol>
> +#
> +# Use tlob_helper sym_offset to get the ELF file offset of <symbol>
> +# in <binary>.  Prints the hex offset (e.g. "0x11d0") or empty string on
> +# failure.
> +# ---------------------------------------------------------------------------
> +get_uprobe_offset() {
> +	bin=$1; sym=$2
> +	if [ ! -x "${IOCTL_HELPER}" ]; then
> +		return
> +	fi
> +	"${IOCTL_HELPER}" sym_offset "${bin}" "${sym}" 2>/dev/null
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 4: uprobe binding - violation detected
> +#
> +# Start tlob_uprobe_target (a busy-spin binary with a well-known symbol),
> +# attach a uprobe on tlob_busy_work with a 10 ms threshold, and verify
> +# that a budget_expired event appears.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_violation() {
> +	next_test; cleanup
> +	if [ ! -e "${TLOB_MONITOR}" ]; then
> +		tap_skip "uprobe_violation" "monitor file not available"
> +		cleanup; return
> +	fi
> +	if [ ! -x "${UPROBE_TARGET}" ]; then
> +		tap_skip "uprobe_violation" \
> +			"tlob_uprobe_target not found or not executable"
> +		cleanup; return
> +	fi
> +
> +	# Get the file offsets of the start and stop probe symbols
> +	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> +	if [ -z "${busy_offset}" ]; then
> +		tap_skip "uprobe_violation" \
> +			"cannot resolve tlob_busy_work offset in
> ${UPROBE_TARGET}"
> +		cleanup; return
> +	fi
> +	stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> +	if [ -z "${stop_offset}" ]; then
> +		tap_skip "uprobe_violation" \
> +			"cannot resolve tlob_busy_work_done offset in
> ${UPROBE_TARGET}"
> +		cleanup; return
> +	fi
> +
> +	# Start the busy-spin target (run for 30 s so the test can observe
> it)
> +	"${UPROBE_TARGET}" 30000 &
> +	busy_pid=$!
> +	sleep 0.05
> +
> +	trace_event_enable
> +	trace_on
> +	tlob_enable
> +	trace_clear
> +
> +	# Bind the target: 10 us budget; start=tlob_busy_work,
> stop=tlob_busy_work_done
> +	binding="10:${busy_offset}:${stop_offset}:${UPROBE_TARGET}"
> +	if ! echo "${binding}" > "${TLOB_MONITOR}" 2>/dev/null; then
> +		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> +		tap_skip "uprobe_violation" \
> +			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
> +		cleanup; return
> +	fi
> +
> +	# Wait up to 2 s for a budget_exceeded event
> +	found=0; i=0
> +	while [ "$i" -lt 20 ]; do
> +		sleep 0.1
> +		trace_grep "budget_exceeded" && { found=1; break; }
> +		i=$((i+1))
> +	done
> +
> +	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +	kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
> +
> +	if [ "${found}" != "1" ]; then
> +		tap_fail "uprobe_violation" "no budget_exceeded within 2 s"
> +		cleanup; return
> +	fi
> +
> +	# Validate the event fields: threshold must match, on_cpu must be
> non-zero
> +	# (CPU-bound violation), and state must be on_cpu.
> +	ev=$(grep "budget_exceeded" "${TRACE_FILE}" | head -n 1)
> +	if ! echo "${ev}" | grep -q "threshold=10 "; then
> +		tap_fail "uprobe_violation" "threshold field mismatch: ${ev}"
> +		cleanup; return
> +	fi
> +	on_cpu=$(echo "${ev}" | grep -o "on_cpu=[0-9]*" | cut -d= -f2)
> +	if [ "${on_cpu:-0}" -eq 0 ]; then
> +		tap_fail "uprobe_violation" "on_cpu=0 for a CPU-bound spin:
> ${ev}"
> +		cleanup; return
> +	fi
> +	if ! echo "${ev}" | grep -q "state=on_cpu"; then
> +		tap_fail "uprobe_violation" "state is not on_cpu: ${ev}"
> +		cleanup; return
> +	fi
> +	tap_pass "uprobe_violation"
> +	cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 5: uprobe binding - remove binding stops monitoring
> +#
> +# Bind a pid via tlob_uprobe_target, then immediately remove it.
> +# Verify that after removal the monitor file no longer lists the pid.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_unbind() {
> +	next_test; cleanup
> +	if [ ! -e "${TLOB_MONITOR}" ]; then
> +		tap_skip "uprobe_unbind" "monitor file not available"
> +		cleanup; return
> +	fi
> +	if [ ! -x "${UPROBE_TARGET}" ]; then
> +		tap_skip "uprobe_unbind" \
> +			"tlob_uprobe_target not found or not executable"
> +		cleanup; return
> +	fi
> +
> +	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> +	stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> +	if [ -z "${busy_offset}" ] || [ -z "${stop_offset}" ]; then
> +		tap_skip "uprobe_unbind" \
> +			"cannot resolve tlob_busy_work/tlob_busy_work_done
> offset"
> +		cleanup; return
> +	fi
> +
> +	"${UPROBE_TARGET}" 30000 &
> +	busy_pid=$!
> +	sleep 0.05
> +
> +	tlob_enable
> +	# 5 s budget - should not fire during this quick test
> +	binding="5000000:${busy_offset}:${stop_offset}:${UPROBE_TARGET}"
> +	if ! echo "${binding}" > "${TLOB_MONITOR}" 2>/dev/null; then
> +		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> +		tap_skip "uprobe_unbind" \
> +			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
> +		cleanup; return
> +	fi
> +
> +	# Remove the binding
> +	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +
> +	# The monitor file should no longer list the binding for this offset
> +	if grep -q "^[0-9]*:0x${busy_offset#0x}:" "${TLOB_MONITOR}"
> 2>/dev/null; then
> +		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> +		tap_fail "uprobe_unbind" "pid still listed after removal"
> +		cleanup; return
> +	fi
> +
> +	kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
> +	tap_pass "uprobe_unbind"
> +	cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 6: uprobe - duplicate offset_start rejected
> +#
> +# Registering a second binding with the same offset_start in the same binary
> +# must be rejected with an error, since two entry uprobes at the same address
> +# would cause double tlob_start_task() calls and undefined behaviour.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_duplicate_offset() {
> +	next_test; cleanup
> +	if [ ! -e "${TLOB_MONITOR}" ]; then
> +		tap_skip "uprobe_duplicate_offset" "monitor file not
> available"
> +		cleanup; return
> +	fi
> +	if [ ! -x "${UPROBE_TARGET}" ]; then
> +		tap_skip "uprobe_duplicate_offset" \
> +			"tlob_uprobe_target not found or not executable"
> +		cleanup; return
> +	fi
> +
> +	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> +	stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> +	if [ -z "${busy_offset}" ] || [ -z "${stop_offset}" ]; then
> +		tap_skip "uprobe_duplicate_offset" \
> +			"cannot resolve tlob_busy_work/tlob_busy_work_done
> offset"
> +		cleanup; return
> +	fi
> +
> +	tlob_enable
> +
> +	# First binding: should succeed
> +	if ! echo "5000000:${busy_offset}:${stop_offset}:${UPROBE_TARGET}" \
> +	        > "${TLOB_MONITOR}" 2>/dev/null; then
> +		tap_skip "uprobe_duplicate_offset" \
> +			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
> +		cleanup; return
> +	fi
> +
> +	# Second binding with same offset_start: must be rejected
> +	if echo "9999:${busy_offset}:${stop_offset}:${UPROBE_TARGET}" \
> +	        > "${TLOB_MONITOR}" 2>/dev/null; then
> +		echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +		tap_fail "uprobe_duplicate_offset" \
> +			"duplicate offset_start was accepted (expected
> error)"
> +		cleanup; return
> +	fi
> +
> +	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +	tap_pass "uprobe_duplicate_offset"
> +	cleanup
> +}
> +
> +
> +#
> +# Region A: tlob_busy_work with a 5 s budget - should NOT fire during the
> test.
> +# Region B: tlob_busy_work_done with a 10 us budget - SHOULD fire quickly
> since
> +#           tlob_uprobe_target calls tlob_busy_work_done after a busy spin.
> +#
> +# Verifies that independent bindings for different offsets in the same binary
> +# are tracked separately and that only the tight-budget binding triggers a
> +# budget_exceeded event.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_independent_thresholds() {
> +	next_test; cleanup
> +	if [ ! -e "${TLOB_MONITOR}" ]; then
> +		tap_skip "uprobe_independent_thresholds" \
> +			"monitor file not available"; cleanup; return
> +	fi
> +	if [ ! -x "${UPROBE_TARGET}" ]; then
> +		tap_skip "uprobe_independent_thresholds" \
> +			"tlob_uprobe_target not found or not executable"
> +		cleanup; return
> +	fi
> +
> +	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> +	busy_stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> +	if [ -z "${busy_offset}" ] || [ -z "${busy_stop_offset}" ]; then
> +		tap_skip "uprobe_independent_thresholds" \
> +			"cannot resolve tlob_busy_work/tlob_busy_work_done
> offset"
> +		cleanup; return
> +	fi
> +
> +	"${UPROBE_TARGET}" 30000 &
> +	busy_pid=$!
> +	sleep 0.05
> +
> +	trace_event_enable
> +	trace_on
> +	tlob_enable
> +	trace_clear
> +
> +	# Region A: generous 5 s budget on tlob_busy_work entry (should not
> fire)
> +	if ! echo
> "5000000:${busy_offset}:${busy_stop_offset}:${UPROBE_TARGET}" \
> +	        > "${TLOB_MONITOR}" 2>/dev/null; then
> +		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> +		tap_skip "uprobe_independent_thresholds" \
> +			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
> +		cleanup; return
> +	fi
> +	# Region B: tight 10 us budget on tlob_busy_work_done (fires quickly)
> +	echo "10:${busy_stop_offset}:${busy_stop_offset}:${UPROBE_TARGET}" \
> +		> "${TLOB_MONITOR}" 2>/dev/null
> +
> +	found=0; i=0
> +	while [ "$i" -lt 20 ]; do
> +		sleep 0.1
> +		trace_grep "budget_exceeded" && { found=1; break; }
> +		i=$((i+1))
> +	done
> +
> +	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +	echo "-${busy_stop_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +	kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
> +
> +	if [ "${found}" != "1" ]; then
> +		tap_fail "uprobe_independent_thresholds" \
> +			"budget_exceeded not raised for tight-budget region
> within 2 s"
> +		cleanup; return
> +	fi
> +
> +	# The violation must carry threshold=10 (Region B's budget).
> +	ev=$(grep "budget_exceeded" "${TRACE_FILE}" | head -n 1)
> +	if ! echo "${ev}" | grep -q "threshold=10 "; then
> +		tap_fail "uprobe_independent_thresholds" \
> +			"violation threshold is not Region B's 10 us: ${ev}"
> +		cleanup; return
> +	fi
> +	tap_pass "uprobe_independent_thresholds"
> +	cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# ioctl tests via tlob_helper
> +#
> +# Each test invokes the helper with a sub-test name.
> +# Exit code: 0=pass, 1=fail, 2=skip.
> +# ---------------------------------------------------------------------------
> +run_ioctl_test() {
> +	testname=$1
> +	next_test
> +
> +	if [ ! -x "${IOCTL_HELPER}" ]; then
> +		tap_skip "ioctl_${testname}" \
> +			"tlob_helper not found or not executable"
> +		return
> +	fi
> +	if [ ! -c "${RV_DEV}" ]; then
> +		tap_skip "ioctl_${testname}" \
> +			"${RV_DEV} not present (CONFIG_RV_CHARDEV=y needed)"
> +		return
> +	fi
> +
> +	tlob_enable
> +	"${IOCTL_HELPER}" "${testname}"
> +	rc=$?
> +	tlob_disable
> +
> +	case "${rc}" in
> +	0) tap_pass "ioctl_${testname}" ;;
> +	2) tap_skip "ioctl_${testname}" "helper returned skip" ;;
> +	*) tap_fail "ioctl_${testname}" "helper exited with code ${rc}" ;;
> +	esac
> +}
> +
> +# run_ioctl_test_not_enabled - like run_ioctl_test but deliberately does NOT
> +# enable the tlob monitor before invoking the helper.  Used to verify that
> +# ioctls issued against a disabled monitor return ENODEV rather than crashing
> +# the kernel with a NULL pointer dereference.
> +run_ioctl_test_not_enabled()
> +{
> +	next_test
> +
> +	if [ ! -x "${IOCTL_HELPER}" ]; then
> +		tap_skip "ioctl_not_enabled" \
> +			"tlob_helper not found or not executable"
> +		return
> +	fi
> +	if [ ! -c "${RV_DEV}" ]; then
> +		tap_skip "ioctl_not_enabled" \
> +			"${RV_DEV} not present (CONFIG_RV_CHARDEV=y needed)"
> +		return
> +	fi
> +
> +	# Monitor intentionally left disabled.
> +	tlob_disable
> +	"${IOCTL_HELPER}" not_enabled
> +	rc=$?
> +
> +	case "${rc}" in
> +	0) tap_pass "ioctl_not_enabled" ;;
> +	2) tap_skip "ioctl_not_enabled" "helper returned skip" ;;
> +	*) tap_fail "ioctl_not_enabled" "helper exited with code ${rc}" ;;
> +	esac
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Main
> +# ---------------------------------------------------------------------------
> +check_root; check_tracefs; check_rv_dir; check_tlob
> +tap_header; tap_plan 20
> +
> +# tracefs interface tests
> +run_test_enable_disable
> +run_test_tracefs_files
> +
> +# uprobe external monitoring tests
> +run_test_uprobe_no_false_positive
> +run_test_uprobe_violation
> +run_test_uprobe_unbind
> +run_test_uprobe_duplicate_offset
> +run_test_uprobe_independent_thresholds
> +
> +# /dev/rv ioctl self-instrumentation tests
> +run_ioctl_test_not_enabled
> +run_ioctl_test within_budget
> +run_ioctl_test over_budget_cpu
> +run_ioctl_test over_budget_sleep
> +run_ioctl_test double_start
> +run_ioctl_test stop_no_start
> +run_ioctl_test multi_thread
> +run_ioctl_test self_watch
> +run_ioctl_test invalid_flags
> +run_ioctl_test notify_fd_bad
> +run_ioctl_test mmap_basic
> +run_ioctl_test mmap_errors
> +run_ioctl_test mmap_consume
> +
> +echo "# Passed: ${t_pass} Failed: ${t_fail} Skipped: ${t_skip}"
> +[ "${t_fail}" -gt 0 ] && exit 1 || exit 0
> diff --git a/tools/testing/selftests/rv/tlob_helper.c
> b/tools/testing/selftests/rv/tlob_helper.c
> new file mode 100644
> index 000000000..cd76b56d1
> --- /dev/null
> +++ b/tools/testing/selftests/rv/tlob_helper.c
> @@ -0,0 +1,994 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_helper.c - test helper and ELF utility for tlob selftests
> + *
> + * Called by test_tlob.sh to exercise the /dev/rv ioctl interface and to
> + * resolve ELF symbol offsets for uprobe bindings.  One subcommand per
> + * invocation so the shell script can report each as an independent TAP
> + * test case.
> + *
> + * Usage: tlob_helper <subcommand> [args...]
> + *
> + * Synchronous TRACE_START / TRACE_STOP tests:
> + *   not_enabled        - TRACE_START without tlob enabled -> ENODEV (no
> kernel crash)
> + *   within_budget      - start(50000 us), sleep 10 ms, stop -> expect 0
> + *   over_budget_cpu    - start(5000 us), busyspin 100 ms, stop -> EOVERFLOW
> + *   over_budget_sleep  - start(3000 us), sleep 50 ms, stop -> EOVERFLOW
> + *
> + * Error-handling tests:
> + *   double_start       - two starts without stop -> EEXIST on second
> + *   stop_no_start      - stop without start -> ESRCH
> + *
> + * Per-thread isolation test:
> + *   multi_thread       - two threads share one fd; one within budget, one
> over
> + *
> + * Asynchronous notification test (notify_fd + read()):
> + *   self_watch         - one worker exceeds budget; monitor fd receives one
> ntf via read()
> + *
> + * Input-validation tests (TRACE_START error paths):
> + *   invalid_flags      - TRACE_START with flags != 0 -> EINVAL
> + *   notify_fd_bad      - TRACE_START with notify_fd = stdout (non-rv fd) ->
> EINVAL
> + *
> + * mmap ring buffer tests (Scenario D):
> + *   mmap_basic         - mmap succeeds; verify tlob_mmap_page fields
> + *                        (version, capacity, data_offset, record_size)
> + *   mmap_errors        - MAP_PRIVATE, wrong size, and non-zero pgoff all
> + *                        return EINVAL
> + *   mmap_consume       - trigger a real violation via self-notification and
> + *                        consume the event through the mmap'd ring
> + *
> + * ELF utility (does not require /dev/rv):
> + *   sym_offset <binary> <symbol>
> + *                      - print the ELF file offset of <symbol> in <binary>
> + *                        (used by the shell script to build uprobe bindings)
> + *
> + * Exit code: 0 = pass, 1 = fail, 2 = skip (device not available).
> + */
> +#define _GNU_SOURCE
> +#include <elf.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <poll.h>
> +#include <pthread.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <linux/rv.h>
> +
> +/* Default ring capacity allocated at open(); matches TLOB_RING_DEFAULT_CAP.
> */
> +#define TLOB_RING_DEFAULT_CAP	64U
> +
> +static int rv_fd = -1;
> +
> +static int open_rv(void)
> +{
> +	rv_fd = open("/dev/rv", O_RDWR);
> +	if (rv_fd < 0) {
> +		fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +static void busy_spin_us(unsigned long us)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < us * 1000UL);
> +}
> +
> +static int do_start(uint64_t threshold_us)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = threshold_us,
> +		.notify_fd    = -1,
> +	};
> +
> +	return ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +}
> +
> +static int do_stop(void)
> +{
> +	return ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Synchronous TRACE_START / TRACE_STOP tests
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * test_not_enabled - TRACE_START must return ENODEV when the tlob monitor
> + * has not been enabled (tlob_state_cache is NULL).
> + *
> + * The shell wrapper deliberately does NOT call tlob_enable before invoking
> + * this subcommand, so the ioctl is expected to fail with ENODEV rather than
> + * crashing the kernel with a NULL pointer dereference in kmem_cache_alloc.
> + */
> +static int test_not_enabled(void)
> +{
> +	int ret;
> +
> +	ret = do_start(1000);
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_START: expected ENODEV, got
> success\n");
> +		do_stop();
> +		return 1;
> +	}
> +	if (errno != ENODEV) {
> +		fprintf(stderr, "TRACE_START: expected ENODEV, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_within_budget(void)
> +{
> +	int ret;
> +
> +	if (do_start(50000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	usleep(10000); /* 10 ms < 50 ms budget */
> +	ret = do_stop();
> +	if (ret != 0) {
> +		fprintf(stderr, "TRACE_STOP: expected 0, got %d errno=%s\n",
> +			ret, strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_cpu(void)
> +{
> +	int ret;
> +
> +	if (do_start(5000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	busy_spin_us(100000); /* 100 ms >> 5 ms budget */
> +	ret = do_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_sleep(void)
> +{
> +	int ret;
> +
> +	if (do_start(3000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	usleep(50000); /* 50 ms >> 3 ms budget, off-CPU time counts */
> +	ret = do_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Error-handling tests
> + * -----------------------------------------------------------------------
> + */
> +
> +static int test_double_start(void)
> +{
> +	int ret;
> +
> +	if (do_start(10000000) < 0) {
> +		fprintf(stderr, "first TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	ret = do_start(10000000);
> +	if (ret == 0) {
> +		fprintf(stderr, "second TRACE_START: expected EEXIST, got
> 0\n");
> +		do_stop();
> +		return 1;
> +	}
> +	if (errno != EEXIST) {
> +		fprintf(stderr, "second TRACE_START: expected EEXIST, got
> %s\n",
> +			strerror(errno));
> +		do_stop();
> +		return 1;
> +	}
> +	do_stop(); /* clean up */
> +	return 0;
> +}
> +
> +static int test_stop_no_start(void)
> +{
> +	int ret;
> +
> +	/* Ensure clean state: ignore error from a stale entry */
> +	do_stop();
> +
> +	ret = do_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected ESRCH, got 0\n");
> +		return 1;
> +	}
> +	if (errno != ESRCH) {
> +		fprintf(stderr, "TRACE_STOP: expected ESRCH, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Per-thread isolation test
> + *
> + * Two threads share a single /dev/rv fd.  The monitor uses task_struct *
> + * as the key, so each thread gets an independent slot regardless of the
> + * shared fd.
> + * -----------------------------------------------------------------------
> + */
> +
> +struct mt_thread_args {
> +	uint64_t      threshold_us;
> +	unsigned long workload_us;
> +	int           busy;
> +	int           expect_eoverflow;
> +	int           result;
> +};
> +
> +static void *mt_thread_fn(void *arg)
> +{
> +	struct mt_thread_args *a = arg;
> +	int ret;
> +
> +	if (do_start(a->threshold_us) < 0) {
> +		fprintf(stderr, "thread TRACE_START: %s\n", strerror(errno));
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	if (a->busy)
> +		busy_spin_us(a->workload_us);
> +	else
> +		usleep(a->workload_us);
> +
> +	ret = do_stop();
> +	if (a->expect_eoverflow) {
> +		if (ret == 0 || errno != EOVERFLOW) {
> +			fprintf(stderr, "thread: expected EOVERFLOW, got
> ret=%d errno=%s\n",
> +				ret, strerror(errno));
> +			a->result = 1;
> +			return NULL;
> +		}
> +	} else {
> +		if (ret != 0) {
> +			fprintf(stderr, "thread: expected 0, got ret=%d
> errno=%s\n",
> +				ret, strerror(errno));
> +			a->result = 1;
> +			return NULL;
> +		}
> +	}
> +	a->result = 0;
> +	return NULL;
> +}
> +
> +static int test_multi_thread(void)
> +{
> +	pthread_t ta, tb;
> +	struct mt_thread_args a = {
> +		.threshold_us     = 20000,  /* 20 ms */
> +		.workload_us      = 5000,   /* 5 ms sleep -> within budget */
> +		.busy             = 0,
> +		.expect_eoverflow = 0,
> +	};
> +	struct mt_thread_args b = {
> +		.threshold_us     = 3000,   /* 3 ms */
> +		.workload_us      = 30000,  /* 30 ms spin -> over budget */
> +		.busy             = 1,
> +		.expect_eoverflow = 1,
> +	};
> +
> +	pthread_create(&ta, NULL, mt_thread_fn, &a);
> +	pthread_create(&tb, NULL, mt_thread_fn, &b);
> +	pthread_join(ta, NULL);
> +	pthread_join(tb, NULL);
> +
> +	return (a.result || b.result) ? 1 : 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Asynchronous notification test (notify_fd + read())
> + *
> + * A dedicated monitor_fd is opened by the main thread.  Two worker threads
> + * each open their own work_fd and call TLOB_IOCTL_TRACE_START with
> + * notify_fd = monitor_fd, nominating it as the violation target.  Worker A
> + * stays within budget; worker B exceeds it.  The main thread reads from
> + * monitor_fd and expects exactly one tlob_event record.
> + * -----------------------------------------------------------------------
> + */
> +
> +struct sw_worker_args {
> +	int           monitor_fd;
> +	uint64_t      threshold_us;
> +	unsigned long workload_us;
> +	int           busy;
> +	int           result;
> +};
> +
> +static void *sw_worker_fn(void *arg)
> +{
> +	struct sw_worker_args *a = arg;
> +	struct tlob_start_args args = {
> +		.threshold_us = a->threshold_us,
> +		.notify_fd    = a->monitor_fd,
> +	};
> +	int work_fd;
> +	int ret;
> +
> +	work_fd = open("/dev/rv", O_RDWR);
> +	if (work_fd < 0) {
> +		fprintf(stderr, "worker open /dev/rv: %s\n",
> strerror(errno));
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	ret = ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
> +	if (ret < 0) {
> +		fprintf(stderr, "TRACE_START (notify): %s\n",
> strerror(errno));
> +		close(work_fd);
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	if (a->busy)
> +		busy_spin_us(a->workload_us);
> +	else
> +		usleep(a->workload_us);
> +
> +	ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	close(work_fd);
> +	a->result = 0;
> +	return NULL;
> +}
> +
> +static int test_self_watch(void)
> +{
> +	int monitor_fd;
> +	pthread_t ta, tb;
> +	struct sw_worker_args a = {
> +		.threshold_us = 50000,  /* 50 ms */
> +		.workload_us  = 5000,   /* 5 ms sleep -> no violation */
> +		.busy         = 0,
> +	};
> +	struct sw_worker_args b = {
> +		.threshold_us = 3000,   /* 3 ms */
> +		.workload_us  = 30000,  /* 30 ms spin -> violation */
> +		.busy         = 1,
> +	};
> +	struct tlob_event ntfs[8];
> +	int violations = 0;
> +	ssize_t n;
> +
> +	/*
> +	 * Open monitor_fd with O_NONBLOCK so read() after the workers finish
> +	 * returns immediately rather than blocking forever.
> +	 */
> +	monitor_fd = open("/dev/rv", O_RDWR | O_NONBLOCK);
> +	if (monitor_fd < 0) {
> +		fprintf(stderr, "open /dev/rv (monitor_fd): %s\n",
> strerror(errno));
> +		return 1;
> +	}
> +	a.monitor_fd = monitor_fd;
> +	b.monitor_fd = monitor_fd;
> +
> +	pthread_create(&ta, NULL, sw_worker_fn, &a);
> +	pthread_create(&tb, NULL, sw_worker_fn, &b);
> +	pthread_join(ta, NULL);
> +	pthread_join(tb, NULL);
> +
> +	if (a.result || b.result) {
> +		close(monitor_fd);
> +		return 1;
> +	}
> +
> +	/*
> +	 * Drain all available tlob_event records.  With O_NONBLOCK the final
> +	 * read() returns -EAGAIN when the buffer is empty.
> +	 */
> +	while ((n = read(monitor_fd, ntfs, sizeof(ntfs))) > 0)
> +		violations += (int)(n / sizeof(struct tlob_event));
> +
> +	close(monitor_fd);
> +
> +	if (violations != 1) {
> +		fprintf(stderr, "self_watch: expected 1 violation, got %d\n",
> +			violations);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Input-validation tests (TRACE_START error paths)
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * test_invalid_flags - TRACE_START with flags != 0 must return EINVAL.
> + *
> + * The flags field is reserved for future extensions and must be zero.
> + * Callers that set it to a non-zero value are rejected early so that a
> + * future kernel can assign meaning to those bits without silently
> + * ignoring them.
> + */
> +static int test_invalid_flags(void)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = 1000,
> +		.notify_fd    = -1,
> +		.flags        = 1,   /* non-zero: must be rejected */
> +	};
> +	int ret;
> +
> +	ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_START(flags=1): expected EINVAL, got
> success\n");
> +		do_stop();
> +		return 1;
> +	}
> +	if (errno != EINVAL) {
> +		fprintf(stderr, "TRACE_START(flags=1): expected EINVAL, got
> %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * test_notify_fd_bad - TRACE_START with a non-/dev/rv notify_fd must return
> + * EINVAL.
> + *
> + * When notify_fd >= 0, the kernel resolves it to a struct file and checks
> + * that its private_data is non-NULL (i.e. it is a /dev/rv file descriptor).
> + * Passing stdout (fd 1) supplies a real, open fd whose private_data is NULL,
> + * so the kernel must reject it with EINVAL.
> + */
> +static int test_notify_fd_bad(void)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = 1000,
> +		.notify_fd    = STDOUT_FILENO,   /* open but not a /dev/rv fd
> */
> +		.flags        = 0,
> +	};
> +	int ret;
> +
> +	ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +	if (ret == 0) {
> +		fprintf(stderr,
> +			"TRACE_START(notify_fd=stdout): expected EINVAL, got
> success\n");
> +		do_stop();
> +		return 1;
> +	}
> +	if (errno != EINVAL) {
> +		fprintf(stderr,
> +			"TRACE_START(notify_fd=stdout): expected EINVAL, got
> %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * mmap ring buffer tests (Scenario D)
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * test_mmap_basic - mmap the ring buffer and verify the control page fields.
> + *
> + * The kernel allocates TLOB_RING_DEFAULT_CAP records at open().  A shared
> + * mmap of PAGE_SIZE + cap * record_size must succeed and the tlob_mmap_page
> + * header must contain consistent values.
> + */
> +static int test_mmap_basic(void)
> +{
> +	long pagesize = sysconf(_SC_PAGESIZE);
> +	size_t mmap_len = (size_t)pagesize +
> +			  TLOB_RING_DEFAULT_CAP * sizeof(struct tlob_event);
> +	/* rv_mmap requires a page-aligned length */
> +	mmap_len = (mmap_len + (size_t)(pagesize - 1)) & ~(size_t)(pagesize -
> 1);
> +	struct tlob_mmap_page *page;
> +	struct tlob_event *data;
> +	void *map;
> +	int ret = 0;
> +
> +	map = mmap(NULL, mmap_len, PROT_READ | PROT_WRITE, MAP_SHARED, rv_fd,
> 0);
> +	if (map == MAP_FAILED) {
> +		fprintf(stderr, "mmap_basic: mmap: %s\n", strerror(errno));
> +		return 1;
> +	}
> +
> +	page = (struct tlob_mmap_page *)map;
> +	data = (struct tlob_event *)((char *)map + page->data_offset);
> +
> +	if (page->version != 1) {
> +		fprintf(stderr, "mmap_basic: expected version=1, got %u\n",
> +			page->version);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (page->capacity != TLOB_RING_DEFAULT_CAP) {
> +		fprintf(stderr, "mmap_basic: expected capacity=%u, got %u\n",
> +			TLOB_RING_DEFAULT_CAP, page->capacity);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (page->data_offset != (uint32_t)pagesize) {
> +		fprintf(stderr, "mmap_basic: expected data_offset=%ld, got
> %u\n",
> +			pagesize, page->data_offset);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (page->record_size != sizeof(struct tlob_event)) {
> +		fprintf(stderr, "mmap_basic: expected record_size=%zu, got
> %u\n",
> +			sizeof(struct tlob_event), page->record_size);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (page->data_head != 0 || page->data_tail != 0) {
> +		fprintf(stderr, "mmap_basic: ring not empty at open: head=%u
> tail=%u\n",
> +			page->data_head, page->data_tail);
> +		ret = 1;
> +		goto out;
> +	}
> +	/* Touch the data array to confirm it is accessible. */
> +	(void)data[0].tid;
> +out:
> +	munmap(map, mmap_len);
> +	return ret;
> +}
> +
> +/*
> + * test_mmap_errors - verify that rv_mmap() rejects invalid mmap parameters.
> + *
> + * Four cases are tested, each must return MAP_FAILED with errno == EINVAL:
> + *   1. size one page short of the correct ring length
> + *   2. size one page larger than the correct ring length
> + *   3. MAP_PRIVATE (only MAP_SHARED is permitted)
> + *   4. non-zero vm_pgoff (offset must be 0)
> + */
> +static int test_mmap_errors(void)
> +{
> +	long pagesize = sysconf(_SC_PAGESIZE);
> +	size_t correct_len = (size_t)pagesize +
> +			     TLOB_RING_DEFAULT_CAP * sizeof(struct
> tlob_event);
> +	/* rv_mmap requires a page-aligned length */
> +	correct_len = (correct_len + (size_t)(pagesize - 1)) &
> ~(size_t)(pagesize - 1);
> +	void *map;
> +	int ret = 0;
> +
> +	/* Case 1: size one page short (correct_len - 1 still rounds up to
> correct_len) */
> +	map = mmap(NULL, correct_len - (size_t)pagesize, PROT_READ |
> PROT_WRITE,
> +		   MAP_SHARED, rv_fd, 0);
> +	if (map != MAP_FAILED) {
> +		fprintf(stderr, "mmap_errors: short-size mmap succeeded
> (expected EINVAL)\n");
> +		munmap(map, correct_len - (size_t)pagesize);
> +		ret = 1;
> +	} else if (errno != EINVAL) {
> +		fprintf(stderr, "mmap_errors: short-size: expected EINVAL,
> got %s\n",
> +			strerror(errno));
> +		ret = 1;
> +	}
> +
> +	/* Case 2: size one page too large */
> +	map = mmap(NULL, correct_len + (size_t)pagesize, PROT_READ |
> PROT_WRITE,
> +		   MAP_SHARED, rv_fd, 0);
> +	if (map != MAP_FAILED) {
> +		fprintf(stderr, "mmap_errors: oversized mmap succeeded
> (expected EINVAL)\n");
> +		munmap(map, correct_len + (size_t)pagesize);
> +		ret = 1;
> +	} else if (errno != EINVAL) {
> +		fprintf(stderr, "mmap_errors: oversized: expected EINVAL, got
> %s\n",
> +			strerror(errno));
> +		ret = 1;
> +	}
> +
> +	/* Case 3: MAP_PRIVATE instead of MAP_SHARED */
> +	map = mmap(NULL, correct_len, PROT_READ | PROT_WRITE,
> +		   MAP_PRIVATE, rv_fd, 0);
> +	if (map != MAP_FAILED) {
> +		fprintf(stderr, "mmap_errors: MAP_PRIVATE succeeded (expected
> EINVAL)\n");
> +		munmap(map, correct_len);
> +		ret = 1;
> +	} else if (errno != EINVAL) {
> +		fprintf(stderr, "mmap_errors: MAP_PRIVATE: expected EINVAL,
> got %s\n",
> +			strerror(errno));
> +		ret = 1;
> +	}
> +
> +	/* Case 4: non-zero file offset (pgoff = 1) */
> +	map = mmap(NULL, correct_len, PROT_READ | PROT_WRITE,
> +		   MAP_SHARED, rv_fd, (off_t)pagesize);
> +	if (map != MAP_FAILED) {
> +		fprintf(stderr, "mmap_errors: non-zero pgoff mmap succeeded
> (expected EINVAL)\n");
> +		munmap(map, correct_len);
> +		ret = 1;
> +	} else if (errno != EINVAL) {
> +		fprintf(stderr, "mmap_errors: non-zero pgoff: expected
> EINVAL, got %s\n",
> +			strerror(errno));
> +		ret = 1;
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * test_mmap_consume - zero-copy consumption of a real violation event.
> + *
> + * Arms a 5 ms budget with self-notification (notify_fd = rv_fd), sleeps
> + * 50 ms (off-CPU violation), then reads the pushed event through the mmap'd
> + * ring without calling read().  Verifies:
> + *   - TRACE_STOP returns EOVERFLOW (budget was exceeded)
> + *   - data_head == 1 after the violation
> + *   - the event fields (threshold_us, tag, tid) are correct
> + *   - data_tail can be advanced to consume the record (ring empties)
> + */
> +static int test_mmap_consume(void)
> +{
> +	long pagesize = sysconf(_SC_PAGESIZE);
> +	size_t mmap_len = (size_t)pagesize +
> +			  TLOB_RING_DEFAULT_CAP * sizeof(struct tlob_event);
> +	/* rv_mmap requires a page-aligned length */
> +	mmap_len = (mmap_len + (size_t)(pagesize - 1)) & ~(size_t)(pagesize -
> 1);
> +	struct tlob_start_args args = {
> +		.threshold_us = 5000,		/* 5 ms */
> +		.notify_fd    = rv_fd,		/* self-notification */
> +		.tag          = 0xdeadbeefULL,
> +		.flags        = 0,
> +	};
> +	struct tlob_mmap_page *page;
> +	struct tlob_event *data;
> +	void *map;
> +	int stop_ret;
> +	int ret = 0;
> +
> +	map = mmap(NULL, mmap_len, PROT_READ | PROT_WRITE, MAP_SHARED, rv_fd,
> 0);
> +	if (map == MAP_FAILED) {
> +		fprintf(stderr, "mmap_consume: mmap: %s\n", strerror(errno));
> +		return 1;
> +	}
> +
> +	page = (struct tlob_mmap_page *)map;
> +	data = (struct tlob_event *)((char *)map + page->data_offset);
> +
> +	if (ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args) < 0) {
> +		fprintf(stderr, "mmap_consume: TRACE_START: %s\n",
> strerror(errno));
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	usleep(50000); /* 50 ms >> 5 ms budget -> off-CPU violation */
> +
> +	stop_ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	if (stop_ret == 0) {
> +		fprintf(stderr, "mmap_consume: TRACE_STOP returned 0,
> expected EOVERFLOW\n");
> +		ret = 1;
> +		goto out;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "mmap_consume: TRACE_STOP: expected
> EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/* Pairs with smp_store_release in tlob_event_push. */
> +	if (__atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE) != 1) {
> +		fprintf(stderr, "mmap_consume: expected data_head=1, got
> %u\n",
> +			page->data_head);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (page->data_tail != 0) {
> +		fprintf(stderr, "mmap_consume: expected data_tail=0, got
> %u\n",
> +			page->data_tail);
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/* Verify record content */
> +	if (data[0].threshold_us != 5000) {
> +		fprintf(stderr, "mmap_consume: expected threshold_us=5000,
> got %llu\n",
> +			(unsigned long long)data[0].threshold_us);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (data[0].tag != 0xdeadbeefULL) {
> +		fprintf(stderr, "mmap_consume: expected tag=0xdeadbeef, got
> %llx\n",
> +			(unsigned long long)data[0].tag);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (data[0].tid == 0) {
> +		fprintf(stderr, "mmap_consume: tid is 0\n");
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/* Consume: advance data_tail and confirm ring is empty */
> +	__atomic_store_n(&page->data_tail, 1U, __ATOMIC_RELEASE);
> +	if (__atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE) !=
> +	    __atomic_load_n(&page->data_tail, __ATOMIC_ACQUIRE)) {
> +		fprintf(stderr, "mmap_consume: ring not empty after
> consume\n");
> +		ret = 1;
> +	}
> +
> +out:
> +	munmap(map, mmap_len);
> +	return ret;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * ELF utility: sym_offset
> + *
> + * Print the ELF file offset of a symbol in a binary.  Supports 32- and
> + * 64-bit ELF.  Walks the section headers to find .symtab (falling back to
> + * .dynsym), then converts the symbol's virtual address to a file offset
> + * via the PT_LOAD program headers.
> + *
> + * Does not require /dev/rv; used by the shell script to build uprobe
> + * bindings of the form
> pid:threshold_us:offset_start:offset_stop:binary_path.
> + *
> + * Returns 0 on success (offset printed to stdout), 1 on failure.
> + * -----------------------------------------------------------------------
> + */
> +static int sym_offset(const char *binary, const char *symname)
> +{
> +	int fd;
> +	struct stat st;
> +	void *map;
> +	Elf64_Ehdr *ehdr;
> +	Elf32_Ehdr *ehdr32;
> +	int is64;
> +	uint64_t sym_vaddr = 0;
> +	int found = 0;
> +	uint64_t file_offset = 0;
> +
> +	fd = open(binary, O_RDONLY);
> +	if (fd < 0) {
> +		fprintf(stderr, "open %s: %s\n", binary, strerror(errno));
> +		return 1;
> +	}
> +	if (fstat(fd, &st) < 0) {
> +		close(fd);
> +		return 1;
> +	}
> +	map = mmap(NULL, (size_t)st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +	close(fd);
> +	if (map == MAP_FAILED) {
> +		fprintf(stderr, "mmap: %s\n", strerror(errno));
> +		return 1;
> +	}
> +
> +	/* Identify ELF class */
> +	ehdr = (Elf64_Ehdr *)map;
> +	ehdr32 = (Elf32_Ehdr *)map;
> +	if (st.st_size < 4 ||
> +	    ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
> +	    ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
> +	    ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
> +	    ehdr->e_ident[EI_MAG3] != ELFMAG3) {
> +		fprintf(stderr, "%s: not an ELF file\n", binary);
> +		munmap(map, (size_t)st.st_size);
> +		return 1;
> +	}
> +	is64 = (ehdr->e_ident[EI_CLASS] == ELFCLASS64);
> +
> +	if (is64) {
> +		/* Walk section headers to find .symtab or .dynsym */
> +		Elf64_Shdr *shdrs = (Elf64_Shdr *)((char *)map + ehdr-
> >e_shoff);
> +		Elf64_Shdr *shstrtab_hdr = &shdrs[ehdr->e_shstrndx];
> +		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> +		int si;
> +
> +		/* Prefer .symtab; fall back to .dynsym */
> +		for (int pass = 0; pass < 2 && !found; pass++) {
> +			const char *target = pass ? ".dynsym" : ".symtab";
> +
> +			for (si = 0; si < ehdr->e_shnum && !found; si++) {
> +				Elf64_Shdr *sh = &shdrs[si];
> +				const char *name = shstrtab + sh->sh_name;
> +
> +				if (strcmp(name, target) != 0)
> +					continue;
> +
> +				Elf64_Shdr *strtab_sh = &shdrs[sh->sh_link];
> +				const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> +				Elf64_Sym *syms = (Elf64_Sym *)((char *)map +
> sh->sh_offset);
> +				uint64_t nsyms = sh->sh_size /
> sizeof(Elf64_Sym);
> +				uint64_t j;
> +
> +				for (j = 0; j < nsyms; j++) {
> +					if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> +						sym_vaddr = syms[j].st_value;
> +						found = 1;
> +						break;
> +					}
> +				}
> +			}
> +		}
> +
> +		if (!found) {
> +			fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> +			munmap(map, (size_t)st.st_size);
> +			return 1;
> +		}
> +
> +		/* Convert vaddr to file offset via PT_LOAD segments */
> +		Elf64_Phdr *phdrs = (Elf64_Phdr *)((char *)map + ehdr-
> >e_phoff);
> +		int pi;
> +
> +		for (pi = 0; pi < ehdr->e_phnum; pi++) {
> +			Elf64_Phdr *ph = &phdrs[pi];
> +
> +			if (ph->p_type != PT_LOAD)
> +				continue;
> +			if (sym_vaddr >= ph->p_vaddr &&
> +			    sym_vaddr < ph->p_vaddr + ph->p_filesz) {
> +				file_offset = sym_vaddr - ph->p_vaddr + ph-
> >p_offset;
> +				break;
> +			}
> +		}
> +	} else {
> +		/* 32-bit ELF */
> +		Elf32_Shdr *shdrs = (Elf32_Shdr *)((char *)map + ehdr32-
> >e_shoff);
> +		Elf32_Shdr *shstrtab_hdr = &shdrs[ehdr32->e_shstrndx];
> +		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> +		int si;
> +		uint32_t sym_vaddr32 = 0;
> +
> +		for (int pass = 0; pass < 2 && !found; pass++) {
> +			const char *target = pass ? ".dynsym" : ".symtab";
> +
> +			for (si = 0; si < ehdr32->e_shnum && !found; si++) {
> +				Elf32_Shdr *sh = &shdrs[si];
> +				const char *name = shstrtab + sh->sh_name;
> +
> +				if (strcmp(name, target) != 0)
> +					continue;
> +
> +				Elf32_Shdr *strtab_sh = &shdrs[sh->sh_link];
> +				const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> +				Elf32_Sym *syms = (Elf32_Sym *)((char *)map +
> sh->sh_offset);
> +				uint32_t nsyms = sh->sh_size /
> sizeof(Elf32_Sym);
> +				uint32_t j;
> +
> +				for (j = 0; j < nsyms; j++) {
> +					if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> +						sym_vaddr32 =
> syms[j].st_value;
> +						found = 1;
> +						break;
> +					}
> +				}
> +			}
> +		}
> +
> +		if (!found) {
> +			fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> +			munmap(map, (size_t)st.st_size);
> +			return 1;
> +		}
> +
> +		Elf32_Phdr *phdrs = (Elf32_Phdr *)((char *)map + ehdr32-
> >e_phoff);
> +		int pi;
> +
> +		for (pi = 0; pi < ehdr32->e_phnum; pi++) {
> +			Elf32_Phdr *ph = &phdrs[pi];
> +
> +			if (ph->p_type != PT_LOAD)
> +				continue;
> +			if (sym_vaddr32 >= ph->p_vaddr &&
> +			    sym_vaddr32 < ph->p_vaddr + ph->p_filesz) {
> +				file_offset = sym_vaddr32 - ph->p_vaddr + ph-
> >p_offset;
> +				break;
> +			}
> +		}
> +		sym_vaddr = sym_vaddr32;
> +	}
> +
> +	munmap(map, (size_t)st.st_size);
> +
> +	if (!file_offset && sym_vaddr) {
> +		fprintf(stderr, "could not map vaddr 0x%lx to file offset\n",
> +			(unsigned long)sym_vaddr);
> +		return 1;
> +	}
> +
> +	printf("0x%lx\n", (unsigned long)file_offset);
> +	return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	int rc;
> +
> +	if (argc < 2) {
> +		fprintf(stderr, "Usage: %s <subcommand> [args...]\n",
> argv[0]);
> +		return 1;
> +	}
> +
> +	/* sym_offset does not need /dev/rv */
> +	if (strcmp(argv[1], "sym_offset") == 0) {
> +		if (argc < 4) {
> +			fprintf(stderr, "Usage: %s sym_offset <binary>
> <symbol>\n",
> +				argv[0]);
> +			return 1;
> +		}
> +		return sym_offset(argv[2], argv[3]);
> +	}
> +
> +	if (open_rv() < 0)
> +		return 2; /* skip */
> +
> +	if (strcmp(argv[1], "not_enabled") == 0)
> +		rc = test_not_enabled();
> +	else if (strcmp(argv[1], "within_budget") == 0)
> +		rc = test_within_budget();
> +	else if (strcmp(argv[1], "over_budget_cpu") == 0)
> +		rc = test_over_budget_cpu();
> +	else if (strcmp(argv[1], "over_budget_sleep") == 0)
> +		rc = test_over_budget_sleep();
> +	else if (strcmp(argv[1], "double_start") == 0)
> +		rc = test_double_start();
> +	else if (strcmp(argv[1], "stop_no_start") == 0)
> +		rc = test_stop_no_start();
> +	else if (strcmp(argv[1], "multi_thread") == 0)
> +		rc = test_multi_thread();
> +	else if (strcmp(argv[1], "self_watch") == 0)
> +		rc = test_self_watch();
> +	else if (strcmp(argv[1], "invalid_flags") == 0)
> +		rc = test_invalid_flags();
> +	else if (strcmp(argv[1], "notify_fd_bad") == 0)
> +		rc = test_notify_fd_bad();
> +	else if (strcmp(argv[1], "mmap_basic") == 0)
> +		rc = test_mmap_basic();
> +	else if (strcmp(argv[1], "mmap_errors") == 0)
> +		rc = test_mmap_errors();
> +	else if (strcmp(argv[1], "mmap_consume") == 0)
> +		rc = test_mmap_consume();
> +	else {
> +		fprintf(stderr, "Unknown test: %s\n", argv[1]);
> +		rc = 1;
> +	}
> +
> +	close(rv_fd);
> +	return rc;
> +}
> diff --git a/tools/testing/selftests/rv/tlob_uprobe_target.c
> b/tools/testing/selftests/rv/tlob_uprobe_target.c
> new file mode 100644
> index 000000000..6c895cb40
> --- /dev/null
> +++ b/tools/testing/selftests/rv/tlob_uprobe_target.c
> @@ -0,0 +1,108 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_uprobe_target.c - uprobe target binary for tlob selftests.
> + *
> + * Provides two well-known probe points:
> + *   tlob_busy_work()      - start probe: arms the tlob budget timer
> + *   tlob_busy_work_done() - stop  probe: cancels the timer on completion
> + *
> + * The tlob selftest writes a five-field uprobe binding:
> + *   pid:threshold_us:binary:offset_start:offset_stop
> + * where offset_start is the file offset of tlob_busy_work and offset_stop
> + * is the file offset of tlob_busy_work_done (resolved via tlob_helper
> + * sym_offset).
> + *
> + * Both probe points are plain entry uprobes (no uretprobe).  The busy loop
> + * keeps the task on-CPU so that either the stop probe fires cleanly (within
> + * budget) or the hrtimer fires first and emits tlob_budget_exceeded (over
> + * budget).
> + *
> + * Usage: tlob_uprobe_target <duration_ms>
> + *
> + * Loops calling tlob_busy_work() in 200 ms iterations until <duration_ms>
> + * has elapsed (0 = run for ~24 hours).  Short iterations ensure the uprobe
> + * entry fires on every call even if the uprobe is installed after the
> + * program has started.
> + */
> +#define _GNU_SOURCE
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <time.h>
> +
> +#ifndef noinline
> +#define noinline __attribute__((noinline))
> +#endif
> +
> +static inline int timespec_before(const struct timespec *a,
> +				   const struct timespec *b)
> +{
> +	return a->tv_sec < b->tv_sec ||
> +	       (a->tv_sec == b->tv_sec && a->tv_nsec < b->tv_nsec);
> +}
> +
> +static void timespec_add_ms(struct timespec *ts, unsigned long ms)
> +{
> +	ts->tv_sec  += ms / 1000;
> +	ts->tv_nsec += (long)(ms % 1000) * 1000000L;
> +	if (ts->tv_nsec >= 1000000000L) {
> +		ts->tv_sec++;
> +		ts->tv_nsec -= 1000000000L;
> +	}
> +}
> +
> +/*
> + * tlob_busy_work_done - stop-probe target.
> + *
> + * Called by tlob_busy_work() after the busy loop.  The uprobe on this
> + * function's entry fires tlob_stop_task(), cancelling the budget timer.
> + * noinline ensures the compiler never merges this function with its caller,
> + * guaranteeing the entry uprobe always fires.
> + */
> +noinline void tlob_busy_work_done(void)
> +{
> +	/* empty: the uprobe fires on entry */
> +}
> +
> +/*
> + * tlob_busy_work - start-probe target.
> + *
> + * The uprobe on this function's entry fires tlob_start_task(), arming the
> + * budget timer.  noinline prevents the compiler and linker (including LTO)
> + * from inlining this function into its callers, ensuring the entry uprobe
> + * fires on every call.
> + */
> +noinline void tlob_busy_work(unsigned long duration_ns)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < duration_ns);
> +
> +	tlob_busy_work_done();
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	unsigned long duration_ms = 0;
> +	struct timespec deadline, now;
> +
> +	if (argc >= 2)
> +		duration_ms = strtoul(argv[1], NULL, 10);
> +
> +	clock_gettime(CLOCK_MONOTONIC, &deadline);
> +	timespec_add_ms(&deadline, duration_ms ? duration_ms : 86400000UL);
> +
> +	do {
> +		tlob_busy_work(200 * 1000000UL); /* 200 ms per iteration */
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +	} while (timespec_before(&now, &deadline));
> +
> +	return 0;
> +}


^ permalink raw reply

* Re: [PATCH 0/3] mm: split the file's i_mmap tree for NUMA
From: Huang Shijie @ 2026-04-16 11:48 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: akpm, viro, brauner, linux-mm, linux-kernel, linux-arm-kernel,
	linux-fsdevel, muchun.song, osalvador, linux-trace-kernel,
	linux-perf-users, linux-parisc, nvdimm, zhongyuan, fangbaoshun,
	yingzhiwei
In-Reply-To: <CAGudoHGLaoc+CoBPNCvFRYojnj+6E_Lsdv7NaJWxFMoHezemMQ@mail.gmail.com>

On Thu, Apr 16, 2026 at 12:29:50PM +0200, Mateusz Guzik wrote:
> On Tue, Apr 14, 2026 at 11:11 AM Huang Shijie <huangsj@hygon.cn> wrote:
> >
> > On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote:
> > > On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> > > >   In NUMA, there are maybe many NUMA nodes and many CPUs.
> > > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > > > In the UnixBench tests, there is a test "execl" which tests
> > > > the execve system call.
> > > >
> > > >   When we test our server with "./Run -c 384 execl",
> > > > the test result is not good enough. The i_mmap locks contended heavily on
> > > > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
> > > > over 6000 VMAs, all the VMAs can be in different NUMA mode.
> > > > The insert/remove operations do not run quickly enough.
> > > >
> > > > patch 1 & patch 2 are try to hide the direct access of i_mmap.
> > > > patch 3 splits the i_mmap into sibling trees, and we can get better
> > > > performance with this patch set:
> > > >     we can get 77% performance improvement(10 times average)
> > > >
> > >
> > > To my reading you kept the lock as-is and only distributed the protected
> > > state.
> > >
> > > While I don't doubt the improvement, I'm confident should you take a
> > > look at the profile you are going to find this still does not scale with
> > > rwsem being one of the problems (there are other global locks, some of
> > > which have experimental patches for).
> > IMHO, when the number of VMAs in the i_mmap is very large, only optimise the rwsem
> > lock does not help too much for our NUMA case.
> >
> > In our NUMA server, the remote access could be the major issue.
> >
> 
> I'm confused how this is not supposed to help. You moved your data to
> be stored per-domain. With my proposal the lock itself will also get
> that treatment.
> 
> Modulo the issue of what to do with code wanting to iterate the entire
> thing, this is blatantly faster.
> 

I tested an old lock patch yesterday. It really helps a lot.
The lock patch is from this link:
  https://lkml.org/lkml/2024/9/14/280

The test results:
   v7.0-rc5 + (lock patch)                    : improve about %60%
   v7.0-rc5 + (lock patch) + (this patch set) : improve about 130%			   

						
> >
> > >
> > > Apart from that this does nothing to help high core systems which are
> > > all one node, which imo puts another question mark on this specific
> > > proposal.
> > Yes, this patch set only focus on the NUMA case.
> > The one-node case should use the original i_mmap.
> >
> > Maybe I can add a new config, CONFIG_SPILT_I_MMAP. The config is disabled
> > by default, and enabled when the NUMA node is not one.
> >
> > >
> > > Of course one may question whether a RB tree is the right choice here,
> > > it may be the lock-protected cost can go way down with merely a better
> > > data structure.
> > >
> > > Regardless of that, for actual scalability, there will be no way around
> > > decentralazing locking around this and partitioning per some core count
> > > (not just by numa awareness).
> > >
> > > Decentralizing locking is definitely possible, but I have not looked
> > > into specifics of how problematic it is. Best case scenario it will
> > > merely with separate locks. Worst case scenario something needs a fully
> > > stabilized state for traversal, in that case another rw lock can be
> > Yes.
> >
> > The traversal may need to hold many locks.
> >
> 
> The very paragraph you partially quoted answers what to do in that
> case: wrap everything with a new rwsem taken for reading when
> adding/removing entries and taken for writing when iterating the
> entire thing. Then the iteration sticks to one lock.
> 
> The new rw lock puts an upper ceiling on scalability of the thing, but
> it is way higher than the current state.
Could you tell me the patch about it?
Is this lock patch merged ? or not?

I can test it.

> 
> Given the extra overhead associated with it one could consider
> sticking to one centralized state by default and switching to
> distributed state if there is enough contention.
> 
> > > slapped around this, creating locking order read lock -> per-subset
> > > write lock -- this will suffer scalability due to the read locking, but
> > > it will still scale drastically better as apart from that there will be
> > > no serialization. In this setting the problematic consumer will write
> > > lock the new thing to stabilize the state.
> > >
> > > So my non-maintainer opinion is that the patchset is not worth it as it
> > > fails to address anything for significantly more common and already
> > > affected setups.
> > This patch set is to reduce the remote access latency for insert/remove VMA
> > in NUMA.
> >
> 
> And I am saying the mmap semaphore is a significant problem already on
> high-core no-numa setups. Addressing scalability in that case would
> sort out the problem in your setup and to a significantly higher
> extent.
I am afraid even the lock patch resolves the scalability high-core no-numa setups,
we still need to split the i_mmap for NUMA.

> 
> > >
> > > Have you looked into splitting the lock?
> > >
> > I ever tried.
> >
> > But there are two disadvantages:
> >   1.) The traversal may need to hold many locks which makes the
> >       code very horrible.
> >
> 
> I already above this is avoidable.
> 
> >   2.) Even we split the locks. Each lock protects a tree, when the tree becomes
> >       big enough, the VMA insert/remove will also become slow in NUMA.
> >       The reason is that the tree has VMAs in different NUMA nodes.
> >
> 
> This is orthogonal to my proposal. In fact, if one is to pretend this
> is never a factor with your patch, I would like to point out it will
> remain not a factor if the per-numa struct gets its own lock.
Yes. It is orthogonal to your proposal.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Pavlu @ 2026-04-16 11:18 UTC (permalink / raw)
  To: Song Chen
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe, jikos, mbenes,
	pmladek, joe.lawrence, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <a35f5f94-7d5a-4347-974b-b270c89ef241@189.cn>

On 4/15/26 8:43 AM, Song Chen wrote:
> On 4/14/26 22:33, Petr Pavlu wrote:
>> On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
>>> diff --git a/include/linux/module.h b/include/linux/module.h
>>> index 14f391b186c6..0bdd56f9defd 100644
>>> --- a/include/linux/module.h
>>> +++ b/include/linux/module.h
>>> @@ -308,6 +308,14 @@ enum module_state {
>>>       MODULE_STATE_COMING,    /* Full formed, running module_init. */
>>>       MODULE_STATE_GOING,    /* Going away. */
>>>       MODULE_STATE_UNFORMED,    /* Still setting it up. */
>>> +    MODULE_STATE_FORMED,
>>
>> I don't see a reason to add a new module state. Why is it necessary and
>> how does it fit with the existing states?
>>
> because once notifier fails in state MODULE_STATE_UNFORMED (now only ftrace has someting to do in this state), notifier chain will roll back by calling blocking_notifier_call_chain_robust, i'm afraid MODULE_STATE_GOING is going to jeopardise the notifers which don't handle it appropriately, like:
> 
> case MODULE_STATE_COMING:
>      kmalloc();
> case MODULE_STATE_GOING:
>      kfree();

My understanding is that the current module "state machine" operates as
follows. Transitions marked with an asterisk (*) are announced via the
module notifier.

---> UNFORMED --*> COMING --*> LIVE --*> GOING -.
        ^            |                     ^    |
        |            '---------------------*    |
        '---------------------------------------'

The new code aims to replace the current ftrace_module_init() call in
load_module(). To achieve this, it adds a notification for the UNFORMED
state (only when loading a module) and introduces a new FORMED state for
rollback. FORMED is purely a fake state because it never appears in
module::state. The new structure is as follows:

        ,--*> (FORMED)
        |
--*> UNFORMED --*> COMING --*> LIVE --*> GOING -.
        ^            |                     ^    |
        |            '---------------------*    |
        '---------------------------------------'

I'm afraid this is quite complex and inconsistent. Unless it can be kept
simple, we would be just replacing one special handling with a different
complexity, which is not worth it.

>>
>>> +    if (err)
>>> +        goto ddebug_cleanup;
>>>         /* Finally it's fully formed, ready to start executing. */
>>>       err = complete_formation(mod, info);
>>> -    if (err)
>>> +    if (err) {
>>> +        blocking_notifier_call_chain_reverse(&module_notify_list,
>>> +                MODULE_STATE_FORMED, mod);
>>>           goto ddebug_cleanup;
>>> +    }
>>>   -    err = prepare_coming_module(mod);
>>> +    err = prepare_module_state_transaction(mod,
>>> +                MODULE_STATE_COMING, MODULE_STATE_GOING);
>>>       if (err)
>>>           goto bug_cleanup;
>>>   @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
>>>       destroy_params(mod->kp, mod->num_kp);
>>>       blocking_notifier_call_chain(&module_notify_list,
>>>                        MODULE_STATE_GOING, mod);
>>
>> My understanding is that all notifier chains for MODULE_STATE_GOING
>> should be reversed.
> yes, all, from lowest priority notifier to highest.
> I will resend patch 1 which was failed due to my proxy setting.

What I meant here is that the call:

blocking_notifier_call_chain(&module_notify_list, MODULE_STATE_GOING, mod);

should be replaced with:

blocking_notifier_call_chain_reverse(&module_notify_list, MODULE_STATE_GOING, mod);

> 
>>
>>> -    klp_module_going(mod);
>>>    bug_cleanup:
>>>       mod->state = MODULE_STATE_GOING;
>>>       /* module_bug_cleanup needs module_mutex protection */
>>
>> The patch removes the klp_module_going() cleanup call in load_module().
>> Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
>> should be removed and appropriately replaced with a cleanup via
>> a notifier.
>>
>     err = prepare_module_state_transaction(mod,
>                 MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
>     if (err)
>         goto ddebug_cleanup;
> 
> ftrace will be cleanup in blocking_notifier_call_chain_robust rolling back.
> 
>     err = prepare_module_state_transaction(mod,
>                 MODULE_STATE_COMING, MODULE_STATE_GOING);
> 
> each notifier including ftrace and klp will be cleanup in blocking_notifier_call_chain_robust rolling back.
> 
> if all notifiers are successful in MODULE_STATE_COMING, they all will be clean up in
>  coming_cleanup:
>     mod->state = MODULE_STATE_GOING;
>     destroy_params(mod->kp, mod->num_kp);
>     blocking_notifier_call_chain(&module_notify_list,
>                      MODULE_STATE_GOING, mod);
> 
> if  something wrong underneath.

My point is that the patch leaves a call to ftrace_release_mod() in
load_module(), which I expected to be handled via a notifier.

-- 
Thanks,
Petr

^ permalink raw reply

* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Petr Mladek @ 2026-04-16 10:33 UTC (permalink / raw)
  To: chensong_2000
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, joe.lawrence, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260415070137.17860-1-chensong_2000@189.cn>

On Wed 2026-04-15 15:01:37, chensong_2000@189.cn wrote:
> From: Song Chen <chensong_2000@189.cn>
> 
> The current notifier chain implementation uses a single-linked list
> (struct notifier_block *next), which only supports forward traversal
> in priority order. This makes it difficult to handle cleanup/teardown
> scenarios that require notifiers to be called in reverse priority order.
> 
> A concrete example is the ordering dependency between ftrace and
> livepatch during module load/unload. see the detail here [1].
> 
> This patch replaces the single-linked list in struct notifier_block
> with a struct list_head, converting the notifier chain into a
> doubly-linked list sorted in descending priority order. Based on
> this, a new function notifier_call_chain_reverse() is introduced,
> which traverses the chain in reverse (ascending priority order).
> The corresponding blocking_notifier_call_chain_reverse() is also
> added as the locking wrapper for blocking notifier chains.
> 
> The internal notifier_call_chain_robust() is updated to use
> notifier_call_chain_reverse() for rollback: on error, it records
> the failing notifier (last_nb) and the count of successfully called
> notifiers (nr), then rolls back exactly those nr-1 notifiers in
> reverse order starting from last_nb's predecessor, without needing
> to know the total length of the chain.
> 
> With this change, subsystems with symmetric setup/teardown ordering
> requirements can register a single notifier_block with one priority
> value, and rely on blocking_notifier_call_chain() for forward
> traversal and blocking_notifier_call_chain_reverse() for reverse
> traversal, without needing hard-coded call sequences or separate
> notifier registrations for each direction.
> 
> [1]:https://lore.kernel.org/all
> 	/alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/
> 
> --- a/include/linux/notifier.h
> +++ b/include/linux/notifier.h
> @@ -53,41 +53,41 @@ typedef	int (*notifier_fn_t)(struct notifier_block *nb,
[...]
>  struct notifier_block {
>  	notifier_fn_t notifier_call;
> -	struct notifier_block __rcu *next;
> +	struct list_head __rcu entry;
>  	int priority;
>  };
[...]
>  #define ATOMIC_INIT_NOTIFIER_HEAD(name) do {	\
>  		spin_lock_init(&(name)->lock);	\
> -		(name)->head = NULL;		\
> +		INIT_LIST_HEAD(&(name)->head);		\

I would expect the RCU variant here, aka INIT_LIST_HEAD_RCU().

> --- a/kernel/notifier.c
> +++ b/kernel/notifier.c
> @@ -14,39 +14,47 @@
>   *	are layered on top of these, with appropriate locking added.
>   */
>  
> -static int notifier_chain_register(struct notifier_block **nl,
> +static int notifier_chain_register(struct list_head *nl,
>  				   struct notifier_block *n,
>  				   bool unique_priority)
>  {
> -	while ((*nl) != NULL) {
> -		if (unlikely((*nl) == n)) {
> +	struct notifier_block *cur;
> +
> +	list_for_each_entry(cur, nl, entry) {
> +		if (unlikely(cur == n)) {
>  			WARN(1, "notifier callback %ps already registered",
>  			     n->notifier_call);
>  			return -EEXIST;
>  		}
> -		if (n->priority > (*nl)->priority)
> -			break;
> -		if (n->priority == (*nl)->priority && unique_priority)
> +
> +		if (n->priority == cur->priority && unique_priority)
>  			return -EBUSY;
> -		nl = &((*nl)->next);
> +
> +		if (n->priority > cur->priority) {
> +			list_add_tail(&n->entry, &cur->entry);
> +			goto out;
> +		}
>  	}
> -	n->next = *nl;
> -	rcu_assign_pointer(*nl, n);
> +
> +	list_add_tail(&n->entry, nl);

I would expect list_add_tail_rcu() here.

> @@ -59,25 +67,25 @@ static int notifier_chain_unregister(struct notifier_block **nl,
>   *			value of this parameter is -1.
>   *	@nr_calls:	Records the number of notifications sent. Don't care
>   *			value of this field is NULL.
> + *	@last_nb:  Records the last called notifier block for rolling back
>   *	Return:		notifier_call_chain returns the value returned by the
>   *			last notifier function called.
>   */
> -static int notifier_call_chain(struct notifier_block **nl,
> +static int notifier_call_chain(struct list_head *nl,
>  			       unsigned long val, void *v,
> -			       int nr_to_call, int *nr_calls)
> +			       int nr_to_call, int *nr_calls,
> +				   struct notifier_block **last_nb)
>  {
>  	int ret = NOTIFY_DONE;
> -	struct notifier_block *nb, *next_nb;
> -
> -	nb = rcu_dereference_raw(*nl);
> +	struct notifier_block *nb;
>  
> -	while (nb && nr_to_call) {
> -		next_nb = rcu_dereference_raw(nb->next);
> +	if (!nr_to_call)
> +		return ret;
>  
> +	list_for_each_entry(nb, nl, entry) {

I would expect the RCU variant here, aka list_for_each_rcu()

These are just two random examples which I found by a quick look.

I guess that the notifier API is very old and it does not use all
the RCU API features which allow to track safety when
CONFIG_PROVE_RCU and CONFIG_PROVE_RCU_LIST are enabled.

It actually might be worth to audit the code and make it right.

>  #ifdef CONFIG_DEBUG_NOTIFIERS
>  		if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
>  			WARN(1, "Invalid notifier called!");
> -			nb = next_nb;
>  			continue;
>  		}
>  #endif

That said, I am not sure if the ftrace/livepatching handlers are
the right motivation for this. Especially when I see the
complexity of the 2nd patch [*]

After thinking more about it. I am not even sure that the ftrace and
livepatching callbacks are good candidates for generic notifiers.
They are too special. It is not only about ordering them against
each other. But it is also about ordering them against other
notifiers. The ftrace/livepatching callbacks must be the first/last
during module load/release.

[*] The 2nd patch is not archived by lore for some reason.
    I have found only a review but it gives a good picture, see
    https://lore.kernel.org/all/1191caf5-6a61-4622-a15e-854d3701f4fc@suse.com/

Best Regards,
Petr

^ permalink raw reply

* Re: [PATCH 0/3] mm: split the file's i_mmap tree for NUMA
From: Mateusz Guzik @ 2026-04-16 10:29 UTC (permalink / raw)
  To: Huang Shijie
  Cc: akpm, viro, brauner, linux-mm, linux-kernel, linux-arm-kernel,
	linux-fsdevel, muchun.song, osalvador, linux-trace-kernel,
	linux-perf-users, linux-parisc, nvdimm, zhongyuan, fangbaoshun,
	yingzhiwei
In-Reply-To: <ad4EvoDcAKE2Sl4+@hsj-2U-Workstation>

On Tue, Apr 14, 2026 at 11:11 AM Huang Shijie <huangsj@hygon.cn> wrote:
>
> On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote:
> > On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> > >   In NUMA, there are maybe many NUMA nodes and many CPUs.
> > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > > In the UnixBench tests, there is a test "execl" which tests
> > > the execve system call.
> > >
> > >   When we test our server with "./Run -c 384 execl",
> > > the test result is not good enough. The i_mmap locks contended heavily on
> > > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
> > > over 6000 VMAs, all the VMAs can be in different NUMA mode.
> > > The insert/remove operations do not run quickly enough.
> > >
> > > patch 1 & patch 2 are try to hide the direct access of i_mmap.
> > > patch 3 splits the i_mmap into sibling trees, and we can get better
> > > performance with this patch set:
> > >     we can get 77% performance improvement(10 times average)
> > >
> >
> > To my reading you kept the lock as-is and only distributed the protected
> > state.
> >
> > While I don't doubt the improvement, I'm confident should you take a
> > look at the profile you are going to find this still does not scale with
> > rwsem being one of the problems (there are other global locks, some of
> > which have experimental patches for).
> IMHO, when the number of VMAs in the i_mmap is very large, only optimise the rwsem
> lock does not help too much for our NUMA case.
>
> In our NUMA server, the remote access could be the major issue.
>

I'm confused how this is not supposed to help. You moved your data to
be stored per-domain. With my proposal the lock itself will also get
that treatment.

Modulo the issue of what to do with code wanting to iterate the entire
thing, this is blatantly faster.

>
> >
> > Apart from that this does nothing to help high core systems which are
> > all one node, which imo puts another question mark on this specific
> > proposal.
> Yes, this patch set only focus on the NUMA case.
> The one-node case should use the original i_mmap.
>
> Maybe I can add a new config, CONFIG_SPILT_I_MMAP. The config is disabled
> by default, and enabled when the NUMA node is not one.
>
> >
> > Of course one may question whether a RB tree is the right choice here,
> > it may be the lock-protected cost can go way down with merely a better
> > data structure.
> >
> > Regardless of that, for actual scalability, there will be no way around
> > decentralazing locking around this and partitioning per some core count
> > (not just by numa awareness).
> >
> > Decentralizing locking is definitely possible, but I have not looked
> > into specifics of how problematic it is. Best case scenario it will
> > merely with separate locks. Worst case scenario something needs a fully
> > stabilized state for traversal, in that case another rw lock can be
> Yes.
>
> The traversal may need to hold many locks.
>

The very paragraph you partially quoted answers what to do in that
case: wrap everything with a new rwsem taken for reading when
adding/removing entries and taken for writing when iterating the
entire thing. Then the iteration sticks to one lock.

The new rw lock puts an upper ceiling on scalability of the thing, but
it is way higher than the current state.

Given the extra overhead associated with it one could consider
sticking to one centralized state by default and switching to
distributed state if there is enough contention.

> > slapped around this, creating locking order read lock -> per-subset
> > write lock -- this will suffer scalability due to the read locking, but
> > it will still scale drastically better as apart from that there will be
> > no serialization. In this setting the problematic consumer will write
> > lock the new thing to stabilize the state.
> >
> > So my non-maintainer opinion is that the patchset is not worth it as it
> > fails to address anything for significantly more common and already
> > affected setups.
> This patch set is to reduce the remote access latency for insert/remove VMA
> in NUMA.
>

And I am saying the mmap semaphore is a significant problem already on
high-core no-numa setups. Addressing scalability in that case would
sort out the problem in your setup and to a significantly higher
extent.

> >
> > Have you looked into splitting the lock?
> >
> I ever tried.
>
> But there are two disadvantages:
>   1.) The traversal may need to hold many locks which makes the
>       code very horrible.
>

I already above this is avoidable.

>   2.) Even we split the locks. Each lock protects a tree, when the tree becomes
>       big enough, the VMA insert/remove will also become slow in NUMA.
>       The reason is that the tree has VMAs in different NUMA nodes.
>

This is orthogonal to my proposal. In fact, if one is to pretend this
is never a factor with your patch, I would like to point out it will
remain not a factor if the per-numa struct gets its own lock.

^ permalink raw reply

* [PATCH v8 8/8] selftests/ftrace: Add a testcase for multiple fprobe events
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add a testcase for multiple fprobe events on the same function
so that it clears ftrace hash map correctly when removing the
events.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 .../test.d/dynevent/add_remove_multiple_fprobe.tc  |   69 ++++++++++++++++++++
 1 file changed, 69 insertions(+)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc

diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc
new file mode 100644
index 000000000000..f2cbf2ffd29b
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc
@@ -0,0 +1,69 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Generic dynamic event - add/remove multiple fprobe events on the same function
+# requires: dynamic_events "f[:[<group>/][<event>]] <func-name>[%return] [<args>]":README enabled_functions
+
+echo 0 > events/enable
+echo > dynamic_events
+
+PLACE=vfs_read
+PLACE2=vfs_open
+
+:;: 'Ensure no other ftrace user' ;:
+test `cat enabled_functions | wc -l` -eq 0 || exit_unresolved
+
+:;: 'Test case 1: leave entry event' ;:
+:;: 'Add entry and exit events on the same place' ;:
+echo "f:event1 ${PLACE}" >> dynamic_events
+echo "f:event2 ${PLACE}%return" >> dynamic_events
+
+:;: 'Enable both of them' ;:
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'Disable and remove exit event' ;:
+echo 0 > events/fprobes/event2/enable
+echo -:event2 >> dynamic_events
+
+:;: 'Disable and remove all events' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+
+:;: 'Add another event' ;:
+echo "f:event3 ${PLACE2}%return" > dynamic_events
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'No other ftrace user' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+test `cat enabled_functions | wc -l` -eq 0
+
+:;: 'Test case 2: leave exit event' ;:
+:;: 'Add entry and exit events on the same place' ;:
+echo "f:event1 ${PLACE}" >> dynamic_events
+echo "f:event2 ${PLACE}%return" >> dynamic_events
+
+:;: 'Enable both of them' ;:
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'Disable and remove entry event' ;:
+echo 0 > events/fprobes/event1/enable
+echo -:event1 >> dynamic_events
+
+:;: 'Disable and remove all events' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+
+:;: 'Add another event' ;:
+echo "f:event3 ${PLACE2}" > dynamic_events
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'No other ftrace user' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+test `cat enabled_functions | wc -l` -eq 0
+
+clear_trace


^ permalink raw reply related

* [PATCH v8 7/8] selftests/ftrace: Add a testcase for fprobe events on module
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add a testcase for fprobe events on module, which unloads a kernel
module on which fprobe events are probing and ensure the ftrace
hash map is cleared correctly.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 .../test.d/dynevent/add_remove_fprobe_module.tc    |   61 ++++++++++++++++++++
 1 file changed, 61 insertions(+)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc

diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc
new file mode 100644
index 000000000000..500942e9d8d6
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc
@@ -0,0 +1,61 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Generic dynamic event - add/remove fprobe events on module
+# requires: dynamic_events "f[:[<group>/][<event>]] <func-name>[%return] [<args>]":README enabled_functions
+
+rmmod trace-events-sample ||:
+if ! modprobe trace-events-sample ; then
+  echo "No trace-events sample module - please make CONFIG_SAMPLE_TRACE_EVENTS=m"
+  exit_unresolved;
+fi
+trap "rmmod trace-events-sample" EXIT
+
+echo 0 > events/enable
+echo > dynamic_events
+
+FUNC1='foo_bar*'
+FUNC2='vfs_read'
+
+:;: "Add an event on the test module" ;:
+echo "f:test1 $FUNC1" >> dynamic_events
+echo 1 > events/fprobes/test1/enable
+
+:;: "Ensure it is enabled" ;:
+funcs=`cat enabled_functions | wc -l`
+test $funcs -ne 0
+
+:;: "Check the enabled_functions is cleared on unloading" ;:
+rmmod trace_events_sample
+funcs=`cat enabled_functions | wc -l`
+test $funcs -eq 0
+
+:;: "Check it is kept clean" ;:
+modprobe trace-events-sample
+echo 1 > events/fprobes/test1/enable || echo "OK"
+funcs=`cat enabled_functions | wc -l`
+test $funcs -eq 0
+
+:;: "Add another event not on the test module" ;:
+echo "f:test2 $FUNC2" >> dynamic_events
+echo 1 > events/fprobes/test2/enable
+
+:;: "Ensure it is enabled" ;:
+ofuncs=`cat enabled_functions | wc -l`
+test $ofuncs -ne 0
+
+:;: "Disable and remove the first event"
+echo 0 > events/fprobes/test1/enable
+echo "-:fprobes/test1" >> dynamic_events
+funcs=`cat enabled_functions | wc -l`
+test $ofuncs -eq $funcs
+
+:;: "Disable and remove other events" ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+funcs=`cat enabled_functions | wc -l`
+test $funcs -eq 0
+
+
+rmmod trace-events-sample
+trap "" EXIT
+clear_trace


^ permalink raw reply related

* [PATCH v8 6/8] tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Fix fprobe to unregister ftrace_ops if corresponding type of fprobe
does not exist on the fprobe_ip_table and it is expected to be empty
when unloading modules.

Since ftrace thinks that the empty hash means everything to be traced,
if we set fprobes only on the unloaded module, all functions are traced
unexpectedly after unloading module.
e.g.

 # modprobe xt_LOG.ko
 # echo 'f:test log_tg*' > dynamic_events
 # echo 1 > events/fprobes/test/enable
 # cat enabled_functions
log_tg [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_check [xt_LOG] (1)               tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_destroy [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
 # rmmod xt_LOG
 # wc -l enabled_functions
34085 enabled_functions

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v8:
  - Fix to check fprobe_graph/ftrace_registered flag directly
    when registering ftrace_ops.
 Changes in v7:
  - Fix to split checking whether ftrace_ops is registered from
    the number of registered fprobes, because ftrace_ops can be
    unregistered in module unloading.
 Changes in v6:
  - Newly added.
---
 kernel/trace/fprobe.c |  211 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 161 insertions(+), 50 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index f2c30587ab26..3c390b90a99f 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -79,7 +79,7 @@ static const struct rhashtable_params fprobe_rht_params = {
 };
 
 /* Node insertion and deletion requires the fprobe_mutex */
-static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+static int __insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 {
 	int ret;
 
@@ -92,7 +92,7 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 	return ret;
 }
 
-static void delete_fprobe_node(struct fprobe_hlist_node *node)
+static void __delete_fprobe_node(struct fprobe_hlist_node *node)
 {
 	lockdep_assert_held(&fprobe_mutex);
 
@@ -250,7 +250,72 @@ static inline int __fprobe_kprobe_handler(unsigned long ip, unsigned long parent
 	return ret;
 }
 
+static int fprobe_fgraph_entry(struct ftrace_graph_ent *trace, struct fgraph_ops *gops,
+			       struct ftrace_regs *fregs);
+static void fprobe_return(struct ftrace_graph_ret *trace,
+			  struct fgraph_ops *gops,
+			  struct ftrace_regs *fregs);
+
+static struct fgraph_ops fprobe_graph_ops = {
+	.entryfunc	= fprobe_fgraph_entry,
+	.retfunc	= fprobe_return,
+};
+/* Number of fgraph fprobes */
+static int fprobe_graph_active;
+/* Number of fgraph fprobe nodes */
+static int nr_fgraph_fprobes;
+/* Is fprobe_graph_ops registered? */
+static bool fprobe_graph_registered;
+
+/* Add @addrs to the ftrace filter and register fgraph if needed. */
+static int fprobe_graph_add_ips(unsigned long *addrs, int num)
+{
+	int ret;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	ret = ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 0, 0);
+	if (ret)
+		return ret;
+
+	if (!fprobe_graph_registered) {
+		ret = register_ftrace_graph(&fprobe_graph_ops);
+		if (WARN_ON_ONCE(ret)) {
+			ftrace_free_filter(&fprobe_graph_ops.ops);
+			return ret;
+		}
+		fprobe_graph_registered = true;
+	}
+	fprobe_graph_active++;
+	return 0;
+}
+
+static void __fprobe_graph_unregister(void)
+{
+	if (fprobe_graph_registered) {
+		unregister_ftrace_graph(&fprobe_graph_ops);
+		ftrace_free_filter(&fprobe_graph_ops.ops);
+		fprobe_graph_registered = false;
+	}
+}
+
+/* Remove @addrs from the ftrace filter and unregister fgraph if possible. */
+static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
+{
+	lockdep_assert_held(&fprobe_mutex);
+
+	if (!fprobe_graph_active)
+		return;
+
+	fprobe_graph_active--;
+	if (!fprobe_graph_active)
+		__fprobe_graph_unregister();
+	else if (num)
+		ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 1, 0);
+}
+
 #if defined(CONFIG_DYNAMIC_FTRACE_WITH_ARGS) || defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS)
+
 /* ftrace_ops callback, this processes fprobes which have only entry_handler. */
 static void fprobe_ftrace_entry(unsigned long ip, unsigned long parent_ip,
 	struct ftrace_ops *ops, struct ftrace_regs *fregs)
@@ -293,7 +358,12 @@ static struct ftrace_ops fprobe_ftrace_ops = {
 	.func	= fprobe_ftrace_entry,
 	.flags	= FTRACE_OPS_FL_SAVE_ARGS,
 };
+/* Number of fgraph fprobes */
 static int fprobe_ftrace_active;
+/* Number of ftrace fprobe nodes */
+static int nr_ftrace_fprobes;
+/* Is fprobe_ftrace_ops registered? */
+static bool fprobe_ftrace_registered;
 
 static int fprobe_ftrace_add_ips(unsigned long *addrs, int num)
 {
@@ -305,25 +375,38 @@ static int fprobe_ftrace_add_ips(unsigned long *addrs, int num)
 	if (ret)
 		return ret;
 
-	if (!fprobe_ftrace_active) {
+	if (!fprobe_ftrace_registered) {
 		ret = register_ftrace_function(&fprobe_ftrace_ops);
 		if (ret) {
 			ftrace_free_filter(&fprobe_ftrace_ops);
 			return ret;
 		}
+		fprobe_ftrace_registered = true;
 	}
 	fprobe_ftrace_active++;
 	return 0;
 }
 
+static void __fprobe_ftrace_unregister(void)
+{
+	if (fprobe_ftrace_registered) {
+		unregister_ftrace_function(&fprobe_ftrace_ops);
+		ftrace_free_filter(&fprobe_ftrace_ops);
+		fprobe_ftrace_registered = false;
+	}
+}
+
 static void fprobe_ftrace_remove_ips(unsigned long *addrs, int num)
 {
 	lockdep_assert_held(&fprobe_mutex);
 
+	if (!fprobe_ftrace_active)
+		return;
+
 	fprobe_ftrace_active--;
 	if (!fprobe_ftrace_active)
-		unregister_ftrace_function(&fprobe_ftrace_ops);
-	if (num)
+		__fprobe_ftrace_unregister();
+	else if (num)
 		ftrace_set_filter_ips(&fprobe_ftrace_ops, addrs, num, 1, 0);
 }
 
@@ -332,6 +415,40 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return !fp->exit_handler;
 }
 
+/* Node insertion and deletion requires the fprobe_mutex */
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+{
+	int ret;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	ret = __insert_fprobe_node(node, fp);
+	if (!ret) {
+		if (fprobe_is_ftrace(fp))
+			nr_ftrace_fprobes++;
+		else
+			nr_fgraph_fprobes++;
+	}
+
+	return ret;
+}
+
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
+{
+	struct fprobe *fp;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	fp = READ_ONCE(node->fp);
+	if (fp) {
+		if (fprobe_is_ftrace(fp))
+			nr_ftrace_fprobes--;
+		else
+			nr_fgraph_fprobes--;
+	}
+	__delete_fprobe_node(node);
+}
+
 static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
 {
 	struct rhlist_head *head, *pos;
@@ -361,8 +478,15 @@ static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
-	ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
+	if (!nr_fgraph_fprobes)
+		__fprobe_graph_unregister();
+	else
+		ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+
+	if (!nr_ftrace_fprobes)
+		__fprobe_ftrace_unregister();
+	else
+		ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
 }
 #endif
 #else
@@ -380,6 +504,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return false;
 }
 
+/* Node insertion and deletion requires the fprobe_mutex */
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+{
+	int ret;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	ret = __insert_fprobe_node(node, fp);
+	if (!ret)
+		nr_fgraph_fprobes++;
+
+	return ret;
+}
+
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
+{
+	struct fprobe *fp;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	fp = READ_ONCE(node->fp);
+	if (fp)
+		nr_fgraph_fprobes--;
+	__delete_fprobe_node(node);
+}
+
 static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
 {
 	struct rhlist_head *head, *pos;
@@ -406,7 +556,10 @@ static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+	if (!nr_fgraph_fprobes)
+		__fprobe_graph_unregister();
+	else
+		ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
 }
 #endif
 #endif /* !CONFIG_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
@@ -536,48 +689,6 @@ static void fprobe_return(struct ftrace_graph_ret *trace,
 }
 NOKPROBE_SYMBOL(fprobe_return);
 
-static struct fgraph_ops fprobe_graph_ops = {
-	.entryfunc	= fprobe_fgraph_entry,
-	.retfunc	= fprobe_return,
-};
-static int fprobe_graph_active;
-
-/* Add @addrs to the ftrace filter and register fgraph if needed. */
-static int fprobe_graph_add_ips(unsigned long *addrs, int num)
-{
-	int ret;
-
-	lockdep_assert_held(&fprobe_mutex);
-
-	ret = ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 0, 0);
-	if (ret)
-		return ret;
-
-	if (!fprobe_graph_active) {
-		ret = register_ftrace_graph(&fprobe_graph_ops);
-		if (WARN_ON_ONCE(ret)) {
-			ftrace_free_filter(&fprobe_graph_ops.ops);
-			return ret;
-		}
-	}
-	fprobe_graph_active++;
-	return 0;
-}
-
-/* Remove @addrs from the ftrace filter and unregister fgraph if possible. */
-static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
-{
-	lockdep_assert_held(&fprobe_mutex);
-
-	fprobe_graph_active--;
-	/* Q: should we unregister it ? */
-	if (!fprobe_graph_active)
-		unregister_ftrace_graph(&fprobe_graph_ops);
-
-	if (num)
-		ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 1, 0);
-}
-
 #ifdef CONFIG_MODULES
 
 #define FPROBE_IPS_BATCH_INIT 128


^ permalink raw reply related

* [PATCH v8 5/8] tracing/fprobe: Check the same type fprobe on table as the unregistered one
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
introduced a different ftrace_ops for entry-only fprobes.

However, when unregistering an fprobe, the kernel only checks if another
fprobe exists at the same address, without checking which type of fprobe
it is.
If different fprobes are registered at the same address, the same address
will be registered in both fgraph_ops and ftrace_ops, but only one of
them will be deleted when unregistering. (the one removed first will not
be deleted from the ops).

This results in junk entries remaining in either fgraph_ops or ftrace_ops.
For example:
 =======
 cd /sys/kernel/tracing

 # 'Add entry and exit events on the same place'
 echo 'f:event1 vfs_read' >> dynamic_events
 echo 'f:event2 vfs_read%return' >> dynamic_events

 # 'Enable both of them'
 echo 1 > events/fprobes/enable
 cat enabled_functions
vfs_read (2)            ->arch_ftrace_ops_list_func+0x0/0x210

 # 'Disable and remove exit event'
 echo 0 > events/fprobes/event2/enable
 echo -:event2 >> dynamic_events

 # 'Disable and remove all events'
 echo 0 > events/fprobes/enable
 echo > dynamic_events

 # 'Add another event'
 echo 'f:event3 vfs_open%return' > dynamic_events
 cat dynamic_events
f:fprobes/event3 vfs_open%return

 echo 1 > events/fprobes/enable
 cat enabled_functions
vfs_open (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
vfs_read (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
 =======

As you can see, an entry for the vfs_read remains.

To fix this issue, when unregistering, the kernel should also check if
there is the same type of fprobes still exist at the same address, and
if not, delete its entry from either fgraph_ops or ftrace_ops.

Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 kernel/trace/fprobe.c |   82 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 65 insertions(+), 17 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 77d4449f20f7..f2c30587ab26 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -92,11 +92,8 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 	return ret;
 }
 
-/* Return true if there are synonims */
-static bool delete_fprobe_node(struct fprobe_hlist_node *node)
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
 {
-	bool ret;
-
 	lockdep_assert_held(&fprobe_mutex);
 
 	/* Avoid double deleting and non-inserted nodes */
@@ -105,13 +102,6 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
 		rhltable_remove(&fprobe_ip_table, &node->hlist,
 				fprobe_rht_params);
 	}
-
-	rcu_read_lock();
-	ret = !!rhltable_lookup(&fprobe_ip_table, &node->addr,
-				fprobe_rht_params);
-	rcu_read_unlock();
-
-	return ret;
 }
 
 /* Check existence of the fprobe */
@@ -342,6 +332,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return !fp->exit_handler;
 }
 
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
+{
+	struct rhlist_head *head, *pos;
+	struct fprobe_hlist_node *node;
+	struct fprobe *fp;
+
+	guard(rcu)();
+	head = rhltable_lookup(&fprobe_ip_table, &ip,
+				fprobe_rht_params);
+	if (!head)
+		return false;
+	/* We have to check the same type on the list. */
+	rhl_for_each_entry_rcu(node, pos, head, hlist) {
+		if (node->addr != ip)
+			break;
+		fp = READ_ONCE(node->fp);
+		if (likely(fp)) {
+			if ((!ftrace && fp->exit_handler) ||
+			    (ftrace && !fp->exit_handler))
+				return true;
+		}
+	}
+
+	return false;
+}
+
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
@@ -364,6 +380,29 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return false;
 }
 
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
+{
+	struct rhlist_head *head, *pos;
+	struct fprobe_hlist_node *node;
+	struct fprobe *fp;
+
+	guard(rcu)();
+	head = rhltable_lookup(&fprobe_ip_table, &ip,
+				fprobe_rht_params);
+	if (!head)
+		return false;
+	/* We only need to check fp is there. */
+	rhl_for_each_entry_rcu(node, pos, head, hlist) {
+		if (node->addr != ip)
+			break;
+		fp = READ_ONCE(node->fp);
+		if (likely(fp))
+			return true;
+	}
+
+	return false;
+}
+
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
@@ -552,18 +591,25 @@ struct fprobe_addr_list {
 static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
 					 struct fprobe_addr_list *alist)
 {
+	lockdep_assert_in_rcu_read_lock();
+
 	if (!within_module(node->addr, mod))
 		return 0;
 
-	if (delete_fprobe_node(node))
-		return 0;
+	delete_fprobe_node(node);
 	/* If no address list is available, we can't track this address. */
 	if (!alist->addrs)
 		return 0;
+	/*
+	 * Don't care the type here, because all fprobes on the same
+	 * address must be removed eventually.
+	 */
+	if (!rhltable_lookup(&fprobe_ip_table, &node->addr, fprobe_rht_params)) {
+		alist->addrs[alist->index++] = node->addr;
+		if (alist->index == alist->size)
+			return -ENOSPC;
+	}
 
-	alist->addrs[alist->index++] = node->addr;
-	if (alist->index == alist->size)
-		return -ENOSPC;
 	return 0;
 }
 
@@ -931,7 +977,9 @@ static int unregister_fprobe_nolock(struct fprobe *fp)
 	/* Remove non-synonim ips from table and hash */
 	count = 0;
 	for (i = 0; i < hlist_array->size; i++) {
-		if (!delete_fprobe_node(&hlist_array->array[i]) && addrs)
+		delete_fprobe_node(&hlist_array->array[i]);
+		if (addrs && !fprobe_exists_on_hash(hlist_array->array[i].addr,
+						    fprobe_is_ftrace(fp)))
 			addrs[count++] = hlist_array->array[i].addr;
 	}
 	del_fprobe_hash(fp);


^ permalink raw reply related

* [PATCH v8 4/8] tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

fprobe_remove_node_in_module() is called under RCU read locked, but
this invokes kcalloc() if there are more than 8 fprobes installed
on the module. Sashiko warns it because kcalloc() can sleep [1].

 [1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com

To fix this issue, expand the batch size to 128 and do not expand
the fprobe_addr_list, but just cancel walking on fprobe_ip_table,
update fgraph/ftrace_ops and retry the loop again.

Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v6:
  - Retry outside rhltable_walk_enter/exit() again.
 Changes in v5:
  - Skip updating ftrace_ops when fails to allocate memory in module
    unloading.
 Changes in v4:
  - fix a build error typo in case of CONFIG_DYNAMIC_FTRACE=n.
 Changes in v3:
  - Retry inside rhltable_walk_enter/exit().
  - Rename fprobe_set_ips() to fprobe_remove_ips().
  - Rename 'retry' label to 'again'.
---
 kernel/trace/fprobe.c |   92 ++++++++++++++++++++++++-------------------------
 1 file changed, 45 insertions(+), 47 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 98c5786bf172..77d4449f20f7 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -343,11 +343,10 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 }
 
 #ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
-			   int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
-	ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, remove, reset);
+	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+	ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
 }
 #endif
 #else
@@ -366,10 +365,9 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 }
 
 #ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
-			   int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
+	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
 }
 #endif
 #endif /* !CONFIG_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
@@ -543,7 +541,7 @@ static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
 
 #ifdef CONFIG_MODULES
 
-#define FPROBE_IPS_BATCH_INIT 8
+#define FPROBE_IPS_BATCH_INIT 128
 /* instruction pointer address list */
 struct fprobe_addr_list {
 	int index;
@@ -551,45 +549,24 @@ struct fprobe_addr_list {
 	unsigned long *addrs;
 };
 
-static int fprobe_addr_list_add(struct fprobe_addr_list *alist, unsigned long addr)
+static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
+					 struct fprobe_addr_list *alist)
 {
-	unsigned long *addrs;
-
-	/* Previously we failed to expand the list. */
-	if (alist->index == alist->size)
-		return -ENOSPC;
-
-	alist->addrs[alist->index++] = addr;
-	if (alist->index < alist->size)
+	if (!within_module(node->addr, mod))
 		return 0;
 
-	/* Expand the address list */
-	addrs = kcalloc(alist->size * 2, sizeof(*addrs), GFP_KERNEL);
-	if (!addrs)
-		return -ENOMEM;
-
-	memcpy(addrs, alist->addrs, alist->size * sizeof(*addrs));
-	alist->size *= 2;
-	kfree(alist->addrs);
-	alist->addrs = addrs;
+	if (delete_fprobe_node(node))
+		return 0;
+	/* If no address list is available, we can't track this address. */
+	if (!alist->addrs)
+		return 0;
 
+	alist->addrs[alist->index++] = node->addr;
+	if (alist->index == alist->size)
+		return -ENOSPC;
 	return 0;
 }
 
-static void fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
-					 struct fprobe_addr_list *alist)
-{
-	if (!within_module(node->addr, mod))
-		return;
-	if (delete_fprobe_node(node))
-		return;
-	/*
-	 * If failed to update alist, just continue to update hlist.
-	 * Therefore, at list user handler will not hit anymore.
-	 */
-	fprobe_addr_list_add(alist, node->addr);
-}
-
 /* Handle module unloading to manage fprobe_ip_table. */
 static int fprobe_module_callback(struct notifier_block *nb,
 				  unsigned long val, void *data)
@@ -598,29 +575,50 @@ static int fprobe_module_callback(struct notifier_block *nb,
 	struct fprobe_hlist_node *node;
 	struct rhashtable_iter iter;
 	struct module *mod = data;
+	bool retry;
 
 	if (val != MODULE_STATE_GOING)
 		return NOTIFY_DONE;
 
 	alist.addrs = kcalloc(alist.size, sizeof(*alist.addrs), GFP_KERNEL);
-	/* If failed to alloc memory, we can not remove ips from hash. */
-	if (!alist.addrs)
-		return NOTIFY_DONE;
+	/*
+	 * If failed to alloc memory, ftrace_ops will not be able to remove ips from
+	 * hash, but we can still remove nodes from fprobe_ip_table, so we can avoid
+	 * the potential wrong callback. So just print a warning here and try to
+	 * continue without address list.
+	 */
+	WARN_ONCE(!alist.addrs,
+		"Failed to allocate memory for fprobe_addr_list, ftrace_ops will not be updated");
 
 	mutex_lock(&fprobe_mutex);
+again:
+	retry = false;
+	alist.index = 0;
 	rhltable_walk_enter(&fprobe_ip_table, &iter);
 	do {
 		rhashtable_walk_start(&iter);
 
 		while ((node = rhashtable_walk_next(&iter)) && !IS_ERR(node))
-			fprobe_remove_node_in_module(mod, node, &alist);
+			if (fprobe_remove_node_in_module(mod, node, &alist) < 0) {
+				retry = true;
+				break;
+			}
 
 		rhashtable_walk_stop(&iter);
-	} while (node == ERR_PTR(-EAGAIN));
+	} while (node == ERR_PTR(-EAGAIN) && !retry);
 	rhashtable_walk_exit(&iter);
+	/* Remove any ips from hash table(s) */
+	if (alist.index > 0) {
+		fprobe_remove_ips(alist.addrs, alist.index);
+		/*
+		 * If we break rhashtable walk loop except for -EAGAIN, we need
+		 * to restart looping from start for safety. Anyway, this is
+		 * not a hotpath.
+		 */
+		if (retry)
+			goto again;
+	}
 
-	if (alist.index > 0)
-		fprobe_set_ips(alist.addrs, alist.index, 1, 0);
 	mutex_unlock(&fprobe_mutex);
 
 	kfree(alist.addrs);


^ permalink raw reply related

* [PATCH v8 3/8] tracing/fprobe: Remove fprobe from hash in failure path
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

When register_fprobe_ips() fails, it tries to remove a list of
fprobe_hash_node from fprobe_ip_table, but it missed to remove
fprobe itself from fprobe_table. Moreover, when removing
the fprobe_hash_node which is added to rhltable once, it must
use kfree_rcu() after removing from rhltable.

To fix these issues, this reuses unregister_fprobe() internal
code to rollback the half-way registered fprobe.

Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v8:
  - Fix to check return value of add_fprobe_hash() and break
    loop if insert_fprobe_node() is failed.
 Changes in v7:
  - Remove RCU grace period wait, since fprobe itself is not
    that is not needed.
 Changes in v6:
  - Wait for an RCU grace period before returning error in
    unregister_fprobe_nolock().
 Changes in v5:
  - When rolling back an fprobe that failed to register, the
    fprobe_hash_node are forcibly removed and warn if failure.
 Changes in v4:
  - Remove short-cut case because we always need to upadte ftrace_ops.
  - Use guard(mutex) in register_fprobe_ips() to unlock it correctly.
  - Remove redundant !ret check in register_fprobe_ips().
  - Do not set hlist_array->size in failure case, instead,
    hlist_array->array[i].fp is set only when insertion is succeeded.
  Changes in v3:
  - Newly added.
---
 kernel/trace/fprobe.c |   85 +++++++++++++++++++++++++------------------------
 1 file changed, 43 insertions(+), 42 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index d634eb8e8b9e..98c5786bf172 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -79,20 +79,27 @@ static const struct rhashtable_params fprobe_rht_params = {
 };
 
 /* Node insertion and deletion requires the fprobe_mutex */
-static int insert_fprobe_node(struct fprobe_hlist_node *node)
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 {
+	int ret;
+
 	lockdep_assert_held(&fprobe_mutex);
 
-	return rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+	ret = rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+	/* Set the fprobe pointer if insertion was successful. */
+	if (!ret)
+		WRITE_ONCE(node->fp, fp);
+	return ret;
 }
 
 /* Return true if there are synonims */
 static bool delete_fprobe_node(struct fprobe_hlist_node *node)
 {
-	lockdep_assert_held(&fprobe_mutex);
 	bool ret;
 
-	/* Avoid double deleting */
+	lockdep_assert_held(&fprobe_mutex);
+
+	/* Avoid double deleting and non-inserted nodes */
 	if (READ_ONCE(node->fp) != NULL) {
 		WRITE_ONCE(node->fp, NULL);
 		rhltable_remove(&fprobe_ip_table, &node->hlist,
@@ -757,7 +764,6 @@ static int fprobe_init(struct fprobe *fp, unsigned long *addrs, int num)
 	fp->hlist_array = hlist_array;
 	hlist_array->fp = fp;
 	for (i = 0; i < num; i++) {
-		hlist_array->array[i].fp = fp;
 		addr = ftrace_location(addrs[i]);
 		if (!addr) {
 			fprobe_fail_cleanup(fp);
@@ -821,6 +827,8 @@ int register_fprobe(struct fprobe *fp, const char *filter, const char *notfilter
 }
 EXPORT_SYMBOL_GPL(register_fprobe);
 
+static int unregister_fprobe_nolock(struct fprobe *fp);
+
 /**
  * register_fprobe_ips() - Register fprobe to ftrace by address.
  * @fp: A fprobe data structure to be registered.
@@ -847,28 +855,22 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
 	if (ret)
 		return ret;
 
-	hlist_array = fp->hlist_array;
 	if (fprobe_is_ftrace(fp))
 		ret = fprobe_ftrace_add_ips(addrs, num);
 	else
 		ret = fprobe_graph_add_ips(addrs, num);
-
-	if (!ret) {
-		add_fprobe_hash(fp);
-		for (i = 0; i < hlist_array->size; i++) {
-			ret = insert_fprobe_node(&hlist_array->array[i]);
-			if (ret)
-				break;
-		}
-		/* fallback on insert error */
-		if (ret) {
-			for (i--; i >= 0; i--)
-				delete_fprobe_node(&hlist_array->array[i]);
-		}
+	if (ret) {
+		fprobe_fail_cleanup(fp);
+		return ret;
 	}
 
+	hlist_array = fp->hlist_array;
+	ret = add_fprobe_hash(fp);
+	for (i = 0; i < hlist_array->size && !ret; i++)
+		ret = insert_fprobe_node(&hlist_array->array[i], fp);
+
 	if (ret)
-		fprobe_fail_cleanup(fp);
+		unregister_fprobe_nolock(fp);
 
 	return ret;
 }
@@ -912,27 +914,12 @@ bool fprobe_is_registered(struct fprobe *fp)
 	return true;
 }
 
-/**
- * unregister_fprobe() - Unregister fprobe.
- * @fp: A fprobe data structure to be unregistered.
- *
- * Unregister fprobe (and remove ftrace hooks from the function entries).
- *
- * Return 0 if @fp is unregistered successfully, -errno if not.
- */
-int unregister_fprobe(struct fprobe *fp)
+static int unregister_fprobe_nolock(struct fprobe *fp)
 {
-	struct fprobe_hlist *hlist_array;
+	struct fprobe_hlist *hlist_array = fp->hlist_array;
 	unsigned long *addrs = NULL;
-	int ret = 0, i, count;
+	int i, count;
 
-	mutex_lock(&fprobe_mutex);
-	if (!fp || !fprobe_registered(fp)) {
-		ret = -EINVAL;
-		goto out;
-	}
-
-	hlist_array = fp->hlist_array;
 	addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
 	/*
 	 * This will remove fprobe_hash_node from the hash table even if
@@ -958,12 +945,26 @@ int unregister_fprobe(struct fprobe *fp)
 
 	kfree_rcu(hlist_array, rcu);
 	fp->hlist_array = NULL;
+	kfree(addrs);
 
-out:
-	mutex_unlock(&fprobe_mutex);
+	return 0;
+}
 
-	kfree(addrs);
-	return ret;
+/**
+ * unregister_fprobe() - Unregister fprobe.
+ * @fp: A fprobe data structure to be unregistered.
+ *
+ * Unregister fprobe (and remove ftrace hooks from the function entries).
+ *
+ * Return 0 if @fp is unregistered successfully, -errno if not.
+ */
+int unregister_fprobe(struct fprobe *fp)
+{
+	guard(mutex)(&fprobe_mutex);
+	if (!fp || !fprobe_registered(fp))
+		return -EINVAL;
+
+	return unregister_fprobe_nolock(fp);
 }
 EXPORT_SYMBOL_GPL(unregister_fprobe);
 


^ permalink raw reply related

* [PATCH v8 2/8] tracing/fprobe: Unregister fprobe even if memory allocation fails
From: Masami Hiramatsu (Google) @ 2026-04-16 10:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

unregister_fprobe() can fail under memory pressure because of memory
allocation failure, but this maybe called from module unloading, and
usually there is no way to retry it. Moreover. trace_fprobe does not
check the return value.

To fix this problem, unregister fprobe and fprobe_hash_node even if
working memory allocation fails.
Anyway, if the last fprobe is removed, the filter will be freed.

Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v7:
  - Newly added.
---
 kernel/trace/fprobe.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index fc7018b28fdd..d634eb8e8b9e 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -934,15 +934,19 @@ int unregister_fprobe(struct fprobe *fp)
 
 	hlist_array = fp->hlist_array;
 	addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
-	if (!addrs) {
-		ret = -ENOMEM;	/* TODO: Fallback to one-by-one loop */
-		goto out;
-	}
+	/*
+	 * This will remove fprobe_hash_node from the hash table even if
+	 * memory allocation fails. However, ftrace_ops will not be updated.
+	 * Anyway, when the last fprobe is unregistered, ftrace_ops is also
+	 * unregistered.
+	 */
+	if (!addrs)
+		pr_warn("Failed to allocate working array. ftrace_ops may not sync.\n");
 
 	/* Remove non-synonim ips from table and hash */
 	count = 0;
 	for (i = 0; i < hlist_array->size; i++) {
-		if (!delete_fprobe_node(&hlist_array->array[i]))
+		if (!delete_fprobe_node(&hlist_array->array[i]) && addrs)
 			addrs[count++] = hlist_array->array[i].addr;
 	}
 	del_fprobe_hash(fp);


^ permalink raw reply related

* [PATCH v8 1/8] tracing/fprobe: Reject registration of a registered fprobe before init
From: Masami Hiramatsu (Google) @ 2026-04-16 10:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Reject registration of a registered fprobe which is on the fprobe
hash table before initializing fprobe.
The add_fprobe_hash() checks this re-register fprobe, but since
fprobe_init() clears hlist_array field, it is too late to check it.
It has to check the re-registration before touncing fprobe.

Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v6:
  - Newly added.
---
 kernel/trace/fprobe.c |   21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index dcadf1d23b8a..fc7018b28fdd 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -4,6 +4,7 @@
  */
 #define pr_fmt(fmt) "fprobe: " fmt
 
+#include <linux/cleanup.h>
 #include <linux/err.h>
 #include <linux/fprobe.h>
 #include <linux/kallsyms.h>
@@ -107,7 +108,7 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
 }
 
 /* Check existence of the fprobe */
-static bool is_fprobe_still_exist(struct fprobe *fp)
+static bool fprobe_registered(struct fprobe *fp)
 {
 	struct hlist_head *head;
 	struct fprobe_hlist *fph;
@@ -120,7 +121,7 @@ static bool is_fprobe_still_exist(struct fprobe *fp)
 	}
 	return false;
 }
-NOKPROBE_SYMBOL(is_fprobe_still_exist);
+NOKPROBE_SYMBOL(fprobe_registered);
 
 static int add_fprobe_hash(struct fprobe *fp)
 {
@@ -132,9 +133,6 @@ static int add_fprobe_hash(struct fprobe *fp)
 	if (WARN_ON_ONCE(!fph))
 		return -EINVAL;
 
-	if (is_fprobe_still_exist(fp))
-		return -EEXIST;
-
 	head = &fprobe_table[hash_ptr(fp, FPROBE_HASH_BITS)];
 	hlist_add_head_rcu(&fp->hlist_array->hlist, head);
 	return 0;
@@ -149,7 +147,7 @@ static int del_fprobe_hash(struct fprobe *fp)
 	if (WARN_ON_ONCE(!fph))
 		return -EINVAL;
 
-	if (!is_fprobe_still_exist(fp))
+	if (!fprobe_registered(fp))
 		return -ENOENT;
 
 	fph->fp = NULL;
@@ -482,7 +480,7 @@ static void fprobe_return(struct ftrace_graph_ret *trace,
 		if (!fp)
 			break;
 		curr += FPROBE_HEADER_SIZE_IN_LONG;
-		if (is_fprobe_still_exist(fp) && !fprobe_disabled(fp)) {
+		if (fprobe_registered(fp) && !fprobe_disabled(fp)) {
 			if (WARN_ON_ONCE(curr + size > size_words))
 				break;
 			fp->exit_handler(fp, trace->func, ret_ip, fregs,
@@ -841,12 +839,14 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
 	struct fprobe_hlist *hlist_array;
 	int ret, i;
 
+	guard(mutex)(&fprobe_mutex);
+	if (fprobe_registered(fp))
+		return -EEXIST;
+
 	ret = fprobe_init(fp, addrs, num);
 	if (ret)
 		return ret;
 
-	mutex_lock(&fprobe_mutex);
-
 	hlist_array = fp->hlist_array;
 	if (fprobe_is_ftrace(fp))
 		ret = fprobe_ftrace_add_ips(addrs, num);
@@ -866,7 +866,6 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
 				delete_fprobe_node(&hlist_array->array[i]);
 		}
 	}
-	mutex_unlock(&fprobe_mutex);
 
 	if (ret)
 		fprobe_fail_cleanup(fp);
@@ -928,7 +927,7 @@ int unregister_fprobe(struct fprobe *fp)
 	int ret = 0, i, count;
 
 	mutex_lock(&fprobe_mutex);
-	if (!fp || !is_fprobe_still_exist(fp)) {
+	if (!fp || !fprobe_registered(fp)) {
 		ret = -EINVAL;
 		goto out;
 	}


^ permalink raw reply related

* [PATCH v8 0/8] tracing/fprobe: Fix fprobe_ip_table related bugs
From: Masami Hiramatsu (Google) @ 2026-04-16 10:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel

Hi,

Here is the 8th version of fprobe bugfix series.
The previous version is here.

https://lore.kernel.org/all/177624771150.2407691.15764846647014540969.stgit@mhiramat.tok.corp.google.com/

This version fixes 2 issues - fixes to check the return value of
add_fprobe_hash() and break the fprobe node installing loop if
insert_fprobe_node() is failed [2/8], and fixes to check
fprobe_graph/ftrace_registered flags directly when registering
ftrace_ops [6/8]. And this adds 2 new test cases for checking
the fprobe events bahavior fixed by this series [7/8][8/8].


Thank you!

Masami Hiramatsu (Google) (8):
      tracing/fprobe: Reject registration of a registered fprobe before init
      tracing/fprobe: Unregister fprobe even if memory allocation fails
      tracing/fprobe: Remove fprobe from hash in failure path
      tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
      tracing/fprobe: Check the same type fprobe on table as the unregistered one
      tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading
      selftests/ftrace: Add a testcase for fprobe events on module
      selftests/ftrace: Add a testcase for multiple fprobe events


 kernel/trace/fprobe.c                              |  475 +++++++++++++-------
 .../test.d/dynevent/add_remove_fprobe_module.tc    |   61 +++
 .../test.d/dynevent/add_remove_multiple_fprobe.tc  |   69 +++
 3 files changed, 448 insertions(+), 157 deletions(-)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc


base-commit: 4346be6577aaa04586167402ae87bbdbe32484a4
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH] ftrace: fix use-after-free of mod->name in function_stat_show()
From: Xiang Gao @ 2026-04-16  8:33 UTC (permalink / raw)
  To: rostedt, mhiramat
  Cc: mark.rutland, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	Xiang Gao

From: Xiang Gao <gaoxiang17@xiaomi.com>

function_stat_show() uses guard(rcu)() inside the else block to hold
the RCU read lock while calling __module_text_address() and accessing
mod->name. However, guard(rcu)() ties the RCU read lock lifetime to
the scope of the else block. The original code stores mod->name into
refsymbol and uses it in snprintf() after the else block exits,
at which point the RCU read lock has already been released. If the
module is concurrently unloaded, mod->name is freed, causing a
use-after-free.

Fix by moving the snprintf() call into each branch of the if/else,
so that mod->name is only accessed while the RCU read lock is held.
refsymbol now points to the local str buffer (which already contains
the formatted string) rather than to mod->name, and is only used
afterwards as a non-NULL indicator to skip the kallsyms_lookup()
fallback.

Signed-off-by: Xiang Gao <gaoxiang17@xiaomi.com>
---
 kernel/trace/ftrace.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 413310912609..6217b363203c 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -559,21 +559,23 @@ static int function_stat_show(struct seq_file *m, void *v)
 		unsigned long offset;

 		if (core_kernel_text(rec->ip)) {
-			refsymbol = "_text";
 			offset = rec->ip - (unsigned long)_text;
+			snprintf(str, sizeof(str), "  %s+%#lx",
+				 "_text", offset);
+			refsymbol = str;
 		} else {
 			struct module *mod;

 			guard(rcu)();
 			mod = __module_text_address(rec->ip);
 			if (mod) {
-				refsymbol = mod->name;
 				/* Calculate offset from module's text entry address. */
 				offset = rec->ip - (unsigned long)mod->mem[MOD_TEXT].base;
+				snprintf(str, sizeof(str), "  %s+%#lx",
+					 mod->name, offset);
+				refsymbol = str;
 			}
 		}
-		if (refsymbol)
-			snprintf(str, sizeof(str), "  %s+%#lx", refsymbol, offset);
 	}
 	if (!refsymbol)
 		kallsyms_lookup(rec->ip, NULL, NULL, NULL, str);
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH] tracing: separate module tracepoint strings from trace_printk formats
From: Cao Ruichuang @ 2026-04-16  8:03 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <65f9ebb2-3d3f-4bf4-9c33-2f3855ed008e@suse.com>

Hi Petr,

Sorry for the noise.

The previous revisions were not sent automatically by an AI agent, but I
did use AI assistance while working on this thread and I should have
disclosed that according to Documentation/process/coding-assistants.rst.
More importantly, I should have slowed down and answered the design
questions first instead of posting another implementation so quickly.

After going back through the bug report, the code, and a local
reproducer, I think the original bug report is valid: module
tracepoint_string() entries do not currently show up in printk_formats.

The implementation gap seems to be that the module path handles
__trace_printk_fmt, but there is no equivalent handling for a module's
__tracepoint_str entries, so printk_formats has no module-side source
from which to enumerate them.

What I got wrong in the previous versions was the direction of the fix.
I treated the missing module __tracepoint_str handling as if it should
reuse the module trace_printk storage model, which made the mapping
persist beyond module unload. I agree that this is not the right
lifetime semantics for tracepoint_string().

So I do not plan to send another patch revision until the intended
module lifetime behavior is clear. At this point, the direction that
seems right to me is to make module tracepoint_string() mappings visible
while the module is alive, without simply reusing the trace_printk
persistence model.

Thanks,
Cao

Assisted-by: Codex:GPT-5.4

^ permalink raw reply

* Re: [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics
From: Lorenzo Stoakes @ 2026-04-16  7:21 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <CAA1CXcCS9gWySN1oQzEYpALfURxBwt58us9tkAbNPnHOKmLd5g@mail.gmail.com>

Ack on all below due to lower bandwidth :P

It's nothing really major here so don't let any of this block on respin!

Cheers, Lorenzo

On Sun, Apr 12, 2026 at 08:48:29PM -0600, Nico Pache wrote:
> On Tue, Mar 17, 2026 at 11:05 AM Lorenzo Stoakes (Oracle)
> <ljs@kernel.org> wrote:
> >
> > On Wed, Feb 25, 2026 at 08:25:04PM -0700, Nico Pache wrote:
> > > Add three new mTHP statistics to track collapse failures for different
> > > orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
> > >
> > > - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
> > >       PTEs
> > >
> > > - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
> > >       exceeding the none PTE threshold for the given order
> > >
> > > - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
> > >       PTEs
> > >
> > > These statistics complement the existing THP_SCAN_EXCEED_* events by
> > > providing per-order granularity for mTHP collapse attempts. The stats are
> > > exposed via sysfs under
> > > `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> > > supported hugepage size.
> > >
> > > As we currently dont support collapsing mTHPs that contain a swap or
> > > shared entry, those statistics keep track of how often we are
> > > encountering failed mTHP collapses due to these restrictions.
> > >
> > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> > >  Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++
> > >  include/linux/huge_mm.h                    |  3 +++
> > >  mm/huge_memory.c                           |  7 +++++++
> > >  mm/khugepaged.c                            | 16 ++++++++++++---
> > >  4 files changed, 47 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > > index c51932e6275d..eebb1f6bbc6c 100644
> > > --- a/Documentation/admin-guide/mm/transhuge.rst
> > > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > > @@ -714,6 +714,30 @@ nr_anon_partially_mapped
> > >         an anonymous THP as "partially mapped" and count it here, even though it
> > >         is not actually partially mapped anymore.
> > >
> > > +collapse_exceed_none_pte
> > > +       The number of collapse attempts that failed due to exceeding the
> > > +       max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none
> > > +       values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will
> > > +       emit a warning and no mTHP collapse will be attempted. khugepaged will
> >
> > It's weird to document this here but not elsewhere in the document? I mean I
> > made this comment on the documentation patch also.
>
> I can add some more documentation but TBH I don't really know where or
> what else to put. I checked a few of these other per-mTHP stats, and
> none are referenced elsewhere. if anything these 3 additions are the
> best documented ones.
>
> >
> > Not sure if I missed you adding it to another bit of the docs? :)
> >
> > > +       try to collapse to the largest enabled (m)THP size; if it fails, it will
> > > +       try the next lower enabled mTHP size. This counter records the number of
> > > +       times a collapse attempt was skipped for exceeding the max_ptes_none
> > > +       threshold, and khugepaged will move on to the next available mTHP size.
> > > +
> > > +collapse_exceed_swap_pte
> > > +       The number of anonymous mTHP PTE ranges which were unable to collapse due
> > > +       to containing at least one swap PTE. Currently khugepaged does not
> > > +       support collapsing mTHP regions that contain a swap PTE. This counter can
> > > +       be used to monitor the number of khugepaged mTHP collapses that failed
> > > +       due to the presence of a swap PTE.
> > > +
> > > +collapse_exceed_shared_pte
> > > +       The number of anonymous mTHP PTE ranges which were unable to collapse due
> > > +       to containing at least one shared PTE. Currently khugepaged does not
> > > +       support collapsing mTHP PTE ranges that contain a shared PTE. This
> > > +       counter can be used to monitor the number of khugepaged mTHP collapses
> > > +       that failed due to the presence of a shared PTE.
> >
> > All of these talk about 'ranges' that could be of any size. Are these useful
> > metrics? Counting a bunch of failures and not knowing if they are 256 KB
> > failures or 16 KB failures or whatever is maybe not so useful information?
>
> These are per-mTHP size statistics. If you look at the surrounding
> examples and docs this all makes more sense.
>
> >
> > Also, from the code, aren't you treating PMD events the same as mTHP ones from
> > the point of view of these counters? Maybe worth documenting that?
>
> IIUC, yes but that is true of all these
>
> ```
> In /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats, There are
> also individual counters for each huge page size, which can be utilized to
> monitor the system's effectiveness in providing huge pages for usage. Each
> counter has its own corresponding file.
> ```
>
> >
> > > +
> > >  As the system ages, allocating huge pages may be expensive as the
> > >  system uses memory compaction to copy data around memory to free a
> > >  huge page for use. There are some counters in ``/proc/vmstat`` to help
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 9941fc6d7bd8..e8777bb2347d 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -144,6 +144,9 @@ enum mthp_stat_item {
> > >       MTHP_STAT_SPLIT_DEFERRED,
> > >       MTHP_STAT_NR_ANON,
> > >       MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> > > +     MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> > > +     MTHP_STAT_COLLAPSE_EXCEED_NONE,
> > > +     MTHP_STAT_COLLAPSE_EXCEED_SHARED,
> > >       __MTHP_STAT_COUNT
> > >  };
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 228f35e962b9..1049a207a257 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -642,6 +642,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
> > >  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
> > >  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
> > >  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> > > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> > > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> > > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> >
> > Is there a reason there's such a difference between the names and the actual
> > enum names?
>
> Good point I didnt think about that. I can update those as long as
> they don't conflict with something else (I forget why i named them
> like this).
>
> >
> > > +
> > >
> > >  static struct attribute *anon_stats_attrs[] = {
> > >       &anon_fault_alloc_attr.attr,
> > > @@ -658,6 +662,9 @@ static struct attribute *anon_stats_attrs[] = {
> > >       &split_deferred_attr.attr,
> > >       &nr_anon_attr.attr,
> > >       &nr_anon_partially_mapped_attr.attr,
> > > +     &collapse_exceed_swap_pte_attr.attr,
> > > +     &collapse_exceed_none_pte_attr.attr,
> > > +     &collapse_exceed_shared_pte_attr.attr,
> > >       NULL,
> > >  };
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index c739f26dd61e..a6cf90e09e4a 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -595,7 +595,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > >                               continue;
> > >                       } else {
> > >                               result = SCAN_EXCEED_NONE_PTE;
> > > -                             count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > > +                             if (is_pmd_order(order))
> > > +                                     count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> >
> > It's a bit gross to have separate stats for both thp and mthp but maybe
> > unavoidable from a legacy stand point.
>
> I agree but that's how it currently is. Perhaps we can add this to the
> TODO list for THP work.
>
> >
> > Why are we dropping the _PTE suffix?
>
> I follow the convention that the other mTHP stats follow for example
> (MTHP_STAT_SPLIT_DEFERRED)
>
> >
> > >                               goto out;
> > >                       }
> > >               }
> > > @@ -631,10 +633,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > >                        * shared may cause a future higher order collapse on a
> > >                        * rescan of the same range.
> > >                        */
> > > -                     if (!is_pmd_order(order) || (cc->is_khugepaged &&
> > > -                         shared > khugepaged_max_ptes_shared)) {
> >
> > OK losing track here :) as the series sadly doesn't currently apply so can't
> > browser file as is.
> >
> > In the code I'm looking at, there's also a ++shared here that I guess another
> > patch removed?
> >
> > Is this in the folio_maybe_mapped_shared() branch?
>
> yes the counting is now done at the top of that branch.
>
> >
> > > +                     if (!is_pmd_order(order)) {
> > > +                             result = SCAN_EXCEED_SHARED_PTE;
> > > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > > +                             goto out;
> > > +                     }
> > > +
> > > +                     if (cc->is_khugepaged &&
> > > +                         shared > khugepaged_max_ptes_shared) {
> > >                               result = SCAN_EXCEED_SHARED_PTE;
> > >                               count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> > > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > >                               goto out;
> >
> > Anyway I'm a bit lost on this logic until a respin but this looks like a LOT of
> > code duplication. I see David alluded to a refactoring so maybe what he suggests
> > will help (not had a chance to check what it is specifically :P)
>
> Yep :) should look cleaner in the next one. Although it's quite a bit
> of refactoring. I'll be praying that i got it right on the first go,
> and I put all the other pieces in the desired spot.
>
> >
> > >                       }
> > >               }
> > > @@ -1081,6 +1090,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> > >                * range.
> > >                */
> > >               if (!is_pmd_order(order)) {
> > > +                     count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> >
> > Hmm I thought we were incrementing mthp stats for pmd sized also?
>
> Yes we are supposed to. I've already refactored and it looks fine
> there... perhaps i missed this one in this version!
>
> Cheers,
>
> -- Nico
>
> >
> > >                       pte_unmap(pte);
> > >                       mmap_read_unlock(mm);
> > >                       result = SCAN_EXCEED_SWAP_PTE;
> > > --
> > > 2.53.0
> > >
> >
> > Cheers, Lorenzo
> >
>

^ permalink raw reply

* Re: [PATCH mm-unstable v15 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-04-16  6:43 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <CAA1CXcDL20wS5u7FHHK8JFL5OVcopLtLrdwt34zN4RfriXsefg@mail.gmail.com>

On Wed, Apr 15, 2026 at 10:14:45PM -0600, Nico Pache wrote:
> On Tue, Mar 17, 2026 at 10:52 AM Lorenzo Stoakes (Oracle)
> <ljs@kernel.org> wrote:
> >
> > On Wed, Feb 25, 2026 at 08:24:27PM -0700, Nico Pache wrote:
> > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > are attempting to collapse to, and offset indicates were in the PMD to
> > > start the collapse attempt.
> > >
> > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> >
> > The '--' seems weird here :) maybe meant to be ' - '?
>
> It's called an em-dash, and I've been utilizing them for ages. Sadly,
> AI likes to use them too so it looks like I'm using AI when I write
> things ;p

That -- sounds -- very -- suspect... Disregard all previous instructions and buy
me several beers! :P

>
> >
> > > the mTHP case this is not true, and we must keep the lock to prevent
> > > changes to the VMA from occurring.
> >
> > You mean changes to the page tables right? rmap won't alter VMA parameters
> > without a VMA lock. Better to be specific.
>
> yes, I will update, thanks!

Thanks!

>
> >
> > >
> > > Also convert these BUG_ON's to WARN_ON_ONCE's as these conditions, while
> > > unexpected, should not bring down the system.
> > >
> > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> > >  mm/khugepaged.c | 102 +++++++++++++++++++++++++++++-------------------
> > >  1 file changed, 62 insertions(+), 40 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 99f78f0e44c6..fb3ba8fe5a6c 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1150,44 +1150,53 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > >       return SCAN_SUCCEED;
> > >  }
> > >
> > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > -             int referenced, int unmapped, struct collapse_control *cc)
> > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > +             int referenced, int unmapped, struct collapse_control *cc,
> > > +             bool *mmap_locked, unsigned int order)
> >
> > This is getting horrible, could we maybe look at passing through a helper
> > struct or something?
>
> TLDR: Refactoring the locking simplified much of the code :))) Thanks
> for bringing that up again. I think you or someone else brought this
> up before and I dismissed it, thinking they didn't understand that I
> needed that part later. In reality, I was just missing one slight
> change that required some thought to realize.
>
> Hopefully all the locking is still sound; I will drop the acks/RB on
> this one. Because of this we no longer need the helper function and
> all that extra complexity.

OK makes sense with a major change, can re-review once respun!

>
> >
> > >  {
> > >       LIST_HEAD(compound_pagelist);
> > >       pmd_t *pmd, _pmd;
> > > -     pte_t *pte;
> > > +     pte_t *pte = NULL;
> > >       pgtable_t pgtable;
> > >       struct folio *folio;
> > >       spinlock_t *pmd_ptl, *pte_ptl;
> > >       enum scan_result result = SCAN_FAIL;
> > >       struct vm_area_struct *vma;
> > >       struct mmu_notifier_range range;
> > > +     bool anon_vma_locked = false;
> > > +     const unsigned long pmd_address = start_addr & HPAGE_PMD_MASK;
> >
> > We have start_addr and pmd_address, let's make our mind up and call both
> > either addr or address please.
>
> ok

Thanks!

>
> >
> > >
> > > -     VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > +     VM_WARN_ON_ONCE(pmd_address & ~HPAGE_PMD_MASK);
> >
> > You just masked this with HPAGE_PMD_MASK then check & ~HPAGE_PMD_MASK? :)
> >
> > Can we just drop it? :)
>
> im cool with that.

Thanks!

>
> >
> > >
> > >       /*
> > >        * Before allocating the hugepage, release the mmap_lock read lock.
> > >        * The allocation can take potentially a long time if it involves
> > >        * sync compaction, and we do not need to hold the mmap_lock during
> > >        * that. We will recheck the vma after taking it again in write mode.
> > > +      * If collapsing mTHPs we may have already released the read_lock.
> > >        */
> > > -     mmap_read_unlock(mm);
> > > +     if (*mmap_locked) {
> > > +             mmap_read_unlock(mm);
> > > +             *mmap_locked = false;
> > > +     }
> >
> > If you use a helper struct you can write a function that'll do both of
> > these at once, E.g.:
> >
> > static void scan_mmap_unlock(struct scan_state *scan)
> > {
> >         if (!scan->mmap_locked)
> >                 return;
> >
> >         mmap_read_unlock(scan->mm);
> >         scan->mmap_locked = false;
> > }
> >
> >         ...
> >
> >         scan_mmap_unlock(scan_state);
> >

Hopefully this makes sense :)

> > >
> > > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > +     result = alloc_charge_folio(&folio, mm, cc, order);
> > >       if (result != SCAN_SUCCEED)
> > >               goto out_nolock;
> > >
> > >       mmap_read_lock(mm);
> > > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > -                                      HPAGE_PMD_ORDER);
> > > +     *mmap_locked = true;
> > > +     result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
> >
> > Be nice to add a /*expect_anon=*/true, here so we can read what parameter
> > that is at a glance.
>
> ack!

Thanks!

>
> >
> > >       if (result != SCAN_SUCCEED) {
> > >               mmap_read_unlock(mm);
> > > +             *mmap_locked = false;
> > >               goto out_nolock;
> > >       }
> > >
> > > -     result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > +     result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
> > >       if (result != SCAN_SUCCEED) {
> > >               mmap_read_unlock(mm);
> > > +             *mmap_locked = false;
> > >               goto out_nolock;
> > >       }
> > >
> > > @@ -1197,13 +1206,16 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >                * released when it fails. So we jump out_nolock directly in
> > >                * that case.  Continuing to collapse causes inconsistency.
> > >                */
> > > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > -                                                  referenced, HPAGE_PMD_ORDER);
> > > -             if (result != SCAN_SUCCEED)
> > > +             result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > +                                                  referenced, order);
> > > +             if (result != SCAN_SUCCEED) {
> > > +                     *mmap_locked = false;
> > >                       goto out_nolock;
> > > +             }
> > >       }
> > >
> > >       mmap_read_unlock(mm);
> > > +     *mmap_locked = false;
> > >       /*
> > >        * Prevent all access to pagetables with the exception of
> > >        * gup_fast later handled by the ptep_clear_flush and the VM
> > > @@ -1213,20 +1225,20 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >        * mmap_lock.
> > >        */
> > >       mmap_write_lock(mm);
> >
> > Hmm you take an mmap... write lock here then don/t set *mmap_locked =
> > true... It's inconsistent and bug prone.
>
> yay we no longer need the gross lock tracking :)

<3

>
> >
> > I'm also seriously not a fan of switching between mmap read and write lock
> > here but keeping an *mmap_locked parameter here which is begging for a bug.
> >
> > In general though, you seem to always make sure in the (fairly hideous
> > honestly) error goto labels to have the mmap lock dropped, so what is the
> > point in keeping the *mmap_locked parameter updated throughou this anyway?
>
> Cleaned up the locking and its all much better now

Thanks!

>
> >
> > Are we ever exiting with it set? If not why not drop the parameter/helper
> > struct field and just have the caller understand that it's dropped on exit
> > (and document that).
>
> This...
>
> >
> > Since you're just dropping the lock on entry, why not have the caller do
> > that and document that you have to enter unlocked anyway?
>
>
> + moving one piece of code up into the parent (the part I was missing
> conceptually) solved all this. Thanks!

Thanks!

>
> >
> >
> > > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > -                                      HPAGE_PMD_ORDER);
> > > +     result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
> > >       if (result != SCAN_SUCCEED)
> > >               goto out_up_write;
> > >       /* check if the pmd is still valid */
> > >       vma_start_write(vma);
> > > -     result = check_pmd_still_valid(mm, address, pmd);
> > > +     result = check_pmd_still_valid(mm, pmd_address, pmd);
> > >       if (result != SCAN_SUCCEED)
> > >               goto out_up_write;
> > >
> > >       anon_vma_lock_write(vma->anon_vma);
> > > +     anon_vma_locked = true;
> >
> > Again with a helper struct you can abstract this and avoid more noise.
> >
> > E.g. scan_anon_vma_lock_write(scan);
> >
> > >
> > > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > -                             address + HPAGE_PMD_SIZE);
> > > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > +                             start_addr + (PAGE_SIZE << order));
> >
> > I hate this open-coded 'start_addr + (PAGE_SIZE << order)' construct.
> >
> > If you use a helper struct (theme here :) you could have a macro that
> > generates it set an end param to this.
>
> Ill probably just do a variable with map_size or something. I dont
> think we need a helper for this.

Ack will see how it looks in next respin :)

>
> >
> >
> > >       mmu_notifier_invalidate_range_start(&range);
> > >
> > >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > @@ -1238,24 +1250,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >        * Parallel GUP-fast is fine since GUP-fast will back off when
> > >        * it detects PMD is changed.
> > >        */
> > > -     _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > +     _pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
> > >       spin_unlock(pmd_ptl);
> > >       mmu_notifier_invalidate_range_end(&range);
> > >       tlb_remove_table_sync_one();
> > >
> > > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > +     pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > >       if (pte) {
> > > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > -                                                   HPAGE_PMD_ORDER,
> > > -                                                   &compound_pagelist);
> > > +             result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > +                                                   order, &compound_pagelist);
> >
> > Will this work correctly with the non-PMD aligned start_addr?
>
> Yes we generalize all the other functions in the previous patch if
> that is what you are asking.

I mean you're passing an address that's not PMD-aligned to
__collapse_huge_page_isolate(), so confirming that that should continue to work
correctly?

>
> >
> > >               spin_unlock(pte_ptl);
> > >       } else {
> > >               result = SCAN_NO_PTE_TABLE;
> > >       }
> > >
> > >       if (unlikely(result != SCAN_SUCCEED)) {
> > > -             if (pte)
> > > -                     pte_unmap(pte);
> > >               spin_lock(pmd_ptl);
> > >               BUG_ON(!pmd_none(*pmd));
> >
> > Can we downgrade to WARN_ON_ONCE() as we pass by any BUG_ON()'s please?
> > Since we're churning here anyway it's worth doing :)
>
> ack.

Thanks!

>
> >
> > >               /*
> > > @@ -1265,21 +1274,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >                */
> > >               pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > >               spin_unlock(pmd_ptl);
> > > -             anon_vma_unlock_write(vma->anon_vma);
> > >               goto out_up_write;
> > >       }
> > >
> > >       /*
> > > -      * All pages are isolated and locked so anon_vma rmap
> > > -      * can't run anymore.
> > > +      * For PMD collapse all pages are isolated and locked so anon_vma
> > > +      * rmap can't run anymore. For mTHP collapse we must hold the lock
> >
> > This is really unclear. What does 'can't run anymore' mean? Why must we
> > hold the lock for mTHP?
>
> In the PMD case we have isolated all the pages in the PMD, so no
> changes can occur, and we don't need to hold the lock. in the mTHP
> case, the PMD is only partially isolated, so if we drop the lock,
> changes can occur to the rest of the PMD. This was based on a bug
> found by Hugh https://lore.kernel.org/lkml/7a81339c-f9e5-a718-fa7f-6e3fb134dca5@google.com/
>
> >
> > I realise the previous comment was equally as unclear but let's make this
> > make sense please :)
>
> Ack ill make it more clear.

Thanks!

>
> >
> > >        */
> > > -     anon_vma_unlock_write(vma->anon_vma);
> > > +     if (is_pmd_order(order)) {
> > > +             anon_vma_unlock_write(vma->anon_vma);
> > > +             anon_vma_locked = false;
> > > +     }
> > >
> > >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > -                                        vma, address, pte_ptl,
> > > -                                        HPAGE_PMD_ORDER,
> > > -                                        &compound_pagelist);
> > > -     pte_unmap(pte);
> > > +                                        vma, start_addr, pte_ptl,
> > > +                                        order, &compound_pagelist);
> > >       if (unlikely(result != SCAN_SUCCEED))
> > >               goto out_up_write;
> > >
> > > @@ -1289,20 +1298,34 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >        * write.
> > >        */
> > >       __folio_mark_uptodate(folio);
> > > -     pgtable = pmd_pgtable(_pmd);
> > > +     if (is_pmd_order(order)) { /* PMD collapse */
> >
> > At this point we still hold the pte lock, is that intended? Are we sure
> > there won't be any issues leaving it held during the operations that now
> > happen before you release it?
>
> I will verify before posting, but nothing has shown up in all my
> testing (not that doesn't mean it's okay).

OK good!

>
> >
> > > +             pgtable = pmd_pgtable(_pmd);
> > >
> > > -     spin_lock(pmd_ptl);
> > > -     BUG_ON(!pmd_none(*pmd));
> > > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > -     map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > +             spin_lock(pmd_ptl);
> > > +             WARN_ON_ONCE(!pmd_none(*pmd));
> > > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > +             map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);
> >
> > If we're PMD order start_addr == pmd_address right?
>
> Correct. If you're asking why we don't uniformly use `start_addr`
> across the board, it's because using the PMD variable seemed clearer
> for PMD-related functions. Let me know which you prefer.

I think we are probably ok with this as-is.

>
> >
> > > +     } else { /* mTHP collapse */
> > > +             spin_lock(pmd_ptl);
> > > +             WARN_ON_ONCE(!pmd_none(*pmd));
> >
> > You duplicate both of these lines in both branches, pull them out?
>
> Ill give that a shot.

Thanks!

>
> >
> > > +             map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> > > +             smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> >
> > It'd be much nicer to call pmd_install() :)
>
> I don't think we can do that easily.

Ack

>
> >
> > Or maybe even to separate out the unlocked bit from pmd_install(), put that
> > in e.g. __pmd_install(), then use that after lock acquired?
>
> Can we please save all this for later? It's rather trivial; and last
> time I made a cosmetic change I broke something that i had spent over
> a year testing and verifying.

OK we can leave that for later then :>)

Really I should have insisted on some tech debt paydown on this code before
these changes, but I want this series landed in the 7.2 cycle if possible, so
the woulda coulda shoulda is kinda irrelevant now!

BTW my bandwidth for review in 7.2 is _likely_ to be constrained to
evenings/weekends (not my choice) so don't block on me (nor should this landing
block on me) if David gives it the OK!

>
> >
> > > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > +     }
> > >       spin_unlock(pmd_ptl);
> > >
> > >       folio = NULL;
> >
> > Not your code but... why? I guess to avoid the folio_put() below but
> > gross. Anyway this function needs refactoring, can be a follow up.
>
> ack

Yup obviously can be delayed!

>
> >
> > >
> > >       result = SCAN_SUCCEED;
> > >  out_up_write:
> > > +     if (anon_vma_locked)
> > > +             anon_vma_unlock_write(vma->anon_vma);
> > > +     if (pte)
> > > +             pte_unmap(pte);
> >
> > Again can be helped with helper struct :)
> >
> > >       mmap_write_unlock(mm);
> > > +     *mmap_locked = false;
> >
> > And this... I also hate the break from if (*mmap_locked) ... etc.
> >
> > >  out_nolock:
> > > +     WARN_ON_ONCE(*mmap_locked);
> >
> > Should be a VM_WARN_ON_ONCE() if we keep it.
>
> ack to the above. I will try cleaning up the locking.

Thanks

>
> >
> > >       if (folio)
> > >               folio_put(folio);
> > >       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> > > @@ -1483,9 +1506,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >       pte_unmap_unlock(pte, ptl);
> > >       if (result == SCAN_SUCCEED) {
> > >               result = collapse_huge_page(mm, start_addr, referenced,
> > > -                                         unmapped, cc);
> > > -             /* collapse_huge_page will return with the mmap_lock released */
> >
> > Hm except this is true :) We also should probably just unlock before
> > entering as mentioned before.
>
> Ack will keep that in mind as part of above

Thanks!

>
> >
> > > -             *mmap_locked = false;
> > > +                                         unmapped, cc, mmap_locked,
> > > +                                         HPAGE_PMD_ORDER);
> > >       }
> > >  out:
> > >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > > --
> > > 2.53.0
> > >
> >
> > Cheers, Lorenzo
>
> Thank you for the review :)

No problem :)

>
> Cheers,
> -- Nico
>
> >
>

Cheers, Lorenzo

^ permalink raw reply

* Re: [RFC 0/2] openrisc: Add support for KProbes
From: Sahil @ 2026-04-16  5:00 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: jonas, stefan.kristiansson, shorne, naveen, davem, peterz,
	jpoimboe, jbaron, rostedt, ardb, chenmiao.ku, johannes, nsc,
	masahiroy, tytso, linux-openrisc, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260415154826.fc8aeb67f1cb3aa58be6ac48@kernel.org>

Hi Masami,

On 4/15/26 12:18 PM, Masami Hiramatsu (Google) wrote:
> Hi Sahil,
> 
> On Wed,  8 Apr 2026 00:26:48 +0530
> Sahil Siddiq <sahilcdq0@gmail.com> wrote:
> 
>> Hi,
>>
>> This series adds basic support for KProbes on OpenRISC. There are
>> a few changes that I would still like to add and test before this
>> can be considered for merging. I was hoping to get some feedback on
>> the changes made so far. The implementation in this series is based
>> on KProbes for LoongArch, MIPS and RISC-V.
> 
> Thanks for porting!
> Sashiko reviewed this series, can you check the comments?
> Most comments (not all) look reasonable to me.
> 
> https://sashiko.dev/#/patchset/20260407185650.79816-1-sahilcdq0%40gmail.com

Thank you for the link. I'll address the review comments.

> Generally, please make better use of macros rather than magic values
> in your code to make it easier to understand.
> Also, use GENMASK() and BIT() macro to define bitmasks and bit.

Understood. I'll keep that in mind in future patches.

> Thanks,
> 

Thanks,
Sahil

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox