* Re: [PATCH 0/3] mm: split the file's i_mmap tree for NUMA
From: Mateusz Guzik @ 2026-04-16 10:29 UTC (permalink / raw)
To: Huang Shijie
Cc: akpm, viro, brauner, linux-mm, linux-kernel, linux-arm-kernel,
linux-fsdevel, muchun.song, osalvador, linux-trace-kernel,
linux-perf-users, linux-parisc, nvdimm, zhongyuan, fangbaoshun,
yingzhiwei
In-Reply-To: <ad4EvoDcAKE2Sl4+@hsj-2U-Workstation>
On Tue, Apr 14, 2026 at 11:11 AM Huang Shijie <huangsj@hygon.cn> wrote:
>
> On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote:
> > On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> > > In NUMA, there are maybe many NUMA nodes and many CPUs.
> > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > > In the UnixBench tests, there is a test "execl" which tests
> > > the execve system call.
> > >
> > > When we test our server with "./Run -c 384 execl",
> > > the test result is not good enough. The i_mmap locks contended heavily on
> > > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
> > > over 6000 VMAs, all the VMAs can be in different NUMA mode.
> > > The insert/remove operations do not run quickly enough.
> > >
> > > patch 1 & patch 2 are try to hide the direct access of i_mmap.
> > > patch 3 splits the i_mmap into sibling trees, and we can get better
> > > performance with this patch set:
> > > we can get 77% performance improvement(10 times average)
> > >
> >
> > To my reading you kept the lock as-is and only distributed the protected
> > state.
> >
> > While I don't doubt the improvement, I'm confident should you take a
> > look at the profile you are going to find this still does not scale with
> > rwsem being one of the problems (there are other global locks, some of
> > which have experimental patches for).
> IMHO, when the number of VMAs in the i_mmap is very large, only optimise the rwsem
> lock does not help too much for our NUMA case.
>
> In our NUMA server, the remote access could be the major issue.
>
I'm confused how this is not supposed to help. You moved your data to
be stored per-domain. With my proposal the lock itself will also get
that treatment.
Modulo the issue of what to do with code wanting to iterate the entire
thing, this is blatantly faster.
>
> >
> > Apart from that this does nothing to help high core systems which are
> > all one node, which imo puts another question mark on this specific
> > proposal.
> Yes, this patch set only focus on the NUMA case.
> The one-node case should use the original i_mmap.
>
> Maybe I can add a new config, CONFIG_SPILT_I_MMAP. The config is disabled
> by default, and enabled when the NUMA node is not one.
>
> >
> > Of course one may question whether a RB tree is the right choice here,
> > it may be the lock-protected cost can go way down with merely a better
> > data structure.
> >
> > Regardless of that, for actual scalability, there will be no way around
> > decentralazing locking around this and partitioning per some core count
> > (not just by numa awareness).
> >
> > Decentralizing locking is definitely possible, but I have not looked
> > into specifics of how problematic it is. Best case scenario it will
> > merely with separate locks. Worst case scenario something needs a fully
> > stabilized state for traversal, in that case another rw lock can be
> Yes.
>
> The traversal may need to hold many locks.
>
The very paragraph you partially quoted answers what to do in that
case: wrap everything with a new rwsem taken for reading when
adding/removing entries and taken for writing when iterating the
entire thing. Then the iteration sticks to one lock.
The new rw lock puts an upper ceiling on scalability of the thing, but
it is way higher than the current state.
Given the extra overhead associated with it one could consider
sticking to one centralized state by default and switching to
distributed state if there is enough contention.
> > slapped around this, creating locking order read lock -> per-subset
> > write lock -- this will suffer scalability due to the read locking, but
> > it will still scale drastically better as apart from that there will be
> > no serialization. In this setting the problematic consumer will write
> > lock the new thing to stabilize the state.
> >
> > So my non-maintainer opinion is that the patchset is not worth it as it
> > fails to address anything for significantly more common and already
> > affected setups.
> This patch set is to reduce the remote access latency for insert/remove VMA
> in NUMA.
>
And I am saying the mmap semaphore is a significant problem already on
high-core no-numa setups. Addressing scalability in that case would
sort out the problem in your setup and to a significantly higher
extent.
> >
> > Have you looked into splitting the lock?
> >
> I ever tried.
>
> But there are two disadvantages:
> 1.) The traversal may need to hold many locks which makes the
> code very horrible.
>
I already above this is avoidable.
> 2.) Even we split the locks. Each lock protects a tree, when the tree becomes
> big enough, the VMA insert/remove will also become slow in NUMA.
> The reason is that the tree has VMAs in different NUMA nodes.
>
This is orthogonal to my proposal. In fact, if one is to pretend this
is never a factor with your patch, I would like to point out it will
remain not a factor if the per-numa struct gets its own lock.
^ permalink raw reply
* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Petr Mladek @ 2026-04-16 10:33 UTC (permalink / raw)
To: chensong_2000
Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
jikos, mbenes, joe.lawrence, rostedt, mhiramat, mark.rutland,
mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260415070137.17860-1-chensong_2000@189.cn>
On Wed 2026-04-15 15:01:37, chensong_2000@189.cn wrote:
> From: Song Chen <chensong_2000@189.cn>
>
> The current notifier chain implementation uses a single-linked list
> (struct notifier_block *next), which only supports forward traversal
> in priority order. This makes it difficult to handle cleanup/teardown
> scenarios that require notifiers to be called in reverse priority order.
>
> A concrete example is the ordering dependency between ftrace and
> livepatch during module load/unload. see the detail here [1].
>
> This patch replaces the single-linked list in struct notifier_block
> with a struct list_head, converting the notifier chain into a
> doubly-linked list sorted in descending priority order. Based on
> this, a new function notifier_call_chain_reverse() is introduced,
> which traverses the chain in reverse (ascending priority order).
> The corresponding blocking_notifier_call_chain_reverse() is also
> added as the locking wrapper for blocking notifier chains.
>
> The internal notifier_call_chain_robust() is updated to use
> notifier_call_chain_reverse() for rollback: on error, it records
> the failing notifier (last_nb) and the count of successfully called
> notifiers (nr), then rolls back exactly those nr-1 notifiers in
> reverse order starting from last_nb's predecessor, without needing
> to know the total length of the chain.
>
> With this change, subsystems with symmetric setup/teardown ordering
> requirements can register a single notifier_block with one priority
> value, and rely on blocking_notifier_call_chain() for forward
> traversal and blocking_notifier_call_chain_reverse() for reverse
> traversal, without needing hard-coded call sequences or separate
> notifier registrations for each direction.
>
> [1]:https://lore.kernel.org/all
> /alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/
>
> --- a/include/linux/notifier.h
> +++ b/include/linux/notifier.h
> @@ -53,41 +53,41 @@ typedef int (*notifier_fn_t)(struct notifier_block *nb,
[...]
> struct notifier_block {
> notifier_fn_t notifier_call;
> - struct notifier_block __rcu *next;
> + struct list_head __rcu entry;
> int priority;
> };
[...]
> #define ATOMIC_INIT_NOTIFIER_HEAD(name) do { \
> spin_lock_init(&(name)->lock); \
> - (name)->head = NULL; \
> + INIT_LIST_HEAD(&(name)->head); \
I would expect the RCU variant here, aka INIT_LIST_HEAD_RCU().
> --- a/kernel/notifier.c
> +++ b/kernel/notifier.c
> @@ -14,39 +14,47 @@
> * are layered on top of these, with appropriate locking added.
> */
>
> -static int notifier_chain_register(struct notifier_block **nl,
> +static int notifier_chain_register(struct list_head *nl,
> struct notifier_block *n,
> bool unique_priority)
> {
> - while ((*nl) != NULL) {
> - if (unlikely((*nl) == n)) {
> + struct notifier_block *cur;
> +
> + list_for_each_entry(cur, nl, entry) {
> + if (unlikely(cur == n)) {
> WARN(1, "notifier callback %ps already registered",
> n->notifier_call);
> return -EEXIST;
> }
> - if (n->priority > (*nl)->priority)
> - break;
> - if (n->priority == (*nl)->priority && unique_priority)
> +
> + if (n->priority == cur->priority && unique_priority)
> return -EBUSY;
> - nl = &((*nl)->next);
> +
> + if (n->priority > cur->priority) {
> + list_add_tail(&n->entry, &cur->entry);
> + goto out;
> + }
> }
> - n->next = *nl;
> - rcu_assign_pointer(*nl, n);
> +
> + list_add_tail(&n->entry, nl);
I would expect list_add_tail_rcu() here.
> @@ -59,25 +67,25 @@ static int notifier_chain_unregister(struct notifier_block **nl,
> * value of this parameter is -1.
> * @nr_calls: Records the number of notifications sent. Don't care
> * value of this field is NULL.
> + * @last_nb: Records the last called notifier block for rolling back
> * Return: notifier_call_chain returns the value returned by the
> * last notifier function called.
> */
> -static int notifier_call_chain(struct notifier_block **nl,
> +static int notifier_call_chain(struct list_head *nl,
> unsigned long val, void *v,
> - int nr_to_call, int *nr_calls)
> + int nr_to_call, int *nr_calls,
> + struct notifier_block **last_nb)
> {
> int ret = NOTIFY_DONE;
> - struct notifier_block *nb, *next_nb;
> -
> - nb = rcu_dereference_raw(*nl);
> + struct notifier_block *nb;
>
> - while (nb && nr_to_call) {
> - next_nb = rcu_dereference_raw(nb->next);
> + if (!nr_to_call)
> + return ret;
>
> + list_for_each_entry(nb, nl, entry) {
I would expect the RCU variant here, aka list_for_each_rcu()
These are just two random examples which I found by a quick look.
I guess that the notifier API is very old and it does not use all
the RCU API features which allow to track safety when
CONFIG_PROVE_RCU and CONFIG_PROVE_RCU_LIST are enabled.
It actually might be worth to audit the code and make it right.
> #ifdef CONFIG_DEBUG_NOTIFIERS
> if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
> WARN(1, "Invalid notifier called!");
> - nb = next_nb;
> continue;
> }
> #endif
That said, I am not sure if the ftrace/livepatching handlers are
the right motivation for this. Especially when I see the
complexity of the 2nd patch [*]
After thinking more about it. I am not even sure that the ftrace and
livepatching callbacks are good candidates for generic notifiers.
They are too special. It is not only about ordering them against
each other. But it is also about ordering them against other
notifiers. The ftrace/livepatching callbacks must be the first/last
during module load/release.
[*] The 2nd patch is not archived by lore for some reason.
I have found only a review but it gives a good picture, see
https://lore.kernel.org/all/1191caf5-6a61-4622-a15e-854d3701f4fc@suse.com/
Best Regards,
Petr
^ permalink raw reply
* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Pavlu @ 2026-04-16 11:18 UTC (permalink / raw)
To: Song Chen
Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe, jikos, mbenes,
pmladek, joe.lawrence, rostedt, mhiramat, mark.rutland,
mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <a35f5f94-7d5a-4347-974b-b270c89ef241@189.cn>
On 4/15/26 8:43 AM, Song Chen wrote:
> On 4/14/26 22:33, Petr Pavlu wrote:
>> On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
>>> diff --git a/include/linux/module.h b/include/linux/module.h
>>> index 14f391b186c6..0bdd56f9defd 100644
>>> --- a/include/linux/module.h
>>> +++ b/include/linux/module.h
>>> @@ -308,6 +308,14 @@ enum module_state {
>>> MODULE_STATE_COMING, /* Full formed, running module_init. */
>>> MODULE_STATE_GOING, /* Going away. */
>>> MODULE_STATE_UNFORMED, /* Still setting it up. */
>>> + MODULE_STATE_FORMED,
>>
>> I don't see a reason to add a new module state. Why is it necessary and
>> how does it fit with the existing states?
>>
> because once notifier fails in state MODULE_STATE_UNFORMED (now only ftrace has someting to do in this state), notifier chain will roll back by calling blocking_notifier_call_chain_robust, i'm afraid MODULE_STATE_GOING is going to jeopardise the notifers which don't handle it appropriately, like:
>
> case MODULE_STATE_COMING:
> kmalloc();
> case MODULE_STATE_GOING:
> kfree();
My understanding is that the current module "state machine" operates as
follows. Transitions marked with an asterisk (*) are announced via the
module notifier.
---> UNFORMED --*> COMING --*> LIVE --*> GOING -.
^ | ^ |
| '---------------------* |
'---------------------------------------'
The new code aims to replace the current ftrace_module_init() call in
load_module(). To achieve this, it adds a notification for the UNFORMED
state (only when loading a module) and introduces a new FORMED state for
rollback. FORMED is purely a fake state because it never appears in
module::state. The new structure is as follows:
,--*> (FORMED)
|
--*> UNFORMED --*> COMING --*> LIVE --*> GOING -.
^ | ^ |
| '---------------------* |
'---------------------------------------'
I'm afraid this is quite complex and inconsistent. Unless it can be kept
simple, we would be just replacing one special handling with a different
complexity, which is not worth it.
>>
>>> + if (err)
>>> + goto ddebug_cleanup;
>>> /* Finally it's fully formed, ready to start executing. */
>>> err = complete_formation(mod, info);
>>> - if (err)
>>> + if (err) {
>>> + blocking_notifier_call_chain_reverse(&module_notify_list,
>>> + MODULE_STATE_FORMED, mod);
>>> goto ddebug_cleanup;
>>> + }
>>> - err = prepare_coming_module(mod);
>>> + err = prepare_module_state_transaction(mod,
>>> + MODULE_STATE_COMING, MODULE_STATE_GOING);
>>> if (err)
>>> goto bug_cleanup;
>>> @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
>>> destroy_params(mod->kp, mod->num_kp);
>>> blocking_notifier_call_chain(&module_notify_list,
>>> MODULE_STATE_GOING, mod);
>>
>> My understanding is that all notifier chains for MODULE_STATE_GOING
>> should be reversed.
> yes, all, from lowest priority notifier to highest.
> I will resend patch 1 which was failed due to my proxy setting.
What I meant here is that the call:
blocking_notifier_call_chain(&module_notify_list, MODULE_STATE_GOING, mod);
should be replaced with:
blocking_notifier_call_chain_reverse(&module_notify_list, MODULE_STATE_GOING, mod);
>
>>
>>> - klp_module_going(mod);
>>> bug_cleanup:
>>> mod->state = MODULE_STATE_GOING;
>>> /* module_bug_cleanup needs module_mutex protection */
>>
>> The patch removes the klp_module_going() cleanup call in load_module().
>> Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
>> should be removed and appropriately replaced with a cleanup via
>> a notifier.
>>
> err = prepare_module_state_transaction(mod,
> MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
> if (err)
> goto ddebug_cleanup;
>
> ftrace will be cleanup in blocking_notifier_call_chain_robust rolling back.
>
> err = prepare_module_state_transaction(mod,
> MODULE_STATE_COMING, MODULE_STATE_GOING);
>
> each notifier including ftrace and klp will be cleanup in blocking_notifier_call_chain_robust rolling back.
>
> if all notifiers are successful in MODULE_STATE_COMING, they all will be clean up in
> coming_cleanup:
> mod->state = MODULE_STATE_GOING;
> destroy_params(mod->kp, mod->num_kp);
> blocking_notifier_call_chain(&module_notify_list,
> MODULE_STATE_GOING, mod);
>
> if something wrong underneath.
My point is that the patch leaves a call to ftrace_release_mod() in
load_module(), which I expected to be handled via a notifier.
--
Thanks,
Petr
^ permalink raw reply
* Re: [PATCH 0/3] mm: split the file's i_mmap tree for NUMA
From: Huang Shijie @ 2026-04-16 11:48 UTC (permalink / raw)
To: Mateusz Guzik
Cc: akpm, viro, brauner, linux-mm, linux-kernel, linux-arm-kernel,
linux-fsdevel, muchun.song, osalvador, linux-trace-kernel,
linux-perf-users, linux-parisc, nvdimm, zhongyuan, fangbaoshun,
yingzhiwei
In-Reply-To: <CAGudoHGLaoc+CoBPNCvFRYojnj+6E_Lsdv7NaJWxFMoHezemMQ@mail.gmail.com>
On Thu, Apr 16, 2026 at 12:29:50PM +0200, Mateusz Guzik wrote:
> On Tue, Apr 14, 2026 at 11:11 AM Huang Shijie <huangsj@hygon.cn> wrote:
> >
> > On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote:
> > > On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> > > > In NUMA, there are maybe many NUMA nodes and many CPUs.
> > > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > > > In the UnixBench tests, there is a test "execl" which tests
> > > > the execve system call.
> > > >
> > > > When we test our server with "./Run -c 384 execl",
> > > > the test result is not good enough. The i_mmap locks contended heavily on
> > > > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
> > > > over 6000 VMAs, all the VMAs can be in different NUMA mode.
> > > > The insert/remove operations do not run quickly enough.
> > > >
> > > > patch 1 & patch 2 are try to hide the direct access of i_mmap.
> > > > patch 3 splits the i_mmap into sibling trees, and we can get better
> > > > performance with this patch set:
> > > > we can get 77% performance improvement(10 times average)
> > > >
> > >
> > > To my reading you kept the lock as-is and only distributed the protected
> > > state.
> > >
> > > While I don't doubt the improvement, I'm confident should you take a
> > > look at the profile you are going to find this still does not scale with
> > > rwsem being one of the problems (there are other global locks, some of
> > > which have experimental patches for).
> > IMHO, when the number of VMAs in the i_mmap is very large, only optimise the rwsem
> > lock does not help too much for our NUMA case.
> >
> > In our NUMA server, the remote access could be the major issue.
> >
>
> I'm confused how this is not supposed to help. You moved your data to
> be stored per-domain. With my proposal the lock itself will also get
> that treatment.
>
> Modulo the issue of what to do with code wanting to iterate the entire
> thing, this is blatantly faster.
>
I tested an old lock patch yesterday. It really helps a lot.
The lock patch is from this link:
https://lkml.org/lkml/2024/9/14/280
The test results:
v7.0-rc5 + (lock patch) : improve about %60%
v7.0-rc5 + (lock patch) + (this patch set) : improve about 130%
> >
> > >
> > > Apart from that this does nothing to help high core systems which are
> > > all one node, which imo puts another question mark on this specific
> > > proposal.
> > Yes, this patch set only focus on the NUMA case.
> > The one-node case should use the original i_mmap.
> >
> > Maybe I can add a new config, CONFIG_SPILT_I_MMAP. The config is disabled
> > by default, and enabled when the NUMA node is not one.
> >
> > >
> > > Of course one may question whether a RB tree is the right choice here,
> > > it may be the lock-protected cost can go way down with merely a better
> > > data structure.
> > >
> > > Regardless of that, for actual scalability, there will be no way around
> > > decentralazing locking around this and partitioning per some core count
> > > (not just by numa awareness).
> > >
> > > Decentralizing locking is definitely possible, but I have not looked
> > > into specifics of how problematic it is. Best case scenario it will
> > > merely with separate locks. Worst case scenario something needs a fully
> > > stabilized state for traversal, in that case another rw lock can be
> > Yes.
> >
> > The traversal may need to hold many locks.
> >
>
> The very paragraph you partially quoted answers what to do in that
> case: wrap everything with a new rwsem taken for reading when
> adding/removing entries and taken for writing when iterating the
> entire thing. Then the iteration sticks to one lock.
>
> The new rw lock puts an upper ceiling on scalability of the thing, but
> it is way higher than the current state.
Could you tell me the patch about it?
Is this lock patch merged ? or not?
I can test it.
>
> Given the extra overhead associated with it one could consider
> sticking to one centralized state by default and switching to
> distributed state if there is enough contention.
>
> > > slapped around this, creating locking order read lock -> per-subset
> > > write lock -- this will suffer scalability due to the read locking, but
> > > it will still scale drastically better as apart from that there will be
> > > no serialization. In this setting the problematic consumer will write
> > > lock the new thing to stabilize the state.
> > >
> > > So my non-maintainer opinion is that the patchset is not worth it as it
> > > fails to address anything for significantly more common and already
> > > affected setups.
> > This patch set is to reduce the remote access latency for insert/remove VMA
> > in NUMA.
> >
>
> And I am saying the mmap semaphore is a significant problem already on
> high-core no-numa setups. Addressing scalability in that case would
> sort out the problem in your setup and to a significantly higher
> extent.
I am afraid even the lock patch resolves the scalability high-core no-numa setups,
we still need to split the i_mmap for NUMA.
>
> > >
> > > Have you looked into splitting the lock?
> > >
> > I ever tried.
> >
> > But there are two disadvantages:
> > 1.) The traversal may need to hold many locks which makes the
> > code very horrible.
> >
>
> I already above this is avoidable.
>
> > 2.) Even we split the locks. Each lock protects a tree, when the tree becomes
> > big enough, the VMA insert/remove will also become slow in NUMA.
> > The reason is that the tree has VMAs in different NUMA nodes.
> >
>
> This is orthogonal to my proposal. In fact, if one is to pretend this
> is never a factor with your patch, I would like to point out it will
> remain not a factor if the per-numa struct gets its own lock.
Yes. It is orthogonal to your proposal.
Thanks
Huang Shijie
^ permalink raw reply
* Re: [RFC PATCH 4/4] selftests/rv: Add selftest for the tlob monitor
From: Gabriele Monaco @ 2026-04-16 12:00 UTC (permalink / raw)
To: wen.yang
Cc: linux-trace-kernel, linux-kernel, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <5bdd82dd8aeb1d3f955b727ae1fce9819b35c170.1776020428.git.wen.yang@linux.dev>
On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> Add a kselftest suite (TAP output, 19 test points) for the tlob RV
> monitor under tools/testing/selftests/rv/.
>
> test_tlob.sh drives a compiled C helper (tlob_helper) and, for uprobe
> tests, a target binary (tlob_uprobe_target). Coverage spans the
> tracefs enable/disable path, uprobe-triggered violations, and the
> ioctl interface (within-budget stop, CPU-bound and sleep violations,
> duplicate start, ring buffer mmap and consumption).
>
> Requires CONFIG_RV_MON_TLOB=y and CONFIG_RV_CHARDEV=y; must be run
> as root.
>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
Those are some extensive selftests!
Could you integrate them with the existing test suite under
tools/testing/selftests/verification ?
You would probably just get your tlob_helper built and call it from some shell
script under test.d, the harness should work without the need for extra helpers.
Thanks,
Gabriele
> ---
> tools/include/uapi/linux/rv.h | 54 +
> tools/testing/selftests/rv/Makefile | 18 +
> tools/testing/selftests/rv/test_tlob.sh | 563 ++++++++++
> tools/testing/selftests/rv/tlob_helper.c | 994 ++++++++++++++++++
> .../testing/selftests/rv/tlob_uprobe_target.c | 108 ++
> 5 files changed, 1737 insertions(+)
> create mode 100644 tools/include/uapi/linux/rv.h
> create mode 100644 tools/testing/selftests/rv/Makefile
> create mode 100755 tools/testing/selftests/rv/test_tlob.sh
> create mode 100644 tools/testing/selftests/rv/tlob_helper.c
> create mode 100644 tools/testing/selftests/rv/tlob_uprobe_target.c
>
> diff --git a/tools/include/uapi/linux/rv.h b/tools/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000..bef07aded
> --- /dev/null
> +++ b/tools/include/uapi/linux/rv.h
> @@ -0,0 +1,54 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * This is a tools-friendly copy of include/uapi/linux/rv.h.
> + * Keep in sync with the kernel header.
> + */
> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H
> +
> +#include <linux/types.h>
> +#include <sys/ioctl.h>
> +
> +/* Magic byte shared by all RV monitor ioctls. */
> +#define RV_IOC_MAGIC 0xB9
> +
> +/* -----------------------------------------------------------------------
> + * tlob: task latency over budget monitor (nr 0x01 - 0x1F)
> + * -----------------------------------------------------------------------
> + */
> +
> +struct tlob_start_args {
> + __u64 threshold_us;
> + __u64 tag;
> + __s32 notify_fd;
> + __u32 flags;
> +};
> +
> +struct tlob_event {
> + __u32 tid;
> + __u32 pad;
> + __u64 threshold_us;
> + __u64 on_cpu_us;
> + __u64 off_cpu_us;
> + __u32 switches;
> + __u32 state; /* 1 = on_cpu, 0 = off_cpu */
> + __u64 tag;
> +};
> +
> +struct tlob_mmap_page {
> + __u32 data_head;
> + __u32 data_tail;
> + __u32 capacity;
> + __u32 version;
> + __u32 data_offset;
> + __u32 record_size;
> + __u64 dropped;
> +};
> +
> +#define TLOB_IOCTL_TRACE_START _IOW(RV_IOC_MAGIC, 0x01, struct
> tlob_start_args)
> +#define TLOB_IOCTL_TRACE_STOP _IO(RV_IOC_MAGIC, 0x02)
> +
> +#endif /* _UAPI_LINUX_RV_H */
> diff --git a/tools/testing/selftests/rv/Makefile
> b/tools/testing/selftests/rv/Makefile
> new file mode 100644
> index 000000000..14e94a1ab
> --- /dev/null
> +++ b/tools/testing/selftests/rv/Makefile
> @@ -0,0 +1,18 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# Makefile for rv selftests
> +
> +TEST_GEN_PROGS := tlob_helper tlob_uprobe_target
> +
> +TEST_PROGS := \
> + test_tlob.sh \
> +
> +# TOOLS_INCLUDES is defined by ../lib.mk; provides -isystem to
> +# tools/include/uapi so that #include <linux/rv.h> resolves to the
> +# in-tree UAPI header without requiring make headers_install.
> +# Note: both must be added to the global variables, not as target-specific
> +# overrides, because lib.mk rewrites TEST_GEN_PROGS to $(OUTPUT)/name
> +# before per-target rules would be evaluated.
> +CFLAGS += $(TOOLS_INCLUDES)
> +LDLIBS += -lpthread
> +
> +include ../lib.mk
> diff --git a/tools/testing/selftests/rv/test_tlob.sh
> b/tools/testing/selftests/rv/test_tlob.sh
> new file mode 100755
> index 000000000..3ba2125eb
> --- /dev/null
> +++ b/tools/testing/selftests/rv/test_tlob.sh
> @@ -0,0 +1,563 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Selftest for the tlob (task latency over budget) RV monitor.
> +#
> +# Two interfaces are tested:
> +#
> +# 1. tracefs interface:
> +# enable/disable, presence of tracefs files,
> +# uprobe binding (threshold_us:offset_start:offset_stop:binary_path)
> and
> +# violation detection via the ftrace ring buffer.
> +#
> +# 2. /dev/rv ioctl self-instrumentation (via tlob_helper):
> +# within-budget, over-budget on-CPU, over-budget off-CPU (sleep),
> +# double-start, stop-without-start.
> +#
> +# Written to be POSIX sh compatible (no bash-specific extensions).
> +
> +ksft_skip=4
> +t_pass=0; t_fail=0; t_skip=0; t_total=0
> +
> +tap_header() { echo "TAP version 13"; }
> +tap_plan() { echo "1..$1"; }
> +tap_pass() { t_pass=$((t_pass+1)); echo "ok $t_total - $1"; }
> +tap_fail() { t_fail=$((t_fail+1)); echo "not ok $t_total - $1"
> + [ -n "$2" ] && echo " # $2"; }
> +tap_skip() { t_skip=$((t_skip+1)); echo "ok $t_total - $1 # SKIP $2"; }
> +next_test() { t_total=$((t_total+1)); }
> +
> +TRACEFS=$(grep -m1 tracefs /proc/mounts 2>/dev/null | awk '{print $2}')
> +[ -z "$TRACEFS" ] && TRACEFS=/sys/kernel/tracing
> +
> +RV_DIR="${TRACEFS}/rv"
> +TLOB_DIR="${RV_DIR}/monitors/tlob"
> +TRACE_FILE="${TRACEFS}/trace"
> +TRACING_ON="${TRACEFS}/tracing_on"
> +TLOB_MONITOR="${TLOB_DIR}/monitor"
> +BUDGET_EXCEEDED_ENABLE="${TRACEFS}/events/rv/tlob_budget_exceeded/enable"
> +RV_DEV="/dev/rv"
> +
> +# tlob_helper and tlob_uprobe_target must be in the same directory as
> +# this script or on PATH.
> +SCRIPT_DIR=$(dirname "$0")
> +IOCTL_HELPER="${SCRIPT_DIR}/tlob_helper"
> +UPROBE_TARGET="${SCRIPT_DIR}/tlob_uprobe_target"
> +
> +check_root() { [ "$(id -u)" = "0" ] || { echo "# Need root" >&2; exit
> $ksft_skip; }; }
> +check_tracefs() { [ -d "${TRACEFS}" ] || { echo "# No tracefs" >&2; exit
> $ksft_skip; }; }
> +check_rv_dir() { [ -d "${RV_DIR}" ] || { echo "# No RV infra" >&2; exit
> $ksft_skip; }; }
> +check_tlob() { [ -d "${TLOB_DIR}" ] || { echo "# No tlob monitor" >&2;
> exit $ksft_skip; }; }
> +
> +tlob_enable() { echo 1 > "${TLOB_DIR}/enable"; }
> +tlob_disable() { echo 0 > "${TLOB_DIR}/enable" 2>/dev/null; }
> +tlob_is_enabled() { [ "$(cat "${TLOB_DIR}/enable" 2>/dev/null)" = "1" ];
> }
> +trace_event_enable() { echo 1 > "${BUDGET_EXCEEDED_ENABLE}" 2>/dev/null; }
> +trace_event_disable() { echo 0 > "${BUDGET_EXCEEDED_ENABLE}" 2>/dev/null; }
> +trace_on() { echo 1 > "${TRACING_ON}" 2>/dev/null; }
> +trace_clear() { echo > "${TRACE_FILE}"; }
> +trace_grep() { grep -q "$1" "${TRACE_FILE}" 2>/dev/null; }
> +
> +cleanup() {
> + tlob_disable
> + trace_event_disable
> + trace_clear
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 1: enable / disable
> +# ---------------------------------------------------------------------------
> +run_test_enable_disable() {
> + next_test; cleanup
> + tlob_enable
> + if ! tlob_is_enabled; then
> + tap_fail "enable_disable" "not enabled after echo 1";
> cleanup; return
> + fi
> + tlob_disable
> + if tlob_is_enabled; then
> + tap_fail "enable_disable" "still enabled after echo 0";
> cleanup; return
> + fi
> + tap_pass "enable_disable"; cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 2: tracefs files present
> +# ---------------------------------------------------------------------------
> +run_test_tracefs_files() {
> + next_test; cleanup
> + missing=""
> + for f in enable desc monitor; do
> + [ ! -e "${TLOB_DIR}/${f}" ] && missing="${missing} ${f}"
> + done
> + [ -n "${missing}" ] \
> + && tap_fail "tracefs_files" "missing:${missing}" \
> + || tap_pass "tracefs_files"
> + cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Helper: resolve file offset of a function inside a binary.
> +#
> +# Usage: resolve_offset <binary> <vaddr_hex>
> +# Prints the hex file offset, or empty string on failure.
> +# ---------------------------------------------------------------------------
> +resolve_offset() {
> + bin=$1; vaddr=$2
> + # Parse /proc/self/maps to find the mapping that contains vaddr.
> + # Each line: start-end perms offset dev inode [path]
> + while IFS= read -r line; do
> + set -- $line
> + range=$1; off=$4; path=$7
> + [ -z "$path" ] && continue
> + # Only consider the mapping for our binary
> + [ "$path" != "$bin" ] && continue
> + # Split range into start and end
> + start=$(echo "$range" | cut -d- -f1)
> + end=$(echo "$range" | cut -d- -f2)
> + # Convert hex to decimal for comparison (use printf)
> + s=$(printf "%d" "0x${start}" 2>/dev/null) || continue
> + e=$(printf "%d" "0x${end}" 2>/dev/null) || continue
> + v=$(printf "%d" "${vaddr}" 2>/dev/null) || continue
> + o=$(printf "%d" "0x${off}" 2>/dev/null) || continue
> + if [ "$v" -ge "$s" ] && [ "$v" -lt "$e" ]; then
> + file_off=$(printf "0x%x" $(( (v - s) + o )))
> + echo "$file_off"
> + return
> + fi
> + done < /proc/self/maps
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 3: uprobe binding - no false positive
> +#
> +# Bind this process with a 10 s budget. Do nothing for 0.5 s.
> +# No budget_exceeded event should appear in the trace.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_no_false_positive() {
> + next_test; cleanup
> + if [ ! -e "${TLOB_MONITOR}" ]; then
> + tap_skip "uprobe_no_false_positive" "monitor file not
> available"
> + cleanup; return
> + fi
> + # We probe the "sleep" command that we will run as a subprocess.
> + # Use /bin/sleep as the binary; find a valid function offset (0x0
> + # resolves to the ELF entry point, which is sufficient for a
> + # no-false-positive test since we just need the binding to exist).
> + sleep_bin=$(command -v sleep 2>/dev/null)
> + if [ -z "$sleep_bin" ]; then
> + tap_skip "uprobe_no_false_positive" "sleep not found";
> cleanup; return
> + fi
> + pid=$$
> + # offset 0x0 probes the entry point of /bin/sleep - this is a
> + # deliberate probe that will not fire during a simple 'sleep 10'
> + # invoked in a subshell, but registers the pid in tlob.
> + #
> + # Instead, bind our own pid with a generous 10 s threshold and
> + # verify that 0.5 s of idle time does NOT fire the timer.
> + #
> + # Since we cannot easily get a valid uprobe offset in pure shell,
> + # we skip this sub-test if we cannot form a valid binding.
> + exe=$(readlink /proc/self/exe 2>/dev/null)
> + if [ -z "$exe" ]; then
> + tap_skip "uprobe_no_false_positive" "cannot read
> /proc/self/exe"
> + cleanup; return
> + fi
> + trace_event_enable
> + trace_on
> + tlob_enable
> + trace_clear
> + # Sleep without any binding - just verify no spurious events
> + sleep 0.5
> + trace_grep "budget_exceeded" \
> + && tap_fail "uprobe_no_false_positive" \
> + "spurious budget_exceeded without any binding" \
> + || tap_pass "uprobe_no_false_positive"
> + cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Helper: get_uprobe_offset <binary> <symbol>
> +#
> +# Use tlob_helper sym_offset to get the ELF file offset of <symbol>
> +# in <binary>. Prints the hex offset (e.g. "0x11d0") or empty string on
> +# failure.
> +# ---------------------------------------------------------------------------
> +get_uprobe_offset() {
> + bin=$1; sym=$2
> + if [ ! -x "${IOCTL_HELPER}" ]; then
> + return
> + fi
> + "${IOCTL_HELPER}" sym_offset "${bin}" "${sym}" 2>/dev/null
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 4: uprobe binding - violation detected
> +#
> +# Start tlob_uprobe_target (a busy-spin binary with a well-known symbol),
> +# attach a uprobe on tlob_busy_work with a 10 ms threshold, and verify
> +# that a budget_expired event appears.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_violation() {
> + next_test; cleanup
> + if [ ! -e "${TLOB_MONITOR}" ]; then
> + tap_skip "uprobe_violation" "monitor file not available"
> + cleanup; return
> + fi
> + if [ ! -x "${UPROBE_TARGET}" ]; then
> + tap_skip "uprobe_violation" \
> + "tlob_uprobe_target not found or not executable"
> + cleanup; return
> + fi
> +
> + # Get the file offsets of the start and stop probe symbols
> + busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> + if [ -z "${busy_offset}" ]; then
> + tap_skip "uprobe_violation" \
> + "cannot resolve tlob_busy_work offset in
> ${UPROBE_TARGET}"
> + cleanup; return
> + fi
> + stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> + if [ -z "${stop_offset}" ]; then
> + tap_skip "uprobe_violation" \
> + "cannot resolve tlob_busy_work_done offset in
> ${UPROBE_TARGET}"
> + cleanup; return
> + fi
> +
> + # Start the busy-spin target (run for 30 s so the test can observe
> it)
> + "${UPROBE_TARGET}" 30000 &
> + busy_pid=$!
> + sleep 0.05
> +
> + trace_event_enable
> + trace_on
> + tlob_enable
> + trace_clear
> +
> + # Bind the target: 10 us budget; start=tlob_busy_work,
> stop=tlob_busy_work_done
> + binding="10:${busy_offset}:${stop_offset}:${UPROBE_TARGET}"
> + if ! echo "${binding}" > "${TLOB_MONITOR}" 2>/dev/null; then
> + kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> + tap_skip "uprobe_violation" \
> + "uprobe binding rejected (CONFIG_UPROBES=y needed)"
> + cleanup; return
> + fi
> +
> + # Wait up to 2 s for a budget_exceeded event
> + found=0; i=0
> + while [ "$i" -lt 20 ]; do
> + sleep 0.1
> + trace_grep "budget_exceeded" && { found=1; break; }
> + i=$((i+1))
> + done
> +
> + echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> + kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
> +
> + if [ "${found}" != "1" ]; then
> + tap_fail "uprobe_violation" "no budget_exceeded within 2 s"
> + cleanup; return
> + fi
> +
> + # Validate the event fields: threshold must match, on_cpu must be
> non-zero
> + # (CPU-bound violation), and state must be on_cpu.
> + ev=$(grep "budget_exceeded" "${TRACE_FILE}" | head -n 1)
> + if ! echo "${ev}" | grep -q "threshold=10 "; then
> + tap_fail "uprobe_violation" "threshold field mismatch: ${ev}"
> + cleanup; return
> + fi
> + on_cpu=$(echo "${ev}" | grep -o "on_cpu=[0-9]*" | cut -d= -f2)
> + if [ "${on_cpu:-0}" -eq 0 ]; then
> + tap_fail "uprobe_violation" "on_cpu=0 for a CPU-bound spin:
> ${ev}"
> + cleanup; return
> + fi
> + if ! echo "${ev}" | grep -q "state=on_cpu"; then
> + tap_fail "uprobe_violation" "state is not on_cpu: ${ev}"
> + cleanup; return
> + fi
> + tap_pass "uprobe_violation"
> + cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 5: uprobe binding - remove binding stops monitoring
> +#
> +# Bind a pid via tlob_uprobe_target, then immediately remove it.
> +# Verify that after removal the monitor file no longer lists the pid.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_unbind() {
> + next_test; cleanup
> + if [ ! -e "${TLOB_MONITOR}" ]; then
> + tap_skip "uprobe_unbind" "monitor file not available"
> + cleanup; return
> + fi
> + if [ ! -x "${UPROBE_TARGET}" ]; then
> + tap_skip "uprobe_unbind" \
> + "tlob_uprobe_target not found or not executable"
> + cleanup; return
> + fi
> +
> + busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> + stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> + if [ -z "${busy_offset}" ] || [ -z "${stop_offset}" ]; then
> + tap_skip "uprobe_unbind" \
> + "cannot resolve tlob_busy_work/tlob_busy_work_done
> offset"
> + cleanup; return
> + fi
> +
> + "${UPROBE_TARGET}" 30000 &
> + busy_pid=$!
> + sleep 0.05
> +
> + tlob_enable
> + # 5 s budget - should not fire during this quick test
> + binding="5000000:${busy_offset}:${stop_offset}:${UPROBE_TARGET}"
> + if ! echo "${binding}" > "${TLOB_MONITOR}" 2>/dev/null; then
> + kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> + tap_skip "uprobe_unbind" \
> + "uprobe binding rejected (CONFIG_UPROBES=y needed)"
> + cleanup; return
> + fi
> +
> + # Remove the binding
> + echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +
> + # The monitor file should no longer list the binding for this offset
> + if grep -q "^[0-9]*:0x${busy_offset#0x}:" "${TLOB_MONITOR}"
> 2>/dev/null; then
> + kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> + tap_fail "uprobe_unbind" "pid still listed after removal"
> + cleanup; return
> + fi
> +
> + kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
> + tap_pass "uprobe_unbind"
> + cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 6: uprobe - duplicate offset_start rejected
> +#
> +# Registering a second binding with the same offset_start in the same binary
> +# must be rejected with an error, since two entry uprobes at the same address
> +# would cause double tlob_start_task() calls and undefined behaviour.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_duplicate_offset() {
> + next_test; cleanup
> + if [ ! -e "${TLOB_MONITOR}" ]; then
> + tap_skip "uprobe_duplicate_offset" "monitor file not
> available"
> + cleanup; return
> + fi
> + if [ ! -x "${UPROBE_TARGET}" ]; then
> + tap_skip "uprobe_duplicate_offset" \
> + "tlob_uprobe_target not found or not executable"
> + cleanup; return
> + fi
> +
> + busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> + stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> + if [ -z "${busy_offset}" ] || [ -z "${stop_offset}" ]; then
> + tap_skip "uprobe_duplicate_offset" \
> + "cannot resolve tlob_busy_work/tlob_busy_work_done
> offset"
> + cleanup; return
> + fi
> +
> + tlob_enable
> +
> + # First binding: should succeed
> + if ! echo "5000000:${busy_offset}:${stop_offset}:${UPROBE_TARGET}" \
> + > "${TLOB_MONITOR}" 2>/dev/null; then
> + tap_skip "uprobe_duplicate_offset" \
> + "uprobe binding rejected (CONFIG_UPROBES=y needed)"
> + cleanup; return
> + fi
> +
> + # Second binding with same offset_start: must be rejected
> + if echo "9999:${busy_offset}:${stop_offset}:${UPROBE_TARGET}" \
> + > "${TLOB_MONITOR}" 2>/dev/null; then
> + echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> + tap_fail "uprobe_duplicate_offset" \
> + "duplicate offset_start was accepted (expected
> error)"
> + cleanup; return
> + fi
> +
> + echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> + tap_pass "uprobe_duplicate_offset"
> + cleanup
> +}
> +
> +
> +#
> +# Region A: tlob_busy_work with a 5 s budget - should NOT fire during the
> test.
> +# Region B: tlob_busy_work_done with a 10 us budget - SHOULD fire quickly
> since
> +# tlob_uprobe_target calls tlob_busy_work_done after a busy spin.
> +#
> +# Verifies that independent bindings for different offsets in the same binary
> +# are tracked separately and that only the tight-budget binding triggers a
> +# budget_exceeded event.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_independent_thresholds() {
> + next_test; cleanup
> + if [ ! -e "${TLOB_MONITOR}" ]; then
> + tap_skip "uprobe_independent_thresholds" \
> + "monitor file not available"; cleanup; return
> + fi
> + if [ ! -x "${UPROBE_TARGET}" ]; then
> + tap_skip "uprobe_independent_thresholds" \
> + "tlob_uprobe_target not found or not executable"
> + cleanup; return
> + fi
> +
> + busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> + busy_stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> + if [ -z "${busy_offset}" ] || [ -z "${busy_stop_offset}" ]; then
> + tap_skip "uprobe_independent_thresholds" \
> + "cannot resolve tlob_busy_work/tlob_busy_work_done
> offset"
> + cleanup; return
> + fi
> +
> + "${UPROBE_TARGET}" 30000 &
> + busy_pid=$!
> + sleep 0.05
> +
> + trace_event_enable
> + trace_on
> + tlob_enable
> + trace_clear
> +
> + # Region A: generous 5 s budget on tlob_busy_work entry (should not
> fire)
> + if ! echo
> "5000000:${busy_offset}:${busy_stop_offset}:${UPROBE_TARGET}" \
> + > "${TLOB_MONITOR}" 2>/dev/null; then
> + kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> + tap_skip "uprobe_independent_thresholds" \
> + "uprobe binding rejected (CONFIG_UPROBES=y needed)"
> + cleanup; return
> + fi
> + # Region B: tight 10 us budget on tlob_busy_work_done (fires quickly)
> + echo "10:${busy_stop_offset}:${busy_stop_offset}:${UPROBE_TARGET}" \
> + > "${TLOB_MONITOR}" 2>/dev/null
> +
> + found=0; i=0
> + while [ "$i" -lt 20 ]; do
> + sleep 0.1
> + trace_grep "budget_exceeded" && { found=1; break; }
> + i=$((i+1))
> + done
> +
> + echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> + echo "-${busy_stop_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> + kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
> +
> + if [ "${found}" != "1" ]; then
> + tap_fail "uprobe_independent_thresholds" \
> + "budget_exceeded not raised for tight-budget region
> within 2 s"
> + cleanup; return
> + fi
> +
> + # The violation must carry threshold=10 (Region B's budget).
> + ev=$(grep "budget_exceeded" "${TRACE_FILE}" | head -n 1)
> + if ! echo "${ev}" | grep -q "threshold=10 "; then
> + tap_fail "uprobe_independent_thresholds" \
> + "violation threshold is not Region B's 10 us: ${ev}"
> + cleanup; return
> + fi
> + tap_pass "uprobe_independent_thresholds"
> + cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# ioctl tests via tlob_helper
> +#
> +# Each test invokes the helper with a sub-test name.
> +# Exit code: 0=pass, 1=fail, 2=skip.
> +# ---------------------------------------------------------------------------
> +run_ioctl_test() {
> + testname=$1
> + next_test
> +
> + if [ ! -x "${IOCTL_HELPER}" ]; then
> + tap_skip "ioctl_${testname}" \
> + "tlob_helper not found or not executable"
> + return
> + fi
> + if [ ! -c "${RV_DEV}" ]; then
> + tap_skip "ioctl_${testname}" \
> + "${RV_DEV} not present (CONFIG_RV_CHARDEV=y needed)"
> + return
> + fi
> +
> + tlob_enable
> + "${IOCTL_HELPER}" "${testname}"
> + rc=$?
> + tlob_disable
> +
> + case "${rc}" in
> + 0) tap_pass "ioctl_${testname}" ;;
> + 2) tap_skip "ioctl_${testname}" "helper returned skip" ;;
> + *) tap_fail "ioctl_${testname}" "helper exited with code ${rc}" ;;
> + esac
> +}
> +
> +# run_ioctl_test_not_enabled - like run_ioctl_test but deliberately does NOT
> +# enable the tlob monitor before invoking the helper. Used to verify that
> +# ioctls issued against a disabled monitor return ENODEV rather than crashing
> +# the kernel with a NULL pointer dereference.
> +run_ioctl_test_not_enabled()
> +{
> + next_test
> +
> + if [ ! -x "${IOCTL_HELPER}" ]; then
> + tap_skip "ioctl_not_enabled" \
> + "tlob_helper not found or not executable"
> + return
> + fi
> + if [ ! -c "${RV_DEV}" ]; then
> + tap_skip "ioctl_not_enabled" \
> + "${RV_DEV} not present (CONFIG_RV_CHARDEV=y needed)"
> + return
> + fi
> +
> + # Monitor intentionally left disabled.
> + tlob_disable
> + "${IOCTL_HELPER}" not_enabled
> + rc=$?
> +
> + case "${rc}" in
> + 0) tap_pass "ioctl_not_enabled" ;;
> + 2) tap_skip "ioctl_not_enabled" "helper returned skip" ;;
> + *) tap_fail "ioctl_not_enabled" "helper exited with code ${rc}" ;;
> + esac
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Main
> +# ---------------------------------------------------------------------------
> +check_root; check_tracefs; check_rv_dir; check_tlob
> +tap_header; tap_plan 20
> +
> +# tracefs interface tests
> +run_test_enable_disable
> +run_test_tracefs_files
> +
> +# uprobe external monitoring tests
> +run_test_uprobe_no_false_positive
> +run_test_uprobe_violation
> +run_test_uprobe_unbind
> +run_test_uprobe_duplicate_offset
> +run_test_uprobe_independent_thresholds
> +
> +# /dev/rv ioctl self-instrumentation tests
> +run_ioctl_test_not_enabled
> +run_ioctl_test within_budget
> +run_ioctl_test over_budget_cpu
> +run_ioctl_test over_budget_sleep
> +run_ioctl_test double_start
> +run_ioctl_test stop_no_start
> +run_ioctl_test multi_thread
> +run_ioctl_test self_watch
> +run_ioctl_test invalid_flags
> +run_ioctl_test notify_fd_bad
> +run_ioctl_test mmap_basic
> +run_ioctl_test mmap_errors
> +run_ioctl_test mmap_consume
> +
> +echo "# Passed: ${t_pass} Failed: ${t_fail} Skipped: ${t_skip}"
> +[ "${t_fail}" -gt 0 ] && exit 1 || exit 0
> diff --git a/tools/testing/selftests/rv/tlob_helper.c
> b/tools/testing/selftests/rv/tlob_helper.c
> new file mode 100644
> index 000000000..cd76b56d1
> --- /dev/null
> +++ b/tools/testing/selftests/rv/tlob_helper.c
> @@ -0,0 +1,994 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_helper.c - test helper and ELF utility for tlob selftests
> + *
> + * Called by test_tlob.sh to exercise the /dev/rv ioctl interface and to
> + * resolve ELF symbol offsets for uprobe bindings. One subcommand per
> + * invocation so the shell script can report each as an independent TAP
> + * test case.
> + *
> + * Usage: tlob_helper <subcommand> [args...]
> + *
> + * Synchronous TRACE_START / TRACE_STOP tests:
> + * not_enabled - TRACE_START without tlob enabled -> ENODEV (no
> kernel crash)
> + * within_budget - start(50000 us), sleep 10 ms, stop -> expect 0
> + * over_budget_cpu - start(5000 us), busyspin 100 ms, stop -> EOVERFLOW
> + * over_budget_sleep - start(3000 us), sleep 50 ms, stop -> EOVERFLOW
> + *
> + * Error-handling tests:
> + * double_start - two starts without stop -> EEXIST on second
> + * stop_no_start - stop without start -> ESRCH
> + *
> + * Per-thread isolation test:
> + * multi_thread - two threads share one fd; one within budget, one
> over
> + *
> + * Asynchronous notification test (notify_fd + read()):
> + * self_watch - one worker exceeds budget; monitor fd receives one
> ntf via read()
> + *
> + * Input-validation tests (TRACE_START error paths):
> + * invalid_flags - TRACE_START with flags != 0 -> EINVAL
> + * notify_fd_bad - TRACE_START with notify_fd = stdout (non-rv fd) ->
> EINVAL
> + *
> + * mmap ring buffer tests (Scenario D):
> + * mmap_basic - mmap succeeds; verify tlob_mmap_page fields
> + * (version, capacity, data_offset, record_size)
> + * mmap_errors - MAP_PRIVATE, wrong size, and non-zero pgoff all
> + * return EINVAL
> + * mmap_consume - trigger a real violation via self-notification and
> + * consume the event through the mmap'd ring
> + *
> + * ELF utility (does not require /dev/rv):
> + * sym_offset <binary> <symbol>
> + * - print the ELF file offset of <symbol> in <binary>
> + * (used by the shell script to build uprobe bindings)
> + *
> + * Exit code: 0 = pass, 1 = fail, 2 = skip (device not available).
> + */
> +#define _GNU_SOURCE
> +#include <elf.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <poll.h>
> +#include <pthread.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <linux/rv.h>
> +
> +/* Default ring capacity allocated at open(); matches TLOB_RING_DEFAULT_CAP.
> */
> +#define TLOB_RING_DEFAULT_CAP 64U
> +
> +static int rv_fd = -1;
> +
> +static int open_rv(void)
> +{
> + rv_fd = open("/dev/rv", O_RDWR);
> + if (rv_fd < 0) {
> + fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
> + return -1;
> + }
> + return 0;
> +}
> +
> +static void busy_spin_us(unsigned long us)
> +{
> + struct timespec start, now;
> + unsigned long elapsed;
> +
> + clock_gettime(CLOCK_MONOTONIC, &start);
> + do {
> + clock_gettime(CLOCK_MONOTONIC, &now);
> + elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> + * 1000000000UL
> + + (unsigned long)(now.tv_nsec - start.tv_nsec);
> + } while (elapsed < us * 1000UL);
> +}
> +
> +static int do_start(uint64_t threshold_us)
> +{
> + struct tlob_start_args args = {
> + .threshold_us = threshold_us,
> + .notify_fd = -1,
> + };
> +
> + return ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +}
> +
> +static int do_stop(void)
> +{
> + return ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Synchronous TRACE_START / TRACE_STOP tests
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * test_not_enabled - TRACE_START must return ENODEV when the tlob monitor
> + * has not been enabled (tlob_state_cache is NULL).
> + *
> + * The shell wrapper deliberately does NOT call tlob_enable before invoking
> + * this subcommand, so the ioctl is expected to fail with ENODEV rather than
> + * crashing the kernel with a NULL pointer dereference in kmem_cache_alloc.
> + */
> +static int test_not_enabled(void)
> +{
> + int ret;
> +
> + ret = do_start(1000);
> + if (ret == 0) {
> + fprintf(stderr, "TRACE_START: expected ENODEV, got
> success\n");
> + do_stop();
> + return 1;
> + }
> + if (errno != ENODEV) {
> + fprintf(stderr, "TRACE_START: expected ENODEV, got %s\n",
> + strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +static int test_within_budget(void)
> +{
> + int ret;
> +
> + if (do_start(50000) < 0) {
> + fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> + return 1;
> + }
> + usleep(10000); /* 10 ms < 50 ms budget */
> + ret = do_stop();
> + if (ret != 0) {
> + fprintf(stderr, "TRACE_STOP: expected 0, got %d errno=%s\n",
> + ret, strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +static int test_over_budget_cpu(void)
> +{
> + int ret;
> +
> + if (do_start(5000) < 0) {
> + fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> + return 1;
> + }
> + busy_spin_us(100000); /* 100 ms >> 5 ms budget */
> + ret = do_stop();
> + if (ret == 0) {
> + fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> + return 1;
> + }
> + if (errno != EOVERFLOW) {
> + fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> + strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +static int test_over_budget_sleep(void)
> +{
> + int ret;
> +
> + if (do_start(3000) < 0) {
> + fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> + return 1;
> + }
> + usleep(50000); /* 50 ms >> 3 ms budget, off-CPU time counts */
> + ret = do_stop();
> + if (ret == 0) {
> + fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> + return 1;
> + }
> + if (errno != EOVERFLOW) {
> + fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> + strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Error-handling tests
> + * -----------------------------------------------------------------------
> + */
> +
> +static int test_double_start(void)
> +{
> + int ret;
> +
> + if (do_start(10000000) < 0) {
> + fprintf(stderr, "first TRACE_START: %s\n", strerror(errno));
> + return 1;
> + }
> + ret = do_start(10000000);
> + if (ret == 0) {
> + fprintf(stderr, "second TRACE_START: expected EEXIST, got
> 0\n");
> + do_stop();
> + return 1;
> + }
> + if (errno != EEXIST) {
> + fprintf(stderr, "second TRACE_START: expected EEXIST, got
> %s\n",
> + strerror(errno));
> + do_stop();
> + return 1;
> + }
> + do_stop(); /* clean up */
> + return 0;
> +}
> +
> +static int test_stop_no_start(void)
> +{
> + int ret;
> +
> + /* Ensure clean state: ignore error from a stale entry */
> + do_stop();
> +
> + ret = do_stop();
> + if (ret == 0) {
> + fprintf(stderr, "TRACE_STOP: expected ESRCH, got 0\n");
> + return 1;
> + }
> + if (errno != ESRCH) {
> + fprintf(stderr, "TRACE_STOP: expected ESRCH, got %s\n",
> + strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Per-thread isolation test
> + *
> + * Two threads share a single /dev/rv fd. The monitor uses task_struct *
> + * as the key, so each thread gets an independent slot regardless of the
> + * shared fd.
> + * -----------------------------------------------------------------------
> + */
> +
> +struct mt_thread_args {
> + uint64_t threshold_us;
> + unsigned long workload_us;
> + int busy;
> + int expect_eoverflow;
> + int result;
> +};
> +
> +static void *mt_thread_fn(void *arg)
> +{
> + struct mt_thread_args *a = arg;
> + int ret;
> +
> + if (do_start(a->threshold_us) < 0) {
> + fprintf(stderr, "thread TRACE_START: %s\n", strerror(errno));
> + a->result = 1;
> + return NULL;
> + }
> +
> + if (a->busy)
> + busy_spin_us(a->workload_us);
> + else
> + usleep(a->workload_us);
> +
> + ret = do_stop();
> + if (a->expect_eoverflow) {
> + if (ret == 0 || errno != EOVERFLOW) {
> + fprintf(stderr, "thread: expected EOVERFLOW, got
> ret=%d errno=%s\n",
> + ret, strerror(errno));
> + a->result = 1;
> + return NULL;
> + }
> + } else {
> + if (ret != 0) {
> + fprintf(stderr, "thread: expected 0, got ret=%d
> errno=%s\n",
> + ret, strerror(errno));
> + a->result = 1;
> + return NULL;
> + }
> + }
> + a->result = 0;
> + return NULL;
> +}
> +
> +static int test_multi_thread(void)
> +{
> + pthread_t ta, tb;
> + struct mt_thread_args a = {
> + .threshold_us = 20000, /* 20 ms */
> + .workload_us = 5000, /* 5 ms sleep -> within budget */
> + .busy = 0,
> + .expect_eoverflow = 0,
> + };
> + struct mt_thread_args b = {
> + .threshold_us = 3000, /* 3 ms */
> + .workload_us = 30000, /* 30 ms spin -> over budget */
> + .busy = 1,
> + .expect_eoverflow = 1,
> + };
> +
> + pthread_create(&ta, NULL, mt_thread_fn, &a);
> + pthread_create(&tb, NULL, mt_thread_fn, &b);
> + pthread_join(ta, NULL);
> + pthread_join(tb, NULL);
> +
> + return (a.result || b.result) ? 1 : 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Asynchronous notification test (notify_fd + read())
> + *
> + * A dedicated monitor_fd is opened by the main thread. Two worker threads
> + * each open their own work_fd and call TLOB_IOCTL_TRACE_START with
> + * notify_fd = monitor_fd, nominating it as the violation target. Worker A
> + * stays within budget; worker B exceeds it. The main thread reads from
> + * monitor_fd and expects exactly one tlob_event record.
> + * -----------------------------------------------------------------------
> + */
> +
> +struct sw_worker_args {
> + int monitor_fd;
> + uint64_t threshold_us;
> + unsigned long workload_us;
> + int busy;
> + int result;
> +};
> +
> +static void *sw_worker_fn(void *arg)
> +{
> + struct sw_worker_args *a = arg;
> + struct tlob_start_args args = {
> + .threshold_us = a->threshold_us,
> + .notify_fd = a->monitor_fd,
> + };
> + int work_fd;
> + int ret;
> +
> + work_fd = open("/dev/rv", O_RDWR);
> + if (work_fd < 0) {
> + fprintf(stderr, "worker open /dev/rv: %s\n",
> strerror(errno));
> + a->result = 1;
> + return NULL;
> + }
> +
> + ret = ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
> + if (ret < 0) {
> + fprintf(stderr, "TRACE_START (notify): %s\n",
> strerror(errno));
> + close(work_fd);
> + a->result = 1;
> + return NULL;
> + }
> +
> + if (a->busy)
> + busy_spin_us(a->workload_us);
> + else
> + usleep(a->workload_us);
> +
> + ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> + close(work_fd);
> + a->result = 0;
> + return NULL;
> +}
> +
> +static int test_self_watch(void)
> +{
> + int monitor_fd;
> + pthread_t ta, tb;
> + struct sw_worker_args a = {
> + .threshold_us = 50000, /* 50 ms */
> + .workload_us = 5000, /* 5 ms sleep -> no violation */
> + .busy = 0,
> + };
> + struct sw_worker_args b = {
> + .threshold_us = 3000, /* 3 ms */
> + .workload_us = 30000, /* 30 ms spin -> violation */
> + .busy = 1,
> + };
> + struct tlob_event ntfs[8];
> + int violations = 0;
> + ssize_t n;
> +
> + /*
> + * Open monitor_fd with O_NONBLOCK so read() after the workers finish
> + * returns immediately rather than blocking forever.
> + */
> + monitor_fd = open("/dev/rv", O_RDWR | O_NONBLOCK);
> + if (monitor_fd < 0) {
> + fprintf(stderr, "open /dev/rv (monitor_fd): %s\n",
> strerror(errno));
> + return 1;
> + }
> + a.monitor_fd = monitor_fd;
> + b.monitor_fd = monitor_fd;
> +
> + pthread_create(&ta, NULL, sw_worker_fn, &a);
> + pthread_create(&tb, NULL, sw_worker_fn, &b);
> + pthread_join(ta, NULL);
> + pthread_join(tb, NULL);
> +
> + if (a.result || b.result) {
> + close(monitor_fd);
> + return 1;
> + }
> +
> + /*
> + * Drain all available tlob_event records. With O_NONBLOCK the final
> + * read() returns -EAGAIN when the buffer is empty.
> + */
> + while ((n = read(monitor_fd, ntfs, sizeof(ntfs))) > 0)
> + violations += (int)(n / sizeof(struct tlob_event));
> +
> + close(monitor_fd);
> +
> + if (violations != 1) {
> + fprintf(stderr, "self_watch: expected 1 violation, got %d\n",
> + violations);
> + return 1;
> + }
> + return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Input-validation tests (TRACE_START error paths)
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * test_invalid_flags - TRACE_START with flags != 0 must return EINVAL.
> + *
> + * The flags field is reserved for future extensions and must be zero.
> + * Callers that set it to a non-zero value are rejected early so that a
> + * future kernel can assign meaning to those bits without silently
> + * ignoring them.
> + */
> +static int test_invalid_flags(void)
> +{
> + struct tlob_start_args args = {
> + .threshold_us = 1000,
> + .notify_fd = -1,
> + .flags = 1, /* non-zero: must be rejected */
> + };
> + int ret;
> +
> + ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> + if (ret == 0) {
> + fprintf(stderr, "TRACE_START(flags=1): expected EINVAL, got
> success\n");
> + do_stop();
> + return 1;
> + }
> + if (errno != EINVAL) {
> + fprintf(stderr, "TRACE_START(flags=1): expected EINVAL, got
> %s\n",
> + strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +/*
> + * test_notify_fd_bad - TRACE_START with a non-/dev/rv notify_fd must return
> + * EINVAL.
> + *
> + * When notify_fd >= 0, the kernel resolves it to a struct file and checks
> + * that its private_data is non-NULL (i.e. it is a /dev/rv file descriptor).
> + * Passing stdout (fd 1) supplies a real, open fd whose private_data is NULL,
> + * so the kernel must reject it with EINVAL.
> + */
> +static int test_notify_fd_bad(void)
> +{
> + struct tlob_start_args args = {
> + .threshold_us = 1000,
> + .notify_fd = STDOUT_FILENO, /* open but not a /dev/rv fd
> */
> + .flags = 0,
> + };
> + int ret;
> +
> + ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> + if (ret == 0) {
> + fprintf(stderr,
> + "TRACE_START(notify_fd=stdout): expected EINVAL, got
> success\n");
> + do_stop();
> + return 1;
> + }
> + if (errno != EINVAL) {
> + fprintf(stderr,
> + "TRACE_START(notify_fd=stdout): expected EINVAL, got
> %s\n",
> + strerror(errno));
> + return 1;
> + }
> + return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * mmap ring buffer tests (Scenario D)
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * test_mmap_basic - mmap the ring buffer and verify the control page fields.
> + *
> + * The kernel allocates TLOB_RING_DEFAULT_CAP records at open(). A shared
> + * mmap of PAGE_SIZE + cap * record_size must succeed and the tlob_mmap_page
> + * header must contain consistent values.
> + */
> +static int test_mmap_basic(void)
> +{
> + long pagesize = sysconf(_SC_PAGESIZE);
> + size_t mmap_len = (size_t)pagesize +
> + TLOB_RING_DEFAULT_CAP * sizeof(struct tlob_event);
> + /* rv_mmap requires a page-aligned length */
> + mmap_len = (mmap_len + (size_t)(pagesize - 1)) & ~(size_t)(pagesize -
> 1);
> + struct tlob_mmap_page *page;
> + struct tlob_event *data;
> + void *map;
> + int ret = 0;
> +
> + map = mmap(NULL, mmap_len, PROT_READ | PROT_WRITE, MAP_SHARED, rv_fd,
> 0);
> + if (map == MAP_FAILED) {
> + fprintf(stderr, "mmap_basic: mmap: %s\n", strerror(errno));
> + return 1;
> + }
> +
> + page = (struct tlob_mmap_page *)map;
> + data = (struct tlob_event *)((char *)map + page->data_offset);
> +
> + if (page->version != 1) {
> + fprintf(stderr, "mmap_basic: expected version=1, got %u\n",
> + page->version);
> + ret = 1;
> + goto out;
> + }
> + if (page->capacity != TLOB_RING_DEFAULT_CAP) {
> + fprintf(stderr, "mmap_basic: expected capacity=%u, got %u\n",
> + TLOB_RING_DEFAULT_CAP, page->capacity);
> + ret = 1;
> + goto out;
> + }
> + if (page->data_offset != (uint32_t)pagesize) {
> + fprintf(stderr, "mmap_basic: expected data_offset=%ld, got
> %u\n",
> + pagesize, page->data_offset);
> + ret = 1;
> + goto out;
> + }
> + if (page->record_size != sizeof(struct tlob_event)) {
> + fprintf(stderr, "mmap_basic: expected record_size=%zu, got
> %u\n",
> + sizeof(struct tlob_event), page->record_size);
> + ret = 1;
> + goto out;
> + }
> + if (page->data_head != 0 || page->data_tail != 0) {
> + fprintf(stderr, "mmap_basic: ring not empty at open: head=%u
> tail=%u\n",
> + page->data_head, page->data_tail);
> + ret = 1;
> + goto out;
> + }
> + /* Touch the data array to confirm it is accessible. */
> + (void)data[0].tid;
> +out:
> + munmap(map, mmap_len);
> + return ret;
> +}
> +
> +/*
> + * test_mmap_errors - verify that rv_mmap() rejects invalid mmap parameters.
> + *
> + * Four cases are tested, each must return MAP_FAILED with errno == EINVAL:
> + * 1. size one page short of the correct ring length
> + * 2. size one page larger than the correct ring length
> + * 3. MAP_PRIVATE (only MAP_SHARED is permitted)
> + * 4. non-zero vm_pgoff (offset must be 0)
> + */
> +static int test_mmap_errors(void)
> +{
> + long pagesize = sysconf(_SC_PAGESIZE);
> + size_t correct_len = (size_t)pagesize +
> + TLOB_RING_DEFAULT_CAP * sizeof(struct
> tlob_event);
> + /* rv_mmap requires a page-aligned length */
> + correct_len = (correct_len + (size_t)(pagesize - 1)) &
> ~(size_t)(pagesize - 1);
> + void *map;
> + int ret = 0;
> +
> + /* Case 1: size one page short (correct_len - 1 still rounds up to
> correct_len) */
> + map = mmap(NULL, correct_len - (size_t)pagesize, PROT_READ |
> PROT_WRITE,
> + MAP_SHARED, rv_fd, 0);
> + if (map != MAP_FAILED) {
> + fprintf(stderr, "mmap_errors: short-size mmap succeeded
> (expected EINVAL)\n");
> + munmap(map, correct_len - (size_t)pagesize);
> + ret = 1;
> + } else if (errno != EINVAL) {
> + fprintf(stderr, "mmap_errors: short-size: expected EINVAL,
> got %s\n",
> + strerror(errno));
> + ret = 1;
> + }
> +
> + /* Case 2: size one page too large */
> + map = mmap(NULL, correct_len + (size_t)pagesize, PROT_READ |
> PROT_WRITE,
> + MAP_SHARED, rv_fd, 0);
> + if (map != MAP_FAILED) {
> + fprintf(stderr, "mmap_errors: oversized mmap succeeded
> (expected EINVAL)\n");
> + munmap(map, correct_len + (size_t)pagesize);
> + ret = 1;
> + } else if (errno != EINVAL) {
> + fprintf(stderr, "mmap_errors: oversized: expected EINVAL, got
> %s\n",
> + strerror(errno));
> + ret = 1;
> + }
> +
> + /* Case 3: MAP_PRIVATE instead of MAP_SHARED */
> + map = mmap(NULL, correct_len, PROT_READ | PROT_WRITE,
> + MAP_PRIVATE, rv_fd, 0);
> + if (map != MAP_FAILED) {
> + fprintf(stderr, "mmap_errors: MAP_PRIVATE succeeded (expected
> EINVAL)\n");
> + munmap(map, correct_len);
> + ret = 1;
> + } else if (errno != EINVAL) {
> + fprintf(stderr, "mmap_errors: MAP_PRIVATE: expected EINVAL,
> got %s\n",
> + strerror(errno));
> + ret = 1;
> + }
> +
> + /* Case 4: non-zero file offset (pgoff = 1) */
> + map = mmap(NULL, correct_len, PROT_READ | PROT_WRITE,
> + MAP_SHARED, rv_fd, (off_t)pagesize);
> + if (map != MAP_FAILED) {
> + fprintf(stderr, "mmap_errors: non-zero pgoff mmap succeeded
> (expected EINVAL)\n");
> + munmap(map, correct_len);
> + ret = 1;
> + } else if (errno != EINVAL) {
> + fprintf(stderr, "mmap_errors: non-zero pgoff: expected
> EINVAL, got %s\n",
> + strerror(errno));
> + ret = 1;
> + }
> +
> + return ret;
> +}
> +
> +/*
> + * test_mmap_consume - zero-copy consumption of a real violation event.
> + *
> + * Arms a 5 ms budget with self-notification (notify_fd = rv_fd), sleeps
> + * 50 ms (off-CPU violation), then reads the pushed event through the mmap'd
> + * ring without calling read(). Verifies:
> + * - TRACE_STOP returns EOVERFLOW (budget was exceeded)
> + * - data_head == 1 after the violation
> + * - the event fields (threshold_us, tag, tid) are correct
> + * - data_tail can be advanced to consume the record (ring empties)
> + */
> +static int test_mmap_consume(void)
> +{
> + long pagesize = sysconf(_SC_PAGESIZE);
> + size_t mmap_len = (size_t)pagesize +
> + TLOB_RING_DEFAULT_CAP * sizeof(struct tlob_event);
> + /* rv_mmap requires a page-aligned length */
> + mmap_len = (mmap_len + (size_t)(pagesize - 1)) & ~(size_t)(pagesize -
> 1);
> + struct tlob_start_args args = {
> + .threshold_us = 5000, /* 5 ms */
> + .notify_fd = rv_fd, /* self-notification */
> + .tag = 0xdeadbeefULL,
> + .flags = 0,
> + };
> + struct tlob_mmap_page *page;
> + struct tlob_event *data;
> + void *map;
> + int stop_ret;
> + int ret = 0;
> +
> + map = mmap(NULL, mmap_len, PROT_READ | PROT_WRITE, MAP_SHARED, rv_fd,
> 0);
> + if (map == MAP_FAILED) {
> + fprintf(stderr, "mmap_consume: mmap: %s\n", strerror(errno));
> + return 1;
> + }
> +
> + page = (struct tlob_mmap_page *)map;
> + data = (struct tlob_event *)((char *)map + page->data_offset);
> +
> + if (ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args) < 0) {
> + fprintf(stderr, "mmap_consume: TRACE_START: %s\n",
> strerror(errno));
> + ret = 1;
> + goto out;
> + }
> +
> + usleep(50000); /* 50 ms >> 5 ms budget -> off-CPU violation */
> +
> + stop_ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> + if (stop_ret == 0) {
> + fprintf(stderr, "mmap_consume: TRACE_STOP returned 0,
> expected EOVERFLOW\n");
> + ret = 1;
> + goto out;
> + }
> + if (errno != EOVERFLOW) {
> + fprintf(stderr, "mmap_consume: TRACE_STOP: expected
> EOVERFLOW, got %s\n",
> + strerror(errno));
> + ret = 1;
> + goto out;
> + }
> +
> + /* Pairs with smp_store_release in tlob_event_push. */
> + if (__atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE) != 1) {
> + fprintf(stderr, "mmap_consume: expected data_head=1, got
> %u\n",
> + page->data_head);
> + ret = 1;
> + goto out;
> + }
> + if (page->data_tail != 0) {
> + fprintf(stderr, "mmap_consume: expected data_tail=0, got
> %u\n",
> + page->data_tail);
> + ret = 1;
> + goto out;
> + }
> +
> + /* Verify record content */
> + if (data[0].threshold_us != 5000) {
> + fprintf(stderr, "mmap_consume: expected threshold_us=5000,
> got %llu\n",
> + (unsigned long long)data[0].threshold_us);
> + ret = 1;
> + goto out;
> + }
> + if (data[0].tag != 0xdeadbeefULL) {
> + fprintf(stderr, "mmap_consume: expected tag=0xdeadbeef, got
> %llx\n",
> + (unsigned long long)data[0].tag);
> + ret = 1;
> + goto out;
> + }
> + if (data[0].tid == 0) {
> + fprintf(stderr, "mmap_consume: tid is 0\n");
> + ret = 1;
> + goto out;
> + }
> +
> + /* Consume: advance data_tail and confirm ring is empty */
> + __atomic_store_n(&page->data_tail, 1U, __ATOMIC_RELEASE);
> + if (__atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE) !=
> + __atomic_load_n(&page->data_tail, __ATOMIC_ACQUIRE)) {
> + fprintf(stderr, "mmap_consume: ring not empty after
> consume\n");
> + ret = 1;
> + }
> +
> +out:
> + munmap(map, mmap_len);
> + return ret;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * ELF utility: sym_offset
> + *
> + * Print the ELF file offset of a symbol in a binary. Supports 32- and
> + * 64-bit ELF. Walks the section headers to find .symtab (falling back to
> + * .dynsym), then converts the symbol's virtual address to a file offset
> + * via the PT_LOAD program headers.
> + *
> + * Does not require /dev/rv; used by the shell script to build uprobe
> + * bindings of the form
> pid:threshold_us:offset_start:offset_stop:binary_path.
> + *
> + * Returns 0 on success (offset printed to stdout), 1 on failure.
> + * -----------------------------------------------------------------------
> + */
> +static int sym_offset(const char *binary, const char *symname)
> +{
> + int fd;
> + struct stat st;
> + void *map;
> + Elf64_Ehdr *ehdr;
> + Elf32_Ehdr *ehdr32;
> + int is64;
> + uint64_t sym_vaddr = 0;
> + int found = 0;
> + uint64_t file_offset = 0;
> +
> + fd = open(binary, O_RDONLY);
> + if (fd < 0) {
> + fprintf(stderr, "open %s: %s\n", binary, strerror(errno));
> + return 1;
> + }
> + if (fstat(fd, &st) < 0) {
> + close(fd);
> + return 1;
> + }
> + map = mmap(NULL, (size_t)st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
> + close(fd);
> + if (map == MAP_FAILED) {
> + fprintf(stderr, "mmap: %s\n", strerror(errno));
> + return 1;
> + }
> +
> + /* Identify ELF class */
> + ehdr = (Elf64_Ehdr *)map;
> + ehdr32 = (Elf32_Ehdr *)map;
> + if (st.st_size < 4 ||
> + ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
> + ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
> + ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
> + ehdr->e_ident[EI_MAG3] != ELFMAG3) {
> + fprintf(stderr, "%s: not an ELF file\n", binary);
> + munmap(map, (size_t)st.st_size);
> + return 1;
> + }
> + is64 = (ehdr->e_ident[EI_CLASS] == ELFCLASS64);
> +
> + if (is64) {
> + /* Walk section headers to find .symtab or .dynsym */
> + Elf64_Shdr *shdrs = (Elf64_Shdr *)((char *)map + ehdr-
> >e_shoff);
> + Elf64_Shdr *shstrtab_hdr = &shdrs[ehdr->e_shstrndx];
> + const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> + int si;
> +
> + /* Prefer .symtab; fall back to .dynsym */
> + for (int pass = 0; pass < 2 && !found; pass++) {
> + const char *target = pass ? ".dynsym" : ".symtab";
> +
> + for (si = 0; si < ehdr->e_shnum && !found; si++) {
> + Elf64_Shdr *sh = &shdrs[si];
> + const char *name = shstrtab + sh->sh_name;
> +
> + if (strcmp(name, target) != 0)
> + continue;
> +
> + Elf64_Shdr *strtab_sh = &shdrs[sh->sh_link];
> + const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> + Elf64_Sym *syms = (Elf64_Sym *)((char *)map +
> sh->sh_offset);
> + uint64_t nsyms = sh->sh_size /
> sizeof(Elf64_Sym);
> + uint64_t j;
> +
> + for (j = 0; j < nsyms; j++) {
> + if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> + sym_vaddr = syms[j].st_value;
> + found = 1;
> + break;
> + }
> + }
> + }
> + }
> +
> + if (!found) {
> + fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> + munmap(map, (size_t)st.st_size);
> + return 1;
> + }
> +
> + /* Convert vaddr to file offset via PT_LOAD segments */
> + Elf64_Phdr *phdrs = (Elf64_Phdr *)((char *)map + ehdr-
> >e_phoff);
> + int pi;
> +
> + for (pi = 0; pi < ehdr->e_phnum; pi++) {
> + Elf64_Phdr *ph = &phdrs[pi];
> +
> + if (ph->p_type != PT_LOAD)
> + continue;
> + if (sym_vaddr >= ph->p_vaddr &&
> + sym_vaddr < ph->p_vaddr + ph->p_filesz) {
> + file_offset = sym_vaddr - ph->p_vaddr + ph-
> >p_offset;
> + break;
> + }
> + }
> + } else {
> + /* 32-bit ELF */
> + Elf32_Shdr *shdrs = (Elf32_Shdr *)((char *)map + ehdr32-
> >e_shoff);
> + Elf32_Shdr *shstrtab_hdr = &shdrs[ehdr32->e_shstrndx];
> + const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> + int si;
> + uint32_t sym_vaddr32 = 0;
> +
> + for (int pass = 0; pass < 2 && !found; pass++) {
> + const char *target = pass ? ".dynsym" : ".symtab";
> +
> + for (si = 0; si < ehdr32->e_shnum && !found; si++) {
> + Elf32_Shdr *sh = &shdrs[si];
> + const char *name = shstrtab + sh->sh_name;
> +
> + if (strcmp(name, target) != 0)
> + continue;
> +
> + Elf32_Shdr *strtab_sh = &shdrs[sh->sh_link];
> + const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> + Elf32_Sym *syms = (Elf32_Sym *)((char *)map +
> sh->sh_offset);
> + uint32_t nsyms = sh->sh_size /
> sizeof(Elf32_Sym);
> + uint32_t j;
> +
> + for (j = 0; j < nsyms; j++) {
> + if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> + sym_vaddr32 =
> syms[j].st_value;
> + found = 1;
> + break;
> + }
> + }
> + }
> + }
> +
> + if (!found) {
> + fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> + munmap(map, (size_t)st.st_size);
> + return 1;
> + }
> +
> + Elf32_Phdr *phdrs = (Elf32_Phdr *)((char *)map + ehdr32-
> >e_phoff);
> + int pi;
> +
> + for (pi = 0; pi < ehdr32->e_phnum; pi++) {
> + Elf32_Phdr *ph = &phdrs[pi];
> +
> + if (ph->p_type != PT_LOAD)
> + continue;
> + if (sym_vaddr32 >= ph->p_vaddr &&
> + sym_vaddr32 < ph->p_vaddr + ph->p_filesz) {
> + file_offset = sym_vaddr32 - ph->p_vaddr + ph-
> >p_offset;
> + break;
> + }
> + }
> + sym_vaddr = sym_vaddr32;
> + }
> +
> + munmap(map, (size_t)st.st_size);
> +
> + if (!file_offset && sym_vaddr) {
> + fprintf(stderr, "could not map vaddr 0x%lx to file offset\n",
> + (unsigned long)sym_vaddr);
> + return 1;
> + }
> +
> + printf("0x%lx\n", (unsigned long)file_offset);
> + return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + int rc;
> +
> + if (argc < 2) {
> + fprintf(stderr, "Usage: %s <subcommand> [args...]\n",
> argv[0]);
> + return 1;
> + }
> +
> + /* sym_offset does not need /dev/rv */
> + if (strcmp(argv[1], "sym_offset") == 0) {
> + if (argc < 4) {
> + fprintf(stderr, "Usage: %s sym_offset <binary>
> <symbol>\n",
> + argv[0]);
> + return 1;
> + }
> + return sym_offset(argv[2], argv[3]);
> + }
> +
> + if (open_rv() < 0)
> + return 2; /* skip */
> +
> + if (strcmp(argv[1], "not_enabled") == 0)
> + rc = test_not_enabled();
> + else if (strcmp(argv[1], "within_budget") == 0)
> + rc = test_within_budget();
> + else if (strcmp(argv[1], "over_budget_cpu") == 0)
> + rc = test_over_budget_cpu();
> + else if (strcmp(argv[1], "over_budget_sleep") == 0)
> + rc = test_over_budget_sleep();
> + else if (strcmp(argv[1], "double_start") == 0)
> + rc = test_double_start();
> + else if (strcmp(argv[1], "stop_no_start") == 0)
> + rc = test_stop_no_start();
> + else if (strcmp(argv[1], "multi_thread") == 0)
> + rc = test_multi_thread();
> + else if (strcmp(argv[1], "self_watch") == 0)
> + rc = test_self_watch();
> + else if (strcmp(argv[1], "invalid_flags") == 0)
> + rc = test_invalid_flags();
> + else if (strcmp(argv[1], "notify_fd_bad") == 0)
> + rc = test_notify_fd_bad();
> + else if (strcmp(argv[1], "mmap_basic") == 0)
> + rc = test_mmap_basic();
> + else if (strcmp(argv[1], "mmap_errors") == 0)
> + rc = test_mmap_errors();
> + else if (strcmp(argv[1], "mmap_consume") == 0)
> + rc = test_mmap_consume();
> + else {
> + fprintf(stderr, "Unknown test: %s\n", argv[1]);
> + rc = 1;
> + }
> +
> + close(rv_fd);
> + return rc;
> +}
> diff --git a/tools/testing/selftests/rv/tlob_uprobe_target.c
> b/tools/testing/selftests/rv/tlob_uprobe_target.c
> new file mode 100644
> index 000000000..6c895cb40
> --- /dev/null
> +++ b/tools/testing/selftests/rv/tlob_uprobe_target.c
> @@ -0,0 +1,108 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_uprobe_target.c - uprobe target binary for tlob selftests.
> + *
> + * Provides two well-known probe points:
> + * tlob_busy_work() - start probe: arms the tlob budget timer
> + * tlob_busy_work_done() - stop probe: cancels the timer on completion
> + *
> + * The tlob selftest writes a five-field uprobe binding:
> + * pid:threshold_us:binary:offset_start:offset_stop
> + * where offset_start is the file offset of tlob_busy_work and offset_stop
> + * is the file offset of tlob_busy_work_done (resolved via tlob_helper
> + * sym_offset).
> + *
> + * Both probe points are plain entry uprobes (no uretprobe). The busy loop
> + * keeps the task on-CPU so that either the stop probe fires cleanly (within
> + * budget) or the hrtimer fires first and emits tlob_budget_exceeded (over
> + * budget).
> + *
> + * Usage: tlob_uprobe_target <duration_ms>
> + *
> + * Loops calling tlob_busy_work() in 200 ms iterations until <duration_ms>
> + * has elapsed (0 = run for ~24 hours). Short iterations ensure the uprobe
> + * entry fires on every call even if the uprobe is installed after the
> + * program has started.
> + */
> +#define _GNU_SOURCE
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <time.h>
> +
> +#ifndef noinline
> +#define noinline __attribute__((noinline))
> +#endif
> +
> +static inline int timespec_before(const struct timespec *a,
> + const struct timespec *b)
> +{
> + return a->tv_sec < b->tv_sec ||
> + (a->tv_sec == b->tv_sec && a->tv_nsec < b->tv_nsec);
> +}
> +
> +static void timespec_add_ms(struct timespec *ts, unsigned long ms)
> +{
> + ts->tv_sec += ms / 1000;
> + ts->tv_nsec += (long)(ms % 1000) * 1000000L;
> + if (ts->tv_nsec >= 1000000000L) {
> + ts->tv_sec++;
> + ts->tv_nsec -= 1000000000L;
> + }
> +}
> +
> +/*
> + * tlob_busy_work_done - stop-probe target.
> + *
> + * Called by tlob_busy_work() after the busy loop. The uprobe on this
> + * function's entry fires tlob_stop_task(), cancelling the budget timer.
> + * noinline ensures the compiler never merges this function with its caller,
> + * guaranteeing the entry uprobe always fires.
> + */
> +noinline void tlob_busy_work_done(void)
> +{
> + /* empty: the uprobe fires on entry */
> +}
> +
> +/*
> + * tlob_busy_work - start-probe target.
> + *
> + * The uprobe on this function's entry fires tlob_start_task(), arming the
> + * budget timer. noinline prevents the compiler and linker (including LTO)
> + * from inlining this function into its callers, ensuring the entry uprobe
> + * fires on every call.
> + */
> +noinline void tlob_busy_work(unsigned long duration_ns)
> +{
> + struct timespec start, now;
> + unsigned long elapsed;
> +
> + clock_gettime(CLOCK_MONOTONIC, &start);
> + do {
> + clock_gettime(CLOCK_MONOTONIC, &now);
> + elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> + * 1000000000UL
> + + (unsigned long)(now.tv_nsec - start.tv_nsec);
> + } while (elapsed < duration_ns);
> +
> + tlob_busy_work_done();
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + unsigned long duration_ms = 0;
> + struct timespec deadline, now;
> +
> + if (argc >= 2)
> + duration_ms = strtoul(argv[1], NULL, 10);
> +
> + clock_gettime(CLOCK_MONOTONIC, &deadline);
> + timespec_add_ms(&deadline, duration_ms ? duration_ms : 86400000UL);
> +
> + do {
> + tlob_busy_work(200 * 1000000UL); /* 200 ms per iteration */
> + clock_gettime(CLOCK_MONOTONIC, &now);
> + } while (timespec_before(&now, &deadline));
> +
> + return 0;
> +}
^ permalink raw reply
* [PATCH v3] tracing/osnoise: Add option to align tlat threads
From: Tomas Glozar @ 2026-04-16 11:59 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, John Kacur, Luis Goncalves, Crystal Wood,
Costa Shulyupin, Wander Lairson Costa, LKML, linux-trace-kernel,
Tomas Glozar
Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
corresponding setting osnoise/timerlat_align_us.
This option sets the alignment of wakeup times between different
timerlat threads, similarly to cyclictest's -A/--aligned option. If
TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
records its first wake-up time. Each following thread sets its first
wake-up time to a fixed offset from the recorded time, and increments
it by the same offset.
Example:
osnoise/timerlat_period is set to 1000, osnoise/timerlat_align_us is
set to 20. There are four threads, on CPUs 1 to 4.
- CPU 4 enters first cycle first. The current time is 20000us, so
the wake-up of the first cycle is set to 21000us. This time is recorded.
- CPU 2 enter first cycle next. It reads the recorded time, increments
it to 21020us, and uses this value as its own wake-up time for the first
cycle.
- CPU 3 enters first cycle next. It reads the recorded time, increments
it to 21040 us, and uses the value as its own wake-up time.
- CPU 1 proceeds analogically.
In each next cycle, the wake-up time (called "absolute period" in
timerlat code) is incremented by the (relative) period of 1000us. Thus,
the wake-ups in the following cycles (provided the times are reached and
not in the past) will be as follows:
CPU 1 CPU 2 CPU 3 CPU 4
21080us 21020us 21040us 21000us
22080us 22020us 22040us 22000us
... ... ... ...
Even if any cycle is skipped due to e.g. the first cycle calculation
happening later, the alignment stays in place.
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Reviewed-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Crystal Wood <crwood@redhat.com>
---
v2 + discussion: https://lore.kernel.org/linux-trace-kernel/20260302131316.385987-1-tglozar@redhat.com/T/#u
v1 + discussion: https://lore.kernel.org/linux-trace-kernel/20260227150420.319528-1-tglozar@redhat.com/T/#u
v3:
- Move align_next up and reset it in tlat_var_reset() instead of
osnoise_workload_start() to fix build failure with CONFIG_TIMERLAT_TRACER=n.
(Bug found by Steven Rostedt, fix suggested by Crystal Wood.)
v2:
- Make align_next global and reset it to 0 in osnoise_workload_start()
so that it gets set by the first thread of each measurement and is not stuck
on what is set by the first measurement until reboot.
- Use atomic64_add_return_relaxed() in place of atomic64_fetch_add_relaxed()
to make the code shorter and easier to read.
- Add more detailed comments to the alignment synchronization logic.
- Fix two typos in the commit message: 50 -> 20 in the example introduction,
and incremenets -> increments.
I tested the patch the same way I did for v2:
[root@cs9 tglozar]# cd /sys/kernel/tracing/osnoise/
[root@cs9 osnoise]# echo TIMERLAT_ALIGN > options
[root@cs9 osnoise]# echo 40 > timerlat_align_us
[root@cs9 osnoise]# rtla timerlat top -q -c 0,1,2,3 -d 1s
...
[root@cs9 osnoise]# rtla timerlat top -q -c 0,1,2,3 -d 1s
...
[root@cs9 osnoise]# dmesg | tail -n9
[ 25.664229] timerlat: thread 0 setting align_next to 25659481645
[ 25.664794] timerlat: aligning thread 1 to 25659521645
[ 25.665294] timerlat: aligning thread 2 to 25659561645
[ 25.665770] timerlat: aligning thread 3 to 25659601645
[ 30.370790] NFSD: all clients done reclaiming, ending NFSv4 grace period (net effffff9)
[ 30.442436] timerlat: thread 0 setting align_next to 30437695876
[ 30.442853] timerlat: aligning thread 1 to 30437735876
[ 30.443141] timerlat: aligning thread 2 to 30437775876
[ 30.443659] timerlat: aligning thread 3 to 30437815876
[root@cs9 osnoise]#
to show that align_next is indeed reset. Additionaly I tested that building
without CONFIG_TIMERLAT_TRACER works.
kernel/trace/trace_osnoise.c | 54 +++++++++++++++++++++++++++++++++++-
1 file changed, 53 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index be6cf0bb3c03..75678053b21c 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -58,6 +58,7 @@ enum osnoise_options_index {
OSN_PANIC_ON_STOP,
OSN_PREEMPT_DISABLE,
OSN_IRQ_DISABLE,
+ OSN_TIMERLAT_ALIGN,
OSN_MAX
};
@@ -66,7 +67,8 @@ static const char * const osnoise_options_str[OSN_MAX] = {
"OSNOISE_WORKLOAD",
"PANIC_ON_STOP",
"OSNOISE_PREEMPT_DISABLE",
- "OSNOISE_IRQ_DISABLE" };
+ "OSNOISE_IRQ_DISABLE",
+ "TIMERLAT_ALIGN" };
#define OSN_DEFAULT_OPTIONS 0x2
static unsigned long osnoise_options = OSN_DEFAULT_OPTIONS;
@@ -250,6 +252,11 @@ struct timerlat_variables {
static DEFINE_PER_CPU(struct timerlat_variables, per_cpu_timerlat_var);
+/*
+ * timerlat wake-up offset for next thread with TIMERLAT_ALIGN set.
+ */
+static atomic64_t align_next;
+
/*
* this_cpu_tmr_var - Return the per-cpu timerlat_variables on its relative CPU
*/
@@ -268,6 +275,7 @@ static inline void tlat_var_reset(void)
/* Synchronize with the timerlat interfaces */
mutex_lock(&interface_lock);
+
/*
* So far, all the values are initialized as 0, so
* zeroing the structure is perfect.
@@ -278,6 +286,12 @@ static inline void tlat_var_reset(void)
hrtimer_cancel(&tlat_var->timer);
memset(tlat_var, 0, sizeof(*tlat_var));
}
+ /*
+ * Reset also align_next, to be filled by a new offset by the first timerlat
+ * thread that wakes up, if TIMERLAT_ALIGN is set.
+ */
+ atomic64_set(&align_next, 0);
+
mutex_unlock(&interface_lock);
}
#else /* CONFIG_TIMERLAT_TRACER */
@@ -326,6 +340,7 @@ static struct osnoise_data {
u64 stop_tracing_total; /* stop trace in the final operation (report/thread) */
#ifdef CONFIG_TIMERLAT_TRACER
u64 timerlat_period; /* timerlat period */
+ u64 timerlat_align_us; /* timerlat alignment */
u64 print_stack; /* print IRQ stack if total > */
int timerlat_tracer; /* timerlat tracer */
#endif
@@ -338,6 +353,7 @@ static struct osnoise_data {
#ifdef CONFIG_TIMERLAT_TRACER
.print_stack = 0,
.timerlat_period = DEFAULT_TIMERLAT_PERIOD,
+ .timerlat_align_us = 0,
.timerlat_tracer = 0,
#endif
};
@@ -1829,6 +1845,26 @@ static int wait_next_period(struct timerlat_variables *tlat)
*/
tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
+ /*
+ * Align thread in the first cycle on each CPU to the set alignment
+ * if TIMERLAT_ALIGN is set.
+ *
+ * This is done by using an atomic64_t to store the next absolute period.
+ * The first thread that wakes up will set the atomic64_t to its
+ * absolute period, and the other threads will increment it by
+ * the alignment value.
+ */
+ if (test_bit(OSN_TIMERLAT_ALIGN, &osnoise_options) && !tlat->count
+ && atomic64_cmpxchg_relaxed(&align_next, 0, tlat->abs_period)) {
+ /*
+ * A thread has already set align_next, use it and increment it
+ * to be used by the next thread that wakes up after this one.
+ */
+ tlat->abs_period = atomic64_add_return_relaxed(
+ osnoise_data.timerlat_align_us * 1000, &align_next);
+ next_abs_period = ns_to_ktime(tlat->abs_period);
+ }
+
/*
* If the new abs_period is in the past, skip the activation.
*/
@@ -2650,6 +2686,17 @@ static struct trace_min_max_param timerlat_period = {
.min = &timerlat_min_period,
};
+/*
+ * osnoise/timerlat_align_us: align the first wakeup of all timerlat
+ * threads to a common boundary (in us). 0 means disabled.
+ */
+static struct trace_min_max_param timerlat_align_us = {
+ .lock = &interface_lock,
+ .val = &osnoise_data.timerlat_align_us,
+ .max = NULL,
+ .min = NULL,
+};
+
static const struct file_operations timerlat_fd_fops = {
.open = timerlat_fd_open,
.read = timerlat_fd_read,
@@ -2746,6 +2793,11 @@ static int init_timerlat_tracefs(struct dentry *top_dir)
if (!tmp)
return -ENOMEM;
+ tmp = tracefs_create_file("timerlat_align_us", TRACE_MODE_WRITE, top_dir,
+ &timerlat_align_us, &trace_min_max_fops);
+ if (!tmp)
+ return -ENOMEM;
+
retval = osnoise_create_cpu_timerlat_fd(top_dir);
if (retval)
return retval;
--
2.53.0
^ permalink raw reply related
* Re: [RFC PATCH 3/4] rv/tlob: Add KUnit tests for the tlob monitor
From: Gabriele Monaco @ 2026-04-16 12:09 UTC (permalink / raw)
To: wen.yang
Cc: linux-trace-kernel, linux-kernel, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <0a7f41ff8cb13f8601920ead2979db2ee5f2d442.1776020428.git.wen.yang@linux.dev>
On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> Add six KUnit test suites gated behind CONFIG_TLOB_KUNIT_TEST
> (depends on RV_MON_TLOB && KUNIT; default KUNIT_ALL_TESTS).
> A .kunitconfig fragment is provided for the kunit.py runner.
>
> Coverage: automaton state transitions and self-loops; start/stop API
> error paths (duplicate start, missing start, overflow threshold,
> table-full, immediate deadline); scheduler context-switch accounting
> for on/off-CPU time; violation tracepoint payload fields; ring buffer
> push, drop-new overflow, and wakeup; and the uprobe line parser.
>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
I was considering adding Kunit tests and thought to have them a bit more
integrated ([1] if you want to have a peek before I submit it for RFC, mind it's
a bit raw).
The problem with reimplementing the da_handle_event() is that you are in fact
validating only the model matrix, several other things could go wrong before you
get there (whether the monitor was started properly, other things you might be
doing from the tracepoint handler before you handle events, etc.).
Also, I believe it's a bit of an overkill to validate every single transition
like this, especially considering the work once you update the model for
whatever reason.
One meaningful thing to validate is that a certain sequence of events with a
certain timing causes a violation (or if you want, that a good sequence does
not), for instance. But that's just my opinion, of course.
Thanks,
Gabriele
> ---
> kernel/trace/rv/Makefile | 1 +
> kernel/trace/rv/monitors/tlob/.kunitconfig | 5 +
> kernel/trace/rv/monitors/tlob/Kconfig | 12 +
> kernel/trace/rv/monitors/tlob/tlob.c | 1 +
> kernel/trace/rv/monitors/tlob/tlob_kunit.c | 1194 ++++++++++++++++++++
> 5 files changed, 1213 insertions(+)
> create mode 100644 kernel/trace/rv/monitors/tlob/.kunitconfig
> create mode 100644 kernel/trace/rv/monitors/tlob/tlob_kunit.c
>
> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
> index cc3781a3b..6d963207d 100644
> --- a/kernel/trace/rv/Makefile
> +++ b/kernel/trace/rv/Makefile
> @@ -19,6 +19,7 @@ obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
> obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
> obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
> obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
> +obj-$(CONFIG_TLOB_KUNIT_TEST) += monitors/tlob/tlob_kunit.o
> # Add new monitors here
> obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
> obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
> diff --git a/kernel/trace/rv/monitors/tlob/.kunitconfig
> b/kernel/trace/rv/monitors/tlob/.kunitconfig
> new file mode 100644
> index 000000000..977c58601
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/.kunitconfig
> @@ -0,0 +1,5 @@
> +CONFIG_FTRACE=y
> +CONFIG_KUNIT=y
> +CONFIG_RV=y
> +CONFIG_RV_MON_TLOB=y
> +CONFIG_TLOB_KUNIT_TEST=y
> diff --git a/kernel/trace/rv/monitors/tlob/Kconfig
> b/kernel/trace/rv/monitors/tlob/Kconfig
> index 010237480..4ccd2f881 100644
> --- a/kernel/trace/rv/monitors/tlob/Kconfig
> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
> @@ -49,3 +49,15 @@ config RV_MON_TLOB
> For further information, see:
> Documentation/trace/rv/monitor_tlob.rst
>
> +config TLOB_KUNIT_TEST
> + tristate "KUnit tests for tlob monitor" if !KUNIT_ALL_TESTS
> + depends on RV_MON_TLOB && KUNIT
> + default KUNIT_ALL_TESTS
> + help
> + Enable KUnit in-kernel unit tests for the tlob RV monitor.
> +
> + Tests cover automaton state transitions, the hash table helpers,
> + the start/stop task interface, and the event ring buffer including
> + overflow handling and wakeup behaviour.
> +
> + Say Y or M here to run the tlob KUnit test suite; otherwise say N.
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> index a6e474025..dd959eb9b 100644
> --- a/kernel/trace/rv/monitors/tlob/tlob.c
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -784,6 +784,7 @@ VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64
> *thr_out,
> *path_out = buf + n;
> return 0;
> }
> +EXPORT_SYMBOL_IF_KUNIT(tlob_parse_uprobe_line);
>
> static ssize_t tlob_monitor_write(struct file *file,
> const char __user *ubuf,
> diff --git a/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> new file mode 100644
> index 000000000..64f5abb34
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> @@ -0,0 +1,1194 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KUnit tests for the tlob RV monitor.
> + *
> + * tlob_automaton: DA transition table coverage.
> + * tlob_task_api: tlob_start_task()/tlob_stop_task() lifecycle and
> errors.
> + * tlob_sched_integration: on/off-CPU accounting across real context
> switches.
> + * tlob_trace_output: tlob_budget_exceeded tracepoint field
> verification.
> + * tlob_event_buf: ring buffer push, overflow, and wakeup.
> + * tlob_parse_uprobe: uprobe format string parser acceptance and
> rejection.
> + *
> + * The duplicate-(binary, offset_start) constraint enforced by
> tlob_add_uprobe()
> + * is not covered here: that function calls kern_path() and requires a real
> + * filesystem, which is outside the scope of unit tests. It is covered by the
> + * uprobe_duplicate_offset case in tools/testing/selftests/rv/test_tlob.sh.
> + */
> +#include <kunit/test.h>
> +#include <linux/atomic.h>
> +#include <linux/completion.h>
> +#include <linux/delay.h>
> +#include <linux/kthread.h>
> +#include <linux/ktime.h>
> +#include <linux/mutex.h>
> +#include <linux/sched.h>
> +#include <linux/sched/task.h>
> +#include <linux/tracepoint.h>
> +
> +/*
> + * Pull in the rv tracepoint declarations so that
> + * register_trace_tlob_budget_exceeded() is available.
> + * No CREATE_TRACE_POINTS here -- the tracepoint implementation lives in
> rv.c.
> + */
> +#include <rv_trace.h>
> +
> +#include "tlob.h"
> +
> +/*
> + * da_handle_event_tlob - apply one automaton transition on @da_mon.
> + *
> + * This helper is used only by the KUnit automaton suite. It applies the
> + * tlob transition table directly on a supplied da_monitor without touching
> + * per-task slots, tracepoints, or timers.
> + */
> +static void da_handle_event_tlob(struct da_monitor *da_mon,
> + enum events_tlob event)
> +{
> + enum states_tlob curr_state = (enum states_tlob)da_mon->curr_state;
> + enum states_tlob next_state =
> + (enum states_tlob)automaton_tlob.function[curr_state][event];
> +
> + if (next_state != INVALID_STATE)
> + da_mon->curr_state = next_state;
> +}
> +
> +MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");
> +
> +/*
> + * Suite 1: automaton state-machine transitions
> + */
> +
> +/* unmonitored -> trace_start -> on_cpu */
> +static void tlob_unmonitored_to_on_cpu(struct kunit *test)
> +{
> + struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> + da_handle_event_tlob(&mon, trace_start_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +}
> +
> +/* on_cpu -> switch_out -> off_cpu */
> +static void tlob_on_cpu_switch_out(struct kunit *test)
> +{
> + struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> + da_handle_event_tlob(&mon, switch_out_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)off_cpu_tlob);
> +}
> +
> +/* off_cpu -> switch_in -> on_cpu */
> +static void tlob_off_cpu_switch_in(struct kunit *test)
> +{
> + struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> + da_handle_event_tlob(&mon, switch_in_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +}
> +
> +/* on_cpu -> budget_expired -> unmonitored */
> +static void tlob_on_cpu_budget_expired(struct kunit *test)
> +{
> + struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> + da_handle_event_tlob(&mon, budget_expired_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* off_cpu -> budget_expired -> unmonitored */
> +static void tlob_off_cpu_budget_expired(struct kunit *test)
> +{
> + struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> + da_handle_event_tlob(&mon, budget_expired_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* on_cpu -> trace_stop -> unmonitored */
> +static void tlob_on_cpu_trace_stop(struct kunit *test)
> +{
> + struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> + da_handle_event_tlob(&mon, trace_stop_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* off_cpu -> trace_stop -> unmonitored */
> +static void tlob_off_cpu_trace_stop(struct kunit *test)
> +{
> + struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> + da_handle_event_tlob(&mon, trace_stop_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* budget_expired -> unmonitored; a single trace_start re-enters on_cpu. */
> +static void tlob_violation_then_restart(struct kunit *test)
> +{
> + struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> + da_handle_event_tlob(&mon, trace_start_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> + da_handle_event_tlob(&mon, budget_expired_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +
> + /* Single trace_start is sufficient to re-enter on_cpu */
> + da_handle_event_tlob(&mon, trace_start_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> + da_handle_event_tlob(&mon, trace_stop_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* off_cpu self-loops on switch_out and sched_wakeup. */
> +static void tlob_off_cpu_self_loops(struct kunit *test)
> +{
> + static const enum events_tlob events[] = {
> + switch_out_tlob, sched_wakeup_tlob,
> + };
> + unsigned int i;
> +
> + for (i = 0; i < ARRAY_SIZE(events); i++) {
> + struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> + da_handle_event_tlob(&mon, events[i]);
> + KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state,
> + (int)off_cpu_tlob,
> + "event %u should self-loop in off_cpu",
> + events[i]);
> + }
> +}
> +
> +/* on_cpu self-loops on sched_wakeup. */
> +static void tlob_on_cpu_self_loops(struct kunit *test)
> +{
> + struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> + da_handle_event_tlob(&mon, sched_wakeup_tlob);
> + KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state, (int)on_cpu_tlob,
> + "sched_wakeup should self-loop in on_cpu");
> +}
> +
> +/* Scheduling events in unmonitored self-loop (no state change). */
> +static void tlob_unmonitored_ignores_sched(struct kunit *test)
> +{
> + static const enum events_tlob events[] = {
> + switch_in_tlob, switch_out_tlob, sched_wakeup_tlob,
> + };
> + unsigned int i;
> +
> + for (i = 0; i < ARRAY_SIZE(events); i++) {
> + struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> + da_handle_event_tlob(&mon, events[i]);
> + KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state,
> + (int)unmonitored_tlob,
> + "event %u should self-loop in
> unmonitored",
> + events[i]);
> + }
> +}
> +
> +static void tlob_full_happy_path(struct kunit *test)
> +{
> + struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> + da_handle_event_tlob(&mon, trace_start_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> + da_handle_event_tlob(&mon, switch_out_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)off_cpu_tlob);
> +
> + da_handle_event_tlob(&mon, switch_in_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> + da_handle_event_tlob(&mon, trace_stop_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +static void tlob_multiple_switches(struct kunit *test)
> +{
> + struct da_monitor mon = { .curr_state = unmonitored_tlob };
> + int i;
> +
> + da_handle_event_tlob(&mon, trace_start_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> + for (i = 0; i < 3; i++) {
> + da_handle_event_tlob(&mon, switch_out_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state,
> (int)off_cpu_tlob);
> + da_handle_event_tlob(&mon, switch_in_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> + }
> +
> + da_handle_event_tlob(&mon, trace_stop_tlob);
> + KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +static struct kunit_case tlob_automaton_cases[] = {
> + KUNIT_CASE(tlob_unmonitored_to_on_cpu),
> + KUNIT_CASE(tlob_on_cpu_switch_out),
> + KUNIT_CASE(tlob_off_cpu_switch_in),
> + KUNIT_CASE(tlob_on_cpu_budget_expired),
> + KUNIT_CASE(tlob_off_cpu_budget_expired),
> + KUNIT_CASE(tlob_on_cpu_trace_stop),
> + KUNIT_CASE(tlob_off_cpu_trace_stop),
> + KUNIT_CASE(tlob_off_cpu_self_loops),
> + KUNIT_CASE(tlob_on_cpu_self_loops),
> + KUNIT_CASE(tlob_unmonitored_ignores_sched),
> + KUNIT_CASE(tlob_full_happy_path),
> + KUNIT_CASE(tlob_violation_then_restart),
> + KUNIT_CASE(tlob_multiple_switches),
> + {}
> +};
> +
> +static struct kunit_suite tlob_automaton_suite = {
> + .name = "tlob_automaton",
> + .test_cases = tlob_automaton_cases,
> +};
> +
> +/*
> + * Suite 2: task registration API
> + */
> +
> +/* Basic start/stop cycle */
> +static void tlob_start_stop_ok(struct kunit *test)
> +{
> + int ret;
> +
> + ret = tlob_start_task(current, 10000000 /* 10 s, won't fire */, NULL,
> 0);
> + KUNIT_ASSERT_EQ(test, ret, 0);
> + KUNIT_EXPECT_EQ(test, tlob_stop_task(current), 0);
> +}
> +
> +/* Double start must return -EEXIST. */
> +static void tlob_double_start(struct kunit *test)
> +{
> + KUNIT_ASSERT_EQ(test, tlob_start_task(current, 10000000, NULL, 0),
> 0);
> + KUNIT_EXPECT_EQ(test, tlob_start_task(current, 10000000, NULL, 0), -
> EEXIST);
> + tlob_stop_task(current);
> +}
> +
> +/* Stop without start must return -ESRCH. */
> +static void tlob_stop_without_start(struct kunit *test)
> +{
> + tlob_stop_task(current); /* clear any stale entry first */
> + KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +}
> +
> +/*
> + * A 1 us budget fires before tlob_stop_task() is called. Either the
> + * timer wins (-ESRCH) or we are very fast (0); both are valid.
> + */
> +static void tlob_immediate_deadline(struct kunit *test)
> +{
> + int ret = tlob_start_task(current, 1 /* 1 us - fires almost
> immediately */, NULL, 0);
> +
> + KUNIT_ASSERT_EQ(test, ret, 0);
> + /* Let the 1 us timer fire */
> + udelay(100);
> + /*
> + * By now the hrtimer has almost certainly fired. Either it has
> + * (returns -ESRCH) or we were very fast (returns 0). Both are
> + * acceptable; just ensure no crash and the table is clean after.
> + */
> + ret = tlob_stop_task(current);
> + KUNIT_EXPECT_TRUE(test, ret == 0 || ret == -ESRCH);
> +}
> +
> +/*
> + * Fill the table to TLOB_MAX_MONITORED using kthreads (each needs a
> + * distinct task_struct), then verify the next start returns -ENOSPC.
> + */
> +struct tlob_waiter_ctx {
> + struct completion start;
> + struct completion done;
> +};
> +
> +static int tlob_waiter_fn(void *arg)
> +{
> + struct tlob_waiter_ctx *ctx = arg;
> +
> + wait_for_completion(&ctx->start);
> + complete(&ctx->done);
> + return 0;
> +}
> +
> +static void tlob_enospc(struct kunit *test)
> +{
> + struct tlob_waiter_ctx *ctxs;
> + struct task_struct **threads;
> + int i, ret;
> +
> + ctxs = kunit_kcalloc(test, TLOB_MAX_MONITORED,
> + sizeof(*ctxs), GFP_KERNEL);
> + KUNIT_ASSERT_NOT_NULL(test, ctxs);
> +
> + threads = kunit_kcalloc(test, TLOB_MAX_MONITORED,
> + sizeof(*threads), GFP_KERNEL);
> + KUNIT_ASSERT_NOT_NULL(test, threads);
> +
> + /* Start TLOB_MAX_MONITORED kthreads and monitor each */
> + for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> + init_completion(&ctxs[i].start);
> + init_completion(&ctxs[i].done);
> +
> + threads[i] = kthread_run(tlob_waiter_fn, &ctxs[i],
> + "tlob_waiter_%d", i);
> + if (IS_ERR(threads[i])) {
> + KUNIT_FAIL(test, "kthread_run failed at i=%d", i);
> + threads[i] = NULL;
> + goto cleanup;
> + }
> + get_task_struct(threads[i]);
> +
> + ret = tlob_start_task(threads[i], 10000000, NULL, 0);
> + if (ret != 0) {
> + KUNIT_FAIL(test, "tlob_start_task failed at i=%d:
> %d",
> + i, ret);
> + put_task_struct(threads[i]);
> + complete(&ctxs[i].start);
> + goto cleanup;
> + }
> + }
> +
> + /* The table is now full: one more must fail with -ENOSPC */
> + ret = tlob_start_task(current, 10000000, NULL, 0);
> + KUNIT_EXPECT_EQ(test, ret, -ENOSPC);
> +
> +cleanup:
> + /*
> + * Two-pass cleanup: cancel tlob monitoring and unblock kthreads
> first,
> + * then kthread_stop() to wait for full exit before releasing refs.
> + */
> + for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> + if (!threads[i])
> + break;
> + tlob_stop_task(threads[i]);
> + complete(&ctxs[i].start);
> + }
> + for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> + if (!threads[i])
> + break;
> + kthread_stop(threads[i]);
> + put_task_struct(threads[i]);
> + }
> +}
> +
> +/*
> + * A kthread holds a mutex for 80 ms; arm a 10 ms budget, burn ~1 ms
> + * on-CPU, then block on the mutex. The timer fires off-CPU; stop
> + * must return -ESRCH.
> + */
> +struct tlob_holder_ctx {
> + struct mutex lock;
> + struct completion ready;
> + unsigned int hold_ms;
> +};
> +
> +static int tlob_holder_fn(void *arg)
> +{
> + struct tlob_holder_ctx *ctx = arg;
> +
> + mutex_lock(&ctx->lock);
> + complete(&ctx->ready);
> + msleep(ctx->hold_ms);
> + mutex_unlock(&ctx->lock);
> + return 0;
> +}
> +
> +static void tlob_deadline_fires_off_cpu(struct kunit *test)
> +{
> + struct tlob_holder_ctx ctx = { .hold_ms = 80 };
> + struct task_struct *holder;
> + ktime_t t0;
> + int ret;
> +
> + mutex_init(&ctx.lock);
> + init_completion(&ctx.ready);
> +
> + holder = kthread_run(tlob_holder_fn, &ctx, "tlob_holder_kunit");
> + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, holder);
> + wait_for_completion(&ctx.ready);
> +
> + /* Arm 10 ms budget while kthread holds the mutex. */
> + ret = tlob_start_task(current, 10000, NULL, 0);
> + KUNIT_ASSERT_EQ(test, ret, 0);
> +
> + /* Phase 1: burn ~1 ms on-CPU to exercise on_cpu accounting. */
> + t0 = ktime_get();
> + while (ktime_us_delta(ktime_get(), t0) < 1000)
> + cpu_relax();
> +
> + /*
> + * Phase 2: block on the mutex -> on_cpu->off_cpu transition.
> + * The 10 ms budget fires while we are off-CPU.
> + */
> + mutex_lock(&ctx.lock);
> + mutex_unlock(&ctx.lock);
> +
> + /* Timer already fired and removed the entry -> -ESRCH */
> + KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +}
> +
> +/* Arm a 1 ms budget and busy-spin for 50 ms; timer fires on-CPU. */
> +static void tlob_deadline_fires_on_cpu(struct kunit *test)
> +{
> + ktime_t t0;
> + int ret;
> +
> + ret = tlob_start_task(current, 1000 /* 1 ms */, NULL, 0);
> + KUNIT_ASSERT_EQ(test, ret, 0);
> +
> + /* Busy-spin 50 ms - 50x the budget */
> + t0 = ktime_get();
> + while (ktime_us_delta(ktime_get(), t0) < 50000)
> + cpu_relax();
> +
> + /* Timer fired during the spin; entry is gone */
> + KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +}
> +
> +/*
> + * Start three tasks, call tlob_destroy_monitor() + tlob_init_monitor(),
> + * and verify the table is empty afterwards.
> + */
> +static int tlob_dummy_fn(void *arg)
> +{
> + wait_for_completion((struct completion *)arg);
> + return 0;
> +}
> +
> +static void tlob_stop_all_cleanup(struct kunit *test)
> +{
> + struct completion done1, done2;
> + struct task_struct *t1, *t2;
> + int ret;
> +
> + init_completion(&done1);
> + init_completion(&done2);
> +
> + t1 = kthread_run(tlob_dummy_fn, &done1, "tlob_dummy1");
> + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t1);
> + get_task_struct(t1);
> +
> + t2 = kthread_run(tlob_dummy_fn, &done2, "tlob_dummy2");
> + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t2);
> + get_task_struct(t2);
> +
> + KUNIT_ASSERT_EQ(test, tlob_start_task(current, 10000000, NULL, 0),
> 0);
> + KUNIT_ASSERT_EQ(test, tlob_start_task(t1, 10000000, NULL, 0), 0);
> + KUNIT_ASSERT_EQ(test, tlob_start_task(t2, 10000000, NULL, 0), 0);
> +
> + /* Destroy clears all entries via tlob_stop_all() */
> + tlob_destroy_monitor();
> + ret = tlob_init_monitor();
> + KUNIT_ASSERT_EQ(test, ret, 0);
> +
> + /* Table must be empty now */
> + KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> + KUNIT_EXPECT_EQ(test, tlob_stop_task(t1), -ESRCH);
> + KUNIT_EXPECT_EQ(test, tlob_stop_task(t2), -ESRCH);
> +
> + complete(&done1);
> + complete(&done2);
> + /*
> + * completions live on stack; wait for kthreads to exit before
> return.
> + */
> + kthread_stop(t1);
> + kthread_stop(t2);
> + put_task_struct(t1);
> + put_task_struct(t2);
> +}
> +
> +/* A threshold that overflows ktime_t must be rejected with -ERANGE. */
> +static void tlob_overflow_threshold(struct kunit *test)
> +{
> + /* KTIME_MAX / NSEC_PER_USEC + 1 overflows ktime_t */
> + u64 too_large = (u64)(KTIME_MAX / NSEC_PER_USEC) + 1;
> +
> + KUNIT_EXPECT_EQ(test,
> + tlob_start_task(current, too_large, NULL, 0),
> + -ERANGE);
> +}
> +
> +static int tlob_task_api_suite_init(struct kunit_suite *suite)
> +{
> + return tlob_init_monitor();
> +}
> +
> +static void tlob_task_api_suite_exit(struct kunit_suite *suite)
> +{
> + tlob_destroy_monitor();
> +}
> +
> +static struct kunit_case tlob_task_api_cases[] = {
> + KUNIT_CASE(tlob_start_stop_ok),
> + KUNIT_CASE(tlob_double_start),
> + KUNIT_CASE(tlob_stop_without_start),
> + KUNIT_CASE(tlob_immediate_deadline),
> + KUNIT_CASE(tlob_enospc),
> + KUNIT_CASE(tlob_overflow_threshold),
> + KUNIT_CASE(tlob_deadline_fires_off_cpu),
> + KUNIT_CASE(tlob_deadline_fires_on_cpu),
> + KUNIT_CASE(tlob_stop_all_cleanup),
> + {}
> +};
> +
> +static struct kunit_suite tlob_task_api_suite = {
> + .name = "tlob_task_api",
> + .suite_init = tlob_task_api_suite_init,
> + .suite_exit = tlob_task_api_suite_exit,
> + .test_cases = tlob_task_api_cases,
> +};
> +
> +/*
> + * Suite 3: scheduling integration
> + */
> +
> +struct tlob_ping_ctx {
> + struct completion ping;
> + struct completion pong;
> +};
> +
> +static int tlob_ping_fn(void *arg)
> +{
> + struct tlob_ping_ctx *ctx = arg;
> +
> + /* Wait for main to give us the CPU back */
> + wait_for_completion(&ctx->ping);
> + complete(&ctx->pong);
> + return 0;
> +}
> +
> +/* Force two context switches and verify stop returns 0 (within budget). */
> +static void tlob_sched_switch_accounting(struct kunit *test)
> +{
> + struct tlob_ping_ctx ctx;
> + struct task_struct *peer;
> + int ret;
> +
> + init_completion(&ctx.ping);
> + init_completion(&ctx.pong);
> +
> + peer = kthread_run(tlob_ping_fn, &ctx, "tlob_ping_kunit");
> + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, peer);
> +
> + /* Arm a generous 5 s budget so the timer never fires */
> + ret = tlob_start_task(current, 5000000, NULL, 0);
> + KUNIT_ASSERT_EQ(test, ret, 0);
> +
> + /*
> + * complete(ping) -> peer runs, forcing a context switch out and
> back.
> + */
> + complete(&ctx.ping);
> + wait_for_completion(&ctx.pong);
> +
> + /*
> + * Back on CPU after one off-CPU interval; stop must return 0.
> + */
> + ret = tlob_stop_task(current);
> + KUNIT_EXPECT_EQ(test, ret, 0);
> +}
> +
> +/*
> + * Verify that monitoring a kthread (not current) works: start on behalf
> + * of a kthread, let it block, then stop it.
> + */
> +static int tlob_block_fn(void *arg)
> +{
> + struct completion *done = arg;
> +
> + /* Block briefly, exercising off_cpu accounting for this task */
> + msleep(20);
> + complete(done);
> + return 0;
> +}
> +
> +static void tlob_monitor_other_task(struct kunit *test)
> +{
> + struct completion done;
> + struct task_struct *target;
> + int ret;
> +
> + init_completion(&done);
> +
> + target = kthread_run(tlob_block_fn, &done, "tlob_target_kunit");
> + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, target);
> + get_task_struct(target);
> +
> + /* Arm a 5 s budget for the target task */
> + ret = tlob_start_task(target, 5000000, NULL, 0);
> + KUNIT_ASSERT_EQ(test, ret, 0);
> +
> + wait_for_completion(&done);
> +
> + /*
> + * Target has finished; stop_task may return 0 (still in htable)
> + * or -ESRCH (kthread exited and timer fired / entry cleaned up).
> + */
> + ret = tlob_stop_task(target);
> + KUNIT_EXPECT_TRUE(test, ret == 0 || ret == -ESRCH);
> + put_task_struct(target);
> +}
> +
> +static int tlob_sched_suite_init(struct kunit_suite *suite)
> +{
> + return tlob_init_monitor();
> +}
> +
> +static void tlob_sched_suite_exit(struct kunit_suite *suite)
> +{
> + tlob_destroy_monitor();
> +}
> +
> +static struct kunit_case tlob_sched_integration_cases[] = {
> + KUNIT_CASE(tlob_sched_switch_accounting),
> + KUNIT_CASE(tlob_monitor_other_task),
> + {}
> +};
> +
> +static struct kunit_suite tlob_sched_integration_suite = {
> + .name = "tlob_sched_integration",
> + .suite_init = tlob_sched_suite_init,
> + .suite_exit = tlob_sched_suite_exit,
> + .test_cases = tlob_sched_integration_cases,
> +};
> +
> +/*
> + * Suite 4: ftrace tracepoint field verification
> + */
> +
> +/* Capture fields from trace_tlob_budget_exceeded for inspection. */
> +struct tlob_exceeded_capture {
> + atomic_t fired; /* 1 after first call */
> + pid_t pid;
> + u64 threshold_us;
> + u64 on_cpu_us;
> + u64 off_cpu_us;
> + u32 switches;
> + bool state_is_on_cpu;
> + u64 tag;
> +};
> +
> +static void
> +probe_tlob_budget_exceeded(void *data,
> + struct task_struct *task, u64 threshold_us,
> + u64 on_cpu_us, u64 off_cpu_us,
> + u32 switches, bool state_is_on_cpu, u64 tag)
> +{
> + struct tlob_exceeded_capture *cap = data;
> +
> + /* Only capture the first event to avoid races. */
> + if (atomic_cmpxchg(&cap->fired, 0, 1) != 0)
> + return;
> +
> + cap->pid = task->pid;
> + cap->threshold_us = threshold_us;
> + cap->on_cpu_us = on_cpu_us;
> + cap->off_cpu_us = off_cpu_us;
> + cap->switches = switches;
> + cap->state_is_on_cpu = state_is_on_cpu;
> + cap->tag = tag;
> +}
> +
> +/*
> + * Arm a 2 ms budget and busy-spin for 60 ms. Verify the tracepoint fires
> + * once with matching threshold, correct pid, and total time >= budget.
> + *
> + * state_is_on_cpu is not asserted: preemption during the spin makes it
> + * non-deterministic.
> + */
> +static void tlob_trace_budget_exceeded_on_cpu(struct kunit *test)
> +{
> + struct tlob_exceeded_capture cap = {};
> + const u64 threshold_us = 2000; /* 2 ms */
> + ktime_t t0;
> + int ret;
> +
> + atomic_set(&cap.fired, 0);
> +
> + ret = register_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> + &cap);
> + KUNIT_ASSERT_EQ(test, ret, 0);
> +
> + ret = tlob_start_task(current, threshold_us, NULL, 0);
> + KUNIT_ASSERT_EQ(test, ret, 0);
> +
> + /* Busy-spin 60 ms -- 30x the budget */
> + t0 = ktime_get();
> + while (ktime_us_delta(ktime_get(), t0) < 60000)
> + cpu_relax();
> +
> + /* Entry removed by timer; stop returns -ESRCH */
> + tlob_stop_task(current);
> +
> + /*
> + * Synchronise: ensure the probe callback has completed before we
> + * read the captured fields.
> + */
> + tracepoint_synchronize_unregister();
> + unregister_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> &cap);
> +
> + KUNIT_EXPECT_EQ(test, atomic_read(&cap.fired), 1);
> + KUNIT_EXPECT_EQ(test, (int)cap.pid, (int)current->pid);
> + KUNIT_EXPECT_EQ(test, cap.threshold_us, threshold_us);
> + /* Total elapsed must cover at least the budget */
> + KUNIT_EXPECT_GE(test, cap.on_cpu_us + cap.off_cpu_us, threshold_us);
> +}
> +
> +/*
> + * Holder kthread grabs a mutex for 80 ms; arm 10 ms budget, burn ~1 ms
> + * on-CPU, then block on the mutex. Timer fires off-CPU. Verify:
> + * state_is_on_cpu == false, switches >= 1, off_cpu_us > 0.
> + */
> +static void tlob_trace_budget_exceeded_off_cpu(struct kunit *test)
> +{
> + struct tlob_exceeded_capture cap = {};
> + struct tlob_holder_ctx ctx = { .hold_ms = 80 };
> + struct task_struct *holder;
> + const u64 threshold_us = 10000; /* 10 ms */
> + ktime_t t0;
> + int ret;
> +
> + atomic_set(&cap.fired, 0);
> +
> + mutex_init(&ctx.lock);
> + init_completion(&ctx.ready);
> +
> + holder = kthread_run(tlob_holder_fn, &ctx, "tlob_holder2_kunit");
> + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, holder);
> + wait_for_completion(&ctx.ready);
> +
> + ret = register_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> + &cap);
> + KUNIT_ASSERT_EQ(test, ret, 0);
> +
> + ret = tlob_start_task(current, threshold_us, NULL, 0);
> + KUNIT_ASSERT_EQ(test, ret, 0);
> +
> + /* Phase 1: ~1 ms on-CPU */
> + t0 = ktime_get();
> + while (ktime_us_delta(ktime_get(), t0) < 1000)
> + cpu_relax();
> +
> + /* Phase 2: block -> off-CPU; timer fires here */
> + mutex_lock(&ctx.lock);
> + mutex_unlock(&ctx.lock);
> +
> + tlob_stop_task(current);
> +
> + tracepoint_synchronize_unregister();
> + unregister_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> &cap);
> +
> + KUNIT_EXPECT_EQ(test, atomic_read(&cap.fired), 1);
> + KUNIT_EXPECT_EQ(test, cap.threshold_us, threshold_us);
> + /* Violation happened off-CPU */
> + KUNIT_EXPECT_FALSE(test, cap.state_is_on_cpu);
> + /* At least the switch_out event was counted */
> + KUNIT_EXPECT_GE(test, (u64)cap.switches, (u64)1);
> + /* Off-CPU time must be non-zero */
> + KUNIT_EXPECT_GT(test, cap.off_cpu_us, (u64)0);
> +}
> +
> +/* threshold_us in the tracepoint must exactly match the start argument. */
> +static void tlob_trace_threshold_field_accuracy(struct kunit *test)
> +{
> + static const u64 thresholds[] = { 500, 1000, 3000 };
> + unsigned int i;
> +
> + for (i = 0; i < ARRAY_SIZE(thresholds); i++) {
> + struct tlob_exceeded_capture cap = {};
> + ktime_t t0;
> + int ret;
> +
> + atomic_set(&cap.fired, 0);
> +
> + ret = register_trace_tlob_budget_exceeded(
> + probe_tlob_budget_exceeded, &cap);
> + KUNIT_ASSERT_EQ(test, ret, 0);
> +
> + ret = tlob_start_task(current, thresholds[i], NULL, 0);
> + KUNIT_ASSERT_EQ(test, ret, 0);
> +
> + /* Spin for 20x the threshold to ensure timer fires */
> + t0 = ktime_get();
> + while (ktime_us_delta(ktime_get(), t0) <
> + (s64)(thresholds[i] * 20))
> + cpu_relax();
> +
> + tlob_stop_task(current);
> +
> + tracepoint_synchronize_unregister();
> + unregister_trace_tlob_budget_exceeded(
> + probe_tlob_budget_exceeded, &cap);
> +
> + KUNIT_EXPECT_EQ_MSG(test, cap.threshold_us, thresholds[i],
> + "threshold mismatch for entry %u", i);
> + }
> +}
> +
> +static int tlob_trace_suite_init(struct kunit_suite *suite)
> +{
> + int ret;
> +
> + ret = tlob_init_monitor();
> + if (ret)
> + return ret;
> + return tlob_enable_hooks();
> +}
> +
> +static void tlob_trace_suite_exit(struct kunit_suite *suite)
> +{
> + tlob_disable_hooks();
> + tlob_destroy_monitor();
> +}
> +
> +static struct kunit_case tlob_trace_output_cases[] = {
> + KUNIT_CASE(tlob_trace_budget_exceeded_on_cpu),
> + KUNIT_CASE(tlob_trace_budget_exceeded_off_cpu),
> + KUNIT_CASE(tlob_trace_threshold_field_accuracy),
> + {}
> +};
> +
> +static struct kunit_suite tlob_trace_output_suite = {
> + .name = "tlob_trace_output",
> + .suite_init = tlob_trace_suite_init,
> + .suite_exit = tlob_trace_suite_exit,
> + .test_cases = tlob_trace_output_cases,
> +};
> +
> +/* Suite 5: ring buffer */
> +
> +/*
> + * Allocate a synthetic rv_file_priv for ring buffer tests. Uses
> + * kunit_kzalloc() instead of __get_free_pages() since the ring is never
> + * mmap'd here.
> + */
> +static struct rv_file_priv *alloc_priv_kunit(struct kunit *test, u32 cap)
> +{
> + struct rv_file_priv *priv;
> + struct tlob_ring *ring;
> +
> + priv = kunit_kzalloc(test, sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return NULL;
> +
> + ring = &priv->ring;
> +
> + ring->page = kunit_kzalloc(test, sizeof(struct tlob_mmap_page),
> + GFP_KERNEL);
> + if (!ring->page)
> + return NULL;
> +
> + ring->data = kunit_kzalloc(test, cap * sizeof(struct tlob_event),
> + GFP_KERNEL);
> + if (!ring->data)
> + return NULL;
> +
> + ring->mask = cap - 1;
> + ring->page->capacity = cap;
> + ring->page->version = 1;
> + ring->page->data_offset = PAGE_SIZE; /* nominal; not used in tests */
> + ring->page->record_size = sizeof(struct tlob_event);
> + spin_lock_init(&ring->lock);
> + init_waitqueue_head(&priv->waitq);
> + return priv;
> +}
> +
> +/* Push one record and verify all fields survive the round-trip. */
> +static void tlob_event_push_one(struct kunit *test)
> +{
> + struct rv_file_priv *priv;
> + struct tlob_ring *ring;
> + struct tlob_event in = {
> + .tid = 1234,
> + .threshold_us = 5000,
> + .on_cpu_us = 3000,
> + .off_cpu_us = 2000,
> + .switches = 3,
> + .state = 1,
> + };
> + struct tlob_event out = {};
> + u32 tail;
> +
> + priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
> + KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> + ring = &priv->ring;
> +
> + tlob_event_push_kunit(priv, &in);
> +
> + /* One record written, none dropped */
> + KUNIT_EXPECT_EQ(test, ring->page->data_head, 1u);
> + KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
> + KUNIT_EXPECT_EQ(test, ring->page->dropped, 0ull);
> +
> + /* Dequeue manually */
> + tail = ring->page->data_tail;
> + out = ring->data[tail & ring->mask];
> + ring->page->data_tail = tail + 1;
> +
> + KUNIT_EXPECT_EQ(test, out.tid, in.tid);
> + KUNIT_EXPECT_EQ(test, out.threshold_us, in.threshold_us);
> + KUNIT_EXPECT_EQ(test, out.on_cpu_us, in.on_cpu_us);
> + KUNIT_EXPECT_EQ(test, out.off_cpu_us, in.off_cpu_us);
> + KUNIT_EXPECT_EQ(test, out.switches, in.switches);
> + KUNIT_EXPECT_EQ(test, out.state, in.state);
> +
> + /* Ring is now empty */
> + KUNIT_EXPECT_EQ(test, ring->page->data_head, ring->page->data_tail);
> +}
> +
> +/*
> + * Fill to capacity, push one more. Drop-new policy: head stays at cap,
> + * dropped == 1, oldest record is preserved.
> + */
> +static void tlob_event_push_overflow(struct kunit *test)
> +{
> + struct rv_file_priv *priv;
> + struct tlob_ring *ring;
> + struct tlob_event ntf = {};
> + struct tlob_event out = {};
> + const u32 cap = TLOB_RING_MIN_CAP;
> + u32 i;
> +
> + priv = alloc_priv_kunit(test, cap);
> + KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> + ring = &priv->ring;
> +
> + /* Push cap + 1 records; tid encodes the sequence */
> + for (i = 0; i <= cap; i++) {
> + ntf.tid = i;
> + ntf.threshold_us = (u64)i * 1000;
> + tlob_event_push_kunit(priv, &ntf);
> + }
> +
> + /* Drop-new: head stopped at cap; one record was silently discarded
> */
> + KUNIT_EXPECT_EQ(test, ring->page->data_head, cap);
> + KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
> + KUNIT_EXPECT_EQ(test, ring->page->dropped, 1ull);
> +
> + /* Oldest surviving record must be the first one pushed (tid == 0) */
> + out = ring->data[ring->page->data_tail & ring->mask];
> + KUNIT_EXPECT_EQ(test, out.tid, 0u);
> +
> + /* Drain the ring; the last record must have tid == cap - 1 */
> + for (i = 0; i < cap; i++) {
> + u32 tail = ring->page->data_tail;
> +
> + out = ring->data[tail & ring->mask];
> + ring->page->data_tail = tail + 1;
> + }
> + KUNIT_EXPECT_EQ(test, out.tid, cap - 1);
> + KUNIT_EXPECT_EQ(test, ring->page->data_head, ring->page->data_tail);
> +}
> +
> +/* A freshly initialised ring is empty. */
> +static void tlob_event_empty(struct kunit *test)
> +{
> + struct rv_file_priv *priv;
> + struct tlob_ring *ring;
> +
> + priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
> + KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> + ring = &priv->ring;
> +
> + KUNIT_EXPECT_EQ(test, ring->page->data_head, 0u);
> + KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
> + KUNIT_EXPECT_EQ(test, ring->page->dropped, 0ull);
> +}
> +
> +/* A kthread blocks on wait_event_interruptible(); pushing one record must
> + * wake it within 1 s.
> + */
> +
> +struct tlob_wakeup_ctx {
> + struct rv_file_priv *priv;
> + struct completion ready;
> + struct completion done;
> + int woke;
> +};
> +
> +static int tlob_wakeup_thread(void *arg)
> +{
> + struct tlob_wakeup_ctx *ctx = arg;
> + struct tlob_ring *ring = &ctx->priv->ring;
> +
> + complete(&ctx->ready);
> +
> + wait_event_interruptible(ctx->priv->waitq,
> + smp_load_acquire(&ring->page->data_head) !=
> + READ_ONCE(ring->page->data_tail) ||
> + kthread_should_stop());
> +
> + if (smp_load_acquire(&ring->page->data_head) !=
> + READ_ONCE(ring->page->data_tail))
> + ctx->woke = 1;
> +
> + complete(&ctx->done);
> + return 0;
> +}
> +
> +static void tlob_ring_wakeup(struct kunit *test)
> +{
> + struct rv_file_priv *priv;
> + struct tlob_wakeup_ctx ctx;
> + struct task_struct *t;
> + struct tlob_event ev = { .tid = 99 };
> + long timeout;
> +
> + priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
> + KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> + init_completion(&ctx.ready);
> + init_completion(&ctx.done);
> + ctx.priv = priv;
> + ctx.woke = 0;
> +
> + t = kthread_run(tlob_wakeup_thread, &ctx, "tlob_wakeup_kunit");
> + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t);
> + get_task_struct(t);
> +
> + /* Let the kthread reach wait_event_interruptible */
> + wait_for_completion(&ctx.ready);
> + usleep_range(10000, 20000);
> +
> + /* Push one record -- must wake the waiter */
> + tlob_event_push_kunit(priv, &ev);
> +
> + timeout = wait_for_completion_timeout(&ctx.done,
> msecs_to_jiffies(1000));
> + kthread_stop(t);
> + put_task_struct(t);
> +
> + KUNIT_EXPECT_GT(test, timeout, 0L);
> + KUNIT_EXPECT_EQ(test, ctx.woke, 1);
> + KUNIT_EXPECT_EQ(test, priv->ring.page->data_head, 1u);
> +}
> +
> +static struct kunit_case tlob_event_buf_cases[] = {
> + KUNIT_CASE(tlob_event_push_one),
> + KUNIT_CASE(tlob_event_push_overflow),
> + KUNIT_CASE(tlob_event_empty),
> + KUNIT_CASE(tlob_ring_wakeup),
> + {}
> +};
> +
> +static struct kunit_suite tlob_event_buf_suite = {
> + .name = "tlob_event_buf",
> + .test_cases = tlob_event_buf_cases,
> +};
> +
> +/* Suite 6: uprobe format string parser */
> +
> +/* Happy path: decimal offsets, plain path. */
> +static void tlob_parse_decimal_offsets(struct kunit *test)
> +{
> + char buf[] = "5000:4768:4848:/usr/bin/myapp";
> + u64 thr; loff_t start, stop; char *path;
> +
> + KUNIT_EXPECT_EQ(test,
> + tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> + 0);
> + KUNIT_EXPECT_EQ(test, thr, (u64)5000);
> + KUNIT_EXPECT_EQ(test, start, (loff_t)4768);
> + KUNIT_EXPECT_EQ(test, stop, (loff_t)4848);
> + KUNIT_EXPECT_STREQ(test, path, "/usr/bin/myapp");
> +}
> +
> +/* Happy path: 0x-prefixed hex offsets. */
> +static void tlob_parse_hex_offsets(struct kunit *test)
> +{
> + char buf[] = "10000:0x12a0:0x12f0:/usr/bin/myapp";
> + u64 thr; loff_t start, stop; char *path;
> +
> + KUNIT_EXPECT_EQ(test,
> + tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> + 0);
> + KUNIT_EXPECT_EQ(test, start, (loff_t)0x12a0);
> + KUNIT_EXPECT_EQ(test, stop, (loff_t)0x12f0);
> + KUNIT_EXPECT_STREQ(test, path, "/usr/bin/myapp");
> +}
> +
> +/* Path containing ':' must not be truncated. */
> +static void tlob_parse_path_with_colon(struct kunit *test)
> +{
> + char buf[] = "1000:0x100:0x200:/opt/my:app/bin";
> + u64 thr; loff_t start, stop; char *path;
> +
> + KUNIT_EXPECT_EQ(test,
> + tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> + 0);
> + KUNIT_EXPECT_STREQ(test, path, "/opt/my:app/bin");
> +}
> +
> +/* Zero threshold must be rejected. */
> +static void tlob_parse_zero_threshold(struct kunit *test)
> +{
> + char buf[] = "0:0x100:0x200:/usr/bin/myapp";
> + u64 thr; loff_t start, stop; char *path;
> +
> + KUNIT_EXPECT_EQ(test,
> + tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> + -EINVAL);
> +}
> +
> +/* Empty path (trailing ':' with nothing after) must be rejected. */
> +static void tlob_parse_empty_path(struct kunit *test)
> +{
> + char buf[] = "5000:0x100:0x200:";
> + u64 thr; loff_t start, stop; char *path;
> +
> + KUNIT_EXPECT_EQ(test,
> + tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> + -EINVAL);
> +}
> +
> +/* Missing field (3 tokens instead of 4) must be rejected. */
> +static void tlob_parse_too_few_fields(struct kunit *test)
> +{
> + char buf[] = "5000:0x100:/usr/bin/myapp";
> + u64 thr; loff_t start, stop; char *path;
> +
> + KUNIT_EXPECT_EQ(test,
> + tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> + -EINVAL);
> +}
> +
> +/* Negative offset must be rejected. */
> +static void tlob_parse_negative_offset(struct kunit *test)
> +{
> + char buf[] = "5000:-1:0x200:/usr/bin/myapp";
> + u64 thr; loff_t start, stop; char *path;
> +
> + KUNIT_EXPECT_EQ(test,
> + tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> + -EINVAL);
> +}
> +
> +static struct kunit_case tlob_parse_uprobe_cases[] = {
> + KUNIT_CASE(tlob_parse_decimal_offsets),
> + KUNIT_CASE(tlob_parse_hex_offsets),
> + KUNIT_CASE(tlob_parse_path_with_colon),
> + KUNIT_CASE(tlob_parse_zero_threshold),
> + KUNIT_CASE(tlob_parse_empty_path),
> + KUNIT_CASE(tlob_parse_too_few_fields),
> + KUNIT_CASE(tlob_parse_negative_offset),
> + {}
> +};
> +
> +static struct kunit_suite tlob_parse_uprobe_suite = {
> + .name = "tlob_parse_uprobe",
> + .test_cases = tlob_parse_uprobe_cases,
> +};
> +
> +kunit_test_suites(&tlob_automaton_suite,
> + &tlob_task_api_suite,
> + &tlob_sched_integration_suite,
> + &tlob_trace_output_suite,
> + &tlob_event_buf_suite,
> + &tlob_parse_uprobe_suite);
> +
> +MODULE_DESCRIPTION("KUnit tests for the tlob RV monitor");
> +MODULE_LICENSE("GPL");
^ permalink raw reply
* Re: [PATCH 55/61] interconnect: Prefer IS_ERR_OR_NULL over manual NULL check
From: Krzysztof Kozlowski @ 2026-04-16 12:24 UTC (permalink / raw)
To: Philipp Hahn, amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel,
dri-devel, gfs2, intel-gfx, intel-wired-lan, iommu, kvm,
linux-arm-kernel, linux-block, linux-bluetooth, linux-btrfs,
linux-cifs, linux-clk, linux-erofs, linux-ext4, linux-fsdevel,
linux-gpio, linux-hyperv, linux-input, linux-kernel, linux-leds,
linux-media, linux-mips, linux-mm, linux-modules, linux-mtd,
linux-nfs, linux-omap, linux-phy, linux-pm, linux-rockchip,
linux-s390, linux-scsi, linux-sctp, linux-security-module,
linux-sh, linux-sound, linux-stm32, linux-trace-kernel, linux-usb,
linux-wireless, netdev, ntfs3, samba-technical, sched-ext,
target-devel, tipc-discussion, v9fs
Cc: Georgi Djakov
In-Reply-To: <20260310-b4-is_err_or_null-v1-55-bd63b656022d@avm.de>
On 10/03/2026 12:49, Philipp Hahn wrote:
> Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
> check.
>
> Semantich change: Previously the code only printed the warning on error,
> but not when the pointer was NULL. Now the warning is printed in both
> cases!
NAK, read the code
>
> Change found with coccinelle.
>
> To: Georgi Djakov <djakov@kernel.org>
> Cc: linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
> ---
> drivers/interconnect/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/interconnect/core.c b/drivers/interconnect/core.c
> index 8569b78a18517b33abeafac091978b25cbc1acc7..22e92b30f73853d5bd2e05b4f52cb5aa22556468 100644
> --- a/drivers/interconnect/core.c
> +++ b/drivers/interconnect/core.c
> @@ -790,7 +790,7 @@ void icc_put(struct icc_path *path)
> size_t i;
> int ret;
>
> - if (!path || WARN_ON(IS_ERR(path)))
> + if (WARN_ON(IS_ERR_OR_NULL(path)))
IS_ERR_OR_NULL is simply discouraged, but beside of code preference, you
just added bug here. This is clearly not equivalent and you emit warn on
perfectly valid case!
Best regards,
Krzysztof
^ permalink raw reply
* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: David Laight @ 2026-04-16 12:30 UTC (permalink / raw)
To: chensong_2000
Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
mark.rutland, mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260415070137.17860-1-chensong_2000@189.cn>
On Wed, 15 Apr 2026 15:01:37 +0800
chensong_2000@189.cn wrote:
> From: Song Chen <chensong_2000@189.cn>
>
> The current notifier chain implementation uses a single-linked list
> (struct notifier_block *next), which only supports forward traversal
> in priority order. This makes it difficult to handle cleanup/teardown
> scenarios that require notifiers to be called in reverse priority order.
If it is only cleanup/teardown then the list can be order-reversed
as part of that process at the same time as the list is deleted.
David
^ permalink raw reply
* Re: [PATCH 01/61] Coccinelle: Prefer IS_ERR_OR_NULL over manual NULL check
From: Krzysztof Kozlowski @ 2026-04-16 12:30 UTC (permalink / raw)
To: Philipp Hahn, amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel,
dri-devel, gfs2, intel-gfx, intel-wired-lan, iommu, kvm,
linux-arm-kernel, linux-block, linux-bluetooth, linux-btrfs,
linux-cifs, linux-clk, linux-erofs, linux-ext4, linux-fsdevel,
linux-gpio, linux-hyperv, linux-input, linux-kernel, linux-leds,
linux-media, linux-mips, linux-mm, linux-modules, linux-mtd,
linux-nfs, linux-omap, linux-phy, linux-pm, linux-rockchip,
linux-s390, linux-scsi, linux-sctp, linux-security-module,
linux-sh, linux-sound, linux-stm32, linux-trace-kernel, linux-usb,
linux-wireless, netdev, ntfs3, samba-technical, sched-ext,
target-devel, tipc-discussion, v9fs
Cc: Julia Lawall, Nicolas Palix
In-Reply-To: <20260310-b4-is_err_or_null-v1-1-bd63b656022d@avm.de>
On 10/03/2026 12:48, Philipp Hahn wrote:
> Find and convert uses of IS_ERR() plus NULL check to IS_ERR_OR_NULL().
>
> There are several cases where `!ptr && WARN_ON[_ONCE](IS_ERR(ptr))` is
> used:
> - arch/x86/kernel/callthunks.c:215 WARN_ON_ONCE
> - drivers/clk/clk.c:4561 WARN_ON_ONCE
> - drivers/interconnect/core.c:793 WARN_ON
> - drivers/reset/core.c:718 WARN_ON
> The change is not 100% semantical equivalent as the warning will now
> also happen when the pointer is NULL.
>
> To: Julia Lawall <Julia.Lawall@inria.fr>
> To: Nicolas Palix <nicolas.palix@imag.fr>
> Cc: cocci@inria.fr
> Cc: linux-kernel@vger.kernel.org
>
> ---
> drivers/clocksource/mips-gic-timer.c:283 looks suspicious: ret != clk,
> but Daniel Lezcano verified it as cottect.
>
> There are some cases where the checks are part of a larger expression:
> - mm/kmemleak.c:1095
> - mm/kmemleak.c:1155
> - mm/kmemleak.c:1173
> - mm/kmemleak.c:1290
> - mm/kmemleak.c:1328
> - mm/kmemleak.c:1241
> - mm/kmemleak.c:1310
> - mm/kmemleak.c:1258
> - net/netlink/af_netlink.c:2670
> Thanks to Julia Lawall for the help to also handle them.
>
> Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
> ---
> scripts/coccinelle/api/is_err_or_null.cocci | 125 ++++++++++++++++++++++++++++
> 1 file changed, 125 insertions(+)
>
Neither this, nor try from 2011, nor any future try should be accepted,
because it creates impression IS_ERR_OR_NULL is somehow okay. No, it is
not okay, it is a discouraged pattern leading to less readable and
maintainable code. We should not have therefore any tools suggesting
usage of IS_ERR_OR_NULL, because people will be converting poor code
into that, instead of fixing that poor code.
Best regards,
Krzysztof
^ permalink raw reply
* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Mladek @ 2026-04-16 13:09 UTC (permalink / raw)
To: Song Chen
Cc: Petr Pavlu, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
frederic, mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe,
jikos, mbenes, joe.lawrence, rostedt, mhiramat, mark.rutland,
mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <a35f5f94-7d5a-4347-974b-b270c89ef241@189.cn>
On Wed 2026-04-15 14:43:53, Song Chen wrote:
> Hi,
>
> On 4/14/26 22:33, Petr Pavlu wrote:
> > On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
> > > From: Song Chen <chensong_2000@189.cn>
> > >
> > > ftrace and livepatch currently have their module load/unload callbacks
> > > hard-coded in the module loader as direct function calls to
> > > ftrace_module_enable(), klp_module_coming(), klp_module_going()
> > > and ftrace_release_mod(). This tight coupling was originally introduced
> > > to enforce strict call ordering that could not be guaranteed by the
> > > module notifier chain, which only supported forward traversal. Their
> > > notifiers were moved in and out back and forth. see [1] and [2].
> >
> > I'm unclear about what is meant by the notifiers being moved back and
> > forth. The links point to patches that converted ftrace+klp from using
> > module notifiers to explicit callbacks due to ordering issues, but this
> > switch occurred only once. Have there been other attempts to use
> > notifiers again?
> >
> > > diff --git a/include/linux/module.h b/include/linux/module.h
> > > index 14f391b186c6..0bdd56f9defd 100644
> > > --- a/include/linux/module.h
> > > +++ b/include/linux/module.h
> > > @@ -308,6 +308,14 @@ enum module_state {
> > > MODULE_STATE_COMING, /* Full formed, running module_init. */
> > > MODULE_STATE_GOING, /* Going away. */
> > > MODULE_STATE_UNFORMED, /* Still setting it up. */
> > > + MODULE_STATE_FORMED,
> >
> > I don't see a reason to add a new module state. Why is it necessary and
> > how does it fit with the existing states?
> >
> because once notifier fails in state MODULE_STATE_UNFORMED (now only ftrace
> has someting to do in this state), notifier chain will roll back by calling
> blocking_notifier_call_chain_robust, i'm afraid MODULE_STATE_GOING is going
> to jeopardise the notifers which don't handle it appropriately, like:
>
> case MODULE_STATE_COMING:
> kmalloc();
> case MODULE_STATE_GOING:
> kfree();
>
>
> > > +};
> > > +
> > > +enum module_notifier_prio {
> > > + MODULE_NOTIFIER_PRIO_LOW = INT_MIN, /* Low prioroty, coming last, going first */
> > > + MODULE_NOTIFIER_PRIO_MID = 0, /* Normal priority. */
> > > + MODULE_NOTIFIER_PRIO_SECOND_HIGH = INT_MAX - 1, /* Second high priorigy, coming second*/
> > > + MODULE_NOTIFIER_PRIO_HIGH = INT_MAX, /* High priorigy, coming first, going late. */
> >
> > I suggest being explicit about how the notifiers are ordered. For
> > example:
> >
> > enum module_notifier_prio {
> > MODULE_NOTIFIER_PRIO_NORMAL, /* Normal priority, coming last, going first. */
> > MODULE_NOTIFIER_PRIO_LIVEPATCH,
> > MODULE_NOTIFIER_PRIO_FTRACE, /* High priority, coming first, going late. */
> > };
> >
I like the explicit PRIO_LIVEPATCH/FTRACE names.
But I would keep the INT_MAX - 1 and INT_MAX priorities. I believe
that ftrace/livepatching will always be the first/last to call.
And INT_MAX would help to preserve kABI when PRIO_NORMAL is not
enough for the rest of notifiers.
That said, I am not sure whether this is worth the effort.
This patch tries to move the explicit callbacks in a generic
notifiers API. But it will still need to use some explictly
defined (reserved) priorities. And it will
not guarantee a misuse. Also it requires the double linked
list which complicates the notifiers code.
> > > };
> > > struct mod_tree_node {
> > > --- a/kernel/module/main.c
> > > +++ b/kernel/module/main.c
> > > @@ -3281,20 +3277,14 @@ static int complete_formation(struct module *mod, struct load_info *info)
> > > return err;
> > > }
> > > -static int prepare_coming_module(struct module *mod)
> > > +static int prepare_module_state_transaction(struct module *mod,
> > > + unsigned long val_up, unsigned long val_down)
> > > {
> > > int err;
> > > - ftrace_module_enable(mod);
> > > - err = klp_module_coming(mod);
> > > - if (err)
> > > - return err;
> > > -
> > > err = blocking_notifier_call_chain_robust(&module_notify_list,
> > > - MODULE_STATE_COMING, MODULE_STATE_GOING, mod);
> > > + val_up, val_down, mod);
> > > err = notifier_to_errno(err);
> > > - if (err)
> > > - klp_module_going(mod);
> > > return err;
> > > }
I personally find the name "prepare_module_state_transaction"
misleading. What is the "transaction" here? If this was a "preparation"
step then where is the transaction done/finished?
It might be better to just opencode the
blocking_notifier_call_chain_robust() instead.
> > > @@ -3468,14 +3458,21 @@ static int load_module(struct load_info *info, const char __user *uargs,
> > > init_build_id(mod, info);
> > > /* Ftrace init must be called in the MODULE_STATE_UNFORMED state */
> > > - ftrace_module_init(mod);
> > > + err = prepare_module_state_transaction(mod,
> > > + MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
> >
> > I believe val_down should be MODULE_STATE_GOING to reverse the
> > operation. Why is the new state MODULE_STATE_FORMED needed here?
> to avoid this:
>
> case MODULE_STATE_COMING:
> kmalloc();
> case MODULE_STATE_GOING:
> kfree();
Hmm, the module is in "FORMED" state here.
> > > + if (err)
> > > + goto ddebug_cleanup;
> > > /* Finally it's fully formed, ready to start executing. */
> > > err = complete_formation(mod, info);
And we call "complete_formation()" function. This sounds like
it was not really "FORMED" before. => It is confusing and nono.
Please, try to avoid the new state if possible. My experience
with reading the module loader code is that any new state
brings a lot of complexity. You need to take it into account
when checking correctness of other changes, features, ...
Something tells me that if the state was not needed before
then we could avoid it.
> > > - if (err)
> > > + if (err) {
> > > + blocking_notifier_call_chain_reverse(&module_notify_list,
> > > + MODULE_STATE_FORMED, mod);
> > > goto ddebug_cleanup;
> > > + }
> > > - err = prepare_coming_module(mod);
> > > + err = prepare_module_state_transaction(mod,
> > > + MODULE_STATE_COMING, MODULE_STATE_GOING);
> > > if (err)
> > > goto bug_cleanup;
> > > --- a/kernel/trace/ftrace.c
> > > +++ b/kernel/trace/ftrace.c
> > > @@ -5241,6 +5241,44 @@ static int __init ftrace_mod_cmd_init(void)
> > > }
> > > core_initcall(ftrace_mod_cmd_init);
> > > +static int ftrace_module_callback(struct notifier_block *nb, unsigned long op,
> > > + void *module)
> > > +{
> > > + struct module *mod = module;
> > > +
> > > + switch (op) {
> > > + case MODULE_STATE_UNFORMED:
> > > + ftrace_module_init(mod);
> > > + break;
> > > + case MODULE_STATE_COMING:
> > > + ftrace_module_enable(mod);
> > > + break;
> > > + case MODULE_STATE_LIVE:
> > > + ftrace_free_mem(mod, mod->mem[MOD_INIT_TEXT].base,
> > > + mod->mem[MOD_INIT_TEXT].base + mod->mem[MOD_INIT_TEXT].size);
> > > + break;
> > > + case MODULE_STATE_GOING:
> > > + case MODULE_STATE_FORMED:
> > > + ftrace_release_mod(mod);
This calls "release" in a "FORMED" state. It does not make any
sense. Something looks fishy, either the code or the naming.
> > > + break;
> > > + default:
> > > + break;
> > > + }
> >
I am sorry for being so picky about names. I believe that good names
help to prevent bugs and reduce headaches.
Best Regards,
Petr
^ permalink raw reply
* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Mladek @ 2026-04-16 14:49 UTC (permalink / raw)
To: Petr Pavlu
Cc: Song Chen, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
frederic, mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe,
jikos, mbenes, joe.lawrence, rostedt, mhiramat, mark.rutland,
mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <1db425bf-58a9-4768-8c38-3ae25d7662a5@suse.com>
On Thu 2026-04-16 13:18:30, Petr Pavlu wrote:
> On 4/15/26 8:43 AM, Song Chen wrote:
> > On 4/14/26 22:33, Petr Pavlu wrote:
> >> On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
> >>> diff --git a/include/linux/module.h b/include/linux/module.h
> >>> index 14f391b186c6..0bdd56f9defd 100644
> >>> --- a/include/linux/module.h
> >>> +++ b/include/linux/module.h
> >>> @@ -308,6 +308,14 @@ enum module_state {
> >>> MODULE_STATE_COMING, /* Full formed, running module_init. */
> >>> MODULE_STATE_GOING, /* Going away. */
> >>> MODULE_STATE_UNFORMED, /* Still setting it up. */
> >>> + MODULE_STATE_FORMED,
> >>
> >> I don't see a reason to add a new module state. Why is it necessary and
> >> how does it fit with the existing states?
> >>
> > because once notifier fails in state MODULE_STATE_UNFORMED (now only ftrace has someting to do in this state), notifier chain will roll back by calling blocking_notifier_call_chain_robust, i'm afraid MODULE_STATE_GOING is going to jeopardise the notifers which don't handle it appropriately, like:
> >
> > case MODULE_STATE_COMING:
> > kmalloc();
> > case MODULE_STATE_GOING:
> > kfree();
>
> My understanding is that the current module "state machine" operates as
> follows. Transitions marked with an asterisk (*) are announced via the
> module notifier.
>
> ---> UNFORMED --*> COMING --*> LIVE --*> GOING -.
> ^ | ^ |
> | '---------------------* |
> '---------------------------------------'
>
> The new code aims to replace the current ftrace_module_init() call in
> load_module(). To achieve this, it adds a notification for the UNFORMED
> state (only when loading a module) and introduces a new FORMED state for
> rollback. FORMED is purely a fake state because it never appears in
> module::state. The new structure is as follows:
>
> ,--*> (FORMED)
> |
> --*> UNFORMED --*> COMING --*> LIVE --*> GOING -.
> ^ | ^ |
> | '---------------------* |
> '---------------------------------------'
>
> I'm afraid this is quite complex and inconsistent. Unless it can be kept
> simple, we would be just replacing one special handling with a different
> complexity, which is not worth it.
> >>
> >>> + if (err)
> >>> + goto ddebug_cleanup;
> >>> /* Finally it's fully formed, ready to start executing. */
> >>> err = complete_formation(mod, info);
> >>> - if (err)
> >>> + if (err) {
> >>> + blocking_notifier_call_chain_reverse(&module_notify_list,
> >>> + MODULE_STATE_FORMED, mod);
> >>> goto ddebug_cleanup;
> >>> + }
> >>> - err = prepare_coming_module(mod);
> >>> + err = prepare_module_state_transaction(mod,
> >>> + MODULE_STATE_COMING, MODULE_STATE_GOING);
> >>> if (err)
> >>> goto bug_cleanup;
> >>> @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
> >>> destroy_params(mod->kp, mod->num_kp);
> >>> blocking_notifier_call_chain(&module_notify_list,
> >>> MODULE_STATE_GOING, mod);
> >>
> >> My understanding is that all notifier chains for MODULE_STATE_GOING
> >> should be reversed.
> > yes, all, from lowest priority notifier to highest.
> > I will resend patch 1 which was failed due to my proxy setting.
>
> What I meant here is that the call:
>
> blocking_notifier_call_chain(&module_notify_list, MODULE_STATE_GOING, mod);
>
> should be replaced with:
>
> blocking_notifier_call_chain_reverse(&module_notify_list, MODULE_STATE_GOING, mod);
>
> >
> >>
> >>> - klp_module_going(mod);
> >>> bug_cleanup:
> >>> mod->state = MODULE_STATE_GOING;
> >>> /* module_bug_cleanup needs module_mutex protection */
> >>
> >> The patch removes the klp_module_going() cleanup call in load_module().
> >> Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
> >> should be removed and appropriately replaced with a cleanup via
> >> a notifier.
> >>
> > err = prepare_module_state_transaction(mod,
> > MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
> > if (err)
> > goto ddebug_cleanup;
> >
> > ftrace will be cleanup in blocking_notifier_call_chain_robust rolling back.
> >
> > err = prepare_module_state_transaction(mod,
> > MODULE_STATE_COMING, MODULE_STATE_GOING);
> >
> > each notifier including ftrace and klp will be cleanup in blocking_notifier_call_chain_robust rolling back.
> >
> > if all notifiers are successful in MODULE_STATE_COMING, they all will be clean up in
> > coming_cleanup:
> > mod->state = MODULE_STATE_GOING;
> > destroy_params(mod->kp, mod->num_kp);
> > blocking_notifier_call_chain(&module_notify_list,
> > MODULE_STATE_GOING, mod);
> >
> > if something wrong underneath.
>
> My point is that the patch leaves a call to ftrace_release_mod() in
> load_module(), which I expected to be handled via a notifier.
I think that I have got it. The ftrace code needs two notifiers when
the module is being loaded and two when it is going.
This is why Sond added the new state. But I think that we would
need two new states to call:
+ ftrace_module_init() in MODULE_STATE_UNFORMED
+ ftrace_module_enable() in MODULE_STATE_FORMED
and
+ ftrace_free_mem() in MODULE_STATE_PRE_GOING
+ ftrace_free_mem() in MODULE_STATE_GOING
By using the ascii art:
-*> UNFORMED -*> FORMED -> COMING -*> LIVE -*> PRE_GOING -*> GOING -.
| | | ^ ^ ^
| | '----------------' | |
| '--------------------------------------' |
'------------------------------------------------------'
But I think that this is not worth it.
Best Regards,
Petr
^ permalink raw reply
* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Petr Mladek @ 2026-04-16 14:54 UTC (permalink / raw)
To: David Laight
Cc: chensong_2000, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
frederic, mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin,
jpoimboe, jikos, mbenes, joe.lawrence, rostedt, mhiramat,
mark.rutland, mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260416133004.07bd2886@pumpkin>
On Thu 2026-04-16 13:30:04, David Laight wrote:
> On Wed, 15 Apr 2026 15:01:37 +0800
> chensong_2000@189.cn wrote:
>
> > From: Song Chen <chensong_2000@189.cn>
> >
> > The current notifier chain implementation uses a single-linked list
> > (struct notifier_block *next), which only supports forward traversal
> > in priority order. This makes it difficult to handle cleanup/teardown
> > scenarios that require notifiers to be called in reverse priority order.
>
> If it is only cleanup/teardown then the list can be order-reversed
> as part of that process at the same time as the list is deleted.
Interesting idea. But it won't work in all situations.
Note that the motivation for this update are the module loader
notifiers which are called several times for each loaded/removed module.
Best Regards,
Petr
^ permalink raw reply
* [PATCH v5 1/7] tracing/lock: Remove unnecessary linux/sched.h include
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin, Usama Arif
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
None of the trace events in lock.h reference anything from
linux/sched.h. Remove the unnecessary include.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
---
include/trace/events/lock.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 8e89baa3775f..da978f2afb45 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -5,7 +5,6 @@
#if !defined(_TRACE_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_LOCK_H
-#include <linux/sched.h>
#include <linux/tracepoint.h>
/* flags for lock:contention_begin */
--
2.52.0
^ permalink raw reply related
* [PATCH v5 0/7] locking: contended_release tracepoint instrumentation
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
The existing contention_begin/contention_end tracepoints fire on the
waiter side. The lock holder's identity and stack can be captured at
contention_begin time (e.g. perf lock contention --lock-owner), but
this reflects the holder's state when a waiter arrives, not when the
lock is actually released.
This series adds a contended_release tracepoint that fires on the
holder side when a lock with waiters is released. This provides:
- Hold time estimation: when the holder's own acquisition was
contended, its contention_end (acquisition) and contended_release
can be correlated to measure how long the lock was held under
contention.
- The holder's stack at release time, which may differ from what perf lock
contention --lock-owner captures if the holder does significant work between
the waiter's arrival and the unlock.
Note: for reader/writer locks, the tracepoint fires for every reader
releasing while a writer is waiting, not only for the last reader.
v4 -> v5:
- Split the combined spinning locks patch into separate qspinlock and
qrwlock patches (Paul E. McKenney).
- Factor out __queued_read_unlock()/__queued_write_unlock() as a
separate preparatory commit, mirroring the queued_spin_release()
split (Paul E. McKenney).
- Updated binary size numbers for qspinlock-only change.
- Added Acked-by and Reviewed-by tags where appropriate.
v3 -> v4:
- Fix spurious events in __percpu_up_read(): guard with
rcuwait_active(&sem->writer) to avoid tracing during the RCU grace
period after a writer releases (Sashiko).
- Fix possible use-after-free in semaphore up(): move
trace_contended_release() inside the sem->lock critical section
(Sashiko).
- Fix build failure with CONFIG_PARAVIRT_SPINLOCKS=y: introduce
queued_spin_release() as the arch-overridable unlock primitive,
so queued_spin_unlock() can be a generic tracing wrapper. Convert
x86 (paravirt) and MIPS overrides (Sashiko).
- Add EXPORT_TRACEPOINT_SYMBOL_GPL(contended_release) for module
support (Sashiko).
- Split spinning locks patch: factor out queued_spin_release() as a
separate preparatory commit (Sashiko).
- Make read unlock tracepoint behavior consistent across all
reader/writer lock types: fire for every reader releasing while
a writer is waiting (rwsem, rwbase_rt were previously last-reader
only).
v2 -> v3:
- Added new patch: extend contended_release tracepoint to queued spinlocks
and queued rwlocks (marked as RFC, requesting feedback). This is prompted by
Matthew Wilcox's suggestion to try to come up with generic instrumentation,
instead of instrumenting each "special" lock manually. See [1] for the
discussion.
- Reworked tracepoint placement to fire before the lock is released and
before the waiter is woken where possible, for consistency with
spinning locks where there is no explicit wake (inspired by Usama Arif's
suggestion).
- Remove unnecessary linux/sched.h include from trace/events/lock.h.
RFC -> v2:
- Add trace_contended_release_enabled() guard before waiter checks that
exist only for the tracepoint (Steven Rostedt).
- Rename __percpu_up_read_slowpath() to __percpu_up_read() (Peter
Zijlstra).
- Add extern for __percpu_up_read() (Peter Zijlstra).
- Squashed tracepoint introduction and usage commits (Masami Hiramatsu).
v4: https://lore.kernel.org/all/cover.1774536681.git.d@ilvokhin.com/
v3: https://lore.kernel.org/all/cover.1773858853.git.d@ilvokhin.com/
v2: https://lore.kernel.org/all/cover.1773164180.git.d@ilvokhin.com/
RFC: https://lore.kernel.org/all/cover.1772642407.git.d@ilvokhin.com/
[1]: https://lore.kernel.org/all/aa7G1nD7Rd9F4eBH@casper.infradead.org/
Dmitry Ilvokhin (7):
tracing/lock: Remove unnecessary linux/sched.h include
locking/percpu-rwsem: Extract __percpu_up_read()
locking: Add contended_release tracepoint to sleepable locks
locking: Factor out queued_spin_release()
locking: Add contended_release tracepoint to qspinlock
locking: Factor out __queued_read_unlock()/__queued_write_unlock()
locking: Add contended_release tracepoint to qrwlock
arch/mips/include/asm/spinlock.h | 6 ++--
arch/x86/include/asm/paravirt-spinlock.h | 6 ++--
include/asm-generic/qrwlock.h | 38 ++++++++++++++++++++++--
include/asm-generic/qspinlock.h | 33 ++++++++++++++++++--
include/linux/percpu-rwsem.h | 15 ++--------
include/trace/events/lock.h | 18 ++++++++++-
kernel/locking/mutex.c | 4 +++
kernel/locking/percpu-rwsem.c | 29 ++++++++++++++++++
kernel/locking/qrwlock.c | 16 ++++++++++
kernel/locking/qspinlock.c | 8 +++++
kernel/locking/rtmutex.c | 1 +
kernel/locking/rwbase_rt.c | 6 ++++
kernel/locking/rwsem.c | 10 +++++--
kernel/locking/semaphore.c | 4 +++
14 files changed, 167 insertions(+), 27 deletions(-)
--
2.52.0
^ permalink raw reply
* [PATCH v5 2/7] locking/percpu-rwsem: Extract __percpu_up_read()
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin, Usama Arif
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
Move the percpu_up_read() slowpath out of the inline function into a new
__percpu_up_read() to avoid binary size increase from adding a
tracepoint to an inlined function.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/percpu-rwsem.h | 15 +++------------
kernel/locking/percpu-rwsem.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index c8cb010d655e..39d5bf8e6562 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -107,6 +107,8 @@ static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
return ret;
}
+extern void __percpu_up_read(struct percpu_rw_semaphore *sem);
+
static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, _RET_IP_);
@@ -118,18 +120,7 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
if (likely(rcu_sync_is_idle(&sem->rss))) {
this_cpu_dec(*sem->read_count);
} else {
- /*
- * slowpath; reader will only ever wake a single blocked
- * writer.
- */
- smp_mb(); /* B matches C */
- /*
- * In other words, if they see our decrement (presumably to
- * aggregate zero, as that is the only time it matters) they
- * will also see our critical section.
- */
- this_cpu_dec(*sem->read_count);
- rcuwait_wake_up(&sem->writer);
+ __percpu_up_read(sem);
}
preempt_enable();
}
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index ef234469baac..f3ee7a0d6047 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -288,3 +288,21 @@ void percpu_up_write(struct percpu_rw_semaphore *sem)
rcu_sync_exit(&sem->rss);
}
EXPORT_SYMBOL_GPL(percpu_up_write);
+
+void __percpu_up_read(struct percpu_rw_semaphore *sem)
+{
+ lockdep_assert_preemption_disabled();
+ /*
+ * slowpath; reader will only ever wake a single blocked
+ * writer.
+ */
+ smp_mb(); /* B matches C */
+ /*
+ * In other words, if they see our decrement (presumably to
+ * aggregate zero, as that is the only time it matters) they
+ * will also see our critical section.
+ */
+ this_cpu_dec(*sem->read_count);
+ rcuwait_wake_up(&sem->writer);
+}
+EXPORT_SYMBOL_GPL(__percpu_up_read);
--
2.52.0
^ permalink raw reply related
* [PATCH v5 3/7] locking: Add contended_release tracepoint to sleepable locks
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
Add the contended_release trace event. This tracepoint fires on the
holder side when a contended lock is released, complementing the
existing contention_begin/contention_end tracepoints which fire on the
waiter side.
This enables correlating lock hold time under contention with waiter
events by lock address.
Add trace_contended_release() calls to the slowpath unlock paths of
sleepable locks: mutex, rtmutex, semaphore, rwsem, percpu-rwsem, and
RT-specific rwbase locks.
Where possible, trace_contended_release() fires before the lock is
released and before the waiter is woken. For some lock types, the
tracepoint fires after the release but before the wake. Making the
placement consistent across all lock types is not worth the added
complexity.
For reader/writer locks, the tracepoint fires for every reader releasing
while a writer is waiting, not only for the last reader.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
---
include/trace/events/lock.h | 17 +++++++++++++++++
kernel/locking/mutex.c | 4 ++++
kernel/locking/percpu-rwsem.c | 11 +++++++++++
kernel/locking/rtmutex.c | 1 +
kernel/locking/rwbase_rt.c | 6 ++++++
kernel/locking/rwsem.c | 10 ++++++++--
kernel/locking/semaphore.c | 4 ++++
7 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index da978f2afb45..1ded869cd619 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -137,6 +137,23 @@ TRACE_EVENT(contention_end,
TP_printk("%p (ret=%d)", __entry->lock_addr, __entry->ret)
);
+TRACE_EVENT(contended_release,
+
+ TP_PROTO(void *lock),
+
+ TP_ARGS(lock),
+
+ TP_STRUCT__entry(
+ __field(void *, lock_addr)
+ ),
+
+ TP_fast_assign(
+ __entry->lock_addr = lock;
+ ),
+
+ TP_printk("%p", __entry->lock_addr)
+);
+
#endif /* _TRACE_LOCK_H */
/* This part must be outside protection */
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 427187ff02db..6c2c9312eb8f 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -997,6 +997,9 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
wake_q_add(&wake_q, next);
}
+ if (trace_contended_release_enabled() && waiter)
+ trace_contended_release(lock);
+
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);
@@ -1194,6 +1197,7 @@ EXPORT_SYMBOL(ww_mutex_lock_interruptible);
EXPORT_TRACEPOINT_SYMBOL_GPL(contention_begin);
EXPORT_TRACEPOINT_SYMBOL_GPL(contention_end);
+EXPORT_TRACEPOINT_SYMBOL_GPL(contended_release);
/**
* atomic_dec_and_mutex_lock - return holding mutex if we dec to 0
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f3ee7a0d6047..46b5903989b8 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -263,6 +263,9 @@ void percpu_up_write(struct percpu_rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, _RET_IP_);
+ if (trace_contended_release_enabled() && wq_has_sleeper(&sem->waiters))
+ trace_contended_release(sem);
+
/*
* Signal the writer is done, no fast path yet.
*
@@ -292,6 +295,14 @@ EXPORT_SYMBOL_GPL(percpu_up_write);
void __percpu_up_read(struct percpu_rw_semaphore *sem)
{
lockdep_assert_preemption_disabled();
+ /*
+ * After percpu_up_write() completes, rcu_sync_is_idle() can still
+ * return false during the grace period, forcing readers into this
+ * slowpath. Only trace when a writer is actually waiting for
+ * readers to drain.
+ */
+ if (trace_contended_release_enabled() && rcuwait_active(&sem->writer))
+ trace_contended_release(sem);
/*
* slowpath; reader will only ever wake a single blocked
* writer.
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index ccaba6148b61..3db8a840b4e8 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1466,6 +1466,7 @@ static void __sched rt_mutex_slowunlock(struct rt_mutex_base *lock)
raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
+ trace_contended_release(lock);
/*
* The wakeup next waiter path does not suffer from the above
* race. See the comments there.
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 82e078c0665a..74da5601018f 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -174,6 +174,8 @@ static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
static __always_inline void rwbase_read_unlock(struct rwbase_rt *rwb,
unsigned int state)
{
+ if (trace_contended_release_enabled() && rt_mutex_owner(&rwb->rtmutex))
+ trace_contended_release(rwb);
/*
* rwb->readers can only hit 0 when a writer is waiting for the
* active readers to leave the critical section.
@@ -205,6 +207,8 @@ static inline void rwbase_write_unlock(struct rwbase_rt *rwb)
unsigned long flags;
raw_spin_lock_irqsave(&rtm->wait_lock, flags);
+ if (trace_contended_release_enabled() && rt_mutex_has_waiters(rtm))
+ trace_contended_release(rwb);
__rwbase_write_unlock(rwb, WRITER_BIAS, flags);
}
@@ -214,6 +218,8 @@ static inline void rwbase_write_downgrade(struct rwbase_rt *rwb)
unsigned long flags;
raw_spin_lock_irqsave(&rtm->wait_lock, flags);
+ if (trace_contended_release_enabled() && rt_mutex_has_waiters(rtm))
+ trace_contended_release(rwb);
/* Release it and account current as reader */
__rwbase_write_unlock(rwb, WRITER_BIAS - 1, flags);
}
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index bf647097369c..602d5fd3c91a 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1387,6 +1387,8 @@ static inline void __up_read(struct rw_semaphore *sem)
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
+ if (trace_contended_release_enabled() && (tmp & RWSEM_FLAG_WAITERS))
+ trace_contended_release(sem);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
RWSEM_FLAG_WAITERS)) {
clear_nonspinnable(sem);
@@ -1413,8 +1415,10 @@ static inline void __up_write(struct rw_semaphore *sem)
preempt_disable();
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
- if (unlikely(tmp & RWSEM_FLAG_WAITERS))
+ if (unlikely(tmp & RWSEM_FLAG_WAITERS)) {
+ trace_contended_release(sem);
rwsem_wake(sem);
+ }
preempt_enable();
}
@@ -1437,8 +1441,10 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
tmp = atomic_long_fetch_add_release(
-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
- if (tmp & RWSEM_FLAG_WAITERS)
+ if (tmp & RWSEM_FLAG_WAITERS) {
+ trace_contended_release(sem);
rwsem_downgrade_wake(sem);
+ }
preempt_enable();
}
diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
index 74d41433ba13..35ac3498dca5 100644
--- a/kernel/locking/semaphore.c
+++ b/kernel/locking/semaphore.c
@@ -230,6 +230,10 @@ void __sched up(struct semaphore *sem)
sem->count++;
else
__up(sem, &wake_q);
+
+ if (trace_contended_release_enabled() && !wake_q_empty(&wake_q))
+ trace_contended_release(sem);
+
raw_spin_unlock_irqrestore(&sem->lock, flags);
if (!wake_q_empty(&wake_q))
wake_up_q(&wake_q);
--
2.52.0
^ permalink raw reply related
* [PATCH v5 4/7] locking: Factor out queued_spin_release()
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
Introduce queued_spin_release() as an arch-overridable unlock primitive,
and make queued_spin_unlock() a generic wrapper around it.
This is a preparatory refactoring for the next commit, which adds
contended_release tracepoint instrumentation to queued_spin_unlock().
Rename the existing arch-specific queued_spin_unlock() overrides on
x86 (paravirt) and MIPS to queued_spin_release().
No functional change.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
---
arch/mips/include/asm/spinlock.h | 6 +++---
arch/x86/include/asm/paravirt-spinlock.h | 6 +++---
include/asm-generic/qspinlock.h | 15 ++++++++++++---
3 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
index 6ce2117e49f6..c349162f15eb 100644
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -13,12 +13,12 @@
#include <asm-generic/qspinlock_types.h>
-#define queued_spin_unlock queued_spin_unlock
+#define queued_spin_release queued_spin_release
/**
- * queued_spin_unlock - release a queued spinlock
+ * queued_spin_release - release a queued spinlock
* @lock : Pointer to queued spinlock structure
*/
-static inline void queued_spin_unlock(struct qspinlock *lock)
+static inline void queued_spin_release(struct qspinlock *lock)
{
/* This could be optimised with ARCH_HAS_MMIOWB */
mmiowb();
diff --git a/arch/x86/include/asm/paravirt-spinlock.h b/arch/x86/include/asm/paravirt-spinlock.h
index 7beffcb08ed6..ac75e0736198 100644
--- a/arch/x86/include/asm/paravirt-spinlock.h
+++ b/arch/x86/include/asm/paravirt-spinlock.h
@@ -49,9 +49,9 @@ static __always_inline bool pv_vcpu_is_preempted(long cpu)
ALT_NOT(X86_FEATURE_VCPUPREEMPT));
}
-#define queued_spin_unlock queued_spin_unlock
+#define queued_spin_release queued_spin_release
/**
- * queued_spin_unlock - release a queued spinlock
+ * queued_spin_release - release a queued spinlock
* @lock : Pointer to queued spinlock structure
*
* A smp_store_release() on the least-significant byte.
@@ -66,7 +66,7 @@ static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
pv_queued_spin_lock_slowpath(lock, val);
}
-static inline void queued_spin_unlock(struct qspinlock *lock)
+static inline void queued_spin_release(struct qspinlock *lock)
{
kcsan_release();
pv_queued_spin_unlock(lock);
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index bf47cca2c375..df76f34645a0 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -115,12 +115,12 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
}
#endif
-#ifndef queued_spin_unlock
+#ifndef queued_spin_release
/**
- * queued_spin_unlock - release a queued spinlock
+ * queued_spin_release - release a queued spinlock
* @lock : Pointer to queued spinlock structure
*/
-static __always_inline void queued_spin_unlock(struct qspinlock *lock)
+static __always_inline void queued_spin_release(struct qspinlock *lock)
{
/*
* unlock() needs release semantics:
@@ -129,6 +129,15 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
}
#endif
+/**
+ * queued_spin_unlock - unlock a queued spinlock
+ * @lock : Pointer to queued spinlock structure
+ */
+static __always_inline void queued_spin_unlock(struct qspinlock *lock)
+{
+ queued_spin_release(lock);
+}
+
#ifndef virt_spin_lock
static __always_inline bool virt_spin_lock(struct qspinlock *lock)
{
--
2.52.0
^ permalink raw reply related
* [PATCH v5 6/7] locking: Factor out __queued_read_unlock()/__queued_write_unlock()
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
This is a preparatory refactoring for the next commit, which adds
contended_release tracepoint instrumentation and needs to call the
unlock from both traced and non-traced paths.
No functional change.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
include/asm-generic/qrwlock.h | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 75b8f4601b28..4b627bafba8b 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -101,16 +101,26 @@ static inline void queued_write_lock(struct qrwlock *lock)
queued_write_lock_slowpath(lock);
}
+static __always_inline void __queued_read_unlock(struct qrwlock *lock)
+{
+ /*
+ * Atomically decrement the reader count
+ */
+ (void)atomic_sub_return_release(_QR_BIAS, &lock->cnts);
+}
+
/**
* queued_read_unlock - release read lock of a queued rwlock
* @lock : Pointer to queued rwlock structure
*/
static inline void queued_read_unlock(struct qrwlock *lock)
{
- /*
- * Atomically decrement the reader count
- */
- (void)atomic_sub_return_release(_QR_BIAS, &lock->cnts);
+ __queued_read_unlock(lock);
+}
+
+static __always_inline void __queued_write_unlock(struct qrwlock *lock)
+{
+ smp_store_release(&lock->wlocked, 0);
}
/**
@@ -119,7 +129,7 @@ static inline void queued_read_unlock(struct qrwlock *lock)
*/
static inline void queued_write_unlock(struct qrwlock *lock)
{
- smp_store_release(&lock->wlocked, 0);
+ __queued_write_unlock(lock);
}
/**
--
2.52.0
^ permalink raw reply related
* [PATCH v5 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
Use the arch-overridable queued_spin_release(), introduced in the
previous commit, to ensure the tracepoint works correctly across all
architectures, including those with custom unlock implementations (e.g.
x86 paravirt).
When the tracepoint is disabled, the only addition to the hot path is a
single NOP instruction (the static branch). When enabled, the contention
check, trace call, and unlock are combined in an out-of-line function to
minimize hot path impact, avoiding the compiler needing to preserve the
lock pointer in a callee-saved register across the trace call.
Binary size impact (x86_64, defconfig):
uninlined unlock (common case): +680 bytes (+0.00%)
inlined unlock (worst case): +83659 bytes (+0.21%)
The inlined unlock case could not be achieved through Kconfig options on
x86_64 as PREEMPT_BUILD unconditionally selects UNINLINE_SPIN_UNLOCK on
x86_64. The UNINLINE_SPIN_UNLOCK guards were manually inverted to force
inline the unlock path and estimate the worst case binary size increase.
In practice, configurations with UNINLINE_SPIN_UNLOCK=n have already
opted against binary size optimization, so the inlined worst case is
unlikely to be a concern.
Architectures with fully custom qspinlock implementations (e.g.
PowerPC) are not covered by this change.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
include/asm-generic/qspinlock.h | 18 ++++++++++++++++++
kernel/locking/qspinlock.c | 8 ++++++++
2 files changed, 26 insertions(+)
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index df76f34645a0..915a4c2777f6 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -41,6 +41,7 @@
#include <asm-generic/qspinlock_types.h>
#include <linux/atomic.h>
+#include <linux/tracepoint-defs.h>
#ifndef queued_spin_is_locked
/**
@@ -129,12 +130,29 @@ static __always_inline void queued_spin_release(struct qspinlock *lock)
}
#endif
+DECLARE_TRACEPOINT(contended_release);
+
+extern void queued_spin_release_traced(struct qspinlock *lock);
+
/**
* queued_spin_unlock - unlock a queued spinlock
* @lock : Pointer to queued spinlock structure
+ *
+ * Generic tracing wrapper around the arch-overridable
+ * queued_spin_release().
*/
static __always_inline void queued_spin_unlock(struct qspinlock *lock)
{
+ /*
+ * Trace and release are combined in queued_spin_release_traced() so
+ * the compiler does not need to preserve the lock pointer across the
+ * function call, avoiding callee-saved register save/restore on the
+ * hot path.
+ */
+ if (tracepoint_enabled(contended_release)) {
+ queued_spin_release_traced(lock);
+ return;
+ }
queued_spin_release(lock);
}
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index af8d122bb649..c72610980ec7 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -104,6 +104,14 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock,
#define queued_spin_lock_slowpath native_queued_spin_lock_slowpath
#endif
+void __lockfunc queued_spin_release_traced(struct qspinlock *lock)
+{
+ if (queued_spin_is_contended(lock))
+ trace_contended_release(lock);
+ queued_spin_release(lock);
+}
+EXPORT_SYMBOL(queued_spin_release_traced);
+
#endif /* _GEN_PV_LOCK_SLOWPATH */
/**
--
2.52.0
^ permalink raw reply related
* [PATCH v5 7/7] locking: Add contended_release tracepoint to qrwlock
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
Extend the contended_release tracepoint to queued rwlocks, using the
same out-of-line traced unlock approach as queued spinlocks.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
include/asm-generic/qrwlock.h | 22 ++++++++++++++++++++++
kernel/locking/qrwlock.c | 16 ++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 4b627bafba8b..274c19006125 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -14,6 +14,7 @@
#define __ASM_GENERIC_QRWLOCK_H
#include <linux/atomic.h>
+#include <linux/tracepoint-defs.h>
#include <asm/barrier.h>
#include <asm/processor.h>
@@ -35,6 +36,10 @@
*/
extern void queued_read_lock_slowpath(struct qrwlock *lock);
extern void queued_write_lock_slowpath(struct qrwlock *lock);
+extern void queued_read_unlock_traced(struct qrwlock *lock);
+extern void queued_write_unlock_traced(struct qrwlock *lock);
+
+DECLARE_TRACEPOINT(contended_release);
/**
* queued_read_trylock - try to acquire read lock of a queued rwlock
@@ -115,6 +120,17 @@ static __always_inline void __queued_read_unlock(struct qrwlock *lock)
*/
static inline void queued_read_unlock(struct qrwlock *lock)
{
+ /*
+ * Trace and unlock are combined in the traced unlock variant so
+ * the compiler does not need to preserve the lock pointer across
+ * the function call, avoiding callee-saved register save/restore
+ * on the hot path.
+ */
+ if (tracepoint_enabled(contended_release)) {
+ queued_read_unlock_traced(lock);
+ return;
+ }
+
__queued_read_unlock(lock);
}
@@ -129,6 +145,12 @@ static __always_inline void __queued_write_unlock(struct qrwlock *lock)
*/
static inline void queued_write_unlock(struct qrwlock *lock)
{
+ /* See comment in queued_read_unlock(). */
+ if (tracepoint_enabled(contended_release)) {
+ queued_write_unlock_traced(lock);
+ return;
+ }
+
__queued_write_unlock(lock);
}
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index d2ef312a8611..5f7a0fc2b27a 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -90,3 +90,19 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
trace_contention_end(lock, 0);
}
EXPORT_SYMBOL(queued_write_lock_slowpath);
+
+void __lockfunc queued_read_unlock_traced(struct qrwlock *lock)
+{
+ if (queued_rwlock_is_contended(lock))
+ trace_contended_release(lock);
+ __queued_read_unlock(lock);
+}
+EXPORT_SYMBOL(queued_read_unlock_traced);
+
+void __lockfunc queued_write_unlock_traced(struct qrwlock *lock)
+{
+ if (queued_rwlock_is_contended(lock))
+ trace_contended_release(lock);
+ __queued_write_unlock(lock);
+}
+EXPORT_SYMBOL(queued_write_unlock_traced);
--
2.52.0
^ permalink raw reply related
* Re: [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor
From: Wen Yang @ 2026-04-16 15:09 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-trace-kernel, linux-kernel
In-Reply-To: <74a624434b59c00f9407909b8696f041536d9418.camel@redhat.com>
On 4/13/26 16:19, Gabriele Monaco wrote:
> On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
>> From: Wen Yang <wen.yang@linux.dev>
>>
>> Add the tlob (task latency over budget) RV monitor. tlob tracks the
>> monotonic elapsed time (CLOCK_MONOTONIC) of a marked per-task code
>> path, including time off-CPU, and fires a per-task hrtimer when the
>> elapsed time exceeds a configurable budget.
>>
>> Three-state DA (unmonitored/on_cpu/off_cpu) driven by trace_start,
>> switch_in/out, and budget_expired events. Per-task state lives in a
>> fixed-size hash table (TLOB_MAX_MONITORED slots) with RCU-deferred
>> free.
>>
>> Two userspace interfaces:
>> - tracefs: uprobe pair registration via the monitor file using the
>> format "pid:threshold_us:offset_start:offset_stop:binary_path"
>> - /dev/rv ioctls (CONFIG_RV_CHARDEV): TLOB_IOCTL_TRACE_START /
>> TRACE_STOP; TRACE_STOP returns -EOVERFLOW on violation
>>
>> Each /dev/rv fd has a per-fd mmap ring buffer (physically contiguous
>> pages). A control page (struct tlob_mmap_page) at offset 0 exposes
>> head/tail/dropped for lockless userspace reads; struct tlob_event
>> records follow at data_offset. Drop-new policy on overflow.
>>
>> UAPI: include/uapi/linux/rv.h (tlob_start_args, tlob_event,
>> tlob_mmap_page, ioctl numbers), monitor_tlob.rst,
>> ioctl-number.rst (RV_IOC_MAGIC=0xB9).
>>
>
> I'm not fully grasping all the requirements for the monitors yet, but I see you
> are reimplementing a lot of functionality in the monitor itself rather than
> within RV, let's see if we can consolidate some of them:
>
> * you're using timer expirations, can we do it with timed automata? [1]
> * RV automata usually don't have an /unmonitored/ state, your trace_start event
> would be the start condition (da_event_start) and the monitor will get non-
> running at each violation (it calls da_monitor_reset() automatically), all
> setup/cleanup logic should be handled implicitly within RV. I believe that would
> also save you that ugly trace_event_tlob() redefinition.
> * you're maintaining a local hash table for each task_struct, that could use
> the per-object monitors [2] where your "object" is in fact your struct,
> allocated when you start the monitor with all appropriate fields and indexed by
> pid
> * you are handling violations manually, considering timed automata trigger a
> full fledged violation on timeouts, can you use the RV-way (error tracepoints or
> reactors only)? Do you need the additional reporting within the
> tracepoint/ioctl? Cannot the userspace consumer desume all those from other
> events and let RV do just the monitoring?
> * I like the uprobe thing, we could probably move all that to a common helper
> once we figure out how to make it generic.
>
> Note: [1] and [2] didn't reach upstream yet, but should reach linux-next soon.
>
Thanks for the review. Here's my plan for each point -- let me know if
the direction looks right.
- Timed automata
The HA framework [1] is a good match when the timeout threshold is
global or state-determined, but tlob needs a per-invocation threshold
supplied at TRACE_START time -- fitting that into HA would require
framework changes.
My plan is to use da_monitor_init_hook() -- the same mechanism HA
monitors use internally -- to arm the per-invocation hrtimer once
da_create_storage() has stored the monitor_target. This gives the same
"timer fires => violation" semantics without touching the HA infrastructure.
If you see a cleaner way to pass per-invocation data through HA I'm
happy to go that route.
- Unmonitored state / da_handle_start_event
Fair point. I'll drop the explicit unmonitored state and the
trace_event_tlob() redefinition. tlob_start_task() will use
da_handle_start_event() to allocate storage, set initial state to on_cpu,
and fire the init hook to arm the timer in one shot. tlob_stop_task()
calls da_monitor_reset() directly.
- Per-object monitors
Will do. The custom hash table goes away; I'll switch to RV_MON_PER_OBJ
with:
typedef struct tlob_task_state *monitor_target;
da_get_target_by_id() handles the sched_switch hot path lookup.
- RV-way violations
Agreed. budget_expired will be declared INVALID in all states so the
framework calls react() (error_tlob tracepoint + any registered reactor)
and da_monitor_reset() automatically. tlob won't emit any tracepoint of
its own.
One note on the /dev/tlob ioctl: TLOB_IOCTL_TRACE_STOP returns -EOVERFLOW
to the caller when the budget was exceeded. This is just a syscall
return code -- not a second reporting path -- to let in-process
instrumentation react inline without polling the trace buffer.
Let me know if you have concerns about keeping this.
- Generic uprobe helper
Proposed interface:
struct rv_uprobe *rv_uprobe_attach_path(
struct path *path, loff_t offset,
int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
int (*ret_fn) (struct rv_uprobe *, unsigned long func,
struct pt_regs *, __u64 *),
void *priv);
struct rv_uprobe *rv_uprobe_attach(
const char *binpath, loff_t offset,
int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
int (*ret_fn) (struct rv_uprobe *, unsigned long func,
struct pt_regs *, __u64 *),
void *priv);
void rv_uprobe_detach(struct rv_uprobe *p);
struct rv_uprobe exposes three read-only fields to monitors (offset,
priv, path); the uprobe_consumer and callbacks would be kept private to
the implementation, so monitors need not include <linux/uprobes.h>.
rv_uprobe_attach() resolves the path and delegates to
rv_uprobe_attach_path(); the latter avoids a redundant kern_path() when
registering multiple probes on the same binary:
kern_path(binpath, LOOKUP_FOLLOW, &path);
b->start = rv_uprobe_attach_path(&path, offset_start, entry_fn,
NULL, b);
b->stop = rv_uprobe_attach_path(&path, offset_stop, stop_fn,
NULL, b);
path_put(&path);
Does the interface look reasonable, or did you have a different shape in
mind?
--
Best wishes,
Wen
>
> [1] -
> https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=f5587d1b6ec938afb2f74fe399a68020d66923e4
> [2] -
> https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=da282bf7fadb095ee0a40c32ff0126429c769b45
>
>> Signed-off-by: Wen Yang <wen.yang@linux.dev>
>> ---
>> Documentation/trace/rv/index.rst | 1 +
>> Documentation/trace/rv/monitor_tlob.rst | 381 +++++++
>> .../userspace-api/ioctl/ioctl-number.rst | 1 +
>> include/uapi/linux/rv.h | 181 ++++
>> kernel/trace/rv/Kconfig | 17 +
>> kernel/trace/rv/Makefile | 2 +
>> kernel/trace/rv/monitors/tlob/Kconfig | 51 +
>> kernel/trace/rv/monitors/tlob/tlob.c | 986 ++++++++++++++++++
>> kernel/trace/rv/monitors/tlob/tlob.h | 145 +++
>> kernel/trace/rv/monitors/tlob/tlob_trace.h | 42 +
>> kernel/trace/rv/rv.c | 4 +
>> kernel/trace/rv/rv_dev.c | 602 +++++++++++
>> kernel/trace/rv/rv_trace.h | 50 +
>> 13 files changed, 2463 insertions(+)
>> create mode 100644 Documentation/trace/rv/monitor_tlob.rst
>> create mode 100644 include/uapi/linux/rv.h
>> create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
>> create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
>> create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
>> create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
>> create mode 100644 kernel/trace/rv/rv_dev.c
>>
>> diff --git a/Documentation/trace/rv/index.rst
>> b/Documentation/trace/rv/index.rst
>> index a2812ac5c..4f2bfaf38 100644
>> --- a/Documentation/trace/rv/index.rst
>> +++ b/Documentation/trace/rv/index.rst
>> @@ -15,3 +15,4 @@ Runtime Verification
>> monitor_wwnr.rst
>> monitor_sched.rst
>> monitor_rtapp.rst
>> + monitor_tlob.rst
>> diff --git a/Documentation/trace/rv/monitor_tlob.rst
>> b/Documentation/trace/rv/monitor_tlob.rst
>> new file mode 100644
>> index 000000000..d498e9894
>> --- /dev/null
>> +++ b/Documentation/trace/rv/monitor_tlob.rst
>> @@ -0,0 +1,381 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +Monitor tlob
>> +============
>> +
>> +- Name: tlob - task latency over budget
>> +- Type: per-task deterministic automaton
>> +- Author: Wen Yang <wen.yang@linux.dev>
>> +
>> +Description
>> +-----------
>> +
>> +The tlob monitor tracks per-task elapsed time (CLOCK_MONOTONIC, including
>> +both on-CPU and off-CPU time) and reports a violation when the monitored
>> +task exceeds a configurable latency budget threshold.
>> +
>> +The monitor implements a three-state deterministic automaton::
>> +
>> + |
>> + | (initial)
>> + v
>> + +--------------+
>> + +-------> | unmonitored |
>> + | +--------------+
>> + | |
>> + | trace_start
>> + | v
>> + | +--------------+
>> + | | on_cpu |
>> + | +--------------+
>> + | | |
>> + | switch_out| | trace_stop / budget_expired
>> + | v v
>> + | +--------------+ (unmonitored)
>> + | | off_cpu |
>> + | +--------------+
>> + | | |
>> + | | switch_in| trace_stop / budget_expired
>> + | v v
>> + | (on_cpu) (unmonitored)
>> + |
>> + +-- trace_stop (from on_cpu or off_cpu)
>> +
>> + Key transitions:
>> + unmonitored --(trace_start)--> on_cpu
>> + on_cpu --(switch_out)--> off_cpu
>> + off_cpu --(switch_in)--> on_cpu
>> + on_cpu --(trace_stop)--> unmonitored
>> + off_cpu --(trace_stop)--> unmonitored
>> + on_cpu --(budget_expired)-> unmonitored [violation]
>> + off_cpu --(budget_expired)-> unmonitored [violation]
>> +
>> + sched_wakeup self-loops in on_cpu and unmonitored; switch_out and
>> + sched_wakeup self-loop in off_cpu. budget_expired is fired by the one-shot
>> hrtimer; it always
>> + transitions to unmonitored regardless of whether the task is on-CPU
>> + or off-CPU when the timer fires.
>> +
>> +State Descriptions
>> +------------------
>> +
>> +- **unmonitored**: Task is not being traced. Scheduling events
>> + (``switch_in``, ``switch_out``, ``sched_wakeup``) are silently
>> + ignored (self-loop). The monitor waits for a ``trace_start`` event
>> + to begin a new observation window.
>> +
>> +- **on_cpu**: Task is running on the CPU with the deadline timer armed.
>> + A one-shot hrtimer was set for ``threshold_us`` microseconds at
>> + ``trace_start`` time. A ``switch_out`` event transitions to
>> + ``off_cpu``; the hrtimer keeps running (off-CPU time counts toward
>> + the budget). A ``trace_stop`` cancels the timer and returns to
>> + ``unmonitored`` (normal completion). If the hrtimer fires
>> + (``budget_expired``) the violation is recorded and the automaton
>> + transitions to ``unmonitored``.
>> +
>> +- **off_cpu**: Task was preempted or blocked. The one-shot hrtimer
>> + continues to run. A ``switch_in`` event returns to ``on_cpu``.
>> + A ``trace_stop`` cancels the timer and returns to ``unmonitored``.
>> + If the hrtimer fires (``budget_expired``) while the task is off-CPU,
>> + the violation is recorded and the automaton transitions to
>> + ``unmonitored``.
>> +
>> +Rationale
>> +---------
>> +
>> +The per-task latency budget threshold allows operators to express timing
>> +requirements in microseconds and receive an immediate ftrace event when a
>> +task exceeds its budget. This is useful for real-time tasks
>> +(``SCHED_FIFO`` / ``SCHED_DEADLINE``) where total elapsed time must
>> +remain within a known bound.
>> +
>> +Each task has an independent threshold, so up to ``TLOB_MAX_MONITORED``
>> +(64) tasks with different timing requirements can be monitored
>> +simultaneously.
>> +
>> +On threshold violation the automaton records a ``tlob_budget_exceeded``
>> +ftrace event carrying the final on-CPU / off-CPU time breakdown, but does
>> +not kill or throttle the task. Monitoring can be restarted by issuing a
>> +new ``trace_start`` event (or a new ``TLOB_IOCTL_TRACE_START`` ioctl).
>> +
>> +A per-task one-shot hrtimer is armed at ``trace_start`` for exactly
>> +``threshold_us`` microseconds. It fires at most once per monitoring
>> +window, performs an O(1) hash lookup, records the violation, and injects
>> +the ``budget_expired`` event into the DA. When ``CONFIG_RV_MON_TLOB``
>> +is not set there is zero runtime cost.
>> +
>> +Usage
>> +-----
>> +
>> +tracefs interface (uprobe-based external monitoring)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +The ``monitor`` tracefs file allows any privileged user to instrument an
>> +unmodified binary via uprobes, without changing its source code. Write a
>> +four-field record to attach two plain entry uprobes: one at
>> +``offset_start`` fires ``tlob_start_task()`` and one at ``offset_stop``
>> +fires ``tlob_stop_task()``, so the latency budget covers exactly the code
>> +region between the two offsets::
>> +
>> + threshold_us:offset_start:offset_stop:binary_path
>> +
>> +``binary_path`` comes last so it may freely contain ``:`` (e.g. paths
>> +inside a container namespace).
>> +
>> +The uprobes fire for every task that executes the probed instruction in
>> +the binary, consistent with the native uprobe semantics. All tasks that
>> +execute the code region get independent per-task monitoring slots.
>> +
>> +Using two plain entry uprobes (rather than a uretprobe for the stop) means
>> +that a mistyped offset can never corrupt the call stack; the worst outcome
>> +of a bad ``offset_stop`` is a missed stop that causes the hrtimer to fire
>> +and report a budget violation.
>> +
>> +Example -- monitor a code region in ``/usr/bin/myapp`` with a 5 ms
>> +budget, where the region starts at offset 0x12a0 and ends at 0x12f0::
>> +
>> + echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
>> +
>> + # Bind uprobes: start probe starts the clock, stop probe stops it
>> + echo "5000:0x12a0:0x12f0:/usr/bin/myapp" \
>> + > /sys/kernel/tracing/rv/monitors/tlob/monitor
>> +
>> + # Remove the uprobe binding for this code region
>> + echo "-0x12a0:/usr/bin/myapp" >
>> /sys/kernel/tracing/rv/monitors/tlob/monitor
>> +
>> + # List registered uprobe bindings (mirrors the write format)
>> + cat /sys/kernel/tracing/rv/monitors/tlob/monitor
>> + # -> 5000:0x12a0:0x12f0:/usr/bin/myapp
>> +
>> + # Read violations from the trace buffer
>> + cat /sys/kernel/tracing/trace
>> +
>> +Up to ``TLOB_MAX_MONITORED`` tasks may be monitored simultaneously.
>> +
>> +The offsets can be obtained with ``nm`` or ``readelf``::
>> +
>> + nm -n /usr/bin/myapp | grep my_function
>> + # -> 0000000000012a0 T my_function
>> +
>> + readelf -s /usr/bin/myapp | grep my_function
>> + # -> 42: 0000000000012a0 336 FUNC GLOBAL DEFAULT 13 my_function
>> +
>> + # offset_start = 0x12a0 (function entry)
>> + # offset_stop = 0x12a0 + 0x50 = 0x12f0 (or any instruction before return)
>> +
>> +Notes:
>> +
>> +- The uprobes fire for every task that executes the probed instruction,
>> + so concurrent calls from different threads each get independent
>> + monitoring slots.
>> +- ``offset_stop`` need not be a function return; it can be any instruction
>> + within the region. If the stop probe is never reached (e.g. early exit
>> + path bypasses it), the hrtimer fires and a budget violation is reported.
>> +- Each ``(binary_path, offset_start)`` pair may only be registered once.
>> + A second write with the same ``offset_start`` for the same binary is
>> + rejected with ``-EEXIST``. Two entry uprobes at the same address would
>> + both fire for every task, causing ``tlob_start_task()`` to be called
>> + twice; the second call would silently fail with ``-EEXIST`` and the
>> + second binding's threshold would never take effect. Different code
>> + regions that share the same ``offset_stop`` (common exit point) are
>> + explicitly allowed.
>> +- The uprobe binding is removed when ``-offset_start:binary_path`` is
>> + written to ``monitor``, or when the monitor is disabled.
>> +- The ``tag`` field in every ``tlob_budget_exceeded`` event is
>> + automatically set to ``offset_start`` for the tracefs path, so
>> + violation events for different code regions are immediately
>> + distinguishable even when ``threshold_us`` values are identical.
>> +
>> +ftrace ring buffer (budget violation events)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +When a monitored task exceeds its latency budget the hrtimer fires,
>> +records the violation, and emits a single ``tlob_budget_exceeded`` event
>> +into the ftrace ring buffer. **Nothing is written to the ftrace ring
>> +buffer while the task is within budget.**
>> +
>> +The event carries the on-CPU / off-CPU time breakdown so that root-cause
>> +analysis (CPU-bound vs. scheduling / I/O overrun) is immediate::
>> +
>> + cat /sys/kernel/tracing/trace
>> +
>> +Example output::
>> +
>> + myapp-1234 [003] .... 12345.678: tlob_budget_exceeded: \
>> + myapp[1234]: budget exceeded threshold=5000 \
>> + on_cpu=820 off_cpu=4500 switches=3 state=off_cpu tag=0x00000000000012a0
>> +
>> +Field descriptions:
>> +
>> +``threshold``
>> + Configured latency budget in microseconds.
>> +
>> +``on_cpu``
>> + Cumulative on-CPU time since ``trace_start``, in microseconds.
>> +
>> +``off_cpu``
>> + Cumulative off-CPU (scheduling + I/O wait) time since ``trace_start``,
>> + in microseconds.
>> +
>> +``switches``
>> + Number of times the task was scheduled out during this window.
>> +
>> +``state``
>> + DA state when the hrtimer fired: ``on_cpu`` means the task was executing
>> + when the budget expired (CPU-bound overrun); ``off_cpu`` means the task
>> + was preempted or blocked (scheduling / I/O overrun).
>> +
>> +``tag``
>> + Opaque 64-bit cookie supplied by the caller via ``tlob_start_args.tag``
>> + (ioctl path) or automatically set to ``offset_start`` (tracefs uprobe
>> + path). Use it to distinguish violations from different code regions
>> + monitored by the same thread. Zero when not set.
>> +
>> +To capture violations in a file::
>> +
>> + trace-cmd record -e tlob_budget_exceeded &
>> + # ... run workload ...
>> + trace-cmd report
>> +
>> +/dev/rv ioctl interface (self-instrumentation)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Tasks can self-instrument their own code paths via the ``/dev/rv`` misc
>> +device (requires ``CONFIG_RV_CHARDEV``). The kernel key is
>> +``task_struct``; multiple threads sharing a single fd each get their own
>> +independent monitoring slot.
>> +
>> +**Synchronous mode** -- the calling thread checks its own result::
>> +
>> + int fd = open("/dev/rv", O_RDWR);
>> +
>> + struct tlob_start_args args = {
>> + .threshold_us = 50000, /* 50 ms */
>> + .tag = 0, /* optional; 0 = don't care */
>> + .notify_fd = -1, /* no fd notification */
>> + };
>> + ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
>> +
>> + /* ... code path under observation ... */
>> +
>> + int ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
>> + /* ret == 0: within budget */
>> + /* ret == -EOVERFLOW: budget exceeded */
>> +
>> + close(fd);
>> +
>> +**Asynchronous mode** -- a dedicated monitor thread receives violation
>> +records via ``read()`` on a shared fd, decoupling the observation from
>> +the critical path::
>> +
>> + /* Monitor thread: open a dedicated fd. */
>> + int monitor_fd = open("/dev/rv", O_RDWR);
>> +
>> + /* Worker thread: set notify_fd = monitor_fd in TRACE_START args. */
>> + int work_fd = open("/dev/rv", O_RDWR);
>> + struct tlob_start_args args = {
>> + .threshold_us = 10000, /* 10 ms */
>> + .tag = REGION_A,
>> + .notify_fd = monitor_fd,
>> + };
>> + ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
>> + /* ... critical section ... */
>> + ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
>> +
>> + /* Monitor thread: blocking read() returns one or more tlob_event records.
>> */
>> + struct tlob_event ntfs[8];
>> + ssize_t n = read(monitor_fd, ntfs, sizeof(ntfs));
>> + for (int i = 0; i < n / sizeof(struct tlob_event); i++) {
>> + struct tlob_event *ntf = &ntfs[i];
>> + printf("tid=%u tag=0x%llx exceeded budget=%llu us "
>> + "(on_cpu=%llu off_cpu=%llu switches=%u state=%s)\n",
>> + ntf->tid, ntf->tag, ntf->threshold_us,
>> + ntf->on_cpu_us, ntf->off_cpu_us, ntf->switches,
>> + ntf->state ? "on_cpu" : "off_cpu");
>> + }
>> +
>> +**mmap ring buffer** -- zero-copy consumption of violation events::
>> +
>> + int fd = open("/dev/rv", O_RDWR);
>> + struct tlob_start_args args = {
>> + .threshold_us = 1000, /* 1 ms */
>> + .notify_fd = fd, /* push violations to own ring buffer */
>> + };
>> + ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
>> +
>> + /* Map the ring: one control page + capacity data records. */
>> + size_t pagesize = sysconf(_SC_PAGESIZE);
>> + size_t cap = 64; /* read from page->capacity after mmap */
>> + size_t len = pagesize + cap * sizeof(struct tlob_event);
>> + void *map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> +
>> + struct tlob_mmap_page *page = map;
>> + struct tlob_event *data =
>> + (struct tlob_event *)((char *)map + page->data_offset);
>> +
>> + /* Consumer loop: poll for events, read without copying. */
>> + while (1) {
>> + poll(&(struct pollfd){fd, POLLIN, 0}, 1, -1);
>> +
>> + uint32_t head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
>> + uint32_t tail = page->data_tail;
>> + while (tail != head) {
>> + handle(&data[tail & (page->capacity - 1)]);
>> + tail++;
>> + }
>> + __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
>> + }
>> +
>> +Note: ``read()`` and ``mmap()`` share the same ring and ``data_tail``
>> +cursor. Do not use both simultaneously on the same fd.
>> +
>> +``tlob_event`` fields:
>> +
>> +``tid``
>> + Thread ID (``task_pid_vnr``) of the violating task.
>> +
>> +``threshold_us``
>> + Budget that was exceeded, in microseconds.
>> +
>> +``on_cpu_us``
>> + Cumulative on-CPU time at violation time, in microseconds.
>> +
>> +``off_cpu_us``
>> + Cumulative off-CPU time at violation time, in microseconds.
>> +
>> +``switches``
>> + Number of context switches since ``TRACE_START``.
>> +
>> +``state``
>> + 1 = timer fired while task was on-CPU; 0 = timer fired while off-CPU.
>> +
>> +``tag``
>> + Cookie from ``tlob_start_args.tag``; for the tracefs uprobe path this
>> + equals ``offset_start``. Zero when not set.
>> +
>> +tracefs files
>> +-------------
>> +
>> +The following files are created under
>> +``/sys/kernel/tracing/rv/monitors/tlob/``:
>> +
>> +``enable`` (rw)
>> + Write ``1`` to enable the monitor; write ``0`` to disable it and
>> + stop all currently monitored tasks.
>> +
>> +``desc`` (ro)
>> + Human-readable description of the monitor.
>> +
>> +``monitor`` (rw)
>> + Write ``threshold_us:offset_start:offset_stop:binary_path`` to bind two
>> + plain entry uprobes in *binary_path*. The uprobe at *offset_start* fires
>> + ``tlob_start_task()``; the uprobe at *offset_stop* fires
>> + ``tlob_stop_task()``. Returns ``-EEXIST`` if a binding with the same
>> + *offset_start* already exists for *binary_path*. Write
>> + ``-offset_start:binary_path`` to remove the binding. Read to list
>> + registered bindings, one
>> + ``threshold_us:0xoffset_start:0xoffset_stop:binary_path`` entry per line.
>> +
>> +Specification
>> +-------------
>> +
>> +Graphviz DOT file in tools/verification/models/tlob.dot
>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> index 331223761..8d3af68db 100644
>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> @@ -385,6 +385,7 @@ Code Seq# Include
>> File Comments
>> 0xB8 01-02 uapi/misc/mrvl_cn10k_dpi.h
>> Marvell CN10K DPI driver
>> 0xB8 all uapi/linux/mshv.h
>> Microsoft Hyper-V /dev/mshv driver
>>
>> <mailto:linux-hyperv@vger.kernel.org>
>> +0xB9 00-3F linux/rv.h
>> Runtime Verification (RV) monitors
>> 0xBA 00-0F uapi/linux/liveupdate.h Pasha
>> Tatashin
>>
>> <mailto:pasha.tatashin@soleen.com>
>> 0xC0 00-0F linux/usb/iowarrior.h
>> diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
>> new file mode 100644
>> index 000000000..d1b96d8cd
>> --- /dev/null
>> +++ b/include/uapi/linux/rv.h
>> @@ -0,0 +1,181 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +/*
>> + * UAPI definitions for Runtime Verification (RV) monitors.
>> + *
>> + * All RV monitors that expose an ioctl self-instrumentation interface
>> + * share the magic byte RV_IOC_MAGIC (0xB9), registered in
>> + * Documentation/userspace-api/ioctl/ioctl-number.rst.
>> + *
>> + * A single /dev/rv misc device serves as the entry point. ioctl numbers
>> + * encode both the monitor identity and the operation:
>> + *
>> + * 0x01 - 0x1F tlob (task latency over budget)
>> + * 0x20 - 0x3F reserved for future RV monitors
>> + *
>> + * Usage examples and design rationale are in:
>> + * Documentation/trace/rv/monitor_tlob.rst
>> + */
>> +
>> +#ifndef _UAPI_LINUX_RV_H
>> +#define _UAPI_LINUX_RV_H
>> +
>> +#include <linux/ioctl.h>
>> +#include <linux/types.h>
>> +
>> +/* Magic byte shared by all RV monitor ioctls. */
>> +#define RV_IOC_MAGIC 0xB9
>> +
>> +/* -----------------------------------------------------------------------
>> + * tlob: task latency over budget monitor (nr 0x01 - 0x1F)
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/**
>> + * struct tlob_start_args - arguments for TLOB_IOCTL_TRACE_START
>> + * @threshold_us: Latency budget for this critical section, in microseconds.
>> + * Must be greater than zero.
>> + * @tag: Opaque 64-bit cookie supplied by the caller. Echoed back
>> + * verbatim in the tlob_budget_exceeded ftrace event and in any
>> + * tlob_event record delivered via @notify_fd. Use it to
>> identify
>> + * which code region triggered a violation when the same thread
>> + * monitors multiple regions sequentially. Set to 0 if not
>> + * needed.
>> + * @notify_fd: File descriptor that will receive a tlob_event record on
>> + * violation. Must refer to an open /dev/rv fd. May equal
>> + * the calling fd (self-notification, useful for retrieving the
>> + * on_cpu_us / off_cpu_us breakdown after TRACE_STOP returns
>> + * -EOVERFLOW). Set to -1 to disable fd notification; in that
>> + * case violations are only signalled via the TRACE_STOP return
>> + * value and the tlob_budget_exceeded ftrace event.
>> + * @flags: Must be 0. Reserved for future extensions.
>> + */
>> +struct tlob_start_args {
>> + __u64 threshold_us;
>> + __u64 tag;
>> + __s32 notify_fd;
>> + __u32 flags;
>> +};
>> +
>> +/**
>> + * struct tlob_event - one budget-exceeded event
>> + *
>> + * Consumed by read() on the notify_fd registered at TLOB_IOCTL_TRACE_START.
>> + * Each record describes a single budget exceedance for one task.
>> + *
>> + * @tid: Thread ID (task_pid_vnr) of the violating task.
>> + * @threshold_us: Budget that was exceeded, in microseconds.
>> + * @on_cpu_us: Cumulative on-CPU time at violation time, in microseconds.
>> + * @off_cpu_us: Cumulative off-CPU (scheduling + I/O wait) time at
>> + * violation time, in microseconds.
>> + * @switches: Number of context switches since TRACE_START.
>> + * @state: DA state at violation: 1 = on_cpu, 0 = off_cpu.
>> + * @tag: Cookie from tlob_start_args.tag; for the tracefs uprobe
>> path
>> + * this is the offset_start value. Zero when not set.
>> + */
>> +struct tlob_event {
>> + __u32 tid;
>> + __u32 pad;
>> + __u64 threshold_us;
>> + __u64 on_cpu_us;
>> + __u64 off_cpu_us;
>> + __u32 switches;
>> + __u32 state; /* 1 = on_cpu, 0 = off_cpu */
>> + __u64 tag;
>> +};
>> +
>> +/**
>> + * struct tlob_mmap_page - control page for the mmap'd violation ring buffer
>> + *
>> + * Mapped at offset 0 of the mmap region returned by mmap(2) on a /dev/rv fd.
>> + * The data array of struct tlob_event records begins at offset @data_offset
>> + * (always one page from the mmap base; use this field rather than hard-
>> coding
>> + * PAGE_SIZE so the code remains correct across architectures).
>> + *
>> + * Ring layout:
>> + *
>> + * mmap base + 0 : struct tlob_mmap_page (one page)
>> + * mmap base + data_offset : struct tlob_event[capacity]
>> + *
>> + * The mmap length determines the ring capacity. Compute it as:
>> + *
>> + * raw = sysconf(_SC_PAGESIZE) + capacity * sizeof(struct tlob_event)
>> + * length = (raw + sysconf(_SC_PAGESIZE) - 1) & ~(sysconf(_SC_PAGESIZE) -
>> 1)
>> + *
>> + * i.e. round the raw byte count up to the next page boundary before
>> + * passing it to mmap(2). The kernel requires a page-aligned length.
>> + * capacity must be a power of 2. Read @capacity after a successful
>> + * mmap(2) for the actual value.
>> + *
>> + * Producer/consumer ordering contract:
>> + *
>> + * Kernel (producer):
>> + * data[data_head & (capacity - 1)] = event;
>> + * // pairs with load-acquire in userspace:
>> + * smp_store_release(&page->data_head, data_head + 1);
>> + *
>> + * Userspace (consumer):
>> + * // pairs with store-release in kernel:
>> + * head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
>> + * for (tail = page->data_tail; tail != head; tail++)
>> + * handle(&data[tail & (capacity - 1)]);
>> + * __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
>> + *
>> + * @data_head and @data_tail are monotonically increasing __u32 counters
>> + * in units of records. Unsigned 32-bit wrap-around is handled correctly
>> + * by modular arithmetic; the ring is full when
>> + * (data_head - data_tail) == capacity.
>> + *
>> + * When the ring is full the kernel drops the incoming record and increments
>> + * @dropped. The consumer should check @dropped periodically to detect loss.
>> + *
>> + * read() and mmap() share the same ring buffer. Do not use both
>> + * simultaneously on the same fd.
>> + *
>> + * @data_head: Next write slot index. Updated by the kernel with
>> + * store-release ordering. Read by userspace with load-
>> acquire.
>> + * @data_tail: Next read slot index. Updated by userspace. Read by the
>> + * kernel to detect overflow.
>> + * @capacity: Actual ring capacity in records (power of 2). Written once
>> + * by the kernel at mmap time; read-only for userspace
>> thereafter.
>> + * @version: Ring buffer ABI version; currently 1.
>> + * @data_offset: Byte offset from the mmap base to the data array.
>> + * Always equal to sysconf(_SC_PAGESIZE) on the running kernel.
>> + * @record_size: sizeof(struct tlob_event) as seen by the kernel. Verify
>> + * this matches userspace's sizeof before indexing the array.
>> + * @dropped: Number of events dropped because the ring was full.
>> + * Monotonically increasing; read with __ATOMIC_RELAXED.
>> + */
>> +struct tlob_mmap_page {
>> + __u32 data_head;
>> + __u32 data_tail;
>> + __u32 capacity;
>> + __u32 version;
>> + __u32 data_offset;
>> + __u32 record_size;
>> + __u64 dropped;
>> +};
>> +
>> +/*
>> + * TLOB_IOCTL_TRACE_START - begin monitoring the calling task.
>> + *
>> + * Arms a per-task hrtimer for threshold_us microseconds. If args.notify_fd
>> + * is >= 0, a tlob_event record is pushed into that fd's ring buffer on
>> + * violation in addition to the tlob_budget_exceeded ftrace event.
>> + * args.notify_fd == -1 disables fd notification.
>> + *
>> + * Violation records are consumed by read() on the notify_fd (blocking or
>> + * non-blocking depending on O_NONBLOCK). On violation,
>> TLOB_IOCTL_TRACE_STOP
>> + * also returns -EOVERFLOW regardless of whether notify_fd is set.
>> + *
>> + * args.flags must be 0.
>> + */
>> +#define TLOB_IOCTL_TRACE_START _IOW(RV_IOC_MAGIC, 0x01, struct
>> tlob_start_args)
>> +
>> +/*
>> + * TLOB_IOCTL_TRACE_STOP - end monitoring the calling task.
>> + *
>> + * Returns 0 if within budget, -EOVERFLOW if the budget was exceeded.
>> + */
>> +#define TLOB_IOCTL_TRACE_STOP _IO(RV_IOC_MAGIC, 0x02)
>> +
>> +#endif /* _UAPI_LINUX_RV_H */
>> diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
>> index 5b4be87ba..227573cda 100644
>> --- a/kernel/trace/rv/Kconfig
>> +++ b/kernel/trace/rv/Kconfig
>> @@ -65,6 +65,7 @@ source "kernel/trace/rv/monitors/pagefault/Kconfig"
>> source "kernel/trace/rv/monitors/sleep/Kconfig"
>> # Add new rtapp monitors here
>>
>> +source "kernel/trace/rv/monitors/tlob/Kconfig"
>> # Add new monitors here
>>
>> config RV_REACTORS
>> @@ -93,3 +94,19 @@ config RV_REACT_PANIC
>> help
>> Enables the panic reactor. The panic reactor emits a printk()
>> message if an exception is found and panic()s the system.
>> +
>> +config RV_CHARDEV
>> + bool "RV ioctl interface via /dev/rv"
>> + depends on RV
>> + default n
>> + help
>> + Register a /dev/rv misc device that exposes an ioctl interface
>> + for RV monitor self-instrumentation. All RV monitors share the
>> + single device node; ioctl numbers encode the monitor identity.
>> +
>> + When enabled, user-space programs can open /dev/rv and use
>> + monitor-specific ioctl commands to bracket code regions they
>> + want the kernel RV subsystem to observe.
>> +
>> + Say Y here if you want to use the tlob self-instrumentation
>> + ioctl interface; otherwise say N.
>> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
>> index 750e4ad6f..cc3781a3b 100644
>> --- a/kernel/trace/rv/Makefile
>> +++ b/kernel/trace/rv/Makefile
>> @@ -3,6 +3,7 @@
>> ccflags-y += -I $(src) # needed for trace events
>>
>> obj-$(CONFIG_RV) += rv.o
>> +obj-$(CONFIG_RV_CHARDEV) += rv_dev.o
>> obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
>> obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
>> obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
>> @@ -17,6 +18,7 @@ obj-$(CONFIG_RV_MON_STS) += monitors/sts/sts.o
>> obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
>> obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
>> obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
>> +obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
>> # Add new monitors here
>> obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
>> obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
>> diff --git a/kernel/trace/rv/monitors/tlob/Kconfig
>> b/kernel/trace/rv/monitors/tlob/Kconfig
>> new file mode 100644
>> index 000000000..010237480
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
>> @@ -0,0 +1,51 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +#
>> +config RV_MON_TLOB
>> + depends on RV
>> + depends on UPROBES
>> + select DA_MON_EVENTS_ID
>> + bool "tlob monitor"
>> + help
>> + Enable the tlob (task latency over budget) monitor. This monitor
>> + tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path
>> within a
>> + task (including both on-CPU and off-CPU time) and reports a
>> + violation when the elapsed time exceeds a configurable budget
>> + threshold.
>> +
>> + The monitor implements a three-state deterministic automaton.
>> + States: unmonitored, on_cpu, off_cpu.
>> + Key transitions:
>> + unmonitored --(trace_start)--> on_cpu
>> + on_cpu --(switch_out)--> off_cpu
>> + off_cpu --(switch_in)--> on_cpu
>> + on_cpu --(trace_stop)--> unmonitored
>> + off_cpu --(trace_stop)--> unmonitored
>> + on_cpu --(budget_expired)--> unmonitored
>> + off_cpu --(budget_expired)--> unmonitored
>> +
>> + External configuration is done via the tracefs "monitor" file:
>> + echo pid:threshold_us:binary:offset_start:offset_stop >
>> .../rv/monitors/tlob/monitor
>> + echo -pid > .../rv/monitors/tlob/monitor (remove
>> task)
>> + cat .../rv/monitors/tlob/monitor (list
>> tasks)
>> +
>> + The uprobe binding places two plain entry uprobes at offset_start
>> and
>> + offset_stop in the binary; these trigger tlob_start_task() and
>> + tlob_stop_task() respectively. Using two entry uprobes (rather
>> than a
>> + uretprobe) means that a mistyped offset can never corrupt the call
>> + stack; the worst outcome is a missed stop, which causes the hrtimer
>> to
>> + fire and report a budget violation.
>> +
>> + Violation events are delivered via a lock-free mmap ring buffer on
>> + /dev/rv (enabled by CONFIG_RV_CHARDEV). The consumer mmap()s the
>> + device, reads records from the data array using the head/tail
>> indices
>> + in the control page, and advances data_tail when done.
>> +
>> + For self-instrumentation, use TLOB_IOCTL_TRACE_START /
>> + TLOB_IOCTL_TRACE_STOP via the /dev/rv misc device (enabled by
>> + CONFIG_RV_CHARDEV).
>> +
>> + Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
>> +
>> + For further information, see:
>> + Documentation/trace/rv/monitor_tlob.rst
>> +
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
>> b/kernel/trace/rv/monitors/tlob/tlob.c
>> new file mode 100644
>> index 000000000..a6e474025
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
>> @@ -0,0 +1,986 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * tlob: task latency over budget monitor
>> + *
>> + * Track the elapsed wall-clock time of a marked code path and detect when
>> + * a monitored task exceeds its per-task latency budget. CLOCK_MONOTONIC
>> + * is used so both on-CPU and off-CPU time count toward the budget.
>> + *
>> + * Per-task state is maintained in a spinlock-protected hash table. A
>> + * one-shot hrtimer fires at the deadline; if the task has not called
>> + * trace_stop by then, a violation is recorded.
>> + *
>> + * Up to TLOB_MAX_MONITORED tasks may be tracked simultaneously.
>> + *
>> + * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
>> + */
>> +#include <linux/file.h>
>> +#include <linux/fs.h>
>> +#include <linux/ftrace.h>
>> +#include <linux/hash.h>
>> +#include <linux/hrtimer.h>
>> +#include <linux/kernel.h>
>> +#include <linux/ktime.h>
>> +#include <linux/module.h>
>> +#include <linux/init.h>
>> +#include <linux/namei.h>
>> +#include <linux/poll.h>
>> +#include <linux/rv.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/atomic.h>
>> +#include <linux/rcupdate.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/tracefs.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/uprobes.h>
>> +#include <kunit/visibility.h>
>> +#include <rv/instrumentation.h>
>> +
>> +/* rv_interface_lock is defined in kernel/trace/rv/rv.c */
>> +extern struct mutex rv_interface_lock;
>> +
>> +#define MODULE_NAME "tlob"
>> +
>> +#include <rv_trace.h>
>> +#include <trace/events/sched.h>
>> +
>> +#define RV_MON_TYPE RV_MON_PER_TASK
>> +#include "tlob.h"
>> +#include <rv/da_monitor.h>
>> +
>> +/* Hash table size; must be a power of two. */
>> +#define TLOB_HTABLE_BITS 6
>> +#define TLOB_HTABLE_SIZE (1 << TLOB_HTABLE_BITS)
>> +
>> +/* Maximum binary path length for uprobe binding. */
>> +#define TLOB_MAX_PATH 256
>> +
>> +/* Per-task latency monitoring state. */
>> +struct tlob_task_state {
>> + struct hlist_node hlist;
>> + struct task_struct *task;
>> + u64 threshold_us;
>> + u64 tag;
>> + struct hrtimer deadline_timer;
>> + int canceled; /* protected by entry_lock */
>> + struct file *notify_file; /* NULL or held reference */
>> +
>> + /*
>> + * entry_lock serialises the mutable accounting fields below.
>> + * Lock order: tlob_table_lock -> entry_lock (never reverse).
>> + */
>> + raw_spinlock_t entry_lock;
>> + u64 on_cpu_us;
>> + u64 off_cpu_us;
>> + ktime_t last_ts;
>> + u32 switches;
>> + u8 da_state;
>> +
>> + struct rcu_head rcu; /* for call_rcu() teardown */
>> +};
>> +
>> +/* Per-uprobe-binding state: a start + stop probe pair for one binary region.
>> */
>> +struct tlob_uprobe_binding {
>> + struct list_head list;
>> + u64 threshold_us;
>> + struct path path;
>> + char binpath[TLOB_MAX_PATH]; /* canonical
>> path for read/remove */
>> + loff_t offset_start;
>> + loff_t offset_stop;
>> + struct uprobe_consumer entry_uc;
>> + struct uprobe_consumer stop_uc;
>> + struct uprobe *entry_uprobe;
>> + struct uprobe *stop_uprobe;
>> +};
>> +
>> +/* Object pool for tlob_task_state. */
>> +static struct kmem_cache *tlob_state_cache;
>> +
>> +/* Hash table and lock protecting table structure (insert/delete/canceled).
>> */
>> +static struct hlist_head tlob_htable[TLOB_HTABLE_SIZE];
>> +static DEFINE_RAW_SPINLOCK(tlob_table_lock);
>> +static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
>> +
>> +/* Uprobe binding list; protected by tlob_uprobe_mutex. */
>> +static LIST_HEAD(tlob_uprobe_list);
>> +static DEFINE_MUTEX(tlob_uprobe_mutex);
>> +
>> +/* Forward declaration */
>> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer);
>> +
>> +/* Hash table helpers */
>> +
>> +static unsigned int tlob_hash_task(const struct task_struct *task)
>> +{
>> + return hash_ptr((void *)task, TLOB_HTABLE_BITS);
>> +}
>> +
>> +/*
>> + * tlob_find_rcu - look up per-task state.
>> + * Must be called under rcu_read_lock() or with tlob_table_lock held.
>> + */
>> +static struct tlob_task_state *tlob_find_rcu(struct task_struct *task)
>> +{
>> + struct tlob_task_state *ws;
>> + unsigned int h = tlob_hash_task(task);
>> +
>> + hlist_for_each_entry_rcu(ws, &tlob_htable[h], hlist,
>> + lockdep_is_held(&tlob_table_lock))
>> + if (ws->task == task)
>> + return ws;
>> + return NULL;
>> +}
>> +
>> +/* Allocate and initialise a new per-task state entry. */
>> +static struct tlob_task_state *tlob_alloc(struct task_struct *task,
>> + u64 threshold_us, u64 tag)
>> +{
>> + struct tlob_task_state *ws;
>> +
>> + ws = kmem_cache_zalloc(tlob_state_cache, GFP_ATOMIC);
>> + if (!ws)
>> + return NULL;
>> +
>> + ws->task = task;
>> + get_task_struct(task);
>> + ws->threshold_us = threshold_us;
>> + ws->tag = tag;
>> + ws->last_ts = ktime_get();
>> + ws->da_state = on_cpu_tlob;
>> + raw_spin_lock_init(&ws->entry_lock);
>> + hrtimer_setup(&ws->deadline_timer, tlob_deadline_timer_fn,
>> + CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> + return ws;
>> +}
>> +
>> +/* RCU callback: free the slab once no readers remain. */
>> +static void tlob_free_rcu_slab(struct rcu_head *head)
>> +{
>> + struct tlob_task_state *ws =
>> + container_of(head, struct tlob_task_state, rcu);
>> + kmem_cache_free(tlob_state_cache, ws);
>> +}
>> +
>> +/* Arm the one-shot deadline timer for threshold_us microseconds. */
>> +static void tlob_arm_deadline(struct tlob_task_state *ws)
>> +{
>> + hrtimer_start(&ws->deadline_timer,
>> + ns_to_ktime(ws->threshold_us * NSEC_PER_USEC),
>> + HRTIMER_MODE_REL);
>> +}
>> +
>> +/*
>> + * Push a violation record into a monitor fd's ring buffer (softirq context).
>> + * Drop-new policy: discard incoming record when full. smp_store_release on
>> + * data_head pairs with smp_load_acquire in the consumer.
>> + */
>> +static void tlob_event_push(struct rv_file_priv *priv,
>> + const struct tlob_event *info)
>> +{
>> + struct tlob_ring *ring = &priv->ring;
>> + unsigned long flags;
>> + u32 head, tail;
>> +
>> + spin_lock_irqsave(&ring->lock, flags);
>> +
>> + head = ring->page->data_head;
>> + tail = READ_ONCE(ring->page->data_tail);
>> +
>> + if (head - tail > ring->mask) {
>> + /* Ring full: drop incoming record. */
>> + ring->page->dropped++;
>> + spin_unlock_irqrestore(&ring->lock, flags);
>> + return;
>> + }
>> +
>> + ring->data[head & ring->mask] = *info;
>> + /* pairs with smp_load_acquire() in the consumer */
>> + smp_store_release(&ring->page->data_head, head + 1);
>> +
>> + spin_unlock_irqrestore(&ring->lock, flags);
>> +
>> + wake_up_interruptible_poll(&priv->waitq, EPOLLIN | EPOLLRDNORM);
>> +}
>> +
>> +#if IS_ENABLED(CONFIG_KUNIT)
>> +void tlob_event_push_kunit(struct rv_file_priv *priv,
>> + const struct tlob_event *info)
>> +{
>> + tlob_event_push(priv, info);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_push_kunit);
>> +#endif /* CONFIG_KUNIT */
>> +
>> +/*
>> + * Budget exceeded: remove the entry, record the violation, and inject
>> + * budget_expired into the DA.
>> + *
>> + * Lock order: tlob_table_lock -> entry_lock. tlob_stop_task() sets
>> + * ws->canceled under both locks; if we see it here the stop path owns
>> cleanup.
>> + * fput/put_task_struct are done before call_rcu(); the RCU callback only
>> + * reclaims the slab.
>> + */
>> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer)
>> +{
>> + struct tlob_task_state *ws =
>> + container_of(timer, struct tlob_task_state, deadline_timer);
>> + struct tlob_event info = {};
>> + struct file *notify_file;
>> + struct task_struct *task;
>> + unsigned long flags;
>> + /* snapshots taken under entry_lock */
>> + u64 on_cpu_us, off_cpu_us, threshold_us, tag;
>> + u32 switches;
>> + bool on_cpu;
>> + bool push_event = false;
>> +
>> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> + /* stop path sets canceled under both locks; if set it owns cleanup
>> */
>> + if (ws->canceled) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + return HRTIMER_NORESTART;
>> + }
>> +
>> + /* Finalize accounting and snapshot all fields under entry_lock. */
>> + raw_spin_lock(&ws->entry_lock);
>> +
>> + {
>> + ktime_t now = ktime_get();
>> + u64 delta_us = ktime_to_us(ktime_sub(now, ws->last_ts));
>> +
>> + if (ws->da_state == on_cpu_tlob)
>> + ws->on_cpu_us += delta_us;
>> + else
>> + ws->off_cpu_us += delta_us;
>> + }
>> +
>> + ws->canceled = 1;
>> + on_cpu_us = ws->on_cpu_us;
>> + off_cpu_us = ws->off_cpu_us;
>> + threshold_us = ws->threshold_us;
>> + tag = ws->tag;
>> + switches = ws->switches;
>> + on_cpu = (ws->da_state == on_cpu_tlob);
>> + notify_file = ws->notify_file;
>> + if (notify_file) {
>> + info.tid = task_pid_vnr(ws->task);
>> + info.threshold_us = threshold_us;
>> + info.on_cpu_us = on_cpu_us;
>> + info.off_cpu_us = off_cpu_us;
>> + info.switches = switches;
>> + info.state = on_cpu ? 1 : 0;
>> + info.tag = tag;
>> + push_event = true;
>> + }
>> +
>> + raw_spin_unlock(&ws->entry_lock);
>> +
>> + hlist_del_rcu(&ws->hlist);
>> + atomic_dec(&tlob_num_monitored);
>> + /*
>> + * Hold a reference so task remains valid across da_handle_event()
>> + * after we drop tlob_table_lock.
>> + */
>> + task = ws->task;
>> + get_task_struct(task);
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> + /*
>> + * Both locks are now released; ws is exclusively owned (removed from
>> + * the hash table with canceled=1). Emit the tracepoint and push the
>> + * violation record.
>> + */
>> + trace_tlob_budget_exceeded(ws->task, threshold_us, on_cpu_us,
>> + off_cpu_us, switches, on_cpu, tag);
>> +
>> + if (push_event) {
>> + struct rv_file_priv *priv = notify_file->private_data;
>> +
>> + if (priv)
>> + tlob_event_push(priv, &info);
>> + }
>> +
>> + da_handle_event(task, budget_expired_tlob);
>> +
>> + if (notify_file)
>> + fput(notify_file); /* ref from fget() at
>> TRACE_START */
>> + put_task_struct(ws->task); /* ref from tlob_alloc() */
>> + put_task_struct(task); /* extra ref from
>> get_task_struct() above */
>> + call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> + return HRTIMER_NORESTART;
>> +}
>> +
>> +/* Tracepoint handlers */
>> +
>> +/*
>> + * handle_sched_switch - advance the DA and accumulate on/off-CPU time.
>> + *
>> + * RCU read-side for lock-free lookup; entry_lock for per-task accounting.
>> + * da_handle_event() is called after rcu_read_unlock() to avoid holding the
>> + * read-side critical section across the RV framework.
>> + */
>> +static void handle_sched_switch(void *data, bool preempt,
>> + struct task_struct *prev,
>> + struct task_struct *next,
>> + unsigned int prev_state)
>> +{
>> + struct tlob_task_state *ws;
>> + unsigned long flags;
>> + bool do_prev = false, do_next = false;
>> + ktime_t now;
>> +
>> + rcu_read_lock();
>> +
>> + ws = tlob_find_rcu(prev);
>> + if (ws) {
>> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> + if (!ws->canceled) {
>> + now = ktime_get();
>> + ws->on_cpu_us += ktime_to_us(ktime_sub(now, ws-
>>> last_ts));
>> + ws->last_ts = now;
>> + ws->switches++;
>> + ws->da_state = off_cpu_tlob;
>> + do_prev = true;
>> + }
>> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> + }
>> +
>> + ws = tlob_find_rcu(next);
>> + if (ws) {
>> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> + if (!ws->canceled) {
>> + now = ktime_get();
>> + ws->off_cpu_us += ktime_to_us(ktime_sub(now, ws-
>>> last_ts));
>> + ws->last_ts = now;
>> + ws->da_state = on_cpu_tlob;
>> + do_next = true;
>> + }
>> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> + }
>> +
>> + rcu_read_unlock();
>> +
>> + if (do_prev)
>> + da_handle_event(prev, switch_out_tlob);
>> + if (do_next)
>> + da_handle_event(next, switch_in_tlob);
>> +}
>> +
>> +static void handle_sched_wakeup(void *data, struct task_struct *p)
>> +{
>> + struct tlob_task_state *ws;
>> + unsigned long flags;
>> + bool found = false;
>> +
>> + rcu_read_lock();
>> + ws = tlob_find_rcu(p);
>> + if (ws) {
>> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> + found = !ws->canceled;
>> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> + }
>> + rcu_read_unlock();
>> +
>> + if (found)
>> + da_handle_event(p, sched_wakeup_tlob);
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * Core start/stop helpers (also called from rv_dev.c)
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/*
>> + * __tlob_insert - insert @ws into the hash table and arm its deadline timer.
>> + *
>> + * Re-checks for duplicates and capacity under tlob_table_lock; the caller
>> + * may have done a lock-free pre-check before allocating @ws. On failure @ws
>> + * is freed directly (never in table, so no call_rcu needed).
>> + */
>> +static int __tlob_insert(struct task_struct *task, struct tlob_task_state
>> *ws)
>> +{
>> + unsigned int h;
>> + unsigned long flags;
>> +
>> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> + if (tlob_find_rcu(task)) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + if (ws->notify_file)
>> + fput(ws->notify_file);
>> + put_task_struct(ws->task);
>> + kmem_cache_free(tlob_state_cache, ws);
>> + return -EEXIST;
>> + }
>> + if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + if (ws->notify_file)
>> + fput(ws->notify_file);
>> + put_task_struct(ws->task);
>> + kmem_cache_free(tlob_state_cache, ws);
>> + return -ENOSPC;
>> + }
>> + h = tlob_hash_task(task);
>> + hlist_add_head_rcu(&ws->hlist, &tlob_htable[h]);
>> + atomic_inc(&tlob_num_monitored);
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> + da_handle_start_run_event(task, trace_start_tlob);
>> + tlob_arm_deadline(ws);
>> + return 0;
>> +}
>> +
>> +/**
>> + * tlob_start_task - begin monitoring @task with latency budget
>> @threshold_us.
>> + *
>> + * @notify_file: /dev/rv fd whose ring buffer receives a tlob_event on
>> + * violation; caller transfers the fget() reference to tlob.c.
>> + * Pass NULL for synchronous mode (violations only via
>> + * TRACE_STOP return value and the tlob_budget_exceeded event).
>> + *
>> + * Returns 0, -ENODEV, -EEXIST, -ENOSPC, or -ENOMEM. On failure the caller
>> + * retains responsibility for any @notify_file reference.
>> + */
>> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
>> + struct file *notify_file, u64 tag)
>> +{
>> + struct tlob_task_state *ws;
>> + unsigned long flags;
>> +
>> + if (!tlob_state_cache)
>> + return -ENODEV;
>> +
>> + if (threshold_us > (u64)KTIME_MAX / NSEC_PER_USEC)
>> + return -ERANGE;
>> +
>> + /* Quick pre-check before allocation. */
>> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> + if (tlob_find_rcu(task)) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + return -EEXIST;
>> + }
>> + if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + return -ENOSPC;
>> + }
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> + ws = tlob_alloc(task, threshold_us, tag);
>> + if (!ws)
>> + return -ENOMEM;
>> +
>> + ws->notify_file = notify_file;
>> + return __tlob_insert(task, ws);
>> +}
>> +EXPORT_SYMBOL_GPL(tlob_start_task);
>> +
>> +/**
>> + * tlob_stop_task - stop monitoring @task before the deadline fires.
>> + *
>> + * Sets canceled under entry_lock (inside tlob_table_lock) before calling
>> + * hrtimer_cancel(), racing safely with the timer callback.
>> + *
>> + * Returns 0 if within budget, -ESRCH if the entry is gone (deadline already
>> + * fired, or TRACE_START was never called).
>> + */
>> +int tlob_stop_task(struct task_struct *task)
>> +{
>> + struct tlob_task_state *ws;
>> + struct file *notify_file;
>> + unsigned long flags;
>> +
>> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> + ws = tlob_find_rcu(task);
>> + if (!ws) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + return -ESRCH;
>> + }
>> +
>> + /* Prevent handle_sched_switch from updating accounting after
>> removal. */
>> + raw_spin_lock(&ws->entry_lock);
>> + ws->canceled = 1;
>> + raw_spin_unlock(&ws->entry_lock);
>> +
>> + hlist_del_rcu(&ws->hlist);
>> + atomic_dec(&tlob_num_monitored);
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> + hrtimer_cancel(&ws->deadline_timer);
>> +
>> + da_handle_event(task, trace_stop_tlob);
>> +
>> + notify_file = ws->notify_file;
>> + if (notify_file)
>> + fput(notify_file);
>> + put_task_struct(ws->task);
>> + call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(tlob_stop_task);
>> +
>> +/* Stop monitoring all tracked tasks; called on monitor disable. */
>> +static void tlob_stop_all(void)
>> +{
>> + struct tlob_task_state *batch[TLOB_MAX_MONITORED];
>> + struct tlob_task_state *ws;
>> + struct hlist_node *tmp;
>> + unsigned long flags;
>> + int n = 0, i;
>> +
>> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> + for (i = 0; i < TLOB_HTABLE_SIZE; i++) {
>> + hlist_for_each_entry_safe(ws, tmp, &tlob_htable[i], hlist) {
>> + raw_spin_lock(&ws->entry_lock);
>> + ws->canceled = 1;
>> + raw_spin_unlock(&ws->entry_lock);
>> + hlist_del_rcu(&ws->hlist);
>> + atomic_dec(&tlob_num_monitored);
>> + if (n < TLOB_MAX_MONITORED)
>> + batch[n++] = ws;
>> + }
>> + }
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> + for (i = 0; i < n; i++) {
>> + ws = batch[i];
>> + hrtimer_cancel(&ws->deadline_timer);
>> + da_handle_event(ws->task, trace_stop_tlob);
>> + if (ws->notify_file)
>> + fput(ws->notify_file);
>> + put_task_struct(ws->task);
>> + call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> + }
>> +}
>> +
>> +/* uprobe binding helpers */
>> +
>> +static int tlob_uprobe_entry_handler(struct uprobe_consumer *uc,
>> + struct pt_regs *regs, __u64 *data)
>> +{
>> + struct tlob_uprobe_binding *b =
>> + container_of(uc, struct tlob_uprobe_binding, entry_uc);
>> +
>> + tlob_start_task(current, b->threshold_us, NULL, (u64)b-
>>> offset_start);
>> + return 0;
>> +}
>> +
>> +static int tlob_uprobe_stop_handler(struct uprobe_consumer *uc,
>> + struct pt_regs *regs, __u64 *data)
>> +{
>> + tlob_stop_task(current);
>> + return 0;
>> +}
>> +
>> +/*
>> + * Register start + stop entry uprobes for a binding.
>> + * Both are plain entry uprobes (no uretprobe), so a wrong offset never
>> + * corrupts the call stack; the worst outcome is a missed stop (hrtimer
>> + * fires and reports a budget violation).
>> + * Called with tlob_uprobe_mutex held.
>> + */
>> +static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
>> + loff_t offset_start, loff_t offset_stop)
>> +{
>> + struct tlob_uprobe_binding *b, *tmp_b;
>> + char pathbuf[TLOB_MAX_PATH];
>> + struct inode *inode;
>> + char *canon;
>> + int ret;
>> +
>> + b = kzalloc(sizeof(*b), GFP_KERNEL);
>> + if (!b)
>> + return -ENOMEM;
>> +
>> + if (binpath[0] != '/') {
>> + kfree(b);
>> + return -EINVAL;
>> + }
>> +
>> + b->threshold_us = threshold_us;
>> + b->offset_start = offset_start;
>> + b->offset_stop = offset_stop;
>> +
>> + ret = kern_path(binpath, LOOKUP_FOLLOW, &b->path);
>> + if (ret)
>> + goto err_free;
>> +
>> + if (!d_is_reg(b->path.dentry)) {
>> + ret = -EINVAL;
>> + goto err_path;
>> + }
>> +
>> + /* Reject duplicate start offset for the same binary. */
>> + list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
>> + if (tmp_b->offset_start == offset_start &&
>> + tmp_b->path.dentry == b->path.dentry) {
>> + ret = -EEXIST;
>> + goto err_path;
>> + }
>> + }
>> +
>> + /* Store canonical path for read-back and removal matching. */
>> + canon = d_path(&b->path, pathbuf, sizeof(pathbuf));
>> + if (IS_ERR(canon)) {
>> + ret = PTR_ERR(canon);
>> + goto err_path;
>> + }
>> + strscpy(b->binpath, canon, sizeof(b->binpath));
>> +
>> + b->entry_uc.handler = tlob_uprobe_entry_handler;
>> + b->stop_uc.handler = tlob_uprobe_stop_handler;
>> +
>> + inode = d_real_inode(b->path.dentry);
>> +
>> + b->entry_uprobe = uprobe_register(inode, offset_start, 0, &b-
>>> entry_uc);
>> + if (IS_ERR(b->entry_uprobe)) {
>> + ret = PTR_ERR(b->entry_uprobe);
>> + b->entry_uprobe = NULL;
>> + goto err_path;
>> + }
>> +
>> + b->stop_uprobe = uprobe_register(inode, offset_stop, 0, &b->stop_uc);
>> + if (IS_ERR(b->stop_uprobe)) {
>> + ret = PTR_ERR(b->stop_uprobe);
>> + b->stop_uprobe = NULL;
>> + goto err_entry;
>> + }
>> +
>> + list_add_tail(&b->list, &tlob_uprobe_list);
>> + return 0;
>> +
>> +err_entry:
>> + uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> + uprobe_unregister_sync();
>> +err_path:
>> + path_put(&b->path);
>> +err_free:
>> + kfree(b);
>> + return ret;
>> +}
>> +
>> +/*
>> + * Remove the uprobe binding for (offset_start, binpath).
>> + * binpath is resolved to a dentry for comparison so symlinks are handled
>> + * correctly. Called with tlob_uprobe_mutex held.
>> + */
>> +static void tlob_remove_uprobe_by_key(loff_t offset_start, const char
>> *binpath)
>> +{
>> + struct tlob_uprobe_binding *b, *tmp;
>> + struct path remove_path;
>> +
>> + if (kern_path(binpath, LOOKUP_FOLLOW, &remove_path))
>> + return;
>> +
>> + list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
>> + if (b->offset_start != offset_start)
>> + continue;
>> + if (b->path.dentry != remove_path.dentry)
>> + continue;
>> + uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> + uprobe_unregister_nosync(b->stop_uprobe, &b->stop_uc);
>> + list_del(&b->list);
>> + uprobe_unregister_sync();
>> + path_put(&b->path);
>> + kfree(b);
>> + break;
>> + }
>> +
>> + path_put(&remove_path);
>> +}
>> +
>> +/* Unregister all uprobe bindings; called from disable_tlob(). */
>> +static void tlob_remove_all_uprobes(void)
>> +{
>> + struct tlob_uprobe_binding *b, *tmp;
>> +
>> + mutex_lock(&tlob_uprobe_mutex);
>> + list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
>> + uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> + uprobe_unregister_nosync(b->stop_uprobe, &b->stop_uc);
>> + list_del(&b->list);
>> + path_put(&b->path);
>> + kfree(b);
>> + }
>> + mutex_unlock(&tlob_uprobe_mutex);
>> + uprobe_unregister_sync();
>> +}
>> +
>> +/*
>> + * tracefs "monitor" file
>> + *
>> + * Read: one "threshold_us:0xoffset_start:0xoffset_stop:binary_path\n"
>> + * line per registered uprobe binding.
>> + * Write: "threshold_us:offset_start:offset_stop:binary_path" - add uprobe
>> binding
>> + * "-offset_start:binary_path" - remove uprobe
>> binding
>> + */
>> +
>> +static ssize_t tlob_monitor_read(struct file *file,
>> + char __user *ubuf,
>> + size_t count, loff_t *ppos)
>> +{
>> + /* pid(10) + threshold(20) + 2 offsets(2*18) + path(256) + delimiters
>> */
>> + const int line_sz = TLOB_MAX_PATH + 72;
>> + struct tlob_uprobe_binding *b;
>> + char *buf, *p;
>> + int n = 0, buf_sz, pos = 0;
>> + ssize_t ret;
>> +
>> + mutex_lock(&tlob_uprobe_mutex);
>> + list_for_each_entry(b, &tlob_uprobe_list, list)
>> + n++;
>> + mutex_unlock(&tlob_uprobe_mutex);
>> +
>> + buf_sz = (n ? n : 1) * line_sz + 1;
>> + buf = kmalloc(buf_sz, GFP_KERNEL);
>> + if (!buf)
>> + return -ENOMEM;
>> +
>> + mutex_lock(&tlob_uprobe_mutex);
>> + list_for_each_entry(b, &tlob_uprobe_list, list) {
>> + p = b->binpath;
>> + pos += scnprintf(buf + pos, buf_sz - pos,
>> + "%llu:0x%llx:0x%llx:%s\n",
>> + b->threshold_us,
>> + (unsigned long long)b->offset_start,
>> + (unsigned long long)b->offset_stop,
>> + p);
>> + }
>> + mutex_unlock(&tlob_uprobe_mutex);
>> +
>> + ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
>> + kfree(buf);
>> + return ret;
>> +}
>> +
>> +/*
>> + * Parse "threshold_us:offset_start:offset_stop:binary_path".
>> + * binary_path comes last so it may freely contain ':'.
>> + * Returns 0 on success.
>> + */
>> +VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
>> + char **path_out,
>> + loff_t *start_out, loff_t
>> *stop_out)
>> +{
>> + unsigned long long thr;
>> + long long start, stop;
>> + int n = 0;
>> +
>> + /*
>> + * %llu : decimal-only (microseconds)
>> + * %lli : auto-base, accepts 0x-prefixed hex for offsets
>> + * %n : records the byte offset of the first path character
>> + */
>> + if (sscanf(buf, "%llu:%lli:%lli:%n", &thr, &start, &stop, &n) != 3)
>> + return -EINVAL;
>> + if (thr == 0 || n == 0 || buf[n] == '\0')
>> + return -EINVAL;
>> + if (start < 0 || stop < 0)
>> + return -EINVAL;
>> +
>> + *thr_out = thr;
>> + *start_out = start;
>> + *stop_out = stop;
>> + *path_out = buf + n;
>> + return 0;
>> +}
>> +
>> +static ssize_t tlob_monitor_write(struct file *file,
>> + const char __user *ubuf,
>> + size_t count, loff_t *ppos)
>> +{
>> + char buf[TLOB_MAX_PATH + 64];
>> + loff_t offset_start, offset_stop;
>> + u64 threshold_us;
>> + char *binpath;
>> + int ret;
>> +
>> + if (count >= sizeof(buf))
>> + return -EINVAL;
>> + if (copy_from_user(buf, ubuf, count))
>> + return -EFAULT;
>> + buf[count] = '\0';
>> +
>> + if (count > 0 && buf[count - 1] == '\n')
>> + buf[count - 1] = '\0';
>> +
>> + /* Remove request: "-offset_start:binary_path" */
>> + if (buf[0] == '-') {
>> + long long off;
>> + int n = 0;
>> +
>> + if (sscanf(buf + 1, "%lli:%n", &off, &n) != 1 || n == 0)
>> + return -EINVAL;
>> + binpath = buf + 1 + n;
>> + if (binpath[0] != '/')
>> + return -EINVAL;
>> +
>> + mutex_lock(&tlob_uprobe_mutex);
>> + tlob_remove_uprobe_by_key((loff_t)off, binpath);
>> + mutex_unlock(&tlob_uprobe_mutex);
>> +
>> + return (ssize_t)count;
>> + }
>> +
>> + /*
>> + * Uprobe binding:
>> "threshold_us:offset_start:offset_stop:binary_path"
>> + * binpath points into buf at the start of the path field.
>> + */
>> + ret = tlob_parse_uprobe_line(buf, &threshold_us,
>> + &binpath, &offset_start, &offset_stop);
>> + if (ret)
>> + return ret;
>> +
>> + mutex_lock(&tlob_uprobe_mutex);
>> + ret = tlob_add_uprobe(threshold_us, binpath, offset_start,
>> offset_stop);
>> + mutex_unlock(&tlob_uprobe_mutex);
>> + return ret ? ret : (ssize_t)count;
>> +}
>> +
>> +static const struct file_operations tlob_monitor_fops = {
>> + .open = simple_open,
>> + .read = tlob_monitor_read,
>> + .write = tlob_monitor_write,
>> + .llseek = noop_llseek,
>> +};
>> +
>> +/*
>> + * __tlob_init_monitor / __tlob_destroy_monitor - called with
>> rv_interface_lock
>> + * held (required by da_monitor_init/destroy via
>> rv_get/put_task_monitor_slot).
>> + */
>> +static int __tlob_init_monitor(void)
>> +{
>> + int i, retval;
>> +
>> + tlob_state_cache = kmem_cache_create("tlob_task_state",
>> + sizeof(struct tlob_task_state),
>> + 0, 0, NULL);
>> + if (!tlob_state_cache)
>> + return -ENOMEM;
>> +
>> + for (i = 0; i < TLOB_HTABLE_SIZE; i++)
>> + INIT_HLIST_HEAD(&tlob_htable[i]);
>> + atomic_set(&tlob_num_monitored, 0);
>> +
>> + retval = da_monitor_init();
>> + if (retval) {
>> + kmem_cache_destroy(tlob_state_cache);
>> + tlob_state_cache = NULL;
>> + return retval;
>> + }
>> +
>> + rv_this.enabled = 1;
>> + return 0;
>> +}
>> +
>> +static void __tlob_destroy_monitor(void)
>> +{
>> + rv_this.enabled = 0;
>> + tlob_stop_all();
>> + tlob_remove_all_uprobes();
>> + /*
>> + * Drain pending call_rcu() callbacks from tlob_stop_all() before
>> + * destroying the kmem_cache.
>> + */
>> + synchronize_rcu();
>> + da_monitor_destroy();
>> + kmem_cache_destroy(tlob_state_cache);
>> + tlob_state_cache = NULL;
>> +}
>> +
>> +/*
>> + * tlob_init_monitor / tlob_destroy_monitor - KUnit wrappers that acquire
>> + * rv_interface_lock, satisfying the lockdep_assert_held() inside
>> + * rv_get/put_task_monitor_slot().
>> + */
>> +VISIBLE_IF_KUNIT int tlob_init_monitor(void)
>> +{
>> + int ret;
>> +
>> + mutex_lock(&rv_interface_lock);
>> + ret = __tlob_init_monitor();
>> + mutex_unlock(&rv_interface_lock);
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_init_monitor);
>> +
>> +VISIBLE_IF_KUNIT void tlob_destroy_monitor(void)
>> +{
>> + mutex_lock(&rv_interface_lock);
>> + __tlob_destroy_monitor();
>> + mutex_unlock(&rv_interface_lock);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_destroy_monitor);
>> +
>> +VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
>> +{
>> + rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
>> + rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
>> +
>> +VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
>> +{
>> + rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
>> + rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
>> +
>> +/*
>> + * enable_tlob / disable_tlob - called by rv_enable/disable_monitor() which
>> + * already holds rv_interface_lock; call the __ variants directly.
>> + */
>> +static int enable_tlob(void)
>> +{
>> + int retval;
>> +
>> + retval = __tlob_init_monitor();
>> + if (retval)
>> + return retval;
>> +
>> + return tlob_enable_hooks();
>> +}
>> +
>> +static void disable_tlob(void)
>> +{
>> + tlob_disable_hooks();
>> + __tlob_destroy_monitor();
>> +}
>> +
>> +static struct rv_monitor rv_this = {
>> + .name = "tlob",
>> + .description = "Per-task latency-over-budget monitor.",
>> + .enable = enable_tlob,
>> + .disable = disable_tlob,
>> + .reset = da_monitor_reset_all,
>> + .enabled = 0,
>> +};
>> +
>> +static int __init register_tlob(void)
>> +{
>> + int ret;
>> +
>> + ret = rv_register_monitor(&rv_this, NULL);
>> + if (ret)
>> + return ret;
>> +
>> + if (rv_this.root_d) {
>> + tracefs_create_file("monitor", 0644, rv_this.root_d, NULL,
>> + &tlob_monitor_fops);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static void __exit unregister_tlob(void)
>> +{
>> + rv_unregister_monitor(&rv_this);
>> +}
>> +
>> +module_init(register_tlob);
>> +module_exit(unregister_tlob);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
>> +MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
>> b/kernel/trace/rv/monitors/tlob/tlob.h
>> new file mode 100644
>> index 000000000..3438a6175
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob.h
>> @@ -0,0 +1,145 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _RV_TLOB_H
>> +#define _RV_TLOB_H
>> +
>> +/*
>> + * C representation of the tlob automaton, generated from tlob.dot via rvgen
>> + * and extended with tlob_start_task()/tlob_stop_task() declarations.
>> + * For the format description see
>> Documentation/trace/rv/deterministic_automata.rst
>> + */
>> +
>> +#include <linux/rv.h>
>> +#include <uapi/linux/rv.h>
>> +
>> +#define MONITOR_NAME tlob
>> +
>> +enum states_tlob {
>> + unmonitored_tlob,
>> + on_cpu_tlob,
>> + off_cpu_tlob,
>> + state_max_tlob,
>> +};
>> +
>> +#define INVALID_STATE state_max_tlob
>> +
>> +enum events_tlob {
>> + trace_start_tlob,
>> + switch_in_tlob,
>> + switch_out_tlob,
>> + sched_wakeup_tlob,
>> + trace_stop_tlob,
>> + budget_expired_tlob,
>> + event_max_tlob,
>> +};
>> +
>> +struct automaton_tlob {
>> + char *state_names[state_max_tlob];
>> + char *event_names[event_max_tlob];
>> + unsigned char function[state_max_tlob][event_max_tlob];
>> + unsigned char initial_state;
>> + bool final_states[state_max_tlob];
>> +};
>> +
>> +static const struct automaton_tlob automaton_tlob = {
>> + .state_names = {
>> + "unmonitored",
>> + "on_cpu",
>> + "off_cpu",
>> + },
>> + .event_names = {
>> + "trace_start",
>> + "switch_in",
>> + "switch_out",
>> + "sched_wakeup",
>> + "trace_stop",
>> + "budget_expired",
>> + },
>> + .function = {
>> + /* unmonitored */
>> + {
>> + on_cpu_tlob, /* trace_start */
>> + unmonitored_tlob, /* switch_in */
>> + unmonitored_tlob, /* switch_out */
>> + unmonitored_tlob, /* sched_wakeup */
>> + INVALID_STATE, /* trace_stop */
>> + INVALID_STATE, /* budget_expired */
>> + },
>> + /* on_cpu */
>> + {
>> + INVALID_STATE, /* trace_start */
>> + INVALID_STATE, /* switch_in */
>> + off_cpu_tlob, /* switch_out */
>> + on_cpu_tlob, /* sched_wakeup */
>> + unmonitored_tlob, /* trace_stop */
>> + unmonitored_tlob, /* budget_expired */
>> + },
>> + /* off_cpu */
>> + {
>> + INVALID_STATE, /* trace_start */
>> + on_cpu_tlob, /* switch_in */
>> + off_cpu_tlob, /* switch_out */
>> + off_cpu_tlob, /* sched_wakeup */
>> + unmonitored_tlob, /* trace_stop */
>> + unmonitored_tlob, /* budget_expired */
>> + },
>> + },
>> + /*
>> + * final_states: unmonitored is the sole accepting state.
>> + * Violations are recorded via ntf_push and tlob_budget_exceeded.
>> + */
>> + .initial_state = unmonitored_tlob,
>> + .final_states = { 1, 0, 0 },
>> +};
>> +
>> +/* Exported for use by the RV ioctl layer (rv_dev.c) */
>> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
>> + struct file *notify_file, u64 tag);
>> +int tlob_stop_task(struct task_struct *task);
>> +
>> +/* Maximum number of concurrently monitored tasks (also used by KUnit). */
>> +#define TLOB_MAX_MONITORED 64U
>> +
>> +/*
>> + * Ring buffer constants (also published in UAPI for mmap size calculation).
>> + */
>> +#define TLOB_RING_DEFAULT_CAP 64U /* records allocated at open() */
>> +#define TLOB_RING_MIN_CAP 8U /* minimum accepted by mmap() */
>> +#define TLOB_RING_MAX_CAP 4096U /* maximum accepted by mmap() */
>> +
>> +/**
>> + * struct tlob_ring - per-fd mmap-capable violation ring buffer.
>> + *
>> + * Allocated as a contiguous page range at rv_open() time:
>> + * page 0: struct tlob_mmap_page (shared with userspace)
>> + * pages 1-N: struct tlob_event[capacity]
>> + */
>> +struct tlob_ring {
>> + struct tlob_mmap_page *page;
>> + struct tlob_event *data;
>> + u32 mask;
>> + spinlock_t lock;
>> + unsigned long base;
>> + unsigned int order;
>> +};
>> +
>> +/**
>> + * struct rv_file_priv - per-fd private data for /dev/rv.
>> + */
>> +struct rv_file_priv {
>> + struct tlob_ring ring;
>> + wait_queue_head_t waitq;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_KUNIT)
>> +int tlob_init_monitor(void);
>> +void tlob_destroy_monitor(void);
>> +int tlob_enable_hooks(void);
>> +void tlob_disable_hooks(void);
>> +void tlob_event_push_kunit(struct rv_file_priv *priv,
>> + const struct tlob_event *info);
>> +int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
>> + char **path_out,
>> + loff_t *start_out, loff_t *stop_out);
>> +#endif /* CONFIG_KUNIT */
>> +
>> +#endif /* _RV_TLOB_H */
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> b/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> new file mode 100644
>> index 000000000..b08d67776
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> @@ -0,0 +1,42 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +/*
>> + * Snippet to be included in rv_trace.h
>> + */
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +/*
>> + * tlob uses the generic event_da_monitor_id and error_da_monitor_id event
>> + * classes so that both event classes are instantiated. This avoids a
>> + * -Werror=unused-variable warning that the compiler emits when a
>> + * DECLARE_EVENT_CLASS has no corresponding DEFINE_EVENT instance.
>> + *
>> + * The event_tlob tracepoint is defined here but the call-site in
>> + * da_handle_event() is overridden with a no-op macro below so that no
>> + * trace record is emitted on every scheduler context switch. Budget
>> + * violations are reported via the dedicated tlob_budget_exceeded event.
>> + *
>> + * error_tlob IS kept active so that invalid DA transitions (programming
>> + * errors) are still visible in the ftrace ring buffer for debugging.
>> + */
>> +DEFINE_EVENT(event_da_monitor_id, event_tlob,
>> + TP_PROTO(int id, char *state, char *event, char *next_state,
>> + bool final_state),
>> + TP_ARGS(id, state, event, next_state, final_state));
>> +
>> +DEFINE_EVENT(error_da_monitor_id, error_tlob,
>> + TP_PROTO(int id, char *state, char *event),
>> + TP_ARGS(id, state, event));
>> +
>> +/*
>> + * Override the trace_event_tlob() call-site with a no-op after the
>> + * DEFINE_EVENT above has satisfied the event class instantiation
>> + * requirement. The tracepoint symbol itself exists (and can be enabled
>> + * via tracefs) but the automatic call from da_handle_event() is silenced
>> + * to avoid per-context-switch ftrace noise during normal operation.
>> + */
>> +#undef trace_event_tlob
>> +#define trace_event_tlob(id, state, event, next_state, final_state) \
>> + do { (void)(id); (void)(state); (void)(event); \
>> + (void)(next_state); (void)(final_state); } while (0)
>> +#endif /* CONFIG_RV_MON_TLOB */
>> diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
>> index ee4e68102..e754e76d5 100644
>> --- a/kernel/trace/rv/rv.c
>> +++ b/kernel/trace/rv/rv.c
>> @@ -148,6 +148,10 @@
>> #include <rv_trace.h>
>> #endif
>>
>> +#ifdef CONFIG_RV_MON_TLOB
>> +EXPORT_TRACEPOINT_SYMBOL_GPL(tlob_budget_exceeded);
>> +#endif
>> +
>> #include "rv.h"
>>
>> DEFINE_MUTEX(rv_interface_lock);
>> diff --git a/kernel/trace/rv/rv_dev.c b/kernel/trace/rv/rv_dev.c
>> new file mode 100644
>> index 000000000..a052f3203
>> --- /dev/null
>> +++ b/kernel/trace/rv/rv_dev.c
>> @@ -0,0 +1,602 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * rv_dev.c - /dev/rv misc device for RV monitor self-instrumentation
>> + *
>> + * A single misc device (MISC_DYNAMIC_MINOR) serves all RV monitors.
>> + * ioctl numbers encode the monitor identity:
>> + *
>> + * 0x01 - 0x1F tlob (task latency over budget)
>> + * 0x20 - 0x3F reserved
>> + *
>> + * Each monitor exports tlob_start_task() / tlob_stop_task() which are
>> + * called here. The calling task is identified by current.
>> + *
>> + * Magic: RV_IOC_MAGIC (0xB9), defined in include/uapi/linux/rv.h
>> + *
>> + * Per-fd private data (rv_file_priv)
>> + * ------------------------------------
>> + * Every open() of /dev/rv allocates an rv_file_priv (defined in tlob.h).
>> + * When TLOB_IOCTL_TRACE_START is called with args.notify_fd >= 0, violations
>> + * are pushed as tlob_event records into that fd's per-fd ring buffer
>> (tlob_ring)
>> + * and its poll/epoll waitqueue is woken.
>> + *
>> + * Consumers drain records with read() on the notify_fd; read() blocks until
>> + * at least one record is available (unless O_NONBLOCK is set).
>> + *
>> + * Per-thread "started" tracking (tlob_task_handle)
>> + * -------------------------------------------------
>> + * tlob_stop_task() returns -ESRCH in two distinct situations:
>> + *
>> + * (a) The deadline timer already fired and removed the tlob hash-table
>> + * entry before TRACE_STOP arrived -> budget was exceeded -> -EOVERFLOW
>> + *
>> + * (b) TRACE_START was never called for this thread -> programming error
>> + * -> -ESRCH
>> + *
>> + * To distinguish them, rv_dev.c maintains a lightweight hash table
>> + * (tlob_handles) that records a tlob_task_handle for every task_struct *
>> + * for which a successful TLOB_IOCTL_TRACE_START has been
>> + * issued but the corresponding TLOB_IOCTL_TRACE_STOP has not yet arrived.
>> + *
>> + * tlob_task_handle is a thin "session ticket" -- it carries only the
>> + * task pointer and the owning file descriptor. The heavy per-task state
>> + * (hrtimer, DA state, threshold) lives in tlob_task_state inside tlob.c.
>> + *
>> + * The table is keyed on task_struct * (same key as tlob.c), protected
>> + * by tlob_handles_lock (spinlock, irq-safe). No get_task_struct()
>> + * refcount is needed here because tlob.c already holds a reference for
>> + * each live entry.
>> + *
>> + * Multiple threads may share the same fd. Each thread has its own
>> + * tlob_task_handle in the table, so concurrent TRACE_START / TRACE_STOP
>> + * calls from different threads do not interfere.
>> + *
>> + * The fd release path (rv_release) calls tlob_stop_task() for every
>> + * handle in tlob_handles that belongs to the closing fd, ensuring cleanup
>> + * even if the user forgets to call TRACE_STOP.
>> + */
>> +#include <linux/file.h>
>> +#include <linux/fs.h>
>> +#include <linux/gfp.h>
>> +#include <linux/hash.h>
>> +#include <linux/mm.h>
>> +#include <linux/miscdevice.h>
>> +#include <linux/module.h>
>> +#include <linux/poll.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/uaccess.h>
>> +#include <uapi/linux/rv.h>
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +#include "monitors/tlob/tlob.h"
>> +#endif
>> +
>> +/* -----------------------------------------------------------------------
>> + * tlob_task_handle - per-thread session ticket for the ioctl interface
>> + *
>> + * One handle is allocated by TLOB_IOCTL_TRACE_START and freed by
>> + * TLOB_IOCTL_TRACE_STOP (or by rv_release if the fd is closed).
>> + *
>> + * @hlist: Hash-table linkage in tlob_handles (keyed on task pointer).
>> + * @task: The monitored thread. Plain pointer; no refcount held here
>> + * because tlob.c holds one for the lifetime of the monitoring
>> + * window, which encompasses the lifetime of this handle.
>> + * @file: The /dev/rv file descriptor that issued TRACE_START.
>> + * Used by rv_release() to sweep orphaned handles on close().
>> + * -----------------------------------------------------------------------
>> + */
>> +#define TLOB_HANDLES_BITS 5
>> +#define TLOB_HANDLES_SIZE (1 << TLOB_HANDLES_BITS)
>> +
>> +struct tlob_task_handle {
>> + struct hlist_node hlist;
>> + struct task_struct *task;
>> + struct file *file;
>> +};
>> +
>> +static struct hlist_head tlob_handles[TLOB_HANDLES_SIZE];
>> +static DEFINE_SPINLOCK(tlob_handles_lock);
>> +
>> +static unsigned int tlob_handle_hash(const struct task_struct *task)
>> +{
>> + return hash_ptr((void *)task, TLOB_HANDLES_BITS);
>> +}
>> +
>> +/* Must be called with tlob_handles_lock held. */
>> +static struct tlob_task_handle *
>> +tlob_handle_find_locked(struct task_struct *task)
>> +{
>> + struct tlob_task_handle *h;
>> + unsigned int slot = tlob_handle_hash(task);
>> +
>> + hlist_for_each_entry(h, &tlob_handles[slot], hlist) {
>> + if (h->task == task)
>> + return h;
>> + }
>> + return NULL;
>> +}
>> +
>> +/*
>> + * tlob_handle_alloc - record that @task has an active monitoring session
>> + * opened via @file.
>> + *
>> + * Returns 0 on success, -EEXIST if @task already has a handle (double
>> + * TRACE_START without TRACE_STOP), -ENOMEM on allocation failure.
>> + */
>> +static int tlob_handle_alloc(struct task_struct *task, struct file *file)
>> +{
>> + struct tlob_task_handle *h;
>> + unsigned long flags;
>> + unsigned int slot;
>> +
>> + h = kmalloc(sizeof(*h), GFP_KERNEL);
>> + if (!h)
>> + return -ENOMEM;
>> + h->task = task;
>> + h->file = file;
>> +
>> + spin_lock_irqsave(&tlob_handles_lock, flags);
>> + if (tlob_handle_find_locked(task)) {
>> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> + kfree(h);
>> + return -EEXIST;
>> + }
>> + slot = tlob_handle_hash(task);
>> + hlist_add_head(&h->hlist, &tlob_handles[slot]);
>> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> + return 0;
>> +}
>> +
>> +/*
>> + * tlob_handle_free - remove the handle for @task and free it.
>> + *
>> + * Returns 1 if a handle existed (TRACE_START was called), 0 if not found
>> + * (TRACE_START was never called for this thread).
>> + */
>> +static int tlob_handle_free(struct task_struct *task)
>> +{
>> + struct tlob_task_handle *h;
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&tlob_handles_lock, flags);
>> + h = tlob_handle_find_locked(task);
>> + if (h) {
>> + hlist_del_init(&h->hlist);
>> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> + kfree(h);
>> + return 1;
>> + }
>> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> + return 0;
>> +}
>> +
>> +/*
>> + * tlob_handle_sweep_file - release all handles owned by @file.
>> + *
>> + * Called from rv_release() when the fd is closed without TRACE_STOP.
>> + * Calls tlob_stop_task() for each orphaned handle to drain the tlob
>> + * monitoring entries and prevent resource leaks in tlob.c.
>> + *
>> + * Handles are collected under the lock (short critical section), then
>> + * processed outside it (tlob_stop_task() may sleep/spin internally).
>> + */
>> +#ifdef CONFIG_RV_MON_TLOB
>> +static void tlob_handle_sweep_file(struct file *file)
>> +{
>> + struct tlob_task_handle *batch[TLOB_HANDLES_SIZE];
>> + struct tlob_task_handle *h;
>> + struct hlist_node *tmp;
>> + unsigned long flags;
>> + int i, n = 0;
>> +
>> + spin_lock_irqsave(&tlob_handles_lock, flags);
>> + for (i = 0; i < TLOB_HANDLES_SIZE; i++) {
>> + hlist_for_each_entry_safe(h, tmp, &tlob_handles[i], hlist) {
>> + if (h->file == file) {
>> + hlist_del_init(&h->hlist);
>> + batch[n++] = h;
>> + }
>> + }
>> + }
>> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +
>> + for (i = 0; i < n; i++) {
>> + /*
>> + * Ignore -ESRCH: the deadline timer may have already fired
>> + * and cleaned up the tlob entry.
>> + */
>> + tlob_stop_task(batch[i]->task);
>> + kfree(batch[i]);
>> + }
>> +}
>> +#else
>> +static inline void tlob_handle_sweep_file(struct file *file) {}
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>> +/* -----------------------------------------------------------------------
>> + * Ring buffer lifecycle
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/*
>> + * tlob_ring_alloc - allocate a ring of @cap records (must be a power of 2).
>> + *
>> + * Allocates a physically contiguous block of pages:
>> + * page 0 : struct tlob_mmap_page (control page, shared with
>> userspace)
>> + * pages 1..N : struct tlob_event[cap] (data pages)
>> + *
>> + * Each page is marked reserved so it can be mapped to userspace via mmap().
>> + */
>> +static int tlob_ring_alloc(struct tlob_ring *ring, u32 cap)
>> +{
>> + unsigned int total = PAGE_SIZE + cap * sizeof(struct tlob_event);
>> + unsigned int order = get_order(total);
>> + unsigned long base;
>> + unsigned int i;
>> +
>> + base = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
>> + if (!base)
>> + return -ENOMEM;
>> +
>> + for (i = 0; i < (1u << order); i++)
>> + SetPageReserved(virt_to_page((void *)(base + i *
>> PAGE_SIZE)));
>> +
>> + ring->base = base;
>> + ring->order = order;
>> + ring->page = (struct tlob_mmap_page *)base;
>> + ring->data = (struct tlob_event *)(base + PAGE_SIZE);
>> + ring->mask = cap - 1;
>> + spin_lock_init(&ring->lock);
>> +
>> + ring->page->capacity = cap;
>> + ring->page->version = 1;
>> + ring->page->data_offset = PAGE_SIZE;
>> + ring->page->record_size = sizeof(struct tlob_event);
>> + return 0;
>> +}
>> +
>> +static void tlob_ring_free(struct tlob_ring *ring)
>> +{
>> + unsigned int i;
>> +
>> + if (!ring->base)
>> + return;
>> +
>> + for (i = 0; i < (1u << ring->order); i++)
>> + ClearPageReserved(virt_to_page((void *)(ring->base + i *
>> PAGE_SIZE)));
>> +
>> + free_pages(ring->base, ring->order);
>> + ring->base = 0;
>> + ring->page = NULL;
>> + ring->data = NULL;
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * File operations
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static int rv_open(struct inode *inode, struct file *file)
>> +{
>> + struct rv_file_priv *priv;
>> + int ret;
>> +
>> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> + if (!priv)
>> + return -ENOMEM;
>> +
>> + ret = tlob_ring_alloc(&priv->ring, TLOB_RING_DEFAULT_CAP);
>> + if (ret) {
>> + kfree(priv);
>> + return ret;
>> + }
>> +
>> + init_waitqueue_head(&priv->waitq);
>> + file->private_data = priv;
>> + return 0;
>> +}
>> +
>> +static int rv_release(struct inode *inode, struct file *file)
>> +{
>> + struct rv_file_priv *priv = file->private_data;
>> +
>> + tlob_handle_sweep_file(file);
>> + tlob_ring_free(&priv->ring);
>> + kfree(priv);
>> + file->private_data = NULL;
>> + return 0;
>> +}
>> +
>> +static __poll_t rv_poll(struct file *file, poll_table *wait)
>> +{
>> + struct rv_file_priv *priv = file->private_data;
>> +
>> + if (!priv)
>> + return EPOLLERR;
>> +
>> + poll_wait(file, &priv->waitq, wait);
>> +
>> + /*
>> + * Pairs with smp_store_release(&ring->page->data_head, ...) in
>> + * tlob_event_push(). No lock needed: head is written by the kernel
>> + * producer and read here; tail is written by the consumer and we
>> only
>> + * need an approximate check for the poll fast path.
>> + */
>> + if (smp_load_acquire(&priv->ring.page->data_head) !=
>> + READ_ONCE(priv->ring.page->data_tail))
>> + return EPOLLIN | EPOLLRDNORM;
>> +
>> + return 0;
>> +}
>> +
>> +/*
>> + * rv_read - consume tlob_event violation records from this fd's ring buffer.
>> + *
>> + * Each read() returns a whole number of struct tlob_event records. @count
>> must
>> + * be at least sizeof(struct tlob_event); partial-record sizes are rejected
>> with
>> + * -EINVAL.
>> + *
>> + * Blocking behaviour follows O_NONBLOCK on the fd:
>> + * O_NONBLOCK clear: blocks until at least one record is available.
>> + * O_NONBLOCK set: returns -EAGAIN immediately if the ring is empty.
>> + *
>> + * Returns the number of bytes copied (always a multiple of sizeof
>> tlob_event),
>> + * -EAGAIN if non-blocking and empty, or a negative error code.
>> + *
>> + * read() and mmap() share the same ring and data_tail cursor; do not use
>> + * both simultaneously on the same fd.
>> + */
>> +static ssize_t rv_read(struct file *file, char __user *buf, size_t count,
>> + loff_t *ppos)
>> +{
>> + struct rv_file_priv *priv = file->private_data;
>> + struct tlob_ring *ring;
>> + size_t rec = sizeof(struct tlob_event);
>> + unsigned long irqflags;
>> + ssize_t done = 0;
>> + int ret;
>> +
>> + if (!priv)
>> + return -ENODEV;
>> +
>> + ring = &priv->ring;
>> +
>> + if (count < rec)
>> + return -EINVAL;
>> +
>> + /* Blocking path: sleep until the producer advances data_head. */
>> + if (!(file->f_flags & O_NONBLOCK)) {
>> + ret = wait_event_interruptible(priv->waitq,
>> + /* pairs with smp_store_release() in the producer */
>> + smp_load_acquire(&ring->page->data_head) !=
>> + READ_ONCE(ring->page->data_tail));
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + /*
>> + * Drain records into the caller's buffer. ring->lock serialises
>> + * concurrent read() callers and the softirq producer.
>> + */
>> + while (done + rec <= count) {
>> + struct tlob_event record;
>> + u32 head, tail;
>> +
>> + spin_lock_irqsave(&ring->lock, irqflags);
>> + /* pairs with smp_store_release() in the producer */
>> + head = smp_load_acquire(&ring->page->data_head);
>> + tail = ring->page->data_tail;
>> + if (head == tail) {
>> + spin_unlock_irqrestore(&ring->lock, irqflags);
>> + break;
>> + }
>> + record = ring->data[tail & ring->mask];
>> + WRITE_ONCE(ring->page->data_tail, tail + 1);
>> + spin_unlock_irqrestore(&ring->lock, irqflags);
>> +
>> + if (copy_to_user(buf + done, &record, rec))
>> + return done ? done : -EFAULT;
>> + done += rec;
>> + }
>> +
>> + return done ? done : -EAGAIN;
>> +}
>> +
>> +/*
>> + * rv_mmap - map the per-fd violation ring buffer into userspace.
>> + *
>> + * The mmap region covers the full ring allocation:
>> + *
>> + * offset 0 : struct tlob_mmap_page (control page)
>> + * offset PAGE_SIZE : struct tlob_event[capacity] (data pages)
>> + *
>> + * The caller must map exactly PAGE_SIZE + capacity * sizeof(struct
>> tlob_event)
>> + * bytes starting at offset 0 (vm_pgoff must be 0). The actual capacity is
>> + * read from tlob_mmap_page.capacity after a successful mmap(2).
>> + *
>> + * Private mappings (MAP_PRIVATE) are rejected: the shared data_tail field
>> + * written by userspace must be visible to the kernel producer.
>> + */
>> +static int rv_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> + struct rv_file_priv *priv = file->private_data;
>> + struct tlob_ring *ring;
>> + unsigned long size = vma->vm_end - vma->vm_start;
>> + unsigned long ring_size;
>> +
>> + if (!priv)
>> + return -ENODEV;
>> +
>> + ring = &priv->ring;
>> +
>> + if (vma->vm_pgoff != 0)
>> + return -EINVAL;
>> +
>> + ring_size = PAGE_ALIGN(PAGE_SIZE + ((unsigned long)(ring->mask + 1) *
>> + sizeof(struct tlob_event)));
>> + if (size != ring_size)
>> + return -EINVAL;
>> +
>> + if (!(vma->vm_flags & VM_SHARED))
>> + return -EINVAL;
>> +
>> + return remap_pfn_range(vma, vma->vm_start,
>> + page_to_pfn(virt_to_page((void *)ring->base)),
>> + ring_size, vma->vm_page_prot);
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * ioctl dispatcher
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static long rv_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>> +{
>> + unsigned int nr = _IOC_NR(cmd);
>> +
>> + /*
>> + * Verify the magic byte so we don't accidentally handle ioctls
>> + * intended for a different device.
>> + */
>> + if (_IOC_TYPE(cmd) != RV_IOC_MAGIC)
>> + return -ENOTTY;
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> + /* tlob: ioctl numbers 0x01 - 0x1F */
>> + switch (cmd) {
>> + case TLOB_IOCTL_TRACE_START: {
>> + struct tlob_start_args args;
>> + struct file *notify_file = NULL;
>> + int ret, hret;
>> +
>> + if (copy_from_user(&args,
>> + (struct tlob_start_args __user *)arg,
>> + sizeof(args)))
>> + return -EFAULT;
>> + if (args.threshold_us == 0)
>> + return -EINVAL;
>> + if (args.flags != 0)
>> + return -EINVAL;
>> +
>> + /*
>> + * If notify_fd >= 0, resolve it to a file pointer.
>> + * fget() bumps the reference count; tlob.c drops it
>> + * via fput() when the monitoring window ends.
>> + * Reject non-/dev/rv fds to prevent type confusion.
>> + */
>> + if (args.notify_fd >= 0) {
>> + notify_file = fget(args.notify_fd);
>> + if (!notify_file)
>> + return -EBADF;
>> + if (notify_file->f_op != file->f_op) {
>> + fput(notify_file);
>> + return -EINVAL;
>> + }
>> + }
>> +
>> + ret = tlob_start_task(current, args.threshold_us,
>> + notify_file, args.tag);
>> + if (ret != 0) {
>> + /* tlob.c did not take ownership; drop ref. */
>> + if (notify_file)
>> + fput(notify_file);
>> + return ret;
>> + }
>> +
>> + /*
>> + * Record session handle. Free any stale handle left by
>> + * a previous window whose deadline timer fired (timer
>> + * removes tlob_task_state but cannot touch tlob_handles).
>> + */
>> + tlob_handle_free(current);
>> + hret = tlob_handle_alloc(current, file);
>> + if (hret < 0) {
>> + tlob_stop_task(current);
>> + return hret;
>> + }
>> + return 0;
>> + }
>> + case TLOB_IOCTL_TRACE_STOP: {
>> + int had_handle;
>> + int ret;
>> +
>> + /*
>> + * Atomically remove the session handle for current.
>> + *
>> + * had_handle == 0: TRACE_START was never called for
>> + * this thread -> caller bug -> -ESRCH
>> + *
>> + * had_handle == 1: TRACE_START was called. If
>> + * tlob_stop_task() now returns
>> + * -ESRCH, the deadline timer already
>> + * fired -> budget exceeded -> -EOVERFLOW
>> + */
>> + had_handle = tlob_handle_free(current);
>> + if (!had_handle)
>> + return -ESRCH;
>> +
>> + ret = tlob_stop_task(current);
>> + return (ret == -ESRCH) ? -EOVERFLOW : ret;
>> + }
>> + default:
>> + break;
>> + }
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>> + return -ENOTTY;
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * Module init / exit
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static const struct file_operations rv_fops = {
>> + .owner = THIS_MODULE,
>> + .open = rv_open,
>> + .release = rv_release,
>> + .read = rv_read,
>> + .poll = rv_poll,
>> + .mmap = rv_mmap,
>> + .unlocked_ioctl = rv_ioctl,
>> +#ifdef CONFIG_COMPAT
>> + .compat_ioctl = rv_ioctl,
>> +#endif
>> + .llseek = noop_llseek,
>> +};
>> +
>> +/*
>> + * 0666: /dev/rv is a self-instrumentation device. All ioctls operate
>> + * exclusively on the calling task (current); no task can monitor another
>> + * via this interface. Opening the device does not grant any privilege
>> + * beyond observing one's own latency, so world-read/write is appropriate.
>> + */
>> +static struct miscdevice rv_miscdev = {
>> + .minor = MISC_DYNAMIC_MINOR,
>> + .name = "rv",
>> + .fops = &rv_fops,
>> + .mode = 0666,
>> +};
>> +
>> +static int __init rv_ioctl_init(void)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < TLOB_HANDLES_SIZE; i++)
>> + INIT_HLIST_HEAD(&tlob_handles[i]);
>> +
>> + return misc_register(&rv_miscdev);
>> +}
>> +
>> +static void __exit rv_ioctl_exit(void)
>> +{
>> + misc_deregister(&rv_miscdev);
>> +}
>> +
>> +module_init(rv_ioctl_init);
>> +module_exit(rv_ioctl_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_DESCRIPTION("RV ioctl interface via /dev/rv");
>> diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
>> index 4a6faddac..65d6c6485 100644
>> --- a/kernel/trace/rv/rv_trace.h
>> +++ b/kernel/trace/rv/rv_trace.h
>> @@ -126,6 +126,7 @@ DECLARE_EVENT_CLASS(error_da_monitor_id,
>> #include <monitors/snroc/snroc_trace.h>
>> #include <monitors/nrp/nrp_trace.h>
>> #include <monitors/sssw/sssw_trace.h>
>> +#include <monitors/tlob/tlob_trace.h>
>> // Add new monitors based on CONFIG_DA_MON_EVENTS_ID here
>>
>> #endif /* CONFIG_DA_MON_EVENTS_ID */
>> @@ -202,6 +203,55 @@ TRACE_EVENT(rv_retries_error,
>> __get_str(event), __get_str(name))
>> );
>> #endif /* CONFIG_RV_MON_MAINTENANCE_EVENTS */
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +/*
>> + * tlob_budget_exceeded - emitted when a monitored task exceeds its latency
>> + * budget. Carries the on-CPU / off-CPU time breakdown so that the cause
>> + * of the overrun (CPU-bound vs. scheduling/I/O latency) is immediately
>> + * visible in the ftrace ring buffer without post-processing.
>> + */
>> +TRACE_EVENT(tlob_budget_exceeded,
>> +
>> + TP_PROTO(struct task_struct *task, u64 threshold_us,
>> + u64 on_cpu_us, u64 off_cpu_us, u32 switches,
>> + bool state_is_on_cpu, u64 tag),
>> +
>> + TP_ARGS(task, threshold_us, on_cpu_us, off_cpu_us, switches,
>> + state_is_on_cpu, tag),
>> +
>> + TP_STRUCT__entry(
>> + __string(comm, task->comm)
>> + __field(pid_t, pid)
>> + __field(u64, threshold_us)
>> + __field(u64, on_cpu_us)
>> + __field(u64, off_cpu_us)
>> + __field(u32, switches)
>> + __field(bool, state_is_on_cpu)
>> + __field(u64, tag)
>> + ),
>> +
>> + TP_fast_assign(
>> + __assign_str(comm);
>> + __entry->pid = task->pid;
>> + __entry->threshold_us = threshold_us;
>> + __entry->on_cpu_us = on_cpu_us;
>> + __entry->off_cpu_us = off_cpu_us;
>> + __entry->switches = switches;
>> + __entry->state_is_on_cpu = state_is_on_cpu;
>> + __entry->tag = tag;
>> + ),
>> +
>> + TP_printk("%s[%d]: budget exceeded threshold=%llu on_cpu=%llu
>> off_cpu=%llu switches=%u state=%s tag=0x%016llx",
>> + __get_str(comm), __entry->pid,
>> + __entry->threshold_us,
>> + __entry->on_cpu_us, __entry->off_cpu_us,
>> + __entry->switches,
>> + __entry->state_is_on_cpu ? "on_cpu" : "off_cpu",
>> + __entry->tag)
>> +);
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>> #endif /* _TRACE_RV_H */
>>
>> /* This part must be outside protection */
>
^ permalink raw reply
* Re: [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor
From: Gabriele Monaco @ 2026-04-16 15:35 UTC (permalink / raw)
To: Wen Yang
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-trace-kernel, linux-kernel
In-Reply-To: <228deda8-3685-4f07-afd5-d3f3ca531154@linux.dev>
Hello,
On Thu, 2026-04-16 at 23:09 +0800, Wen Yang wrote:
>
> Thanks for the review. Here's my plan for each point -- let me know if
> the direction looks right.
>
>
> - Timed automata
>
> The HA framework [1] is a good match when the timeout threshold is
> global or state-determined, but tlob needs a per-invocation threshold
> supplied at TRACE_START time -- fitting that into HA would require
> framework changes.
Not quite, look at the nomiss monitor, the deadline comes directly from the
deadline entity.
What I meant with using per-object monitor is that you can use your custom
struct as a monitor target, that has your per-invocation threshold because you
set instantiate it on start.
Now you can simply do ha_get_target(ha_mon)->threshold and you get your value.
You can define in the dot representation "clk < THRESHOLD_NS()" and rvgen will
do most of the things for you. It's probably better to use nanoseconds so you
avoid conversions when dealing with hrtimers. You can do it transparently when
initialising so the user still passes us.
> My plan is to use da_monitor_init_hook() -- the same mechanism HA
> monitors use internally -- to arm the per-invocation hrtimer once
> da_create_storage() has stored the monitor_target. This gives the same
> "timer fires => violation" semantics without touching the HA infrastructure.
>
> If you see a cleaner way to pass per-invocation data through HA I'm
> happy to go that route.
The above looks cleaner to me, what do you think?
da_monitor_init_hook() isn't really meant to be used by monitors, it's more for
the infrastructure to extend da_monitor.h easily, sure you can use it if there's
no other way, though.
> - Unmonitored state / da_handle_start_event
>
> Fair point. I'll drop the explicit unmonitored state and the
> trace_event_tlob() redefinition. tlob_start_task() will use
> da_handle_start_event() to allocate storage, set initial state to on_cpu,
> and fire the init hook to arm the timer in one shot. tlob_stop_task()
> calls da_monitor_reset() directly.
>
> - Per-object monitors
>
> Will do. The custom hash table goes away; I'll switch to RV_MON_PER_OBJ
> with:
>
> typedef struct tlob_task_state *monitor_target;
>
> da_get_target_by_id() handles the sched_switch hot path lookup.
>
Exactly! That should do.
> - RV-way violations
>
> Agreed. budget_expired will be declared INVALID in all states so the
> framework calls react() (error_tlob tracepoint + any registered reactor)
> and da_monitor_reset() automatically. tlob won't emit any tracepoint of
> its own.
>
> One note on the /dev/tlob ioctl: TLOB_IOCTL_TRACE_STOP returns -EOVERFLOW
> to the caller when the budget was exceeded. This is just a syscall
> return code -- not a second reporting path -- to let in-process
> instrumentation react inline without polling the trace buffer.
> Let me know if you have concerns about keeping this.
>
I'm not sure how faster can it be compared to attaching to the tracefs, that
should be quite light if you just listen to error events. Sure you'd need a few
more libraries.
I'm a bit concerned in adding new interfaces (ioctl), when we have already
tracepoints and reactors. The reactors themselves are not as flexible as they
should be though, but if required we may definitely create a ioctl reactor just
for this.
For now ignore all this and continue with the TLOB_IOCTL_TRACE_STOP, then we can
think of the details.
> - Generic uprobe helper
>
> Proposed interface:
>
> struct rv_uprobe *rv_uprobe_attach_path(
> struct path *path, loff_t offset,
> int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
> int (*ret_fn) (struct rv_uprobe *, unsigned long func,
> struct pt_regs *, __u64 *),
> void *priv);
>
> struct rv_uprobe *rv_uprobe_attach(
> const char *binpath, loff_t offset,
> int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
> int (*ret_fn) (struct rv_uprobe *, unsigned long func,
> struct pt_regs *, __u64 *),
> void *priv);
>
> void rv_uprobe_detach(struct rv_uprobe *p);
>
> struct rv_uprobe exposes three read-only fields to monitors (offset,
> priv, path); the uprobe_consumer and callbacks would be kept private to
> the implementation, so monitors need not include <linux/uprobes.h>.
>
> rv_uprobe_attach() resolves the path and delegates to
> rv_uprobe_attach_path(); the latter avoids a redundant kern_path() when
> registering multiple probes on the same binary:
>
> kern_path(binpath, LOOKUP_FOLLOW, &path);
> b->start = rv_uprobe_attach_path(&path, offset_start, entry_fn,
> NULL, b);
> b->stop = rv_uprobe_attach_path(&path, offset_stop, stop_fn,
> NULL, b);
> path_put(&path);
>
> Does the interface look reasonable, or did you have a different shape in
> mind?
>
Yeah seems reasonable. Then we'd need to keep around the uprobe for
deinitialisation, but probably having it global is the best way without
overengineer anything.
Thanks,
Gabriele
^ permalink raw reply
* [PATCH v2 00/28] vfs/nfsd: add support for CB_NOTIFY callbacks in directory delegations
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
This version has a number of significant changes from the last. I
dropped some of the R-b's for this reason.
Of particular interest to the fsnotify maintainers will be the
FSNOTIFY_EVENT_RENAME data type. This combines the FSNOTIFY_EVENT_DENTRY
and FSNOTIFY_EVENT_INODE event types so that the fsnotify event can
additionally send information about a file that was unlinked as a result
of being replaced via rename().
There are also a host of other bugfixes, and a new tracepoint. Please
consider this for v7.2.
Original cover letter follows:
---------------------------------8<------------------------------------
This patchset builds on the directory delegation work we did a few
months ago, to add support for CB_NOTIFY callbacks for some events. In
particular, creates, unlinks and renames. The server also sends updated
directory attributes in the notifications. With this support, the client
can register interest in a directory and get notifications about changes
within it without losing its lease.
The series starts with patches to allow the vfs to ignore certain types
of events on directories. nfsd can then request these sorts of
delegations on directories, and then set up inotify watches on the
directory to trigger sending CB_NOTIFY events.
This has mainly been tested with pynfs, with some new testcases that
I'll be posting soon. They seem to work fine with those tests, but I
don't think we'll want to merge these until we have a complete
client-side implementation to test against.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v2:
- Fix __break_lease handling with different lease types on flc_lease list
- Add FSNOTIFY_EVENT_RENAME data type to properly handle cross-directory rename events
- Display fsnotify mask symbolically in tracepoints
- New tracepoint in fsnotify()
- Recalc fsnotify mask after unlocking lease instead of before
- Don't notify client that is making the changes
- After sending CB_NOTIFY, requeue if new events came in while running
- Document removal of NFS4_VERIFIER_SIZE/NFS4_FHSIZE from UAPI headers
- Properly release nfsd_dir_fsnotify_group on server shutdown
- Link to v1: https://lore.kernel.org/r/20260407-dir-deleg-v1-0-aaf68c478abd@kernel.org
---
Jeff Layton (28):
filelock: pass current blocking lease to trace_break_lease_block() rather than "new_fl"
filelock: add support for ignoring deleg breaks for dir change events
filelock: add a tracepoint to start of break_lease()
filelock: add an inode_lease_ignore_mask helper
fsnotify: new tracepoint in fsnotify()
fsnotify: add fsnotify_modify_mark_mask()
fsnotify: add FSNOTIFY_EVENT_RENAME data type
nfsd: check fl_lmops in nfsd_breaker_owns_lease()
nfsd: add protocol support for CB_NOTIFY
nfs_common: add new NOTIFY4_* flags proposed in RFC8881bis
nfsd: allow nfsd to get a dir lease with an ignore mask
nfsd: update the fsnotify mark when setting or removing a dir delegation
nfsd: make nfsd4_callback_ops->prepare operation bool return
nfsd: add callback encoding and decoding linkages for CB_NOTIFY
nfsd: use RCU to protect fi_deleg_file
nfsd: add data structures for handling CB_NOTIFY
nfsd: add notification handlers for dir events
nfsd: add tracepoint to dir_event handler
nfsd: apply the notify mask to the delegation when requested
nfsd: add helper to marshal a fattr4 from completed args
nfsd: allow nfsd4_encode_fattr4_change() to work with no export
nfsd: send basic file attributes in CB_NOTIFY
nfsd: allow encoding a filehandle into fattr4 without a svc_fh
nfsd: add a fi_connectable flag to struct nfs4_file
nfsd: add the filehandle to returned attributes in CB_NOTIFY
nfsd: properly track requested child attributes
nfsd: track requested dir attributes
nfsd: add support to CB_NOTIFY for dir attribute changes
Documentation/sunrpc/xdr/nfs4_1.x | 264 ++++++++++++++-
fs/attr.c | 2 +-
fs/locks.c | 118 +++++--
fs/namei.c | 31 +-
fs/nfsd/filecache.c | 70 +++-
fs/nfsd/nfs4callback.c | 60 +++-
fs/nfsd/nfs4layouts.c | 5 +-
fs/nfsd/nfs4proc.c | 17 +
fs/nfsd/nfs4state.c | 550 ++++++++++++++++++++++++++++----
fs/nfsd/nfs4xdr.c | 323 +++++++++++++++++--
fs/nfsd/nfs4xdr_gen.c | 601 ++++++++++++++++++++++++++++++++++-
fs/nfsd/nfs4xdr_gen.h | 20 +-
fs/nfsd/state.h | 72 ++++-
fs/nfsd/trace.h | 23 ++
fs/nfsd/xdr4.h | 5 +
fs/nfsd/xdr4cb.h | 12 +
fs/notify/fsnotify.c | 5 +
fs/notify/mark.c | 29 ++
fs/posix_acl.c | 4 +-
fs/xattr.c | 4 +-
include/linux/filelock.h | 54 +++-
include/linux/fsnotify.h | 8 +-
include/linux/fsnotify_backend.h | 21 ++
include/linux/nfs4.h | 127 --------
include/linux/sunrpc/xdrgen/nfs4_1.h | 291 ++++++++++++++++-
include/trace/events/filelock.h | 38 ++-
include/trace/events/fsnotify.h | 51 +++
include/trace/misc/fsnotify.h | 35 ++
include/uapi/linux/nfs4.h | 2 -
29 files changed, 2518 insertions(+), 324 deletions(-)
---
base-commit: f4d71dd7fd9cec357c32431fa55c107b96008312
change-id: 20260325-dir-deleg-339066dd1017
Best regards,
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply
* [PATCH v2 01/28] filelock: pass current blocking lease to trace_break_lease_block() rather than "new_fl"
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
The break_lease_block tracepoint currently just shows the type of
"new_fl", which we can predict from the "flags" value. Switch it to
display info about "fl" instead, as that's the file_lease on which the
code is blocking.
For trace_break_lease_unblock(), pass it a NULL pointer. "fl" may have
been freed by that point, and passing it the info in new_fl is
deceptive.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/locks.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index 8e44b1f6c15a..d82c5be7aa5b 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1691,7 +1691,7 @@ int __break_lease(struct inode *inode, unsigned int flags)
} else
break_time++;
locks_insert_block(&fl->c, &new_fl->c, leases_conflict);
- trace_break_lease_block(inode, new_fl);
+ trace_break_lease_block(inode, fl);
spin_unlock(&ctx->flc_lock);
percpu_up_read(&file_rwsem);
@@ -1702,7 +1702,7 @@ int __break_lease(struct inode *inode, unsigned int flags)
percpu_down_read(&file_rwsem);
spin_lock(&ctx->flc_lock);
- trace_break_lease_unblock(inode, new_fl);
+ trace_break_lease_unblock(inode, NULL);
__locks_delete_block(&new_fl->c);
if (error >= 0) {
/*
--
2.53.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox