* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Mladek @ 2026-04-16 13:09 UTC (permalink / raw)
To: Song Chen
Cc: Petr Pavlu, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
frederic, mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe,
jikos, mbenes, joe.lawrence, rostedt, mhiramat, mark.rutland,
mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <a35f5f94-7d5a-4347-974b-b270c89ef241@189.cn>
On Wed 2026-04-15 14:43:53, Song Chen wrote:
> Hi,
>
> On 4/14/26 22:33, Petr Pavlu wrote:
> > On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
> > > From: Song Chen <chensong_2000@189.cn>
> > >
> > > ftrace and livepatch currently have their module load/unload callbacks
> > > hard-coded in the module loader as direct function calls to
> > > ftrace_module_enable(), klp_module_coming(), klp_module_going()
> > > and ftrace_release_mod(). This tight coupling was originally introduced
> > > to enforce strict call ordering that could not be guaranteed by the
> > > module notifier chain, which only supported forward traversal. Their
> > > notifiers were moved in and out back and forth. see [1] and [2].
> >
> > I'm unclear about what is meant by the notifiers being moved back and
> > forth. The links point to patches that converted ftrace+klp from using
> > module notifiers to explicit callbacks due to ordering issues, but this
> > switch occurred only once. Have there been other attempts to use
> > notifiers again?
> >
> > > diff --git a/include/linux/module.h b/include/linux/module.h
> > > index 14f391b186c6..0bdd56f9defd 100644
> > > --- a/include/linux/module.h
> > > +++ b/include/linux/module.h
> > > @@ -308,6 +308,14 @@ enum module_state {
> > > MODULE_STATE_COMING, /* Full formed, running module_init. */
> > > MODULE_STATE_GOING, /* Going away. */
> > > MODULE_STATE_UNFORMED, /* Still setting it up. */
> > > + MODULE_STATE_FORMED,
> >
> > I don't see a reason to add a new module state. Why is it necessary and
> > how does it fit with the existing states?
> >
> because once notifier fails in state MODULE_STATE_UNFORMED (now only ftrace
> has someting to do in this state), notifier chain will roll back by calling
> blocking_notifier_call_chain_robust, i'm afraid MODULE_STATE_GOING is going
> to jeopardise the notifers which don't handle it appropriately, like:
>
> case MODULE_STATE_COMING:
> kmalloc();
> case MODULE_STATE_GOING:
> kfree();
>
>
> > > +};
> > > +
> > > +enum module_notifier_prio {
> > > + MODULE_NOTIFIER_PRIO_LOW = INT_MIN, /* Low prioroty, coming last, going first */
> > > + MODULE_NOTIFIER_PRIO_MID = 0, /* Normal priority. */
> > > + MODULE_NOTIFIER_PRIO_SECOND_HIGH = INT_MAX - 1, /* Second high priorigy, coming second*/
> > > + MODULE_NOTIFIER_PRIO_HIGH = INT_MAX, /* High priorigy, coming first, going late. */
> >
> > I suggest being explicit about how the notifiers are ordered. For
> > example:
> >
> > enum module_notifier_prio {
> > MODULE_NOTIFIER_PRIO_NORMAL, /* Normal priority, coming last, going first. */
> > MODULE_NOTIFIER_PRIO_LIVEPATCH,
> > MODULE_NOTIFIER_PRIO_FTRACE, /* High priority, coming first, going late. */
> > };
> >
I like the explicit PRIO_LIVEPATCH/FTRACE names.
But I would keep the INT_MAX - 1 and INT_MAX priorities. I believe
that ftrace/livepatching will always be the first/last to call.
And INT_MAX would help to preserve kABI when PRIO_NORMAL is not
enough for the rest of notifiers.
That said, I am not sure whether this is worth the effort.
This patch tries to move the explicit callbacks in a generic
notifiers API. But it will still need to use some explictly
defined (reserved) priorities. And it will
not guarantee a misuse. Also it requires the double linked
list which complicates the notifiers code.
> > > };
> > > struct mod_tree_node {
> > > --- a/kernel/module/main.c
> > > +++ b/kernel/module/main.c
> > > @@ -3281,20 +3277,14 @@ static int complete_formation(struct module *mod, struct load_info *info)
> > > return err;
> > > }
> > > -static int prepare_coming_module(struct module *mod)
> > > +static int prepare_module_state_transaction(struct module *mod,
> > > + unsigned long val_up, unsigned long val_down)
> > > {
> > > int err;
> > > - ftrace_module_enable(mod);
> > > - err = klp_module_coming(mod);
> > > - if (err)
> > > - return err;
> > > -
> > > err = blocking_notifier_call_chain_robust(&module_notify_list,
> > > - MODULE_STATE_COMING, MODULE_STATE_GOING, mod);
> > > + val_up, val_down, mod);
> > > err = notifier_to_errno(err);
> > > - if (err)
> > > - klp_module_going(mod);
> > > return err;
> > > }
I personally find the name "prepare_module_state_transaction"
misleading. What is the "transaction" here? If this was a "preparation"
step then where is the transaction done/finished?
It might be better to just opencode the
blocking_notifier_call_chain_robust() instead.
> > > @@ -3468,14 +3458,21 @@ static int load_module(struct load_info *info, const char __user *uargs,
> > > init_build_id(mod, info);
> > > /* Ftrace init must be called in the MODULE_STATE_UNFORMED state */
> > > - ftrace_module_init(mod);
> > > + err = prepare_module_state_transaction(mod,
> > > + MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
> >
> > I believe val_down should be MODULE_STATE_GOING to reverse the
> > operation. Why is the new state MODULE_STATE_FORMED needed here?
> to avoid this:
>
> case MODULE_STATE_COMING:
> kmalloc();
> case MODULE_STATE_GOING:
> kfree();
Hmm, the module is in "FORMED" state here.
> > > + if (err)
> > > + goto ddebug_cleanup;
> > > /* Finally it's fully formed, ready to start executing. */
> > > err = complete_formation(mod, info);
And we call "complete_formation()" function. This sounds like
it was not really "FORMED" before. => It is confusing and nono.
Please, try to avoid the new state if possible. My experience
with reading the module loader code is that any new state
brings a lot of complexity. You need to take it into account
when checking correctness of other changes, features, ...
Something tells me that if the state was not needed before
then we could avoid it.
> > > - if (err)
> > > + if (err) {
> > > + blocking_notifier_call_chain_reverse(&module_notify_list,
> > > + MODULE_STATE_FORMED, mod);
> > > goto ddebug_cleanup;
> > > + }
> > > - err = prepare_coming_module(mod);
> > > + err = prepare_module_state_transaction(mod,
> > > + MODULE_STATE_COMING, MODULE_STATE_GOING);
> > > if (err)
> > > goto bug_cleanup;
> > > --- a/kernel/trace/ftrace.c
> > > +++ b/kernel/trace/ftrace.c
> > > @@ -5241,6 +5241,44 @@ static int __init ftrace_mod_cmd_init(void)
> > > }
> > > core_initcall(ftrace_mod_cmd_init);
> > > +static int ftrace_module_callback(struct notifier_block *nb, unsigned long op,
> > > + void *module)
> > > +{
> > > + struct module *mod = module;
> > > +
> > > + switch (op) {
> > > + case MODULE_STATE_UNFORMED:
> > > + ftrace_module_init(mod);
> > > + break;
> > > + case MODULE_STATE_COMING:
> > > + ftrace_module_enable(mod);
> > > + break;
> > > + case MODULE_STATE_LIVE:
> > > + ftrace_free_mem(mod, mod->mem[MOD_INIT_TEXT].base,
> > > + mod->mem[MOD_INIT_TEXT].base + mod->mem[MOD_INIT_TEXT].size);
> > > + break;
> > > + case MODULE_STATE_GOING:
> > > + case MODULE_STATE_FORMED:
> > > + ftrace_release_mod(mod);
This calls "release" in a "FORMED" state. It does not make any
sense. Something looks fishy, either the code or the naming.
> > > + break;
> > > + default:
> > > + break;
> > > + }
> >
I am sorry for being so picky about names. I believe that good names
help to prevent bugs and reduce headaches.
Best Regards,
Petr
^ permalink raw reply
* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Mladek @ 2026-04-16 14:49 UTC (permalink / raw)
To: Petr Pavlu
Cc: Song Chen, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
frederic, mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe,
jikos, mbenes, joe.lawrence, rostedt, mhiramat, mark.rutland,
mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <1db425bf-58a9-4768-8c38-3ae25d7662a5@suse.com>
On Thu 2026-04-16 13:18:30, Petr Pavlu wrote:
> On 4/15/26 8:43 AM, Song Chen wrote:
> > On 4/14/26 22:33, Petr Pavlu wrote:
> >> On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
> >>> diff --git a/include/linux/module.h b/include/linux/module.h
> >>> index 14f391b186c6..0bdd56f9defd 100644
> >>> --- a/include/linux/module.h
> >>> +++ b/include/linux/module.h
> >>> @@ -308,6 +308,14 @@ enum module_state {
> >>> MODULE_STATE_COMING, /* Full formed, running module_init. */
> >>> MODULE_STATE_GOING, /* Going away. */
> >>> MODULE_STATE_UNFORMED, /* Still setting it up. */
> >>> + MODULE_STATE_FORMED,
> >>
> >> I don't see a reason to add a new module state. Why is it necessary and
> >> how does it fit with the existing states?
> >>
> > because once notifier fails in state MODULE_STATE_UNFORMED (now only ftrace has someting to do in this state), notifier chain will roll back by calling blocking_notifier_call_chain_robust, i'm afraid MODULE_STATE_GOING is going to jeopardise the notifers which don't handle it appropriately, like:
> >
> > case MODULE_STATE_COMING:
> > kmalloc();
> > case MODULE_STATE_GOING:
> > kfree();
>
> My understanding is that the current module "state machine" operates as
> follows. Transitions marked with an asterisk (*) are announced via the
> module notifier.
>
> ---> UNFORMED --*> COMING --*> LIVE --*> GOING -.
> ^ | ^ |
> | '---------------------* |
> '---------------------------------------'
>
> The new code aims to replace the current ftrace_module_init() call in
> load_module(). To achieve this, it adds a notification for the UNFORMED
> state (only when loading a module) and introduces a new FORMED state for
> rollback. FORMED is purely a fake state because it never appears in
> module::state. The new structure is as follows:
>
> ,--*> (FORMED)
> |
> --*> UNFORMED --*> COMING --*> LIVE --*> GOING -.
> ^ | ^ |
> | '---------------------* |
> '---------------------------------------'
>
> I'm afraid this is quite complex and inconsistent. Unless it can be kept
> simple, we would be just replacing one special handling with a different
> complexity, which is not worth it.
> >>
> >>> + if (err)
> >>> + goto ddebug_cleanup;
> >>> /* Finally it's fully formed, ready to start executing. */
> >>> err = complete_formation(mod, info);
> >>> - if (err)
> >>> + if (err) {
> >>> + blocking_notifier_call_chain_reverse(&module_notify_list,
> >>> + MODULE_STATE_FORMED, mod);
> >>> goto ddebug_cleanup;
> >>> + }
> >>> - err = prepare_coming_module(mod);
> >>> + err = prepare_module_state_transaction(mod,
> >>> + MODULE_STATE_COMING, MODULE_STATE_GOING);
> >>> if (err)
> >>> goto bug_cleanup;
> >>> @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
> >>> destroy_params(mod->kp, mod->num_kp);
> >>> blocking_notifier_call_chain(&module_notify_list,
> >>> MODULE_STATE_GOING, mod);
> >>
> >> My understanding is that all notifier chains for MODULE_STATE_GOING
> >> should be reversed.
> > yes, all, from lowest priority notifier to highest.
> > I will resend patch 1 which was failed due to my proxy setting.
>
> What I meant here is that the call:
>
> blocking_notifier_call_chain(&module_notify_list, MODULE_STATE_GOING, mod);
>
> should be replaced with:
>
> blocking_notifier_call_chain_reverse(&module_notify_list, MODULE_STATE_GOING, mod);
>
> >
> >>
> >>> - klp_module_going(mod);
> >>> bug_cleanup:
> >>> mod->state = MODULE_STATE_GOING;
> >>> /* module_bug_cleanup needs module_mutex protection */
> >>
> >> The patch removes the klp_module_going() cleanup call in load_module().
> >> Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
> >> should be removed and appropriately replaced with a cleanup via
> >> a notifier.
> >>
> > err = prepare_module_state_transaction(mod,
> > MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
> > if (err)
> > goto ddebug_cleanup;
> >
> > ftrace will be cleanup in blocking_notifier_call_chain_robust rolling back.
> >
> > err = prepare_module_state_transaction(mod,
> > MODULE_STATE_COMING, MODULE_STATE_GOING);
> >
> > each notifier including ftrace and klp will be cleanup in blocking_notifier_call_chain_robust rolling back.
> >
> > if all notifiers are successful in MODULE_STATE_COMING, they all will be clean up in
> > coming_cleanup:
> > mod->state = MODULE_STATE_GOING;
> > destroy_params(mod->kp, mod->num_kp);
> > blocking_notifier_call_chain(&module_notify_list,
> > MODULE_STATE_GOING, mod);
> >
> > if something wrong underneath.
>
> My point is that the patch leaves a call to ftrace_release_mod() in
> load_module(), which I expected to be handled via a notifier.
I think that I have got it. The ftrace code needs two notifiers when
the module is being loaded and two when it is going.
This is why Sond added the new state. But I think that we would
need two new states to call:
+ ftrace_module_init() in MODULE_STATE_UNFORMED
+ ftrace_module_enable() in MODULE_STATE_FORMED
and
+ ftrace_free_mem() in MODULE_STATE_PRE_GOING
+ ftrace_free_mem() in MODULE_STATE_GOING
By using the ascii art:
-*> UNFORMED -*> FORMED -> COMING -*> LIVE -*> PRE_GOING -*> GOING -.
| | | ^ ^ ^
| | '----------------' | |
| '--------------------------------------' |
'------------------------------------------------------'
But I think that this is not worth it.
Best Regards,
Petr
^ permalink raw reply
* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Petr Mladek @ 2026-04-16 14:54 UTC (permalink / raw)
To: David Laight
Cc: chensong_2000, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
frederic, mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin,
jpoimboe, jikos, mbenes, joe.lawrence, rostedt, mhiramat,
mark.rutland, mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260416133004.07bd2886@pumpkin>
On Thu 2026-04-16 13:30:04, David Laight wrote:
> On Wed, 15 Apr 2026 15:01:37 +0800
> chensong_2000@189.cn wrote:
>
> > From: Song Chen <chensong_2000@189.cn>
> >
> > The current notifier chain implementation uses a single-linked list
> > (struct notifier_block *next), which only supports forward traversal
> > in priority order. This makes it difficult to handle cleanup/teardown
> > scenarios that require notifiers to be called in reverse priority order.
>
> If it is only cleanup/teardown then the list can be order-reversed
> as part of that process at the same time as the list is deleted.
Interesting idea. But it won't work in all situations.
Note that the motivation for this update are the module loader
notifiers which are called several times for each loaded/removed module.
Best Regards,
Petr
^ permalink raw reply
* [PATCH v5 1/7] tracing/lock: Remove unnecessary linux/sched.h include
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin, Usama Arif
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
None of the trace events in lock.h reference anything from
linux/sched.h. Remove the unnecessary include.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
---
include/trace/events/lock.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 8e89baa3775f..da978f2afb45 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -5,7 +5,6 @@
#if !defined(_TRACE_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_LOCK_H
-#include <linux/sched.h>
#include <linux/tracepoint.h>
/* flags for lock:contention_begin */
--
2.52.0
^ permalink raw reply related
* [PATCH v5 0/7] locking: contended_release tracepoint instrumentation
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
The existing contention_begin/contention_end tracepoints fire on the
waiter side. The lock holder's identity and stack can be captured at
contention_begin time (e.g. perf lock contention --lock-owner), but
this reflects the holder's state when a waiter arrives, not when the
lock is actually released.
This series adds a contended_release tracepoint that fires on the
holder side when a lock with waiters is released. This provides:
- Hold time estimation: when the holder's own acquisition was
contended, its contention_end (acquisition) and contended_release
can be correlated to measure how long the lock was held under
contention.
- The holder's stack at release time, which may differ from what perf lock
contention --lock-owner captures if the holder does significant work between
the waiter's arrival and the unlock.
Note: for reader/writer locks, the tracepoint fires for every reader
releasing while a writer is waiting, not only for the last reader.
v4 -> v5:
- Split the combined spinning locks patch into separate qspinlock and
qrwlock patches (Paul E. McKenney).
- Factor out __queued_read_unlock()/__queued_write_unlock() as a
separate preparatory commit, mirroring the queued_spin_release()
split (Paul E. McKenney).
- Updated binary size numbers for qspinlock-only change.
- Added Acked-by and Reviewed-by tags where appropriate.
v3 -> v4:
- Fix spurious events in __percpu_up_read(): guard with
rcuwait_active(&sem->writer) to avoid tracing during the RCU grace
period after a writer releases (Sashiko).
- Fix possible use-after-free in semaphore up(): move
trace_contended_release() inside the sem->lock critical section
(Sashiko).
- Fix build failure with CONFIG_PARAVIRT_SPINLOCKS=y: introduce
queued_spin_release() as the arch-overridable unlock primitive,
so queued_spin_unlock() can be a generic tracing wrapper. Convert
x86 (paravirt) and MIPS overrides (Sashiko).
- Add EXPORT_TRACEPOINT_SYMBOL_GPL(contended_release) for module
support (Sashiko).
- Split spinning locks patch: factor out queued_spin_release() as a
separate preparatory commit (Sashiko).
- Make read unlock tracepoint behavior consistent across all
reader/writer lock types: fire for every reader releasing while
a writer is waiting (rwsem, rwbase_rt were previously last-reader
only).
v2 -> v3:
- Added new patch: extend contended_release tracepoint to queued spinlocks
and queued rwlocks (marked as RFC, requesting feedback). This is prompted by
Matthew Wilcox's suggestion to try to come up with generic instrumentation,
instead of instrumenting each "special" lock manually. See [1] for the
discussion.
- Reworked tracepoint placement to fire before the lock is released and
before the waiter is woken where possible, for consistency with
spinning locks where there is no explicit wake (inspired by Usama Arif's
suggestion).
- Remove unnecessary linux/sched.h include from trace/events/lock.h.
RFC -> v2:
- Add trace_contended_release_enabled() guard before waiter checks that
exist only for the tracepoint (Steven Rostedt).
- Rename __percpu_up_read_slowpath() to __percpu_up_read() (Peter
Zijlstra).
- Add extern for __percpu_up_read() (Peter Zijlstra).
- Squashed tracepoint introduction and usage commits (Masami Hiramatsu).
v4: https://lore.kernel.org/all/cover.1774536681.git.d@ilvokhin.com/
v3: https://lore.kernel.org/all/cover.1773858853.git.d@ilvokhin.com/
v2: https://lore.kernel.org/all/cover.1773164180.git.d@ilvokhin.com/
RFC: https://lore.kernel.org/all/cover.1772642407.git.d@ilvokhin.com/
[1]: https://lore.kernel.org/all/aa7G1nD7Rd9F4eBH@casper.infradead.org/
Dmitry Ilvokhin (7):
tracing/lock: Remove unnecessary linux/sched.h include
locking/percpu-rwsem: Extract __percpu_up_read()
locking: Add contended_release tracepoint to sleepable locks
locking: Factor out queued_spin_release()
locking: Add contended_release tracepoint to qspinlock
locking: Factor out __queued_read_unlock()/__queued_write_unlock()
locking: Add contended_release tracepoint to qrwlock
arch/mips/include/asm/spinlock.h | 6 ++--
arch/x86/include/asm/paravirt-spinlock.h | 6 ++--
include/asm-generic/qrwlock.h | 38 ++++++++++++++++++++++--
include/asm-generic/qspinlock.h | 33 ++++++++++++++++++--
include/linux/percpu-rwsem.h | 15 ++--------
include/trace/events/lock.h | 18 ++++++++++-
kernel/locking/mutex.c | 4 +++
kernel/locking/percpu-rwsem.c | 29 ++++++++++++++++++
kernel/locking/qrwlock.c | 16 ++++++++++
kernel/locking/qspinlock.c | 8 +++++
kernel/locking/rtmutex.c | 1 +
kernel/locking/rwbase_rt.c | 6 ++++
kernel/locking/rwsem.c | 10 +++++--
kernel/locking/semaphore.c | 4 +++
14 files changed, 167 insertions(+), 27 deletions(-)
--
2.52.0
^ permalink raw reply
* [PATCH v5 2/7] locking/percpu-rwsem: Extract __percpu_up_read()
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin, Usama Arif
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
Move the percpu_up_read() slowpath out of the inline function into a new
__percpu_up_read() to avoid binary size increase from adding a
tracepoint to an inlined function.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/percpu-rwsem.h | 15 +++------------
kernel/locking/percpu-rwsem.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index c8cb010d655e..39d5bf8e6562 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -107,6 +107,8 @@ static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
return ret;
}
+extern void __percpu_up_read(struct percpu_rw_semaphore *sem);
+
static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, _RET_IP_);
@@ -118,18 +120,7 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
if (likely(rcu_sync_is_idle(&sem->rss))) {
this_cpu_dec(*sem->read_count);
} else {
- /*
- * slowpath; reader will only ever wake a single blocked
- * writer.
- */
- smp_mb(); /* B matches C */
- /*
- * In other words, if they see our decrement (presumably to
- * aggregate zero, as that is the only time it matters) they
- * will also see our critical section.
- */
- this_cpu_dec(*sem->read_count);
- rcuwait_wake_up(&sem->writer);
+ __percpu_up_read(sem);
}
preempt_enable();
}
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index ef234469baac..f3ee7a0d6047 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -288,3 +288,21 @@ void percpu_up_write(struct percpu_rw_semaphore *sem)
rcu_sync_exit(&sem->rss);
}
EXPORT_SYMBOL_GPL(percpu_up_write);
+
+void __percpu_up_read(struct percpu_rw_semaphore *sem)
+{
+ lockdep_assert_preemption_disabled();
+ /*
+ * slowpath; reader will only ever wake a single blocked
+ * writer.
+ */
+ smp_mb(); /* B matches C */
+ /*
+ * In other words, if they see our decrement (presumably to
+ * aggregate zero, as that is the only time it matters) they
+ * will also see our critical section.
+ */
+ this_cpu_dec(*sem->read_count);
+ rcuwait_wake_up(&sem->writer);
+}
+EXPORT_SYMBOL_GPL(__percpu_up_read);
--
2.52.0
^ permalink raw reply related
* [PATCH v5 3/7] locking: Add contended_release tracepoint to sleepable locks
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
Add the contended_release trace event. This tracepoint fires on the
holder side when a contended lock is released, complementing the
existing contention_begin/contention_end tracepoints which fire on the
waiter side.
This enables correlating lock hold time under contention with waiter
events by lock address.
Add trace_contended_release() calls to the slowpath unlock paths of
sleepable locks: mutex, rtmutex, semaphore, rwsem, percpu-rwsem, and
RT-specific rwbase locks.
Where possible, trace_contended_release() fires before the lock is
released and before the waiter is woken. For some lock types, the
tracepoint fires after the release but before the wake. Making the
placement consistent across all lock types is not worth the added
complexity.
For reader/writer locks, the tracepoint fires for every reader releasing
while a writer is waiting, not only for the last reader.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
---
include/trace/events/lock.h | 17 +++++++++++++++++
kernel/locking/mutex.c | 4 ++++
kernel/locking/percpu-rwsem.c | 11 +++++++++++
kernel/locking/rtmutex.c | 1 +
kernel/locking/rwbase_rt.c | 6 ++++++
kernel/locking/rwsem.c | 10 ++++++++--
kernel/locking/semaphore.c | 4 ++++
7 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index da978f2afb45..1ded869cd619 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -137,6 +137,23 @@ TRACE_EVENT(contention_end,
TP_printk("%p (ret=%d)", __entry->lock_addr, __entry->ret)
);
+TRACE_EVENT(contended_release,
+
+ TP_PROTO(void *lock),
+
+ TP_ARGS(lock),
+
+ TP_STRUCT__entry(
+ __field(void *, lock_addr)
+ ),
+
+ TP_fast_assign(
+ __entry->lock_addr = lock;
+ ),
+
+ TP_printk("%p", __entry->lock_addr)
+);
+
#endif /* _TRACE_LOCK_H */
/* This part must be outside protection */
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 427187ff02db..6c2c9312eb8f 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -997,6 +997,9 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
wake_q_add(&wake_q, next);
}
+ if (trace_contended_release_enabled() && waiter)
+ trace_contended_release(lock);
+
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);
@@ -1194,6 +1197,7 @@ EXPORT_SYMBOL(ww_mutex_lock_interruptible);
EXPORT_TRACEPOINT_SYMBOL_GPL(contention_begin);
EXPORT_TRACEPOINT_SYMBOL_GPL(contention_end);
+EXPORT_TRACEPOINT_SYMBOL_GPL(contended_release);
/**
* atomic_dec_and_mutex_lock - return holding mutex if we dec to 0
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f3ee7a0d6047..46b5903989b8 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -263,6 +263,9 @@ void percpu_up_write(struct percpu_rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, _RET_IP_);
+ if (trace_contended_release_enabled() && wq_has_sleeper(&sem->waiters))
+ trace_contended_release(sem);
+
/*
* Signal the writer is done, no fast path yet.
*
@@ -292,6 +295,14 @@ EXPORT_SYMBOL_GPL(percpu_up_write);
void __percpu_up_read(struct percpu_rw_semaphore *sem)
{
lockdep_assert_preemption_disabled();
+ /*
+ * After percpu_up_write() completes, rcu_sync_is_idle() can still
+ * return false during the grace period, forcing readers into this
+ * slowpath. Only trace when a writer is actually waiting for
+ * readers to drain.
+ */
+ if (trace_contended_release_enabled() && rcuwait_active(&sem->writer))
+ trace_contended_release(sem);
/*
* slowpath; reader will only ever wake a single blocked
* writer.
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index ccaba6148b61..3db8a840b4e8 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1466,6 +1466,7 @@ static void __sched rt_mutex_slowunlock(struct rt_mutex_base *lock)
raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
+ trace_contended_release(lock);
/*
* The wakeup next waiter path does not suffer from the above
* race. See the comments there.
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 82e078c0665a..74da5601018f 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -174,6 +174,8 @@ static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
static __always_inline void rwbase_read_unlock(struct rwbase_rt *rwb,
unsigned int state)
{
+ if (trace_contended_release_enabled() && rt_mutex_owner(&rwb->rtmutex))
+ trace_contended_release(rwb);
/*
* rwb->readers can only hit 0 when a writer is waiting for the
* active readers to leave the critical section.
@@ -205,6 +207,8 @@ static inline void rwbase_write_unlock(struct rwbase_rt *rwb)
unsigned long flags;
raw_spin_lock_irqsave(&rtm->wait_lock, flags);
+ if (trace_contended_release_enabled() && rt_mutex_has_waiters(rtm))
+ trace_contended_release(rwb);
__rwbase_write_unlock(rwb, WRITER_BIAS, flags);
}
@@ -214,6 +218,8 @@ static inline void rwbase_write_downgrade(struct rwbase_rt *rwb)
unsigned long flags;
raw_spin_lock_irqsave(&rtm->wait_lock, flags);
+ if (trace_contended_release_enabled() && rt_mutex_has_waiters(rtm))
+ trace_contended_release(rwb);
/* Release it and account current as reader */
__rwbase_write_unlock(rwb, WRITER_BIAS - 1, flags);
}
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index bf647097369c..602d5fd3c91a 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1387,6 +1387,8 @@ static inline void __up_read(struct rw_semaphore *sem)
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
+ if (trace_contended_release_enabled() && (tmp & RWSEM_FLAG_WAITERS))
+ trace_contended_release(sem);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
RWSEM_FLAG_WAITERS)) {
clear_nonspinnable(sem);
@@ -1413,8 +1415,10 @@ static inline void __up_write(struct rw_semaphore *sem)
preempt_disable();
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
- if (unlikely(tmp & RWSEM_FLAG_WAITERS))
+ if (unlikely(tmp & RWSEM_FLAG_WAITERS)) {
+ trace_contended_release(sem);
rwsem_wake(sem);
+ }
preempt_enable();
}
@@ -1437,8 +1441,10 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
tmp = atomic_long_fetch_add_release(
-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
- if (tmp & RWSEM_FLAG_WAITERS)
+ if (tmp & RWSEM_FLAG_WAITERS) {
+ trace_contended_release(sem);
rwsem_downgrade_wake(sem);
+ }
preempt_enable();
}
diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
index 74d41433ba13..35ac3498dca5 100644
--- a/kernel/locking/semaphore.c
+++ b/kernel/locking/semaphore.c
@@ -230,6 +230,10 @@ void __sched up(struct semaphore *sem)
sem->count++;
else
__up(sem, &wake_q);
+
+ if (trace_contended_release_enabled() && !wake_q_empty(&wake_q))
+ trace_contended_release(sem);
+
raw_spin_unlock_irqrestore(&sem->lock, flags);
if (!wake_q_empty(&wake_q))
wake_up_q(&wake_q);
--
2.52.0
^ permalink raw reply related
* [PATCH v5 4/7] locking: Factor out queued_spin_release()
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
Introduce queued_spin_release() as an arch-overridable unlock primitive,
and make queued_spin_unlock() a generic wrapper around it.
This is a preparatory refactoring for the next commit, which adds
contended_release tracepoint instrumentation to queued_spin_unlock().
Rename the existing arch-specific queued_spin_unlock() overrides on
x86 (paravirt) and MIPS to queued_spin_release().
No functional change.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
---
arch/mips/include/asm/spinlock.h | 6 +++---
arch/x86/include/asm/paravirt-spinlock.h | 6 +++---
include/asm-generic/qspinlock.h | 15 ++++++++++++---
3 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
index 6ce2117e49f6..c349162f15eb 100644
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -13,12 +13,12 @@
#include <asm-generic/qspinlock_types.h>
-#define queued_spin_unlock queued_spin_unlock
+#define queued_spin_release queued_spin_release
/**
- * queued_spin_unlock - release a queued spinlock
+ * queued_spin_release - release a queued spinlock
* @lock : Pointer to queued spinlock structure
*/
-static inline void queued_spin_unlock(struct qspinlock *lock)
+static inline void queued_spin_release(struct qspinlock *lock)
{
/* This could be optimised with ARCH_HAS_MMIOWB */
mmiowb();
diff --git a/arch/x86/include/asm/paravirt-spinlock.h b/arch/x86/include/asm/paravirt-spinlock.h
index 7beffcb08ed6..ac75e0736198 100644
--- a/arch/x86/include/asm/paravirt-spinlock.h
+++ b/arch/x86/include/asm/paravirt-spinlock.h
@@ -49,9 +49,9 @@ static __always_inline bool pv_vcpu_is_preempted(long cpu)
ALT_NOT(X86_FEATURE_VCPUPREEMPT));
}
-#define queued_spin_unlock queued_spin_unlock
+#define queued_spin_release queued_spin_release
/**
- * queued_spin_unlock - release a queued spinlock
+ * queued_spin_release - release a queued spinlock
* @lock : Pointer to queued spinlock structure
*
* A smp_store_release() on the least-significant byte.
@@ -66,7 +66,7 @@ static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
pv_queued_spin_lock_slowpath(lock, val);
}
-static inline void queued_spin_unlock(struct qspinlock *lock)
+static inline void queued_spin_release(struct qspinlock *lock)
{
kcsan_release();
pv_queued_spin_unlock(lock);
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index bf47cca2c375..df76f34645a0 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -115,12 +115,12 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
}
#endif
-#ifndef queued_spin_unlock
+#ifndef queued_spin_release
/**
- * queued_spin_unlock - release a queued spinlock
+ * queued_spin_release - release a queued spinlock
* @lock : Pointer to queued spinlock structure
*/
-static __always_inline void queued_spin_unlock(struct qspinlock *lock)
+static __always_inline void queued_spin_release(struct qspinlock *lock)
{
/*
* unlock() needs release semantics:
@@ -129,6 +129,15 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
}
#endif
+/**
+ * queued_spin_unlock - unlock a queued spinlock
+ * @lock : Pointer to queued spinlock structure
+ */
+static __always_inline void queued_spin_unlock(struct qspinlock *lock)
+{
+ queued_spin_release(lock);
+}
+
#ifndef virt_spin_lock
static __always_inline bool virt_spin_lock(struct qspinlock *lock)
{
--
2.52.0
^ permalink raw reply related
* [PATCH v5 6/7] locking: Factor out __queued_read_unlock()/__queued_write_unlock()
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
This is a preparatory refactoring for the next commit, which adds
contended_release tracepoint instrumentation and needs to call the
unlock from both traced and non-traced paths.
No functional change.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
include/asm-generic/qrwlock.h | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 75b8f4601b28..4b627bafba8b 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -101,16 +101,26 @@ static inline void queued_write_lock(struct qrwlock *lock)
queued_write_lock_slowpath(lock);
}
+static __always_inline void __queued_read_unlock(struct qrwlock *lock)
+{
+ /*
+ * Atomically decrement the reader count
+ */
+ (void)atomic_sub_return_release(_QR_BIAS, &lock->cnts);
+}
+
/**
* queued_read_unlock - release read lock of a queued rwlock
* @lock : Pointer to queued rwlock structure
*/
static inline void queued_read_unlock(struct qrwlock *lock)
{
- /*
- * Atomically decrement the reader count
- */
- (void)atomic_sub_return_release(_QR_BIAS, &lock->cnts);
+ __queued_read_unlock(lock);
+}
+
+static __always_inline void __queued_write_unlock(struct qrwlock *lock)
+{
+ smp_store_release(&lock->wlocked, 0);
}
/**
@@ -119,7 +129,7 @@ static inline void queued_read_unlock(struct qrwlock *lock)
*/
static inline void queued_write_unlock(struct qrwlock *lock)
{
- smp_store_release(&lock->wlocked, 0);
+ __queued_write_unlock(lock);
}
/**
--
2.52.0
^ permalink raw reply related
* [PATCH v5 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
Use the arch-overridable queued_spin_release(), introduced in the
previous commit, to ensure the tracepoint works correctly across all
architectures, including those with custom unlock implementations (e.g.
x86 paravirt).
When the tracepoint is disabled, the only addition to the hot path is a
single NOP instruction (the static branch). When enabled, the contention
check, trace call, and unlock are combined in an out-of-line function to
minimize hot path impact, avoiding the compiler needing to preserve the
lock pointer in a callee-saved register across the trace call.
Binary size impact (x86_64, defconfig):
uninlined unlock (common case): +680 bytes (+0.00%)
inlined unlock (worst case): +83659 bytes (+0.21%)
The inlined unlock case could not be achieved through Kconfig options on
x86_64 as PREEMPT_BUILD unconditionally selects UNINLINE_SPIN_UNLOCK on
x86_64. The UNINLINE_SPIN_UNLOCK guards were manually inverted to force
inline the unlock path and estimate the worst case binary size increase.
In practice, configurations with UNINLINE_SPIN_UNLOCK=n have already
opted against binary size optimization, so the inlined worst case is
unlikely to be a concern.
Architectures with fully custom qspinlock implementations (e.g.
PowerPC) are not covered by this change.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
include/asm-generic/qspinlock.h | 18 ++++++++++++++++++
kernel/locking/qspinlock.c | 8 ++++++++
2 files changed, 26 insertions(+)
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index df76f34645a0..915a4c2777f6 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -41,6 +41,7 @@
#include <asm-generic/qspinlock_types.h>
#include <linux/atomic.h>
+#include <linux/tracepoint-defs.h>
#ifndef queued_spin_is_locked
/**
@@ -129,12 +130,29 @@ static __always_inline void queued_spin_release(struct qspinlock *lock)
}
#endif
+DECLARE_TRACEPOINT(contended_release);
+
+extern void queued_spin_release_traced(struct qspinlock *lock);
+
/**
* queued_spin_unlock - unlock a queued spinlock
* @lock : Pointer to queued spinlock structure
+ *
+ * Generic tracing wrapper around the arch-overridable
+ * queued_spin_release().
*/
static __always_inline void queued_spin_unlock(struct qspinlock *lock)
{
+ /*
+ * Trace and release are combined in queued_spin_release_traced() so
+ * the compiler does not need to preserve the lock pointer across the
+ * function call, avoiding callee-saved register save/restore on the
+ * hot path.
+ */
+ if (tracepoint_enabled(contended_release)) {
+ queued_spin_release_traced(lock);
+ return;
+ }
queued_spin_release(lock);
}
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index af8d122bb649..c72610980ec7 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -104,6 +104,14 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock,
#define queued_spin_lock_slowpath native_queued_spin_lock_slowpath
#endif
+void __lockfunc queued_spin_release_traced(struct qspinlock *lock)
+{
+ if (queued_spin_is_contended(lock))
+ trace_contended_release(lock);
+ queued_spin_release(lock);
+}
+EXPORT_SYMBOL(queued_spin_release_traced);
+
#endif /* _GEN_PV_LOCK_SLOWPATH */
/**
--
2.52.0
^ permalink raw reply related
* [PATCH v5 7/7] locking: Add contended_release tracepoint to qrwlock
From: Dmitry Ilvokhin @ 2026-04-16 15:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-mips, virtualization, linux-arch, linux-mm,
linux-trace-kernel, kernel-team, Paul E. McKenney,
Dmitry Ilvokhin
In-Reply-To: <cover.1776350944.git.d@ilvokhin.com>
Extend the contended_release tracepoint to queued rwlocks, using the
same out-of-line traced unlock approach as queued spinlocks.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
include/asm-generic/qrwlock.h | 22 ++++++++++++++++++++++
kernel/locking/qrwlock.c | 16 ++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 4b627bafba8b..274c19006125 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -14,6 +14,7 @@
#define __ASM_GENERIC_QRWLOCK_H
#include <linux/atomic.h>
+#include <linux/tracepoint-defs.h>
#include <asm/barrier.h>
#include <asm/processor.h>
@@ -35,6 +36,10 @@
*/
extern void queued_read_lock_slowpath(struct qrwlock *lock);
extern void queued_write_lock_slowpath(struct qrwlock *lock);
+extern void queued_read_unlock_traced(struct qrwlock *lock);
+extern void queued_write_unlock_traced(struct qrwlock *lock);
+
+DECLARE_TRACEPOINT(contended_release);
/**
* queued_read_trylock - try to acquire read lock of a queued rwlock
@@ -115,6 +120,17 @@ static __always_inline void __queued_read_unlock(struct qrwlock *lock)
*/
static inline void queued_read_unlock(struct qrwlock *lock)
{
+ /*
+ * Trace and unlock are combined in the traced unlock variant so
+ * the compiler does not need to preserve the lock pointer across
+ * the function call, avoiding callee-saved register save/restore
+ * on the hot path.
+ */
+ if (tracepoint_enabled(contended_release)) {
+ queued_read_unlock_traced(lock);
+ return;
+ }
+
__queued_read_unlock(lock);
}
@@ -129,6 +145,12 @@ static __always_inline void __queued_write_unlock(struct qrwlock *lock)
*/
static inline void queued_write_unlock(struct qrwlock *lock)
{
+ /* See comment in queued_read_unlock(). */
+ if (tracepoint_enabled(contended_release)) {
+ queued_write_unlock_traced(lock);
+ return;
+ }
+
__queued_write_unlock(lock);
}
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index d2ef312a8611..5f7a0fc2b27a 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -90,3 +90,19 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock)
trace_contention_end(lock, 0);
}
EXPORT_SYMBOL(queued_write_lock_slowpath);
+
+void __lockfunc queued_read_unlock_traced(struct qrwlock *lock)
+{
+ if (queued_rwlock_is_contended(lock))
+ trace_contended_release(lock);
+ __queued_read_unlock(lock);
+}
+EXPORT_SYMBOL(queued_read_unlock_traced);
+
+void __lockfunc queued_write_unlock_traced(struct qrwlock *lock)
+{
+ if (queued_rwlock_is_contended(lock))
+ trace_contended_release(lock);
+ __queued_write_unlock(lock);
+}
+EXPORT_SYMBOL(queued_write_unlock_traced);
--
2.52.0
^ permalink raw reply related
* Re: [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor
From: Wen Yang @ 2026-04-16 15:09 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-trace-kernel, linux-kernel
In-Reply-To: <74a624434b59c00f9407909b8696f041536d9418.camel@redhat.com>
On 4/13/26 16:19, Gabriele Monaco wrote:
> On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
>> From: Wen Yang <wen.yang@linux.dev>
>>
>> Add the tlob (task latency over budget) RV monitor. tlob tracks the
>> monotonic elapsed time (CLOCK_MONOTONIC) of a marked per-task code
>> path, including time off-CPU, and fires a per-task hrtimer when the
>> elapsed time exceeds a configurable budget.
>>
>> Three-state DA (unmonitored/on_cpu/off_cpu) driven by trace_start,
>> switch_in/out, and budget_expired events. Per-task state lives in a
>> fixed-size hash table (TLOB_MAX_MONITORED slots) with RCU-deferred
>> free.
>>
>> Two userspace interfaces:
>> - tracefs: uprobe pair registration via the monitor file using the
>> format "pid:threshold_us:offset_start:offset_stop:binary_path"
>> - /dev/rv ioctls (CONFIG_RV_CHARDEV): TLOB_IOCTL_TRACE_START /
>> TRACE_STOP; TRACE_STOP returns -EOVERFLOW on violation
>>
>> Each /dev/rv fd has a per-fd mmap ring buffer (physically contiguous
>> pages). A control page (struct tlob_mmap_page) at offset 0 exposes
>> head/tail/dropped for lockless userspace reads; struct tlob_event
>> records follow at data_offset. Drop-new policy on overflow.
>>
>> UAPI: include/uapi/linux/rv.h (tlob_start_args, tlob_event,
>> tlob_mmap_page, ioctl numbers), monitor_tlob.rst,
>> ioctl-number.rst (RV_IOC_MAGIC=0xB9).
>>
>
> I'm not fully grasping all the requirements for the monitors yet, but I see you
> are reimplementing a lot of functionality in the monitor itself rather than
> within RV, let's see if we can consolidate some of them:
>
> * you're using timer expirations, can we do it with timed automata? [1]
> * RV automata usually don't have an /unmonitored/ state, your trace_start event
> would be the start condition (da_event_start) and the monitor will get non-
> running at each violation (it calls da_monitor_reset() automatically), all
> setup/cleanup logic should be handled implicitly within RV. I believe that would
> also save you that ugly trace_event_tlob() redefinition.
> * you're maintaining a local hash table for each task_struct, that could use
> the per-object monitors [2] where your "object" is in fact your struct,
> allocated when you start the monitor with all appropriate fields and indexed by
> pid
> * you are handling violations manually, considering timed automata trigger a
> full fledged violation on timeouts, can you use the RV-way (error tracepoints or
> reactors only)? Do you need the additional reporting within the
> tracepoint/ioctl? Cannot the userspace consumer desume all those from other
> events and let RV do just the monitoring?
> * I like the uprobe thing, we could probably move all that to a common helper
> once we figure out how to make it generic.
>
> Note: [1] and [2] didn't reach upstream yet, but should reach linux-next soon.
>
Thanks for the review. Here's my plan for each point -- let me know if
the direction looks right.
- Timed automata
The HA framework [1] is a good match when the timeout threshold is
global or state-determined, but tlob needs a per-invocation threshold
supplied at TRACE_START time -- fitting that into HA would require
framework changes.
My plan is to use da_monitor_init_hook() -- the same mechanism HA
monitors use internally -- to arm the per-invocation hrtimer once
da_create_storage() has stored the monitor_target. This gives the same
"timer fires => violation" semantics without touching the HA infrastructure.
If you see a cleaner way to pass per-invocation data through HA I'm
happy to go that route.
- Unmonitored state / da_handle_start_event
Fair point. I'll drop the explicit unmonitored state and the
trace_event_tlob() redefinition. tlob_start_task() will use
da_handle_start_event() to allocate storage, set initial state to on_cpu,
and fire the init hook to arm the timer in one shot. tlob_stop_task()
calls da_monitor_reset() directly.
- Per-object monitors
Will do. The custom hash table goes away; I'll switch to RV_MON_PER_OBJ
with:
typedef struct tlob_task_state *monitor_target;
da_get_target_by_id() handles the sched_switch hot path lookup.
- RV-way violations
Agreed. budget_expired will be declared INVALID in all states so the
framework calls react() (error_tlob tracepoint + any registered reactor)
and da_monitor_reset() automatically. tlob won't emit any tracepoint of
its own.
One note on the /dev/tlob ioctl: TLOB_IOCTL_TRACE_STOP returns -EOVERFLOW
to the caller when the budget was exceeded. This is just a syscall
return code -- not a second reporting path -- to let in-process
instrumentation react inline without polling the trace buffer.
Let me know if you have concerns about keeping this.
- Generic uprobe helper
Proposed interface:
struct rv_uprobe *rv_uprobe_attach_path(
struct path *path, loff_t offset,
int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
int (*ret_fn) (struct rv_uprobe *, unsigned long func,
struct pt_regs *, __u64 *),
void *priv);
struct rv_uprobe *rv_uprobe_attach(
const char *binpath, loff_t offset,
int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
int (*ret_fn) (struct rv_uprobe *, unsigned long func,
struct pt_regs *, __u64 *),
void *priv);
void rv_uprobe_detach(struct rv_uprobe *p);
struct rv_uprobe exposes three read-only fields to monitors (offset,
priv, path); the uprobe_consumer and callbacks would be kept private to
the implementation, so monitors need not include <linux/uprobes.h>.
rv_uprobe_attach() resolves the path and delegates to
rv_uprobe_attach_path(); the latter avoids a redundant kern_path() when
registering multiple probes on the same binary:
kern_path(binpath, LOOKUP_FOLLOW, &path);
b->start = rv_uprobe_attach_path(&path, offset_start, entry_fn,
NULL, b);
b->stop = rv_uprobe_attach_path(&path, offset_stop, stop_fn,
NULL, b);
path_put(&path);
Does the interface look reasonable, or did you have a different shape in
mind?
--
Best wishes,
Wen
>
> [1] -
> https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=f5587d1b6ec938afb2f74fe399a68020d66923e4
> [2] -
> https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=da282bf7fadb095ee0a40c32ff0126429c769b45
>
>> Signed-off-by: Wen Yang <wen.yang@linux.dev>
>> ---
>> Documentation/trace/rv/index.rst | 1 +
>> Documentation/trace/rv/monitor_tlob.rst | 381 +++++++
>> .../userspace-api/ioctl/ioctl-number.rst | 1 +
>> include/uapi/linux/rv.h | 181 ++++
>> kernel/trace/rv/Kconfig | 17 +
>> kernel/trace/rv/Makefile | 2 +
>> kernel/trace/rv/monitors/tlob/Kconfig | 51 +
>> kernel/trace/rv/monitors/tlob/tlob.c | 986 ++++++++++++++++++
>> kernel/trace/rv/monitors/tlob/tlob.h | 145 +++
>> kernel/trace/rv/monitors/tlob/tlob_trace.h | 42 +
>> kernel/trace/rv/rv.c | 4 +
>> kernel/trace/rv/rv_dev.c | 602 +++++++++++
>> kernel/trace/rv/rv_trace.h | 50 +
>> 13 files changed, 2463 insertions(+)
>> create mode 100644 Documentation/trace/rv/monitor_tlob.rst
>> create mode 100644 include/uapi/linux/rv.h
>> create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
>> create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
>> create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
>> create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
>> create mode 100644 kernel/trace/rv/rv_dev.c
>>
>> diff --git a/Documentation/trace/rv/index.rst
>> b/Documentation/trace/rv/index.rst
>> index a2812ac5c..4f2bfaf38 100644
>> --- a/Documentation/trace/rv/index.rst
>> +++ b/Documentation/trace/rv/index.rst
>> @@ -15,3 +15,4 @@ Runtime Verification
>> monitor_wwnr.rst
>> monitor_sched.rst
>> monitor_rtapp.rst
>> + monitor_tlob.rst
>> diff --git a/Documentation/trace/rv/monitor_tlob.rst
>> b/Documentation/trace/rv/monitor_tlob.rst
>> new file mode 100644
>> index 000000000..d498e9894
>> --- /dev/null
>> +++ b/Documentation/trace/rv/monitor_tlob.rst
>> @@ -0,0 +1,381 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +Monitor tlob
>> +============
>> +
>> +- Name: tlob - task latency over budget
>> +- Type: per-task deterministic automaton
>> +- Author: Wen Yang <wen.yang@linux.dev>
>> +
>> +Description
>> +-----------
>> +
>> +The tlob monitor tracks per-task elapsed time (CLOCK_MONOTONIC, including
>> +both on-CPU and off-CPU time) and reports a violation when the monitored
>> +task exceeds a configurable latency budget threshold.
>> +
>> +The monitor implements a three-state deterministic automaton::
>> +
>> + |
>> + | (initial)
>> + v
>> + +--------------+
>> + +-------> | unmonitored |
>> + | +--------------+
>> + | |
>> + | trace_start
>> + | v
>> + | +--------------+
>> + | | on_cpu |
>> + | +--------------+
>> + | | |
>> + | switch_out| | trace_stop / budget_expired
>> + | v v
>> + | +--------------+ (unmonitored)
>> + | | off_cpu |
>> + | +--------------+
>> + | | |
>> + | | switch_in| trace_stop / budget_expired
>> + | v v
>> + | (on_cpu) (unmonitored)
>> + |
>> + +-- trace_stop (from on_cpu or off_cpu)
>> +
>> + Key transitions:
>> + unmonitored --(trace_start)--> on_cpu
>> + on_cpu --(switch_out)--> off_cpu
>> + off_cpu --(switch_in)--> on_cpu
>> + on_cpu --(trace_stop)--> unmonitored
>> + off_cpu --(trace_stop)--> unmonitored
>> + on_cpu --(budget_expired)-> unmonitored [violation]
>> + off_cpu --(budget_expired)-> unmonitored [violation]
>> +
>> + sched_wakeup self-loops in on_cpu and unmonitored; switch_out and
>> + sched_wakeup self-loop in off_cpu. budget_expired is fired by the one-shot
>> hrtimer; it always
>> + transitions to unmonitored regardless of whether the task is on-CPU
>> + or off-CPU when the timer fires.
>> +
>> +State Descriptions
>> +------------------
>> +
>> +- **unmonitored**: Task is not being traced. Scheduling events
>> + (``switch_in``, ``switch_out``, ``sched_wakeup``) are silently
>> + ignored (self-loop). The monitor waits for a ``trace_start`` event
>> + to begin a new observation window.
>> +
>> +- **on_cpu**: Task is running on the CPU with the deadline timer armed.
>> + A one-shot hrtimer was set for ``threshold_us`` microseconds at
>> + ``trace_start`` time. A ``switch_out`` event transitions to
>> + ``off_cpu``; the hrtimer keeps running (off-CPU time counts toward
>> + the budget). A ``trace_stop`` cancels the timer and returns to
>> + ``unmonitored`` (normal completion). If the hrtimer fires
>> + (``budget_expired``) the violation is recorded and the automaton
>> + transitions to ``unmonitored``.
>> +
>> +- **off_cpu**: Task was preempted or blocked. The one-shot hrtimer
>> + continues to run. A ``switch_in`` event returns to ``on_cpu``.
>> + A ``trace_stop`` cancels the timer and returns to ``unmonitored``.
>> + If the hrtimer fires (``budget_expired``) while the task is off-CPU,
>> + the violation is recorded and the automaton transitions to
>> + ``unmonitored``.
>> +
>> +Rationale
>> +---------
>> +
>> +The per-task latency budget threshold allows operators to express timing
>> +requirements in microseconds and receive an immediate ftrace event when a
>> +task exceeds its budget. This is useful for real-time tasks
>> +(``SCHED_FIFO`` / ``SCHED_DEADLINE``) where total elapsed time must
>> +remain within a known bound.
>> +
>> +Each task has an independent threshold, so up to ``TLOB_MAX_MONITORED``
>> +(64) tasks with different timing requirements can be monitored
>> +simultaneously.
>> +
>> +On threshold violation the automaton records a ``tlob_budget_exceeded``
>> +ftrace event carrying the final on-CPU / off-CPU time breakdown, but does
>> +not kill or throttle the task. Monitoring can be restarted by issuing a
>> +new ``trace_start`` event (or a new ``TLOB_IOCTL_TRACE_START`` ioctl).
>> +
>> +A per-task one-shot hrtimer is armed at ``trace_start`` for exactly
>> +``threshold_us`` microseconds. It fires at most once per monitoring
>> +window, performs an O(1) hash lookup, records the violation, and injects
>> +the ``budget_expired`` event into the DA. When ``CONFIG_RV_MON_TLOB``
>> +is not set there is zero runtime cost.
>> +
>> +Usage
>> +-----
>> +
>> +tracefs interface (uprobe-based external monitoring)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +The ``monitor`` tracefs file allows any privileged user to instrument an
>> +unmodified binary via uprobes, without changing its source code. Write a
>> +four-field record to attach two plain entry uprobes: one at
>> +``offset_start`` fires ``tlob_start_task()`` and one at ``offset_stop``
>> +fires ``tlob_stop_task()``, so the latency budget covers exactly the code
>> +region between the two offsets::
>> +
>> + threshold_us:offset_start:offset_stop:binary_path
>> +
>> +``binary_path`` comes last so it may freely contain ``:`` (e.g. paths
>> +inside a container namespace).
>> +
>> +The uprobes fire for every task that executes the probed instruction in
>> +the binary, consistent with the native uprobe semantics. All tasks that
>> +execute the code region get independent per-task monitoring slots.
>> +
>> +Using two plain entry uprobes (rather than a uretprobe for the stop) means
>> +that a mistyped offset can never corrupt the call stack; the worst outcome
>> +of a bad ``offset_stop`` is a missed stop that causes the hrtimer to fire
>> +and report a budget violation.
>> +
>> +Example -- monitor a code region in ``/usr/bin/myapp`` with a 5 ms
>> +budget, where the region starts at offset 0x12a0 and ends at 0x12f0::
>> +
>> + echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
>> +
>> + # Bind uprobes: start probe starts the clock, stop probe stops it
>> + echo "5000:0x12a0:0x12f0:/usr/bin/myapp" \
>> + > /sys/kernel/tracing/rv/monitors/tlob/monitor
>> +
>> + # Remove the uprobe binding for this code region
>> + echo "-0x12a0:/usr/bin/myapp" >
>> /sys/kernel/tracing/rv/monitors/tlob/monitor
>> +
>> + # List registered uprobe bindings (mirrors the write format)
>> + cat /sys/kernel/tracing/rv/monitors/tlob/monitor
>> + # -> 5000:0x12a0:0x12f0:/usr/bin/myapp
>> +
>> + # Read violations from the trace buffer
>> + cat /sys/kernel/tracing/trace
>> +
>> +Up to ``TLOB_MAX_MONITORED`` tasks may be monitored simultaneously.
>> +
>> +The offsets can be obtained with ``nm`` or ``readelf``::
>> +
>> + nm -n /usr/bin/myapp | grep my_function
>> + # -> 0000000000012a0 T my_function
>> +
>> + readelf -s /usr/bin/myapp | grep my_function
>> + # -> 42: 0000000000012a0 336 FUNC GLOBAL DEFAULT 13 my_function
>> +
>> + # offset_start = 0x12a0 (function entry)
>> + # offset_stop = 0x12a0 + 0x50 = 0x12f0 (or any instruction before return)
>> +
>> +Notes:
>> +
>> +- The uprobes fire for every task that executes the probed instruction,
>> + so concurrent calls from different threads each get independent
>> + monitoring slots.
>> +- ``offset_stop`` need not be a function return; it can be any instruction
>> + within the region. If the stop probe is never reached (e.g. early exit
>> + path bypasses it), the hrtimer fires and a budget violation is reported.
>> +- Each ``(binary_path, offset_start)`` pair may only be registered once.
>> + A second write with the same ``offset_start`` for the same binary is
>> + rejected with ``-EEXIST``. Two entry uprobes at the same address would
>> + both fire for every task, causing ``tlob_start_task()`` to be called
>> + twice; the second call would silently fail with ``-EEXIST`` and the
>> + second binding's threshold would never take effect. Different code
>> + regions that share the same ``offset_stop`` (common exit point) are
>> + explicitly allowed.
>> +- The uprobe binding is removed when ``-offset_start:binary_path`` is
>> + written to ``monitor``, or when the monitor is disabled.
>> +- The ``tag`` field in every ``tlob_budget_exceeded`` event is
>> + automatically set to ``offset_start`` for the tracefs path, so
>> + violation events for different code regions are immediately
>> + distinguishable even when ``threshold_us`` values are identical.
>> +
>> +ftrace ring buffer (budget violation events)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +When a monitored task exceeds its latency budget the hrtimer fires,
>> +records the violation, and emits a single ``tlob_budget_exceeded`` event
>> +into the ftrace ring buffer. **Nothing is written to the ftrace ring
>> +buffer while the task is within budget.**
>> +
>> +The event carries the on-CPU / off-CPU time breakdown so that root-cause
>> +analysis (CPU-bound vs. scheduling / I/O overrun) is immediate::
>> +
>> + cat /sys/kernel/tracing/trace
>> +
>> +Example output::
>> +
>> + myapp-1234 [003] .... 12345.678: tlob_budget_exceeded: \
>> + myapp[1234]: budget exceeded threshold=5000 \
>> + on_cpu=820 off_cpu=4500 switches=3 state=off_cpu tag=0x00000000000012a0
>> +
>> +Field descriptions:
>> +
>> +``threshold``
>> + Configured latency budget in microseconds.
>> +
>> +``on_cpu``
>> + Cumulative on-CPU time since ``trace_start``, in microseconds.
>> +
>> +``off_cpu``
>> + Cumulative off-CPU (scheduling + I/O wait) time since ``trace_start``,
>> + in microseconds.
>> +
>> +``switches``
>> + Number of times the task was scheduled out during this window.
>> +
>> +``state``
>> + DA state when the hrtimer fired: ``on_cpu`` means the task was executing
>> + when the budget expired (CPU-bound overrun); ``off_cpu`` means the task
>> + was preempted or blocked (scheduling / I/O overrun).
>> +
>> +``tag``
>> + Opaque 64-bit cookie supplied by the caller via ``tlob_start_args.tag``
>> + (ioctl path) or automatically set to ``offset_start`` (tracefs uprobe
>> + path). Use it to distinguish violations from different code regions
>> + monitored by the same thread. Zero when not set.
>> +
>> +To capture violations in a file::
>> +
>> + trace-cmd record -e tlob_budget_exceeded &
>> + # ... run workload ...
>> + trace-cmd report
>> +
>> +/dev/rv ioctl interface (self-instrumentation)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Tasks can self-instrument their own code paths via the ``/dev/rv`` misc
>> +device (requires ``CONFIG_RV_CHARDEV``). The kernel key is
>> +``task_struct``; multiple threads sharing a single fd each get their own
>> +independent monitoring slot.
>> +
>> +**Synchronous mode** -- the calling thread checks its own result::
>> +
>> + int fd = open("/dev/rv", O_RDWR);
>> +
>> + struct tlob_start_args args = {
>> + .threshold_us = 50000, /* 50 ms */
>> + .tag = 0, /* optional; 0 = don't care */
>> + .notify_fd = -1, /* no fd notification */
>> + };
>> + ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
>> +
>> + /* ... code path under observation ... */
>> +
>> + int ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
>> + /* ret == 0: within budget */
>> + /* ret == -EOVERFLOW: budget exceeded */
>> +
>> + close(fd);
>> +
>> +**Asynchronous mode** -- a dedicated monitor thread receives violation
>> +records via ``read()`` on a shared fd, decoupling the observation from
>> +the critical path::
>> +
>> + /* Monitor thread: open a dedicated fd. */
>> + int monitor_fd = open("/dev/rv", O_RDWR);
>> +
>> + /* Worker thread: set notify_fd = monitor_fd in TRACE_START args. */
>> + int work_fd = open("/dev/rv", O_RDWR);
>> + struct tlob_start_args args = {
>> + .threshold_us = 10000, /* 10 ms */
>> + .tag = REGION_A,
>> + .notify_fd = monitor_fd,
>> + };
>> + ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
>> + /* ... critical section ... */
>> + ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
>> +
>> + /* Monitor thread: blocking read() returns one or more tlob_event records.
>> */
>> + struct tlob_event ntfs[8];
>> + ssize_t n = read(monitor_fd, ntfs, sizeof(ntfs));
>> + for (int i = 0; i < n / sizeof(struct tlob_event); i++) {
>> + struct tlob_event *ntf = &ntfs[i];
>> + printf("tid=%u tag=0x%llx exceeded budget=%llu us "
>> + "(on_cpu=%llu off_cpu=%llu switches=%u state=%s)\n",
>> + ntf->tid, ntf->tag, ntf->threshold_us,
>> + ntf->on_cpu_us, ntf->off_cpu_us, ntf->switches,
>> + ntf->state ? "on_cpu" : "off_cpu");
>> + }
>> +
>> +**mmap ring buffer** -- zero-copy consumption of violation events::
>> +
>> + int fd = open("/dev/rv", O_RDWR);
>> + struct tlob_start_args args = {
>> + .threshold_us = 1000, /* 1 ms */
>> + .notify_fd = fd, /* push violations to own ring buffer */
>> + };
>> + ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
>> +
>> + /* Map the ring: one control page + capacity data records. */
>> + size_t pagesize = sysconf(_SC_PAGESIZE);
>> + size_t cap = 64; /* read from page->capacity after mmap */
>> + size_t len = pagesize + cap * sizeof(struct tlob_event);
>> + void *map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> +
>> + struct tlob_mmap_page *page = map;
>> + struct tlob_event *data =
>> + (struct tlob_event *)((char *)map + page->data_offset);
>> +
>> + /* Consumer loop: poll for events, read without copying. */
>> + while (1) {
>> + poll(&(struct pollfd){fd, POLLIN, 0}, 1, -1);
>> +
>> + uint32_t head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
>> + uint32_t tail = page->data_tail;
>> + while (tail != head) {
>> + handle(&data[tail & (page->capacity - 1)]);
>> + tail++;
>> + }
>> + __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
>> + }
>> +
>> +Note: ``read()`` and ``mmap()`` share the same ring and ``data_tail``
>> +cursor. Do not use both simultaneously on the same fd.
>> +
>> +``tlob_event`` fields:
>> +
>> +``tid``
>> + Thread ID (``task_pid_vnr``) of the violating task.
>> +
>> +``threshold_us``
>> + Budget that was exceeded, in microseconds.
>> +
>> +``on_cpu_us``
>> + Cumulative on-CPU time at violation time, in microseconds.
>> +
>> +``off_cpu_us``
>> + Cumulative off-CPU time at violation time, in microseconds.
>> +
>> +``switches``
>> + Number of context switches since ``TRACE_START``.
>> +
>> +``state``
>> + 1 = timer fired while task was on-CPU; 0 = timer fired while off-CPU.
>> +
>> +``tag``
>> + Cookie from ``tlob_start_args.tag``; for the tracefs uprobe path this
>> + equals ``offset_start``. Zero when not set.
>> +
>> +tracefs files
>> +-------------
>> +
>> +The following files are created under
>> +``/sys/kernel/tracing/rv/monitors/tlob/``:
>> +
>> +``enable`` (rw)
>> + Write ``1`` to enable the monitor; write ``0`` to disable it and
>> + stop all currently monitored tasks.
>> +
>> +``desc`` (ro)
>> + Human-readable description of the monitor.
>> +
>> +``monitor`` (rw)
>> + Write ``threshold_us:offset_start:offset_stop:binary_path`` to bind two
>> + plain entry uprobes in *binary_path*. The uprobe at *offset_start* fires
>> + ``tlob_start_task()``; the uprobe at *offset_stop* fires
>> + ``tlob_stop_task()``. Returns ``-EEXIST`` if a binding with the same
>> + *offset_start* already exists for *binary_path*. Write
>> + ``-offset_start:binary_path`` to remove the binding. Read to list
>> + registered bindings, one
>> + ``threshold_us:0xoffset_start:0xoffset_stop:binary_path`` entry per line.
>> +
>> +Specification
>> +-------------
>> +
>> +Graphviz DOT file in tools/verification/models/tlob.dot
>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> index 331223761..8d3af68db 100644
>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> @@ -385,6 +385,7 @@ Code Seq# Include
>> File Comments
>> 0xB8 01-02 uapi/misc/mrvl_cn10k_dpi.h
>> Marvell CN10K DPI driver
>> 0xB8 all uapi/linux/mshv.h
>> Microsoft Hyper-V /dev/mshv driver
>>
>> <mailto:linux-hyperv@vger.kernel.org>
>> +0xB9 00-3F linux/rv.h
>> Runtime Verification (RV) monitors
>> 0xBA 00-0F uapi/linux/liveupdate.h Pasha
>> Tatashin
>>
>> <mailto:pasha.tatashin@soleen.com>
>> 0xC0 00-0F linux/usb/iowarrior.h
>> diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
>> new file mode 100644
>> index 000000000..d1b96d8cd
>> --- /dev/null
>> +++ b/include/uapi/linux/rv.h
>> @@ -0,0 +1,181 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +/*
>> + * UAPI definitions for Runtime Verification (RV) monitors.
>> + *
>> + * All RV monitors that expose an ioctl self-instrumentation interface
>> + * share the magic byte RV_IOC_MAGIC (0xB9), registered in
>> + * Documentation/userspace-api/ioctl/ioctl-number.rst.
>> + *
>> + * A single /dev/rv misc device serves as the entry point. ioctl numbers
>> + * encode both the monitor identity and the operation:
>> + *
>> + * 0x01 - 0x1F tlob (task latency over budget)
>> + * 0x20 - 0x3F reserved for future RV monitors
>> + *
>> + * Usage examples and design rationale are in:
>> + * Documentation/trace/rv/monitor_tlob.rst
>> + */
>> +
>> +#ifndef _UAPI_LINUX_RV_H
>> +#define _UAPI_LINUX_RV_H
>> +
>> +#include <linux/ioctl.h>
>> +#include <linux/types.h>
>> +
>> +/* Magic byte shared by all RV monitor ioctls. */
>> +#define RV_IOC_MAGIC 0xB9
>> +
>> +/* -----------------------------------------------------------------------
>> + * tlob: task latency over budget monitor (nr 0x01 - 0x1F)
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/**
>> + * struct tlob_start_args - arguments for TLOB_IOCTL_TRACE_START
>> + * @threshold_us: Latency budget for this critical section, in microseconds.
>> + * Must be greater than zero.
>> + * @tag: Opaque 64-bit cookie supplied by the caller. Echoed back
>> + * verbatim in the tlob_budget_exceeded ftrace event and in any
>> + * tlob_event record delivered via @notify_fd. Use it to
>> identify
>> + * which code region triggered a violation when the same thread
>> + * monitors multiple regions sequentially. Set to 0 if not
>> + * needed.
>> + * @notify_fd: File descriptor that will receive a tlob_event record on
>> + * violation. Must refer to an open /dev/rv fd. May equal
>> + * the calling fd (self-notification, useful for retrieving the
>> + * on_cpu_us / off_cpu_us breakdown after TRACE_STOP returns
>> + * -EOVERFLOW). Set to -1 to disable fd notification; in that
>> + * case violations are only signalled via the TRACE_STOP return
>> + * value and the tlob_budget_exceeded ftrace event.
>> + * @flags: Must be 0. Reserved for future extensions.
>> + */
>> +struct tlob_start_args {
>> + __u64 threshold_us;
>> + __u64 tag;
>> + __s32 notify_fd;
>> + __u32 flags;
>> +};
>> +
>> +/**
>> + * struct tlob_event - one budget-exceeded event
>> + *
>> + * Consumed by read() on the notify_fd registered at TLOB_IOCTL_TRACE_START.
>> + * Each record describes a single budget exceedance for one task.
>> + *
>> + * @tid: Thread ID (task_pid_vnr) of the violating task.
>> + * @threshold_us: Budget that was exceeded, in microseconds.
>> + * @on_cpu_us: Cumulative on-CPU time at violation time, in microseconds.
>> + * @off_cpu_us: Cumulative off-CPU (scheduling + I/O wait) time at
>> + * violation time, in microseconds.
>> + * @switches: Number of context switches since TRACE_START.
>> + * @state: DA state at violation: 1 = on_cpu, 0 = off_cpu.
>> + * @tag: Cookie from tlob_start_args.tag; for the tracefs uprobe
>> path
>> + * this is the offset_start value. Zero when not set.
>> + */
>> +struct tlob_event {
>> + __u32 tid;
>> + __u32 pad;
>> + __u64 threshold_us;
>> + __u64 on_cpu_us;
>> + __u64 off_cpu_us;
>> + __u32 switches;
>> + __u32 state; /* 1 = on_cpu, 0 = off_cpu */
>> + __u64 tag;
>> +};
>> +
>> +/**
>> + * struct tlob_mmap_page - control page for the mmap'd violation ring buffer
>> + *
>> + * Mapped at offset 0 of the mmap region returned by mmap(2) on a /dev/rv fd.
>> + * The data array of struct tlob_event records begins at offset @data_offset
>> + * (always one page from the mmap base; use this field rather than hard-
>> coding
>> + * PAGE_SIZE so the code remains correct across architectures).
>> + *
>> + * Ring layout:
>> + *
>> + * mmap base + 0 : struct tlob_mmap_page (one page)
>> + * mmap base + data_offset : struct tlob_event[capacity]
>> + *
>> + * The mmap length determines the ring capacity. Compute it as:
>> + *
>> + * raw = sysconf(_SC_PAGESIZE) + capacity * sizeof(struct tlob_event)
>> + * length = (raw + sysconf(_SC_PAGESIZE) - 1) & ~(sysconf(_SC_PAGESIZE) -
>> 1)
>> + *
>> + * i.e. round the raw byte count up to the next page boundary before
>> + * passing it to mmap(2). The kernel requires a page-aligned length.
>> + * capacity must be a power of 2. Read @capacity after a successful
>> + * mmap(2) for the actual value.
>> + *
>> + * Producer/consumer ordering contract:
>> + *
>> + * Kernel (producer):
>> + * data[data_head & (capacity - 1)] = event;
>> + * // pairs with load-acquire in userspace:
>> + * smp_store_release(&page->data_head, data_head + 1);
>> + *
>> + * Userspace (consumer):
>> + * // pairs with store-release in kernel:
>> + * head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
>> + * for (tail = page->data_tail; tail != head; tail++)
>> + * handle(&data[tail & (capacity - 1)]);
>> + * __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
>> + *
>> + * @data_head and @data_tail are monotonically increasing __u32 counters
>> + * in units of records. Unsigned 32-bit wrap-around is handled correctly
>> + * by modular arithmetic; the ring is full when
>> + * (data_head - data_tail) == capacity.
>> + *
>> + * When the ring is full the kernel drops the incoming record and increments
>> + * @dropped. The consumer should check @dropped periodically to detect loss.
>> + *
>> + * read() and mmap() share the same ring buffer. Do not use both
>> + * simultaneously on the same fd.
>> + *
>> + * @data_head: Next write slot index. Updated by the kernel with
>> + * store-release ordering. Read by userspace with load-
>> acquire.
>> + * @data_tail: Next read slot index. Updated by userspace. Read by the
>> + * kernel to detect overflow.
>> + * @capacity: Actual ring capacity in records (power of 2). Written once
>> + * by the kernel at mmap time; read-only for userspace
>> thereafter.
>> + * @version: Ring buffer ABI version; currently 1.
>> + * @data_offset: Byte offset from the mmap base to the data array.
>> + * Always equal to sysconf(_SC_PAGESIZE) on the running kernel.
>> + * @record_size: sizeof(struct tlob_event) as seen by the kernel. Verify
>> + * this matches userspace's sizeof before indexing the array.
>> + * @dropped: Number of events dropped because the ring was full.
>> + * Monotonically increasing; read with __ATOMIC_RELAXED.
>> + */
>> +struct tlob_mmap_page {
>> + __u32 data_head;
>> + __u32 data_tail;
>> + __u32 capacity;
>> + __u32 version;
>> + __u32 data_offset;
>> + __u32 record_size;
>> + __u64 dropped;
>> +};
>> +
>> +/*
>> + * TLOB_IOCTL_TRACE_START - begin monitoring the calling task.
>> + *
>> + * Arms a per-task hrtimer for threshold_us microseconds. If args.notify_fd
>> + * is >= 0, a tlob_event record is pushed into that fd's ring buffer on
>> + * violation in addition to the tlob_budget_exceeded ftrace event.
>> + * args.notify_fd == -1 disables fd notification.
>> + *
>> + * Violation records are consumed by read() on the notify_fd (blocking or
>> + * non-blocking depending on O_NONBLOCK). On violation,
>> TLOB_IOCTL_TRACE_STOP
>> + * also returns -EOVERFLOW regardless of whether notify_fd is set.
>> + *
>> + * args.flags must be 0.
>> + */
>> +#define TLOB_IOCTL_TRACE_START _IOW(RV_IOC_MAGIC, 0x01, struct
>> tlob_start_args)
>> +
>> +/*
>> + * TLOB_IOCTL_TRACE_STOP - end monitoring the calling task.
>> + *
>> + * Returns 0 if within budget, -EOVERFLOW if the budget was exceeded.
>> + */
>> +#define TLOB_IOCTL_TRACE_STOP _IO(RV_IOC_MAGIC, 0x02)
>> +
>> +#endif /* _UAPI_LINUX_RV_H */
>> diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
>> index 5b4be87ba..227573cda 100644
>> --- a/kernel/trace/rv/Kconfig
>> +++ b/kernel/trace/rv/Kconfig
>> @@ -65,6 +65,7 @@ source "kernel/trace/rv/monitors/pagefault/Kconfig"
>> source "kernel/trace/rv/monitors/sleep/Kconfig"
>> # Add new rtapp monitors here
>>
>> +source "kernel/trace/rv/monitors/tlob/Kconfig"
>> # Add new monitors here
>>
>> config RV_REACTORS
>> @@ -93,3 +94,19 @@ config RV_REACT_PANIC
>> help
>> Enables the panic reactor. The panic reactor emits a printk()
>> message if an exception is found and panic()s the system.
>> +
>> +config RV_CHARDEV
>> + bool "RV ioctl interface via /dev/rv"
>> + depends on RV
>> + default n
>> + help
>> + Register a /dev/rv misc device that exposes an ioctl interface
>> + for RV monitor self-instrumentation. All RV monitors share the
>> + single device node; ioctl numbers encode the monitor identity.
>> +
>> + When enabled, user-space programs can open /dev/rv and use
>> + monitor-specific ioctl commands to bracket code regions they
>> + want the kernel RV subsystem to observe.
>> +
>> + Say Y here if you want to use the tlob self-instrumentation
>> + ioctl interface; otherwise say N.
>> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
>> index 750e4ad6f..cc3781a3b 100644
>> --- a/kernel/trace/rv/Makefile
>> +++ b/kernel/trace/rv/Makefile
>> @@ -3,6 +3,7 @@
>> ccflags-y += -I $(src) # needed for trace events
>>
>> obj-$(CONFIG_RV) += rv.o
>> +obj-$(CONFIG_RV_CHARDEV) += rv_dev.o
>> obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
>> obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
>> obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
>> @@ -17,6 +18,7 @@ obj-$(CONFIG_RV_MON_STS) += monitors/sts/sts.o
>> obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
>> obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
>> obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
>> +obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
>> # Add new monitors here
>> obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
>> obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
>> diff --git a/kernel/trace/rv/monitors/tlob/Kconfig
>> b/kernel/trace/rv/monitors/tlob/Kconfig
>> new file mode 100644
>> index 000000000..010237480
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
>> @@ -0,0 +1,51 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +#
>> +config RV_MON_TLOB
>> + depends on RV
>> + depends on UPROBES
>> + select DA_MON_EVENTS_ID
>> + bool "tlob monitor"
>> + help
>> + Enable the tlob (task latency over budget) monitor. This monitor
>> + tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path
>> within a
>> + task (including both on-CPU and off-CPU time) and reports a
>> + violation when the elapsed time exceeds a configurable budget
>> + threshold.
>> +
>> + The monitor implements a three-state deterministic automaton.
>> + States: unmonitored, on_cpu, off_cpu.
>> + Key transitions:
>> + unmonitored --(trace_start)--> on_cpu
>> + on_cpu --(switch_out)--> off_cpu
>> + off_cpu --(switch_in)--> on_cpu
>> + on_cpu --(trace_stop)--> unmonitored
>> + off_cpu --(trace_stop)--> unmonitored
>> + on_cpu --(budget_expired)--> unmonitored
>> + off_cpu --(budget_expired)--> unmonitored
>> +
>> + External configuration is done via the tracefs "monitor" file:
>> + echo pid:threshold_us:binary:offset_start:offset_stop >
>> .../rv/monitors/tlob/monitor
>> + echo -pid > .../rv/monitors/tlob/monitor (remove
>> task)
>> + cat .../rv/monitors/tlob/monitor (list
>> tasks)
>> +
>> + The uprobe binding places two plain entry uprobes at offset_start
>> and
>> + offset_stop in the binary; these trigger tlob_start_task() and
>> + tlob_stop_task() respectively. Using two entry uprobes (rather
>> than a
>> + uretprobe) means that a mistyped offset can never corrupt the call
>> + stack; the worst outcome is a missed stop, which causes the hrtimer
>> to
>> + fire and report a budget violation.
>> +
>> + Violation events are delivered via a lock-free mmap ring buffer on
>> + /dev/rv (enabled by CONFIG_RV_CHARDEV). The consumer mmap()s the
>> + device, reads records from the data array using the head/tail
>> indices
>> + in the control page, and advances data_tail when done.
>> +
>> + For self-instrumentation, use TLOB_IOCTL_TRACE_START /
>> + TLOB_IOCTL_TRACE_STOP via the /dev/rv misc device (enabled by
>> + CONFIG_RV_CHARDEV).
>> +
>> + Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
>> +
>> + For further information, see:
>> + Documentation/trace/rv/monitor_tlob.rst
>> +
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
>> b/kernel/trace/rv/monitors/tlob/tlob.c
>> new file mode 100644
>> index 000000000..a6e474025
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
>> @@ -0,0 +1,986 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * tlob: task latency over budget monitor
>> + *
>> + * Track the elapsed wall-clock time of a marked code path and detect when
>> + * a monitored task exceeds its per-task latency budget. CLOCK_MONOTONIC
>> + * is used so both on-CPU and off-CPU time count toward the budget.
>> + *
>> + * Per-task state is maintained in a spinlock-protected hash table. A
>> + * one-shot hrtimer fires at the deadline; if the task has not called
>> + * trace_stop by then, a violation is recorded.
>> + *
>> + * Up to TLOB_MAX_MONITORED tasks may be tracked simultaneously.
>> + *
>> + * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
>> + */
>> +#include <linux/file.h>
>> +#include <linux/fs.h>
>> +#include <linux/ftrace.h>
>> +#include <linux/hash.h>
>> +#include <linux/hrtimer.h>
>> +#include <linux/kernel.h>
>> +#include <linux/ktime.h>
>> +#include <linux/module.h>
>> +#include <linux/init.h>
>> +#include <linux/namei.h>
>> +#include <linux/poll.h>
>> +#include <linux/rv.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/atomic.h>
>> +#include <linux/rcupdate.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/tracefs.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/uprobes.h>
>> +#include <kunit/visibility.h>
>> +#include <rv/instrumentation.h>
>> +
>> +/* rv_interface_lock is defined in kernel/trace/rv/rv.c */
>> +extern struct mutex rv_interface_lock;
>> +
>> +#define MODULE_NAME "tlob"
>> +
>> +#include <rv_trace.h>
>> +#include <trace/events/sched.h>
>> +
>> +#define RV_MON_TYPE RV_MON_PER_TASK
>> +#include "tlob.h"
>> +#include <rv/da_monitor.h>
>> +
>> +/* Hash table size; must be a power of two. */
>> +#define TLOB_HTABLE_BITS 6
>> +#define TLOB_HTABLE_SIZE (1 << TLOB_HTABLE_BITS)
>> +
>> +/* Maximum binary path length for uprobe binding. */
>> +#define TLOB_MAX_PATH 256
>> +
>> +/* Per-task latency monitoring state. */
>> +struct tlob_task_state {
>> + struct hlist_node hlist;
>> + struct task_struct *task;
>> + u64 threshold_us;
>> + u64 tag;
>> + struct hrtimer deadline_timer;
>> + int canceled; /* protected by entry_lock */
>> + struct file *notify_file; /* NULL or held reference */
>> +
>> + /*
>> + * entry_lock serialises the mutable accounting fields below.
>> + * Lock order: tlob_table_lock -> entry_lock (never reverse).
>> + */
>> + raw_spinlock_t entry_lock;
>> + u64 on_cpu_us;
>> + u64 off_cpu_us;
>> + ktime_t last_ts;
>> + u32 switches;
>> + u8 da_state;
>> +
>> + struct rcu_head rcu; /* for call_rcu() teardown */
>> +};
>> +
>> +/* Per-uprobe-binding state: a start + stop probe pair for one binary region.
>> */
>> +struct tlob_uprobe_binding {
>> + struct list_head list;
>> + u64 threshold_us;
>> + struct path path;
>> + char binpath[TLOB_MAX_PATH]; /* canonical
>> path for read/remove */
>> + loff_t offset_start;
>> + loff_t offset_stop;
>> + struct uprobe_consumer entry_uc;
>> + struct uprobe_consumer stop_uc;
>> + struct uprobe *entry_uprobe;
>> + struct uprobe *stop_uprobe;
>> +};
>> +
>> +/* Object pool for tlob_task_state. */
>> +static struct kmem_cache *tlob_state_cache;
>> +
>> +/* Hash table and lock protecting table structure (insert/delete/canceled).
>> */
>> +static struct hlist_head tlob_htable[TLOB_HTABLE_SIZE];
>> +static DEFINE_RAW_SPINLOCK(tlob_table_lock);
>> +static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
>> +
>> +/* Uprobe binding list; protected by tlob_uprobe_mutex. */
>> +static LIST_HEAD(tlob_uprobe_list);
>> +static DEFINE_MUTEX(tlob_uprobe_mutex);
>> +
>> +/* Forward declaration */
>> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer);
>> +
>> +/* Hash table helpers */
>> +
>> +static unsigned int tlob_hash_task(const struct task_struct *task)
>> +{
>> + return hash_ptr((void *)task, TLOB_HTABLE_BITS);
>> +}
>> +
>> +/*
>> + * tlob_find_rcu - look up per-task state.
>> + * Must be called under rcu_read_lock() or with tlob_table_lock held.
>> + */
>> +static struct tlob_task_state *tlob_find_rcu(struct task_struct *task)
>> +{
>> + struct tlob_task_state *ws;
>> + unsigned int h = tlob_hash_task(task);
>> +
>> + hlist_for_each_entry_rcu(ws, &tlob_htable[h], hlist,
>> + lockdep_is_held(&tlob_table_lock))
>> + if (ws->task == task)
>> + return ws;
>> + return NULL;
>> +}
>> +
>> +/* Allocate and initialise a new per-task state entry. */
>> +static struct tlob_task_state *tlob_alloc(struct task_struct *task,
>> + u64 threshold_us, u64 tag)
>> +{
>> + struct tlob_task_state *ws;
>> +
>> + ws = kmem_cache_zalloc(tlob_state_cache, GFP_ATOMIC);
>> + if (!ws)
>> + return NULL;
>> +
>> + ws->task = task;
>> + get_task_struct(task);
>> + ws->threshold_us = threshold_us;
>> + ws->tag = tag;
>> + ws->last_ts = ktime_get();
>> + ws->da_state = on_cpu_tlob;
>> + raw_spin_lock_init(&ws->entry_lock);
>> + hrtimer_setup(&ws->deadline_timer, tlob_deadline_timer_fn,
>> + CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> + return ws;
>> +}
>> +
>> +/* RCU callback: free the slab once no readers remain. */
>> +static void tlob_free_rcu_slab(struct rcu_head *head)
>> +{
>> + struct tlob_task_state *ws =
>> + container_of(head, struct tlob_task_state, rcu);
>> + kmem_cache_free(tlob_state_cache, ws);
>> +}
>> +
>> +/* Arm the one-shot deadline timer for threshold_us microseconds. */
>> +static void tlob_arm_deadline(struct tlob_task_state *ws)
>> +{
>> + hrtimer_start(&ws->deadline_timer,
>> + ns_to_ktime(ws->threshold_us * NSEC_PER_USEC),
>> + HRTIMER_MODE_REL);
>> +}
>> +
>> +/*
>> + * Push a violation record into a monitor fd's ring buffer (softirq context).
>> + * Drop-new policy: discard incoming record when full. smp_store_release on
>> + * data_head pairs with smp_load_acquire in the consumer.
>> + */
>> +static void tlob_event_push(struct rv_file_priv *priv,
>> + const struct tlob_event *info)
>> +{
>> + struct tlob_ring *ring = &priv->ring;
>> + unsigned long flags;
>> + u32 head, tail;
>> +
>> + spin_lock_irqsave(&ring->lock, flags);
>> +
>> + head = ring->page->data_head;
>> + tail = READ_ONCE(ring->page->data_tail);
>> +
>> + if (head - tail > ring->mask) {
>> + /* Ring full: drop incoming record. */
>> + ring->page->dropped++;
>> + spin_unlock_irqrestore(&ring->lock, flags);
>> + return;
>> + }
>> +
>> + ring->data[head & ring->mask] = *info;
>> + /* pairs with smp_load_acquire() in the consumer */
>> + smp_store_release(&ring->page->data_head, head + 1);
>> +
>> + spin_unlock_irqrestore(&ring->lock, flags);
>> +
>> + wake_up_interruptible_poll(&priv->waitq, EPOLLIN | EPOLLRDNORM);
>> +}
>> +
>> +#if IS_ENABLED(CONFIG_KUNIT)
>> +void tlob_event_push_kunit(struct rv_file_priv *priv,
>> + const struct tlob_event *info)
>> +{
>> + tlob_event_push(priv, info);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_push_kunit);
>> +#endif /* CONFIG_KUNIT */
>> +
>> +/*
>> + * Budget exceeded: remove the entry, record the violation, and inject
>> + * budget_expired into the DA.
>> + *
>> + * Lock order: tlob_table_lock -> entry_lock. tlob_stop_task() sets
>> + * ws->canceled under both locks; if we see it here the stop path owns
>> cleanup.
>> + * fput/put_task_struct are done before call_rcu(); the RCU callback only
>> + * reclaims the slab.
>> + */
>> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer)
>> +{
>> + struct tlob_task_state *ws =
>> + container_of(timer, struct tlob_task_state, deadline_timer);
>> + struct tlob_event info = {};
>> + struct file *notify_file;
>> + struct task_struct *task;
>> + unsigned long flags;
>> + /* snapshots taken under entry_lock */
>> + u64 on_cpu_us, off_cpu_us, threshold_us, tag;
>> + u32 switches;
>> + bool on_cpu;
>> + bool push_event = false;
>> +
>> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> + /* stop path sets canceled under both locks; if set it owns cleanup
>> */
>> + if (ws->canceled) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + return HRTIMER_NORESTART;
>> + }
>> +
>> + /* Finalize accounting and snapshot all fields under entry_lock. */
>> + raw_spin_lock(&ws->entry_lock);
>> +
>> + {
>> + ktime_t now = ktime_get();
>> + u64 delta_us = ktime_to_us(ktime_sub(now, ws->last_ts));
>> +
>> + if (ws->da_state == on_cpu_tlob)
>> + ws->on_cpu_us += delta_us;
>> + else
>> + ws->off_cpu_us += delta_us;
>> + }
>> +
>> + ws->canceled = 1;
>> + on_cpu_us = ws->on_cpu_us;
>> + off_cpu_us = ws->off_cpu_us;
>> + threshold_us = ws->threshold_us;
>> + tag = ws->tag;
>> + switches = ws->switches;
>> + on_cpu = (ws->da_state == on_cpu_tlob);
>> + notify_file = ws->notify_file;
>> + if (notify_file) {
>> + info.tid = task_pid_vnr(ws->task);
>> + info.threshold_us = threshold_us;
>> + info.on_cpu_us = on_cpu_us;
>> + info.off_cpu_us = off_cpu_us;
>> + info.switches = switches;
>> + info.state = on_cpu ? 1 : 0;
>> + info.tag = tag;
>> + push_event = true;
>> + }
>> +
>> + raw_spin_unlock(&ws->entry_lock);
>> +
>> + hlist_del_rcu(&ws->hlist);
>> + atomic_dec(&tlob_num_monitored);
>> + /*
>> + * Hold a reference so task remains valid across da_handle_event()
>> + * after we drop tlob_table_lock.
>> + */
>> + task = ws->task;
>> + get_task_struct(task);
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> + /*
>> + * Both locks are now released; ws is exclusively owned (removed from
>> + * the hash table with canceled=1). Emit the tracepoint and push the
>> + * violation record.
>> + */
>> + trace_tlob_budget_exceeded(ws->task, threshold_us, on_cpu_us,
>> + off_cpu_us, switches, on_cpu, tag);
>> +
>> + if (push_event) {
>> + struct rv_file_priv *priv = notify_file->private_data;
>> +
>> + if (priv)
>> + tlob_event_push(priv, &info);
>> + }
>> +
>> + da_handle_event(task, budget_expired_tlob);
>> +
>> + if (notify_file)
>> + fput(notify_file); /* ref from fget() at
>> TRACE_START */
>> + put_task_struct(ws->task); /* ref from tlob_alloc() */
>> + put_task_struct(task); /* extra ref from
>> get_task_struct() above */
>> + call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> + return HRTIMER_NORESTART;
>> +}
>> +
>> +/* Tracepoint handlers */
>> +
>> +/*
>> + * handle_sched_switch - advance the DA and accumulate on/off-CPU time.
>> + *
>> + * RCU read-side for lock-free lookup; entry_lock for per-task accounting.
>> + * da_handle_event() is called after rcu_read_unlock() to avoid holding the
>> + * read-side critical section across the RV framework.
>> + */
>> +static void handle_sched_switch(void *data, bool preempt,
>> + struct task_struct *prev,
>> + struct task_struct *next,
>> + unsigned int prev_state)
>> +{
>> + struct tlob_task_state *ws;
>> + unsigned long flags;
>> + bool do_prev = false, do_next = false;
>> + ktime_t now;
>> +
>> + rcu_read_lock();
>> +
>> + ws = tlob_find_rcu(prev);
>> + if (ws) {
>> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> + if (!ws->canceled) {
>> + now = ktime_get();
>> + ws->on_cpu_us += ktime_to_us(ktime_sub(now, ws-
>>> last_ts));
>> + ws->last_ts = now;
>> + ws->switches++;
>> + ws->da_state = off_cpu_tlob;
>> + do_prev = true;
>> + }
>> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> + }
>> +
>> + ws = tlob_find_rcu(next);
>> + if (ws) {
>> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> + if (!ws->canceled) {
>> + now = ktime_get();
>> + ws->off_cpu_us += ktime_to_us(ktime_sub(now, ws-
>>> last_ts));
>> + ws->last_ts = now;
>> + ws->da_state = on_cpu_tlob;
>> + do_next = true;
>> + }
>> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> + }
>> +
>> + rcu_read_unlock();
>> +
>> + if (do_prev)
>> + da_handle_event(prev, switch_out_tlob);
>> + if (do_next)
>> + da_handle_event(next, switch_in_tlob);
>> +}
>> +
>> +static void handle_sched_wakeup(void *data, struct task_struct *p)
>> +{
>> + struct tlob_task_state *ws;
>> + unsigned long flags;
>> + bool found = false;
>> +
>> + rcu_read_lock();
>> + ws = tlob_find_rcu(p);
>> + if (ws) {
>> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> + found = !ws->canceled;
>> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> + }
>> + rcu_read_unlock();
>> +
>> + if (found)
>> + da_handle_event(p, sched_wakeup_tlob);
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * Core start/stop helpers (also called from rv_dev.c)
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/*
>> + * __tlob_insert - insert @ws into the hash table and arm its deadline timer.
>> + *
>> + * Re-checks for duplicates and capacity under tlob_table_lock; the caller
>> + * may have done a lock-free pre-check before allocating @ws. On failure @ws
>> + * is freed directly (never in table, so no call_rcu needed).
>> + */
>> +static int __tlob_insert(struct task_struct *task, struct tlob_task_state
>> *ws)
>> +{
>> + unsigned int h;
>> + unsigned long flags;
>> +
>> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> + if (tlob_find_rcu(task)) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + if (ws->notify_file)
>> + fput(ws->notify_file);
>> + put_task_struct(ws->task);
>> + kmem_cache_free(tlob_state_cache, ws);
>> + return -EEXIST;
>> + }
>> + if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + if (ws->notify_file)
>> + fput(ws->notify_file);
>> + put_task_struct(ws->task);
>> + kmem_cache_free(tlob_state_cache, ws);
>> + return -ENOSPC;
>> + }
>> + h = tlob_hash_task(task);
>> + hlist_add_head_rcu(&ws->hlist, &tlob_htable[h]);
>> + atomic_inc(&tlob_num_monitored);
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> + da_handle_start_run_event(task, trace_start_tlob);
>> + tlob_arm_deadline(ws);
>> + return 0;
>> +}
>> +
>> +/**
>> + * tlob_start_task - begin monitoring @task with latency budget
>> @threshold_us.
>> + *
>> + * @notify_file: /dev/rv fd whose ring buffer receives a tlob_event on
>> + * violation; caller transfers the fget() reference to tlob.c.
>> + * Pass NULL for synchronous mode (violations only via
>> + * TRACE_STOP return value and the tlob_budget_exceeded event).
>> + *
>> + * Returns 0, -ENODEV, -EEXIST, -ENOSPC, or -ENOMEM. On failure the caller
>> + * retains responsibility for any @notify_file reference.
>> + */
>> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
>> + struct file *notify_file, u64 tag)
>> +{
>> + struct tlob_task_state *ws;
>> + unsigned long flags;
>> +
>> + if (!tlob_state_cache)
>> + return -ENODEV;
>> +
>> + if (threshold_us > (u64)KTIME_MAX / NSEC_PER_USEC)
>> + return -ERANGE;
>> +
>> + /* Quick pre-check before allocation. */
>> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> + if (tlob_find_rcu(task)) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + return -EEXIST;
>> + }
>> + if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + return -ENOSPC;
>> + }
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> + ws = tlob_alloc(task, threshold_us, tag);
>> + if (!ws)
>> + return -ENOMEM;
>> +
>> + ws->notify_file = notify_file;
>> + return __tlob_insert(task, ws);
>> +}
>> +EXPORT_SYMBOL_GPL(tlob_start_task);
>> +
>> +/**
>> + * tlob_stop_task - stop monitoring @task before the deadline fires.
>> + *
>> + * Sets canceled under entry_lock (inside tlob_table_lock) before calling
>> + * hrtimer_cancel(), racing safely with the timer callback.
>> + *
>> + * Returns 0 if within budget, -ESRCH if the entry is gone (deadline already
>> + * fired, or TRACE_START was never called).
>> + */
>> +int tlob_stop_task(struct task_struct *task)
>> +{
>> + struct tlob_task_state *ws;
>> + struct file *notify_file;
>> + unsigned long flags;
>> +
>> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> + ws = tlob_find_rcu(task);
>> + if (!ws) {
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> + return -ESRCH;
>> + }
>> +
>> + /* Prevent handle_sched_switch from updating accounting after
>> removal. */
>> + raw_spin_lock(&ws->entry_lock);
>> + ws->canceled = 1;
>> + raw_spin_unlock(&ws->entry_lock);
>> +
>> + hlist_del_rcu(&ws->hlist);
>> + atomic_dec(&tlob_num_monitored);
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> + hrtimer_cancel(&ws->deadline_timer);
>> +
>> + da_handle_event(task, trace_stop_tlob);
>> +
>> + notify_file = ws->notify_file;
>> + if (notify_file)
>> + fput(notify_file);
>> + put_task_struct(ws->task);
>> + call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(tlob_stop_task);
>> +
>> +/* Stop monitoring all tracked tasks; called on monitor disable. */
>> +static void tlob_stop_all(void)
>> +{
>> + struct tlob_task_state *batch[TLOB_MAX_MONITORED];
>> + struct tlob_task_state *ws;
>> + struct hlist_node *tmp;
>> + unsigned long flags;
>> + int n = 0, i;
>> +
>> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> + for (i = 0; i < TLOB_HTABLE_SIZE; i++) {
>> + hlist_for_each_entry_safe(ws, tmp, &tlob_htable[i], hlist) {
>> + raw_spin_lock(&ws->entry_lock);
>> + ws->canceled = 1;
>> + raw_spin_unlock(&ws->entry_lock);
>> + hlist_del_rcu(&ws->hlist);
>> + atomic_dec(&tlob_num_monitored);
>> + if (n < TLOB_MAX_MONITORED)
>> + batch[n++] = ws;
>> + }
>> + }
>> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> + for (i = 0; i < n; i++) {
>> + ws = batch[i];
>> + hrtimer_cancel(&ws->deadline_timer);
>> + da_handle_event(ws->task, trace_stop_tlob);
>> + if (ws->notify_file)
>> + fput(ws->notify_file);
>> + put_task_struct(ws->task);
>> + call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> + }
>> +}
>> +
>> +/* uprobe binding helpers */
>> +
>> +static int tlob_uprobe_entry_handler(struct uprobe_consumer *uc,
>> + struct pt_regs *regs, __u64 *data)
>> +{
>> + struct tlob_uprobe_binding *b =
>> + container_of(uc, struct tlob_uprobe_binding, entry_uc);
>> +
>> + tlob_start_task(current, b->threshold_us, NULL, (u64)b-
>>> offset_start);
>> + return 0;
>> +}
>> +
>> +static int tlob_uprobe_stop_handler(struct uprobe_consumer *uc,
>> + struct pt_regs *regs, __u64 *data)
>> +{
>> + tlob_stop_task(current);
>> + return 0;
>> +}
>> +
>> +/*
>> + * Register start + stop entry uprobes for a binding.
>> + * Both are plain entry uprobes (no uretprobe), so a wrong offset never
>> + * corrupts the call stack; the worst outcome is a missed stop (hrtimer
>> + * fires and reports a budget violation).
>> + * Called with tlob_uprobe_mutex held.
>> + */
>> +static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
>> + loff_t offset_start, loff_t offset_stop)
>> +{
>> + struct tlob_uprobe_binding *b, *tmp_b;
>> + char pathbuf[TLOB_MAX_PATH];
>> + struct inode *inode;
>> + char *canon;
>> + int ret;
>> +
>> + b = kzalloc(sizeof(*b), GFP_KERNEL);
>> + if (!b)
>> + return -ENOMEM;
>> +
>> + if (binpath[0] != '/') {
>> + kfree(b);
>> + return -EINVAL;
>> + }
>> +
>> + b->threshold_us = threshold_us;
>> + b->offset_start = offset_start;
>> + b->offset_stop = offset_stop;
>> +
>> + ret = kern_path(binpath, LOOKUP_FOLLOW, &b->path);
>> + if (ret)
>> + goto err_free;
>> +
>> + if (!d_is_reg(b->path.dentry)) {
>> + ret = -EINVAL;
>> + goto err_path;
>> + }
>> +
>> + /* Reject duplicate start offset for the same binary. */
>> + list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
>> + if (tmp_b->offset_start == offset_start &&
>> + tmp_b->path.dentry == b->path.dentry) {
>> + ret = -EEXIST;
>> + goto err_path;
>> + }
>> + }
>> +
>> + /* Store canonical path for read-back and removal matching. */
>> + canon = d_path(&b->path, pathbuf, sizeof(pathbuf));
>> + if (IS_ERR(canon)) {
>> + ret = PTR_ERR(canon);
>> + goto err_path;
>> + }
>> + strscpy(b->binpath, canon, sizeof(b->binpath));
>> +
>> + b->entry_uc.handler = tlob_uprobe_entry_handler;
>> + b->stop_uc.handler = tlob_uprobe_stop_handler;
>> +
>> + inode = d_real_inode(b->path.dentry);
>> +
>> + b->entry_uprobe = uprobe_register(inode, offset_start, 0, &b-
>>> entry_uc);
>> + if (IS_ERR(b->entry_uprobe)) {
>> + ret = PTR_ERR(b->entry_uprobe);
>> + b->entry_uprobe = NULL;
>> + goto err_path;
>> + }
>> +
>> + b->stop_uprobe = uprobe_register(inode, offset_stop, 0, &b->stop_uc);
>> + if (IS_ERR(b->stop_uprobe)) {
>> + ret = PTR_ERR(b->stop_uprobe);
>> + b->stop_uprobe = NULL;
>> + goto err_entry;
>> + }
>> +
>> + list_add_tail(&b->list, &tlob_uprobe_list);
>> + return 0;
>> +
>> +err_entry:
>> + uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> + uprobe_unregister_sync();
>> +err_path:
>> + path_put(&b->path);
>> +err_free:
>> + kfree(b);
>> + return ret;
>> +}
>> +
>> +/*
>> + * Remove the uprobe binding for (offset_start, binpath).
>> + * binpath is resolved to a dentry for comparison so symlinks are handled
>> + * correctly. Called with tlob_uprobe_mutex held.
>> + */
>> +static void tlob_remove_uprobe_by_key(loff_t offset_start, const char
>> *binpath)
>> +{
>> + struct tlob_uprobe_binding *b, *tmp;
>> + struct path remove_path;
>> +
>> + if (kern_path(binpath, LOOKUP_FOLLOW, &remove_path))
>> + return;
>> +
>> + list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
>> + if (b->offset_start != offset_start)
>> + continue;
>> + if (b->path.dentry != remove_path.dentry)
>> + continue;
>> + uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> + uprobe_unregister_nosync(b->stop_uprobe, &b->stop_uc);
>> + list_del(&b->list);
>> + uprobe_unregister_sync();
>> + path_put(&b->path);
>> + kfree(b);
>> + break;
>> + }
>> +
>> + path_put(&remove_path);
>> +}
>> +
>> +/* Unregister all uprobe bindings; called from disable_tlob(). */
>> +static void tlob_remove_all_uprobes(void)
>> +{
>> + struct tlob_uprobe_binding *b, *tmp;
>> +
>> + mutex_lock(&tlob_uprobe_mutex);
>> + list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
>> + uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> + uprobe_unregister_nosync(b->stop_uprobe, &b->stop_uc);
>> + list_del(&b->list);
>> + path_put(&b->path);
>> + kfree(b);
>> + }
>> + mutex_unlock(&tlob_uprobe_mutex);
>> + uprobe_unregister_sync();
>> +}
>> +
>> +/*
>> + * tracefs "monitor" file
>> + *
>> + * Read: one "threshold_us:0xoffset_start:0xoffset_stop:binary_path\n"
>> + * line per registered uprobe binding.
>> + * Write: "threshold_us:offset_start:offset_stop:binary_path" - add uprobe
>> binding
>> + * "-offset_start:binary_path" - remove uprobe
>> binding
>> + */
>> +
>> +static ssize_t tlob_monitor_read(struct file *file,
>> + char __user *ubuf,
>> + size_t count, loff_t *ppos)
>> +{
>> + /* pid(10) + threshold(20) + 2 offsets(2*18) + path(256) + delimiters
>> */
>> + const int line_sz = TLOB_MAX_PATH + 72;
>> + struct tlob_uprobe_binding *b;
>> + char *buf, *p;
>> + int n = 0, buf_sz, pos = 0;
>> + ssize_t ret;
>> +
>> + mutex_lock(&tlob_uprobe_mutex);
>> + list_for_each_entry(b, &tlob_uprobe_list, list)
>> + n++;
>> + mutex_unlock(&tlob_uprobe_mutex);
>> +
>> + buf_sz = (n ? n : 1) * line_sz + 1;
>> + buf = kmalloc(buf_sz, GFP_KERNEL);
>> + if (!buf)
>> + return -ENOMEM;
>> +
>> + mutex_lock(&tlob_uprobe_mutex);
>> + list_for_each_entry(b, &tlob_uprobe_list, list) {
>> + p = b->binpath;
>> + pos += scnprintf(buf + pos, buf_sz - pos,
>> + "%llu:0x%llx:0x%llx:%s\n",
>> + b->threshold_us,
>> + (unsigned long long)b->offset_start,
>> + (unsigned long long)b->offset_stop,
>> + p);
>> + }
>> + mutex_unlock(&tlob_uprobe_mutex);
>> +
>> + ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
>> + kfree(buf);
>> + return ret;
>> +}
>> +
>> +/*
>> + * Parse "threshold_us:offset_start:offset_stop:binary_path".
>> + * binary_path comes last so it may freely contain ':'.
>> + * Returns 0 on success.
>> + */
>> +VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
>> + char **path_out,
>> + loff_t *start_out, loff_t
>> *stop_out)
>> +{
>> + unsigned long long thr;
>> + long long start, stop;
>> + int n = 0;
>> +
>> + /*
>> + * %llu : decimal-only (microseconds)
>> + * %lli : auto-base, accepts 0x-prefixed hex for offsets
>> + * %n : records the byte offset of the first path character
>> + */
>> + if (sscanf(buf, "%llu:%lli:%lli:%n", &thr, &start, &stop, &n) != 3)
>> + return -EINVAL;
>> + if (thr == 0 || n == 0 || buf[n] == '\0')
>> + return -EINVAL;
>> + if (start < 0 || stop < 0)
>> + return -EINVAL;
>> +
>> + *thr_out = thr;
>> + *start_out = start;
>> + *stop_out = stop;
>> + *path_out = buf + n;
>> + return 0;
>> +}
>> +
>> +static ssize_t tlob_monitor_write(struct file *file,
>> + const char __user *ubuf,
>> + size_t count, loff_t *ppos)
>> +{
>> + char buf[TLOB_MAX_PATH + 64];
>> + loff_t offset_start, offset_stop;
>> + u64 threshold_us;
>> + char *binpath;
>> + int ret;
>> +
>> + if (count >= sizeof(buf))
>> + return -EINVAL;
>> + if (copy_from_user(buf, ubuf, count))
>> + return -EFAULT;
>> + buf[count] = '\0';
>> +
>> + if (count > 0 && buf[count - 1] == '\n')
>> + buf[count - 1] = '\0';
>> +
>> + /* Remove request: "-offset_start:binary_path" */
>> + if (buf[0] == '-') {
>> + long long off;
>> + int n = 0;
>> +
>> + if (sscanf(buf + 1, "%lli:%n", &off, &n) != 1 || n == 0)
>> + return -EINVAL;
>> + binpath = buf + 1 + n;
>> + if (binpath[0] != '/')
>> + return -EINVAL;
>> +
>> + mutex_lock(&tlob_uprobe_mutex);
>> + tlob_remove_uprobe_by_key((loff_t)off, binpath);
>> + mutex_unlock(&tlob_uprobe_mutex);
>> +
>> + return (ssize_t)count;
>> + }
>> +
>> + /*
>> + * Uprobe binding:
>> "threshold_us:offset_start:offset_stop:binary_path"
>> + * binpath points into buf at the start of the path field.
>> + */
>> + ret = tlob_parse_uprobe_line(buf, &threshold_us,
>> + &binpath, &offset_start, &offset_stop);
>> + if (ret)
>> + return ret;
>> +
>> + mutex_lock(&tlob_uprobe_mutex);
>> + ret = tlob_add_uprobe(threshold_us, binpath, offset_start,
>> offset_stop);
>> + mutex_unlock(&tlob_uprobe_mutex);
>> + return ret ? ret : (ssize_t)count;
>> +}
>> +
>> +static const struct file_operations tlob_monitor_fops = {
>> + .open = simple_open,
>> + .read = tlob_monitor_read,
>> + .write = tlob_monitor_write,
>> + .llseek = noop_llseek,
>> +};
>> +
>> +/*
>> + * __tlob_init_monitor / __tlob_destroy_monitor - called with
>> rv_interface_lock
>> + * held (required by da_monitor_init/destroy via
>> rv_get/put_task_monitor_slot).
>> + */
>> +static int __tlob_init_monitor(void)
>> +{
>> + int i, retval;
>> +
>> + tlob_state_cache = kmem_cache_create("tlob_task_state",
>> + sizeof(struct tlob_task_state),
>> + 0, 0, NULL);
>> + if (!tlob_state_cache)
>> + return -ENOMEM;
>> +
>> + for (i = 0; i < TLOB_HTABLE_SIZE; i++)
>> + INIT_HLIST_HEAD(&tlob_htable[i]);
>> + atomic_set(&tlob_num_monitored, 0);
>> +
>> + retval = da_monitor_init();
>> + if (retval) {
>> + kmem_cache_destroy(tlob_state_cache);
>> + tlob_state_cache = NULL;
>> + return retval;
>> + }
>> +
>> + rv_this.enabled = 1;
>> + return 0;
>> +}
>> +
>> +static void __tlob_destroy_monitor(void)
>> +{
>> + rv_this.enabled = 0;
>> + tlob_stop_all();
>> + tlob_remove_all_uprobes();
>> + /*
>> + * Drain pending call_rcu() callbacks from tlob_stop_all() before
>> + * destroying the kmem_cache.
>> + */
>> + synchronize_rcu();
>> + da_monitor_destroy();
>> + kmem_cache_destroy(tlob_state_cache);
>> + tlob_state_cache = NULL;
>> +}
>> +
>> +/*
>> + * tlob_init_monitor / tlob_destroy_monitor - KUnit wrappers that acquire
>> + * rv_interface_lock, satisfying the lockdep_assert_held() inside
>> + * rv_get/put_task_monitor_slot().
>> + */
>> +VISIBLE_IF_KUNIT int tlob_init_monitor(void)
>> +{
>> + int ret;
>> +
>> + mutex_lock(&rv_interface_lock);
>> + ret = __tlob_init_monitor();
>> + mutex_unlock(&rv_interface_lock);
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_init_monitor);
>> +
>> +VISIBLE_IF_KUNIT void tlob_destroy_monitor(void)
>> +{
>> + mutex_lock(&rv_interface_lock);
>> + __tlob_destroy_monitor();
>> + mutex_unlock(&rv_interface_lock);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_destroy_monitor);
>> +
>> +VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
>> +{
>> + rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
>> + rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
>> +
>> +VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
>> +{
>> + rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
>> + rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
>> +
>> +/*
>> + * enable_tlob / disable_tlob - called by rv_enable/disable_monitor() which
>> + * already holds rv_interface_lock; call the __ variants directly.
>> + */
>> +static int enable_tlob(void)
>> +{
>> + int retval;
>> +
>> + retval = __tlob_init_monitor();
>> + if (retval)
>> + return retval;
>> +
>> + return tlob_enable_hooks();
>> +}
>> +
>> +static void disable_tlob(void)
>> +{
>> + tlob_disable_hooks();
>> + __tlob_destroy_monitor();
>> +}
>> +
>> +static struct rv_monitor rv_this = {
>> + .name = "tlob",
>> + .description = "Per-task latency-over-budget monitor.",
>> + .enable = enable_tlob,
>> + .disable = disable_tlob,
>> + .reset = da_monitor_reset_all,
>> + .enabled = 0,
>> +};
>> +
>> +static int __init register_tlob(void)
>> +{
>> + int ret;
>> +
>> + ret = rv_register_monitor(&rv_this, NULL);
>> + if (ret)
>> + return ret;
>> +
>> + if (rv_this.root_d) {
>> + tracefs_create_file("monitor", 0644, rv_this.root_d, NULL,
>> + &tlob_monitor_fops);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static void __exit unregister_tlob(void)
>> +{
>> + rv_unregister_monitor(&rv_this);
>> +}
>> +
>> +module_init(register_tlob);
>> +module_exit(unregister_tlob);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
>> +MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
>> b/kernel/trace/rv/monitors/tlob/tlob.h
>> new file mode 100644
>> index 000000000..3438a6175
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob.h
>> @@ -0,0 +1,145 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _RV_TLOB_H
>> +#define _RV_TLOB_H
>> +
>> +/*
>> + * C representation of the tlob automaton, generated from tlob.dot via rvgen
>> + * and extended with tlob_start_task()/tlob_stop_task() declarations.
>> + * For the format description see
>> Documentation/trace/rv/deterministic_automata.rst
>> + */
>> +
>> +#include <linux/rv.h>
>> +#include <uapi/linux/rv.h>
>> +
>> +#define MONITOR_NAME tlob
>> +
>> +enum states_tlob {
>> + unmonitored_tlob,
>> + on_cpu_tlob,
>> + off_cpu_tlob,
>> + state_max_tlob,
>> +};
>> +
>> +#define INVALID_STATE state_max_tlob
>> +
>> +enum events_tlob {
>> + trace_start_tlob,
>> + switch_in_tlob,
>> + switch_out_tlob,
>> + sched_wakeup_tlob,
>> + trace_stop_tlob,
>> + budget_expired_tlob,
>> + event_max_tlob,
>> +};
>> +
>> +struct automaton_tlob {
>> + char *state_names[state_max_tlob];
>> + char *event_names[event_max_tlob];
>> + unsigned char function[state_max_tlob][event_max_tlob];
>> + unsigned char initial_state;
>> + bool final_states[state_max_tlob];
>> +};
>> +
>> +static const struct automaton_tlob automaton_tlob = {
>> + .state_names = {
>> + "unmonitored",
>> + "on_cpu",
>> + "off_cpu",
>> + },
>> + .event_names = {
>> + "trace_start",
>> + "switch_in",
>> + "switch_out",
>> + "sched_wakeup",
>> + "trace_stop",
>> + "budget_expired",
>> + },
>> + .function = {
>> + /* unmonitored */
>> + {
>> + on_cpu_tlob, /* trace_start */
>> + unmonitored_tlob, /* switch_in */
>> + unmonitored_tlob, /* switch_out */
>> + unmonitored_tlob, /* sched_wakeup */
>> + INVALID_STATE, /* trace_stop */
>> + INVALID_STATE, /* budget_expired */
>> + },
>> + /* on_cpu */
>> + {
>> + INVALID_STATE, /* trace_start */
>> + INVALID_STATE, /* switch_in */
>> + off_cpu_tlob, /* switch_out */
>> + on_cpu_tlob, /* sched_wakeup */
>> + unmonitored_tlob, /* trace_stop */
>> + unmonitored_tlob, /* budget_expired */
>> + },
>> + /* off_cpu */
>> + {
>> + INVALID_STATE, /* trace_start */
>> + on_cpu_tlob, /* switch_in */
>> + off_cpu_tlob, /* switch_out */
>> + off_cpu_tlob, /* sched_wakeup */
>> + unmonitored_tlob, /* trace_stop */
>> + unmonitored_tlob, /* budget_expired */
>> + },
>> + },
>> + /*
>> + * final_states: unmonitored is the sole accepting state.
>> + * Violations are recorded via ntf_push and tlob_budget_exceeded.
>> + */
>> + .initial_state = unmonitored_tlob,
>> + .final_states = { 1, 0, 0 },
>> +};
>> +
>> +/* Exported for use by the RV ioctl layer (rv_dev.c) */
>> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
>> + struct file *notify_file, u64 tag);
>> +int tlob_stop_task(struct task_struct *task);
>> +
>> +/* Maximum number of concurrently monitored tasks (also used by KUnit). */
>> +#define TLOB_MAX_MONITORED 64U
>> +
>> +/*
>> + * Ring buffer constants (also published in UAPI for mmap size calculation).
>> + */
>> +#define TLOB_RING_DEFAULT_CAP 64U /* records allocated at open() */
>> +#define TLOB_RING_MIN_CAP 8U /* minimum accepted by mmap() */
>> +#define TLOB_RING_MAX_CAP 4096U /* maximum accepted by mmap() */
>> +
>> +/**
>> + * struct tlob_ring - per-fd mmap-capable violation ring buffer.
>> + *
>> + * Allocated as a contiguous page range at rv_open() time:
>> + * page 0: struct tlob_mmap_page (shared with userspace)
>> + * pages 1-N: struct tlob_event[capacity]
>> + */
>> +struct tlob_ring {
>> + struct tlob_mmap_page *page;
>> + struct tlob_event *data;
>> + u32 mask;
>> + spinlock_t lock;
>> + unsigned long base;
>> + unsigned int order;
>> +};
>> +
>> +/**
>> + * struct rv_file_priv - per-fd private data for /dev/rv.
>> + */
>> +struct rv_file_priv {
>> + struct tlob_ring ring;
>> + wait_queue_head_t waitq;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_KUNIT)
>> +int tlob_init_monitor(void);
>> +void tlob_destroy_monitor(void);
>> +int tlob_enable_hooks(void);
>> +void tlob_disable_hooks(void);
>> +void tlob_event_push_kunit(struct rv_file_priv *priv,
>> + const struct tlob_event *info);
>> +int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
>> + char **path_out,
>> + loff_t *start_out, loff_t *stop_out);
>> +#endif /* CONFIG_KUNIT */
>> +
>> +#endif /* _RV_TLOB_H */
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> b/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> new file mode 100644
>> index 000000000..b08d67776
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> @@ -0,0 +1,42 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +/*
>> + * Snippet to be included in rv_trace.h
>> + */
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +/*
>> + * tlob uses the generic event_da_monitor_id and error_da_monitor_id event
>> + * classes so that both event classes are instantiated. This avoids a
>> + * -Werror=unused-variable warning that the compiler emits when a
>> + * DECLARE_EVENT_CLASS has no corresponding DEFINE_EVENT instance.
>> + *
>> + * The event_tlob tracepoint is defined here but the call-site in
>> + * da_handle_event() is overridden with a no-op macro below so that no
>> + * trace record is emitted on every scheduler context switch. Budget
>> + * violations are reported via the dedicated tlob_budget_exceeded event.
>> + *
>> + * error_tlob IS kept active so that invalid DA transitions (programming
>> + * errors) are still visible in the ftrace ring buffer for debugging.
>> + */
>> +DEFINE_EVENT(event_da_monitor_id, event_tlob,
>> + TP_PROTO(int id, char *state, char *event, char *next_state,
>> + bool final_state),
>> + TP_ARGS(id, state, event, next_state, final_state));
>> +
>> +DEFINE_EVENT(error_da_monitor_id, error_tlob,
>> + TP_PROTO(int id, char *state, char *event),
>> + TP_ARGS(id, state, event));
>> +
>> +/*
>> + * Override the trace_event_tlob() call-site with a no-op after the
>> + * DEFINE_EVENT above has satisfied the event class instantiation
>> + * requirement. The tracepoint symbol itself exists (and can be enabled
>> + * via tracefs) but the automatic call from da_handle_event() is silenced
>> + * to avoid per-context-switch ftrace noise during normal operation.
>> + */
>> +#undef trace_event_tlob
>> +#define trace_event_tlob(id, state, event, next_state, final_state) \
>> + do { (void)(id); (void)(state); (void)(event); \
>> + (void)(next_state); (void)(final_state); } while (0)
>> +#endif /* CONFIG_RV_MON_TLOB */
>> diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
>> index ee4e68102..e754e76d5 100644
>> --- a/kernel/trace/rv/rv.c
>> +++ b/kernel/trace/rv/rv.c
>> @@ -148,6 +148,10 @@
>> #include <rv_trace.h>
>> #endif
>>
>> +#ifdef CONFIG_RV_MON_TLOB
>> +EXPORT_TRACEPOINT_SYMBOL_GPL(tlob_budget_exceeded);
>> +#endif
>> +
>> #include "rv.h"
>>
>> DEFINE_MUTEX(rv_interface_lock);
>> diff --git a/kernel/trace/rv/rv_dev.c b/kernel/trace/rv/rv_dev.c
>> new file mode 100644
>> index 000000000..a052f3203
>> --- /dev/null
>> +++ b/kernel/trace/rv/rv_dev.c
>> @@ -0,0 +1,602 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * rv_dev.c - /dev/rv misc device for RV monitor self-instrumentation
>> + *
>> + * A single misc device (MISC_DYNAMIC_MINOR) serves all RV monitors.
>> + * ioctl numbers encode the monitor identity:
>> + *
>> + * 0x01 - 0x1F tlob (task latency over budget)
>> + * 0x20 - 0x3F reserved
>> + *
>> + * Each monitor exports tlob_start_task() / tlob_stop_task() which are
>> + * called here. The calling task is identified by current.
>> + *
>> + * Magic: RV_IOC_MAGIC (0xB9), defined in include/uapi/linux/rv.h
>> + *
>> + * Per-fd private data (rv_file_priv)
>> + * ------------------------------------
>> + * Every open() of /dev/rv allocates an rv_file_priv (defined in tlob.h).
>> + * When TLOB_IOCTL_TRACE_START is called with args.notify_fd >= 0, violations
>> + * are pushed as tlob_event records into that fd's per-fd ring buffer
>> (tlob_ring)
>> + * and its poll/epoll waitqueue is woken.
>> + *
>> + * Consumers drain records with read() on the notify_fd; read() blocks until
>> + * at least one record is available (unless O_NONBLOCK is set).
>> + *
>> + * Per-thread "started" tracking (tlob_task_handle)
>> + * -------------------------------------------------
>> + * tlob_stop_task() returns -ESRCH in two distinct situations:
>> + *
>> + * (a) The deadline timer already fired and removed the tlob hash-table
>> + * entry before TRACE_STOP arrived -> budget was exceeded -> -EOVERFLOW
>> + *
>> + * (b) TRACE_START was never called for this thread -> programming error
>> + * -> -ESRCH
>> + *
>> + * To distinguish them, rv_dev.c maintains a lightweight hash table
>> + * (tlob_handles) that records a tlob_task_handle for every task_struct *
>> + * for which a successful TLOB_IOCTL_TRACE_START has been
>> + * issued but the corresponding TLOB_IOCTL_TRACE_STOP has not yet arrived.
>> + *
>> + * tlob_task_handle is a thin "session ticket" -- it carries only the
>> + * task pointer and the owning file descriptor. The heavy per-task state
>> + * (hrtimer, DA state, threshold) lives in tlob_task_state inside tlob.c.
>> + *
>> + * The table is keyed on task_struct * (same key as tlob.c), protected
>> + * by tlob_handles_lock (spinlock, irq-safe). No get_task_struct()
>> + * refcount is needed here because tlob.c already holds a reference for
>> + * each live entry.
>> + *
>> + * Multiple threads may share the same fd. Each thread has its own
>> + * tlob_task_handle in the table, so concurrent TRACE_START / TRACE_STOP
>> + * calls from different threads do not interfere.
>> + *
>> + * The fd release path (rv_release) calls tlob_stop_task() for every
>> + * handle in tlob_handles that belongs to the closing fd, ensuring cleanup
>> + * even if the user forgets to call TRACE_STOP.
>> + */
>> +#include <linux/file.h>
>> +#include <linux/fs.h>
>> +#include <linux/gfp.h>
>> +#include <linux/hash.h>
>> +#include <linux/mm.h>
>> +#include <linux/miscdevice.h>
>> +#include <linux/module.h>
>> +#include <linux/poll.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/uaccess.h>
>> +#include <uapi/linux/rv.h>
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +#include "monitors/tlob/tlob.h"
>> +#endif
>> +
>> +/* -----------------------------------------------------------------------
>> + * tlob_task_handle - per-thread session ticket for the ioctl interface
>> + *
>> + * One handle is allocated by TLOB_IOCTL_TRACE_START and freed by
>> + * TLOB_IOCTL_TRACE_STOP (or by rv_release if the fd is closed).
>> + *
>> + * @hlist: Hash-table linkage in tlob_handles (keyed on task pointer).
>> + * @task: The monitored thread. Plain pointer; no refcount held here
>> + * because tlob.c holds one for the lifetime of the monitoring
>> + * window, which encompasses the lifetime of this handle.
>> + * @file: The /dev/rv file descriptor that issued TRACE_START.
>> + * Used by rv_release() to sweep orphaned handles on close().
>> + * -----------------------------------------------------------------------
>> + */
>> +#define TLOB_HANDLES_BITS 5
>> +#define TLOB_HANDLES_SIZE (1 << TLOB_HANDLES_BITS)
>> +
>> +struct tlob_task_handle {
>> + struct hlist_node hlist;
>> + struct task_struct *task;
>> + struct file *file;
>> +};
>> +
>> +static struct hlist_head tlob_handles[TLOB_HANDLES_SIZE];
>> +static DEFINE_SPINLOCK(tlob_handles_lock);
>> +
>> +static unsigned int tlob_handle_hash(const struct task_struct *task)
>> +{
>> + return hash_ptr((void *)task, TLOB_HANDLES_BITS);
>> +}
>> +
>> +/* Must be called with tlob_handles_lock held. */
>> +static struct tlob_task_handle *
>> +tlob_handle_find_locked(struct task_struct *task)
>> +{
>> + struct tlob_task_handle *h;
>> + unsigned int slot = tlob_handle_hash(task);
>> +
>> + hlist_for_each_entry(h, &tlob_handles[slot], hlist) {
>> + if (h->task == task)
>> + return h;
>> + }
>> + return NULL;
>> +}
>> +
>> +/*
>> + * tlob_handle_alloc - record that @task has an active monitoring session
>> + * opened via @file.
>> + *
>> + * Returns 0 on success, -EEXIST if @task already has a handle (double
>> + * TRACE_START without TRACE_STOP), -ENOMEM on allocation failure.
>> + */
>> +static int tlob_handle_alloc(struct task_struct *task, struct file *file)
>> +{
>> + struct tlob_task_handle *h;
>> + unsigned long flags;
>> + unsigned int slot;
>> +
>> + h = kmalloc(sizeof(*h), GFP_KERNEL);
>> + if (!h)
>> + return -ENOMEM;
>> + h->task = task;
>> + h->file = file;
>> +
>> + spin_lock_irqsave(&tlob_handles_lock, flags);
>> + if (tlob_handle_find_locked(task)) {
>> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> + kfree(h);
>> + return -EEXIST;
>> + }
>> + slot = tlob_handle_hash(task);
>> + hlist_add_head(&h->hlist, &tlob_handles[slot]);
>> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> + return 0;
>> +}
>> +
>> +/*
>> + * tlob_handle_free - remove the handle for @task and free it.
>> + *
>> + * Returns 1 if a handle existed (TRACE_START was called), 0 if not found
>> + * (TRACE_START was never called for this thread).
>> + */
>> +static int tlob_handle_free(struct task_struct *task)
>> +{
>> + struct tlob_task_handle *h;
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&tlob_handles_lock, flags);
>> + h = tlob_handle_find_locked(task);
>> + if (h) {
>> + hlist_del_init(&h->hlist);
>> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> + kfree(h);
>> + return 1;
>> + }
>> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> + return 0;
>> +}
>> +
>> +/*
>> + * tlob_handle_sweep_file - release all handles owned by @file.
>> + *
>> + * Called from rv_release() when the fd is closed without TRACE_STOP.
>> + * Calls tlob_stop_task() for each orphaned handle to drain the tlob
>> + * monitoring entries and prevent resource leaks in tlob.c.
>> + *
>> + * Handles are collected under the lock (short critical section), then
>> + * processed outside it (tlob_stop_task() may sleep/spin internally).
>> + */
>> +#ifdef CONFIG_RV_MON_TLOB
>> +static void tlob_handle_sweep_file(struct file *file)
>> +{
>> + struct tlob_task_handle *batch[TLOB_HANDLES_SIZE];
>> + struct tlob_task_handle *h;
>> + struct hlist_node *tmp;
>> + unsigned long flags;
>> + int i, n = 0;
>> +
>> + spin_lock_irqsave(&tlob_handles_lock, flags);
>> + for (i = 0; i < TLOB_HANDLES_SIZE; i++) {
>> + hlist_for_each_entry_safe(h, tmp, &tlob_handles[i], hlist) {
>> + if (h->file == file) {
>> + hlist_del_init(&h->hlist);
>> + batch[n++] = h;
>> + }
>> + }
>> + }
>> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +
>> + for (i = 0; i < n; i++) {
>> + /*
>> + * Ignore -ESRCH: the deadline timer may have already fired
>> + * and cleaned up the tlob entry.
>> + */
>> + tlob_stop_task(batch[i]->task);
>> + kfree(batch[i]);
>> + }
>> +}
>> +#else
>> +static inline void tlob_handle_sweep_file(struct file *file) {}
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>> +/* -----------------------------------------------------------------------
>> + * Ring buffer lifecycle
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/*
>> + * tlob_ring_alloc - allocate a ring of @cap records (must be a power of 2).
>> + *
>> + * Allocates a physically contiguous block of pages:
>> + * page 0 : struct tlob_mmap_page (control page, shared with
>> userspace)
>> + * pages 1..N : struct tlob_event[cap] (data pages)
>> + *
>> + * Each page is marked reserved so it can be mapped to userspace via mmap().
>> + */
>> +static int tlob_ring_alloc(struct tlob_ring *ring, u32 cap)
>> +{
>> + unsigned int total = PAGE_SIZE + cap * sizeof(struct tlob_event);
>> + unsigned int order = get_order(total);
>> + unsigned long base;
>> + unsigned int i;
>> +
>> + base = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
>> + if (!base)
>> + return -ENOMEM;
>> +
>> + for (i = 0; i < (1u << order); i++)
>> + SetPageReserved(virt_to_page((void *)(base + i *
>> PAGE_SIZE)));
>> +
>> + ring->base = base;
>> + ring->order = order;
>> + ring->page = (struct tlob_mmap_page *)base;
>> + ring->data = (struct tlob_event *)(base + PAGE_SIZE);
>> + ring->mask = cap - 1;
>> + spin_lock_init(&ring->lock);
>> +
>> + ring->page->capacity = cap;
>> + ring->page->version = 1;
>> + ring->page->data_offset = PAGE_SIZE;
>> + ring->page->record_size = sizeof(struct tlob_event);
>> + return 0;
>> +}
>> +
>> +static void tlob_ring_free(struct tlob_ring *ring)
>> +{
>> + unsigned int i;
>> +
>> + if (!ring->base)
>> + return;
>> +
>> + for (i = 0; i < (1u << ring->order); i++)
>> + ClearPageReserved(virt_to_page((void *)(ring->base + i *
>> PAGE_SIZE)));
>> +
>> + free_pages(ring->base, ring->order);
>> + ring->base = 0;
>> + ring->page = NULL;
>> + ring->data = NULL;
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * File operations
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static int rv_open(struct inode *inode, struct file *file)
>> +{
>> + struct rv_file_priv *priv;
>> + int ret;
>> +
>> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> + if (!priv)
>> + return -ENOMEM;
>> +
>> + ret = tlob_ring_alloc(&priv->ring, TLOB_RING_DEFAULT_CAP);
>> + if (ret) {
>> + kfree(priv);
>> + return ret;
>> + }
>> +
>> + init_waitqueue_head(&priv->waitq);
>> + file->private_data = priv;
>> + return 0;
>> +}
>> +
>> +static int rv_release(struct inode *inode, struct file *file)
>> +{
>> + struct rv_file_priv *priv = file->private_data;
>> +
>> + tlob_handle_sweep_file(file);
>> + tlob_ring_free(&priv->ring);
>> + kfree(priv);
>> + file->private_data = NULL;
>> + return 0;
>> +}
>> +
>> +static __poll_t rv_poll(struct file *file, poll_table *wait)
>> +{
>> + struct rv_file_priv *priv = file->private_data;
>> +
>> + if (!priv)
>> + return EPOLLERR;
>> +
>> + poll_wait(file, &priv->waitq, wait);
>> +
>> + /*
>> + * Pairs with smp_store_release(&ring->page->data_head, ...) in
>> + * tlob_event_push(). No lock needed: head is written by the kernel
>> + * producer and read here; tail is written by the consumer and we
>> only
>> + * need an approximate check for the poll fast path.
>> + */
>> + if (smp_load_acquire(&priv->ring.page->data_head) !=
>> + READ_ONCE(priv->ring.page->data_tail))
>> + return EPOLLIN | EPOLLRDNORM;
>> +
>> + return 0;
>> +}
>> +
>> +/*
>> + * rv_read - consume tlob_event violation records from this fd's ring buffer.
>> + *
>> + * Each read() returns a whole number of struct tlob_event records. @count
>> must
>> + * be at least sizeof(struct tlob_event); partial-record sizes are rejected
>> with
>> + * -EINVAL.
>> + *
>> + * Blocking behaviour follows O_NONBLOCK on the fd:
>> + * O_NONBLOCK clear: blocks until at least one record is available.
>> + * O_NONBLOCK set: returns -EAGAIN immediately if the ring is empty.
>> + *
>> + * Returns the number of bytes copied (always a multiple of sizeof
>> tlob_event),
>> + * -EAGAIN if non-blocking and empty, or a negative error code.
>> + *
>> + * read() and mmap() share the same ring and data_tail cursor; do not use
>> + * both simultaneously on the same fd.
>> + */
>> +static ssize_t rv_read(struct file *file, char __user *buf, size_t count,
>> + loff_t *ppos)
>> +{
>> + struct rv_file_priv *priv = file->private_data;
>> + struct tlob_ring *ring;
>> + size_t rec = sizeof(struct tlob_event);
>> + unsigned long irqflags;
>> + ssize_t done = 0;
>> + int ret;
>> +
>> + if (!priv)
>> + return -ENODEV;
>> +
>> + ring = &priv->ring;
>> +
>> + if (count < rec)
>> + return -EINVAL;
>> +
>> + /* Blocking path: sleep until the producer advances data_head. */
>> + if (!(file->f_flags & O_NONBLOCK)) {
>> + ret = wait_event_interruptible(priv->waitq,
>> + /* pairs with smp_store_release() in the producer */
>> + smp_load_acquire(&ring->page->data_head) !=
>> + READ_ONCE(ring->page->data_tail));
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + /*
>> + * Drain records into the caller's buffer. ring->lock serialises
>> + * concurrent read() callers and the softirq producer.
>> + */
>> + while (done + rec <= count) {
>> + struct tlob_event record;
>> + u32 head, tail;
>> +
>> + spin_lock_irqsave(&ring->lock, irqflags);
>> + /* pairs with smp_store_release() in the producer */
>> + head = smp_load_acquire(&ring->page->data_head);
>> + tail = ring->page->data_tail;
>> + if (head == tail) {
>> + spin_unlock_irqrestore(&ring->lock, irqflags);
>> + break;
>> + }
>> + record = ring->data[tail & ring->mask];
>> + WRITE_ONCE(ring->page->data_tail, tail + 1);
>> + spin_unlock_irqrestore(&ring->lock, irqflags);
>> +
>> + if (copy_to_user(buf + done, &record, rec))
>> + return done ? done : -EFAULT;
>> + done += rec;
>> + }
>> +
>> + return done ? done : -EAGAIN;
>> +}
>> +
>> +/*
>> + * rv_mmap - map the per-fd violation ring buffer into userspace.
>> + *
>> + * The mmap region covers the full ring allocation:
>> + *
>> + * offset 0 : struct tlob_mmap_page (control page)
>> + * offset PAGE_SIZE : struct tlob_event[capacity] (data pages)
>> + *
>> + * The caller must map exactly PAGE_SIZE + capacity * sizeof(struct
>> tlob_event)
>> + * bytes starting at offset 0 (vm_pgoff must be 0). The actual capacity is
>> + * read from tlob_mmap_page.capacity after a successful mmap(2).
>> + *
>> + * Private mappings (MAP_PRIVATE) are rejected: the shared data_tail field
>> + * written by userspace must be visible to the kernel producer.
>> + */
>> +static int rv_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> + struct rv_file_priv *priv = file->private_data;
>> + struct tlob_ring *ring;
>> + unsigned long size = vma->vm_end - vma->vm_start;
>> + unsigned long ring_size;
>> +
>> + if (!priv)
>> + return -ENODEV;
>> +
>> + ring = &priv->ring;
>> +
>> + if (vma->vm_pgoff != 0)
>> + return -EINVAL;
>> +
>> + ring_size = PAGE_ALIGN(PAGE_SIZE + ((unsigned long)(ring->mask + 1) *
>> + sizeof(struct tlob_event)));
>> + if (size != ring_size)
>> + return -EINVAL;
>> +
>> + if (!(vma->vm_flags & VM_SHARED))
>> + return -EINVAL;
>> +
>> + return remap_pfn_range(vma, vma->vm_start,
>> + page_to_pfn(virt_to_page((void *)ring->base)),
>> + ring_size, vma->vm_page_prot);
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * ioctl dispatcher
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static long rv_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>> +{
>> + unsigned int nr = _IOC_NR(cmd);
>> +
>> + /*
>> + * Verify the magic byte so we don't accidentally handle ioctls
>> + * intended for a different device.
>> + */
>> + if (_IOC_TYPE(cmd) != RV_IOC_MAGIC)
>> + return -ENOTTY;
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> + /* tlob: ioctl numbers 0x01 - 0x1F */
>> + switch (cmd) {
>> + case TLOB_IOCTL_TRACE_START: {
>> + struct tlob_start_args args;
>> + struct file *notify_file = NULL;
>> + int ret, hret;
>> +
>> + if (copy_from_user(&args,
>> + (struct tlob_start_args __user *)arg,
>> + sizeof(args)))
>> + return -EFAULT;
>> + if (args.threshold_us == 0)
>> + return -EINVAL;
>> + if (args.flags != 0)
>> + return -EINVAL;
>> +
>> + /*
>> + * If notify_fd >= 0, resolve it to a file pointer.
>> + * fget() bumps the reference count; tlob.c drops it
>> + * via fput() when the monitoring window ends.
>> + * Reject non-/dev/rv fds to prevent type confusion.
>> + */
>> + if (args.notify_fd >= 0) {
>> + notify_file = fget(args.notify_fd);
>> + if (!notify_file)
>> + return -EBADF;
>> + if (notify_file->f_op != file->f_op) {
>> + fput(notify_file);
>> + return -EINVAL;
>> + }
>> + }
>> +
>> + ret = tlob_start_task(current, args.threshold_us,
>> + notify_file, args.tag);
>> + if (ret != 0) {
>> + /* tlob.c did not take ownership; drop ref. */
>> + if (notify_file)
>> + fput(notify_file);
>> + return ret;
>> + }
>> +
>> + /*
>> + * Record session handle. Free any stale handle left by
>> + * a previous window whose deadline timer fired (timer
>> + * removes tlob_task_state but cannot touch tlob_handles).
>> + */
>> + tlob_handle_free(current);
>> + hret = tlob_handle_alloc(current, file);
>> + if (hret < 0) {
>> + tlob_stop_task(current);
>> + return hret;
>> + }
>> + return 0;
>> + }
>> + case TLOB_IOCTL_TRACE_STOP: {
>> + int had_handle;
>> + int ret;
>> +
>> + /*
>> + * Atomically remove the session handle for current.
>> + *
>> + * had_handle == 0: TRACE_START was never called for
>> + * this thread -> caller bug -> -ESRCH
>> + *
>> + * had_handle == 1: TRACE_START was called. If
>> + * tlob_stop_task() now returns
>> + * -ESRCH, the deadline timer already
>> + * fired -> budget exceeded -> -EOVERFLOW
>> + */
>> + had_handle = tlob_handle_free(current);
>> + if (!had_handle)
>> + return -ESRCH;
>> +
>> + ret = tlob_stop_task(current);
>> + return (ret == -ESRCH) ? -EOVERFLOW : ret;
>> + }
>> + default:
>> + break;
>> + }
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>> + return -ENOTTY;
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * Module init / exit
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static const struct file_operations rv_fops = {
>> + .owner = THIS_MODULE,
>> + .open = rv_open,
>> + .release = rv_release,
>> + .read = rv_read,
>> + .poll = rv_poll,
>> + .mmap = rv_mmap,
>> + .unlocked_ioctl = rv_ioctl,
>> +#ifdef CONFIG_COMPAT
>> + .compat_ioctl = rv_ioctl,
>> +#endif
>> + .llseek = noop_llseek,
>> +};
>> +
>> +/*
>> + * 0666: /dev/rv is a self-instrumentation device. All ioctls operate
>> + * exclusively on the calling task (current); no task can monitor another
>> + * via this interface. Opening the device does not grant any privilege
>> + * beyond observing one's own latency, so world-read/write is appropriate.
>> + */
>> +static struct miscdevice rv_miscdev = {
>> + .minor = MISC_DYNAMIC_MINOR,
>> + .name = "rv",
>> + .fops = &rv_fops,
>> + .mode = 0666,
>> +};
>> +
>> +static int __init rv_ioctl_init(void)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < TLOB_HANDLES_SIZE; i++)
>> + INIT_HLIST_HEAD(&tlob_handles[i]);
>> +
>> + return misc_register(&rv_miscdev);
>> +}
>> +
>> +static void __exit rv_ioctl_exit(void)
>> +{
>> + misc_deregister(&rv_miscdev);
>> +}
>> +
>> +module_init(rv_ioctl_init);
>> +module_exit(rv_ioctl_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_DESCRIPTION("RV ioctl interface via /dev/rv");
>> diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
>> index 4a6faddac..65d6c6485 100644
>> --- a/kernel/trace/rv/rv_trace.h
>> +++ b/kernel/trace/rv/rv_trace.h
>> @@ -126,6 +126,7 @@ DECLARE_EVENT_CLASS(error_da_monitor_id,
>> #include <monitors/snroc/snroc_trace.h>
>> #include <monitors/nrp/nrp_trace.h>
>> #include <monitors/sssw/sssw_trace.h>
>> +#include <monitors/tlob/tlob_trace.h>
>> // Add new monitors based on CONFIG_DA_MON_EVENTS_ID here
>>
>> #endif /* CONFIG_DA_MON_EVENTS_ID */
>> @@ -202,6 +203,55 @@ TRACE_EVENT(rv_retries_error,
>> __get_str(event), __get_str(name))
>> );
>> #endif /* CONFIG_RV_MON_MAINTENANCE_EVENTS */
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +/*
>> + * tlob_budget_exceeded - emitted when a monitored task exceeds its latency
>> + * budget. Carries the on-CPU / off-CPU time breakdown so that the cause
>> + * of the overrun (CPU-bound vs. scheduling/I/O latency) is immediately
>> + * visible in the ftrace ring buffer without post-processing.
>> + */
>> +TRACE_EVENT(tlob_budget_exceeded,
>> +
>> + TP_PROTO(struct task_struct *task, u64 threshold_us,
>> + u64 on_cpu_us, u64 off_cpu_us, u32 switches,
>> + bool state_is_on_cpu, u64 tag),
>> +
>> + TP_ARGS(task, threshold_us, on_cpu_us, off_cpu_us, switches,
>> + state_is_on_cpu, tag),
>> +
>> + TP_STRUCT__entry(
>> + __string(comm, task->comm)
>> + __field(pid_t, pid)
>> + __field(u64, threshold_us)
>> + __field(u64, on_cpu_us)
>> + __field(u64, off_cpu_us)
>> + __field(u32, switches)
>> + __field(bool, state_is_on_cpu)
>> + __field(u64, tag)
>> + ),
>> +
>> + TP_fast_assign(
>> + __assign_str(comm);
>> + __entry->pid = task->pid;
>> + __entry->threshold_us = threshold_us;
>> + __entry->on_cpu_us = on_cpu_us;
>> + __entry->off_cpu_us = off_cpu_us;
>> + __entry->switches = switches;
>> + __entry->state_is_on_cpu = state_is_on_cpu;
>> + __entry->tag = tag;
>> + ),
>> +
>> + TP_printk("%s[%d]: budget exceeded threshold=%llu on_cpu=%llu
>> off_cpu=%llu switches=%u state=%s tag=0x%016llx",
>> + __get_str(comm), __entry->pid,
>> + __entry->threshold_us,
>> + __entry->on_cpu_us, __entry->off_cpu_us,
>> + __entry->switches,
>> + __entry->state_is_on_cpu ? "on_cpu" : "off_cpu",
>> + __entry->tag)
>> +);
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>> #endif /* _TRACE_RV_H */
>>
>> /* This part must be outside protection */
>
^ permalink raw reply
* Re: [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor
From: Gabriele Monaco @ 2026-04-16 15:35 UTC (permalink / raw)
To: Wen Yang
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-trace-kernel, linux-kernel
In-Reply-To: <228deda8-3685-4f07-afd5-d3f3ca531154@linux.dev>
Hello,
On Thu, 2026-04-16 at 23:09 +0800, Wen Yang wrote:
>
> Thanks for the review. Here's my plan for each point -- let me know if
> the direction looks right.
>
>
> - Timed automata
>
> The HA framework [1] is a good match when the timeout threshold is
> global or state-determined, but tlob needs a per-invocation threshold
> supplied at TRACE_START time -- fitting that into HA would require
> framework changes.
Not quite, look at the nomiss monitor, the deadline comes directly from the
deadline entity.
What I meant with using per-object monitor is that you can use your custom
struct as a monitor target, that has your per-invocation threshold because you
set instantiate it on start.
Now you can simply do ha_get_target(ha_mon)->threshold and you get your value.
You can define in the dot representation "clk < THRESHOLD_NS()" and rvgen will
do most of the things for you. It's probably better to use nanoseconds so you
avoid conversions when dealing with hrtimers. You can do it transparently when
initialising so the user still passes us.
> My plan is to use da_monitor_init_hook() -- the same mechanism HA
> monitors use internally -- to arm the per-invocation hrtimer once
> da_create_storage() has stored the monitor_target. This gives the same
> "timer fires => violation" semantics without touching the HA infrastructure.
>
> If you see a cleaner way to pass per-invocation data through HA I'm
> happy to go that route.
The above looks cleaner to me, what do you think?
da_monitor_init_hook() isn't really meant to be used by monitors, it's more for
the infrastructure to extend da_monitor.h easily, sure you can use it if there's
no other way, though.
> - Unmonitored state / da_handle_start_event
>
> Fair point. I'll drop the explicit unmonitored state and the
> trace_event_tlob() redefinition. tlob_start_task() will use
> da_handle_start_event() to allocate storage, set initial state to on_cpu,
> and fire the init hook to arm the timer in one shot. tlob_stop_task()
> calls da_monitor_reset() directly.
>
> - Per-object monitors
>
> Will do. The custom hash table goes away; I'll switch to RV_MON_PER_OBJ
> with:
>
> typedef struct tlob_task_state *monitor_target;
>
> da_get_target_by_id() handles the sched_switch hot path lookup.
>
Exactly! That should do.
> - RV-way violations
>
> Agreed. budget_expired will be declared INVALID in all states so the
> framework calls react() (error_tlob tracepoint + any registered reactor)
> and da_monitor_reset() automatically. tlob won't emit any tracepoint of
> its own.
>
> One note on the /dev/tlob ioctl: TLOB_IOCTL_TRACE_STOP returns -EOVERFLOW
> to the caller when the budget was exceeded. This is just a syscall
> return code -- not a second reporting path -- to let in-process
> instrumentation react inline without polling the trace buffer.
> Let me know if you have concerns about keeping this.
>
I'm not sure how faster can it be compared to attaching to the tracefs, that
should be quite light if you just listen to error events. Sure you'd need a few
more libraries.
I'm a bit concerned in adding new interfaces (ioctl), when we have already
tracepoints and reactors. The reactors themselves are not as flexible as they
should be though, but if required we may definitely create a ioctl reactor just
for this.
For now ignore all this and continue with the TLOB_IOCTL_TRACE_STOP, then we can
think of the details.
> - Generic uprobe helper
>
> Proposed interface:
>
> struct rv_uprobe *rv_uprobe_attach_path(
> struct path *path, loff_t offset,
> int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
> int (*ret_fn) (struct rv_uprobe *, unsigned long func,
> struct pt_regs *, __u64 *),
> void *priv);
>
> struct rv_uprobe *rv_uprobe_attach(
> const char *binpath, loff_t offset,
> int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
> int (*ret_fn) (struct rv_uprobe *, unsigned long func,
> struct pt_regs *, __u64 *),
> void *priv);
>
> void rv_uprobe_detach(struct rv_uprobe *p);
>
> struct rv_uprobe exposes three read-only fields to monitors (offset,
> priv, path); the uprobe_consumer and callbacks would be kept private to
> the implementation, so monitors need not include <linux/uprobes.h>.
>
> rv_uprobe_attach() resolves the path and delegates to
> rv_uprobe_attach_path(); the latter avoids a redundant kern_path() when
> registering multiple probes on the same binary:
>
> kern_path(binpath, LOOKUP_FOLLOW, &path);
> b->start = rv_uprobe_attach_path(&path, offset_start, entry_fn,
> NULL, b);
> b->stop = rv_uprobe_attach_path(&path, offset_stop, stop_fn,
> NULL, b);
> path_put(&path);
>
> Does the interface look reasonable, or did you have a different shape in
> mind?
>
Yeah seems reasonable. Then we'd need to keep around the uprobe for
deinitialisation, but probably having it global is the best way without
overengineer anything.
Thanks,
Gabriele
^ permalink raw reply
* [PATCH v2 00/28] vfs/nfsd: add support for CB_NOTIFY callbacks in directory delegations
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
This version has a number of significant changes from the last. I
dropped some of the R-b's for this reason.
Of particular interest to the fsnotify maintainers will be the
FSNOTIFY_EVENT_RENAME data type. This combines the FSNOTIFY_EVENT_DENTRY
and FSNOTIFY_EVENT_INODE event types so that the fsnotify event can
additionally send information about a file that was unlinked as a result
of being replaced via rename().
There are also a host of other bugfixes, and a new tracepoint. Please
consider this for v7.2.
Original cover letter follows:
---------------------------------8<------------------------------------
This patchset builds on the directory delegation work we did a few
months ago, to add support for CB_NOTIFY callbacks for some events. In
particular, creates, unlinks and renames. The server also sends updated
directory attributes in the notifications. With this support, the client
can register interest in a directory and get notifications about changes
within it without losing its lease.
The series starts with patches to allow the vfs to ignore certain types
of events on directories. nfsd can then request these sorts of
delegations on directories, and then set up inotify watches on the
directory to trigger sending CB_NOTIFY events.
This has mainly been tested with pynfs, with some new testcases that
I'll be posting soon. They seem to work fine with those tests, but I
don't think we'll want to merge these until we have a complete
client-side implementation to test against.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v2:
- Fix __break_lease handling with different lease types on flc_lease list
- Add FSNOTIFY_EVENT_RENAME data type to properly handle cross-directory rename events
- Display fsnotify mask symbolically in tracepoints
- New tracepoint in fsnotify()
- Recalc fsnotify mask after unlocking lease instead of before
- Don't notify client that is making the changes
- After sending CB_NOTIFY, requeue if new events came in while running
- Document removal of NFS4_VERIFIER_SIZE/NFS4_FHSIZE from UAPI headers
- Properly release nfsd_dir_fsnotify_group on server shutdown
- Link to v1: https://lore.kernel.org/r/20260407-dir-deleg-v1-0-aaf68c478abd@kernel.org
---
Jeff Layton (28):
filelock: pass current blocking lease to trace_break_lease_block() rather than "new_fl"
filelock: add support for ignoring deleg breaks for dir change events
filelock: add a tracepoint to start of break_lease()
filelock: add an inode_lease_ignore_mask helper
fsnotify: new tracepoint in fsnotify()
fsnotify: add fsnotify_modify_mark_mask()
fsnotify: add FSNOTIFY_EVENT_RENAME data type
nfsd: check fl_lmops in nfsd_breaker_owns_lease()
nfsd: add protocol support for CB_NOTIFY
nfs_common: add new NOTIFY4_* flags proposed in RFC8881bis
nfsd: allow nfsd to get a dir lease with an ignore mask
nfsd: update the fsnotify mark when setting or removing a dir delegation
nfsd: make nfsd4_callback_ops->prepare operation bool return
nfsd: add callback encoding and decoding linkages for CB_NOTIFY
nfsd: use RCU to protect fi_deleg_file
nfsd: add data structures for handling CB_NOTIFY
nfsd: add notification handlers for dir events
nfsd: add tracepoint to dir_event handler
nfsd: apply the notify mask to the delegation when requested
nfsd: add helper to marshal a fattr4 from completed args
nfsd: allow nfsd4_encode_fattr4_change() to work with no export
nfsd: send basic file attributes in CB_NOTIFY
nfsd: allow encoding a filehandle into fattr4 without a svc_fh
nfsd: add a fi_connectable flag to struct nfs4_file
nfsd: add the filehandle to returned attributes in CB_NOTIFY
nfsd: properly track requested child attributes
nfsd: track requested dir attributes
nfsd: add support to CB_NOTIFY for dir attribute changes
Documentation/sunrpc/xdr/nfs4_1.x | 264 ++++++++++++++-
fs/attr.c | 2 +-
fs/locks.c | 118 +++++--
fs/namei.c | 31 +-
fs/nfsd/filecache.c | 70 +++-
fs/nfsd/nfs4callback.c | 60 +++-
fs/nfsd/nfs4layouts.c | 5 +-
fs/nfsd/nfs4proc.c | 17 +
fs/nfsd/nfs4state.c | 550 ++++++++++++++++++++++++++++----
fs/nfsd/nfs4xdr.c | 323 +++++++++++++++++--
fs/nfsd/nfs4xdr_gen.c | 601 ++++++++++++++++++++++++++++++++++-
fs/nfsd/nfs4xdr_gen.h | 20 +-
fs/nfsd/state.h | 72 ++++-
fs/nfsd/trace.h | 23 ++
fs/nfsd/xdr4.h | 5 +
fs/nfsd/xdr4cb.h | 12 +
fs/notify/fsnotify.c | 5 +
fs/notify/mark.c | 29 ++
fs/posix_acl.c | 4 +-
fs/xattr.c | 4 +-
include/linux/filelock.h | 54 +++-
include/linux/fsnotify.h | 8 +-
include/linux/fsnotify_backend.h | 21 ++
include/linux/nfs4.h | 127 --------
include/linux/sunrpc/xdrgen/nfs4_1.h | 291 ++++++++++++++++-
include/trace/events/filelock.h | 38 ++-
include/trace/events/fsnotify.h | 51 +++
include/trace/misc/fsnotify.h | 35 ++
include/uapi/linux/nfs4.h | 2 -
29 files changed, 2518 insertions(+), 324 deletions(-)
---
base-commit: f4d71dd7fd9cec357c32431fa55c107b96008312
change-id: 20260325-dir-deleg-339066dd1017
Best regards,
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply
* [PATCH v2 01/28] filelock: pass current blocking lease to trace_break_lease_block() rather than "new_fl"
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
The break_lease_block tracepoint currently just shows the type of
"new_fl", which we can predict from the "flags" value. Switch it to
display info about "fl" instead, as that's the file_lease on which the
code is blocking.
For trace_break_lease_unblock(), pass it a NULL pointer. "fl" may have
been freed by that point, and passing it the info in new_fl is
deceptive.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/locks.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index 8e44b1f6c15a..d82c5be7aa5b 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1691,7 +1691,7 @@ int __break_lease(struct inode *inode, unsigned int flags)
} else
break_time++;
locks_insert_block(&fl->c, &new_fl->c, leases_conflict);
- trace_break_lease_block(inode, new_fl);
+ trace_break_lease_block(inode, fl);
spin_unlock(&ctx->flc_lock);
percpu_up_read(&file_rwsem);
@@ -1702,7 +1702,7 @@ int __break_lease(struct inode *inode, unsigned int flags)
percpu_down_read(&file_rwsem);
spin_lock(&ctx->flc_lock);
- trace_break_lease_unblock(inode, new_fl);
+ trace_break_lease_unblock(inode, NULL);
__locks_delete_block(&new_fl->c);
if (error >= 0) {
/*
--
2.53.0
^ permalink raw reply related
* [PATCH v2 02/28] filelock: add support for ignoring deleg breaks for dir change events
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
If a NFS client requests a directory delegation with a notification
bitmask covering directory change events, the server shouldn't recall
the delegation. Instead the client will be notified of the change after
the fact.
Add support for ignoring lease breaks on directory changes. Add a new
flags parameter to try_break_deleg() and teach __break_lease how to
ignore certain types of delegation break events.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/attr.c | 2 +-
fs/locks.c | 82 ++++++++++++++++++++++++++++-------------
fs/namei.c | 31 +++++++++-------
fs/posix_acl.c | 4 +-
fs/xattr.c | 4 +-
include/linux/filelock.h | 53 ++++++++++++++++++--------
include/trace/events/filelock.h | 5 ++-
7 files changed, 120 insertions(+), 61 deletions(-)
diff --git a/fs/attr.c b/fs/attr.c
index e7d7c6d19fe9..28744f0e9ff4 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -547,7 +547,7 @@ int notify_change(struct mnt_idmap *idmap, struct dentry *dentry,
* breaking the delegation in this case.
*/
if (!(ia_valid & ATTR_DELEG)) {
- error = try_break_deleg(inode, delegated_inode);
+ error = try_break_deleg(inode, 0, delegated_inode);
if (error)
return error;
}
diff --git a/fs/locks.c b/fs/locks.c
index d82c5be7aa5b..8b5958f34b61 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1583,29 +1583,63 @@ static bool leases_conflict(struct file_lock_core *lc, struct file_lock_core *bc
}
static bool
-any_leases_conflict(struct inode *inode, struct file_lease *breaker)
+ignore_dir_deleg_break(struct file_lease *fl, unsigned int flags)
{
- struct file_lock_context *ctx = inode->i_flctx;
- struct file_lock_core *flc;
+ if ((flags & LEASE_BREAK_DIR_CREATE) && (fl->c.flc_flags & FL_IGN_DIR_CREATE))
+ return true;
+ if ((flags & LEASE_BREAK_DIR_DELETE) && (fl->c.flc_flags & FL_IGN_DIR_DELETE))
+ return true;
+ if ((flags & LEASE_BREAK_DIR_RENAME) && (fl->c.flc_flags & FL_IGN_DIR_RENAME))
+ return true;
+
+ return false;
+}
+
+static unsigned int
+break_lease_flags_to_type(unsigned int flags)
+{
+ if (flags & LEASE_BREAK_LEASE)
+ return FL_LEASE;
+ else if (flags & LEASE_BREAK_DELEG)
+ return FL_DELEG;
+ else if (flags & LEASE_BREAK_LAYOUT)
+ return FL_LAYOUT;
+ else
+ return 0;
+
+}
+
+static struct file_lease *
+first_visible_lease(struct inode *inode, struct file_lease *new_fl, unsigned int flags)
+{
+ struct file_lock_context *ctx = locks_inode_context(inode);
+ struct file_lease *fl;
lockdep_assert_held(&ctx->flc_lock);
- list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
- if (leases_conflict(flc, &breaker->c))
- return true;
+ list_for_each_entry(fl, &ctx->flc_lease, c.flc_list) {
+ if (!leases_conflict(&fl->c, &new_fl->c))
+ continue;
+ if (S_ISDIR(inode->i_mode) && ignore_dir_deleg_break(fl, flags))
+ continue;
+ return fl;
}
- return false;
+ return NULL;
}
+
/**
- * __break_lease - revoke all outstanding leases on file
- * @inode: the inode of the file to return
- * @flags: LEASE_BREAK_* flags
+ * __break_lease - revoke all outstanding leases on file
+ * @inode: the inode of the file to return
+ * @flags: LEASE_BREAK_* flags
*
- * break_lease (inlined for speed) has checked there already is at least
- * some kind of lock (maybe a lease) on this file. Leases are broken on
- * a call to open() or truncate(). This function can block waiting for the
- * lease break unless you specify LEASE_BREAK_NONBLOCK.
+ * break_lease (inlined for speed) has checked there already is at least
+ * some kind of lock (maybe a lease) on this file. Leases and Delegations
+ * are broken on a call to open() or truncate(). Delegations are also
+ * broken on any event that would change the ctime. Directory delegations
+ * are broken whenever the directory changes (unless the delegation is set
+ * up to ignore the event). This function can block waiting for the lease
+ * break unless you specify LEASE_BREAK_NONBLOCK.
*/
int __break_lease(struct inode *inode, unsigned int flags)
{
@@ -1617,13 +1651,8 @@ int __break_lease(struct inode *inode, unsigned int flags)
bool want_write = !(flags & LEASE_BREAK_OPEN_RDONLY);
int error = 0;
- if (flags & LEASE_BREAK_LEASE)
- type = FL_LEASE;
- else if (flags & LEASE_BREAK_DELEG)
- type = FL_DELEG;
- else if (flags & LEASE_BREAK_LAYOUT)
- type = FL_LAYOUT;
- else
+ type = break_lease_flags_to_type(flags);
+ if (!type)
return -EINVAL;
new_fl = lease_alloc(NULL, type, want_write ? F_WRLCK : F_RDLCK);
@@ -1642,7 +1671,7 @@ int __break_lease(struct inode *inode, unsigned int flags)
time_out_leases(inode, &dispose);
- if (!any_leases_conflict(inode, new_fl))
+ if (!first_visible_lease(inode, new_fl, flags))
goto out;
break_time = 0;
@@ -1655,6 +1684,8 @@ int __break_lease(struct inode *inode, unsigned int flags)
list_for_each_entry_safe(fl, tmp, &ctx->flc_lease, c.flc_list) {
if (!leases_conflict(&fl->c, &new_fl->c))
continue;
+ if (S_ISDIR(inode->i_mode) && ignore_dir_deleg_break(fl, flags))
+ continue;
if (want_write) {
if (fl->c.flc_flags & FL_UNLOCK_PENDING)
continue;
@@ -1670,7 +1701,8 @@ int __break_lease(struct inode *inode, unsigned int flags)
locks_delete_lock_ctx(&fl->c, &dispose);
}
- if (list_empty(&ctx->flc_lease))
+ fl = first_visible_lease(inode, new_fl, flags);
+ if (!fl)
goto out;
if (flags & LEASE_BREAK_NONBLOCK) {
@@ -1680,7 +1712,6 @@ int __break_lease(struct inode *inode, unsigned int flags)
}
restart:
- fl = list_first_entry(&ctx->flc_lease, struct file_lease, c.flc_list);
break_time = fl->fl_break_time;
if (break_time != 0) {
if (time_after(jiffies, break_time)) {
@@ -1711,7 +1742,8 @@ int __break_lease(struct inode *inode, unsigned int flags)
*/
if (error == 0)
time_out_leases(inode, &dispose);
- if (any_leases_conflict(inode, new_fl))
+ fl = first_visible_lease(inode, new_fl, flags);
+ if (fl)
goto restart;
error = 0;
}
diff --git a/fs/namei.c b/fs/namei.c
index 9e5500dad14f..e3cbd9f877bd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4176,7 +4176,7 @@ int vfs_create(struct mnt_idmap *idmap, struct dentry *dentry, umode_t mode,
error = security_inode_create(dir, dentry, mode);
if (error)
return error;
- error = try_break_deleg(dir, di);
+ error = try_break_deleg(dir, LEASE_BREAK_DIR_CREATE, di);
if (error)
return error;
error = dir->i_op->create(idmap, dir, dentry, mode, true);
@@ -4475,7 +4475,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
/* Negative dentry, just create the file */
if (!dentry->d_inode && (open_flag & O_CREAT)) {
/* but break the directory lease first! */
- error = try_break_deleg(dir_inode, delegated_inode);
+ error = try_break_deleg(dir_inode, LEASE_BREAK_DIR_CREATE, delegated_inode);
if (error)
goto out_dput;
@@ -5091,7 +5091,7 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;
- error = try_break_deleg(dir, delegated_inode);
+ error = try_break_deleg(dir, LEASE_BREAK_DIR_CREATE, delegated_inode);
if (error)
return error;
@@ -5232,7 +5232,7 @@ struct dentry *vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
if (max_links && dir->i_nlink >= max_links)
goto err;
- error = try_break_deleg(dir, delegated_inode);
+ error = try_break_deleg(dir, LEASE_BREAK_DIR_CREATE, delegated_inode);
if (error)
goto err;
@@ -5337,7 +5337,7 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
if (error)
goto out;
- error = try_break_deleg(dir, delegated_inode);
+ error = try_break_deleg(dir, LEASE_BREAK_DIR_DELETE, delegated_inode);
if (error)
goto out;
@@ -5467,10 +5467,10 @@ int vfs_unlink(struct mnt_idmap *idmap, struct inode *dir,
else {
error = security_inode_unlink(dir, dentry);
if (!error) {
- error = try_break_deleg(dir, delegated_inode);
+ error = try_break_deleg(dir, LEASE_BREAK_DIR_DELETE, delegated_inode);
if (error)
goto out;
- error = try_break_deleg(target, delegated_inode);
+ error = try_break_deleg(target, 0, delegated_inode);
if (error)
goto out;
error = dir->i_op->unlink(dir, dentry);
@@ -5614,7 +5614,7 @@ int vfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;
- error = try_break_deleg(dir, delegated_inode);
+ error = try_break_deleg(dir, LEASE_BREAK_DIR_CREATE, delegated_inode);
if (error)
return error;
@@ -5745,9 +5745,9 @@ int vfs_link(struct dentry *old_dentry, struct mnt_idmap *idmap,
else if (max_links && inode->i_nlink >= max_links)
error = -EMLINK;
else {
- error = try_break_deleg(dir, delegated_inode);
+ error = try_break_deleg(dir, LEASE_BREAK_DIR_CREATE, delegated_inode);
if (!error)
- error = try_break_deleg(inode, delegated_inode);
+ error = try_break_deleg(inode, 0, delegated_inode);
if (!error)
error = dir->i_op->link(old_dentry, dir, new_dentry);
}
@@ -6011,21 +6011,24 @@ int vfs_rename(struct renamedata *rd)
old_dir->i_nlink >= max_links)
goto out;
}
- error = try_break_deleg(old_dir, delegated_inode);
+ error = try_break_deleg(old_dir,
+ old_dir == new_dir ? LEASE_BREAK_DIR_RENAME :
+ LEASE_BREAK_DIR_DELETE,
+ delegated_inode);
if (error)
goto out;
if (new_dir != old_dir) {
- error = try_break_deleg(new_dir, delegated_inode);
+ error = try_break_deleg(new_dir, LEASE_BREAK_DIR_CREATE, delegated_inode);
if (error)
goto out;
}
if (!is_dir) {
- error = try_break_deleg(source, delegated_inode);
+ error = try_break_deleg(source, 0, delegated_inode);
if (error)
goto out;
}
if (target && !new_is_dir) {
- error = try_break_deleg(target, delegated_inode);
+ error = try_break_deleg(target, 0, delegated_inode);
if (error)
goto out;
}
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 12591c95c925..b4bfe4ddf64e 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -1126,7 +1126,7 @@ int vfs_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
if (error)
goto out_inode_unlock;
- error = try_break_deleg(inode, &delegated_inode);
+ error = try_break_deleg(inode, 0, &delegated_inode);
if (error)
goto out_inode_unlock;
@@ -1234,7 +1234,7 @@ int vfs_remove_acl(struct mnt_idmap *idmap, struct dentry *dentry,
if (error)
goto out_inode_unlock;
- error = try_break_deleg(inode, &delegated_inode);
+ error = try_break_deleg(inode, 0, &delegated_inode);
if (error)
goto out_inode_unlock;
diff --git a/fs/xattr.c b/fs/xattr.c
index 3e49e612e1ba..6b67a6e76eeb 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -288,7 +288,7 @@ __vfs_setxattr_locked(struct mnt_idmap *idmap, struct dentry *dentry,
if (error)
goto out;
- error = try_break_deleg(inode, delegated_inode);
+ error = try_break_deleg(inode, 0, delegated_inode);
if (error)
goto out;
@@ -546,7 +546,7 @@ __vfs_removexattr_locked(struct mnt_idmap *idmap,
if (error)
goto out;
- error = try_break_deleg(inode, delegated_inode);
+ error = try_break_deleg(inode, 0, delegated_inode);
if (error)
goto out;
diff --git a/include/linux/filelock.h b/include/linux/filelock.h
index 5f0a2fb31450..9dd4e67a6f30 100644
--- a/include/linux/filelock.h
+++ b/include/linux/filelock.h
@@ -4,19 +4,22 @@
#include <linux/fs.h>
-#define FL_POSIX 1
-#define FL_FLOCK 2
-#define FL_DELEG 4 /* NFSv4 delegation */
-#define FL_ACCESS 8 /* not trying to lock, just looking */
-#define FL_EXISTS 16 /* when unlocking, test for existence */
-#define FL_LEASE 32 /* lease held on this file */
-#define FL_CLOSE 64 /* unlock on close */
-#define FL_SLEEP 128 /* A blocking lock */
-#define FL_DOWNGRADE_PENDING 256 /* Lease is being downgraded */
-#define FL_UNLOCK_PENDING 512 /* Lease is being broken */
-#define FL_OFDLCK 1024 /* lock is "owned" by struct file */
-#define FL_LAYOUT 2048 /* outstanding pNFS layout */
-#define FL_RECLAIM 4096 /* reclaiming from a reboot server */
+#define FL_POSIX BIT(0) /* POSIX lock */
+#define FL_FLOCK BIT(1) /* BSD lock */
+#define FL_DELEG BIT(2) /* NFSv4 delegation */
+#define FL_ACCESS BIT(3) /* not trying to lock, just looking */
+#define FL_EXISTS BIT(4) /* when unlocking, test for existence */
+#define FL_LEASE BIT(5) /* file lease */
+#define FL_CLOSE BIT(6) /* unlock on close */
+#define FL_SLEEP BIT(7) /* A blocking lock */
+#define FL_DOWNGRADE_PENDING BIT(8) /* Lease is being downgraded */
+#define FL_UNLOCK_PENDING BIT(9) /* Lease is being broken */
+#define FL_OFDLCK BIT(10) /* POSIX lock "owned" by struct file */
+#define FL_LAYOUT BIT(11) /* outstanding pNFS layout */
+#define FL_RECLAIM BIT(12) /* reclaiming from a reboot server */
+#define FL_IGN_DIR_CREATE BIT(13) /* ignore DIR_CREATE events */
+#define FL_IGN_DIR_DELETE BIT(14) /* ignore DIR_DELETE events */
+#define FL_IGN_DIR_RENAME BIT(15) /* ignore DIR_RENAME events */
#define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
@@ -222,6 +225,10 @@ struct file_lease *locks_alloc_lease(void);
#define LEASE_BREAK_LAYOUT BIT(2) // break layouts only
#define LEASE_BREAK_NONBLOCK BIT(3) // non-blocking break
#define LEASE_BREAK_OPEN_RDONLY BIT(4) // readonly open event
+#define LEASE_BREAK_DIR_CREATE BIT(5) // dir deleg create event
+#define LEASE_BREAK_DIR_DELETE BIT(6) // dir deleg delete event
+#define LEASE_BREAK_DIR_RENAME BIT(7) // dir deleg rename event
+
int __break_lease(struct inode *inode, unsigned int flags);
void lease_get_mtime(struct inode *, struct timespec64 *time);
@@ -516,12 +523,26 @@ static inline bool is_delegated(struct delegated_inode *di)
return di->di_inode;
}
-static inline int try_break_deleg(struct inode *inode,
+/**
+ * try_break_deleg - do a non-blocking delegation break
+ * @inode: inode that should have its delegations broken
+ * @flags: extra LEASE_BREAK_* flags to pass to break_deleg()
+ * @di: returns pointer to delegated inode (may be NULL)
+ *
+ * Break delegations in a non-blocking fashion. If there are
+ * outstanding delegations and @di is set, then an extra reference
+ * will be taken on @inode and @di->di_inode will be populated so
+ * that it may be waited upon.
+ *
+ * Returns 0 if there is no need to wait or an error. If -EWOULDBLOCK
+ * is returned, then @di will be populated (if non-NULL).
+ */
+static inline int try_break_deleg(struct inode *inode, unsigned int flags,
struct delegated_inode *di)
{
int ret;
- ret = break_deleg(inode, LEASE_BREAK_NONBLOCK);
+ ret = break_deleg(inode, flags | LEASE_BREAK_NONBLOCK);
if (ret == -EWOULDBLOCK && di) {
di->di_inode = inode;
ihold(inode);
@@ -574,7 +595,7 @@ static inline int break_deleg(struct inode *inode, unsigned int flags)
return 0;
}
-static inline int try_break_deleg(struct inode *inode,
+static inline int try_break_deleg(struct inode *inode, unsigned int flags,
struct delegated_inode *delegated_inode)
{
return 0;
diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
index 370016c38a5b..ef4bb0afb86a 100644
--- a/include/trace/events/filelock.h
+++ b/include/trace/events/filelock.h
@@ -28,7 +28,10 @@
{ FL_DOWNGRADE_PENDING, "FL_DOWNGRADE_PENDING" }, \
{ FL_UNLOCK_PENDING, "FL_UNLOCK_PENDING" }, \
{ FL_OFDLCK, "FL_OFDLCK" }, \
- { FL_RECLAIM, "FL_RECLAIM"})
+ { FL_RECLAIM, "FL_RECLAIM" }, \
+ { FL_IGN_DIR_CREATE, "FL_IGN_DIR_CREATE" }, \
+ { FL_IGN_DIR_DELETE, "FL_IGN_DIR_DELETE" }, \
+ { FL_IGN_DIR_RENAME, "FL_IGN_DIR_RENAME" })
#define show_fl_type(val) \
__print_symbolic(val, \
--
2.53.0
^ permalink raw reply related
* [PATCH v2 03/28] filelock: add a tracepoint to start of break_lease()
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
...mostly to show the LEASE_BREAK_* flags.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/locks.c | 2 ++
include/trace/events/filelock.h | 33 +++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)
diff --git a/fs/locks.c b/fs/locks.c
index 8b5958f34b61..792c3920b33a 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1651,6 +1651,8 @@ int __break_lease(struct inode *inode, unsigned int flags)
bool want_write = !(flags & LEASE_BREAK_OPEN_RDONLY);
int error = 0;
+ trace_break_lease(inode, flags);
+
type = break_lease_flags_to_type(flags);
if (!type)
return -EINVAL;
diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
index ef4bb0afb86a..fff0ee2d452d 100644
--- a/include/trace/events/filelock.h
+++ b/include/trace/events/filelock.h
@@ -120,6 +120,39 @@ DEFINE_EVENT(filelock_lock, flock_lock_inode,
TP_PROTO(struct inode *inode, struct file_lock *fl, int ret),
TP_ARGS(inode, fl, ret));
+#define show_lease_break_flags(val) \
+ __print_flags(val, "|", \
+ { LEASE_BREAK_LEASE, "LEASE" }, \
+ { LEASE_BREAK_DELEG, "DELEG" }, \
+ { LEASE_BREAK_LAYOUT, "LAYOUT" }, \
+ { LEASE_BREAK_NONBLOCK, "NONBLOCK" }, \
+ { LEASE_BREAK_OPEN_RDONLY, "OPEN_RDONLY" }, \
+ { LEASE_BREAK_DIR_CREATE, "DIR_CREATE" }, \
+ { LEASE_BREAK_DIR_DELETE, "DIR_DELETE" }, \
+ { LEASE_BREAK_DIR_RENAME, "DIR_RENAME" })
+
+TRACE_EVENT(break_lease,
+ TP_PROTO(struct inode *inode, unsigned int flags),
+
+ TP_ARGS(inode, flags),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, i_ino)
+ __field(dev_t, s_dev)
+ __field(unsigned int, flags)
+ ),
+
+ TP_fast_assign(
+ __entry->s_dev = inode->i_sb->s_dev;
+ __entry->i_ino = inode->i_ino;
+ __entry->flags = flags;
+ ),
+
+ TP_printk("dev=0x%x:0x%x ino=0x%lx flags=%s",
+ MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+ __entry->i_ino, show_lease_break_flags(__entry->flags))
+);
+
DECLARE_EVENT_CLASS(filelock_lease,
TP_PROTO(struct inode *inode, struct file_lease *fl),
--
2.53.0
^ permalink raw reply related
* [PATCH v2 04/28] filelock: add an inode_lease_ignore_mask helper
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
Add a new routine that returns a mask of all dir change events that are
currently ignored by any leases. nfsd will use this to determine how to
configure the fsnotify_mark mask.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/locks.c | 32 ++++++++++++++++++++++++++++++++
include/linux/filelock.h | 1 +
2 files changed, 33 insertions(+)
diff --git a/fs/locks.c b/fs/locks.c
index 792c3920b33a..61f64b261282 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1582,6 +1582,38 @@ static bool leases_conflict(struct file_lock_core *lc, struct file_lock_core *bc
return rc;
}
+#define IGNORE_MASK (FL_IGN_DIR_CREATE | FL_IGN_DIR_DELETE | FL_IGN_DIR_RENAME)
+
+/**
+ * inode_lease_ignore_mask - return union of all ignored inode events for this inode
+ * @inode: inode of which to get ignore mask
+ *
+ * Walk the list of leases, and return the result of all of
+ * their FL_IGN_DIR_* bits or'ed together.
+ */
+u32
+inode_lease_ignore_mask(struct inode *inode)
+{
+ struct file_lock_context *ctx;
+ struct file_lock_core *flc;
+ u32 mask = 0;
+
+ ctx = locks_inode_context(inode);
+ if (!ctx)
+ return 0;
+
+ spin_lock(&ctx->flc_lock);
+ list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
+ mask |= flc->flc_flags & IGNORE_MASK;
+ /* If we already have everything, we can stop */
+ if (mask == IGNORE_MASK)
+ break;
+ }
+ spin_unlock(&ctx->flc_lock);
+ return mask;
+}
+EXPORT_SYMBOL_GPL(inode_lease_ignore_mask);
+
static bool
ignore_dir_deleg_break(struct file_lease *fl, unsigned int flags)
{
diff --git a/include/linux/filelock.h b/include/linux/filelock.h
index 9dd4e67a6f30..6e125902c58a 100644
--- a/include/linux/filelock.h
+++ b/include/linux/filelock.h
@@ -236,6 +236,7 @@ int generic_setlease(struct file *, int, struct file_lease **, void **priv);
int kernel_setlease(struct file *, int, struct file_lease **, void **);
int vfs_setlease(struct file *, int, struct file_lease **, void **);
int lease_modify(struct file_lease *, int, struct list_head *);
+u32 inode_lease_ignore_mask(struct inode *inode);
struct notifier_block;
int lease_register_notifier(struct notifier_block *);
--
2.53.0
^ permalink raw reply related
* [PATCH v2 05/28] fsnotify: new tracepoint in fsnotify()
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
Add a tracepoint so we can see exactly how this is being called.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/notify/fsnotify.c | 5 ++++
include/trace/events/fsnotify.h | 51 +++++++++++++++++++++++++++++++++++++++++
include/trace/misc/fsnotify.h | 35 ++++++++++++++++++++++++++++
3 files changed, 91 insertions(+)
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 9995de1710e5..5448738635f6 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -14,6 +14,9 @@
#include <linux/fsnotify_backend.h>
#include "fsnotify.h"
+#define CREATE_TRACE_POINTS
+#include <trace/events/fsnotify.h>
+
/*
* Clear all of the marks on an inode when it is being evicted from core
*/
@@ -504,6 +507,8 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
int ret = 0;
__u32 test_mask, marks_mask = 0;
+ trace_fsnotify(mask, data, data_type, dir, file_name, inode, cookie);
+
if (path)
mnt = real_mount(path->mnt);
diff --git a/include/trace/events/fsnotify.h b/include/trace/events/fsnotify.h
new file mode 100644
index 000000000000..341bbd57a39b
--- /dev/null
+++ b/include/trace/events/fsnotify.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM fsnotify
+
+#if !defined(_TRACE_FSNOTIFY_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FSNOTIFY_H
+
+#include <linux/tracepoint.h>
+
+#include <trace/misc/fsnotify.h>
+
+TRACE_EVENT(fsnotify,
+ TP_PROTO(__u32 mask, const void *data, int data_type,
+ struct inode *dir, const struct qstr *file_name,
+ struct inode *inode, u32 cookie),
+
+ TP_ARGS(mask, data, data_type, dir, file_name, inode, cookie),
+
+ TP_STRUCT__entry(
+ __field(__u32, mask)
+ __field(unsigned long, dir_ino)
+ __field(unsigned long, ino)
+ __field(dev_t, s_dev)
+ __field(int, data_type)
+ __field(u32, cookie)
+ __string(file_name, file_name ? (const char *)file_name->name : "")
+ ),
+
+ TP_fast_assign(
+ __entry->mask = mask;
+ __entry->dir_ino = dir ? dir->i_ino : 0;
+ __entry->ino = inode ? inode->i_ino : 0;
+ __entry->s_dev = dir ? dir->i_sb->s_dev :
+ inode ? inode->i_sb->s_dev : 0;
+ __entry->data_type = data_type;
+ __entry->cookie = cookie;
+ __assign_str(file_name);
+ ),
+
+ TP_printk("dev=%d:%d dir=%lu ino=%lu data_type=%d cookie=0x%x mask=0x%x %s name=%s",
+ MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+ __entry->dir_ino, __entry->ino,
+ __entry->data_type, __entry->cookie,
+ __entry->mask, show_fsnotify_mask(__entry->mask),
+ __get_str(file_name))
+);
+
+#endif /* _TRACE_FSNOTIFY_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/trace/misc/fsnotify.h b/include/trace/misc/fsnotify.h
new file mode 100644
index 000000000000..a201e1bd6d8c
--- /dev/null
+++ b/include/trace/misc/fsnotify.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Display helpers for fsnotify events
+ */
+
+#include <linux/fsnotify_backend.h>
+
+#define show_fsnotify_mask(mask) \
+ __print_flags(mask, "|", \
+ { FS_ACCESS, "ACCESS" }, \
+ { FS_MODIFY, "MODIFY" }, \
+ { FS_ATTRIB, "ATTRIB" }, \
+ { FS_CLOSE_WRITE, "CLOSE_WRITE" }, \
+ { FS_CLOSE_NOWRITE, "CLOSE_NOWRITE" }, \
+ { FS_OPEN, "OPEN" }, \
+ { FS_MOVED_FROM, "MOVED_FROM" }, \
+ { FS_MOVED_TO, "MOVED_TO" }, \
+ { FS_CREATE, "CREATE" }, \
+ { FS_DELETE, "DELETE" }, \
+ { FS_DELETE_SELF, "DELETE_SELF" }, \
+ { FS_MOVE_SELF, "MOVE_SELF" }, \
+ { FS_OPEN_EXEC, "OPEN_EXEC" }, \
+ { FS_UNMOUNT, "UNMOUNT" }, \
+ { FS_Q_OVERFLOW, "Q_OVERFLOW" }, \
+ { FS_ERROR, "ERROR" }, \
+ { FS_OPEN_PERM, "OPEN_PERM" }, \
+ { FS_ACCESS_PERM, "ACCESS_PERM" }, \
+ { FS_OPEN_EXEC_PERM, "OPEN_EXEC_PERM" }, \
+ { FS_PRE_ACCESS, "PRE_ACCESS" }, \
+ { FS_MNT_ATTACH, "MNT_ATTACH" }, \
+ { FS_MNT_DETACH, "MNT_DETACH" }, \
+ { FS_EVENT_ON_CHILD, "EVENT_ON_CHILD" }, \
+ { FS_RENAME, "RENAME" }, \
+ { FS_DN_MULTISHOT, "DN_MULTISHOT" }, \
+ { FS_ISDIR, "ISDIR" })
--
2.53.0
^ permalink raw reply related
* [PATCH v2 06/28] fsnotify: add fsnotify_modify_mark_mask()
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
nfsd needs to be able to modify the mask on an existing mark when new
directory delegations are set or unset. Add an exported function that
allows the caller to set and clear bits in the mark->mask, and does
the recalculation if something changed.
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/notify/mark.c | 29 +++++++++++++++++++++++++++++
include/linux/fsnotify_backend.h | 1 +
2 files changed, 30 insertions(+)
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index c2ed5b11b0fe..b1e73c6fd382 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -310,6 +310,35 @@ void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
fsnotify_conn_set_children_dentry_flags(conn);
}
+/**
+ * fsnotify_modify_mark_mask - set and/or clear flags in a mark's mask
+ * @mark: mark to be modified
+ * @set: bits to be set in mask
+ * @clear: bits to be cleared in mask
+ *
+ * Modify a fsnotify_mark mask as directed, and update its associated conn.
+ * The caller is expected to hold a reference to the mark.
+ */
+void fsnotify_modify_mark_mask(struct fsnotify_mark *mark, u32 set, u32 clear)
+{
+ bool recalc = false;
+ u32 mask;
+
+ WARN_ON_ONCE(clear & set);
+
+ spin_lock(&mark->lock);
+ mask = mark->mask;
+ mark->mask |= set;
+ mark->mask &= ~clear;
+ if (mark->mask != mask)
+ recalc = true;
+ spin_unlock(&mark->lock);
+
+ if (recalc)
+ fsnotify_recalc_mask(mark->connector);
+}
+EXPORT_SYMBOL_GPL(fsnotify_modify_mark_mask);
+
/* Free all connectors queued for freeing once SRCU period ends */
static void fsnotify_connector_destroy_workfn(struct work_struct *work)
{
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 95985400d3d8..66e185bd1b1b 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -917,6 +917,7 @@ extern void fsnotify_get_mark(struct fsnotify_mark *mark);
extern void fsnotify_put_mark(struct fsnotify_mark *mark);
extern void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info);
extern bool fsnotify_prepare_user_wait(struct fsnotify_iter_info *iter_info);
+extern void fsnotify_modify_mark_mask(struct fsnotify_mark *mark, u32 set, u32 clear);
static inline void fsnotify_init_event(struct fsnotify_event *event)
{
--
2.53.0
^ permalink raw reply related
* [PATCH v2 07/28] fsnotify: add FSNOTIFY_EVENT_RENAME data type
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
Add a new fsnotify_rename_data struct and FSNOTIFY_EVENT_RENAME data
type that carries both the moved dentry and the inode that was
overwritten by the rename (if any).
Update fsnotify_data_inode(), fsnotify_data_dentry(), and
fsnotify_data_sb() to handle the new type, and add a new
fsnotify_data_rename_target() helper for extracting the overwritten
target inode.
Update fsnotify_move() to use the new data type for FS_RENAME and
FS_MOVED_TO events, passing the overwritten target inode through the
event data. FS_MOVED_FROM is unchanged since the source directory
doesn't need overwrite information.
This is done so that fsnotify consumers like nfsd can atomically
observe the overwritten file when a rename replaces an existing entry,
without needing a separate FS_DELETE event.
Assisted-by: Claude (Anthropic Claude Code)
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
include/linux/fsnotify.h | 8 ++++++--
include/linux/fsnotify_backend.h | 20 ++++++++++++++++++++
2 files changed, 26 insertions(+), 2 deletions(-)
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 079c18bcdbde..bda798bc67bc 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -257,6 +257,10 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
__u32 new_dir_mask = FS_MOVED_TO;
__u32 rename_mask = FS_RENAME;
const struct qstr *new_name = &moved->d_name;
+ struct fsnotify_rename_data rd = {
+ .moved = moved,
+ .target = target,
+ };
if (isdir) {
old_dir_mask |= FS_ISDIR;
@@ -265,12 +269,12 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
}
/* Event with information about both old and new parent+name */
- fsnotify_name(rename_mask, moved, FSNOTIFY_EVENT_DENTRY,
+ fsnotify_name(rename_mask, &rd, FSNOTIFY_EVENT_RENAME,
old_dir, old_name, 0);
fsnotify_name(old_dir_mask, source, FSNOTIFY_EVENT_INODE,
old_dir, old_name, fs_cookie);
- fsnotify_name(new_dir_mask, source, FSNOTIFY_EVENT_INODE,
+ fsnotify_name(new_dir_mask, &rd, FSNOTIFY_EVENT_RENAME,
new_dir, new_name, fs_cookie);
if (target)
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 66e185bd1b1b..f8c8fb7f34ae 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -311,6 +311,7 @@ enum fsnotify_data_type {
FSNOTIFY_EVENT_DENTRY,
FSNOTIFY_EVENT_MNT,
FSNOTIFY_EVENT_ERROR,
+ FSNOTIFY_EVENT_RENAME,
};
struct fs_error_report {
@@ -335,6 +336,11 @@ struct fsnotify_mnt {
u64 mnt_id;
};
+struct fsnotify_rename_data {
+ struct dentry *moved; /* the dentry that was renamed */
+ struct inode *target; /* inode overwritten by rename, or NULL */
+};
+
static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
{
switch (data_type) {
@@ -348,6 +354,8 @@ static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
return d_inode(file_range_path(data)->dentry);
case FSNOTIFY_EVENT_ERROR:
return ((struct fs_error_report *)data)->inode;
+ case FSNOTIFY_EVENT_RENAME:
+ return d_inode(((const struct fsnotify_rename_data *)data)->moved);
default:
return NULL;
}
@@ -363,6 +371,8 @@ static inline struct dentry *fsnotify_data_dentry(const void *data, int data_typ
return ((const struct path *)data)->dentry;
case FSNOTIFY_EVENT_FILE_RANGE:
return file_range_path(data)->dentry;
+ case FSNOTIFY_EVENT_RENAME:
+ return ((struct fsnotify_rename_data *)data)->moved;
default:
return NULL;
}
@@ -395,6 +405,8 @@ static inline struct super_block *fsnotify_data_sb(const void *data,
return file_range_path(data)->dentry->d_sb;
case FSNOTIFY_EVENT_ERROR:
return ((struct fs_error_report *) data)->sb;
+ case FSNOTIFY_EVENT_RENAME:
+ return ((const struct fsnotify_rename_data *)data)->moved->d_sb;
default:
return NULL;
}
@@ -430,6 +442,14 @@ static inline struct fs_error_report *fsnotify_data_error_report(
}
}
+static inline struct inode *fsnotify_data_rename_target(const void *data,
+ int data_type)
+{
+ if (data_type == FSNOTIFY_EVENT_RENAME)
+ return ((const struct fsnotify_rename_data *)data)->target;
+ return NULL;
+}
+
static inline const struct file_range *fsnotify_data_file_range(
const void *data,
int data_type)
--
2.53.0
^ permalink raw reply related
* [PATCH v2 08/28] nfsd: check fl_lmops in nfsd_breaker_owns_lease()
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
Any lease created by nfsd will have its fl_lmops set to
nfsd_lease_mng_ops. Do a quick check for that first when testing whether
the lease breaker owns the lease.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4state.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index c75d3940188c..35f5c098717e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -91,6 +91,8 @@ static void _free_cpntf_state_locked(struct nfsd_net *nn, struct nfs4_cpntf_stat
static void nfsd4_file_hash_remove(struct nfs4_file *fi);
static void deleg_reaper(struct nfsd_net *nn);
+static const struct lease_manager_operations nfsd_lease_mng_ops;
+
/* Locking: */
enum nfsd4_st_mutex_lock_subclass {
@@ -5655,6 +5657,10 @@ static bool nfsd_breaker_owns_lease(struct file_lease *fl)
struct svc_rqst *rqst;
struct nfs4_client *clp;
+ /* Only nfsd leases */
+ if (fl->fl_lmops != &nfsd_lease_mng_ops)
+ return false;
+
rqst = nfsd_current_rqst();
if (!nfsd_v4client(rqst))
return false;
--
2.53.0
^ permalink raw reply related
* [PATCH v2 09/28] nfsd: add protocol support for CB_NOTIFY
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
Add the necessary bits to nfs4_1.x and remove the duplicate definitions
from nfs4.h and the uapi nfs4 header. Regenerate the xdr files.
Note that regenerating these files caused conflicts with the definitions
of NFS4_VERIFIER_SIZE and NFS4_FHSIZE in include/uapi/linux/nfs4.h.
These constants are defined by the RFC, and are not part of the kernel
API. They have been removed. Userspace consumers who require those
constants should plan to get them from more authoritative sources.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Documentation/sunrpc/xdr/nfs4_1.x | 250 ++++++++++++++-
fs/nfsd/nfs4xdr_gen.c | 590 ++++++++++++++++++++++++++++++++++-
fs/nfsd/nfs4xdr_gen.h | 20 +-
fs/nfsd/trace.h | 1 +
include/linux/nfs4.h | 127 --------
include/linux/sunrpc/xdrgen/nfs4_1.h | 280 ++++++++++++++++-
include/uapi/linux/nfs4.h | 2 -
7 files changed, 1129 insertions(+), 141 deletions(-)
diff --git a/Documentation/sunrpc/xdr/nfs4_1.x b/Documentation/sunrpc/xdr/nfs4_1.x
index 5b45547b2ebc..632f5b579c39 100644
--- a/Documentation/sunrpc/xdr/nfs4_1.x
+++ b/Documentation/sunrpc/xdr/nfs4_1.x
@@ -45,19 +45,165 @@ pragma header nfs4;
/*
* Basic typedefs for RFC 1832 data type definitions
*/
-typedef hyper int64_t;
-typedef unsigned int uint32_t;
+typedef int int32_t;
+typedef unsigned int uint32_t;
+typedef hyper int64_t;
+typedef unsigned hyper uint64_t;
+
+const NFS4_VERIFIER_SIZE = 8;
+const NFS4_FHSIZE = 128;
+
+enum nfsstat4 {
+ NFS4_OK = 0, /* everything is okay */
+ NFS4ERR_PERM = 1, /* caller not privileged */
+ NFS4ERR_NOENT = 2, /* no such file/directory */
+ NFS4ERR_IO = 5, /* hard I/O error */
+ NFS4ERR_NXIO = 6, /* no such device */
+ NFS4ERR_ACCESS = 13, /* access denied */
+ NFS4ERR_EXIST = 17, /* file already exists */
+ NFS4ERR_XDEV = 18, /* different filesystems */
+
+ /*
+ * Please do not allocate value 19; it was used in NFSv3
+ * and we do not want a value in NFSv3 to have a different
+ * meaning in NFSv4.x.
+ */
+
+ NFS4ERR_NOTDIR = 20, /* should be a directory */
+ NFS4ERR_ISDIR = 21, /* should not be directory */
+ NFS4ERR_INVAL = 22, /* invalid argument */
+ NFS4ERR_FBIG = 27, /* file exceeds server max */
+ NFS4ERR_NOSPC = 28, /* no space on filesystem */
+ NFS4ERR_ROFS = 30, /* read-only filesystem */
+ NFS4ERR_MLINK = 31, /* too many hard links */
+ NFS4ERR_NAMETOOLONG = 63, /* name exceeds server max */
+ NFS4ERR_NOTEMPTY = 66, /* directory not empty */
+ NFS4ERR_DQUOT = 69, /* hard quota limit reached*/
+ NFS4ERR_STALE = 70, /* file no longer exists */
+ NFS4ERR_BADHANDLE = 10001,/* Illegal filehandle */
+ NFS4ERR_BAD_COOKIE = 10003,/* READDIR cookie is stale */
+ NFS4ERR_NOTSUPP = 10004,/* operation not supported */
+ NFS4ERR_TOOSMALL = 10005,/* response limit exceeded */
+ NFS4ERR_SERVERFAULT = 10006,/* undefined server error */
+ NFS4ERR_BADTYPE = 10007,/* type invalid for CREATE */
+ NFS4ERR_DELAY = 10008,/* file "busy" - retry */
+ NFS4ERR_SAME = 10009,/* nverify says attrs same */
+ NFS4ERR_DENIED = 10010,/* lock unavailable */
+ NFS4ERR_EXPIRED = 10011,/* lock lease expired */
+ NFS4ERR_LOCKED = 10012,/* I/O failed due to lock */
+ NFS4ERR_GRACE = 10013,/* in grace period */
+ NFS4ERR_FHEXPIRED = 10014,/* filehandle expired */
+ NFS4ERR_SHARE_DENIED = 10015,/* share reserve denied */
+ NFS4ERR_WRONGSEC = 10016,/* wrong security flavor */
+ NFS4ERR_CLID_INUSE = 10017,/* clientid in use */
+
+ /* NFS4ERR_RESOURCE is not a valid error in NFSv4.1 */
+ NFS4ERR_RESOURCE = 10018,/* resource exhaustion */
+
+ NFS4ERR_MOVED = 10019,/* filesystem relocated */
+ NFS4ERR_NOFILEHANDLE = 10020,/* current FH is not set */
+ NFS4ERR_MINOR_VERS_MISMATCH= 10021,/* minor vers not supp */
+ NFS4ERR_STALE_CLIENTID = 10022,/* server has rebooted */
+ NFS4ERR_STALE_STATEID = 10023,/* server has rebooted */
+ NFS4ERR_OLD_STATEID = 10024,/* state is out of sync */
+ NFS4ERR_BAD_STATEID = 10025,/* incorrect stateid */
+ NFS4ERR_BAD_SEQID = 10026,/* request is out of seq. */
+ NFS4ERR_NOT_SAME = 10027,/* verify - attrs not same */
+ NFS4ERR_LOCK_RANGE = 10028,/* overlapping lock range */
+ NFS4ERR_SYMLINK = 10029,/* should be file/directory*/
+ NFS4ERR_RESTOREFH = 10030,/* no saved filehandle */
+ NFS4ERR_LEASE_MOVED = 10031,/* some filesystem moved */
+ NFS4ERR_ATTRNOTSUPP = 10032,/* recommended attr not sup*/
+ NFS4ERR_NO_GRACE = 10033,/* reclaim outside of grace*/
+ NFS4ERR_RECLAIM_BAD = 10034,/* reclaim error at server */
+ NFS4ERR_RECLAIM_CONFLICT= 10035,/* conflict on reclaim */
+ NFS4ERR_BADXDR = 10036,/* XDR decode failed */
+ NFS4ERR_LOCKS_HELD = 10037,/* file locks held at CLOSE*/
+ NFS4ERR_OPENMODE = 10038,/* conflict in OPEN and I/O*/
+ NFS4ERR_BADOWNER = 10039,/* owner translation bad */
+ NFS4ERR_BADCHAR = 10040,/* utf-8 char not supported*/
+ NFS4ERR_BADNAME = 10041,/* name not supported */
+ NFS4ERR_BAD_RANGE = 10042,/* lock range not supported*/
+ NFS4ERR_LOCK_NOTSUPP = 10043,/* no atomic up/downgrade */
+ NFS4ERR_OP_ILLEGAL = 10044,/* undefined operation */
+ NFS4ERR_DEADLOCK = 10045,/* file locking deadlock */
+ NFS4ERR_FILE_OPEN = 10046,/* open file blocks op. */
+ NFS4ERR_ADMIN_REVOKED = 10047,/* lockowner state revoked */
+ NFS4ERR_CB_PATH_DOWN = 10048,/* callback path down */
+
+ /* NFSv4.1 errors start here. */
+
+ NFS4ERR_BADIOMODE = 10049,
+ NFS4ERR_BADLAYOUT = 10050,
+ NFS4ERR_BAD_SESSION_DIGEST = 10051,
+ NFS4ERR_BADSESSION = 10052,
+ NFS4ERR_BADSLOT = 10053,
+ NFS4ERR_COMPLETE_ALREADY = 10054,
+ NFS4ERR_CONN_NOT_BOUND_TO_SESSION = 10055,
+ NFS4ERR_DELEG_ALREADY_WANTED = 10056,
+ NFS4ERR_BACK_CHAN_BUSY = 10057,/*backchan reqs outstanding*/
+ NFS4ERR_LAYOUTTRYLATER = 10058,
+ NFS4ERR_LAYOUTUNAVAILABLE = 10059,
+ NFS4ERR_NOMATCHING_LAYOUT = 10060,
+ NFS4ERR_RECALLCONFLICT = 10061,
+ NFS4ERR_UNKNOWN_LAYOUTTYPE = 10062,
+ NFS4ERR_SEQ_MISORDERED = 10063,/* unexpected seq.ID in req*/
+ NFS4ERR_SEQUENCE_POS = 10064,/* [CB_]SEQ. op not 1st op */
+ NFS4ERR_REQ_TOO_BIG = 10065,/* request too big */
+ NFS4ERR_REP_TOO_BIG = 10066,/* reply too big */
+ NFS4ERR_REP_TOO_BIG_TO_CACHE =10067,/* rep. not all cached*/
+ NFS4ERR_RETRY_UNCACHED_REP =10068,/* retry & rep. uncached*/
+ NFS4ERR_UNSAFE_COMPOUND =10069,/* retry/recovery too hard */
+ NFS4ERR_TOO_MANY_OPS = 10070,/*too many ops in [CB_]COMP*/
+ NFS4ERR_OP_NOT_IN_SESSION =10071,/* op needs [CB_]SEQ. op */
+ NFS4ERR_HASH_ALG_UNSUPP = 10072, /* hash alg. not supp. */
+ /* Error 10073 is unused. */
+ NFS4ERR_CLIENTID_BUSY = 10074,/* clientid has state */
+ NFS4ERR_PNFS_IO_HOLE = 10075,/* IO to _SPARSE file hole */
+ NFS4ERR_SEQ_FALSE_RETRY= 10076,/* Retry != original req. */
+ NFS4ERR_BAD_HIGH_SLOT = 10077,/* req has bad highest_slot*/
+ NFS4ERR_DEADSESSION = 10078,/*new req sent to dead sess*/
+ NFS4ERR_ENCR_ALG_UNSUPP= 10079,/* encr alg. not supp. */
+ NFS4ERR_PNFS_NO_LAYOUT = 10080,/* I/O without a layout */
+ NFS4ERR_NOT_ONLY_OP = 10081,/* addl ops not allowed */
+ NFS4ERR_WRONG_CRED = 10082,/* op done by wrong cred */
+ NFS4ERR_WRONG_TYPE = 10083,/* op on wrong type object */
+ NFS4ERR_DIRDELEG_UNAVAIL=10084,/* delegation not avail. */
+ NFS4ERR_REJECT_DELEG = 10085,/* cb rejected delegation */
+ NFS4ERR_RETURNCONFLICT = 10086,/* layout get before return*/
+ NFS4ERR_DELEG_REVOKED = 10087, /* deleg./layout revoked */
+ NFS4ERR_PARTNER_NOTSUPP = 10088,
+ NFS4ERR_PARTNER_NO_AUTH = 10089,
+ NFS4ERR_UNION_NOTSUPP = 10090,
+ NFS4ERR_OFFLOAD_DENIED = 10091,
+ NFS4ERR_WRONG_LFS = 10092,
+ NFS4ERR_BADLABEL = 10093,
+ NFS4ERR_OFFLOAD_NO_REQS = 10094,
+ NFS4ERR_NOXATTR = 10095,
+ NFS4ERR_XATTR2BIG = 10096,
+
+ /* always set this to one more than the last one in the enum */
+ NFS4ERR_FIRST_FREE = 10097
+};
/*
* Basic data types
*/
+typedef opaque attrlist4<>;
typedef uint32_t bitmap4<>;
+typedef opaque verifier4[NFS4_VERIFIER_SIZE];
+typedef uint64_t nfs_cookie4;
+typedef opaque nfs_fh4<NFS4_FHSIZE>;
typedef opaque utf8string<>;
typedef utf8string utf8str_cis;
typedef utf8string utf8str_cs;
typedef utf8string utf8str_mixed;
+typedef utf8str_cs component4;
+typedef utf8str_cs linktext4;
+typedef component4 pathname4<>;
+
/*
* Timeval
*/
@@ -66,6 +212,21 @@ struct nfstime4 {
uint32_t nseconds;
};
+/*
+ * File attribute container
+ */
+struct fattr4 {
+ bitmap4 attrmask;
+ attrlist4 attr_vals;
+};
+
+/*
+ * Stateid
+ */
+struct stateid4 {
+ uint32_t seqid;
+ opaque other[12];
+};
/*
* The following content was extracted from draft-ietf-nfsv4-delstid
@@ -245,3 +406,88 @@ const FATTR4_ACL_TRUEFORM = 89;
const FATTR4_ACL_TRUEFORM_SCOPE = 90;
const FATTR4_POSIX_DEFAULT_ACL = 91;
const FATTR4_POSIX_ACCESS_ACL = 92;
+
+/*
+ * Directory notification types.
+ */
+enum notify_type4 {
+ NOTIFY4_CHANGE_CHILD_ATTRS = 0,
+ NOTIFY4_CHANGE_DIR_ATTRS = 1,
+ NOTIFY4_REMOVE_ENTRY = 2,
+ NOTIFY4_ADD_ENTRY = 3,
+ NOTIFY4_RENAME_ENTRY = 4,
+ NOTIFY4_CHANGE_COOKIE_VERIFIER = 5
+};
+
+/* Changed entry information. */
+struct notify_entry4 {
+ component4 ne_file;
+ fattr4 ne_attrs;
+};
+
+/* Previous entry information */
+struct prev_entry4 {
+ notify_entry4 pe_prev_entry;
+ /* what READDIR returned for this entry */
+ nfs_cookie4 pe_prev_entry_cookie;
+};
+
+struct notify_remove4 {
+ notify_entry4 nrm_old_entry;
+ nfs_cookie4 nrm_old_entry_cookie;
+};
+pragma public notify_remove4;
+
+struct notify_add4 {
+ /*
+ * Information on object
+ * possibly renamed over.
+ */
+ notify_remove4 nad_old_entry<1>;
+ notify_entry4 nad_new_entry;
+ /* what READDIR would have returned for this entry */
+ nfs_cookie4 nad_new_entry_cookie<1>;
+ prev_entry4 nad_prev_entry<1>;
+ bool nad_last_entry;
+};
+pragma public notify_add4;
+
+struct notify_attr4 {
+ notify_entry4 na_changed_entry;
+};
+pragma public notify_attr4;
+
+struct notify_rename4 {
+ notify_remove4 nrn_old_entry;
+ notify_add4 nrn_new_entry;
+};
+pragma public notify_rename4;
+
+struct notify_verifier4 {
+ verifier4 nv_old_cookieverf;
+ verifier4 nv_new_cookieverf;
+};
+
+/*
+ * Objects of type notify_<>4 and
+ * notify_device_<>4 are encoded in this.
+ */
+typedef opaque notifylist4<>;
+
+struct notify4 {
+ /* composed from notify_type4 or notify_deviceid_type4 */
+ bitmap4 notify_mask;
+ notifylist4 notify_vals;
+};
+
+struct CB_NOTIFY4args {
+ stateid4 cna_stateid;
+ nfs_fh4 cna_fh;
+ notify4 cna_changes<>;
+};
+pragma public CB_NOTIFY4args;
+
+struct CB_NOTIFY4res {
+ nfsstat4 cnr_status;
+};
+pragma public CB_NOTIFY4res;
diff --git a/fs/nfsd/nfs4xdr_gen.c b/fs/nfsd/nfs4xdr_gen.c
index 824497051b87..5e656d6bbb8e 100644
--- a/fs/nfsd/nfs4xdr_gen.c
+++ b/fs/nfsd/nfs4xdr_gen.c
@@ -1,16 +1,16 @@
// SPDX-License-Identifier: GPL-2.0
// Generated by xdrgen. Manual edits will be lost.
// XDR specification file: ../../Documentation/sunrpc/xdr/nfs4_1.x
-// XDR specification modification time: Thu Jan 8 23:12:07 2026
+// XDR specification modification time: Wed Mar 25 11:39:22 2026
#include <linux/sunrpc/svc.h>
#include "nfs4xdr_gen.h"
static bool __maybe_unused
-xdrgen_decode_int64_t(struct xdr_stream *xdr, int64_t *ptr)
+xdrgen_decode_int32_t(struct xdr_stream *xdr, int32_t *ptr)
{
- return xdrgen_decode_hyper(xdr, ptr);
+ return xdrgen_decode_int(xdr, ptr);
}
static bool __maybe_unused
@@ -19,6 +19,155 @@ xdrgen_decode_uint32_t(struct xdr_stream *xdr, uint32_t *ptr)
return xdrgen_decode_unsigned_int(xdr, ptr);
}
+static bool __maybe_unused
+xdrgen_decode_int64_t(struct xdr_stream *xdr, int64_t *ptr)
+{
+ return xdrgen_decode_hyper(xdr, ptr);
+}
+
+static bool __maybe_unused
+xdrgen_decode_uint64_t(struct xdr_stream *xdr, uint64_t *ptr)
+{
+ return xdrgen_decode_unsigned_hyper(xdr, ptr);
+}
+
+static bool __maybe_unused
+xdrgen_decode_nfsstat4(struct xdr_stream *xdr, nfsstat4 *ptr)
+{
+ u32 val;
+
+ if (xdr_stream_decode_u32(xdr, &val) < 0)
+ return false;
+ /* Compiler may optimize to a range check for dense enums */
+ switch (val) {
+ case NFS4_OK:
+ case NFS4ERR_PERM:
+ case NFS4ERR_NOENT:
+ case NFS4ERR_IO:
+ case NFS4ERR_NXIO:
+ case NFS4ERR_ACCESS:
+ case NFS4ERR_EXIST:
+ case NFS4ERR_XDEV:
+ case NFS4ERR_NOTDIR:
+ case NFS4ERR_ISDIR:
+ case NFS4ERR_INVAL:
+ case NFS4ERR_FBIG:
+ case NFS4ERR_NOSPC:
+ case NFS4ERR_ROFS:
+ case NFS4ERR_MLINK:
+ case NFS4ERR_NAMETOOLONG:
+ case NFS4ERR_NOTEMPTY:
+ case NFS4ERR_DQUOT:
+ case NFS4ERR_STALE:
+ case NFS4ERR_BADHANDLE:
+ case NFS4ERR_BAD_COOKIE:
+ case NFS4ERR_NOTSUPP:
+ case NFS4ERR_TOOSMALL:
+ case NFS4ERR_SERVERFAULT:
+ case NFS4ERR_BADTYPE:
+ case NFS4ERR_DELAY:
+ case NFS4ERR_SAME:
+ case NFS4ERR_DENIED:
+ case NFS4ERR_EXPIRED:
+ case NFS4ERR_LOCKED:
+ case NFS4ERR_GRACE:
+ case NFS4ERR_FHEXPIRED:
+ case NFS4ERR_SHARE_DENIED:
+ case NFS4ERR_WRONGSEC:
+ case NFS4ERR_CLID_INUSE:
+ case NFS4ERR_RESOURCE:
+ case NFS4ERR_MOVED:
+ case NFS4ERR_NOFILEHANDLE:
+ case NFS4ERR_MINOR_VERS_MISMATCH:
+ case NFS4ERR_STALE_CLIENTID:
+ case NFS4ERR_STALE_STATEID:
+ case NFS4ERR_OLD_STATEID:
+ case NFS4ERR_BAD_STATEID:
+ case NFS4ERR_BAD_SEQID:
+ case NFS4ERR_NOT_SAME:
+ case NFS4ERR_LOCK_RANGE:
+ case NFS4ERR_SYMLINK:
+ case NFS4ERR_RESTOREFH:
+ case NFS4ERR_LEASE_MOVED:
+ case NFS4ERR_ATTRNOTSUPP:
+ case NFS4ERR_NO_GRACE:
+ case NFS4ERR_RECLAIM_BAD:
+ case NFS4ERR_RECLAIM_CONFLICT:
+ case NFS4ERR_BADXDR:
+ case NFS4ERR_LOCKS_HELD:
+ case NFS4ERR_OPENMODE:
+ case NFS4ERR_BADOWNER:
+ case NFS4ERR_BADCHAR:
+ case NFS4ERR_BADNAME:
+ case NFS4ERR_BAD_RANGE:
+ case NFS4ERR_LOCK_NOTSUPP:
+ case NFS4ERR_OP_ILLEGAL:
+ case NFS4ERR_DEADLOCK:
+ case NFS4ERR_FILE_OPEN:
+ case NFS4ERR_ADMIN_REVOKED:
+ case NFS4ERR_CB_PATH_DOWN:
+ case NFS4ERR_BADIOMODE:
+ case NFS4ERR_BADLAYOUT:
+ case NFS4ERR_BAD_SESSION_DIGEST:
+ case NFS4ERR_BADSESSION:
+ case NFS4ERR_BADSLOT:
+ case NFS4ERR_COMPLETE_ALREADY:
+ case NFS4ERR_CONN_NOT_BOUND_TO_SESSION:
+ case NFS4ERR_DELEG_ALREADY_WANTED:
+ case NFS4ERR_BACK_CHAN_BUSY:
+ case NFS4ERR_LAYOUTTRYLATER:
+ case NFS4ERR_LAYOUTUNAVAILABLE:
+ case NFS4ERR_NOMATCHING_LAYOUT:
+ case NFS4ERR_RECALLCONFLICT:
+ case NFS4ERR_UNKNOWN_LAYOUTTYPE:
+ case NFS4ERR_SEQ_MISORDERED:
+ case NFS4ERR_SEQUENCE_POS:
+ case NFS4ERR_REQ_TOO_BIG:
+ case NFS4ERR_REP_TOO_BIG:
+ case NFS4ERR_REP_TOO_BIG_TO_CACHE:
+ case NFS4ERR_RETRY_UNCACHED_REP:
+ case NFS4ERR_UNSAFE_COMPOUND:
+ case NFS4ERR_TOO_MANY_OPS:
+ case NFS4ERR_OP_NOT_IN_SESSION:
+ case NFS4ERR_HASH_ALG_UNSUPP:
+ case NFS4ERR_CLIENTID_BUSY:
+ case NFS4ERR_PNFS_IO_HOLE:
+ case NFS4ERR_SEQ_FALSE_RETRY:
+ case NFS4ERR_BAD_HIGH_SLOT:
+ case NFS4ERR_DEADSESSION:
+ case NFS4ERR_ENCR_ALG_UNSUPP:
+ case NFS4ERR_PNFS_NO_LAYOUT:
+ case NFS4ERR_NOT_ONLY_OP:
+ case NFS4ERR_WRONG_CRED:
+ case NFS4ERR_WRONG_TYPE:
+ case NFS4ERR_DIRDELEG_UNAVAIL:
+ case NFS4ERR_REJECT_DELEG:
+ case NFS4ERR_RETURNCONFLICT:
+ case NFS4ERR_DELEG_REVOKED:
+ case NFS4ERR_PARTNER_NOTSUPP:
+ case NFS4ERR_PARTNER_NO_AUTH:
+ case NFS4ERR_UNION_NOTSUPP:
+ case NFS4ERR_OFFLOAD_DENIED:
+ case NFS4ERR_WRONG_LFS:
+ case NFS4ERR_BADLABEL:
+ case NFS4ERR_OFFLOAD_NO_REQS:
+ case NFS4ERR_NOXATTR:
+ case NFS4ERR_XATTR2BIG:
+ case NFS4ERR_FIRST_FREE:
+ break;
+ default:
+ return false;
+ }
+ *ptr = val;
+ return true;
+}
+
+static bool __maybe_unused
+xdrgen_decode_attrlist4(struct xdr_stream *xdr, attrlist4 *ptr)
+{
+ return xdrgen_decode_opaque(xdr, ptr, 0);
+}
+
static bool __maybe_unused
xdrgen_decode_bitmap4(struct xdr_stream *xdr, bitmap4 *ptr)
{
@@ -30,6 +179,24 @@ xdrgen_decode_bitmap4(struct xdr_stream *xdr, bitmap4 *ptr)
return true;
}
+static bool __maybe_unused
+xdrgen_decode_verifier4(struct xdr_stream *xdr, verifier4 *ptr)
+{
+ return xdr_stream_decode_opaque_fixed(xdr, ptr, NFS4_VERIFIER_SIZE) == 0;
+}
+
+static bool __maybe_unused
+xdrgen_decode_nfs_cookie4(struct xdr_stream *xdr, nfs_cookie4 *ptr)
+{
+ return xdrgen_decode_uint64_t(xdr, ptr);
+}
+
+static bool __maybe_unused
+xdrgen_decode_nfs_fh4(struct xdr_stream *xdr, nfs_fh4 *ptr)
+{
+ return xdrgen_decode_opaque(xdr, ptr, NFS4_FHSIZE);
+}
+
static bool __maybe_unused
xdrgen_decode_utf8string(struct xdr_stream *xdr, utf8string *ptr)
{
@@ -54,6 +221,29 @@ xdrgen_decode_utf8str_mixed(struct xdr_stream *xdr, utf8str_mixed *ptr)
return xdrgen_decode_utf8string(xdr, ptr);
}
+static bool __maybe_unused
+xdrgen_decode_component4(struct xdr_stream *xdr, component4 *ptr)
+{
+ return xdrgen_decode_utf8str_cs(xdr, ptr);
+}
+
+static bool __maybe_unused
+xdrgen_decode_linktext4(struct xdr_stream *xdr, linktext4 *ptr)
+{
+ return xdrgen_decode_utf8str_cs(xdr, ptr);
+}
+
+static bool __maybe_unused
+xdrgen_decode_pathname4(struct xdr_stream *xdr, pathname4 *ptr)
+{
+ if (xdr_stream_decode_u32(xdr, &ptr->count) < 0)
+ return false;
+ for (u32 i = 0; i < ptr->count; i++)
+ if (!xdrgen_decode_component4(xdr, &ptr->element[i]))
+ return false;
+ return true;
+}
+
static bool __maybe_unused
xdrgen_decode_nfstime4(struct xdr_stream *xdr, struct nfstime4 *ptr)
{
@@ -64,6 +254,26 @@ xdrgen_decode_nfstime4(struct xdr_stream *xdr, struct nfstime4 *ptr)
return true;
}
+static bool __maybe_unused
+xdrgen_decode_fattr4(struct xdr_stream *xdr, struct fattr4 *ptr)
+{
+ if (!xdrgen_decode_bitmap4(xdr, &ptr->attrmask))
+ return false;
+ if (!xdrgen_decode_attrlist4(xdr, &ptr->attr_vals))
+ return false;
+ return true;
+}
+
+static bool __maybe_unused
+xdrgen_decode_stateid4(struct xdr_stream *xdr, struct stateid4 *ptr)
+{
+ if (!xdrgen_decode_uint32_t(xdr, &ptr->seqid))
+ return false;
+ if (xdr_stream_decode_opaque_fixed(xdr, ptr->other, 12) < 0)
+ return false;
+ return true;
+}
+
static bool __maybe_unused
xdrgen_decode_fattr4_offline(struct xdr_stream *xdr, fattr4_offline *ptr)
{
@@ -366,9 +576,160 @@ xdrgen_decode_fattr4_posix_access_acl(struct xdr_stream *xdr, fattr4_posix_acces
*/
static bool __maybe_unused
-xdrgen_encode_int64_t(struct xdr_stream *xdr, const int64_t value)
+xdrgen_decode_notify_type4(struct xdr_stream *xdr, notify_type4 *ptr)
{
- return xdrgen_encode_hyper(xdr, value);
+ u32 val;
+
+ if (xdr_stream_decode_u32(xdr, &val) < 0)
+ return false;
+ /* Compiler may optimize to a range check for dense enums */
+ switch (val) {
+ case NOTIFY4_CHANGE_CHILD_ATTRS:
+ case NOTIFY4_CHANGE_DIR_ATTRS:
+ case NOTIFY4_REMOVE_ENTRY:
+ case NOTIFY4_ADD_ENTRY:
+ case NOTIFY4_RENAME_ENTRY:
+ case NOTIFY4_CHANGE_COOKIE_VERIFIER:
+ break;
+ default:
+ return false;
+ }
+ *ptr = val;
+ return true;
+}
+
+static bool __maybe_unused
+xdrgen_decode_notify_entry4(struct xdr_stream *xdr, struct notify_entry4 *ptr)
+{
+ if (!xdrgen_decode_component4(xdr, &ptr->ne_file))
+ return false;
+ if (!xdrgen_decode_fattr4(xdr, &ptr->ne_attrs))
+ return false;
+ return true;
+}
+
+static bool __maybe_unused
+xdrgen_decode_prev_entry4(struct xdr_stream *xdr, struct prev_entry4 *ptr)
+{
+ if (!xdrgen_decode_notify_entry4(xdr, &ptr->pe_prev_entry))
+ return false;
+ if (!xdrgen_decode_nfs_cookie4(xdr, &ptr->pe_prev_entry_cookie))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_decode_notify_remove4(struct xdr_stream *xdr, struct notify_remove4 *ptr)
+{
+ if (!xdrgen_decode_notify_entry4(xdr, &ptr->nrm_old_entry))
+ return false;
+ if (!xdrgen_decode_nfs_cookie4(xdr, &ptr->nrm_old_entry_cookie))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_decode_notify_add4(struct xdr_stream *xdr, struct notify_add4 *ptr)
+{
+ if (xdr_stream_decode_u32(xdr, &ptr->nad_old_entry.count) < 0)
+ return false;
+ if (ptr->nad_old_entry.count > 1)
+ return false;
+ for (u32 i = 0; i < ptr->nad_old_entry.count; i++)
+ if (!xdrgen_decode_notify_remove4(xdr, &ptr->nad_old_entry.element[i]))
+ return false;
+ if (!xdrgen_decode_notify_entry4(xdr, &ptr->nad_new_entry))
+ return false;
+ if (xdr_stream_decode_u32(xdr, &ptr->nad_new_entry_cookie.count) < 0)
+ return false;
+ if (ptr->nad_new_entry_cookie.count > 1)
+ return false;
+ for (u32 i = 0; i < ptr->nad_new_entry_cookie.count; i++)
+ if (!xdrgen_decode_nfs_cookie4(xdr, &ptr->nad_new_entry_cookie.element[i]))
+ return false;
+ if (xdr_stream_decode_u32(xdr, &ptr->nad_prev_entry.count) < 0)
+ return false;
+ if (ptr->nad_prev_entry.count > 1)
+ return false;
+ for (u32 i = 0; i < ptr->nad_prev_entry.count; i++)
+ if (!xdrgen_decode_prev_entry4(xdr, &ptr->nad_prev_entry.element[i]))
+ return false;
+ if (!xdrgen_decode_bool(xdr, &ptr->nad_last_entry))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_decode_notify_attr4(struct xdr_stream *xdr, struct notify_attr4 *ptr)
+{
+ if (!xdrgen_decode_notify_entry4(xdr, &ptr->na_changed_entry))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_decode_notify_rename4(struct xdr_stream *xdr, struct notify_rename4 *ptr)
+{
+ if (!xdrgen_decode_notify_remove4(xdr, &ptr->nrn_old_entry))
+ return false;
+ if (!xdrgen_decode_notify_add4(xdr, &ptr->nrn_new_entry))
+ return false;
+ return true;
+}
+
+static bool __maybe_unused
+xdrgen_decode_notify_verifier4(struct xdr_stream *xdr, struct notify_verifier4 *ptr)
+{
+ if (!xdrgen_decode_verifier4(xdr, &ptr->nv_old_cookieverf))
+ return false;
+ if (!xdrgen_decode_verifier4(xdr, &ptr->nv_new_cookieverf))
+ return false;
+ return true;
+}
+
+static bool __maybe_unused
+xdrgen_decode_notifylist4(struct xdr_stream *xdr, notifylist4 *ptr)
+{
+ return xdrgen_decode_opaque(xdr, ptr, 0);
+}
+
+static bool __maybe_unused
+xdrgen_decode_notify4(struct xdr_stream *xdr, struct notify4 *ptr)
+{
+ if (!xdrgen_decode_bitmap4(xdr, &ptr->notify_mask))
+ return false;
+ if (!xdrgen_decode_notifylist4(xdr, &ptr->notify_vals))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_decode_CB_NOTIFY4args(struct xdr_stream *xdr, struct CB_NOTIFY4args *ptr)
+{
+ if (!xdrgen_decode_stateid4(xdr, &ptr->cna_stateid))
+ return false;
+ if (!xdrgen_decode_nfs_fh4(xdr, &ptr->cna_fh))
+ return false;
+ if (xdr_stream_decode_u32(xdr, &ptr->cna_changes.count) < 0)
+ return false;
+ for (u32 i = 0; i < ptr->cna_changes.count; i++)
+ if (!xdrgen_decode_notify4(xdr, &ptr->cna_changes.element[i]))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_decode_CB_NOTIFY4res(struct xdr_stream *xdr, struct CB_NOTIFY4res *ptr)
+{
+ if (!xdrgen_decode_nfsstat4(xdr, &ptr->cnr_status))
+ return false;
+ return true;
+}
+
+static bool __maybe_unused
+xdrgen_encode_int32_t(struct xdr_stream *xdr, const int32_t value)
+{
+ return xdrgen_encode_int(xdr, value);
}
static bool __maybe_unused
@@ -377,6 +738,30 @@ xdrgen_encode_uint32_t(struct xdr_stream *xdr, const uint32_t value)
return xdrgen_encode_unsigned_int(xdr, value);
}
+static bool __maybe_unused
+xdrgen_encode_int64_t(struct xdr_stream *xdr, const int64_t value)
+{
+ return xdrgen_encode_hyper(xdr, value);
+}
+
+static bool __maybe_unused
+xdrgen_encode_uint64_t(struct xdr_stream *xdr, const uint64_t value)
+{
+ return xdrgen_encode_unsigned_hyper(xdr, value);
+}
+
+static bool __maybe_unused
+xdrgen_encode_nfsstat4(struct xdr_stream *xdr, nfsstat4 value)
+{
+ return xdr_stream_encode_u32(xdr, value) == XDR_UNIT;
+}
+
+static bool __maybe_unused
+xdrgen_encode_attrlist4(struct xdr_stream *xdr, const attrlist4 value)
+{
+ return xdr_stream_encode_opaque(xdr, value.data, value.len) >= 0;
+}
+
static bool __maybe_unused
xdrgen_encode_bitmap4(struct xdr_stream *xdr, const bitmap4 value)
{
@@ -388,6 +773,24 @@ xdrgen_encode_bitmap4(struct xdr_stream *xdr, const bitmap4 value)
return true;
}
+static bool __maybe_unused
+xdrgen_encode_verifier4(struct xdr_stream *xdr, const verifier4 value)
+{
+ return xdr_stream_encode_opaque_fixed(xdr, value, NFS4_VERIFIER_SIZE) >= 0;
+}
+
+static bool __maybe_unused
+xdrgen_encode_nfs_cookie4(struct xdr_stream *xdr, const nfs_cookie4 value)
+{
+ return xdrgen_encode_uint64_t(xdr, value);
+}
+
+static bool __maybe_unused
+xdrgen_encode_nfs_fh4(struct xdr_stream *xdr, const nfs_fh4 value)
+{
+ return xdr_stream_encode_opaque(xdr, value.data, value.len) >= 0;
+}
+
static bool __maybe_unused
xdrgen_encode_utf8string(struct xdr_stream *xdr, const utf8string value)
{
@@ -412,6 +815,29 @@ xdrgen_encode_utf8str_mixed(struct xdr_stream *xdr, const utf8str_mixed value)
return xdrgen_encode_utf8string(xdr, value);
}
+static bool __maybe_unused
+xdrgen_encode_component4(struct xdr_stream *xdr, const component4 value)
+{
+ return xdrgen_encode_utf8str_cs(xdr, value);
+}
+
+static bool __maybe_unused
+xdrgen_encode_linktext4(struct xdr_stream *xdr, const linktext4 value)
+{
+ return xdrgen_encode_utf8str_cs(xdr, value);
+}
+
+static bool __maybe_unused
+xdrgen_encode_pathname4(struct xdr_stream *xdr, const pathname4 value)
+{
+ if (xdr_stream_encode_u32(xdr, value.count) != XDR_UNIT)
+ return false;
+ for (u32 i = 0; i < value.count; i++)
+ if (!xdrgen_encode_component4(xdr, value.element[i]))
+ return false;
+ return true;
+}
+
static bool __maybe_unused
xdrgen_encode_nfstime4(struct xdr_stream *xdr, const struct nfstime4 *value)
{
@@ -422,6 +848,26 @@ xdrgen_encode_nfstime4(struct xdr_stream *xdr, const struct nfstime4 *value)
return true;
}
+static bool __maybe_unused
+xdrgen_encode_fattr4(struct xdr_stream *xdr, const struct fattr4 *value)
+{
+ if (!xdrgen_encode_bitmap4(xdr, value->attrmask))
+ return false;
+ if (!xdrgen_encode_attrlist4(xdr, value->attr_vals))
+ return false;
+ return true;
+}
+
+static bool __maybe_unused
+xdrgen_encode_stateid4(struct xdr_stream *xdr, const struct stateid4 *value)
+{
+ if (!xdrgen_encode_uint32_t(xdr, value->seqid))
+ return false;
+ if (xdr_stream_encode_opaque_fixed(xdr, value->other, 12) < 0)
+ return false;
+ return true;
+}
+
static bool __maybe_unused
xdrgen_encode_fattr4_offline(struct xdr_stream *xdr, const fattr4_offline value)
{
@@ -567,3 +1013,137 @@ xdrgen_encode_fattr4_posix_access_acl(struct xdr_stream *xdr, const fattr4_posix
return false;
return true;
}
+
+static bool __maybe_unused
+xdrgen_encode_notify_type4(struct xdr_stream *xdr, notify_type4 value)
+{
+ return xdr_stream_encode_u32(xdr, value) == XDR_UNIT;
+}
+
+static bool __maybe_unused
+xdrgen_encode_notify_entry4(struct xdr_stream *xdr, const struct notify_entry4 *value)
+{
+ if (!xdrgen_encode_component4(xdr, value->ne_file))
+ return false;
+ if (!xdrgen_encode_fattr4(xdr, &value->ne_attrs))
+ return false;
+ return true;
+}
+
+static bool __maybe_unused
+xdrgen_encode_prev_entry4(struct xdr_stream *xdr, const struct prev_entry4 *value)
+{
+ if (!xdrgen_encode_notify_entry4(xdr, &value->pe_prev_entry))
+ return false;
+ if (!xdrgen_encode_nfs_cookie4(xdr, value->pe_prev_entry_cookie))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_encode_notify_remove4(struct xdr_stream *xdr, const struct notify_remove4 *value)
+{
+ if (!xdrgen_encode_notify_entry4(xdr, &value->nrm_old_entry))
+ return false;
+ if (!xdrgen_encode_nfs_cookie4(xdr, value->nrm_old_entry_cookie))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_encode_notify_add4(struct xdr_stream *xdr, const struct notify_add4 *value)
+{
+ if (value->nad_old_entry.count > 1)
+ return false;
+ if (xdr_stream_encode_u32(xdr, value->nad_old_entry.count) != XDR_UNIT)
+ return false;
+ for (u32 i = 0; i < value->nad_old_entry.count; i++)
+ if (!xdrgen_encode_notify_remove4(xdr, &value->nad_old_entry.element[i]))
+ return false;
+ if (!xdrgen_encode_notify_entry4(xdr, &value->nad_new_entry))
+ return false;
+ if (value->nad_new_entry_cookie.count > 1)
+ return false;
+ if (xdr_stream_encode_u32(xdr, value->nad_new_entry_cookie.count) != XDR_UNIT)
+ return false;
+ for (u32 i = 0; i < value->nad_new_entry_cookie.count; i++)
+ if (!xdrgen_encode_nfs_cookie4(xdr, value->nad_new_entry_cookie.element[i]))
+ return false;
+ if (value->nad_prev_entry.count > 1)
+ return false;
+ if (xdr_stream_encode_u32(xdr, value->nad_prev_entry.count) != XDR_UNIT)
+ return false;
+ for (u32 i = 0; i < value->nad_prev_entry.count; i++)
+ if (!xdrgen_encode_prev_entry4(xdr, &value->nad_prev_entry.element[i]))
+ return false;
+ if (!xdrgen_encode_bool(xdr, value->nad_last_entry))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_encode_notify_attr4(struct xdr_stream *xdr, const struct notify_attr4 *value)
+{
+ if (!xdrgen_encode_notify_entry4(xdr, &value->na_changed_entry))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_encode_notify_rename4(struct xdr_stream *xdr, const struct notify_rename4 *value)
+{
+ if (!xdrgen_encode_notify_remove4(xdr, &value->nrn_old_entry))
+ return false;
+ if (!xdrgen_encode_notify_add4(xdr, &value->nrn_new_entry))
+ return false;
+ return true;
+}
+
+static bool __maybe_unused
+xdrgen_encode_notify_verifier4(struct xdr_stream *xdr, const struct notify_verifier4 *value)
+{
+ if (!xdrgen_encode_verifier4(xdr, value->nv_old_cookieverf))
+ return false;
+ if (!xdrgen_encode_verifier4(xdr, value->nv_new_cookieverf))
+ return false;
+ return true;
+}
+
+static bool __maybe_unused
+xdrgen_encode_notifylist4(struct xdr_stream *xdr, const notifylist4 value)
+{
+ return xdr_stream_encode_opaque(xdr, value.data, value.len) >= 0;
+}
+
+static bool __maybe_unused
+xdrgen_encode_notify4(struct xdr_stream *xdr, const struct notify4 *value)
+{
+ if (!xdrgen_encode_bitmap4(xdr, value->notify_mask))
+ return false;
+ if (!xdrgen_encode_notifylist4(xdr, value->notify_vals))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_encode_CB_NOTIFY4args(struct xdr_stream *xdr, const struct CB_NOTIFY4args *value)
+{
+ if (!xdrgen_encode_stateid4(xdr, &value->cna_stateid))
+ return false;
+ if (!xdrgen_encode_nfs_fh4(xdr, value->cna_fh))
+ return false;
+ if (xdr_stream_encode_u32(xdr, value->cna_changes.count) != XDR_UNIT)
+ return false;
+ for (u32 i = 0; i < value->cna_changes.count; i++)
+ if (!xdrgen_encode_notify4(xdr, &value->cna_changes.element[i]))
+ return false;
+ return true;
+}
+
+bool
+xdrgen_encode_CB_NOTIFY4res(struct xdr_stream *xdr, const struct CB_NOTIFY4res *value)
+{
+ if (!xdrgen_encode_nfsstat4(xdr, value->cnr_status))
+ return false;
+ return true;
+}
diff --git a/fs/nfsd/nfs4xdr_gen.h b/fs/nfsd/nfs4xdr_gen.h
index 1c487f1a11ab..503fe2ccba51 100644
--- a/fs/nfsd/nfs4xdr_gen.h
+++ b/fs/nfsd/nfs4xdr_gen.h
@@ -1,7 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/* Generated by xdrgen. Manual edits will be lost. */
/* XDR specification file: ../../Documentation/sunrpc/xdr/nfs4_1.x */
-/* XDR specification modification time: Thu Jan 8 23:12:07 2026 */
+/* XDR specification modification time: Wed Mar 25 11:39:22 2026 */
#ifndef _LINUX_XDRGEN_NFS4_1_DECL_H
#define _LINUX_XDRGEN_NFS4_1_DECL_H
@@ -32,4 +32,22 @@ bool xdrgen_decode_posixaceperm4(struct xdr_stream *xdr, posixaceperm4 *ptr);
bool xdrgen_encode_posixaceperm4(struct xdr_stream *xdr, const posixaceperm4 value);
+bool xdrgen_decode_notify_remove4(struct xdr_stream *xdr, struct notify_remove4 *ptr);
+bool xdrgen_encode_notify_remove4(struct xdr_stream *xdr, const struct notify_remove4 *value);
+
+bool xdrgen_decode_notify_add4(struct xdr_stream *xdr, struct notify_add4 *ptr);
+bool xdrgen_encode_notify_add4(struct xdr_stream *xdr, const struct notify_add4 *value);
+
+bool xdrgen_decode_notify_attr4(struct xdr_stream *xdr, struct notify_attr4 *ptr);
+bool xdrgen_encode_notify_attr4(struct xdr_stream *xdr, const struct notify_attr4 *value);
+
+bool xdrgen_decode_notify_rename4(struct xdr_stream *xdr, struct notify_rename4 *ptr);
+bool xdrgen_encode_notify_rename4(struct xdr_stream *xdr, const struct notify_rename4 *value);
+
+bool xdrgen_decode_CB_NOTIFY4args(struct xdr_stream *xdr, struct CB_NOTIFY4args *ptr);
+bool xdrgen_encode_CB_NOTIFY4args(struct xdr_stream *xdr, const struct CB_NOTIFY4args *value);
+
+bool xdrgen_decode_CB_NOTIFY4res(struct xdr_stream *xdr, struct CB_NOTIFY4res *ptr);
+bool xdrgen_encode_CB_NOTIFY4res(struct xdr_stream *xdr, const struct CB_NOTIFY4res *value);
+
#endif /* _LINUX_XDRGEN_NFS4_1_DECL_H */
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index a13d18447324..60cacf64181c 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -1677,6 +1677,7 @@ TRACE_EVENT(nfsd_cb_setup_err,
{ OP_CB_RECALL, "CB_RECALL" }, \
{ OP_CB_LAYOUTRECALL, "CB_LAYOUTRECALL" }, \
{ OP_CB_RECALL_ANY, "CB_RECALL_ANY" }, \
+ { OP_CB_NOTIFY, "CB_NOTIFY" }, \
{ OP_CB_NOTIFY_LOCK, "CB_NOTIFY_LOCK" }, \
{ OP_CB_OFFLOAD, "CB_OFFLOAD" })
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index d87be1f25273..44e5e9fa12e1 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -171,133 +171,6 @@ Needs to be updated if more operations are defined in future.*/
#define LAST_NFS42_OP OP_REMOVEXATTR
#define LAST_NFS4_OP LAST_NFS42_OP
-enum nfsstat4 {
- NFS4_OK = 0,
- NFS4ERR_PERM = 1,
- NFS4ERR_NOENT = 2,
- NFS4ERR_IO = 5,
- NFS4ERR_NXIO = 6,
- NFS4ERR_ACCESS = 13,
- NFS4ERR_EXIST = 17,
- NFS4ERR_XDEV = 18,
- /* Unused/reserved 19 */
- NFS4ERR_NOTDIR = 20,
- NFS4ERR_ISDIR = 21,
- NFS4ERR_INVAL = 22,
- NFS4ERR_FBIG = 27,
- NFS4ERR_NOSPC = 28,
- NFS4ERR_ROFS = 30,
- NFS4ERR_MLINK = 31,
- NFS4ERR_NAMETOOLONG = 63,
- NFS4ERR_NOTEMPTY = 66,
- NFS4ERR_DQUOT = 69,
- NFS4ERR_STALE = 70,
- NFS4ERR_BADHANDLE = 10001,
- NFS4ERR_BAD_COOKIE = 10003,
- NFS4ERR_NOTSUPP = 10004,
- NFS4ERR_TOOSMALL = 10005,
- NFS4ERR_SERVERFAULT = 10006,
- NFS4ERR_BADTYPE = 10007,
- NFS4ERR_DELAY = 10008,
- NFS4ERR_SAME = 10009,
- NFS4ERR_DENIED = 10010,
- NFS4ERR_EXPIRED = 10011,
- NFS4ERR_LOCKED = 10012,
- NFS4ERR_GRACE = 10013,
- NFS4ERR_FHEXPIRED = 10014,
- NFS4ERR_SHARE_DENIED = 10015,
- NFS4ERR_WRONGSEC = 10016,
- NFS4ERR_CLID_INUSE = 10017,
- NFS4ERR_RESOURCE = 10018,
- NFS4ERR_MOVED = 10019,
- NFS4ERR_NOFILEHANDLE = 10020,
- NFS4ERR_MINOR_VERS_MISMATCH = 10021,
- NFS4ERR_STALE_CLIENTID = 10022,
- NFS4ERR_STALE_STATEID = 10023,
- NFS4ERR_OLD_STATEID = 10024,
- NFS4ERR_BAD_STATEID = 10025,
- NFS4ERR_BAD_SEQID = 10026,
- NFS4ERR_NOT_SAME = 10027,
- NFS4ERR_LOCK_RANGE = 10028,
- NFS4ERR_SYMLINK = 10029,
- NFS4ERR_RESTOREFH = 10030,
- NFS4ERR_LEASE_MOVED = 10031,
- NFS4ERR_ATTRNOTSUPP = 10032,
- NFS4ERR_NO_GRACE = 10033,
- NFS4ERR_RECLAIM_BAD = 10034,
- NFS4ERR_RECLAIM_CONFLICT = 10035,
- NFS4ERR_BADXDR = 10036,
- NFS4ERR_LOCKS_HELD = 10037,
- NFS4ERR_OPENMODE = 10038,
- NFS4ERR_BADOWNER = 10039,
- NFS4ERR_BADCHAR = 10040,
- NFS4ERR_BADNAME = 10041,
- NFS4ERR_BAD_RANGE = 10042,
- NFS4ERR_LOCK_NOTSUPP = 10043,
- NFS4ERR_OP_ILLEGAL = 10044,
- NFS4ERR_DEADLOCK = 10045,
- NFS4ERR_FILE_OPEN = 10046,
- NFS4ERR_ADMIN_REVOKED = 10047,
- NFS4ERR_CB_PATH_DOWN = 10048,
-
- /* nfs41 */
- NFS4ERR_BADIOMODE = 10049,
- NFS4ERR_BADLAYOUT = 10050,
- NFS4ERR_BAD_SESSION_DIGEST = 10051,
- NFS4ERR_BADSESSION = 10052,
- NFS4ERR_BADSLOT = 10053,
- NFS4ERR_COMPLETE_ALREADY = 10054,
- NFS4ERR_CONN_NOT_BOUND_TO_SESSION = 10055,
- NFS4ERR_DELEG_ALREADY_WANTED = 10056,
- NFS4ERR_BACK_CHAN_BUSY = 10057, /* backchan reqs outstanding */
- NFS4ERR_LAYOUTTRYLATER = 10058,
- NFS4ERR_LAYOUTUNAVAILABLE = 10059,
- NFS4ERR_NOMATCHING_LAYOUT = 10060,
- NFS4ERR_RECALLCONFLICT = 10061,
- NFS4ERR_UNKNOWN_LAYOUTTYPE = 10062,
- NFS4ERR_SEQ_MISORDERED = 10063, /* unexpected seq.id in req */
- NFS4ERR_SEQUENCE_POS = 10064, /* [CB_]SEQ. op not 1st op */
- NFS4ERR_REQ_TOO_BIG = 10065, /* request too big */
- NFS4ERR_REP_TOO_BIG = 10066, /* reply too big */
- NFS4ERR_REP_TOO_BIG_TO_CACHE = 10067, /* rep. not all cached */
- NFS4ERR_RETRY_UNCACHED_REP = 10068, /* retry & rep. uncached */
- NFS4ERR_UNSAFE_COMPOUND = 10069, /* retry/recovery too hard */
- NFS4ERR_TOO_MANY_OPS = 10070, /* too many ops in [CB_]COMP */
- NFS4ERR_OP_NOT_IN_SESSION = 10071, /* op needs [CB_]SEQ. op */
- NFS4ERR_HASH_ALG_UNSUPP = 10072, /* hash alg. not supp. */
- /* Error 10073 is unused. */
- NFS4ERR_CLIENTID_BUSY = 10074, /* clientid has state */
- NFS4ERR_PNFS_IO_HOLE = 10075, /* IO to _SPARSE file hole */
- NFS4ERR_SEQ_FALSE_RETRY = 10076, /* retry not original */
- NFS4ERR_BAD_HIGH_SLOT = 10077, /* sequence arg bad */
- NFS4ERR_DEADSESSION = 10078, /* persistent session dead */
- NFS4ERR_ENCR_ALG_UNSUPP = 10079, /* SSV alg mismatch */
- NFS4ERR_PNFS_NO_LAYOUT = 10080, /* direct I/O with no layout */
- NFS4ERR_NOT_ONLY_OP = 10081, /* bad compound */
- NFS4ERR_WRONG_CRED = 10082, /* permissions:state change */
- NFS4ERR_WRONG_TYPE = 10083, /* current operation mismatch */
- NFS4ERR_DIRDELEG_UNAVAIL = 10084, /* no directory delegation */
- NFS4ERR_REJECT_DELEG = 10085, /* on callback */
- NFS4ERR_RETURNCONFLICT = 10086, /* outstanding layoutreturn */
- NFS4ERR_DELEG_REVOKED = 10087, /* deleg./layout revoked */
-
- /* nfs42 */
- NFS4ERR_PARTNER_NOTSUPP = 10088,
- NFS4ERR_PARTNER_NO_AUTH = 10089,
- NFS4ERR_UNION_NOTSUPP = 10090,
- NFS4ERR_OFFLOAD_DENIED = 10091,
- NFS4ERR_WRONG_LFS = 10092,
- NFS4ERR_BADLABEL = 10093,
- NFS4ERR_OFFLOAD_NO_REQS = 10094,
-
- /* xattr (RFC8276) */
- NFS4ERR_NOXATTR = 10095,
- NFS4ERR_XATTR2BIG = 10096,
-
- /* can be used for internal errors */
- NFS4ERR_FIRST_FREE
-};
-
/* error codes for internal client use */
#define NFS4ERR_RESET_TO_MDS 12001
#define NFS4ERR_RESET_TO_PNFS 12002
diff --git a/include/linux/sunrpc/xdrgen/nfs4_1.h b/include/linux/sunrpc/xdrgen/nfs4_1.h
index 4ac54bdbd335..f761c3ddb4c7 100644
--- a/include/linux/sunrpc/xdrgen/nfs4_1.h
+++ b/include/linux/sunrpc/xdrgen/nfs4_1.h
@@ -1,7 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/* Generated by xdrgen. Manual edits will be lost. */
/* XDR specification file: ../../Documentation/sunrpc/xdr/nfs4_1.x */
-/* XDR specification modification time: Thu Jan 8 23:12:07 2026 */
+/* XDR specification modification time: Wed Mar 25 11:39:22 2026 */
#ifndef _LINUX_XDRGEN_NFS4_1_DEF_H
#define _LINUX_XDRGEN_NFS4_1_DEF_H
@@ -9,15 +9,150 @@
#include <linux/types.h>
#include <linux/sunrpc/xdrgen/_defs.h>
-typedef s64 int64_t;
+typedef s32 int32_t;
typedef u32 uint32_t;
+typedef s64 int64_t;
+
+typedef u64 uint64_t;
+
+enum { NFS4_VERIFIER_SIZE = 8 };
+
+enum { NFS4_FHSIZE = 128 };
+
+enum nfsstat4 {
+ NFS4_OK = 0,
+ NFS4ERR_PERM = 1,
+ NFS4ERR_NOENT = 2,
+ NFS4ERR_IO = 5,
+ NFS4ERR_NXIO = 6,
+ NFS4ERR_ACCESS = 13,
+ NFS4ERR_EXIST = 17,
+ NFS4ERR_XDEV = 18,
+ NFS4ERR_NOTDIR = 20,
+ NFS4ERR_ISDIR = 21,
+ NFS4ERR_INVAL = 22,
+ NFS4ERR_FBIG = 27,
+ NFS4ERR_NOSPC = 28,
+ NFS4ERR_ROFS = 30,
+ NFS4ERR_MLINK = 31,
+ NFS4ERR_NAMETOOLONG = 63,
+ NFS4ERR_NOTEMPTY = 66,
+ NFS4ERR_DQUOT = 69,
+ NFS4ERR_STALE = 70,
+ NFS4ERR_BADHANDLE = 10001,
+ NFS4ERR_BAD_COOKIE = 10003,
+ NFS4ERR_NOTSUPP = 10004,
+ NFS4ERR_TOOSMALL = 10005,
+ NFS4ERR_SERVERFAULT = 10006,
+ NFS4ERR_BADTYPE = 10007,
+ NFS4ERR_DELAY = 10008,
+ NFS4ERR_SAME = 10009,
+ NFS4ERR_DENIED = 10010,
+ NFS4ERR_EXPIRED = 10011,
+ NFS4ERR_LOCKED = 10012,
+ NFS4ERR_GRACE = 10013,
+ NFS4ERR_FHEXPIRED = 10014,
+ NFS4ERR_SHARE_DENIED = 10015,
+ NFS4ERR_WRONGSEC = 10016,
+ NFS4ERR_CLID_INUSE = 10017,
+ NFS4ERR_RESOURCE = 10018,
+ NFS4ERR_MOVED = 10019,
+ NFS4ERR_NOFILEHANDLE = 10020,
+ NFS4ERR_MINOR_VERS_MISMATCH = 10021,
+ NFS4ERR_STALE_CLIENTID = 10022,
+ NFS4ERR_STALE_STATEID = 10023,
+ NFS4ERR_OLD_STATEID = 10024,
+ NFS4ERR_BAD_STATEID = 10025,
+ NFS4ERR_BAD_SEQID = 10026,
+ NFS4ERR_NOT_SAME = 10027,
+ NFS4ERR_LOCK_RANGE = 10028,
+ NFS4ERR_SYMLINK = 10029,
+ NFS4ERR_RESTOREFH = 10030,
+ NFS4ERR_LEASE_MOVED = 10031,
+ NFS4ERR_ATTRNOTSUPP = 10032,
+ NFS4ERR_NO_GRACE = 10033,
+ NFS4ERR_RECLAIM_BAD = 10034,
+ NFS4ERR_RECLAIM_CONFLICT = 10035,
+ NFS4ERR_BADXDR = 10036,
+ NFS4ERR_LOCKS_HELD = 10037,
+ NFS4ERR_OPENMODE = 10038,
+ NFS4ERR_BADOWNER = 10039,
+ NFS4ERR_BADCHAR = 10040,
+ NFS4ERR_BADNAME = 10041,
+ NFS4ERR_BAD_RANGE = 10042,
+ NFS4ERR_LOCK_NOTSUPP = 10043,
+ NFS4ERR_OP_ILLEGAL = 10044,
+ NFS4ERR_DEADLOCK = 10045,
+ NFS4ERR_FILE_OPEN = 10046,
+ NFS4ERR_ADMIN_REVOKED = 10047,
+ NFS4ERR_CB_PATH_DOWN = 10048,
+ NFS4ERR_BADIOMODE = 10049,
+ NFS4ERR_BADLAYOUT = 10050,
+ NFS4ERR_BAD_SESSION_DIGEST = 10051,
+ NFS4ERR_BADSESSION = 10052,
+ NFS4ERR_BADSLOT = 10053,
+ NFS4ERR_COMPLETE_ALREADY = 10054,
+ NFS4ERR_CONN_NOT_BOUND_TO_SESSION = 10055,
+ NFS4ERR_DELEG_ALREADY_WANTED = 10056,
+ NFS4ERR_BACK_CHAN_BUSY = 10057,
+ NFS4ERR_LAYOUTTRYLATER = 10058,
+ NFS4ERR_LAYOUTUNAVAILABLE = 10059,
+ NFS4ERR_NOMATCHING_LAYOUT = 10060,
+ NFS4ERR_RECALLCONFLICT = 10061,
+ NFS4ERR_UNKNOWN_LAYOUTTYPE = 10062,
+ NFS4ERR_SEQ_MISORDERED = 10063,
+ NFS4ERR_SEQUENCE_POS = 10064,
+ NFS4ERR_REQ_TOO_BIG = 10065,
+ NFS4ERR_REP_TOO_BIG = 10066,
+ NFS4ERR_REP_TOO_BIG_TO_CACHE = 10067,
+ NFS4ERR_RETRY_UNCACHED_REP = 10068,
+ NFS4ERR_UNSAFE_COMPOUND = 10069,
+ NFS4ERR_TOO_MANY_OPS = 10070,
+ NFS4ERR_OP_NOT_IN_SESSION = 10071,
+ NFS4ERR_HASH_ALG_UNSUPP = 10072,
+ NFS4ERR_CLIENTID_BUSY = 10074,
+ NFS4ERR_PNFS_IO_HOLE = 10075,
+ NFS4ERR_SEQ_FALSE_RETRY = 10076,
+ NFS4ERR_BAD_HIGH_SLOT = 10077,
+ NFS4ERR_DEADSESSION = 10078,
+ NFS4ERR_ENCR_ALG_UNSUPP = 10079,
+ NFS4ERR_PNFS_NO_LAYOUT = 10080,
+ NFS4ERR_NOT_ONLY_OP = 10081,
+ NFS4ERR_WRONG_CRED = 10082,
+ NFS4ERR_WRONG_TYPE = 10083,
+ NFS4ERR_DIRDELEG_UNAVAIL = 10084,
+ NFS4ERR_REJECT_DELEG = 10085,
+ NFS4ERR_RETURNCONFLICT = 10086,
+ NFS4ERR_DELEG_REVOKED = 10087,
+ NFS4ERR_PARTNER_NOTSUPP = 10088,
+ NFS4ERR_PARTNER_NO_AUTH = 10089,
+ NFS4ERR_UNION_NOTSUPP = 10090,
+ NFS4ERR_OFFLOAD_DENIED = 10091,
+ NFS4ERR_WRONG_LFS = 10092,
+ NFS4ERR_BADLABEL = 10093,
+ NFS4ERR_OFFLOAD_NO_REQS = 10094,
+ NFS4ERR_NOXATTR = 10095,
+ NFS4ERR_XATTR2BIG = 10096,
+ NFS4ERR_FIRST_FREE = 10097,
+};
+
+typedef enum nfsstat4 nfsstat4;
+
+typedef opaque attrlist4;
+
typedef struct {
u32 count;
uint32_t *element;
} bitmap4;
+typedef u8 verifier4[NFS4_VERIFIER_SIZE];
+
+typedef uint64_t nfs_cookie4;
+
+typedef opaque nfs_fh4;
+
typedef opaque utf8string;
typedef utf8string utf8str_cis;
@@ -26,11 +161,30 @@ typedef utf8string utf8str_cs;
typedef utf8string utf8str_mixed;
+typedef utf8str_cs component4;
+
+typedef utf8str_cs linktext4;
+
+typedef struct {
+ u32 count;
+ component4 *element;
+} pathname4;
+
struct nfstime4 {
int64_t seconds;
uint32_t nseconds;
};
+struct fattr4 {
+ bitmap4 attrmask;
+ attrlist4 attr_vals;
+};
+
+struct stateid4 {
+ uint32_t seqid;
+ u8 other[12];
+};
+
typedef bool fattr4_offline;
enum { FATTR4_OFFLINE = 83 };
@@ -216,11 +370,98 @@ enum { FATTR4_POSIX_DEFAULT_ACL = 91 };
enum { FATTR4_POSIX_ACCESS_ACL = 92 };
-#define NFS4_int64_t_sz \
- (XDR_hyper)
+enum notify_type4 {
+ NOTIFY4_CHANGE_CHILD_ATTRS = 0,
+ NOTIFY4_CHANGE_DIR_ATTRS = 1,
+ NOTIFY4_REMOVE_ENTRY = 2,
+ NOTIFY4_ADD_ENTRY = 3,
+ NOTIFY4_RENAME_ENTRY = 4,
+ NOTIFY4_CHANGE_COOKIE_VERIFIER = 5,
+};
+
+typedef enum notify_type4 notify_type4;
+
+struct notify_entry4 {
+ component4 ne_file;
+ struct fattr4 ne_attrs;
+};
+
+struct prev_entry4 {
+ struct notify_entry4 pe_prev_entry;
+ nfs_cookie4 pe_prev_entry_cookie;
+};
+
+struct notify_remove4 {
+ struct notify_entry4 nrm_old_entry;
+ nfs_cookie4 nrm_old_entry_cookie;
+};
+
+struct notify_add4 {
+ struct {
+ u32 count;
+ struct notify_remove4 *element;
+ } nad_old_entry;
+ struct notify_entry4 nad_new_entry;
+ struct {
+ u32 count;
+ nfs_cookie4 *element;
+ } nad_new_entry_cookie;
+ struct {
+ u32 count;
+ struct prev_entry4 *element;
+ } nad_prev_entry;
+ bool nad_last_entry;
+};
+
+struct notify_attr4 {
+ struct notify_entry4 na_changed_entry;
+};
+
+struct notify_rename4 {
+ struct notify_remove4 nrn_old_entry;
+ struct notify_add4 nrn_new_entry;
+};
+
+struct notify_verifier4 {
+ verifier4 nv_old_cookieverf;
+ verifier4 nv_new_cookieverf;
+};
+
+typedef opaque notifylist4;
+
+struct notify4 {
+ bitmap4 notify_mask;
+ notifylist4 notify_vals;
+};
+
+struct CB_NOTIFY4args {
+ struct stateid4 cna_stateid;
+ nfs_fh4 cna_fh;
+ struct {
+ u32 count;
+ struct notify4 *element;
+ } cna_changes;
+};
+
+struct CB_NOTIFY4res {
+ nfsstat4 cnr_status;
+};
+
+#define NFS4_int32_t_sz \
+ (XDR_int)
#define NFS4_uint32_t_sz \
(XDR_unsigned_int)
+#define NFS4_int64_t_sz \
+ (XDR_hyper)
+#define NFS4_uint64_t_sz \
+ (XDR_unsigned_hyper)
+#define NFS4_nfsstat4_sz (XDR_int)
+#define NFS4_attrlist4_sz (XDR_unsigned_int)
#define NFS4_bitmap4_sz (XDR_unsigned_int)
+#define NFS4_verifier4_sz (XDR_QUADLEN(NFS4_VERIFIER_SIZE))
+#define NFS4_nfs_cookie4_sz \
+ (NFS4_uint64_t_sz)
+#define NFS4_nfs_fh4_sz (XDR_unsigned_int + XDR_QUADLEN(NFS4_FHSIZE))
#define NFS4_utf8string_sz (XDR_unsigned_int)
#define NFS4_utf8str_cis_sz \
(NFS4_utf8string_sz)
@@ -228,8 +469,17 @@ enum { FATTR4_POSIX_ACCESS_ACL = 92 };
(NFS4_utf8string_sz)
#define NFS4_utf8str_mixed_sz \
(NFS4_utf8string_sz)
+#define NFS4_component4_sz \
+ (NFS4_utf8str_cs_sz)
+#define NFS4_linktext4_sz \
+ (NFS4_utf8str_cs_sz)
+#define NFS4_pathname4_sz (XDR_unsigned_int)
#define NFS4_nfstime4_sz \
(NFS4_int64_t_sz + NFS4_uint32_t_sz)
+#define NFS4_fattr4_sz \
+ (NFS4_bitmap4_sz + NFS4_attrlist4_sz)
+#define NFS4_stateid4_sz \
+ (NFS4_uint32_t_sz + XDR_QUADLEN(12))
#define NFS4_fattr4_offline_sz \
(XDR_bool)
#define NFS4_open_arguments4_sz \
@@ -259,5 +509,27 @@ enum { FATTR4_POSIX_ACCESS_ACL = 92 };
(NFS4_aclscope4_sz)
#define NFS4_fattr4_posix_default_acl_sz (XDR_unsigned_int)
#define NFS4_fattr4_posix_access_acl_sz (XDR_unsigned_int)
+#define NFS4_notify_type4_sz (XDR_int)
+#define NFS4_notify_entry4_sz \
+ (NFS4_component4_sz + NFS4_fattr4_sz)
+#define NFS4_prev_entry4_sz \
+ (NFS4_notify_entry4_sz + NFS4_nfs_cookie4_sz)
+#define NFS4_notify_remove4_sz \
+ (NFS4_notify_entry4_sz + NFS4_nfs_cookie4_sz)
+#define NFS4_notify_add4_sz \
+ (XDR_unsigned_int + (1 * (NFS4_notify_remove4_sz)) + NFS4_notify_entry4_sz + XDR_unsigned_int + (1 * (NFS4_nfs_cookie4_sz)) + XDR_unsigned_int + (1 * (NFS4_prev_entry4_sz)) + XDR_bool)
+#define NFS4_notify_attr4_sz \
+ (NFS4_notify_entry4_sz)
+#define NFS4_notify_rename4_sz \
+ (NFS4_notify_remove4_sz + NFS4_notify_add4_sz)
+#define NFS4_notify_verifier4_sz \
+ (NFS4_verifier4_sz + NFS4_verifier4_sz)
+#define NFS4_notifylist4_sz (XDR_unsigned_int)
+#define NFS4_notify4_sz \
+ (NFS4_bitmap4_sz + NFS4_notifylist4_sz)
+#define NFS4_CB_NOTIFY4args_sz \
+ (NFS4_stateid4_sz + NFS4_nfs_fh4_sz + XDR_unsigned_int)
+#define NFS4_CB_NOTIFY4res_sz \
+ (NFS4_nfsstat4_sz)
#endif /* _LINUX_XDRGEN_NFS4_1_DEF_H */
diff --git a/include/uapi/linux/nfs4.h b/include/uapi/linux/nfs4.h
index 4273e0249fcb..289205b53a08 100644
--- a/include/uapi/linux/nfs4.h
+++ b/include/uapi/linux/nfs4.h
@@ -17,11 +17,9 @@
#include <linux/types.h>
#define NFS4_BITMAP_SIZE 3
-#define NFS4_VERIFIER_SIZE 8
#define NFS4_STATEID_SEQID_SIZE 4
#define NFS4_STATEID_OTHER_SIZE 12
#define NFS4_STATEID_SIZE (NFS4_STATEID_SEQID_SIZE + NFS4_STATEID_OTHER_SIZE)
-#define NFS4_FHSIZE 128
#define NFS4_MAXPATHLEN PATH_MAX
#define NFS4_MAXNAMLEN NAME_MAX
#define NFS4_OPAQUE_LIMIT 1024
--
2.53.0
^ permalink raw reply related
* [PATCH v2 10/28] nfs_common: add new NOTIFY4_* flags proposed in RFC8881bis
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
RFC8881bis adds some new flags to GET_DIR_DELEGATION that we very much
need to support.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Documentation/sunrpc/xdr/nfs4_1.x | 16 +++++++++++++++-
fs/nfsd/nfs4xdr_gen.c | 13 ++++++++++++-
fs/nfsd/nfs4xdr_gen.h | 2 +-
include/linux/sunrpc/xdrgen/nfs4_1.h | 13 ++++++++++++-
4 files changed, 40 insertions(+), 4 deletions(-)
diff --git a/Documentation/sunrpc/xdr/nfs4_1.x b/Documentation/sunrpc/xdr/nfs4_1.x
index 632f5b579c39..aa14b590b524 100644
--- a/Documentation/sunrpc/xdr/nfs4_1.x
+++ b/Documentation/sunrpc/xdr/nfs4_1.x
@@ -416,7 +416,21 @@ enum notify_type4 {
NOTIFY4_REMOVE_ENTRY = 2,
NOTIFY4_ADD_ENTRY = 3,
NOTIFY4_RENAME_ENTRY = 4,
- NOTIFY4_CHANGE_COOKIE_VERIFIER = 5
+ NOTIFY4_CHANGE_COOKIE_VERIFIER = 5,
+ /*
+ * Added in NFSv4.1 bis document
+ */
+ NOTIFY4_GFLAG_EXTEND = 6,
+ NOTIFY4_AUFLAG_VALID = 7,
+ NOTIFY4_AUFLAG_USER = 8,
+ NOTIFY4_AUFLAG_GROUP = 9,
+ NOTIFY4_AUFLAG_OTHER = 10,
+ NOTIFY4_CHANGE_AUTH = 11,
+ NOTIFY4_CFLAG_ORDER = 12,
+ NOTIFY4_AUFLAG_GANOW = 13,
+ NOTIFY4_AUFLAG_GALATER = 14,
+ NOTIFY4_CHANGE_GA = 15,
+ NOTIFY4_CHANGE_AMASK = 16
};
/* Changed entry information. */
diff --git a/fs/nfsd/nfs4xdr_gen.c b/fs/nfsd/nfs4xdr_gen.c
index 5e656d6bbb8e..80369139ef7e 100644
--- a/fs/nfsd/nfs4xdr_gen.c
+++ b/fs/nfsd/nfs4xdr_gen.c
@@ -1,7 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
// Generated by xdrgen. Manual edits will be lost.
// XDR specification file: ../../Documentation/sunrpc/xdr/nfs4_1.x
-// XDR specification modification time: Wed Mar 25 11:39:22 2026
+// XDR specification modification time: Wed Mar 25 11:40:02 2026
#include <linux/sunrpc/svc.h>
@@ -590,6 +590,17 @@ xdrgen_decode_notify_type4(struct xdr_stream *xdr, notify_type4 *ptr)
case NOTIFY4_ADD_ENTRY:
case NOTIFY4_RENAME_ENTRY:
case NOTIFY4_CHANGE_COOKIE_VERIFIER:
+ case NOTIFY4_GFLAG_EXTEND:
+ case NOTIFY4_AUFLAG_VALID:
+ case NOTIFY4_AUFLAG_USER:
+ case NOTIFY4_AUFLAG_GROUP:
+ case NOTIFY4_AUFLAG_OTHER:
+ case NOTIFY4_CHANGE_AUTH:
+ case NOTIFY4_CFLAG_ORDER:
+ case NOTIFY4_AUFLAG_GANOW:
+ case NOTIFY4_AUFLAG_GALATER:
+ case NOTIFY4_CHANGE_GA:
+ case NOTIFY4_CHANGE_AMASK:
break;
default:
return false;
diff --git a/fs/nfsd/nfs4xdr_gen.h b/fs/nfsd/nfs4xdr_gen.h
index 503fe2ccba51..092a1ed399c7 100644
--- a/fs/nfsd/nfs4xdr_gen.h
+++ b/fs/nfsd/nfs4xdr_gen.h
@@ -1,7 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/* Generated by xdrgen. Manual edits will be lost. */
/* XDR specification file: ../../Documentation/sunrpc/xdr/nfs4_1.x */
-/* XDR specification modification time: Wed Mar 25 11:39:22 2026 */
+/* XDR specification modification time: Wed Mar 25 11:40:02 2026 */
#ifndef _LINUX_XDRGEN_NFS4_1_DECL_H
#define _LINUX_XDRGEN_NFS4_1_DECL_H
diff --git a/include/linux/sunrpc/xdrgen/nfs4_1.h b/include/linux/sunrpc/xdrgen/nfs4_1.h
index f761c3ddb4c7..537504069f24 100644
--- a/include/linux/sunrpc/xdrgen/nfs4_1.h
+++ b/include/linux/sunrpc/xdrgen/nfs4_1.h
@@ -1,7 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/* Generated by xdrgen. Manual edits will be lost. */
/* XDR specification file: ../../Documentation/sunrpc/xdr/nfs4_1.x */
-/* XDR specification modification time: Wed Mar 25 11:39:22 2026 */
+/* XDR specification modification time: Wed Mar 25 11:40:02 2026 */
#ifndef _LINUX_XDRGEN_NFS4_1_DEF_H
#define _LINUX_XDRGEN_NFS4_1_DEF_H
@@ -377,6 +377,17 @@ enum notify_type4 {
NOTIFY4_ADD_ENTRY = 3,
NOTIFY4_RENAME_ENTRY = 4,
NOTIFY4_CHANGE_COOKIE_VERIFIER = 5,
+ NOTIFY4_GFLAG_EXTEND = 6,
+ NOTIFY4_AUFLAG_VALID = 7,
+ NOTIFY4_AUFLAG_USER = 8,
+ NOTIFY4_AUFLAG_GROUP = 9,
+ NOTIFY4_AUFLAG_OTHER = 10,
+ NOTIFY4_CHANGE_AUTH = 11,
+ NOTIFY4_CFLAG_ORDER = 12,
+ NOTIFY4_AUFLAG_GANOW = 13,
+ NOTIFY4_AUFLAG_GALATER = 14,
+ NOTIFY4_CHANGE_GA = 15,
+ NOTIFY4_CHANGE_AMASK = 16,
};
typedef enum notify_type4 notify_type4;
--
2.53.0
^ permalink raw reply related
* [PATCH v2 11/28] nfsd: allow nfsd to get a dir lease with an ignore mask
From: Jeff Layton @ 2026-04-16 17:35 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260416-dir-deleg-v2-0-851426a550f6@kernel.org>
When requesting a directory lease, enable the FL_IGN_DIR_* bits that
correspond to the requested notification types.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4state.c | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 35f5c098717e..bd7e4f9cdaa5 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -6040,7 +6040,22 @@ static bool nfsd4_cb_channel_good(struct nfs4_client *clp)
return clp->cl_minorversion && clp->cl_cb_state == NFSD4_CB_UNKNOWN;
}
-static struct file_lease *nfs4_alloc_init_lease(struct nfs4_delegation *dp)
+static unsigned int
+nfsd_notify_to_ignore(u32 notify)
+{
+ unsigned int mask = 0;
+
+ if (notify & BIT(NOTIFY4_REMOVE_ENTRY))
+ mask |= FL_IGN_DIR_DELETE;
+ if (notify & BIT(NOTIFY4_ADD_ENTRY))
+ mask |= FL_IGN_DIR_CREATE;
+ if (notify & BIT(NOTIFY4_RENAME_ENTRY))
+ mask |= FL_IGN_DIR_RENAME;
+
+ return mask;
+}
+
+static struct file_lease *nfs4_alloc_init_lease(struct nfs4_delegation *dp, u32 notify)
{
struct file_lease *fl;
@@ -6048,7 +6063,7 @@ static struct file_lease *nfs4_alloc_init_lease(struct nfs4_delegation *dp)
if (!fl)
return NULL;
fl->fl_lmops = &nfsd_lease_mng_ops;
- fl->c.flc_flags = FL_DELEG;
+ fl->c.flc_flags = FL_DELEG | nfsd_notify_to_ignore(notify);
fl->c.flc_type = deleg_is_read(dp->dl_type) ? F_RDLCK : F_WRLCK;
fl->c.flc_owner = (fl_owner_t)dp;
fl->c.flc_pid = current->tgid;
@@ -6265,7 +6280,7 @@ nfs4_set_delegation(struct nfsd4_open *open, struct nfs4_ol_stateid *stp,
if (stp->st_stid.sc_export)
dp->dl_stid.sc_export = exp_get(stp->st_stid.sc_export);
- fl = nfs4_alloc_init_lease(dp);
+ fl = nfs4_alloc_init_lease(dp, 0);
if (!fl)
goto out_clnt_odstate;
@@ -9634,12 +9649,11 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
dp->dl_stid.sc_export =
exp_get(cstate->current_fh.fh_export);
- fl = nfs4_alloc_init_lease(dp);
+ fl = nfs4_alloc_init_lease(dp, gdd->gddr_notification[0]);
if (!fl)
goto out_put_stid;
- status = kernel_setlease(nf->nf_file,
- fl->c.flc_type, &fl, NULL);
+ status = kernel_setlease(nf->nf_file, fl->c.flc_type, &fl, NULL);
if (fl)
locks_free_lease(fl);
if (status)
--
2.53.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox