From: Gabriele Monaco <gmonaco@redhat.com>
To: wen.yang@linux.dev
Cc: linux-trace-kernel@vger.kernel.org, linux-kernel@vger.kernel.org,
Steven Rostedt <rostedt@goodmis.org>
Subject: Re: [RFC PATCH v2 08/10] rv/tlob: add tlob hybrid automaton monitor
Date: Fri, 15 May 2026 11:53:03 +0200 [thread overview]
Message-ID: <16edc9bc32425af44152892d5d7df50ee32fdb22.camel@redhat.com> (raw)
In-Reply-To: <fe5ed6a9a0a911e6ec74dc06c453786a2c4fb6d1.1778522945.git.wen.yang@linux.dev>
On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> Introduce tlob (task latency over budget), a per-task hybrid-automaton
> RV monitor that measures elapsed time (CLOCK_MONOTONIC) across
> a user-delimited code section and fires an error_env_tlob tracepoint
> when the elapsed time exceeds a configurable per-invocation budget.
>
> The monitor is built on RV_MON_PER_OBJ with HA_TIMER_HRTIMER. Three
> states track the scheduler status of the monitored task:
>
> running --(sleep)-------> sleeping
> running --(preempt)-----> waiting
> sleeping --(wakeup)------> waiting
> waiting --(switch_in)--> running
>
> A single clock invariant clk_elapsed < BUDGET_NS() is active in all
> three states. The budget hrtimer is rearmed on each DA transition for
> the remaining budget, keeping the absolute deadline fixed at
> start_time + BUDGET_NS.
>
> Per-task state is stored in the DA framework's hash table keyed by
> task->pid. Storage is pre-allocated by tlob_start_task() with
> GFP_KERNEL via da_create_or_get() before the scheduler tracepoints
> can fire, using DA_SKIP_AUTO_ALLOC so that no kmalloc occurs on the
> tracepoint hot path. This avoids both the kmalloc_nolock() restriction
> (requires HAVE_ALIGNED_STRUCT_PAGE) and latency issues under PREEMPT_RT.
>
> Nested monitoring is handled by nest_depth: tlob_start_task() on an
> already-monitored pid returns -EEXIST and increments nest_depth without
> disturbing the outer window; only the outermost tlob_stop_task()
> performs real cleanup.
>
> Two userspace interfaces are provided. The ioctl interface exposes
> in-process self-instrumentation via /dev/rv with TLOB_IOCTL_TRACE_START
> and TLOB_IOCTL_TRACE_STOP. The uprobe interface enables external
> monitoring of unmodified binaries via tracefs:
>
> echo "p PATH:OFFSET_START OFFSET_STOP threshold=NS" \
> > /sys/kernel/tracing/rv/monitors/tlob/monitor
>
> Violations are reported via error_env_tlob (HA clock-invariant)
> regardless of which interface triggered them.
>
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
[...]
> diff --git a/include/linux/rv.h b/include/linux/rv.h
> index 541ba404926a..1ea91bb3f1c2 100644
> --- a/include/linux/rv.h
> +++ b/include/linux/rv.h
> @@ -21,6 +21,13 @@
> #include <linux/list.h>
> #include <linux/types.h>
>
> +/* Forward declaration: poll_table is only needed by rv_chardev_ops::poll.
> + * Avoid pulling in <linux/poll.h> from rv.h — that header is included by
> + * sched.h, and poll.h → fs.h → rcupdate.h creates a header-ordering cycle
> + * with migrate_disable() on UML/non-SMP targets.
> + */
> +struct poll_table_struct;
> +
> /*
> * Deterministic automaton per-object variables.
> */
> @@ -158,6 +165,44 @@ int rv_register_monitor(struct rv_monitor *monitor,
> struct rv_monitor *parent);
> int rv_get_task_monitor_slot(void);
> void rv_put_task_monitor_slot(int slot);
Could you have everything that isn't strictly tlob-related in another
patch. This adds the ioctl functionality, can it stay on its own until
you wire it with tlob?
[...]
> diff --git a/include/rv/automata.h b/include/rv/automata.h
> index 4a4eb40cf09a..ae819638d85a 100644
> --- a/include/rv/automata.h
> +++ b/include/rv/automata.h
> @@ -41,6 +41,21 @@ static char *model_get_event_name(enum events event)
> return RV_AUTOMATON_NAME.event_names[event];
> }
>
> +/*
> + * model_get_timer_event_name - label used when the HA timer fires (no
> event).
> + *
> + * Monitors may define MONITOR_TIMER_EVENT_NAME before including the model
> + * header to give the timer-fired violation a semantically meaningful label
> + * (e.g. "budget_exceeded" for tlob). Defaults to "none".
> + */
> +#ifndef MONITOR_TIMER_EVENT_NAME
> +#define MONITOR_TIMER_EVENT_NAME "none"
> +#endif
Why don't you just override EVENT_NONE_LBL (and if you prefer call it
MONITOR_TIMER_EVENT_NAME) without the need for another function?
> +static inline char *model_get_timer_event_name(void)
> +{
> + return MONITOR_TIMER_EVENT_NAME;
> +}
> +
[...]
> diff --git a/include/rv/rv_uprobe.h b/include/rv/rv_uprobe.h
> index 084cdb36a2ff..9106c5c9275e 100644
> --- a/include/rv/rv_uprobe.h
> +++ b/include/rv/rv_uprobe.h
> @@ -79,9 +79,41 @@ struct rv_uprobe *rv_uprobe_attach(const char *binpath,
> loff_t offset,
> * for any in-progress handler to finish, then releases the path reference
> * and frees the rv_uprobe struct. The caller's priv data is NOT freed.
> *
> + * When removing a single probe, prefer this over the three-phase API.
> * Safe to call from process context only (uprobe_unregister_sync() may
> * schedule).
> */
> void rv_uprobe_detach(struct rv_uprobe *p);
Why don't you put all this in the patch about uprobes?
>
> +/**
> + * rv_uprobe_unregister_nosync - dequeue an uprobe without waiting
> + * @p: probe to dequeue; may be NULL (no-op)
> + *
> + * Removes the uprobe from the uprobe subsystem but does NOT wait for
> + * in-flight handlers to complete. The caller must call rv_uprobe_sync()
> + * before calling rv_uprobe_free() on the same probe.
> + *
> + * Use this to batch multiple deregistrations before a single
> rv_uprobe_sync().
> + */
> +void rv_uprobe_unregister_nosync(struct rv_uprobe *p);
> +
> +/**
> + * rv_uprobe_sync - wait for all in-flight uprobe handlers to complete
> + *
> + * Global barrier: waits for every in-flight uprobe handler across the system
> + * to finish. Call once after a batch of rv_uprobe_unregister_nosync() calls
> + * and before any rv_uprobe_free() call.
> + */
> +void rv_uprobe_sync(void);
> +
> +/**
> + * rv_uprobe_free - release resources of a previously deregistered probe
> + * @p: probe to free; may be NULL (no-op)
> + *
> + * Releases the path reference and frees the rv_uprobe struct. Must only
> + * be called after rv_uprobe_sync() has returned. The caller's priv data
> + * is NOT freed.
> + */
> +void rv_uprobe_free(struct rv_uprobe *p);
> +
> #endif /* _RV_UPROBE_H */
> diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000000..a34e5426393b
> --- /dev/null
> +++ b/include/uapi/linux/rv.h
> @@ -0,0 +1,86 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * All RV monitors that expose an ioctl self-instrumentation interface
> + * share the magic byte RV_IOC_MAGIC ('r').
> + *
> + * Usage examples and design rationale are in:
> + * Documentation/trace/rv/monitor_tlob.rst
> + */
Same as above, this could be in a separate patch.
> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H
> +
> +#include <linux/ioctl.h>
> +#include <linux/types.h>
> +
[...]
> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
> index f139b904bea3..8a5b5c84aff9 100644
> --- a/kernel/trace/rv/Makefile
> +++ b/kernel/trace/rv/Makefile
> @@ -2,7 +2,7 @@
>
> ccflags-y += -I $(src) # needed for trace events
>
> -obj-$(CONFIG_RV) += rv.o
> +obj-$(CONFIG_RV) += rv.o rv_chardev.o
Same here.
> obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
> obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
> obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
> @@ -0,0 +1,69 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +config RV_MON_TLOB
> + depends on RV
> + select RV_UPROBE
> + select HA_MON_EVENTS_ID
> + bool "tlob monitor"
> + help
> + Enable the tlob (task latency over budget) monitor. This monitor
> + tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path
> + within a task (including both on-CPU and off-CPU time) and reports
> + a violation when the elapsed time exceeds a configurable budget.
> +
> + The monitor uses a three-state hybrid automaton (running, waiting,
> + sleeping) stored per object using RV_MON_PER_OBJ. A single HA
> + clock invariant (clk_elapsed < BUDGET_NS) is enforced in all three
> + states via a per-task hrtimer.
> +
> + States: running (initial, on-CPU), waiting (in runqueue, off-CPU),
> + sleeping (blocked on resource, off-CPU).
> + Key transitions:
> + running --(sleep)------> sleeping
> + running --(preempt)----> waiting
> + sleeping --(wakeup)-----> waiting
> + waiting --(switch_in)--> running
> + task_start calls da_handle_start_event() to set the initial state,
> + then arms the budget timer directly via ha_reset_clk_ns() +
> + ha_start_timer_ns(). task_stop cancels the timer synchronously via
> + ha_cancel_timer_sync() then calls da_monitor_reset().
> +
> + Two userspace interfaces are provided:
> +
> + tracefs uprobe binding (external, unmodified binaries):
> + echo "p PATH:OFFSET_START OFFSET_STOP threshold=NS" \
> + > /sys/kernel/tracing/rv/monitors/tlob/monitor
> + The uprobe at offset_start fires tlob_start_task(); the uprobe at
> + offset_stop fires tlob_stop_task(). Both are plain entry uprobes
> + so a mistyped offset cannot corrupt the call stack.
> +
> + /dev/rv ioctl (in-process self-instrumentation):
> + ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> + do_critical_work();
> + ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> + /* ret == -EOVERFLOW when budget exceeded */
> + Allows conditional monitoring, sub-function granularity, and
> + inline reaction to violations without polling the trace buffer.
> +
> + Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
> +
> + Violations are always reported via the standard error_env_tlob RV
> + tracepoint regardless of which interface triggered them. The
> + tracefs interface requires only tracefs write permissions, avoiding
> + the CAP_BPF privilege needed for equivalent eBPF-based approaches.
> +
> + For further information, see:
> + Documentation/trace/rv/monitor_tlob.rst
> +
> +config TLOB_KUNIT_TEST
Do you need to add this here? Since you have a patch adding KUnit tests
to tlob, cannot you put everything kunit-related there?
That's also going to simplify things since RV KUnits aren't stable right
now.
> + tristate "KUnit tests for tlob monitor" if !KUNIT_ALL_TESTS
I couldn't build it as module, do we need it that way?
ERROR: modpost: "sched_setscheduler_nocheck" [kernel/trace/rv/monitors/tlob/tlob_kunit.ko] undefined!
> + depends on RV_MON_TLOB && KUNIT
> + default KUNIT_ALL_TESTS
> + help
> + Enable KUnit in-kernel unit tests for the tlob RV monitor.
> +
> + Tests cover automaton state transitions, the start/stop task
> + interface, scheduler context-switch accounting, and the uprobe
> + format string parser.
> +
> + Say Y or M here to run the tlob KUnit test suite; otherwise say N.
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> new file mode 100644
> index 000000000000..475e972ae9aa
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -0,0 +1,1307 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob: task latency over budget monitor
> + *
> + * Track the elapsed wall-clock time of a marked code path and detect when
> + * a monitored task exceeds its per-task latency budget. CLOCK_MONOTONIC
> + * is used so both on-CPU and off-CPU time count toward the budget.
> + *
> + * On a budget violation, two tracepoints are emitted from the hrtimer
> + * callback: error_env_tlob signals the violation, and detail_env_tlob
> + * provides a per-state time breakdown (running_ns, waiting_ns, sleeping_ns)
> + * that pinpoints whether the overrun occurred in running, waiting, or
> sleeping state.
> + *
> + * The monitor uses RV_MON_PER_OBJ: per-task state (struct tlob_task_state)
> + * is stored as monitor_target in the framework's hash table.
> + *
> + * One HA clock invariant is enforced:
> + * clk_elapsed < BUDGET_NS() (active in all states)
> + *
> + * task_start uses da_handle_start_event() to set the initial state, then
> + * calls ha_reset_clk_ns() + ha_start_timer_ns() directly to initialise the
> + * clock and arm the budget timer. No synthetic event is needed.
> + * The HA timer is cancelled synchronously by ha_cancel_timer_sync() in
> + * tlob_stop_task().
> + *
> + * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
> + */
> +#include <linux/completion.h>
> +#include <linux/hrtimer.h>
> +#include <linux/kernel.h>
> +#include <linux/ktime.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/namei.h>
> +#include <linux/refcount.h>
> +#include <linux/rv.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/tracefs.h>
> +#include <linux/uaccess.h>
> +#include <kunit/visibility.h>
> +#include <rv/instrumentation.h>
> +#include <rv/rv_uprobe.h>
> +#include <uapi/linux/rv.h>
> +#include "../../rv.h"
> +
> +#define MODULE_NAME "tlob"
> +
> +#include <trace/events/sched.h>
> +#include <rv_trace.h>
> +
> +/*
> + * Per-fd private data; one instance per open /dev/rv fd.
> + * monitoring: set while TRACE_START is active; cleared at TRACE_STOP.
> + * budget_exceeded: set by hrtimer callback; read at TRACE_STOP to report
> + * -EOVERFLOW even when cleanup was claimed by a concurrent stop_all or
> + * a task-exit handler.
> + */
> +struct tlob_fpriv {
> + struct task_struct *task;
> + bool monitoring;
> + bool budget_exceeded;
> +};
> +
> +/*
> + * Per-task latency monitoring state. One instance per monitoring window.
> + * Stored as monitor_target in da_monitor_storage; freed via call_rcu.
> + */
> +struct tlob_task_state {
> + struct task_struct *task; /* via get_task_struct */
> + u64 threshold_us; /* budget in microseconds */
> +
> + /* 1 = cleanup claimed; ha_setup_invariants won't restart the timer.
> */
> + atomic_t stopping;
> +
> + /* Serialises the ns accumulators; held briefly (hardirq-safe). */
> + raw_spinlock_t entry_lock;
> + u64 running_ns; /* time in running state */
> + u64 waiting_ns; /* time in waiting state */
> + u64 sleeping_ns; /* time in sleeping state */
> + ktime_t last_ts;
> +
> + /* store-release in TRACE_START ioctl, load-acquire in reset_notify.
> */
> + struct tlob_fpriv *fpriv;
> +
> + struct rcu_head rcu; /* for call_rcu()
> teardown */
> +};
> +
> +#define RV_MON_TYPE RV_MON_PER_OBJ
> +#define HA_TIMER_TYPE HA_TIMER_HRTIMER
> +/* Pool mode: da_handle_start_event uses da_fill_empty_storage, not kmalloc.
> */
> +#define DA_SKIP_AUTO_ALLOC
> +
> +/* Type for da_monitor_storage.target; must be defined before the includes.
> */
> +typedef struct tlob_task_state *monitor_target;
> +
> +/* Forward-declared so da_monitor_reset_hook works before ha_monitor.h. */
> +static inline void tlob_reset_notify(struct da_monitor *da_mon);
> +#define da_monitor_reset_hook tlob_reset_notify
> +
> +/*
> + * When the hrtimer fires (budget elapsed), the HA framework emits
> + * error_env_tlob with this label instead of the generic "none".
> + */
> +#define MONITOR_TIMER_EVENT_NAME "budget_exceeded"
> +
> +#include "tlob.h"
> +#include <rv/ha_monitor.h>
> +
> +/*
> + * Called from da_monitor_reset() on both normal stop and hrtimer expiry.
> + * On violation (stopping==0), emits detail_env_tlob.
> + */
> +static inline void tlob_reset_notify(struct da_monitor *da_mon)
> +{
> + struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
> + struct tlob_task_state *ws;
> +
> + ha_monitor_reset_env(da_mon);
> +
> + ws = ha_get_target(ha_mon);
> + if (!ws)
> + return;
> +
> + /*
> + * Emit per-state breakdown on budget violation only.
> + * stopping==0: timer callback owns this path (genuine overrun).
> + * stopping==1: normal stop claimed ownership first; skip.
> + */
> + if (!atomic_read(&ws->stopping)) {
> + unsigned int curr_state = READ_ONCE(da_mon->curr_state);
> + u64 running_ns, waiting_ns, sleeping_ns, partial_ns;
> + struct tlob_fpriv *fp;
> + unsigned long flags;
> +
> + /*
> + * Snapshot accumulators; partial_ns covers curr_state time
> + * not yet folded in (transition-out pending).
> + */
> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
> + partial_ns = ktime_get_ns() - ktime_to_ns(ws->last_ts);
> + running_ns = ws->running_ns +
> + (curr_state == running_tlob ? partial_ns :
> 0);
> + waiting_ns = ws->waiting_ns +
> + (curr_state == waiting_tlob ? partial_ns :
> 0);
> + sleeping_ns = ws->sleeping_ns +
> + (curr_state == sleeping_tlob ? partial_ns :
> 0);
> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +
> + trace_detail_env_tlob(da_get_id(da_mon), ws->threshold_us,
> + running_ns, waiting_ns, sleeping_ns);
> +
> + /*
> + * Latch violation in the fd so TRACE_STOP can return -
> EOVERFLOW
> + * even if a concurrent stop_all or task-exit handler claims
> + * cleanup first. Pairs with smp_store_release in
> TRACE_START.
> + */
> + fp = smp_load_acquire(&ws->fpriv);
> + if (fp)
> + WRITE_ONCE(fp->budget_exceeded, true);
> + }
> +}
> +
> +#define BUDGET_US(ha_mon) (ha_get_target(ha_mon)->threshold_us)
> +#define BUDGET_NS(ha_mon) (BUDGET_US(ha_mon) * 1000ULL)
> +
> +/* HA constraint functions (called by ha_monitor_handle_constraint) */
> +
> +static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_tlob env, u64
> time_ns)
> +{
> + if (env == clk_elapsed_tlob)
> + return ha_get_clk_ns(ha_mon, env, time_ns);
> + return ENV_INVALID_VALUE;
> +}
> +
> +static void ha_reset_env(struct ha_monitor *ha_mon, enum envs_tlob env, u64
> time_ns)
> +{
> + if (env == clk_elapsed_tlob)
> + ha_reset_clk_ns(ha_mon, env, time_ns);
> +}
> +
> +/*
> + * ha_verify_invariants - clk_elapsed < BUDGET_NS must hold in all states.
> + */
> +static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
> + enum states curr_state, enum events
> event,
> + enum states next_state, u64 time_ns)
> +{
> + if (curr_state == running_tlob)
> + return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob,
> time_ns);
> + else if (curr_state == sleeping_tlob)
> + return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob,
> time_ns);
> + else if (curr_state == waiting_tlob)
> + return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob,
> time_ns);
> + return true;
> +}
> +
> +/*
> + * Convert invariant (deadline) to guard (reset anchor) on state transitions.
> + * Skip if uninitialised (ENV_INVALID_VALUE): the race between
> + * da_handle_start_event() and ha_reset_clk_ns() would give U64_MAX -
> BUDGET_NS.
> + */
> +static inline void ha_convert_inv_guard(struct ha_monitor *ha_mon,
> + enum states curr_state, enum events
> event,
> + enum states next_state, u64 time_ns)
> +{
> + if (curr_state == next_state)
> + return;
> + if (curr_state == running_tlob &&
> + !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
> + ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> time_ns);
> + else if (curr_state == sleeping_tlob &&
> + !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
> + ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> time_ns);
> + else if (curr_state == waiting_tlob &&
> + !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
> + ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> time_ns);
> +}
> +
> +/* No per-event guard conditions for tlob; invariants suffice. */
> +static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
> + enum states curr_state, enum events
> event,
> + enum states next_state, u64 time_ns)
> +{
> + return true;
> +}
> +
> +/*
> + * Arm or cancel the HA budget timer on state transitions.
> + * Guard on stopping: sched_switch events can arrive after
> ha_cancel_timer_sync,
> + * restarting the timer and triggering an ODEBUG "activate active" splat.
> + */
> +static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
> + enum states curr_state, enum events
> event,
> + enum states next_state, u64 time_ns)
> +{
> + if (next_state == curr_state)
> + return;
> + if (next_state == running_tlob) {
> + if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
> + ha_start_timer_ns(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> + } else if (next_state == sleeping_tlob) {
> + if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
> + ha_start_timer_ns(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> + } else if (next_state == waiting_tlob) {
> + if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
> + ha_start_timer_ns(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> + } else if (curr_state == running_tlob)
> + ha_cancel_timer(ha_mon);
> + else if (curr_state == waiting_tlob)
> + ha_cancel_timer(ha_mon);
> + else if (curr_state == sleeping_tlob)
> + ha_cancel_timer(ha_mon);
> +}
> +
> +static bool ha_verify_constraint(struct ha_monitor *ha_mon,
> + enum states curr_state, enum events event,
> + enum states next_state, u64 time_ns)
> +{
> + if (!ha_verify_invariants(ha_mon, curr_state, event, next_state,
> time_ns))
> + return false;
> +
> + ha_convert_inv_guard(ha_mon, curr_state, event, next_state, time_ns);
> +
> + if (!ha_verify_guards(ha_mon, curr_state, event, next_state,
> time_ns))
> + return false;
> +
> + ha_setup_invariants(ha_mon, curr_state, event, next_state, time_ns);
> +
> + return true;
> +}
> +
> +static struct kmem_cache *tlob_state_cache;
> +
> +static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
> +
> +/* Uprobe binding list; protected by tlob_uprobe_mutex. */
> +static LIST_HEAD(tlob_uprobe_list);
> +static DEFINE_MUTEX(tlob_uprobe_mutex);
> +
> +/*
> + * Serialises duplicate-check + da_create_or_get() to prevent two concurrent
> + * callers for the same pid from both inserting into the hash table.
> + */
> +static DEFINE_MUTEX(tlob_start_mutex);
> +
> +/*
> + * Counts open /dev/rv fds plus one synthetic ref held while enabled.
> + * __tlob_destroy_monitor() drops the synthetic ref and waits for zero
> + * before teardown, preventing kmem_cache_zalloc() on a destroyed cache.
> + */
> +static refcount_t tlob_fd_refcount = REFCOUNT_INIT(0);
> +static DECLARE_COMPLETION(tlob_fd_released);
> +
> +/* Per-uprobe-binding state: a start + stop probe pair for one binary region.
> */
> +struct tlob_uprobe_binding {
> + struct list_head list;
> + u64 threshold_us;
> + char binpath[TLOB_MAX_PATH];
> + loff_t offset_start;
> + loff_t offset_stop;
> + struct rv_uprobe *start_probe;
> + struct rv_uprobe *stop_probe;
> +};
> +
> +/* RCU callback: free the slab once no readers remain. */
> +static void tlob_free_rcu(struct rcu_head *head)
> +{
> + struct tlob_task_state *ws =
> + container_of(head, struct tlob_task_state, rcu);
> + kmem_cache_free(tlob_state_cache, ws);
> +}
> +
> +/*
> + * handle_sched_switch - advance the DA on every context switch.
> + *
> + * Generates three DA events:
> + * prev, prev_state != 0 -> sleep_tlob (running -> sleeping)
> + * prev, prev_state == 0 -> preempt_tlob (running -> waiting)
> + * next -> switch_in_tlob (waiting -> running)
> + */
> +static void handle_sched_switch(void *data, bool preempt_unused,
> + struct task_struct *prev,
> + struct task_struct *next,
> + unsigned int prev_state)
> +{
> + struct tlob_task_state *ws;
> + unsigned long flags;
> + bool do_prev = false, do_next = false;
> + bool prev_preempted;
> + ktime_t now;
> +
Perhaps keep the handler simpler by moving this reporting to a helper
function and use guard(rcu)() there.
> + rcu_read_lock();
> +
> + ws = da_get_target_by_id(prev->pid);
> + if (ws) {
> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
> + now = ktime_get();
> + ws->running_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
> + ws->last_ts = now;
> + /* prev_state == 0: TASK_RUNNING (preempted); != 0: sleeping.
> */
> + prev_preempted = (prev_state == 0);
> + do_prev = true;
> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> + }
> +
> + ws = da_get_target_by_id(next->pid);
> + if (ws) {
> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
> + now = ktime_get();
> + ws->waiting_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
> + ws->last_ts = now;
> + do_next = true;
> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> + }
> +
> + rcu_read_unlock();
> +
You probably don't need these. da_handle_event should skip tasks without
a monitor.
> + if (do_prev)
> + da_handle_event(prev->pid, NULL,
> + prev_preempted ? preempt_tlob : sleep_tlob);
> + if (do_next)
> + da_handle_event(next->pid, NULL, switch_in_tlob);
> +}
> +
> +/*
> + * handle_sched_wakeup - sleeping -> waiting transition.
> + *
> + * try_to_wake_up() skips TASK_RUNNING tasks, so this never fires for a
> + * task already in running or waiting state.
> + */
> +static void handle_sched_wakeup(void *data, struct task_struct *p)
> +{
> + struct tlob_task_state *ws;
> + unsigned long flags;
> + bool found = false;
> +
Same as above to keep the handler simple.
> + rcu_read_lock();
> + ws = da_get_target_by_id(p->pid);
> + if (ws) {
> + ktime_t now = ktime_get();
> +
> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
> + ws->sleeping_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
> + ws->last_ts = now;
> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> + found = true;
> + }
> + rcu_read_unlock();
> +
> + if (found)
You probably don't need this. da_handle_event should skip tasks without
a monitor.
> + da_handle_event(p->pid, NULL, wakeup_tlob);
> +}
> +
> +/*
> + * handle_sched_process_exit - clean up if a task exits without TRACE_STOP.
> + *
> + * Called in do_exit() context; the task still has a valid pid here.
> + */
> +static void handle_sched_process_exit(void *data, struct task_struct *p,
> + bool group_dead)
> +{
> + struct tlob_task_state *ws;
> + bool found = false;
> +
> + rcu_read_lock();
> + ws = da_get_target_by_id(p->pid);
> + found = !!ws;
> + rcu_read_unlock();
> +
> + if (found)
You can skip all this here.
> + tlob_stop_task(p);
> +}
> +
> +
> +
> +/**
> + * tlob_start_task - begin monitoring @task with budget @threshold_us us.
> + * @task: Task to monitor; may be current or another task.
> + * @threshold_us: Latency budget in microseconds (wall-clock; running +
> waiting + sleeping). > 0.
> + *
> + * Returns 0, -ENODEV, -EALREADY, -ENOSPC, or -ENOMEM.
> + */
> +int tlob_start_task(struct task_struct *task, u64 threshold_us)
> +{
> + struct tlob_task_state *ws_existing;
> + struct tlob_task_state *ws;
> + struct da_monitor *da_mon;
> + struct ha_monitor *ha_mon;
> + u64 now_ns;
> + int ret;
> +
> + if (!da_monitor_enabled())
> + return -ENODEV;
> +
> + if (threshold_us == 0)
> + return -ERANGE;
> +
> + /* Serialise duplicate-check + da_create_or_get for the same pid. */
> + guard(mutex)(&tlob_start_mutex);
> +
> + rcu_read_lock();
That should be a scoped_guard(rcu), definitely use guards if you have
return paths, the compiler is going to clean up (unlock) for you.
> + ws_existing = da_get_target_by_id(task->pid);
> + if (ws_existing) {
> + rcu_read_unlock();
> + return -EALREADY;
> + }
> + rcu_read_unlock();
> +
> + ws = kmem_cache_zalloc(tlob_state_cache, GFP_KERNEL);
> + if (!ws)
> + return -ENOMEM;
> +
> + ws->task = task;
> + get_task_struct(task);
> + ws->threshold_us = threshold_us;
> + ws->last_ts = ktime_get();
> + raw_spin_lock_init(&ws->entry_lock);
> +
> + /* Claim a pool slot (no kmalloc; DA_SKIP_AUTO_ALLOC + prealloc). */
> + ret = da_create_or_get(task->pid, ws);
> + if (ret) {
> + put_task_struct(task);
> + kmem_cache_free(tlob_state_cache, ws);
> + return ret;
> + }
> +
> + atomic_inc(&tlob_num_monitored);
> +
> + /* Hold RCU across handle + timer setup to keep da_mon valid. */
> + rcu_read_lock();
Same here about guards.
Sadly there doesn't seem to be a cleanup helper for kmem_cache_free,
would be worth adding one. You have also a lot of other things to do
here so it isn't a big deal.
> + da_handle_start_event(task->pid, ws, switch_in_tlob);
> + da_mon = da_get_monitor(task->pid, NULL);
> + if (unlikely(!da_mon)) {
> + /* Slot registered; missing da_mon means concurrent destroy.
> */
> + rcu_read_unlock();
> + da_destroy_storage(task->pid);
> + atomic_dec(&tlob_num_monitored);
> + put_task_struct(task);
> + kmem_cache_free(tlob_state_cache, ws);
> + return -ENOMEM;
> + }
> + ha_mon = to_ha_monitor(da_mon);
> + now_ns = ktime_get_ns();
> + ha_reset_env(ha_mon, clk_elapsed_tlob, now_ns);
> + ha_start_timer_ns(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> now_ns);
> + rcu_read_unlock();
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_start_task);
> +
> +/**
> + * tlob_stop_task - stop monitoring @task.
> + * @task: Task to stop.
> + *
> + * CAS on ws->stopping (0->1) under RCU claims cleanup ownership;
> + * the winner cancels the timer synchronously and frees all resources.
> + *
> + * Returns 0, -EOVERFLOW (budget exceeded), -ESRCH (not monitored),
> + * or -EAGAIN (concurrent caller claimed cleanup).
> + */
> +int tlob_stop_task(struct task_struct *task)
> +{
> + struct da_monitor *da_mon;
> + struct ha_monitor *ha_mon;
> + struct tlob_task_state *ws;
> + bool budget_exceeded;
> +
> + rcu_read_lock();
> + ws = da_get_target_by_id(task->pid);
> + if (!ws) {
> + rcu_read_unlock();
> + return -ESRCH;
> + }
> +
> + da_mon = da_get_monitor(task->pid, NULL);
> + if (unlikely(!da_mon)) {
> + /* ws in hash but da_mon gone; internal inconsistency. */
> + rcu_read_unlock();
> + WARN_ON_ONCE(1);
> + return -ESRCH;
> + }
> +
> + ha_mon = to_ha_monitor(da_mon);
> +
> + /*
> + * CAS (0->1) claims cleanup ownership under RCU (ws guaranteed
> valid).
> + * _release pairs with atomic_read_acquire in ha_setup_invariants.
> + */
> + if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0) {
> + rcu_read_unlock();
> + return -EAGAIN;
> + }
> +
> + rcu_read_unlock();
> +
> + /* Wait for in-flight timer callback before reading da_monitoring. */
> + ha_cancel_timer_sync(ha_mon);
> +
> + /* Timer fired first -> budget exceeded; otherwise reset normally. */
> + rcu_read_lock();
> + budget_exceeded = !da_monitoring(da_mon);
> + if (!budget_exceeded)
> + da_monitor_reset(da_mon);
> + rcu_read_unlock();
> + da_destroy_storage(task->pid);
> + atomic_dec(&tlob_num_monitored);
> +
> + put_task_struct(ws->task);
> + call_rcu(&ws->rcu, tlob_free_rcu);
> + return budget_exceeded ? -EOVERFLOW : 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_stop_task);
> +
> +static void tlob_stop_all(void)
> +{
All this function does should be done by da_monitor_destroy. It does
have some concurrency issues I'm trying to fix, but there's no reason
not to use it.
We could add a way to pass some additional deallocation for all the
other cleanup you're doing on each storage.
Something like a da_extra_cleanup() you can define as whatever you need
and gets called in all per-obj destruction paths.
In general, let's try to use/extend as much as possible in the RV API
rather then re-implementing things.
> + struct da_monitor_storage *ms;
> + pid_t pids[TLOB_MAX_MONITORED];
> + int bkt, n = 0;
> +
> + /* Snapshot pids under RCU; re-derive ws under a fresh lock below. */
> + rcu_read_lock();
> + hash_for_each_rcu(da_monitor_ht, bkt, ms, node) {
> + if (ms->target && n < TLOB_MAX_MONITORED)
> + pids[n++] = ms->id;
> + }
> + rcu_read_unlock();
> +
> + for (int i = 0; i < n; i++) {
> + pid_t pid = pids[i];
> + struct da_monitor *da_mon;
> + struct ha_monitor *ha_mon;
> + struct tlob_task_state *ws;
> +
> + rcu_read_lock();
> + da_mon = da_get_monitor(pid, NULL);
> + if (!da_mon) {
> + /* Cleaned up by tlob_stop_task or exit handler. */
> + rcu_read_unlock();
> + continue;
> + }
> +
> + ws = da_get_target(da_mon);
> + ha_mon = to_ha_monitor(da_mon);
> +
> + /* CAS (0->1) claims ownership; skip if another caller won.
> */
> + if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0) {
> + rcu_read_unlock();
> + continue;
> + }
> + rcu_read_unlock();
> +
> + ha_cancel_timer_sync(ha_mon);
> +
> + scoped_guard(rcu) {
> + da_monitor_reset(da_mon);
> + }
> + da_destroy_storage(pid);
> + atomic_dec(&tlob_num_monitored);
> + put_task_struct(ws->task);
> + call_rcu(&ws->rcu, tlob_free_rcu);
> + }
> +}
> +
> +static int tlob_uprobe_entry_handler(struct rv_uprobe *p, struct pt_regs
> *regs,
> + __u64 *data)
> +{
> + struct tlob_uprobe_binding *b = p->priv;
> +
> + tlob_start_task(current, b->threshold_us);
> + return 0;
> +}
> +
> +static int tlob_uprobe_stop_handler(struct rv_uprobe *p, struct pt_regs
> *regs,
> + __u64 *data)
> +{
> + tlob_stop_task(current);
> + return 0;
> +}
> +
> +/*
> + * Register start + stop entry uprobes for a binding.
> + * Called with tlob_uprobe_mutex held.
> + */
> +static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
> + loff_t offset_start, loff_t offset_stop)
> +{
> + struct tlob_uprobe_binding *b, *tmp_b;
> + char pathbuf[TLOB_MAX_PATH];
> + struct path path;
> + char *canon;
> + int ret;
> +
> + if (binpath[0] != '/')
> + return -EINVAL;
> +
> + b = kzalloc_obj(*b, GFP_KERNEL);
> + if (!b)
> + return -ENOMEM;
> +
> + b->threshold_us = threshold_us;
> + b->offset_start = offset_start;
> + b->offset_stop = offset_stop;
> +
> + ret = kern_path(binpath, LOOKUP_FOLLOW, &path);
> + if (ret)
> + goto err_free;
> +
> + if (!d_is_reg(path.dentry)) {
> + ret = -EINVAL;
> + goto err_path;
> + }
> +
> + /* Reject duplicate start offset for the same binary. */
> + list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
> + if (tmp_b->offset_start == offset_start &&
> + tmp_b->start_probe->path.dentry == path.dentry) {
> + ret = -EEXIST;
> + goto err_path;
> + }
> + }
> +
> + canon = d_path(&path, pathbuf, sizeof(pathbuf));
> + if (IS_ERR(canon)) {
> + ret = PTR_ERR(canon);
> + goto err_path;
> + }
> + strscpy(b->binpath, canon, sizeof(b->binpath));
> +
> + /* Both probes share b (priv) and path; attach_path refs path itself.
> */
> + b->start_probe = rv_uprobe_attach_path(&path, offset_start,
> + tlob_uprobe_entry_handler,
> NULL, b);
> + if (IS_ERR(b->start_probe)) {
> + ret = PTR_ERR(b->start_probe);
> + b->start_probe = NULL;
> + goto err_path;
> + }
> +
> + b->stop_probe = rv_uprobe_attach_path(&path, offset_stop,
> + tlob_uprobe_stop_handler, NULL,
> b);
> + if (IS_ERR(b->stop_probe)) {
> + ret = PTR_ERR(b->stop_probe);
> + b->stop_probe = NULL;
> + goto err_start;
> + }
> +
> + path_put(&path);
> + list_add_tail(&b->list, &tlob_uprobe_list);
> + return 0;
> +
> +err_start:
> + rv_uprobe_detach(b->start_probe);
> +err_path:
> + path_put(&path);
> +err_free:
> + kfree(b);
> + return ret;
> +}
> +
> +static int tlob_remove_uprobe_by_key(loff_t offset_start, const char
> *binpath)
> +{
> + struct tlob_uprobe_binding *b, *tmp;
> + struct path remove_path;
> + int ret;
> +
> + ret = kern_path(binpath, LOOKUP_FOLLOW, &remove_path);
> + if (ret)
> + return ret;
> +
> + ret = -ENOENT;
> + list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> + if (b->offset_start != offset_start)
> + continue;
> + if (b->start_probe->path.dentry != remove_path.dentry)
> + continue;
> + list_del(&b->list);
> + rv_uprobe_detach(b->start_probe);
> + rv_uprobe_detach(b->stop_probe);
> + kfree(b);
> + ret = 0;
> + break;
> + }
> +
> + path_put(&remove_path);
> + return ret;
> +}
> +
> +static void tlob_remove_all_uprobes(void)
> +{
> + struct tlob_uprobe_binding *b, *tmp;
> + LIST_HEAD(pending);
> +
> + mutex_lock(&tlob_uprobe_mutex);
> + list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> + list_move(&b->list, &pending);
> + rv_uprobe_unregister_nosync(b->start_probe);
> + rv_uprobe_unregister_nosync(b->stop_probe);
> + }
> + mutex_unlock(&tlob_uprobe_mutex);
> +
> + if (list_empty(&pending))
> + return;
> +
> + /*
> + * One global barrier for all probes dequeued above; no new handlers
> + * for any of them can fire after this returns.
> + */
> + rv_uprobe_sync();
> +
> + list_for_each_entry_safe(b, tmp, &pending, list) {
> + rv_uprobe_free(b->start_probe);
> + rv_uprobe_free(b->stop_probe);
> + kfree(b);
> + }
> +}
> +
> +static ssize_t tlob_monitor_read(struct file *file,
> + char __user *ubuf,
> + size_t count, loff_t *ppos)
> +{
> + const int line_sz = TLOB_MAX_PATH + 128;
> + struct tlob_uprobe_binding *b;
> + char *buf, *p;
> + int n = 0, buf_sz, pos = 0;
> + ssize_t ret;
> +
> + mutex_lock(&tlob_uprobe_mutex);
> + list_for_each_entry(b, &tlob_uprobe_list, list)
> + n++;
> +
> + buf_sz = (n ? n : 1) * line_sz + 1;
> + buf = kmalloc(buf_sz, GFP_KERNEL);
> + if (!buf) {
> + mutex_unlock(&tlob_uprobe_mutex);
> + return -ENOMEM;
> + }
> +
> + list_for_each_entry(b, &tlob_uprobe_list, list) {
> + p = b->binpath;
> + pos += scnprintf(buf + pos, buf_sz - pos,
> + "p %s:0x%llx 0x%llx threshold=%llu\n",
> + p,
> + (unsigned long long)b->offset_start,
> + (unsigned long long)b->offset_stop,
> + b->threshold_us);
> + }
> + mutex_unlock(&tlob_uprobe_mutex);
> +
> + ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
> + kfree(buf);
> + return ret;
> +}
> +
> +/*
> + * Parse "p PATH:OFFSET_START OFFSET_STOP threshold=US".
> + * PATH may contain ':'; the last ':' separates path from offset.
> + * Returns 0 or -EINVAL.
> + */
> +static int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
> + char **path_out,
> + loff_t *start_out, loff_t *stop_out)
> +{
> + unsigned long long thr = 0, stop_val = 0;
> + long long start_val;
> + char *p, *path_token, *token, *colon;
> + bool got_stop = false, got_thr = false;
> + int n;
> +
> + /* Must start with "p " */
> + if (buf[0] != 'p' || buf[1] != ' ')
> + return -EINVAL;
> +
> + p = buf + 2;
> + while (*p == ' ')
> + p++;
> +
> + /* First space-delimited token is PATH:OFFSET_START */
> + path_token = strsep(&p, " \t");
> + if (!path_token || !*path_token)
> + return -EINVAL;
> +
> + /* Split at last ':' to handle paths that contain ':'. */
> + colon = strrchr(path_token, ':');
> + if (!colon || colon - path_token < 2)
> + return -EINVAL;
> + *colon = '\0';
> +
> + if (path_token[0] != '/')
> + return -EINVAL;
> +
> + n = 0;
> + if (sscanf(colon + 1, "%lli%n", &start_val, &n) != 1 || n == 0)
> + return -EINVAL;
> + if (start_val < 0)
> + return -EINVAL;
> +
> + /* Remaining tokens: OFFSET_STOP threshold=US */
> + while (p && (token = strsep(&p, " \t")) != NULL) {
> + if (!*token)
> + continue;
> + if (strncmp(token, "threshold=", 10) == 0) {
> + if (kstrtoull(token + 10, 0, &thr))
> + return -EINVAL;
> + got_thr = true;
> + } else if (!got_stop) {
> + long long sv;
> +
> + n = 0;
> + if (sscanf(token, "%lli%n", &sv, &n) != 1 || n == 0)
> + return -EINVAL;
> + if (sv < 0)
> + return -EINVAL;
> + stop_val = (unsigned long long)sv;
> + got_stop = true;
> + } else {
> + return -EINVAL;
> + }
> + }
> +
> + if (!got_stop || !got_thr || thr == 0)
> + return -EINVAL;
> + if (start_val == (long long)stop_val)
> + return -EINVAL;
> +
> + *thr_out = thr;
> + *path_out = path_token;
> + *start_out = (loff_t)start_val;
> + *stop_out = (loff_t)stop_val;
> + return 0;
> +}
> +
> +/* Parse "-PATH:OFFSET_START" (ftrace uprobe_events removal convention). */
> +static int tlob_parse_remove_line(char *buf, char **path_out, loff_t
> *start_out)
> +{
> + char *binpath, *colon;
> + long long off;
> + int n = 0;
> +
> + if (buf[0] != '-')
> + return -EINVAL;
> + binpath = buf + 1;
> + if (binpath[0] != '/')
> + return -EINVAL;
> + colon = strrchr(binpath, ':');
> + if (!colon || colon - binpath < 2)
> + return -EINVAL;
> + *colon = '\0';
> + if (sscanf(colon + 1, "%lli%n", &off, &n) != 1 || n == 0)
> + return -EINVAL;
> + *path_out = binpath;
> + *start_out = (loff_t)off;
> + return 0;
> +}
> +
> +VISIBLE_IF_KUNIT int tlob_create_or_delete_uprobe(char *buf)
> +{
> + loff_t offset_start, offset_stop;
> + u64 threshold_us;
> + char *binpath;
> + int ret;
> +
> + if (buf[0] == '-') {
> + ret = tlob_parse_remove_line(buf, &binpath, &offset_start);
> + if (ret)
> + return ret;
> + mutex_lock(&tlob_uprobe_mutex);
> + ret = tlob_remove_uprobe_by_key(offset_start, binpath);
> + mutex_unlock(&tlob_uprobe_mutex);
> + return ret;
> + }
> + ret = tlob_parse_uprobe_line(buf, &threshold_us, &binpath,
> + &offset_start, &offset_stop);
> + if (ret)
> + return ret;
> + mutex_lock(&tlob_uprobe_mutex);
> + ret = tlob_add_uprobe(threshold_us, binpath, offset_start,
> offset_stop);
> + mutex_unlock(&tlob_uprobe_mutex);
> + return ret;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_create_or_delete_uprobe);
> +
> +static ssize_t tlob_monitor_write(struct file *file,
> + const char __user *ubuf,
> + size_t count, loff_t *ppos)
> +{
> + char buf[TLOB_MAX_PATH + 128];
> +
> + if (count >= sizeof(buf))
> + return -EINVAL;
> + if (copy_from_user(buf, ubuf, count))
> + return -EFAULT;
> + buf[count] = '\0';
> + if (count > 0 && buf[count - 1] == '\n')
> + buf[count - 1] = '\0';
> + return tlob_create_or_delete_uprobe(buf) ?: (ssize_t)count;
> +}
> +
> +static const struct file_operations tlob_monitor_fops = {
> + .open = simple_open,
> + .read = tlob_monitor_read,
> + .write = tlob_monitor_write,
> + .llseek = noop_llseek,
> +};
> +
> +static int __tlob_init_monitor(void)
> +{
> + int retval;
> +
> + tlob_state_cache = kmem_cache_create("tlob_task_state",
> + sizeof(struct tlob_task_state),
> + 0, 0, NULL);
> + if (!tlob_state_cache)
> + return -ENOMEM;
> +
> + atomic_set(&tlob_num_monitored, 0);
> +
> + retval = da_monitor_init_prealloc(TLOB_MAX_MONITORED);
> + if (retval) {
> + kmem_cache_destroy(tlob_state_cache);
> + tlob_state_cache = NULL;
> + return retval;
> + }
> +
> + /* Synthetic reference: held while the monitor is enabled. */
> + reinit_completion(&tlob_fd_released);
> + refcount_set(&tlob_fd_refcount, 1);
> +
> + rv_this.enabled = 1;
> + return 0;
> +}
> +
> +static void __tlob_destroy_monitor(void)
> +{
> + rv_this.enabled = 0;
> + /*
> + * Remove uprobes first so stop_task can't race with tlob_stop_all().
> + * rv_uprobe_sync() inside ensures all in-flight handlers have
> finished.
> + */
> + tlob_remove_all_uprobes();
> + tlob_stop_all();
> + /* Wait for tlob_free_rcu and da_pool_return_cb before pool teardown.
> */
> + synchronize_rcu();
> +
> + /*
> + * Drop the synthetic ref and wait for all open fds to close before
> + * teardown; prevents kmem_cache_zalloc() on the destroyed cache.
> + */
> + if (!refcount_dec_and_test(&tlob_fd_refcount))
> + wait_for_completion(&tlob_fd_released);
> +
> + da_monitor_destroy();
> + kmem_cache_destroy(tlob_state_cache);
> + tlob_state_cache = NULL;
> +}
> +
> +/* KUnit wrappers that acquire rv_interface_lock around monitor init/destroy.
> */
> +#if IS_ENABLED(CONFIG_KUNIT)
> +int tlob_init_monitor(void)
> +{
> + int ret;
> +
> + mutex_lock(&rv_interface_lock);
> + ret = __tlob_init_monitor();
> + mutex_unlock(&rv_interface_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tlob_init_monitor);
> +
> +void tlob_destroy_monitor(void)
> +{
> + mutex_lock(&rv_interface_lock);
> + __tlob_destroy_monitor();
> + mutex_unlock(&rv_interface_lock);
> +}
> +EXPORT_SYMBOL_GPL(tlob_destroy_monitor);
> +
> +int tlob_num_monitored_read(void)
> +{
> + return atomic_read(&tlob_num_monitored);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_num_monitored_read);
> +
> +/* Tracepoint probes for KUnit; rv_trace.h is only included here. */
> +static struct tlob_captured_event tlob_kunit_last_event;
> +static struct tlob_captured_error_env tlob_kunit_last_error_env;
> +static atomic_t tlob_kunit_event_cnt = ATOMIC_INIT(0);
> +static atomic_t tlob_kunit_error_env_cnt = ATOMIC_INIT(0);
> +
> +static void tlob_kunit_event_probe(void *data, int id, char *state, char
> *event,
> + char *next_state, bool final_state)
> +{
> + tlob_kunit_last_event.id = id;
> + strscpy(tlob_kunit_last_event.state, state,
> + sizeof(tlob_kunit_last_event.state));
> + strscpy(tlob_kunit_last_event.event, event,
> + sizeof(tlob_kunit_last_event.event));
> + strscpy(tlob_kunit_last_event.next_state, next_state,
> + sizeof(tlob_kunit_last_event.next_state));
> + tlob_kunit_last_event.final_state = final_state;
> + atomic_inc(&tlob_kunit_event_cnt);
> +}
> +
> +static void tlob_kunit_error_env_probe(void *data, int id, char *state,
> + char *event, char *env)
> +{
> + tlob_kunit_last_error_env.id = id;
> + strscpy(tlob_kunit_last_error_env.state, state,
> + sizeof(tlob_kunit_last_error_env.state));
> + strscpy(tlob_kunit_last_error_env.event, event,
> + sizeof(tlob_kunit_last_error_env.event));
> + strscpy(tlob_kunit_last_error_env.env, env,
> + sizeof(tlob_kunit_last_error_env.env));
> + atomic_inc(&tlob_kunit_error_env_cnt);
> +}
> +
> +int tlob_register_kunit_probes(void)
> +{
> + int ret;
> +
> + atomic_set(&tlob_kunit_event_cnt, 0);
> + atomic_set(&tlob_kunit_error_env_cnt, 0);
> +
> + ret = register_trace_event_tlob(tlob_kunit_event_probe, NULL);
> + if (ret)
> + return ret;
> + ret = register_trace_error_env_tlob(tlob_kunit_error_env_probe,
> NULL);
> + if (ret) {
> + unregister_trace_event_tlob(tlob_kunit_event_probe, NULL);
> + return ret;
> + }
> + return 0;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_register_kunit_probes);
> +
> +void tlob_unregister_kunit_probes(void)
> +{
> + unregister_trace_event_tlob(tlob_kunit_event_probe, NULL);
> + unregister_trace_error_env_tlob(tlob_kunit_error_env_probe, NULL);
> + tracepoint_synchronize_unregister();
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_unregister_kunit_probes);
> +
> +int tlob_event_count_read(void)
> +{
> + return atomic_read(&tlob_kunit_event_cnt);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_count_read);
> +
> +void tlob_event_count_reset(void)
> +{
> + atomic_set(&tlob_kunit_event_cnt, 0);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_count_reset);
> +
> +int tlob_error_env_count_read(void)
> +{
> + return atomic_read(&tlob_kunit_error_env_cnt);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_error_env_count_read);
> +
> +void tlob_error_env_count_reset(void)
> +{
> + atomic_set(&tlob_kunit_error_env_cnt, 0);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_error_env_count_reset);
> +
> +const struct tlob_captured_event *tlob_last_event_read(void)
> +{
> + return &tlob_kunit_last_event;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_last_event_read);
> +
> +const struct tlob_captured_error_env *tlob_last_error_env_read(void)
> +{
> + return &tlob_kunit_last_error_env;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_last_error_env_read);
> +
> +#endif /* CONFIG_KUNIT */
> +
> +VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
> +{
> + rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
> + rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> + rv_attach_trace_probe("tlob", sched_process_exit,
> handle_sched_process_exit);
> + return 0;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
> +
> +VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
> +{
> + rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
> + rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> + rv_detach_trace_probe("tlob", sched_process_exit,
> handle_sched_process_exit);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
> +
> +static int enable_tlob(void)
> +{
> + int retval;
> +
> + retval = __tlob_init_monitor();
> + if (retval)
> + return retval;
> +
> + return tlob_enable_hooks();
> +}
> +
> +static void disable_tlob(void)
> +{
> + tlob_disable_hooks();
> + __tlob_destroy_monitor();
> +}
> +
> +static struct rv_monitor rv_this = {
> + .name = "tlob",
> + .description = "Per-task latency-over-budget monitor.",
> + .enable = enable_tlob,
> + .disable = disable_tlob,
> + .reset = da_monitor_reset_all,
> + .enabled = 0,
> +};
> +
> +static void *tlob_chardev_bind(void)
> +{
> + struct tlob_fpriv *fp;
> +
> + fp = kzalloc_obj(*fp, GFP_KERNEL);
> + if (!fp)
> + return ERR_PTR(-ENOMEM);
> +
> + /* Pin cache/pool for fd lifetime; balanced in tlob_chardev_release.
> + * If the synthetic ref has already been dropped
> (__tlob_destroy_monitor
> + * ran to completion), reject the bind so the caller gets ENODEV
> instead
> + * of corrupting a zero refcount.
> + */
> + if (!refcount_inc_not_zero(&tlob_fd_refcount)) {
> + kfree(fp);
> + return ERR_PTR(-ENODEV);
> + }
> + return fp;
> +}
> +
> +static void tlob_chardev_release(void *priv)
> +{
> + struct tlob_fpriv *fp = priv;
> +
> + if (fp->monitoring) {
> + /* All return values are safe on close. */
> + (void)tlob_stop_task(fp->task);
> + put_task_struct(fp->task);
> + }
> +
> + kfree(fp);
> +
> + /* Release fd's pin; if last, wake __tlob_destroy_monitor. */
> + if (refcount_dec_and_test(&tlob_fd_refcount))
> + complete(&tlob_fd_released);
> +}
> +
> +static long tlob_chardev_ioctl(void *priv, unsigned int cmd, unsigned long
> arg)
> +{
> + struct tlob_fpriv *fp = priv;
> + struct tlob_start_args args;
> + struct task_struct *task;
> + int ret;
> +
> + switch (cmd) {
> + case TLOB_IOCTL_TRACE_START:
> + if (fp->monitoring)
> + return -EALREADY;
> +
> + if (copy_from_user(&args, (void __user *)arg, sizeof(args)))
> + return -EFAULT;
> +
> + ret = tlob_start_task(current, args.threshold_us);
> + if (ret)
> + return ret;
> +
> + fp->task = current;
> + get_task_struct(current);
> + fp->budget_exceeded = false;
> +
> + /* Link fd so hrtimer callback can latch budget_exceeded. */
> + scoped_guard(rcu) {
> + struct tlob_task_state *ws =
> da_get_target_by_id(current->pid);
> +
> + if (ws)
> + smp_store_release(&ws->fpriv, fp);
> + }
> +
> + fp->monitoring = true;
> + return 0;
> +
> + case TLOB_IOCTL_TRACE_STOP:
> + if (!fp->monitoring)
> + return -EINVAL;
> +
> + task = fp->task;
> + fp->monitoring = false;
> + fp->task = NULL;
> +
> + ret = tlob_stop_task(task);
> + put_task_struct(task);
> +
> + /*
> + * -EOVERFLOW: budget exceeded; propagate to caller.
> + * -EAGAIN: concurrent stop_all claimed cleanup; fall through
> to
> + * budget_exceeded latch set by the hrtimer callback.
> + * -ESRCH: task exited before TRACE_STOP (process-exit
> handler
> + * claimed cleanup); same latch applies. Not an internal
> error.
> + */
> + if (ret == -EAGAIN || ret == -ESRCH)
> + return READ_ONCE(fp->budget_exceeded) ? -EOVERFLOW :
> 0;
> + return ret;
> +
> + default:
> + return -ENOTTY;
> + }
> +}
> +
> +static const struct rv_chardev_ops tlob_chardev_ops = {
> + .owner = THIS_MODULE,
> + .bind = tlob_chardev_bind,
> + .ioctl = tlob_chardev_ioctl,
> + .release = tlob_chardev_release,
> +};
> +
> +static int __init register_tlob(void)
> +{
> + int ret;
> +
> + ret = rv_chardev_register_monitor("tlob", &tlob_chardev_ops);
> + if (ret)
> + return ret;
> +
> + ret = rv_register_monitor(&rv_this, NULL);
> + if (ret) {
> + rv_chardev_unregister_monitor("tlob");
> + return ret;
> + }
> +
> + if (rv_this.root_d) {
> + if (!tracefs_create_file("monitor", 0644, rv_this.root_d,
> NULL,
> + &tlob_monitor_fops)) {
> + rv_unregister_monitor(&rv_this);
> + rv_chardev_unregister_monitor("tlob");
> + return -ENOMEM;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static void __exit unregister_tlob(void)
> +{
> + rv_chardev_unregister_monitor("tlob");
> + rv_unregister_monitor(&rv_this);
> +}
> +
> +module_init(register_tlob);
> +module_exit(unregister_tlob);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
> +MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
> b/kernel/trace/rv/monitors/tlob/tlob.h
> new file mode 100644
> index 000000000000..71c1735d27d2
> --- /dev/null
[...]
> diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
> index ee4e68102f17..a45c4763dbe5 100644
> --- a/kernel/trace/rv/rv.c
> +++ b/kernel/trace/rv/rv.c
> @@ -142,10 +142,17 @@
> #include <linux/module.h>
> #include <linux/init.h>
> #include <linux/slab.h>
> +#include <kunit/visibility.h>
>
> #ifdef CONFIG_RV_MON_EVENTS
> #define CREATE_TRACE_POINTS
> #include <rv_trace.h>
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +EXPORT_TRACEPOINT_SYMBOL_GPL(error_tlob);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(event_tlob);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(error_env_tlob);
> +#endif
Cannot this stay in tlob.c ? So you keep the shared file clean and skip
the ifdeffery.
> #endif
>
> #include "rv.h"
> @@ -696,6 +703,33 @@ static void turn_monitoring_on(void)
> WRITE_ONCE(monitoring_on, true);
> }
>
> +#if IS_ENABLED(CONFIG_KUNIT)
> +/**
> + * rv_kunit_monitoring_on - enable the global monitoring_on flag for KUnit
> tests.
> + *
> + * KUnit test suite_init functions must call this before initialising any
> + * monitor, mirroring the turn_monitoring_on() call in rv_init_interface().
> + * The matching rv_kunit_monitoring_off() must be called in suite_exit to
> + * restore the flag so that test suites do not interfere with each other.
> + */
> +void rv_kunit_monitoring_on(void)
> +{
> + turn_monitoring_on();
> +}
> +EXPORT_SYMBOL_IF_KUNIT(rv_kunit_monitoring_on);
> +
> +/**
> + * rv_kunit_monitoring_off - disable the global monitoring_on flag for KUnit
> tests.
> + *
> + * Must be called in suite_exit to restore global state after
> rv_kunit_monitoring_on().
> + */
> +void rv_kunit_monitoring_off(void)
> +{
> + turn_monitoring_off();
> +}
> +EXPORT_SYMBOL_IF_KUNIT(rv_kunit_monitoring_off);
> +#endif /* CONFIG_KUNIT */
> +
> static void turn_monitoring_on_with_reset(void)
> {
> lockdep_assert_held(&rv_interface_lock);
> @@ -846,6 +880,10 @@ int __init rv_init_interface(void)
> if (retval)
> return 1;
>
> + retval = rv_chardev_init();
> + if (retval)
> + return 1;
> +
Both of those can stay in separate patches as mentioned above.
> turn_monitoring_on();
>
> rv_root.root_dir = no_free_ptr(root_dir);
> diff --git a/kernel/trace/rv/rv.h b/kernel/trace/rv/rv.h
> index 2c0f51ff9d5c..82c9a2b57596 100644
> --- a/kernel/trace/rv/rv.h
> +++ b/kernel/trace/rv/rv.h
> @@ -31,6 +31,8 @@ int rv_enable_monitor(struct rv_monitor *mon);
> bool rv_is_container_monitor(struct rv_monitor *mon);
> bool rv_is_nested_monitor(struct rv_monitor *mon);
>
> +int rv_chardev_init(void);
> +
Same here.
> #ifdef CONFIG_RV_REACTORS
> int reactor_populate_monitor(struct rv_monitor *mon, struct dentry *root);
> int init_rv_reactors(struct dentry *root_dir);
> diff --git a/kernel/trace/rv/rv_chardev.c b/kernel/trace/rv/rv_chardev.c
> new file mode 100644
> index 000000000000..1fba1642ebc1
> --- /dev/null
> +++ b/kernel/trace/rv/rv_chardev.c
> @@ -0,0 +1,201 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
And here.
> diff --git a/kernel/trace/rv/rv_uprobe.c b/kernel/trace/rv/rv_uprobe.c
> index bc28399cfd4b..1ba7b80c1d87 100644
> --- a/kernel/trace/rv/rv_uprobe.c
> +++ b/kernel/trace/rv/rv_uprobe.c
Also this probably belongs in the uprobes patch.
> @@ -132,13 +132,10 @@ EXPORT_SYMBOL_GPL(rv_uprobe_attach);
> */
> void rv_uprobe_detach(struct rv_uprobe *p)
> {
> - struct rv_uprobe_impl *impl;
> -
> if (!p)
> return;
>
> - impl = container_of(p, struct rv_uprobe_impl, pub);
> - uprobe_unregister_nosync(impl->uprobe, &impl->uc);
> + rv_uprobe_unregister_nosync(p);
> /*
> * uprobe_unregister_sync() is a global barrier: it waits for all
> * in-flight uprobe handlers across the entire system to complete,
> @@ -146,8 +143,47 @@ void rv_uprobe_detach(struct rv_uprobe *p)
> * guarantees that no handler touching impl->pub.priv is running by
> * the time we return, even if the caller immediately frees priv.
> */
> + rv_uprobe_sync();
> + rv_uprobe_free(p);
> +}
> +EXPORT_SYMBOL_GPL(rv_uprobe_detach);
[...]
> diff --git a/tools/include/uapi/linux/rv.h b/tools/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000000..a34e5426393b
> --- /dev/null
> +++ b/tools/include/uapi/linux/rv.h
> @@ -0,0 +1,86 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * All RV monitors that expose an ioctl self-instrumentation interface
> + * share the magic byte RV_IOC_MAGIC ('r').
> + *
> + * Usage examples and design rationale are in:
> + * Documentation/trace/rv/monitor_tlob.rst
> + */
And this in a new ioctl patch.
> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H
Thanks,
Gabriele
next prev parent reply other threads:[~2026-05-15 9:53 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-11 18:24 [RFC PATCH v2 00/10] rv/tlob: Add task latency over budget RV monitor wen.yang
2026-05-11 18:24 ` [RFC PATCH v2 01/10] rv/da: fix monitor start ordering and memory ordering for monitoring flag wen.yang
2026-05-13 12:39 ` Gabriele Monaco
2026-05-11 18:24 ` [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync wen.yang
2026-05-12 8:27 ` Gabriele Monaco
2026-05-12 9:09 ` Gabriele Monaco
2026-05-13 5:32 ` Wen Yang
2026-05-13 9:31 ` Gabriele Monaco
2026-05-11 18:24 ` [RFC PATCH v2 03/10] selftests/verification: fix verificationtest-ktap for out-of-tree execution wen.yang
2026-05-13 8:32 ` Gabriele Monaco
2026-05-11 18:24 ` [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors wen.yang
2026-05-13 13:47 ` Gabriele Monaco
2026-05-13 13:50 ` Gabriele Monaco
2026-05-13 14:01 ` Gabriele Monaco
2026-05-15 8:30 ` [PATCH] Re: " Gabriele Monaco
2026-05-11 18:24 ` [RFC PATCH v2 05/10] rv: add generic uprobe infrastructure for RV monitors wen.yang
2026-05-11 18:24 ` [RFC PATCH v2 06/10] rvgen: support reset() on the __init arrow for global-window HA clocks wen.yang
2026-05-12 13:25 ` Gabriele Monaco
2026-05-11 18:24 ` [RFC PATCH v2 07/10] rv/tlob: add tlob model DOT file wen.yang
2026-05-11 18:24 ` [RFC PATCH v2 08/10] rv/tlob: add tlob hybrid automaton monitor wen.yang
2026-05-15 9:53 ` Gabriele Monaco [this message]
2026-05-15 13:08 ` Gabriele Monaco
2026-05-11 18:24 ` [RFC PATCH v2 09/10] rv/tlob: add KUnit tests for the tlob monitor wen.yang
2026-05-15 13:13 ` Gabriele Monaco
2026-05-11 18:24 ` [RFC PATCH v2 10/10] selftests/verification: add tlob selftests wen.yang
2026-05-13 7:46 ` Gabriele Monaco
2026-05-15 13:23 ` Gabriele Monaco
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=16edc9bc32425af44152892d5d7df50ee32fdb22.camel@redhat.com \
--to=gmonaco@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=rostedt@goodmis.org \
--cc=wen.yang@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox