[RFC PATCH v2 08/10] rv/tlob: add tlob hybrid automaton monitor

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: wen.yang@linux.dev
To: Gabriele Monaco <gmonaco@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>
Cc: linux-trace-kernel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wen Yang <wen.yang@linux.dev>
Subject: [RFC PATCH v2 08/10] rv/tlob: add tlob hybrid automaton monitor
Date: Tue, 12 May 2026 02:24:54 +0800	[thread overview]
Message-ID: <fe5ed6a9a0a911e6ec74dc06c453786a2c4fb6d1.1778522945.git.wen.yang@linux.dev> (raw)
In-Reply-To: <cover.1778522945.git.wen.yang@linux.dev>

From: Wen Yang <wen.yang@linux.dev>

Introduce tlob (task latency over budget), a per-task hybrid-automaton
RV monitor that measures elapsed time (CLOCK_MONOTONIC) across
a user-delimited code section and fires an error_env_tlob tracepoint
when the elapsed time exceeds a configurable per-invocation budget.

The monitor is built on RV_MON_PER_OBJ with HA_TIMER_HRTIMER.  Three
states track the scheduler status of the monitored task:

  running  --(sleep)-------> sleeping
  running  --(preempt)-----> waiting
  sleeping --(wakeup)------> waiting
  waiting  --(switch_in)--> running

A single clock invariant clk_elapsed < BUDGET_NS() is active in all
three states.  The budget hrtimer is rearmed on each DA transition for
the remaining budget, keeping the absolute deadline fixed at
start_time + BUDGET_NS.

Per-task state is stored in the DA framework's hash table keyed by
task->pid.  Storage is pre-allocated by tlob_start_task() with
GFP_KERNEL via da_create_or_get() before the scheduler tracepoints
can fire, using DA_SKIP_AUTO_ALLOC so that no kmalloc occurs on the
tracepoint hot path.  This avoids both the kmalloc_nolock() restriction
(requires HAVE_ALIGNED_STRUCT_PAGE) and latency issues under PREEMPT_RT.

Nested monitoring is handled by nest_depth: tlob_start_task() on an
already-monitored pid returns -EEXIST and increments nest_depth without
disturbing the outer window; only the outermost tlob_stop_task()
performs real cleanup.

Two userspace interfaces are provided.  The ioctl interface exposes
in-process self-instrumentation via /dev/rv with TLOB_IOCTL_TRACE_START
and TLOB_IOCTL_TRACE_STOP.  The uprobe interface enables external
monitoring of unmodified binaries via tracefs:

  echo "p PATH:OFFSET_START OFFSET_STOP threshold=NS" \
      > /sys/kernel/tracing/rv/monitors/tlob/monitor

Violations are reported via error_env_tlob (HA clock-invariant)
regardless of which interface triggered them.

Suggested-by: Gabriele Monaco <gmonaco@redhat.com> 
Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
 Documentation/trace/rv/index.rst           |    1 +
 Documentation/trace/rv/monitor_tlob.rst    |  213 ++++
 include/linux/rv.h                         |   45 +
 include/rv/automata.h                      |   15 +
 include/rv/ha_monitor.h                    |   33 +-
 include/rv/rv_uprobe.h                     |   32 +
 include/uapi/linux/rv.h                    |   86 ++
 kernel/trace/rv/Kconfig                    |    2 +
 kernel/trace/rv/Makefile                   |    4 +-
 kernel/trace/rv/monitors/tlob/Kconfig      |   69 ++
 kernel/trace/rv/monitors/tlob/tlob.c       | 1307 ++++++++++++++++++++
 kernel/trace/rv/monitors/tlob/tlob.h       |  171 +++
 kernel/trace/rv/monitors/tlob/tlob_trace.h |   58 +
 kernel/trace/rv/rv.c                       |   38 +
 kernel/trace/rv/rv.h                       |    2 +
 kernel/trace/rv/rv_chardev.c               |  201 +++
 kernel/trace/rv/rv_trace.h                 |    1 +
 kernel/trace/rv/rv_uprobe.c                |   46 +-
 tools/include/uapi/linux/rv.h              |   86 ++
 19 files changed, 2400 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/trace/rv/monitor_tlob.rst
 create mode 100644 include/uapi/linux/rv.h
 create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
 create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
 create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
 create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
 create mode 100644 kernel/trace/rv/rv_chardev.c
 create mode 100644 tools/include/uapi/linux/rv.h

diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
index 29769f06bb0f..1501545b5f08 100644
--- a/Documentation/trace/rv/index.rst
+++ b/Documentation/trace/rv/index.rst
@@ -16,5 +16,6 @@ Runtime Verification
    monitor_wwnr.rst
    monitor_sched.rst
    monitor_rtapp.rst
+   monitor_tlob.rst
    monitor_stall.rst
    monitor_deadline.rst
diff --git a/Documentation/trace/rv/monitor_tlob.rst b/Documentation/trace/rv/monitor_tlob.rst
new file mode 100644
index 000000000000..91b592630b3f
--- /dev/null
+++ b/Documentation/trace/rv/monitor_tlob.rst
@@ -0,0 +1,213 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Monitor tlob
+============
+
+- Name: tlob - task latency over budget
+- Type: per-object hybrid automaton (RV_MON_PER_OBJ)
+- Author: Wen Yang <wen.yang@linux.dev>
+
+Description
+-----------
+
+The tlob monitor tracks per-task elapsed wall-clock time (CLOCK_MONOTONIC,
+spanning running, waiting, and sleeping states) and reports a violation when
+the monitored task exceeds a configurable per-invocation budget threshold.
+
+The monitor implements a three-state hybrid automaton with a single clock
+environment variable ``clk_elapsed``.  The clock invariant
+``clk_elapsed < BUDGET_NS()`` is active in all three states; when it is
+violated the HA timer fires and the framework emits ``error_env_tlob``
+then calls ``da_monitor_reset()`` automatically::
+
+                  | (initial, via task_start)
+                  v
+           +--------------+
+           |   running    | <-----------+
+           +--------------+             |
+             |         |                |
+           sleep     preempt        switch_in
+             |         |                |
+             v         v                |
+        +---------+  +---------+        |
+        | sleeping|  | waiting | -------+
+        +---------+  +---------+
+             |            ^
+             +---wakeup---+
+
+  Key transitions:
+    running  --(sleep)------> sleeping   (task blocks waiting for a resource)
+    running  --(preempt)----> waiting    (task preempted, back in runqueue)
+    sleeping --(wakeup)-----> waiting    (resource available, enters runqueue)
+    waiting  --(switch_in)--> running    (scheduler picks task, back on CPU)
+
+  ``task_start`` calls ``da_handle_start_event()`` with the synthetic event
+  ``switch_in_tlob`` to force the initial DA state to ``running`` (since
+  ``switch_in`` transitions waiting→running), then resets ``clk_elapsed`` and
+  arms the budget timer directly via ``ha_reset_clk_ns()`` + ``ha_start_timer_ns()``.
+  ``task_stop`` cancels the HA timer synchronously via
+  ``ha_cancel_timer_sync()`` then calls ``da_monitor_reset()`` directly.
+
+The non-running condition (monitor not yet started or reset after a
+stop/violation) is handled implicitly by the RV framework
+(``da_mon->monitoring == 0``) — it is not an explicit DA state.
+
+Per-task state lives in ``struct tlob_task_state`` which is stored as
+``monitor_target`` in the framework's ``da_monitor_storage``, indexed by
+pid.  The per-invocation ``threshold_us`` is read via
+``ha_get_target(ha_mon)->threshold_us`` inside the HA constraint functions,
+following the same pattern as the ``nomiss`` monitor.
+
+Usage
+-----
+
+tracefs interface (uprobe-based external monitoring)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``monitor`` tracefs file instruments an unmodified binary via uprobes.
+The format follows the ftrace ``uprobe_events`` convention (``PATH:OFFSET``
+for the probe location, ``key=value`` for configuration parameters)::
+
+  p PATH:OFFSET_START OFFSET_STOP threshold=US
+
+The uprobe at ``OFFSET_START`` fires ``tlob_start_task()``; the uprobe at
+``OFFSET_STOP`` fires ``tlob_stop_task()``.  Both offsets are ELF file
+offsets of entry points in ``PATH``.  ``PATH`` may contain ``:``; the last
+``:`` in the ``PATH:OFFSET_START`` token is the separator.
+
+To remove a binding, use ``-PATH:OFFSET_START``::
+
+  echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
+
+  echo "p /usr/bin/myapp:0x12a0 0x12f0 threshold=5000" \
+      > /sys/kernel/tracing/rv/monitors/tlob/monitor
+
+  # Remove a binding
+  echo "-/usr/bin/myapp:0x12a0" > /sys/kernel/tracing/rv/monitors/tlob/monitor
+
+  # List registered bindings
+  cat /sys/kernel/tracing/rv/monitors/tlob/monitor
+
+  # Read violations from the trace buffer
+  cat /sys/kernel/tracing/trace
+
+ioctl self-instrumentation (/dev/rv)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``/dev/rv`` is a shared RV character device.  Before using any monitor-specific
+ioctl, the fd must be bound to a monitor via ``RV_IOCTL_BIND_MONITOR``.  Each
+open fd has independent per-fd monitoring state::
+
+  int fd = open("/dev/rv", O_RDWR);
+
+  /* Bind this fd to the tlob monitor. */
+  struct rv_bind_args bind = { .monitor_name = "tlob" };
+  ioctl(fd, RV_IOCTL_BIND_MONITOR, &bind);
+
+  struct tlob_start_args args = {
+      .threshold_us = 50000,   /* 50 ms in microseconds */
+  };
+  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
+
+  /* ... code path under observation ... */
+
+  int ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
+  /* ret == 0:          within budget  */
+  /* ret == -EOVERFLOW: budget exceeded */
+
+  close(fd);
+
+``TRACE_STOP`` returns ``-EOVERFLOW`` whenever the budget was exceeded.
+The HA timer calls ``da_monitor_reset()`` (storage remains); the
+synchronous ``ha_cancel_timer_sync()`` in ``tlob_stop_task()`` ensures the
+callback has completed before checking ``da_monitoring()``.
+
+Violation events
+~~~~~~~~~~~~~~~~
+
+Budget violations are always reported via the ``error_env_tlob`` RV
+tracepoint (HA clock-invariant violation), regardless of which interface
+triggered them::
+
+  cat /sys/kernel/tracing/trace
+
+To capture violations in a file::
+
+  trace-cmd record -e error_env_tlob &
+  # ... run workload ...
+  trace-cmd report
+
+tracefs files
+-------------
+
+The following files are created under
+``/sys/kernel/tracing/rv/monitors/tlob/``:
+
+``enable`` (rw)
+  Write ``1`` to enable the monitor; write ``0`` to disable it.
+
+``desc`` (ro)
+  Human-readable description of the monitor.
+
+``monitor`` (rw)
+  Write ``p PATH:OFFSET_START OFFSET_STOP threshold=US``
+  to bind two entry uprobes.  Write ``-PATH:OFFSET_START`` to remove a
+  binding.  Read to list registered bindings in the same format.
+
+Kernel API
+----------
+
+.. kernel-doc:: kernel/trace/rv/monitors/tlob/tlob.c
+   :functions: tlob_start_task tlob_stop_task
+
+``tlob_start_task(task, threshold_us)``
+  Begin monitoring *task* with a total latency budget of *threshold_us*
+  microseconds.  Allocates per-task state, sets initial DA state to
+  ``running``, resets ``clk_elapsed``, and arms the HA budget timer.
+  Returns 0, -ENODEV (monitor disabled), -ERANGE (zero threshold),
+  -EALREADY (already monitoring), -ENOSPC (at capacity), or -ENOMEM.
+
+``tlob_stop_task(task)``
+  Stop monitoring *task*.  Synchronously cancels the HA timer via
+  ``ha_cancel_timer_sync()``, checks ``da_monitoring()`` to determine outcome.
+  Returns 0 (clean stop, within budget), -EOVERFLOW (budget was exceeded),
+  -ESRCH (not monitored), or -EAGAIN (concurrent stop racing).
+
+Design notes
+------------
+
+State transitions are driven by two tracepoints:
+
+- ``sched_switch``: ``prev_state == 0`` (``TASK_RUNNING``, preempted,
+  stays on runqueue) → running→waiting; ``prev_state != 0`` (voluntarily
+  blocked, leaves runqueue) → running→sleeping; ``next`` pointer →
+  waiting→running.
+- ``sched_wakeup``: task moves back onto the runqueue → sleeping→waiting.
+
+No ``waiting → sleeping`` edge exists because a task can only block
+itself while executing on CPU.  ``try_to_wake_up()`` is also a no-op
+when ``__state == TASK_RUNNING``, so ``sched_wakeup`` never fires while
+the task is in ``waiting`` state.
+
+Limitations:
+
+- The initial DA state is always ``running``, set by feeding the synthetic
+  event ``switch_in_tlob`` to ``da_handle_start_event()``.  Monitoring a non-current
+  task that is already in waiting or sleeping state at call time misclassifies
+  the first interval as ``running_ns``.
+- ``TASK_STOPPED`` and ``TASK_TRACED`` carry ``prev_state != 0`` and are
+  therefore counted as ``sleeping_ns``, indistinguishable from
+  I/O-blocked time.
+- ``sched_wakeup_new`` is not hooked.  In practice this is not an issue
+  because ``tlob_start_task`` is always called from a running context.
+
+Specification
+-------------
+
+Graphviz DOT file in tools/verification/models/tlob.dot.
+
+KUnit tests under ``kernel/trace/rv/monitors/tlob/tlob_kunit.c``
+(CONFIG_TLOB_KUNIT_TEST).
+
+User-space integration tests under ``tools/testing/selftests/verification/``
+(requires CONFIG_RV_MON_TLOB=y and root).
diff --git a/include/linux/rv.h b/include/linux/rv.h
index 541ba404926a..1ea91bb3f1c2 100644
--- a/include/linux/rv.h
+++ b/include/linux/rv.h
@@ -21,6 +21,13 @@
 #include <linux/list.h>
 #include <linux/types.h>
 
+/* Forward declaration: poll_table is only needed by rv_chardev_ops::poll.
+ * Avoid pulling in <linux/poll.h> from rv.h — that header is included by
+ * sched.h, and poll.h → fs.h → rcupdate.h creates a header-ordering cycle
+ * with migrate_disable() on UML/non-SMP targets.
+ */
+struct poll_table_struct;
+
 /*
  * Deterministic automaton per-object variables.
  */
@@ -158,6 +165,44 @@ int rv_register_monitor(struct rv_monitor *monitor, struct rv_monitor *parent);
 int rv_get_task_monitor_slot(void);
 void rv_put_task_monitor_slot(int slot);
 
+/**
+ * struct rv_chardev_ops - per-monitor callbacks for the /dev/rv chardev
+ *
+ * Monitors that want to expose an ioctl self-instrumentation interface
+ * register an instance of this struct with rv_chardev_register_monitor().
+ *
+ * @owner:   Module that owns this ops struct.  Set to THIS_MODULE.
+ *           The chardev holds a module reference for every bound fd so
+ *           the module cannot be unloaded while any fd remains open.
+ * @bind:    Called when userspace issues RV_IOCTL_BIND_MONITOR.  Should
+ *           allocate and return per-fd private data (opaque pointer), or
+ *           ERR_PTR(errno) on failure.
+ * @ioctl:   Called for every monitor-specific ioctl after binding.  @priv
+ *           is the pointer returned by @bind.
+ * @poll:    Optional.  Called from the fd's poll() / epoll_wait() path.
+ *           Should call poll_wait(@file, wq, @wait) on the monitor's internal
+ *           wait queue and return the current event mask (EPOLLIN | EPOLLRDNORM
+ *           when an event is pending, 0 otherwise).  If NULL, poll() always
+ *           returns 0 (no events).
+ * @release: Called when the fd is closed.  Must free @priv.
+ */
+struct rv_chardev_ops {
+	struct module *owner;
+	void *(*bind)(void);
+	long  (*ioctl)(void *priv, unsigned int cmd, unsigned long arg);
+	__poll_t (*poll)(void *priv, struct file *file, struct poll_table_struct *wait);
+	void  (*release)(void *priv);
+};
+
+int  rv_chardev_register_monitor(const char *name,
+				 const struct rv_chardev_ops *ops);
+void rv_chardev_unregister_monitor(const char *name);
+
+#if IS_ENABLED(CONFIG_KUNIT)
+void rv_kunit_monitoring_on(void);
+void rv_kunit_monitoring_off(void);
+#endif
+
 #ifdef CONFIG_RV_REACTORS
 int rv_unregister_reactor(struct rv_reactor *reactor);
 int rv_register_reactor(struct rv_reactor *reactor);
diff --git a/include/rv/automata.h b/include/rv/automata.h
index 4a4eb40cf09a..ae819638d85a 100644
--- a/include/rv/automata.h
+++ b/include/rv/automata.h
@@ -41,6 +41,21 @@ static char *model_get_event_name(enum events event)
 	return RV_AUTOMATON_NAME.event_names[event];
 }
 
+/*
+ * model_get_timer_event_name - label used when the HA timer fires (no event).
+ *
+ * Monitors may define MONITOR_TIMER_EVENT_NAME before including the model
+ * header to give the timer-fired violation a semantically meaningful label
+ * (e.g. "budget_exceeded" for tlob).  Defaults to "none".
+ */
+#ifndef MONITOR_TIMER_EVENT_NAME
+#define MONITOR_TIMER_EVENT_NAME "none"
+#endif
+static inline char *model_get_timer_event_name(void)
+{
+	return MONITOR_TIMER_EVENT_NAME;
+}
+
 /*
  * model_get_initial_state - return the automaton's initial state
  */
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index d59507e8cb30..dfc993774089 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -28,6 +28,7 @@ static inline void ha_monitor_init_env(struct da_monitor *da_mon);
 static inline void ha_monitor_reset_env(struct da_monitor *da_mon);
 static inline void ha_setup_timer(struct ha_monitor *ha_mon);
 static inline bool ha_cancel_timer(struct ha_monitor *ha_mon);
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon);
 static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 					 enum states curr_state,
 					 enum events event,
@@ -35,7 +36,10 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 					 da_id_type id);
 #define da_monitor_event_hook ha_monitor_handle_constraint
 #define da_monitor_init_hook ha_monitor_init_env
+/* Allow monitors to override da_monitor_reset_hook before including this header. */
+#ifndef da_monitor_reset_hook
 #define da_monitor_reset_hook ha_monitor_reset_env
+#endif
 
 #include <rv/da_monitor.h>
 #include <linux/seq_buf.h>
@@ -70,7 +74,7 @@ static void ha_react(enum states curr_state, enum events event, char *env)
 	rv_react(&rv_this,
 		 "rv: monitor %s does not allow event %s on state %s with env %s\n",
 		 __stringify(MONITOR_NAME),
-		 event == EVENT_NONE ? EVENT_NONE_LBL : model_get_event_name(event),
+		 event == EVENT_NONE ? model_get_timer_event_name() : model_get_event_name(event),
 		 model_get_state_name(curr_state), env);
 }
 
@@ -246,7 +250,7 @@ static inline void __ha_monitor_timer_callback(struct ha_monitor *ha_mon)
 	ha_get_env_string(&env_string, ha_mon, time_ns);
 	ha_react(curr_state, EVENT_NONE, env_string.buffer);
 	ha_trace_error_env(ha_mon, model_get_state_name(curr_state),
-			   EVENT_NONE_LBL, env_string.buffer,
+			   model_get_timer_event_name(), env_string.buffer,
 			   da_get_id(&ha_mon->da_mon));
 
 	da_monitor_reset(&ha_mon->da_mon);
@@ -412,6 +416,14 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return timer_delete(&ha_mon->timer);
 }
+/*
+ * ha_cancel_timer_sync - Cancel the timer, blocking until any running
+ * callback has completed.
+ */
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
+{
+	timer_delete_sync(&ha_mon->timer);
+}
 #elif HA_TIMER_TYPE == HA_TIMER_HRTIMER
 /*
  * Helper functions to handle the monitor timer.
@@ -432,12 +444,12 @@ static enum hrtimer_restart ha_monitor_timer_callback(struct hrtimer *hrtimer)
 static inline void ha_setup_timer(struct ha_monitor *ha_mon)
 {
 	hrtimer_setup(&ha_mon->hrtimer, ha_monitor_timer_callback,
-		      CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
+		      CLOCK_MONOTONIC, HRTIMER_MODE_REL_SOFT);
 }
 static inline void ha_start_timer_ns(struct ha_monitor *ha_mon, enum envs env,
 				     u64 expire, u64 time_ns)
 {
-	int mode = HRTIMER_MODE_REL_HARD;
+	int mode = HRTIMER_MODE_REL_SOFT;
 	u64 passed = ha_invariant_passed_ns(ha_mon, env, expire, time_ns);
 
 	if (RV_MON_TYPE == RV_MON_PER_CPU)
@@ -463,6 +475,18 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return hrtimer_try_to_cancel(&ha_mon->hrtimer) == 1;
 }
+/*
+ * ha_cancel_timer_sync - Cancel the timer, blocking until any running
+ * callback has completed.
+ *
+ * Use in teardown paths (e.g. stop_task) where the caller must know the
+ * callback has finished before inspecting or freeing monitor state.
+ * Must not be called from atomic context or within the timer callback.
+ */
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
+{
+	hrtimer_cancel(&ha_mon->hrtimer);
+}
 #else /* HA_TIMER_NONE */
 /*
  * Start function is intentionally not defined, monitors using timers must
@@ -473,6 +497,7 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return false;
 }
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon) { }
 #endif
 
 #endif
diff --git a/include/rv/rv_uprobe.h b/include/rv/rv_uprobe.h
index 084cdb36a2ff..9106c5c9275e 100644
--- a/include/rv/rv_uprobe.h
+++ b/include/rv/rv_uprobe.h
@@ -79,9 +79,41 @@ struct rv_uprobe *rv_uprobe_attach(const char *binpath, loff_t offset,
  * for any in-progress handler to finish, then releases the path reference
  * and frees the rv_uprobe struct.  The caller's priv data is NOT freed.
  *
+ * When removing a single probe, prefer this over the three-phase API.
  * Safe to call from process context only (uprobe_unregister_sync() may
  * schedule).
  */
 void rv_uprobe_detach(struct rv_uprobe *p);
 
+/**
+ * rv_uprobe_unregister_nosync - dequeue an uprobe without waiting
+ * @p:  probe to dequeue; may be NULL (no-op)
+ *
+ * Removes the uprobe from the uprobe subsystem but does NOT wait for
+ * in-flight handlers to complete.  The caller must call rv_uprobe_sync()
+ * before calling rv_uprobe_free() on the same probe.
+ *
+ * Use this to batch multiple deregistrations before a single rv_uprobe_sync().
+ */
+void rv_uprobe_unregister_nosync(struct rv_uprobe *p);
+
+/**
+ * rv_uprobe_sync - wait for all in-flight uprobe handlers to complete
+ *
+ * Global barrier: waits for every in-flight uprobe handler across the system
+ * to finish.  Call once after a batch of rv_uprobe_unregister_nosync() calls
+ * and before any rv_uprobe_free() call.
+ */
+void rv_uprobe_sync(void);
+
+/**
+ * rv_uprobe_free - release resources of a previously deregistered probe
+ * @p:  probe to free; may be NULL (no-op)
+ *
+ * Releases the path reference and frees the rv_uprobe struct.  Must only
+ * be called after rv_uprobe_sync() has returned.  The caller's priv data
+ * is NOT freed.
+ */
+void rv_uprobe_free(struct rv_uprobe *p);
+
 #endif /* _RV_UPROBE_H */
diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
new file mode 100644
index 000000000000..a34e5426393b
--- /dev/null
+++ b/include/uapi/linux/rv.h
@@ -0,0 +1,86 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * UAPI definitions for Runtime Verification (RV) monitors.
+ *
+ * All RV monitors that expose an ioctl self-instrumentation interface
+ * share the magic byte RV_IOC_MAGIC ('r').
+ *
+ * Usage examples and design rationale are in:
+ *   Documentation/trace/rv/monitor_tlob.rst
+ */
+
+#ifndef _UAPI_LINUX_RV_H
+#define _UAPI_LINUX_RV_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+/* Magic byte shared by all RV monitor ioctls. */
+#define RV_IOC_MAGIC		'r'
+
+/* Maximum monitor name length (including NUL terminator). */
+#define RV_MONITOR_NAME_MAX	32
+
+/* Generic /dev/rv ioctls (ioctl numbers 0–15 are reserved for the core) */
+
+/**
+ * struct rv_bind_args - arguments for RV_IOCTL_BIND_MONITOR
+ * @monitor_name: NUL-terminated name of the monitor to bind (e.g. "tlob").
+ */
+struct rv_bind_args {
+	char monitor_name[RV_MONITOR_NAME_MAX];
+};
+
+/*
+ * RV_IOCTL_BIND_MONITOR - associate this fd with a specific RV monitor.
+ *
+ * Must be called once after open() and before any monitor-specific ioctl.
+ *
+ * Returns 0 on success.
+ * Returns -EBUSY  if this fd is already bound to a monitor.
+ * Returns -ENOENT if the requested monitor is not registered.
+ * Returns -ENOMEM on allocation failure.
+ */
+#define RV_IOCTL_BIND_MONITOR	_IOW(RV_IOC_MAGIC, 0, struct rv_bind_args)
+
+/* tlob: task latency over budget monitor (ioctl numbers 1–15) */
+
+/**
+ * struct tlob_start_args - arguments for TLOB_IOCTL_TRACE_START
+ * @threshold_us: Total latency budget for this window, in microseconds.
+ *               Must be greater than zero.  Both on-CPU and off-CPU time
+ *               (including runqueue wait) count toward this budget.
+ */
+struct tlob_start_args {
+	__u64 threshold_us;
+};
+
+/*
+ * TLOB_IOCTL_TRACE_START - begin monitoring the calling task.
+ *
+ * Arms a per-task hrtimer for threshold_us microseconds (CLOCK_MONOTONIC,
+ * so both on-CPU and off-CPU time count toward the budget).
+ *
+ * Returns 0 on success.
+ * Returns -EEXIST if TRACE_START was already called on this fd.
+ * Returns -ENOSPC if TLOB_MAX_MONITORED tasks are already being tracked.
+ * Returns -ENOMEM on allocation failure.
+ * Returns -ENODEV if the tlob monitor is not enabled.
+ * Returns -ERANGE if threshold_us is 0.
+ */
+#define TLOB_IOCTL_TRACE_START	_IOW(RV_IOC_MAGIC, 1, struct tlob_start_args)
+
+/*
+ * TLOB_IOCTL_TRACE_STOP - end monitoring the calling task.
+ *
+ * Returns 0 if within budget.
+ * Returns -EOVERFLOW if the latency budget was exceeded.
+ * Returns -EINVAL if TLOB_IOCTL_TRACE_START was not called on this fd.
+ *
+ * poll/epoll: after TRACE_START the fd becomes readable (EPOLLIN) when the
+ * budget is exceeded.  The caller may then issue TRACE_STOP to retrieve the
+ * result, or simply close the fd to clean up.
+ */
+#define TLOB_IOCTL_TRACE_STOP	_IO(RV_IOC_MAGIC, 2)
+
+#endif /* _UAPI_LINUX_RV_H */
diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index e2e0033a00b9..1c36939db8e5 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -87,6 +87,8 @@ source "kernel/trace/rv/monitors/deadline/Kconfig"
 source "kernel/trace/rv/monitors/nomiss/Kconfig"
 # Add new deadline monitors here
 
+source "kernel/trace/rv/monitors/tlob/Kconfig"
+
 # Add new monitors here
 
 config RV_REACTORS
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index f139b904bea3..8a5b5c84aff9 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -2,7 +2,7 @@
 
 ccflags-y += -I $(src)		# needed for trace events
 
-obj-$(CONFIG_RV) += rv.o
+obj-$(CONFIG_RV) += rv.o rv_chardev.o
 obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
 obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
 obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
@@ -17,6 +17,8 @@ obj-$(CONFIG_RV_MON_STS) += monitors/sts/sts.o
 obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
 obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
 obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
+obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
+obj-$(CONFIG_TLOB_KUNIT_TEST) += monitors/tlob/tlob_kunit.o
 obj-$(CONFIG_RV_MON_STALL) += monitors/stall/stall.o
 obj-$(CONFIG_RV_MON_DEADLINE) += monitors/deadline/deadline.o
 obj-$(CONFIG_RV_MON_NOMISS) += monitors/nomiss/nomiss.o
diff --git a/kernel/trace/rv/monitors/tlob/Kconfig b/kernel/trace/rv/monitors/tlob/Kconfig
new file mode 100644
index 000000000000..82e521891496
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/Kconfig
@@ -0,0 +1,69 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+config RV_MON_TLOB
+	depends on RV
+	select RV_UPROBE
+	select HA_MON_EVENTS_ID
+	bool "tlob monitor"
+	help
+	  Enable the tlob (task latency over budget) monitor.  This monitor
+	  tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path
+	  within a task (including both on-CPU and off-CPU time) and reports
+	  a violation when the elapsed time exceeds a configurable budget.
+
+	  The monitor uses a three-state hybrid automaton (running, waiting,
+	  sleeping) stored per object using RV_MON_PER_OBJ.  A single HA
+	  clock invariant (clk_elapsed < BUDGET_NS) is enforced in all three
+	  states via a per-task hrtimer.
+
+	  States: running (initial, on-CPU), waiting (in runqueue, off-CPU),
+	          sleeping (blocked on resource, off-CPU).
+	  Key transitions:
+	    running  --(sleep)------> sleeping
+	    running  --(preempt)----> waiting
+	    sleeping --(wakeup)-----> waiting
+	    waiting  --(switch_in)--> running
+	  task_start calls da_handle_start_event() to set the initial state,
+	  then arms the budget timer directly via ha_reset_clk_ns() +
+	  ha_start_timer_ns().  task_stop cancels the timer synchronously via
+	  ha_cancel_timer_sync() then calls da_monitor_reset().
+
+	  Two userspace interfaces are provided:
+
+	  tracefs uprobe binding (external, unmodified binaries):
+	    echo "p PATH:OFFSET_START OFFSET_STOP threshold=NS" \
+	        > /sys/kernel/tracing/rv/monitors/tlob/monitor
+	  The uprobe at offset_start fires tlob_start_task(); the uprobe at
+	  offset_stop fires tlob_stop_task().  Both are plain entry uprobes
+	  so a mistyped offset cannot corrupt the call stack.
+
+	  /dev/rv ioctl (in-process self-instrumentation):
+	    ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
+	    do_critical_work();
+	    ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
+	    /* ret == -EOVERFLOW when budget exceeded */
+	  Allows conditional monitoring, sub-function granularity, and
+	  inline reaction to violations without polling the trace buffer.
+
+	  Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
+
+	  Violations are always reported via the standard error_env_tlob RV
+	  tracepoint regardless of which interface triggered them.  The
+	  tracefs interface requires only tracefs write permissions, avoiding
+	  the CAP_BPF privilege needed for equivalent eBPF-based approaches.
+
+	  For further information, see:
+	    Documentation/trace/rv/monitor_tlob.rst
+
+config TLOB_KUNIT_TEST
+	tristate "KUnit tests for tlob monitor" if !KUNIT_ALL_TESTS
+	depends on RV_MON_TLOB && KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  Enable KUnit in-kernel unit tests for the tlob RV monitor.
+
+	  Tests cover automaton state transitions, the start/stop task
+	  interface, scheduler context-switch accounting, and the uprobe
+	  format string parser.
+
+	  Say Y or M here to run the tlob KUnit test suite; otherwise say N.
diff --git a/kernel/trace/rv/monitors/tlob/tlob.c b/kernel/trace/rv/monitors/tlob/tlob.c
new file mode 100644
index 000000000000..475e972ae9aa
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/tlob.c
@@ -0,0 +1,1307 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tlob: task latency over budget monitor
+ *
+ * Track the elapsed wall-clock time of a marked code path and detect when
+ * a monitored task exceeds its per-task latency budget.  CLOCK_MONOTONIC
+ * is used so both on-CPU and off-CPU time count toward the budget.
+ *
+ * On a budget violation, two tracepoints are emitted from the hrtimer
+ * callback: error_env_tlob signals the violation, and detail_env_tlob
+ * provides a per-state time breakdown (running_ns, waiting_ns, sleeping_ns)
+ * that pinpoints whether the overrun occurred in running, waiting, or sleeping state.
+ *
+ * The monitor uses RV_MON_PER_OBJ: per-task state (struct tlob_task_state)
+ * is stored as monitor_target in the framework's hash table.
+ *
+ * One HA clock invariant is enforced:
+ *   clk_elapsed < BUDGET_NS()   (active in all states)
+ *
+ * task_start uses da_handle_start_event() to set the initial state, then
+ * calls ha_reset_clk_ns() + ha_start_timer_ns() directly to initialise the
+ * clock and arm the budget timer.  No synthetic event is needed.
+ * The HA timer is cancelled synchronously by ha_cancel_timer_sync() in
+ * tlob_stop_task().
+ *
+ * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
+ */
+#include <linux/completion.h>
+#include <linux/hrtimer.h>
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/namei.h>
+#include <linux/refcount.h>
+#include <linux/rv.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/tracefs.h>
+#include <linux/uaccess.h>
+#include <kunit/visibility.h>
+#include <rv/instrumentation.h>
+#include <rv/rv_uprobe.h>
+#include <uapi/linux/rv.h>
+#include "../../rv.h"
+
+#define MODULE_NAME "tlob"
+
+#include <trace/events/sched.h>
+#include <rv_trace.h>
+
+/*
+ * Per-fd private data; one instance per open /dev/rv fd.
+ * monitoring: set while TRACE_START is active; cleared at TRACE_STOP.
+ * budget_exceeded: set by hrtimer callback; read at TRACE_STOP to report
+ * -EOVERFLOW even when cleanup was claimed by a concurrent stop_all or
+ * a task-exit handler.
+ */
+struct tlob_fpriv {
+	struct task_struct	*task;
+	bool			monitoring;
+	bool			budget_exceeded;
+};
+
+/*
+ * Per-task latency monitoring state.  One instance per monitoring window.
+ * Stored as monitor_target in da_monitor_storage; freed via call_rcu.
+ */
+struct tlob_task_state {
+	struct task_struct	*task;		/* via get_task_struct */
+	u64			threshold_us;	/* budget in microseconds */
+
+	/* 1 = cleanup claimed; ha_setup_invariants won't restart the timer. */
+	atomic_t		stopping;
+
+	/* Serialises the ns accumulators; held briefly (hardirq-safe). */
+	raw_spinlock_t		entry_lock;
+	u64			running_ns;	/* time in running state  */
+	u64			waiting_ns;	/* time in waiting state  */
+	u64			sleeping_ns;	/* time in sleeping state */
+	ktime_t			last_ts;
+
+	/* store-release in TRACE_START ioctl, load-acquire in reset_notify. */
+	struct tlob_fpriv	*fpriv;
+
+	struct rcu_head		rcu;		/* for call_rcu() teardown */
+};
+
+#define RV_MON_TYPE RV_MON_PER_OBJ
+#define HA_TIMER_TYPE HA_TIMER_HRTIMER
+/* Pool mode: da_handle_start_event uses da_fill_empty_storage, not kmalloc. */
+#define DA_SKIP_AUTO_ALLOC
+
+/* Type for da_monitor_storage.target; must be defined before the includes. */
+typedef struct tlob_task_state *monitor_target;
+
+/* Forward-declared so da_monitor_reset_hook works before ha_monitor.h. */
+static inline void tlob_reset_notify(struct da_monitor *da_mon);
+#define da_monitor_reset_hook tlob_reset_notify
+
+/*
+ * When the hrtimer fires (budget elapsed), the HA framework emits
+ * error_env_tlob with this label instead of the generic "none".
+ */
+#define MONITOR_TIMER_EVENT_NAME "budget_exceeded"
+
+#include "tlob.h"
+#include <rv/ha_monitor.h>
+
+/*
+ * Called from da_monitor_reset() on both normal stop and hrtimer expiry.
+ * On violation (stopping==0), emits detail_env_tlob.
+ */
+static inline void tlob_reset_notify(struct da_monitor *da_mon)
+{
+	struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
+	struct tlob_task_state *ws;
+
+	ha_monitor_reset_env(da_mon);
+
+	ws = ha_get_target(ha_mon);
+	if (!ws)
+		return;
+
+	/*
+	 * Emit per-state breakdown on budget violation only.
+	 * stopping==0: timer callback owns this path (genuine overrun).
+	 * stopping==1: normal stop claimed ownership first; skip.
+	 */
+	if (!atomic_read(&ws->stopping)) {
+		unsigned int curr_state = READ_ONCE(da_mon->curr_state);
+		u64 running_ns, waiting_ns, sleeping_ns, partial_ns;
+		struct tlob_fpriv *fp;
+		unsigned long flags;
+
+		/*
+		 * Snapshot accumulators; partial_ns covers curr_state time
+		 * not yet folded in (transition-out pending).
+		 */
+		raw_spin_lock_irqsave(&ws->entry_lock, flags);
+		partial_ns   = ktime_get_ns() - ktime_to_ns(ws->last_ts);
+		running_ns   = ws->running_ns  +
+			       (curr_state == running_tlob  ? partial_ns : 0);
+		waiting_ns   = ws->waiting_ns  +
+			       (curr_state == waiting_tlob  ? partial_ns : 0);
+		sleeping_ns  = ws->sleeping_ns +
+			       (curr_state == sleeping_tlob ? partial_ns : 0);
+		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
+
+		trace_detail_env_tlob(da_get_id(da_mon), ws->threshold_us,
+				      running_ns, waiting_ns, sleeping_ns);
+
+		/*
+		 * Latch violation in the fd so TRACE_STOP can return -EOVERFLOW
+		 * even if a concurrent stop_all or task-exit handler claims
+		 * cleanup first.  Pairs with smp_store_release in TRACE_START.
+		 */
+		fp = smp_load_acquire(&ws->fpriv);
+		if (fp)
+			WRITE_ONCE(fp->budget_exceeded, true);
+	}
+}
+
+#define BUDGET_US(ha_mon) (ha_get_target(ha_mon)->threshold_us)
+#define BUDGET_NS(ha_mon) (BUDGET_US(ha_mon) * 1000ULL)
+
+/* HA constraint functions (called by ha_monitor_handle_constraint) */
+
+static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_tlob env, u64 time_ns)
+{
+	if (env == clk_elapsed_tlob)
+		return ha_get_clk_ns(ha_mon, env, time_ns);
+	return ENV_INVALID_VALUE;
+}
+
+static void ha_reset_env(struct ha_monitor *ha_mon, enum envs_tlob env, u64 time_ns)
+{
+	if (env == clk_elapsed_tlob)
+		ha_reset_clk_ns(ha_mon, env, time_ns);
+}
+
+/*
+ * ha_verify_invariants - clk_elapsed < BUDGET_NS must hold in all states.
+ */
+static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
+					enum states curr_state, enum events event,
+					enum states next_state, u64 time_ns)
+{
+	if (curr_state == running_tlob)
+		return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob, time_ns);
+	else if (curr_state == sleeping_tlob)
+		return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob, time_ns);
+	else if (curr_state == waiting_tlob)
+		return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob, time_ns);
+	return true;
+}
+
+/*
+ * Convert invariant (deadline) to guard (reset anchor) on state transitions.
+ * Skip if uninitialised (ENV_INVALID_VALUE): the race between
+ * da_handle_start_event() and ha_reset_clk_ns() would give U64_MAX - BUDGET_NS.
+ */
+static inline void ha_convert_inv_guard(struct ha_monitor *ha_mon,
+					enum states curr_state, enum events event,
+					enum states next_state, u64 time_ns)
+{
+	if (curr_state == next_state)
+		return;
+	if (curr_state == running_tlob &&
+	    !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
+		ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon), time_ns);
+	else if (curr_state == sleeping_tlob &&
+		 !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
+		ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon), time_ns);
+	else if (curr_state == waiting_tlob &&
+		 !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
+		ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon), time_ns);
+}
+
+/* No per-event guard conditions for tlob; invariants suffice. */
+static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
+				    enum states curr_state, enum events event,
+				    enum states next_state, u64 time_ns)
+{
+	return true;
+}
+
+/*
+ * Arm or cancel the HA budget timer on state transitions.
+ * Guard on stopping: sched_switch events can arrive after ha_cancel_timer_sync,
+ * restarting the timer and triggering an ODEBUG "activate active" splat.
+ */
+static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
+				       enum states curr_state, enum events event,
+				       enum states next_state, u64 time_ns)
+{
+	if (next_state == curr_state)
+		return;
+	if (next_state == running_tlob) {
+		if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
+			ha_start_timer_ns(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon), time_ns);
+	} else if (next_state == sleeping_tlob) {
+		if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
+			ha_start_timer_ns(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon), time_ns);
+	} else if (next_state == waiting_tlob) {
+		if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
+			ha_start_timer_ns(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon), time_ns);
+	} else if (curr_state == running_tlob)
+		ha_cancel_timer(ha_mon);
+	else if (curr_state == waiting_tlob)
+		ha_cancel_timer(ha_mon);
+	else if (curr_state == sleeping_tlob)
+		ha_cancel_timer(ha_mon);
+}
+
+static bool ha_verify_constraint(struct ha_monitor *ha_mon,
+				 enum states curr_state, enum events event,
+				 enum states next_state, u64 time_ns)
+{
+	if (!ha_verify_invariants(ha_mon, curr_state, event, next_state, time_ns))
+		return false;
+
+	ha_convert_inv_guard(ha_mon, curr_state, event, next_state, time_ns);
+
+	if (!ha_verify_guards(ha_mon, curr_state, event, next_state, time_ns))
+		return false;
+
+	ha_setup_invariants(ha_mon, curr_state, event, next_state, time_ns);
+
+	return true;
+}
+
+static struct kmem_cache *tlob_state_cache;
+
+static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
+
+/* Uprobe binding list; protected by tlob_uprobe_mutex. */
+static LIST_HEAD(tlob_uprobe_list);
+static DEFINE_MUTEX(tlob_uprobe_mutex);
+
+/*
+ * Serialises duplicate-check + da_create_or_get() to prevent two concurrent
+ * callers for the same pid from both inserting into the hash table.
+ */
+static DEFINE_MUTEX(tlob_start_mutex);
+
+/*
+ * Counts open /dev/rv fds plus one synthetic ref held while enabled.
+ * __tlob_destroy_monitor() drops the synthetic ref and waits for zero
+ * before teardown, preventing kmem_cache_zalloc() on a destroyed cache.
+ */
+static refcount_t tlob_fd_refcount = REFCOUNT_INIT(0);
+static DECLARE_COMPLETION(tlob_fd_released);
+
+/* Per-uprobe-binding state: a start + stop probe pair for one binary region. */
+struct tlob_uprobe_binding {
+	struct list_head	list;
+	u64			threshold_us;
+	char			binpath[TLOB_MAX_PATH];
+	loff_t			offset_start;
+	loff_t			offset_stop;
+	struct rv_uprobe	*start_probe;
+	struct rv_uprobe	*stop_probe;
+};
+
+/* RCU callback: free the slab once no readers remain. */
+static void tlob_free_rcu(struct rcu_head *head)
+{
+	struct tlob_task_state *ws =
+		container_of(head, struct tlob_task_state, rcu);
+	kmem_cache_free(tlob_state_cache, ws);
+}
+
+/*
+ * handle_sched_switch - advance the DA on every context switch.
+ *
+ * Generates three DA events:
+ *   prev, prev_state != 0  -> sleep_tlob    (running -> sleeping)
+ *   prev, prev_state == 0  -> preempt_tlob  (running -> waiting)
+ *   next                   -> switch_in_tlob (waiting -> running)
+ */
+static void handle_sched_switch(void *data, bool preempt_unused,
+				struct task_struct *prev,
+				struct task_struct *next,
+				unsigned int prev_state)
+{
+	struct tlob_task_state *ws;
+	unsigned long flags;
+	bool do_prev = false, do_next = false;
+	bool prev_preempted;
+	ktime_t now;
+
+	rcu_read_lock();
+
+	ws = da_get_target_by_id(prev->pid);
+	if (ws) {
+		raw_spin_lock_irqsave(&ws->entry_lock, flags);
+		now = ktime_get();
+		ws->running_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
+		ws->last_ts = now;
+		/* prev_state == 0: TASK_RUNNING (preempted); != 0: sleeping. */
+		prev_preempted = (prev_state == 0);
+		do_prev = true;
+		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
+	}
+
+	ws = da_get_target_by_id(next->pid);
+	if (ws) {
+		raw_spin_lock_irqsave(&ws->entry_lock, flags);
+		now = ktime_get();
+		ws->waiting_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
+		ws->last_ts = now;
+		do_next = true;
+		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
+	}
+
+	rcu_read_unlock();
+
+	if (do_prev)
+		da_handle_event(prev->pid, NULL,
+				prev_preempted ? preempt_tlob : sleep_tlob);
+	if (do_next)
+		da_handle_event(next->pid, NULL, switch_in_tlob);
+}
+
+/*
+ * handle_sched_wakeup - sleeping -> waiting transition.
+ *
+ * try_to_wake_up() skips TASK_RUNNING tasks, so this never fires for a
+ * task already in running or waiting state.
+ */
+static void handle_sched_wakeup(void *data, struct task_struct *p)
+{
+	struct tlob_task_state *ws;
+	unsigned long flags;
+	bool found = false;
+
+	rcu_read_lock();
+	ws = da_get_target_by_id(p->pid);
+	if (ws) {
+		ktime_t now = ktime_get();
+
+		raw_spin_lock_irqsave(&ws->entry_lock, flags);
+		ws->sleeping_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
+		ws->last_ts = now;
+		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
+		found = true;
+	}
+	rcu_read_unlock();
+
+	if (found)
+		da_handle_event(p->pid, NULL, wakeup_tlob);
+}
+
+/*
+ * handle_sched_process_exit - clean up if a task exits without TRACE_STOP.
+ *
+ * Called in do_exit() context; the task still has a valid pid here.
+ */
+static void handle_sched_process_exit(void *data, struct task_struct *p,
+				       bool group_dead)
+{
+	struct tlob_task_state *ws;
+	bool found = false;
+
+	rcu_read_lock();
+	ws = da_get_target_by_id(p->pid);
+	found = !!ws;
+	rcu_read_unlock();
+
+	if (found)
+		tlob_stop_task(p);
+}
+
+
+
+/**
+ * tlob_start_task - begin monitoring @task with budget @threshold_us us.
+ * @task:         Task to monitor; may be current or another task.
+ * @threshold_us: Latency budget in microseconds (wall-clock; running + waiting + sleeping). > 0.
+ *
+ * Returns 0, -ENODEV, -EALREADY, -ENOSPC, or -ENOMEM.
+ */
+int tlob_start_task(struct task_struct *task, u64 threshold_us)
+{
+	struct tlob_task_state *ws_existing;
+	struct tlob_task_state *ws;
+	struct da_monitor *da_mon;
+	struct ha_monitor *ha_mon;
+	u64 now_ns;
+	int ret;
+
+	if (!da_monitor_enabled())
+		return -ENODEV;
+
+	if (threshold_us == 0)
+		return -ERANGE;
+
+	/* Serialise duplicate-check + da_create_or_get for the same pid. */
+	guard(mutex)(&tlob_start_mutex);
+
+	rcu_read_lock();
+	ws_existing = da_get_target_by_id(task->pid);
+	if (ws_existing) {
+		rcu_read_unlock();
+		return -EALREADY;
+	}
+	rcu_read_unlock();
+
+	ws = kmem_cache_zalloc(tlob_state_cache, GFP_KERNEL);
+	if (!ws)
+		return -ENOMEM;
+
+	ws->task = task;
+	get_task_struct(task);
+	ws->threshold_us = threshold_us;
+	ws->last_ts = ktime_get();
+	raw_spin_lock_init(&ws->entry_lock);
+
+	/* Claim a pool slot (no kmalloc; DA_SKIP_AUTO_ALLOC + prealloc). */
+	ret = da_create_or_get(task->pid, ws);
+	if (ret) {
+		put_task_struct(task);
+		kmem_cache_free(tlob_state_cache, ws);
+		return ret;
+	}
+
+	atomic_inc(&tlob_num_monitored);
+
+	/* Hold RCU across handle + timer setup to keep da_mon valid. */
+	rcu_read_lock();
+	da_handle_start_event(task->pid, ws, switch_in_tlob);
+	da_mon = da_get_monitor(task->pid, NULL);
+	if (unlikely(!da_mon)) {
+		/* Slot registered; missing da_mon means concurrent destroy. */
+		rcu_read_unlock();
+		da_destroy_storage(task->pid);
+		atomic_dec(&tlob_num_monitored);
+		put_task_struct(task);
+		kmem_cache_free(tlob_state_cache, ws);
+		return -ENOMEM;
+	}
+	ha_mon = to_ha_monitor(da_mon);
+	now_ns = ktime_get_ns();
+	ha_reset_env(ha_mon, clk_elapsed_tlob, now_ns);
+	ha_start_timer_ns(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon), now_ns);
+	rcu_read_unlock();
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(tlob_start_task);
+
+/**
+ * tlob_stop_task - stop monitoring @task.
+ * @task: Task to stop.
+ *
+ * CAS on ws->stopping (0->1) under RCU claims cleanup ownership;
+ * the winner cancels the timer synchronously and frees all resources.
+ *
+ * Returns 0, -EOVERFLOW (budget exceeded), -ESRCH (not monitored),
+ * or -EAGAIN (concurrent caller claimed cleanup).
+ */
+int tlob_stop_task(struct task_struct *task)
+{
+	struct da_monitor *da_mon;
+	struct ha_monitor *ha_mon;
+	struct tlob_task_state *ws;
+	bool budget_exceeded;
+
+	rcu_read_lock();
+	ws = da_get_target_by_id(task->pid);
+	if (!ws) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	da_mon = da_get_monitor(task->pid, NULL);
+	if (unlikely(!da_mon)) {
+		/* ws in hash but da_mon gone; internal inconsistency. */
+		rcu_read_unlock();
+		WARN_ON_ONCE(1);
+		return -ESRCH;
+	}
+
+	ha_mon = to_ha_monitor(da_mon);
+
+	/*
+	 * CAS (0->1) claims cleanup ownership under RCU (ws guaranteed valid).
+	 * _release pairs with atomic_read_acquire in ha_setup_invariants.
+	 */
+	if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0) {
+		rcu_read_unlock();
+		return -EAGAIN;
+	}
+
+	rcu_read_unlock();
+
+	/* Wait for in-flight timer callback before reading da_monitoring. */
+	ha_cancel_timer_sync(ha_mon);
+
+	/* Timer fired first -> budget exceeded; otherwise reset normally. */
+	rcu_read_lock();
+	budget_exceeded = !da_monitoring(da_mon);
+	if (!budget_exceeded)
+		da_monitor_reset(da_mon);
+	rcu_read_unlock();
+	da_destroy_storage(task->pid);
+	atomic_dec(&tlob_num_monitored);
+
+	put_task_struct(ws->task);
+	call_rcu(&ws->rcu, tlob_free_rcu);
+	return budget_exceeded ? -EOVERFLOW : 0;
+}
+EXPORT_SYMBOL_GPL(tlob_stop_task);
+
+static void tlob_stop_all(void)
+{
+	struct da_monitor_storage *ms;
+	pid_t pids[TLOB_MAX_MONITORED];
+	int bkt, n = 0;
+
+	/* Snapshot pids under RCU; re-derive ws under a fresh lock below. */
+	rcu_read_lock();
+	hash_for_each_rcu(da_monitor_ht, bkt, ms, node) {
+		if (ms->target && n < TLOB_MAX_MONITORED)
+			pids[n++] = ms->id;
+	}
+	rcu_read_unlock();
+
+	for (int i = 0; i < n; i++) {
+		pid_t pid = pids[i];
+		struct da_monitor *da_mon;
+		struct ha_monitor *ha_mon;
+		struct tlob_task_state *ws;
+
+		rcu_read_lock();
+		da_mon = da_get_monitor(pid, NULL);
+		if (!da_mon) {
+			/* Cleaned up by tlob_stop_task or exit handler. */
+			rcu_read_unlock();
+			continue;
+		}
+
+		ws = da_get_target(da_mon);
+		ha_mon = to_ha_monitor(da_mon);
+
+		/* CAS (0->1) claims ownership; skip if another caller won. */
+		if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0) {
+			rcu_read_unlock();
+			continue;
+		}
+		rcu_read_unlock();
+
+		ha_cancel_timer_sync(ha_mon);
+
+		scoped_guard(rcu) {
+			da_monitor_reset(da_mon);
+		}
+		da_destroy_storage(pid);
+		atomic_dec(&tlob_num_monitored);
+		put_task_struct(ws->task);
+		call_rcu(&ws->rcu, tlob_free_rcu);
+	}
+}
+
+static int tlob_uprobe_entry_handler(struct rv_uprobe *p, struct pt_regs *regs,
+				     __u64 *data)
+{
+	struct tlob_uprobe_binding *b = p->priv;
+
+	tlob_start_task(current, b->threshold_us);
+	return 0;
+}
+
+static int tlob_uprobe_stop_handler(struct rv_uprobe *p, struct pt_regs *regs,
+				    __u64 *data)
+{
+	tlob_stop_task(current);
+	return 0;
+}
+
+/*
+ * Register start + stop entry uprobes for a binding.
+ * Called with tlob_uprobe_mutex held.
+ */
+static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
+			   loff_t offset_start, loff_t offset_stop)
+{
+	struct tlob_uprobe_binding *b, *tmp_b;
+	char pathbuf[TLOB_MAX_PATH];
+	struct path path;
+	char *canon;
+	int ret;
+
+	if (binpath[0] != '/')
+		return -EINVAL;
+
+	b = kzalloc_obj(*b, GFP_KERNEL);
+	if (!b)
+		return -ENOMEM;
+
+	b->threshold_us = threshold_us;
+	b->offset_start = offset_start;
+	b->offset_stop  = offset_stop;
+
+	ret = kern_path(binpath, LOOKUP_FOLLOW, &path);
+	if (ret)
+		goto err_free;
+
+	if (!d_is_reg(path.dentry)) {
+		ret = -EINVAL;
+		goto err_path;
+	}
+
+	/* Reject duplicate start offset for the same binary. */
+	list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
+		if (tmp_b->offset_start == offset_start &&
+		    tmp_b->start_probe->path.dentry == path.dentry) {
+			ret = -EEXIST;
+			goto err_path;
+		}
+	}
+
+	canon = d_path(&path, pathbuf, sizeof(pathbuf));
+	if (IS_ERR(canon)) {
+		ret = PTR_ERR(canon);
+		goto err_path;
+	}
+	strscpy(b->binpath, canon, sizeof(b->binpath));
+
+	/* Both probes share b (priv) and path; attach_path refs path itself. */
+	b->start_probe = rv_uprobe_attach_path(&path, offset_start,
+					       tlob_uprobe_entry_handler, NULL, b);
+	if (IS_ERR(b->start_probe)) {
+		ret = PTR_ERR(b->start_probe);
+		b->start_probe = NULL;
+		goto err_path;
+	}
+
+	b->stop_probe = rv_uprobe_attach_path(&path, offset_stop,
+					      tlob_uprobe_stop_handler, NULL, b);
+	if (IS_ERR(b->stop_probe)) {
+		ret = PTR_ERR(b->stop_probe);
+		b->stop_probe = NULL;
+		goto err_start;
+	}
+
+	path_put(&path);
+	list_add_tail(&b->list, &tlob_uprobe_list);
+	return 0;
+
+err_start:
+	rv_uprobe_detach(b->start_probe);
+err_path:
+	path_put(&path);
+err_free:
+	kfree(b);
+	return ret;
+}
+
+static int tlob_remove_uprobe_by_key(loff_t offset_start, const char *binpath)
+{
+	struct tlob_uprobe_binding *b, *tmp;
+	struct path remove_path;
+	int ret;
+
+	ret = kern_path(binpath, LOOKUP_FOLLOW, &remove_path);
+	if (ret)
+		return ret;
+
+	ret = -ENOENT;
+	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
+		if (b->offset_start != offset_start)
+			continue;
+		if (b->start_probe->path.dentry != remove_path.dentry)
+			continue;
+		list_del(&b->list);
+		rv_uprobe_detach(b->start_probe);
+		rv_uprobe_detach(b->stop_probe);
+		kfree(b);
+		ret = 0;
+		break;
+	}
+
+	path_put(&remove_path);
+	return ret;
+}
+
+static void tlob_remove_all_uprobes(void)
+{
+	struct tlob_uprobe_binding *b, *tmp;
+	LIST_HEAD(pending);
+
+	mutex_lock(&tlob_uprobe_mutex);
+	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
+		list_move(&b->list, &pending);
+		rv_uprobe_unregister_nosync(b->start_probe);
+		rv_uprobe_unregister_nosync(b->stop_probe);
+	}
+	mutex_unlock(&tlob_uprobe_mutex);
+
+	if (list_empty(&pending))
+		return;
+
+	/*
+	 * One global barrier for all probes dequeued above; no new handlers
+	 * for any of them can fire after this returns.
+	 */
+	rv_uprobe_sync();
+
+	list_for_each_entry_safe(b, tmp, &pending, list) {
+		rv_uprobe_free(b->start_probe);
+		rv_uprobe_free(b->stop_probe);
+		kfree(b);
+	}
+}
+
+static ssize_t tlob_monitor_read(struct file *file,
+				 char __user *ubuf,
+				 size_t count, loff_t *ppos)
+{
+	const int line_sz = TLOB_MAX_PATH + 128;
+	struct tlob_uprobe_binding *b;
+	char *buf, *p;
+	int n = 0, buf_sz, pos = 0;
+	ssize_t ret;
+
+	mutex_lock(&tlob_uprobe_mutex);
+	list_for_each_entry(b, &tlob_uprobe_list, list)
+		n++;
+
+	buf_sz = (n ? n : 1) * line_sz + 1;
+	buf = kmalloc(buf_sz, GFP_KERNEL);
+	if (!buf) {
+		mutex_unlock(&tlob_uprobe_mutex);
+		return -ENOMEM;
+	}
+
+	list_for_each_entry(b, &tlob_uprobe_list, list) {
+		p = b->binpath;
+		pos += scnprintf(buf + pos, buf_sz - pos,
+				 "p %s:0x%llx 0x%llx threshold=%llu\n",
+				 p,
+				 (unsigned long long)b->offset_start,
+				 (unsigned long long)b->offset_stop,
+				 b->threshold_us);
+	}
+	mutex_unlock(&tlob_uprobe_mutex);
+
+	ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
+	kfree(buf);
+	return ret;
+}
+
+/*
+ * Parse "p PATH:OFFSET_START OFFSET_STOP threshold=US".
+ * PATH may contain ':'; the last ':' separates path from offset.
+ * Returns 0 or -EINVAL.
+ */
+static int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
+				  char **path_out,
+				  loff_t *start_out, loff_t *stop_out)
+{
+	unsigned long long thr = 0, stop_val = 0;
+	long long start_val;
+	char *p, *path_token, *token, *colon;
+	bool got_stop = false, got_thr = false;
+	int n;
+
+	/* Must start with "p " */
+	if (buf[0] != 'p' || buf[1] != ' ')
+		return -EINVAL;
+
+	p = buf + 2;
+	while (*p == ' ')
+		p++;
+
+	/* First space-delimited token is PATH:OFFSET_START */
+	path_token = strsep(&p, " \t");
+	if (!path_token || !*path_token)
+		return -EINVAL;
+
+	/* Split at last ':' to handle paths that contain ':'. */
+	colon = strrchr(path_token, ':');
+	if (!colon || colon - path_token < 2)
+		return -EINVAL;
+	*colon = '\0';
+
+	if (path_token[0] != '/')
+		return -EINVAL;
+
+	n = 0;
+	if (sscanf(colon + 1, "%lli%n", &start_val, &n) != 1 || n == 0)
+		return -EINVAL;
+	if (start_val < 0)
+		return -EINVAL;
+
+	/* Remaining tokens: OFFSET_STOP threshold=US */
+	while (p && (token = strsep(&p, " \t")) != NULL) {
+		if (!*token)
+			continue;
+		if (strncmp(token, "threshold=", 10) == 0) {
+			if (kstrtoull(token + 10, 0, &thr))
+				return -EINVAL;
+			got_thr = true;
+		} else if (!got_stop) {
+			long long sv;
+
+			n = 0;
+			if (sscanf(token, "%lli%n", &sv, &n) != 1 || n == 0)
+				return -EINVAL;
+			if (sv < 0)
+				return -EINVAL;
+			stop_val = (unsigned long long)sv;
+			got_stop = true;
+		} else {
+			return -EINVAL;
+		}
+	}
+
+	if (!got_stop || !got_thr || thr == 0)
+		return -EINVAL;
+	if (start_val == (long long)stop_val)
+		return -EINVAL;
+
+	*thr_out   = thr;
+	*path_out  = path_token;
+	*start_out = (loff_t)start_val;
+	*stop_out  = (loff_t)stop_val;
+	return 0;
+}
+
+/* Parse "-PATH:OFFSET_START" (ftrace uprobe_events removal convention). */
+static int tlob_parse_remove_line(char *buf, char **path_out, loff_t *start_out)
+{
+	char *binpath, *colon;
+	long long off;
+	int n = 0;
+
+	if (buf[0] != '-')
+		return -EINVAL;
+	binpath = buf + 1;
+	if (binpath[0] != '/')
+		return -EINVAL;
+	colon = strrchr(binpath, ':');
+	if (!colon || colon - binpath < 2)
+		return -EINVAL;
+	*colon = '\0';
+	if (sscanf(colon + 1, "%lli%n", &off, &n) != 1 || n == 0)
+		return -EINVAL;
+	*path_out  = binpath;
+	*start_out = (loff_t)off;
+	return 0;
+}
+
+VISIBLE_IF_KUNIT int tlob_create_or_delete_uprobe(char *buf)
+{
+	loff_t offset_start, offset_stop;
+	u64 threshold_us;
+	char *binpath;
+	int ret;
+
+	if (buf[0] == '-') {
+		ret = tlob_parse_remove_line(buf, &binpath, &offset_start);
+		if (ret)
+			return ret;
+		mutex_lock(&tlob_uprobe_mutex);
+		ret = tlob_remove_uprobe_by_key(offset_start, binpath);
+		mutex_unlock(&tlob_uprobe_mutex);
+		return ret;
+	}
+	ret = tlob_parse_uprobe_line(buf, &threshold_us, &binpath,
+				     &offset_start, &offset_stop);
+	if (ret)
+		return ret;
+	mutex_lock(&tlob_uprobe_mutex);
+	ret = tlob_add_uprobe(threshold_us, binpath, offset_start, offset_stop);
+	mutex_unlock(&tlob_uprobe_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_create_or_delete_uprobe);
+
+static ssize_t tlob_monitor_write(struct file *file,
+				  const char __user *ubuf,
+				  size_t count, loff_t *ppos)
+{
+	char buf[TLOB_MAX_PATH + 128];
+
+	if (count >= sizeof(buf))
+		return -EINVAL;
+	if (copy_from_user(buf, ubuf, count))
+		return -EFAULT;
+	buf[count] = '\0';
+	if (count > 0 && buf[count - 1] == '\n')
+		buf[count - 1] = '\0';
+	return tlob_create_or_delete_uprobe(buf) ?: (ssize_t)count;
+}
+
+static const struct file_operations tlob_monitor_fops = {
+	.open	= simple_open,
+	.read	= tlob_monitor_read,
+	.write	= tlob_monitor_write,
+	.llseek	= noop_llseek,
+};
+
+static int __tlob_init_monitor(void)
+{
+	int retval;
+
+	tlob_state_cache = kmem_cache_create("tlob_task_state",
+					     sizeof(struct tlob_task_state),
+					     0, 0, NULL);
+	if (!tlob_state_cache)
+		return -ENOMEM;
+
+	atomic_set(&tlob_num_monitored, 0);
+
+	retval = da_monitor_init_prealloc(TLOB_MAX_MONITORED);
+	if (retval) {
+		kmem_cache_destroy(tlob_state_cache);
+		tlob_state_cache = NULL;
+		return retval;
+	}
+
+	/* Synthetic reference: held while the monitor is enabled. */
+	reinit_completion(&tlob_fd_released);
+	refcount_set(&tlob_fd_refcount, 1);
+
+	rv_this.enabled = 1;
+	return 0;
+}
+
+static void __tlob_destroy_monitor(void)
+{
+	rv_this.enabled = 0;
+	/*
+	 * Remove uprobes first so stop_task can't race with tlob_stop_all().
+	 * rv_uprobe_sync() inside ensures all in-flight handlers have finished.
+	 */
+	tlob_remove_all_uprobes();
+	tlob_stop_all();
+	/* Wait for tlob_free_rcu and da_pool_return_cb before pool teardown. */
+	synchronize_rcu();
+
+	/*
+	 * Drop the synthetic ref and wait for all open fds to close before
+	 * teardown; prevents kmem_cache_zalloc() on the destroyed cache.
+	 */
+	if (!refcount_dec_and_test(&tlob_fd_refcount))
+		wait_for_completion(&tlob_fd_released);
+
+	da_monitor_destroy();
+	kmem_cache_destroy(tlob_state_cache);
+	tlob_state_cache = NULL;
+}
+
+/* KUnit wrappers that acquire rv_interface_lock around monitor init/destroy. */
+#if IS_ENABLED(CONFIG_KUNIT)
+int tlob_init_monitor(void)
+{
+	int ret;
+
+	mutex_lock(&rv_interface_lock);
+	ret = __tlob_init_monitor();
+	mutex_unlock(&rv_interface_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tlob_init_monitor);
+
+void tlob_destroy_monitor(void)
+{
+	mutex_lock(&rv_interface_lock);
+	__tlob_destroy_monitor();
+	mutex_unlock(&rv_interface_lock);
+}
+EXPORT_SYMBOL_GPL(tlob_destroy_monitor);
+
+int tlob_num_monitored_read(void)
+{
+	return atomic_read(&tlob_num_monitored);
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_num_monitored_read);
+
+/* Tracepoint probes for KUnit; rv_trace.h is only included here. */
+static struct tlob_captured_event     tlob_kunit_last_event;
+static struct tlob_captured_error_env tlob_kunit_last_error_env;
+static atomic_t tlob_kunit_event_cnt    = ATOMIC_INIT(0);
+static atomic_t tlob_kunit_error_env_cnt = ATOMIC_INIT(0);
+
+static void tlob_kunit_event_probe(void *data, int id, char *state, char *event,
+				   char *next_state, bool final_state)
+{
+	tlob_kunit_last_event.id = id;
+	strscpy(tlob_kunit_last_event.state, state,
+		sizeof(tlob_kunit_last_event.state));
+	strscpy(tlob_kunit_last_event.event, event,
+		sizeof(tlob_kunit_last_event.event));
+	strscpy(tlob_kunit_last_event.next_state, next_state,
+		sizeof(tlob_kunit_last_event.next_state));
+	tlob_kunit_last_event.final_state = final_state;
+	atomic_inc(&tlob_kunit_event_cnt);
+}
+
+static void tlob_kunit_error_env_probe(void *data, int id, char *state,
+				       char *event, char *env)
+{
+	tlob_kunit_last_error_env.id = id;
+	strscpy(tlob_kunit_last_error_env.state, state,
+		sizeof(tlob_kunit_last_error_env.state));
+	strscpy(tlob_kunit_last_error_env.event, event,
+		sizeof(tlob_kunit_last_error_env.event));
+	strscpy(tlob_kunit_last_error_env.env, env,
+		sizeof(tlob_kunit_last_error_env.env));
+	atomic_inc(&tlob_kunit_error_env_cnt);
+}
+
+int tlob_register_kunit_probes(void)
+{
+	int ret;
+
+	atomic_set(&tlob_kunit_event_cnt, 0);
+	atomic_set(&tlob_kunit_error_env_cnt, 0);
+
+	ret = register_trace_event_tlob(tlob_kunit_event_probe, NULL);
+	if (ret)
+		return ret;
+	ret = register_trace_error_env_tlob(tlob_kunit_error_env_probe, NULL);
+	if (ret) {
+		unregister_trace_event_tlob(tlob_kunit_event_probe, NULL);
+		return ret;
+	}
+	return 0;
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_register_kunit_probes);
+
+void tlob_unregister_kunit_probes(void)
+{
+	unregister_trace_event_tlob(tlob_kunit_event_probe, NULL);
+	unregister_trace_error_env_tlob(tlob_kunit_error_env_probe, NULL);
+	tracepoint_synchronize_unregister();
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_unregister_kunit_probes);
+
+int tlob_event_count_read(void)
+{
+	return atomic_read(&tlob_kunit_event_cnt);
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_event_count_read);
+
+void tlob_event_count_reset(void)
+{
+	atomic_set(&tlob_kunit_event_cnt, 0);
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_event_count_reset);
+
+int tlob_error_env_count_read(void)
+{
+	return atomic_read(&tlob_kunit_error_env_cnt);
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_error_env_count_read);
+
+void tlob_error_env_count_reset(void)
+{
+	atomic_set(&tlob_kunit_error_env_cnt, 0);
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_error_env_count_reset);
+
+const struct tlob_captured_event *tlob_last_event_read(void)
+{
+	return &tlob_kunit_last_event;
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_last_event_read);
+
+const struct tlob_captured_error_env *tlob_last_error_env_read(void)
+{
+	return &tlob_kunit_last_error_env;
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_last_error_env_read);
+
+#endif /* CONFIG_KUNIT */
+
+VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
+{
+	rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
+	rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
+	rv_attach_trace_probe("tlob", sched_process_exit, handle_sched_process_exit);
+	return 0;
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
+
+VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
+{
+	rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
+	rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
+	rv_detach_trace_probe("tlob", sched_process_exit, handle_sched_process_exit);
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
+
+static int enable_tlob(void)
+{
+	int retval;
+
+	retval = __tlob_init_monitor();
+	if (retval)
+		return retval;
+
+	return tlob_enable_hooks();
+}
+
+static void disable_tlob(void)
+{
+	tlob_disable_hooks();
+	__tlob_destroy_monitor();
+}
+
+static struct rv_monitor rv_this = {
+	.name		= "tlob",
+	.description	= "Per-task latency-over-budget monitor.",
+	.enable		= enable_tlob,
+	.disable	= disable_tlob,
+	.reset		= da_monitor_reset_all,
+	.enabled	= 0,
+};
+
+static void *tlob_chardev_bind(void)
+{
+	struct tlob_fpriv *fp;
+
+	fp = kzalloc_obj(*fp, GFP_KERNEL);
+	if (!fp)
+		return ERR_PTR(-ENOMEM);
+
+	/* Pin cache/pool for fd lifetime; balanced in tlob_chardev_release.
+	 * If the synthetic ref has already been dropped (__tlob_destroy_monitor
+	 * ran to completion), reject the bind so the caller gets ENODEV instead
+	 * of corrupting a zero refcount.
+	 */
+	if (!refcount_inc_not_zero(&tlob_fd_refcount)) {
+		kfree(fp);
+		return ERR_PTR(-ENODEV);
+	}
+	return fp;
+}
+
+static void tlob_chardev_release(void *priv)
+{
+	struct tlob_fpriv *fp = priv;
+
+	if (fp->monitoring) {
+		/* All return values are safe on close. */
+		(void)tlob_stop_task(fp->task);
+		put_task_struct(fp->task);
+	}
+
+	kfree(fp);
+
+	/* Release fd's pin; if last, wake __tlob_destroy_monitor. */
+	if (refcount_dec_and_test(&tlob_fd_refcount))
+		complete(&tlob_fd_released);
+}
+
+static long tlob_chardev_ioctl(void *priv, unsigned int cmd, unsigned long arg)
+{
+	struct tlob_fpriv *fp = priv;
+	struct tlob_start_args args;
+	struct task_struct *task;
+	int ret;
+
+	switch (cmd) {
+	case TLOB_IOCTL_TRACE_START:
+		if (fp->monitoring)
+			return -EALREADY;
+
+		if (copy_from_user(&args, (void __user *)arg, sizeof(args)))
+			return -EFAULT;
+
+		ret = tlob_start_task(current, args.threshold_us);
+		if (ret)
+			return ret;
+
+		fp->task = current;
+		get_task_struct(current);
+		fp->budget_exceeded = false;
+
+		/* Link fd so hrtimer callback can latch budget_exceeded. */
+		scoped_guard(rcu) {
+			struct tlob_task_state *ws = da_get_target_by_id(current->pid);
+
+			if (ws)
+				smp_store_release(&ws->fpriv, fp);
+		}
+
+		fp->monitoring = true;
+		return 0;
+
+	case TLOB_IOCTL_TRACE_STOP:
+		if (!fp->monitoring)
+			return -EINVAL;
+
+		task = fp->task;
+		fp->monitoring = false;
+		fp->task = NULL;
+
+		ret = tlob_stop_task(task);
+		put_task_struct(task);
+
+		/*
+		 * -EOVERFLOW: budget exceeded; propagate to caller.
+		 * -EAGAIN: concurrent stop_all claimed cleanup; fall through to
+		 *   budget_exceeded latch set by the hrtimer callback.
+		 * -ESRCH: task exited before TRACE_STOP (process-exit handler
+		 *   claimed cleanup); same latch applies.  Not an internal error.
+		 */
+		if (ret == -EAGAIN || ret == -ESRCH)
+			return READ_ONCE(fp->budget_exceeded) ? -EOVERFLOW : 0;
+		return ret;
+
+	default:
+		return -ENOTTY;
+	}
+}
+
+static const struct rv_chardev_ops tlob_chardev_ops = {
+	.owner   = THIS_MODULE,
+	.bind    = tlob_chardev_bind,
+	.ioctl   = tlob_chardev_ioctl,
+	.release = tlob_chardev_release,
+};
+
+static int __init register_tlob(void)
+{
+	int ret;
+
+	ret = rv_chardev_register_monitor("tlob", &tlob_chardev_ops);
+	if (ret)
+		return ret;
+
+	ret = rv_register_monitor(&rv_this, NULL);
+	if (ret) {
+		rv_chardev_unregister_monitor("tlob");
+		return ret;
+	}
+
+	if (rv_this.root_d) {
+		if (!tracefs_create_file("monitor", 0644, rv_this.root_d, NULL,
+					 &tlob_monitor_fops)) {
+			rv_unregister_monitor(&rv_this);
+			rv_chardev_unregister_monitor("tlob");
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+}
+
+static void __exit unregister_tlob(void)
+{
+	rv_chardev_unregister_monitor("tlob");
+	rv_unregister_monitor(&rv_this);
+}
+
+module_init(register_tlob);
+module_exit(unregister_tlob);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
+MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
diff --git a/kernel/trace/rv/monitors/tlob/tlob.h b/kernel/trace/rv/monitors/tlob/tlob.h
new file mode 100644
index 000000000000..71c1735d27d2
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/tlob.h
@@ -0,0 +1,171 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _RV_TLOB_H
+#define _RV_TLOB_H
+
+/*
+ * C representation of the tlob hybrid automaton.
+ *
+ * Three-state HA following sched_stat / wwnr monitor naming conventions:
+ *
+ *   running  (initial) - task is executing on CPU          [sched_stat: runtime]
+ *   waiting             - task is in runqueue, awaiting CPU [sched_stat: wait   ]
+ *   sleeping            - task is blocked, awaiting resource[sched_stat: sleep  ]
+ *
+ * Events (derived from sched_switch / sched_wakeup tracepoints):
+ *   sleep     - sched_switch, prev_state != 0   running  → sleeping
+ *   preempt   - sched_switch, prev_state == 0   running  → waiting
+ *   wakeup    - sched_wakeup                    sleeping → waiting
+ *   switch_in - sched_switch, next == task      waiting  → running
+ *
+ * One HA clock invariant:
+ *   clk_elapsed < BUDGET_NS()  active in all states  (total latency budget)
+ *
+ * task_start and task_stop are NOT DA events:
+ *   task_start calls da_handle_start_event() to set initial state, then
+ *   ha_reset_clk_ns() + ha_start_timer_ns() to initialise the clock and arm
+ *   the timer directly.
+ *   task_stop calls hrtimer_cancel() + da_monitor_reset() directly.
+ *
+ * For the format description see:
+ *   Documentation/trace/rv/deterministic_automata.rst
+ */
+
+#include <linux/rv.h>
+#include <linux/sched.h>
+
+#define MONITOR_NAME tlob
+
+enum states_tlob {
+	running_tlob,
+	waiting_tlob,
+	sleeping_tlob,
+	state_max_tlob,
+};
+
+#define INVALID_STATE state_max_tlob
+
+enum events_tlob {
+	sleep_tlob,
+	preempt_tlob,
+	wakeup_tlob,
+	switch_in_tlob,
+	event_max_tlob,
+};
+
+/*
+ * HA environment variable: clk_elapsed is the only clock.
+ * It measures wall-clock time since task_start and is active in all states.
+ */
+enum envs_tlob {
+	clk_elapsed_tlob,
+	env_max_tlob,
+	env_max_stored_tlob = env_max_tlob,
+};
+
+_Static_assert(env_max_stored_tlob <= MAX_HA_ENV_LEN, "Not enough slots");
+#define HA_CLK_NS
+
+struct automaton_tlob {
+	char *state_names[state_max_tlob];
+	char *event_names[event_max_tlob];
+	char *env_names[env_max_tlob];
+	unsigned char function[state_max_tlob][event_max_tlob];
+	unsigned char initial_state;
+	bool final_states[state_max_tlob];
+};
+
+static const struct automaton_tlob automaton_tlob = {
+	.state_names = {
+		"running",
+		"waiting",
+		"sleeping",
+	},
+	.event_names = {
+		"sleep",
+		"preempt",
+		"wakeup",
+		"switch_in",
+	},
+	.env_names = {
+		"clk_elapsed",
+	},
+	.function = {
+		/* running */
+		{
+			sleeping_tlob,	/* sleep     (sched_switch, prev_state != 0) */
+			waiting_tlob,	/* preempt   (sched_switch, prev_state == 0) */
+			INVALID_STATE,	/* wakeup    (TASK_RUNNING can't be woken)   */
+			INVALID_STATE,	/* switch_in (already on CPU)                */
+		},
+		/* waiting */
+		{
+			INVALID_STATE,	/* sleep     (not on CPU)                    */
+			INVALID_STATE,	/* preempt   (not on CPU)                    */
+			INVALID_STATE,	/* wakeup    (already TASK_RUNNING)          */
+			running_tlob,	/* switch_in                                 */
+		},
+		/* sleeping */
+		{
+			INVALID_STATE,	/* sleep     (already sleeping)              */
+			INVALID_STATE,	/* preempt   (not on CPU)                    */
+			waiting_tlob,	/* wakeup                                    */
+			INVALID_STATE,	/* switch_in (must go through waiting first) */
+		},
+	},
+	.initial_state = running_tlob,
+	.final_states = { 1, 0, 0 },
+};
+
+/* Maximum number of concurrently monitored tasks. */
+#define TLOB_MAX_MONITORED	64U
+
+/* Maximum binary path length for uprobe binding. */
+#define TLOB_MAX_PATH		256
+
+/* Exported to ioctl/uprobe layers and KUnit */
+int tlob_start_task(struct task_struct *task, u64 threshold_us);
+int tlob_stop_task(struct task_struct *task);
+
+#if IS_ENABLED(CONFIG_KUNIT)
+int tlob_init_monitor(void);
+void tlob_destroy_monitor(void);
+int tlob_enable_hooks(void);
+void tlob_disable_hooks(void);
+int tlob_create_or_delete_uprobe(char *buf);
+int tlob_num_monitored_read(void);
+
+struct tlob_captured_event {
+	int  id;
+	char state[16];
+	char event[16];
+	char next_state[16];
+	bool final_state;
+};
+
+struct tlob_captured_error_env {
+	int  id;
+	char state[16];
+	char event[16];
+	char env[64];
+};
+
+struct tlob_captured_detail {
+	int  pid;
+	u64  threshold_us;
+	u64  running_ns;
+	u64  waiting_ns;
+	u64  sleeping_ns;
+};
+
+int  tlob_register_kunit_probes(void);
+void tlob_unregister_kunit_probes(void);
+int  tlob_event_count_read(void);
+void tlob_event_count_reset(void);
+int  tlob_error_env_count_read(void);
+void tlob_error_env_count_reset(void);
+const struct tlob_captured_event     *tlob_last_event_read(void);
+const struct tlob_captured_error_env *tlob_last_error_env_read(void);
+const struct tlob_captured_detail    *tlob_last_detail_read(void);
+#endif /* CONFIG_KUNIT */
+
+#endif /* _RV_TLOB_H */
diff --git a/kernel/trace/rv/monitors/tlob/tlob_trace.h b/kernel/trace/rv/monitors/tlob/tlob_trace.h
new file mode 100644
index 000000000000..08d34e1b0ab8
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/tlob_trace.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Snippet to be included in rv_trace.h for tlob tracepoints.
+ *
+ * event_tlob and error_tlob are defined on the event_da_monitor_id and
+ * error_da_monitor_id classes, following the same pattern as nomiss.
+ * error_env_tlob carries the environment variable name that caused the
+ * clock-invariant violation (budget exceeded).
+ * The id field carries the pid of the monitored task.
+ */
+
+#ifdef CONFIG_RV_MON_TLOB
+/* id is the pid of the monitored task */
+DEFINE_EVENT(event_da_monitor_id, event_tlob,
+	     TP_PROTO(int id, char *state, char *event, char *next_state, bool final_state),
+	     TP_ARGS(id, state, event, next_state, final_state));
+
+DEFINE_EVENT(error_da_monitor_id, error_tlob,
+	     TP_PROTO(int id, char *state, char *event),
+	     TP_ARGS(id, state, event));
+
+DEFINE_EVENT(error_env_da_monitor_id, error_env_tlob,
+	     TP_PROTO(int id, char *state, char *event, char *env),
+	     TP_ARGS(id, state, event, env));
+
+/*
+ * detail_env_tlob - per-state time breakdown emitted alongside error_env_tlob.
+ *
+ * Fired once per budget violation, immediately after error_env_tlob, from
+ * the hrtimer callback (hardirq context).  The three _ns fields sum to
+ * approximately threshold_us * 1000; any rounding comes from the partial
+ * time accumulated in the current state since the last transition.
+ */
+TRACE_EVENT(detail_env_tlob,
+	TP_PROTO(int pid, u64 threshold_us,
+		 u64 running_ns, u64 waiting_ns, u64 sleeping_ns),
+	TP_ARGS(pid, threshold_us, running_ns, waiting_ns, sleeping_ns),
+	TP_STRUCT__entry(
+		__field(int,	pid)
+		__field(u64,	threshold_us)
+		__field(u64,	running_ns)
+		__field(u64,	waiting_ns)
+		__field(u64,	sleeping_ns)
+	),
+	TP_fast_assign(
+		__entry->pid		= pid;
+		__entry->threshold_us	= threshold_us;
+		__entry->running_ns	= running_ns;
+		__entry->waiting_ns	= waiting_ns;
+		__entry->sleeping_ns	= sleeping_ns;
+	),
+	TP_printk("pid=%d threshold_us=%llu running_ns=%llu waiting_ns=%llu sleeping_ns=%llu",
+		  __entry->pid, __entry->threshold_us,
+		  __entry->running_ns, __entry->waiting_ns,
+		  __entry->sleeping_ns)
+);
+#endif /* CONFIG_RV_MON_TLOB */
diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
index ee4e68102f17..a45c4763dbe5 100644
--- a/kernel/trace/rv/rv.c
+++ b/kernel/trace/rv/rv.c
@@ -142,10 +142,17 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/slab.h>
+#include <kunit/visibility.h>
 
 #ifdef CONFIG_RV_MON_EVENTS
 #define CREATE_TRACE_POINTS
 #include <rv_trace.h>
+
+#ifdef CONFIG_RV_MON_TLOB
+EXPORT_TRACEPOINT_SYMBOL_GPL(error_tlob);
+EXPORT_TRACEPOINT_SYMBOL_GPL(event_tlob);
+EXPORT_TRACEPOINT_SYMBOL_GPL(error_env_tlob);
+#endif
 #endif
 
 #include "rv.h"
@@ -696,6 +703,33 @@ static void turn_monitoring_on(void)
 	WRITE_ONCE(monitoring_on, true);
 }
 
+#if IS_ENABLED(CONFIG_KUNIT)
+/**
+ * rv_kunit_monitoring_on - enable the global monitoring_on flag for KUnit tests.
+ *
+ * KUnit test suite_init functions must call this before initialising any
+ * monitor, mirroring the turn_monitoring_on() call in rv_init_interface().
+ * The matching rv_kunit_monitoring_off() must be called in suite_exit to
+ * restore the flag so that test suites do not interfere with each other.
+ */
+void rv_kunit_monitoring_on(void)
+{
+	turn_monitoring_on();
+}
+EXPORT_SYMBOL_IF_KUNIT(rv_kunit_monitoring_on);
+
+/**
+ * rv_kunit_monitoring_off - disable the global monitoring_on flag for KUnit tests.
+ *
+ * Must be called in suite_exit to restore global state after rv_kunit_monitoring_on().
+ */
+void rv_kunit_monitoring_off(void)
+{
+	turn_monitoring_off();
+}
+EXPORT_SYMBOL_IF_KUNIT(rv_kunit_monitoring_off);
+#endif /* CONFIG_KUNIT */
+
 static void turn_monitoring_on_with_reset(void)
 {
 	lockdep_assert_held(&rv_interface_lock);
@@ -846,6 +880,10 @@ int __init rv_init_interface(void)
 	if (retval)
 		return 1;
 
+	retval = rv_chardev_init();
+	if (retval)
+		return 1;
+
 	turn_monitoring_on();
 
 	rv_root.root_dir = no_free_ptr(root_dir);
diff --git a/kernel/trace/rv/rv.h b/kernel/trace/rv/rv.h
index 2c0f51ff9d5c..82c9a2b57596 100644
--- a/kernel/trace/rv/rv.h
+++ b/kernel/trace/rv/rv.h
@@ -31,6 +31,8 @@ int rv_enable_monitor(struct rv_monitor *mon);
 bool rv_is_container_monitor(struct rv_monitor *mon);
 bool rv_is_nested_monitor(struct rv_monitor *mon);
 
+int rv_chardev_init(void);
+
 #ifdef CONFIG_RV_REACTORS
 int reactor_populate_monitor(struct rv_monitor *mon, struct dentry *root);
 int init_rv_reactors(struct dentry *root_dir);
diff --git a/kernel/trace/rv/rv_chardev.c b/kernel/trace/rv/rv_chardev.c
new file mode 100644
index 000000000000..1fba1642ebc1
--- /dev/null
+++ b/kernel/trace/rv/rv_chardev.c
@@ -0,0 +1,201 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/rv.h>
+#include <uapi/linux/rv.h>
+
+#include "rv.h"
+
+static_assert(MAX_RV_MONITOR_NAME_SIZE == RV_MONITOR_NAME_MAX,
+	      "RV internal and UAPI monitor name size constants must match");
+
+struct rv_fd_priv {
+	const struct rv_chardev_ops	*ops;
+	void				*monitor_priv;
+};
+
+struct rv_chardev_entry {
+	char			name[MAX_RV_MONITOR_NAME_SIZE];
+	const struct rv_chardev_ops *ops;
+	struct list_head	list;
+};
+
+/* Protected by rv_interface_lock (from rv.h / rv.c). */
+static LIST_HEAD(rv_chardev_list);
+
+/**
+ * rv_chardev_register_monitor - expose a monitor via /dev/rv
+ * @name: Monitor name, must match the rv_monitor .name field.
+ * @ops:  Callbacks providing bind / ioctl / release.
+ *
+ * Returns 0 on success, -EINVAL if @name is too long, -EEXIST if @name is
+ * already registered, -ENOMEM on OOM.
+ */
+int rv_chardev_register_monitor(const char *name,
+				const struct rv_chardev_ops *ops)
+{
+	struct rv_chardev_entry *e, *existing;
+
+	if (strlen(name) >= MAX_RV_MONITOR_NAME_SIZE)
+		return -EINVAL;
+
+	e = kmalloc_obj(*e, GFP_KERNEL);
+	if (!e)
+		return -ENOMEM;
+
+	strscpy(e->name, name, sizeof(e->name));
+	e->ops = ops;
+
+	guard(mutex)(&rv_interface_lock);
+	list_for_each_entry(existing, &rv_chardev_list, list) {
+		if (strcmp(existing->name, name) == 0) {
+			kfree(e);
+			return -EEXIST;
+		}
+	}
+	list_add_tail(&e->list, &rv_chardev_list);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(rv_chardev_register_monitor);
+
+/**
+ * rv_chardev_unregister_monitor - remove a monitor from the /dev/rv registry
+ * @name: Monitor name previously passed to rv_chardev_register_monitor().
+ *
+ * Existing bound fds remain valid; their ops pointer is stable until the
+ * fd is closed.  The caller must ensure no new binds to this monitor can
+ * succeed after unregistration — typically by unregistering before unloading
+ * the module that provides the ops.
+ */
+void rv_chardev_unregister_monitor(const char *name)
+{
+	struct rv_chardev_entry *e, *tmp;
+
+	guard(mutex)(&rv_interface_lock);
+	list_for_each_entry_safe(e, tmp, &rv_chardev_list, list) {
+		if (strcmp(e->name, name) == 0) {
+			list_del(&e->list);
+			kfree(e);
+			return;
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(rv_chardev_unregister_monitor);
+
+static int rv_dev_open(struct inode *inode, struct file *file)
+{
+	struct rv_fd_priv *fp;
+
+	fp = kzalloc_obj(*fp, GFP_KERNEL);
+	if (!fp)
+		return -ENOMEM;
+
+	file->private_data = fp;
+	return 0;
+}
+
+static int rv_dev_release(struct inode *inode, struct file *file)
+{
+	struct rv_fd_priv *fp = file->private_data;
+
+	if (fp->ops) {
+		fp->ops->release(fp->monitor_priv);
+		module_put(fp->ops->owner);
+	}
+	kfree(fp);
+	return 0;
+}
+
+static int rv_bind_monitor(struct rv_fd_priv *fp, const char __user *uarg)
+{
+	const struct rv_chardev_ops *ops = NULL;
+	struct rv_bind_args args;
+	void *priv;
+
+	if (fp->ops)
+		return -EBUSY;
+
+	if (copy_from_user(&args, uarg, sizeof(args)))
+		return -EFAULT;
+
+	args.monitor_name[RV_MONITOR_NAME_MAX - 1] = '\0';
+
+	/*
+	 * Pin the owning module while the list entry is still valid under
+	 * rv_interface_lock, preventing a concurrent rmmod from completing
+	 * between lookup and reference acquisition.  bind() may sleep
+	 * (GFP_KERNEL inside), so it runs after the lock is dropped.
+	 */
+	scoped_guard(mutex, &rv_interface_lock) {
+		struct rv_chardev_entry *e;
+
+		list_for_each_entry(e, &rv_chardev_list, list) {
+			if (strcmp(e->name, args.monitor_name) != 0)
+				continue;
+			if (!try_module_get(e->ops->owner))
+				return -ENODEV;
+			ops = e->ops;
+			break;
+		}
+	}
+
+	if (!ops)
+		return -ENOENT;
+
+	priv = ops->bind();
+	if (IS_ERR(priv)) {
+		module_put(ops->owner);
+		return PTR_ERR(priv);
+	}
+
+	fp->ops = ops;
+	fp->monitor_priv = priv;
+	return 0;
+}
+
+static long rv_dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct rv_fd_priv *fp = file->private_data;
+
+	if (cmd == RV_IOCTL_BIND_MONITOR)
+		return rv_bind_monitor(fp, (const char __user *)arg);
+
+	if (!fp->ops)
+		return -ENXIO;
+
+	return fp->ops->ioctl(fp->monitor_priv, cmd, arg);
+}
+
+static __poll_t rv_dev_poll(struct file *file, poll_table *wait)
+{
+	struct rv_fd_priv *fp = file->private_data;
+
+	if (!fp->ops || !fp->ops->poll)
+		return 0;
+
+	return fp->ops->poll(fp->monitor_priv, file, wait);
+}
+
+static const struct file_operations rv_dev_fops = {
+	.owner		= THIS_MODULE,
+	.open		= rv_dev_open,
+	.release	= rv_dev_release,
+	.unlocked_ioctl	= rv_dev_ioctl,
+	.compat_ioctl	= rv_dev_ioctl,
+	.poll		= rv_dev_poll,
+};
+
+static struct miscdevice rv_miscdev = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "rv",
+	.fops	= &rv_dev_fops,
+};
+
+int __init rv_chardev_init(void)
+{
+	return misc_register(&rv_miscdev);
+}
diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
index 9622c269789c..a4bc215c1f15 100644
--- a/kernel/trace/rv/rv_trace.h
+++ b/kernel/trace/rv/rv_trace.h
@@ -189,6 +189,7 @@ DECLARE_EVENT_CLASS(error_env_da_monitor_id,
 
 #include <monitors/stall/stall_trace.h>
 #include <monitors/nomiss/nomiss_trace.h>
+#include <monitors/tlob/tlob_trace.h>
 // Add new monitors based on CONFIG_HA_MON_EVENTS_ID here
 
 #endif
diff --git a/kernel/trace/rv/rv_uprobe.c b/kernel/trace/rv/rv_uprobe.c
index bc28399cfd4b..1ba7b80c1d87 100644
--- a/kernel/trace/rv/rv_uprobe.c
+++ b/kernel/trace/rv/rv_uprobe.c
@@ -132,13 +132,10 @@ EXPORT_SYMBOL_GPL(rv_uprobe_attach);
  */
 void rv_uprobe_detach(struct rv_uprobe *p)
 {
-	struct rv_uprobe_impl *impl;
-
 	if (!p)
 		return;
 
-	impl = container_of(p, struct rv_uprobe_impl, pub);
-	uprobe_unregister_nosync(impl->uprobe, &impl->uc);
+	rv_uprobe_unregister_nosync(p);
 	/*
 	 * uprobe_unregister_sync() is a global barrier: it waits for all
 	 * in-flight uprobe handlers across the entire system to complete,
@@ -146,8 +143,47 @@ void rv_uprobe_detach(struct rv_uprobe *p)
 	 * guarantees that no handler touching impl->pub.priv is running by
 	 * the time we return, even if the caller immediately frees priv.
 	 */
+	rv_uprobe_sync();
+	rv_uprobe_free(p);
+}
+EXPORT_SYMBOL_GPL(rv_uprobe_detach);
+
+/**
+ * rv_uprobe_unregister_nosync - dequeue an uprobe without waiting
+ */
+void rv_uprobe_unregister_nosync(struct rv_uprobe *p)
+{
+	struct rv_uprobe_impl *impl;
+
+	if (!p)
+		return;
+
+	impl = container_of(p, struct rv_uprobe_impl, pub);
+	uprobe_unregister_nosync(impl->uprobe, &impl->uc);
+}
+EXPORT_SYMBOL_GPL(rv_uprobe_unregister_nosync);
+
+/**
+ * rv_uprobe_sync - wait for all in-flight uprobe handlers to complete
+ */
+void rv_uprobe_sync(void)
+{
 	uprobe_unregister_sync();
+}
+EXPORT_SYMBOL_GPL(rv_uprobe_sync);
+
+/**
+ * rv_uprobe_free - release resources of a previously deregistered probe
+ */
+void rv_uprobe_free(struct rv_uprobe *p)
+{
+	struct rv_uprobe_impl *impl;
+
+	if (!p)
+		return;
+
+	impl = container_of(p, struct rv_uprobe_impl, pub);
 	path_put(&p->path);
 	kfree(impl);
 }
-EXPORT_SYMBOL_GPL(rv_uprobe_detach);
+EXPORT_SYMBOL_GPL(rv_uprobe_free);
diff --git a/tools/include/uapi/linux/rv.h b/tools/include/uapi/linux/rv.h
new file mode 100644
index 000000000000..a34e5426393b
--- /dev/null
+++ b/tools/include/uapi/linux/rv.h
@@ -0,0 +1,86 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * UAPI definitions for Runtime Verification (RV) monitors.
+ *
+ * All RV monitors that expose an ioctl self-instrumentation interface
+ * share the magic byte RV_IOC_MAGIC ('r').
+ *
+ * Usage examples and design rationale are in:
+ *   Documentation/trace/rv/monitor_tlob.rst
+ */
+
+#ifndef _UAPI_LINUX_RV_H
+#define _UAPI_LINUX_RV_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+/* Magic byte shared by all RV monitor ioctls. */
+#define RV_IOC_MAGIC		'r'
+
+/* Maximum monitor name length (including NUL terminator). */
+#define RV_MONITOR_NAME_MAX	32
+
+/* Generic /dev/rv ioctls (ioctl numbers 0–15 are reserved for the core) */
+
+/**
+ * struct rv_bind_args - arguments for RV_IOCTL_BIND_MONITOR
+ * @monitor_name: NUL-terminated name of the monitor to bind (e.g. "tlob").
+ */
+struct rv_bind_args {
+	char monitor_name[RV_MONITOR_NAME_MAX];
+};
+
+/*
+ * RV_IOCTL_BIND_MONITOR - associate this fd with a specific RV monitor.
+ *
+ * Must be called once after open() and before any monitor-specific ioctl.
+ *
+ * Returns 0 on success.
+ * Returns -EBUSY  if this fd is already bound to a monitor.
+ * Returns -ENOENT if the requested monitor is not registered.
+ * Returns -ENOMEM on allocation failure.
+ */
+#define RV_IOCTL_BIND_MONITOR	_IOW(RV_IOC_MAGIC, 0, struct rv_bind_args)
+
+/* tlob: task latency over budget monitor (ioctl numbers 1–15) */
+
+/**
+ * struct tlob_start_args - arguments for TLOB_IOCTL_TRACE_START
+ * @threshold_us: Total latency budget for this window, in microseconds.
+ *               Must be greater than zero.  Both on-CPU and off-CPU time
+ *               (including runqueue wait) count toward this budget.
+ */
+struct tlob_start_args {
+	__u64 threshold_us;
+};
+
+/*
+ * TLOB_IOCTL_TRACE_START - begin monitoring the calling task.
+ *
+ * Arms a per-task hrtimer for threshold_us microseconds (CLOCK_MONOTONIC,
+ * so both on-CPU and off-CPU time count toward the budget).
+ *
+ * Returns 0 on success.
+ * Returns -EEXIST if TRACE_START was already called on this fd.
+ * Returns -ENOSPC if TLOB_MAX_MONITORED tasks are already being tracked.
+ * Returns -ENOMEM on allocation failure.
+ * Returns -ENODEV if the tlob monitor is not enabled.
+ * Returns -ERANGE if threshold_us is 0.
+ */
+#define TLOB_IOCTL_TRACE_START	_IOW(RV_IOC_MAGIC, 1, struct tlob_start_args)
+
+/*
+ * TLOB_IOCTL_TRACE_STOP - end monitoring the calling task.
+ *
+ * Returns 0 if within budget.
+ * Returns -EOVERFLOW if the latency budget was exceeded.
+ * Returns -EINVAL if TLOB_IOCTL_TRACE_START was not called on this fd.
+ *
+ * poll/epoll: after TRACE_START the fd becomes readable (EPOLLIN) when the
+ * budget is exceeded.  The caller may then issue TRACE_STOP to retrieve the
+ * result, or simply close the fd to clean up.
+ */
+#define TLOB_IOCTL_TRACE_STOP	_IO(RV_IOC_MAGIC, 2)
+
+#endif /* _UAPI_LINUX_RV_H */
-- 
2.25.1

next prev parent reply	other threads:[~2026-05-11 18:25 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11 18:24 [RFC PATCH v2 00/10] rv/tlob: Add task latency over budget RV monitor wen.yang
2026-05-11 18:24 ` [RFC PATCH v2 01/10] rv/da: fix monitor start ordering and memory ordering for monitoring flag wen.yang
2026-05-11 18:24 ` [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync wen.yang
2026-05-11 18:24 ` [RFC PATCH v2 03/10] selftests/verification: fix verificationtest-ktap for out-of-tree execution wen.yang
2026-05-11 18:24 ` [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors wen.yang
2026-05-11 18:24 ` [RFC PATCH v2 05/10] rv: add generic uprobe infrastructure for RV monitors wen.yang
2026-05-11 18:24 ` [RFC PATCH v2 06/10] rvgen: support reset() on the __init arrow for global-window HA clocks wen.yang
2026-05-11 18:24 ` [RFC PATCH v2 07/10] rv/tlob: add tlob model DOT file wen.yang
2026-05-11 18:24 ` wen.yang [this message]
2026-05-11 18:24 ` [RFC PATCH v2 09/10] rv/tlob: add KUnit tests for the tlob monitor wen.yang
2026-05-11 18:24 ` [RFC PATCH v2 10/10] selftests/verification: add tlob selftests wen.yang

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:29769f06bb0 dfblob:1501545b5f0 dfblob:91b592630b3
dfblob:541ba404926 dfblob:1ea91bb3f1c dfblob:4a4eb40cf09
dfblob:ae819638d85 dfblob:d59507e8cb3 dfblob:dfc99377408
dfblob:084cdb36a2f dfblob:9106c5c9275 dfblob:a34e5426393
dfblob:e2e0033a00b dfblob:1c36939db8e dfblob:f139b904bea
dfblob:8a5b5c84aff dfblob:82e52189149 dfblob:475e972ae9a
dfblob:71c1735d27d dfblob:08d34e1b0ab dfblob:ee4e68102f1
dfblob:a45c4763dbe dfblob:2c0f51ff9d5 dfblob:82c9a2b5759
dfblob:1fba1642ebc dfblob:9622c269789 dfblob:a4bc215c1f1
dfblob:bc28399cfd4 dfblob:1ba7b80c1d8 dfblob:a34e5426393 )
 OR (
bs:"[RFC PATCH v2 08/10] rv/tlob: add tlob hybrid automaton monitor" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fe5ed6a9a0a911e6ec74dc06c453786a2c4fb6d1.1778522945.git.wen.yang@linux.dev \
    --to=wen.yang@linux.dev \
    --cc=gmonaco@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox