[RFC PATCH 0/4] rv/tlob: Add task latency over budget RV monitor

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] rv/tlob: Add task latency over budget RV monitor
@ 2026-04-12 19:27 wen.yang
  2026-04-12 19:27 ` [RFC PATCH 1/4] rv/tlob: Add tlob model DOT file wen.yang
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: wen.yang @ 2026-04-12 19:27 UTC (permalink / raw)
  To: Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Wen Yang

From: Wen Yang <wen.yang@linux.dev>

This series introduces tlob (task latency over budget), a new per-task
Runtime Verification monitor.

Background
----------
The RV framework formalises kernel behavioural properties as
deterministic automata. Existing monitors (wwnr, sssw, opid, etc.) cover
scheduling and locking invariants; none tracks wall-clock latency of
a per-task code path, including off-CPU time. This property is needed
in ADAS perception/planning pipelines, industrial real-time
controllers, and similar mixed-criticality deployments.

tlob adds this capability. A caller demarcates a code path via a
start/stop pair; the kernel arms a per-task hrtimer for the requested
budget. If the task has not called TRACE_STOP before the timer fires,
a violation is recorded, the stop call returns -EOVERFLOW, and an
event is pushed to the caller's mmap ring.

The tracefs interface requires only tracefs write permissions, avoiding
the CAP_BPF privilege needed for equivalent eBPF-based approaches. The
DA model (patch 1) can be independently verified with standard model-
checking tools.

Design
------
The monitor is a three-state deterministic automaton (DA):

  unmonitored --trace_start--> on_cpu
  on_cpu       --switch_out--> off_cpu
  off_cpu      --switch_in---> on_cpu
  {on_cpu, off_cpu} --{trace_stop, budget_expired}--> unmonitored

Per-task state lives in a fixed-size hash table (TLOB_MAX_MONITORED
slots) with RCU-deferred free. Timing is based on CLOCK_MONOTONIC
(ktime_get()), so budgets account for off-CPU time.

Two userspace interfaces are provided:

  tracefs: uprobe pair registration via the monitor/enable files; no
           new UAPI required.

  /dev/rv ioctls (CONFIG_RV_CHARDEV):
    TLOB_IOCTL_TRACE_START  — arm the budget for a target task
    TLOB_IOCTL_TRACE_STOP   — disarm; returns -EOVERFLOW on violation

  Each /dev/rv file descriptor has a per-fd mmap ring (a physically
  contiguous control page struct tlob_mmap_page followed by an array of
  struct tlob_event records). Head/tail/dropped are userspace-readable
  without locking; overflow uses a drop-new policy.

New UAPI (include/uapi/linux/rv.h): tlob_start_args, tlob_event,
tlob_mmap_page, ioctl numbers (RV_IOC_MAGIC=0xB9, registered in
Documentation/userspace-api/ioctl/ioctl-number.rst).

Testing
-------
KUnit (patch 3): six suites (38 cases) gated on CONFIG_TLOB_KUNIT_TEST.

  ./tools/testing/kunit/kunit.py run \
    --kunitconfig kernel/trace/rv/monitors/tlob/.kunitconfig

  Coverage: automaton state transitions, start/stop API error paths,
  scheduler context-switch accounting, tracepoint payload fields,
  ring-buffer push/overflow/wakeup, and the uprobe line parser.

kselftest (patch 4): 19 TAP test points under
tools/testing/selftests/rv/. Requires CONFIG_RV_MON_TLOB=y,
CONFIG_RV_CHARDEV=y, and root.

  make -C tools/testing/selftests/rv
  sudo ./test_tlob.sh

Patch overview
--------------
Patch 1 — DOT model: formal automaton specification for verification.
Patch 2 — monitor implementation, UAPI, and documentation.
Patch 3 — KUnit in-kernel unit tests.
Patch 4 — kselftest user-space integration tests.

Wen Yang (4):
  rv/tlob: Add tlob model DOT file
  rv/tlob: Add tlob deterministic automaton monitor
  rv/tlob: Add KUnit tests for the tlob monitor
  selftests/rv: Add selftest for the tlob monitor

 Documentation/trace/rv/index.rst              |    1 +
 Documentation/trace/rv/monitor_tlob.rst       |  381 ++++++
 .../userspace-api/ioctl/ioctl-number.rst      |    1 +
 MAINTAINERS                                   |    3 +
 include/uapi/linux/rv.h                       |  181 +++
 kernel/trace/rv/Kconfig                       |   17 +
 kernel/trace/rv/Makefile                      |    3 +
 kernel/trace/rv/monitors/tlob/.kunitconfig    |    5 +
 kernel/trace/rv/monitors/tlob/Kconfig         |   63 +
 kernel/trace/rv/monitors/tlob/tlob.c          |  987 ++++++++++++++
 kernel/trace/rv/monitors/tlob/tlob.h          |  145 ++
 kernel/trace/rv/monitors/tlob/tlob_kunit.c    | 1194 +++++++++++++++++
 kernel/trace/rv/monitors/tlob/tlob_trace.h    |   42 +
 kernel/trace/rv/rv.c                          |    4 +
 kernel/trace/rv/rv_dev.c                      |  602 +++++++++
 kernel/trace/rv/rv_trace.h                    |   50 +
 tools/include/uapi/linux/rv.h                 |   54 +
 tools/testing/selftests/rv/Makefile           |   18 +
 tools/testing/selftests/rv/test_tlob.sh       |  563 ++++++++
 tools/testing/selftests/rv/tlob_helper.c      |  994 ++++++++++++++
 .../testing/selftests/rv/tlob_uprobe_target.c |  108 ++
 tools/verification/models/tlob.dot            |   25 +
 22 files changed, 5441 insertions(+)
 create mode 100644 Documentation/trace/rv/monitor_tlob.rst
 create mode 100644 include/uapi/linux/rv.h
 create mode 100644 kernel/trace/rv/monitors/tlob/.kunitconfig
 create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
 create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
 create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
 create mode 100644 kernel/trace/rv/monitors/tlob/tlob_kunit.c
 create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
 create mode 100644 kernel/trace/rv/rv_dev.c
 create mode 100644 tools/include/uapi/linux/rv.h
 create mode 100644 tools/testing/selftests/rv/Makefile
 create mode 100755 tools/testing/selftests/rv/test_tlob.sh
 create mode 100644 tools/testing/selftests/rv/tlob_helper.c
 create mode 100644 tools/testing/selftests/rv/tlob_uprobe_target.c
 create mode 100644 tools/verification/models/tlob.dot

-- 
2.43.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/4] rv/tlob: Add tlob model DOT file
  2026-04-12 19:27 [RFC PATCH 0/4] rv/tlob: Add task latency over budget RV monitor wen.yang
@ 2026-04-12 19:27 ` wen.yang
  2026-04-13  8:19   ` Gabriele Monaco
  2026-04-12 19:27 ` [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor wen.yang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: wen.yang @ 2026-04-12 19:27 UTC (permalink / raw)
  To: Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Wen Yang

From: Wen Yang <wen.yang@linux.dev>

Add the Graphviz DOT specification for the tlob (task latency over
budget) deterministic automaton.

The model has three states: unmonitored, on_cpu, and off_cpu.
trace_start transitions from unmonitored to on_cpu; switch_out and
switch_in cycle between on_cpu and off_cpu; trace_stop and
budget_expired return to unmonitored from either active state.
unmonitored is the sole accepting state.

switch_in, switch_out, and sched_wakeup self-loop in unmonitored;
sched_wakeup self-loops in on_cpu; switch_out and sched_wakeup
self-loop in off_cpu.

Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
 MAINTAINERS                        |  3 +++
 tools/verification/models/tlob.dot | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)
 create mode 100644 tools/verification/models/tlob.dot

diff --git a/MAINTAINERS b/MAINTAINERS
index 9fbb619c6..c2c56236c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23242,7 +23242,10 @@ S:	Maintained
 F:	Documentation/trace/rv/
 F:	include/linux/rv.h
 F:	include/rv/
+F:	include/uapi/linux/rv.h
 F:	kernel/trace/rv/
+F:	samples/rv/
+F:	tools/testing/selftests/rv/
 F:	tools/testing/selftests/verification/
 F:	tools/verification/
 
diff --git a/tools/verification/models/tlob.dot b/tools/verification/models/tlob.dot
new file mode 100644
index 000000000..df34a14b8
--- /dev/null
+++ b/tools/verification/models/tlob.dot
@@ -0,0 +1,25 @@
+digraph state_automaton {
+	center = true;
+	size = "7,11";
+	{node [shape = plaintext, style=invis, label=""] "__init_unmonitored"};
+	{node [shape = ellipse] "unmonitored"};
+	{node [shape = plaintext] "unmonitored"};
+	{node [shape = plaintext] "on_cpu"};
+	{node [shape = plaintext] "off_cpu"};
+	"__init_unmonitored" -> "unmonitored";
+	"unmonitored" [label = "unmonitored", color = green3];
+	"unmonitored" -> "on_cpu" [ label = "trace_start" ];
+	"unmonitored" -> "unmonitored" [ label = "switch_in\nswitch_out\nsched_wakeup" ];
+	"on_cpu" [label = "on_cpu"];
+	"on_cpu" -> "off_cpu" [ label = "switch_out" ];
+	"on_cpu" -> "unmonitored" [ label = "trace_stop\nbudget_expired" ];
+	"on_cpu" -> "on_cpu" [ label = "sched_wakeup" ];
+	"off_cpu" [label = "off_cpu"];
+	"off_cpu" -> "on_cpu" [ label = "switch_in" ];
+	"off_cpu" -> "unmonitored" [ label = "trace_stop\nbudget_expired" ];
+	"off_cpu" -> "off_cpu" [ label = "switch_out\nsched_wakeup" ];
+	{ rank = min ;
+		"__init_unmonitored";
+		"unmonitored";
+	}
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor
  2026-04-12 19:27 [RFC PATCH 0/4] rv/tlob: Add task latency over budget RV monitor wen.yang
  2026-04-12 19:27 ` [RFC PATCH 1/4] rv/tlob: Add tlob model DOT file wen.yang
@ 2026-04-12 19:27 ` wen.yang
  2026-04-13  8:19   ` Gabriele Monaco
  2026-04-12 19:27 ` [RFC PATCH 3/4] rv/tlob: Add KUnit tests for the tlob monitor wen.yang
  2026-04-12 19:27 ` [RFC PATCH 4/4] selftests/rv: Add selftest " wen.yang
  3 siblings, 1 reply; 11+ messages in thread
From: wen.yang @ 2026-04-12 19:27 UTC (permalink / raw)
  To: Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Wen Yang

From: Wen Yang <wen.yang@linux.dev>

Add the tlob (task latency over budget) RV monitor. tlob tracks the
monotonic elapsed time (CLOCK_MONOTONIC) of a marked per-task code
path, including time off-CPU, and fires a per-task hrtimer when the
elapsed time exceeds a configurable budget.

Three-state DA (unmonitored/on_cpu/off_cpu) driven by trace_start,
switch_in/out, and budget_expired events. Per-task state lives in a
fixed-size hash table (TLOB_MAX_MONITORED slots) with RCU-deferred
free.

Two userspace interfaces:
 - tracefs: uprobe pair registration via the monitor file using the
   format "pid:threshold_us:offset_start:offset_stop:binary_path"
 - /dev/rv ioctls (CONFIG_RV_CHARDEV): TLOB_IOCTL_TRACE_START /
   TRACE_STOP; TRACE_STOP returns -EOVERFLOW on violation

Each /dev/rv fd has a per-fd mmap ring buffer (physically contiguous
pages). A control page (struct tlob_mmap_page) at offset 0 exposes
head/tail/dropped for lockless userspace reads; struct tlob_event
records follow at data_offset. Drop-new policy on overflow.

UAPI: include/uapi/linux/rv.h (tlob_start_args, tlob_event,
      tlob_mmap_page, ioctl numbers), monitor_tlob.rst,
      ioctl-number.rst (RV_IOC_MAGIC=0xB9).

Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
 Documentation/trace/rv/index.rst              |   1 +
 Documentation/trace/rv/monitor_tlob.rst       | 381 +++++++
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 include/uapi/linux/rv.h                       | 181 ++++
 kernel/trace/rv/Kconfig                       |  17 +
 kernel/trace/rv/Makefile                      |   2 +
 kernel/trace/rv/monitors/tlob/Kconfig         |  51 +
 kernel/trace/rv/monitors/tlob/tlob.c          | 986 ++++++++++++++++++
 kernel/trace/rv/monitors/tlob/tlob.h          | 145 +++
 kernel/trace/rv/monitors/tlob/tlob_trace.h    |  42 +
 kernel/trace/rv/rv.c                          |   4 +
 kernel/trace/rv/rv_dev.c                      | 602 +++++++++++
 kernel/trace/rv/rv_trace.h                    |  50 +
 13 files changed, 2463 insertions(+)
 create mode 100644 Documentation/trace/rv/monitor_tlob.rst
 create mode 100644 include/uapi/linux/rv.h
 create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
 create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
 create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
 create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
 create mode 100644 kernel/trace/rv/rv_dev.c

diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
index a2812ac5c..4f2bfaf38 100644
--- a/Documentation/trace/rv/index.rst
+++ b/Documentation/trace/rv/index.rst
@@ -15,3 +15,4 @@ Runtime Verification
    monitor_wwnr.rst
    monitor_sched.rst
    monitor_rtapp.rst
+   monitor_tlob.rst
diff --git a/Documentation/trace/rv/monitor_tlob.rst b/Documentation/trace/rv/monitor_tlob.rst
new file mode 100644
index 000000000..d498e9894
--- /dev/null
+++ b/Documentation/trace/rv/monitor_tlob.rst
@@ -0,0 +1,381 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Monitor tlob
+============
+
+- Name: tlob - task latency over budget
+- Type: per-task deterministic automaton
+- Author: Wen Yang <wen.yang@linux.dev>
+
+Description
+-----------
+
+The tlob monitor tracks per-task elapsed time (CLOCK_MONOTONIC, including
+both on-CPU and off-CPU time) and reports a violation when the monitored
+task exceeds a configurable latency budget threshold.
+
+The monitor implements a three-state deterministic automaton::
+
+                              |
+                              | (initial)
+                              v
+                    +--------------+
+          +-------> | unmonitored  |
+          |         +--------------+
+          |                |
+          |          trace_start
+          |                v
+          |         +--------------+
+          |         |   on_cpu     |
+          |         +--------------+
+          |           |         |
+          |  switch_out|         | trace_stop / budget_expired
+          |            v         v
+          |  +--------------+  (unmonitored)
+          |  |   off_cpu    |
+          |  +--------------+
+          |     |         |
+          |     | switch_in| trace_stop / budget_expired
+          |     v         v
+          |  (on_cpu)  (unmonitored)
+          |
+          +-- trace_stop (from on_cpu or off_cpu)
+
+  Key transitions:
+    unmonitored   --(trace_start)-->   on_cpu
+    on_cpu        --(switch_out)-->    off_cpu
+    off_cpu       --(switch_in)-->     on_cpu
+    on_cpu        --(trace_stop)-->    unmonitored
+    off_cpu       --(trace_stop)-->    unmonitored
+    on_cpu        --(budget_expired)-> unmonitored   [violation]
+    off_cpu       --(budget_expired)-> unmonitored   [violation]
+
+  sched_wakeup self-loops in on_cpu and unmonitored; switch_out and
+  sched_wakeup self-loop in off_cpu.  budget_expired is fired by the one-shot hrtimer; it always
+  transitions to unmonitored regardless of whether the task is on-CPU
+  or off-CPU when the timer fires.
+
+State Descriptions
+------------------
+
+- **unmonitored**: Task is not being traced.  Scheduling events
+  (``switch_in``, ``switch_out``, ``sched_wakeup``) are silently
+  ignored (self-loop).  The monitor waits for a ``trace_start`` event
+  to begin a new observation window.
+
+- **on_cpu**: Task is running on the CPU with the deadline timer armed.
+  A one-shot hrtimer was set for ``threshold_us`` microseconds at
+  ``trace_start`` time.  A ``switch_out`` event transitions to
+  ``off_cpu``; the hrtimer keeps running (off-CPU time counts toward
+  the budget).  A ``trace_stop`` cancels the timer and returns to
+  ``unmonitored`` (normal completion).  If the hrtimer fires
+  (``budget_expired``) the violation is recorded and the automaton
+  transitions to ``unmonitored``.
+
+- **off_cpu**: Task was preempted or blocked.  The one-shot hrtimer
+  continues to run.  A ``switch_in`` event returns to ``on_cpu``.
+  A ``trace_stop`` cancels the timer and returns to ``unmonitored``.
+  If the hrtimer fires (``budget_expired``) while the task is off-CPU,
+  the violation is recorded and the automaton transitions to
+  ``unmonitored``.
+
+Rationale
+---------
+
+The per-task latency budget threshold allows operators to express timing
+requirements in microseconds and receive an immediate ftrace event when a
+task exceeds its budget.  This is useful for real-time tasks
+(``SCHED_FIFO`` / ``SCHED_DEADLINE``) where total elapsed time must
+remain within a known bound.
+
+Each task has an independent threshold, so up to ``TLOB_MAX_MONITORED``
+(64) tasks with different timing requirements can be monitored
+simultaneously.
+
+On threshold violation the automaton records a ``tlob_budget_exceeded``
+ftrace event carrying the final on-CPU / off-CPU time breakdown, but does
+not kill or throttle the task.  Monitoring can be restarted by issuing a
+new ``trace_start`` event (or a new ``TLOB_IOCTL_TRACE_START`` ioctl).
+
+A per-task one-shot hrtimer is armed at ``trace_start`` for exactly
+``threshold_us`` microseconds.  It fires at most once per monitoring
+window, performs an O(1) hash lookup, records the violation, and injects
+the ``budget_expired`` event into the DA.  When ``CONFIG_RV_MON_TLOB``
+is not set there is zero runtime cost.
+
+Usage
+-----
+
+tracefs interface (uprobe-based external monitoring)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``monitor`` tracefs file allows any privileged user to instrument an
+unmodified binary via uprobes, without changing its source code.  Write a
+four-field record to attach two plain entry uprobes: one at
+``offset_start`` fires ``tlob_start_task()`` and one at ``offset_stop``
+fires ``tlob_stop_task()``, so the latency budget covers exactly the code
+region between the two offsets::
+
+  threshold_us:offset_start:offset_stop:binary_path
+
+``binary_path`` comes last so it may freely contain ``:`` (e.g. paths
+inside a container namespace).
+
+The uprobes fire for every task that executes the probed instruction in
+the binary, consistent with the native uprobe semantics.  All tasks that
+execute the code region get independent per-task monitoring slots.
+
+Using two plain entry uprobes (rather than a uretprobe for the stop) means
+that a mistyped offset can never corrupt the call stack; the worst outcome
+of a bad ``offset_stop`` is a missed stop that causes the hrtimer to fire
+and report a budget violation.
+
+Example  --  monitor a code region in ``/usr/bin/myapp`` with a 5 ms
+budget, where the region starts at offset 0x12a0 and ends at 0x12f0::
+
+  echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
+
+  # Bind uprobes: start probe starts the clock, stop probe stops it
+  echo "5000:0x12a0:0x12f0:/usr/bin/myapp" \
+      > /sys/kernel/tracing/rv/monitors/tlob/monitor
+
+  # Remove the uprobe binding for this code region
+  echo "-0x12a0:/usr/bin/myapp" > /sys/kernel/tracing/rv/monitors/tlob/monitor
+
+  # List registered uprobe bindings (mirrors the write format)
+  cat /sys/kernel/tracing/rv/monitors/tlob/monitor
+  # -> 5000:0x12a0:0x12f0:/usr/bin/myapp
+
+  # Read violations from the trace buffer
+  cat /sys/kernel/tracing/trace
+
+Up to ``TLOB_MAX_MONITORED`` tasks may be monitored simultaneously.
+
+The offsets can be obtained with ``nm`` or ``readelf``::
+
+  nm -n /usr/bin/myapp | grep my_function
+  # -> 0000000000012a0 T my_function
+
+  readelf -s /usr/bin/myapp | grep my_function
+  # -> 42: 0000000000012a0  336 FUNC GLOBAL DEFAULT  13 my_function
+
+  # offset_start = 0x12a0 (function entry)
+  # offset_stop  = 0x12a0 + 0x50 = 0x12f0 (or any instruction before return)
+
+Notes:
+
+- The uprobes fire for every task that executes the probed instruction,
+  so concurrent calls from different threads each get independent
+  monitoring slots.
+- ``offset_stop`` need not be a function return; it can be any instruction
+  within the region.  If the stop probe is never reached (e.g. early exit
+  path bypasses it), the hrtimer fires and a budget violation is reported.
+- Each ``(binary_path, offset_start)`` pair may only be registered once.
+  A second write with the same ``offset_start`` for the same binary is
+  rejected with ``-EEXIST``.  Two entry uprobes at the same address would
+  both fire for every task, causing ``tlob_start_task()`` to be called
+  twice; the second call would silently fail with ``-EEXIST`` and the
+  second binding's threshold would never take effect.  Different code
+  regions that share the same ``offset_stop`` (common exit point) are
+  explicitly allowed.
+- The uprobe binding is removed when ``-offset_start:binary_path`` is
+  written to ``monitor``, or when the monitor is disabled.
+- The ``tag`` field in every ``tlob_budget_exceeded`` event is
+  automatically set to ``offset_start`` for the tracefs path, so
+  violation events for different code regions are immediately
+  distinguishable even when ``threshold_us`` values are identical.
+
+ftrace ring buffer (budget violation events)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a monitored task exceeds its latency budget the hrtimer fires,
+records the violation, and emits a single ``tlob_budget_exceeded`` event
+into the ftrace ring buffer.  **Nothing is written to the ftrace ring
+buffer while the task is within budget.**
+
+The event carries the on-CPU / off-CPU time breakdown so that root-cause
+analysis (CPU-bound vs. scheduling / I/O overrun) is immediate::
+
+  cat /sys/kernel/tracing/trace
+
+Example output::
+
+  myapp-1234 [003] .... 12345.678: tlob_budget_exceeded: \
+    myapp[1234]: budget exceeded threshold=5000 \
+    on_cpu=820 off_cpu=4500 switches=3 state=off_cpu tag=0x00000000000012a0
+
+Field descriptions:
+
+``threshold``
+  Configured latency budget in microseconds.
+
+``on_cpu``
+  Cumulative on-CPU time since ``trace_start``, in microseconds.
+
+``off_cpu``
+  Cumulative off-CPU (scheduling + I/O wait) time since ``trace_start``,
+  in microseconds.
+
+``switches``
+  Number of times the task was scheduled out during this window.
+
+``state``
+  DA state when the hrtimer fired: ``on_cpu`` means the task was executing
+  when the budget expired (CPU-bound overrun); ``off_cpu`` means the task
+  was preempted or blocked (scheduling / I/O overrun).
+
+``tag``
+  Opaque 64-bit cookie supplied by the caller via ``tlob_start_args.tag``
+  (ioctl path) or automatically set to ``offset_start`` (tracefs uprobe
+  path).  Use it to distinguish violations from different code regions
+  monitored by the same thread.  Zero when not set.
+
+To capture violations in a file::
+
+  trace-cmd record -e tlob_budget_exceeded &
+  # ... run workload ...
+  trace-cmd report
+
+/dev/rv ioctl interface (self-instrumentation)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Tasks can self-instrument their own code paths via the ``/dev/rv`` misc
+device (requires ``CONFIG_RV_CHARDEV``).  The kernel key is
+``task_struct``; multiple threads sharing a single fd each get their own
+independent monitoring slot.
+
+**Synchronous mode**  --  the calling thread checks its own result::
+
+  int fd = open("/dev/rv", O_RDWR);
+
+  struct tlob_start_args args = {
+      .threshold_us = 50000,   /* 50 ms */
+      .tag          = 0,       /* optional; 0 = don't care */
+      .notify_fd    = -1,      /* no fd notification */
+  };
+  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
+
+  /* ... code path under observation ... */
+
+  int ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
+  /* ret == 0:          within budget  */
+  /* ret == -EOVERFLOW: budget exceeded */
+
+  close(fd);
+
+**Asynchronous mode**  --  a dedicated monitor thread receives violation
+records via ``read()`` on a shared fd, decoupling the observation from
+the critical path::
+
+  /* Monitor thread: open a dedicated fd. */
+  int monitor_fd = open("/dev/rv", O_RDWR);
+
+  /* Worker thread: set notify_fd = monitor_fd in TRACE_START args. */
+  int work_fd = open("/dev/rv", O_RDWR);
+  struct tlob_start_args args = {
+      .threshold_us = 10000,   /* 10 ms */
+      .tag          = REGION_A,
+      .notify_fd    = monitor_fd,
+  };
+  ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
+  /* ... critical section ... */
+  ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
+
+  /* Monitor thread: blocking read() returns one or more tlob_event records. */
+  struct tlob_event ntfs[8];
+  ssize_t n = read(monitor_fd, ntfs, sizeof(ntfs));
+  for (int i = 0; i < n / sizeof(struct tlob_event); i++) {
+      struct tlob_event *ntf = &ntfs[i];
+      printf("tid=%u tag=0x%llx exceeded budget=%llu us "
+             "(on_cpu=%llu off_cpu=%llu switches=%u state=%s)\n",
+             ntf->tid, ntf->tag, ntf->threshold_us,
+             ntf->on_cpu_us, ntf->off_cpu_us, ntf->switches,
+             ntf->state ? "on_cpu" : "off_cpu");
+  }
+
+**mmap ring buffer**  --  zero-copy consumption of violation events::
+
+  int fd = open("/dev/rv", O_RDWR);
+  struct tlob_start_args args = {
+      .threshold_us = 1000,   /* 1 ms */
+      .notify_fd    = fd,     /* push violations to own ring buffer */
+  };
+  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
+
+  /* Map the ring: one control page + capacity data records. */
+  size_t pagesize = sysconf(_SC_PAGESIZE);
+  size_t cap = 64;   /* read from page->capacity after mmap */
+  size_t len = pagesize + cap * sizeof(struct tlob_event);
+  void *map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+
+  struct tlob_mmap_page *page = map;
+  struct tlob_event *data =
+      (struct tlob_event *)((char *)map + page->data_offset);
+
+  /* Consumer loop: poll for events, read without copying. */
+  while (1) {
+      poll(&(struct pollfd){fd, POLLIN, 0}, 1, -1);
+
+      uint32_t head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
+      uint32_t tail = page->data_tail;
+      while (tail != head) {
+          handle(&data[tail & (page->capacity - 1)]);
+          tail++;
+      }
+      __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
+  }
+
+Note: ``read()`` and ``mmap()`` share the same ring and ``data_tail``
+cursor.  Do not use both simultaneously on the same fd.
+
+``tlob_event`` fields:
+
+``tid``
+  Thread ID (``task_pid_vnr``) of the violating task.
+
+``threshold_us``
+  Budget that was exceeded, in microseconds.
+
+``on_cpu_us``
+  Cumulative on-CPU time at violation time, in microseconds.
+
+``off_cpu_us``
+  Cumulative off-CPU time at violation time, in microseconds.
+
+``switches``
+  Number of context switches since ``TRACE_START``.
+
+``state``
+  1 = timer fired while task was on-CPU; 0 = timer fired while off-CPU.
+
+``tag``
+  Cookie from ``tlob_start_args.tag``; for the tracefs uprobe path this
+  equals ``offset_start``.  Zero when not set.
+
+tracefs files
+-------------
+
+The following files are created under
+``/sys/kernel/tracing/rv/monitors/tlob/``:
+
+``enable`` (rw)
+  Write ``1`` to enable the monitor; write ``0`` to disable it and
+  stop all currently monitored tasks.
+
+``desc`` (ro)
+  Human-readable description of the monitor.
+
+``monitor`` (rw)
+  Write ``threshold_us:offset_start:offset_stop:binary_path`` to bind two
+  plain entry uprobes in *binary_path*.  The uprobe at *offset_start* fires
+  ``tlob_start_task()``; the uprobe at *offset_stop* fires
+  ``tlob_stop_task()``.  Returns ``-EEXIST`` if a binding with the same
+  *offset_start* already exists for *binary_path*.  Write
+  ``-offset_start:binary_path`` to remove the binding.  Read to list
+  registered bindings, one
+  ``threshold_us:0xoffset_start:0xoffset_stop:binary_path`` entry per line.
+
+Specification
+-------------
+
+Graphviz DOT file in tools/verification/models/tlob.dot
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 331223761..8d3af68db 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -385,6 +385,7 @@ Code  Seq#    Include File                                             Comments
 0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                                Marvell CN10K DPI driver
 0xB8  all    uapi/linux/mshv.h                                         Microsoft Hyper-V /dev/mshv driver
                                                                        <mailto:linux-hyperv@vger.kernel.org>
+0xB9  00-3F  linux/rv.h                                                Runtime Verification (RV) monitors
 0xBA  00-0F  uapi/linux/liveupdate.h                                   Pasha Tatashin
                                                                        <mailto:pasha.tatashin@soleen.com>
 0xC0  00-0F  linux/usb/iowarrior.h
diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
new file mode 100644
index 000000000..d1b96d8cd
--- /dev/null
+++ b/include/uapi/linux/rv.h
@@ -0,0 +1,181 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * UAPI definitions for Runtime Verification (RV) monitors.
+ *
+ * All RV monitors that expose an ioctl self-instrumentation interface
+ * share the magic byte RV_IOC_MAGIC (0xB9), registered in
+ * Documentation/userspace-api/ioctl/ioctl-number.rst.
+ *
+ * A single /dev/rv misc device serves as the entry point.  ioctl numbers
+ * encode both the monitor identity and the operation:
+ *
+ *   0x01 - 0x1F  tlob (task latency over budget)
+ *   0x20 - 0x3F  reserved for future RV monitors
+ *
+ * Usage examples and design rationale are in:
+ *   Documentation/trace/rv/monitor_tlob.rst
+ */
+
+#ifndef _UAPI_LINUX_RV_H
+#define _UAPI_LINUX_RV_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+/* Magic byte shared by all RV monitor ioctls. */
+#define RV_IOC_MAGIC	0xB9
+
+/* -----------------------------------------------------------------------
+ * tlob: task latency over budget monitor  (nr 0x01 - 0x1F)
+ * -----------------------------------------------------------------------
+ */
+
+/**
+ * struct tlob_start_args - arguments for TLOB_IOCTL_TRACE_START
+ * @threshold_us: Latency budget for this critical section, in microseconds.
+ *               Must be greater than zero.
+ * @tag:         Opaque 64-bit cookie supplied by the caller.  Echoed back
+ *               verbatim in the tlob_budget_exceeded ftrace event and in any
+ *               tlob_event record delivered via @notify_fd.  Use it to identify
+ *               which code region triggered a violation when the same thread
+ *               monitors multiple regions sequentially.  Set to 0 if not
+ *               needed.
+ * @notify_fd:   File descriptor that will receive a tlob_event record on
+ *               violation.  Must refer to an open /dev/rv fd.  May equal
+ *               the calling fd (self-notification, useful for retrieving the
+ *               on_cpu_us / off_cpu_us breakdown after TRACE_STOP returns
+ *               -EOVERFLOW).  Set to -1 to disable fd notification; in that
+ *               case violations are only signalled via the TRACE_STOP return
+ *               value and the tlob_budget_exceeded ftrace event.
+ * @flags:       Must be 0.  Reserved for future extensions.
+ */
+struct tlob_start_args {
+	__u64 threshold_us;
+	__u64 tag;
+	__s32 notify_fd;
+	__u32 flags;
+};
+
+/**
+ * struct tlob_event - one budget-exceeded event
+ *
+ * Consumed by read() on the notify_fd registered at TLOB_IOCTL_TRACE_START.
+ * Each record describes a single budget exceedance for one task.
+ *
+ * @tid:          Thread ID (task_pid_vnr) of the violating task.
+ * @threshold_us: Budget that was exceeded, in microseconds.
+ * @on_cpu_us:    Cumulative on-CPU time at violation time, in microseconds.
+ * @off_cpu_us:   Cumulative off-CPU (scheduling + I/O wait) time at
+ *               violation time, in microseconds.
+ * @switches:     Number of context switches since TRACE_START.
+ * @state:        DA state at violation: 1 = on_cpu, 0 = off_cpu.
+ * @tag:          Cookie from tlob_start_args.tag; for the tracefs uprobe path
+ *               this is the offset_start value.  Zero when not set.
+ */
+struct tlob_event {
+	__u32 tid;
+	__u32 pad;
+	__u64 threshold_us;
+	__u64 on_cpu_us;
+	__u64 off_cpu_us;
+	__u32 switches;
+	__u32 state;   /* 1 = on_cpu, 0 = off_cpu */
+	__u64 tag;
+};
+
+/**
+ * struct tlob_mmap_page - control page for the mmap'd violation ring buffer
+ *
+ * Mapped at offset 0 of the mmap region returned by mmap(2) on a /dev/rv fd.
+ * The data array of struct tlob_event records begins at offset @data_offset
+ * (always one page from the mmap base; use this field rather than hard-coding
+ * PAGE_SIZE so the code remains correct across architectures).
+ *
+ * Ring layout:
+ *
+ *   mmap base + 0             : struct tlob_mmap_page  (one page)
+ *   mmap base + data_offset   : struct tlob_event[capacity]
+ *
+ * The mmap length determines the ring capacity.  Compute it as:
+ *
+ *   raw    = sysconf(_SC_PAGESIZE) + capacity * sizeof(struct tlob_event)
+ *   length = (raw + sysconf(_SC_PAGESIZE) - 1) & ~(sysconf(_SC_PAGESIZE) - 1)
+ *
+ * i.e. round the raw byte count up to the next page boundary before
+ * passing it to mmap(2).  The kernel requires a page-aligned length.
+ * capacity must be a power of 2.  Read @capacity after a successful
+ * mmap(2) for the actual value.
+ *
+ * Producer/consumer ordering contract:
+ *
+ *   Kernel (producer):
+ *     data[data_head & (capacity - 1)] = event;
+ *     // pairs with load-acquire in userspace:
+ *     smp_store_release(&page->data_head, data_head + 1);
+ *
+ *   Userspace (consumer):
+ *     // pairs with store-release in kernel:
+ *     head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
+ *     for (tail = page->data_tail; tail != head; tail++)
+ *         handle(&data[tail & (capacity - 1)]);
+ *     __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
+ *
+ * @data_head and @data_tail are monotonically increasing __u32 counters
+ * in units of records.  Unsigned 32-bit wrap-around is handled correctly
+ * by modular arithmetic; the ring is full when
+ * (data_head - data_tail) == capacity.
+ *
+ * When the ring is full the kernel drops the incoming record and increments
+ * @dropped.  The consumer should check @dropped periodically to detect loss.
+ *
+ * read() and mmap() share the same ring buffer.  Do not use both
+ * simultaneously on the same fd.
+ *
+ * @data_head:   Next write slot index.  Updated by the kernel with
+ *               store-release ordering.  Read by userspace with load-acquire.
+ * @data_tail:   Next read slot index.  Updated by userspace.  Read by the
+ *               kernel to detect overflow.
+ * @capacity:    Actual ring capacity in records (power of 2).  Written once
+ *               by the kernel at mmap time; read-only for userspace thereafter.
+ * @version:     Ring buffer ABI version; currently 1.
+ * @data_offset: Byte offset from the mmap base to the data array.
+ *               Always equal to sysconf(_SC_PAGESIZE) on the running kernel.
+ * @record_size: sizeof(struct tlob_event) as seen by the kernel.  Verify
+ *               this matches userspace's sizeof before indexing the array.
+ * @dropped:     Number of events dropped because the ring was full.
+ *               Monotonically increasing; read with __ATOMIC_RELAXED.
+ */
+struct tlob_mmap_page {
+	__u32  data_head;
+	__u32  data_tail;
+	__u32  capacity;
+	__u32  version;
+	__u32  data_offset;
+	__u32  record_size;
+	__u64  dropped;
+};
+
+/*
+ * TLOB_IOCTL_TRACE_START - begin monitoring the calling task.
+ *
+ * Arms a per-task hrtimer for threshold_us microseconds.  If args.notify_fd
+ * is >= 0, a tlob_event record is pushed into that fd's ring buffer on
+ * violation in addition to the tlob_budget_exceeded ftrace event.
+ * args.notify_fd == -1 disables fd notification.
+ *
+ * Violation records are consumed by read() on the notify_fd (blocking or
+ * non-blocking depending on O_NONBLOCK).  On violation, TLOB_IOCTL_TRACE_STOP
+ * also returns -EOVERFLOW regardless of whether notify_fd is set.
+ *
+ * args.flags must be 0.
+ */
+#define TLOB_IOCTL_TRACE_START		_IOW(RV_IOC_MAGIC, 0x01, struct tlob_start_args)
+
+/*
+ * TLOB_IOCTL_TRACE_STOP - end monitoring the calling task.
+ *
+ * Returns 0 if within budget, -EOVERFLOW if the budget was exceeded.
+ */
+#define TLOB_IOCTL_TRACE_STOP		_IO(RV_IOC_MAGIC,  0x02)
+
+#endif /* _UAPI_LINUX_RV_H */
diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index 5b4be87ba..227573cda 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -65,6 +65,7 @@ source "kernel/trace/rv/monitors/pagefault/Kconfig"
 source "kernel/trace/rv/monitors/sleep/Kconfig"
 # Add new rtapp monitors here
 
+source "kernel/trace/rv/monitors/tlob/Kconfig"
 # Add new monitors here
 
 config RV_REACTORS
@@ -93,3 +94,19 @@ config RV_REACT_PANIC
 	help
 	  Enables the panic reactor. The panic reactor emits a printk()
 	  message if an exception is found and panic()s the system.
+
+config RV_CHARDEV
+	bool "RV ioctl interface via /dev/rv"
+	depends on RV
+	default n
+	help
+	  Register a /dev/rv misc device that exposes an ioctl interface
+	  for RV monitor self-instrumentation.  All RV monitors share the
+	  single device node; ioctl numbers encode the monitor identity.
+
+	  When enabled, user-space programs can open /dev/rv and use
+	  monitor-specific ioctl commands to bracket code regions they
+	  want the kernel RV subsystem to observe.
+
+	  Say Y here if you want to use the tlob self-instrumentation
+	  ioctl interface; otherwise say N.
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index 750e4ad6f..cc3781a3b 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -3,6 +3,7 @@
 ccflags-y += -I $(src)		# needed for trace events
 
 obj-$(CONFIG_RV) += rv.o
+obj-$(CONFIG_RV_CHARDEV) += rv_dev.o
 obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
 obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
 obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
@@ -17,6 +18,7 @@ obj-$(CONFIG_RV_MON_STS) += monitors/sts/sts.o
 obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
 obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
 obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
+obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
 # Add new monitors here
 obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
 obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
diff --git a/kernel/trace/rv/monitors/tlob/Kconfig b/kernel/trace/rv/monitors/tlob/Kconfig
new file mode 100644
index 000000000..010237480
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/Kconfig
@@ -0,0 +1,51 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+config RV_MON_TLOB
+	depends on RV
+	depends on UPROBES
+	select DA_MON_EVENTS_ID
+	bool "tlob monitor"
+	help
+	  Enable the tlob (task latency over budget) monitor. This monitor
+	  tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path within a
+	  task (including both on-CPU and off-CPU time) and reports a
+	  violation when the elapsed time exceeds a configurable budget
+	  threshold.
+
+	  The monitor implements a three-state deterministic automaton.
+	  States: unmonitored, on_cpu, off_cpu.
+	  Key transitions:
+	    unmonitored    --(trace_start)-->    on_cpu
+	    on_cpu   --(switch_out)-->     off_cpu
+	    off_cpu  --(switch_in)-->      on_cpu
+	    on_cpu   --(trace_stop)-->    unmonitored
+	    off_cpu  --(trace_stop)-->    unmonitored
+	    on_cpu   --(budget_expired)--> unmonitored
+	    off_cpu  --(budget_expired)--> unmonitored
+
+	  External configuration is done via the tracefs "monitor" file:
+	    echo pid:threshold_us:binary:offset_start:offset_stop > .../rv/monitors/tlob/monitor
+	    echo -pid             > .../rv/monitors/tlob/monitor  (remove task)
+	    cat                     .../rv/monitors/tlob/monitor  (list tasks)
+
+	  The uprobe binding places two plain entry uprobes at offset_start and
+	  offset_stop in the binary; these trigger tlob_start_task() and
+	  tlob_stop_task() respectively.  Using two entry uprobes (rather than a
+	  uretprobe) means that a mistyped offset can never corrupt the call
+	  stack; the worst outcome is a missed stop, which causes the hrtimer to
+	  fire and report a budget violation.
+
+	  Violation events are delivered via a lock-free mmap ring buffer on
+	  /dev/rv (enabled by CONFIG_RV_CHARDEV).  The consumer mmap()s the
+	  device, reads records from the data array using the head/tail indices
+	  in the control page, and advances data_tail when done.
+
+	  For self-instrumentation, use TLOB_IOCTL_TRACE_START /
+	  TLOB_IOCTL_TRACE_STOP via the /dev/rv misc device (enabled by
+	  CONFIG_RV_CHARDEV).
+
+	  Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
+
+	  For further information, see:
+	    Documentation/trace/rv/monitor_tlob.rst
+
diff --git a/kernel/trace/rv/monitors/tlob/tlob.c b/kernel/trace/rv/monitors/tlob/tlob.c
new file mode 100644
index 000000000..a6e474025
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/tlob.c
@@ -0,0 +1,986 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tlob: task latency over budget monitor
+ *
+ * Track the elapsed wall-clock time of a marked code path and detect when
+ * a monitored task exceeds its per-task latency budget.  CLOCK_MONOTONIC
+ * is used so both on-CPU and off-CPU time count toward the budget.
+ *
+ * Per-task state is maintained in a spinlock-protected hash table.  A
+ * one-shot hrtimer fires at the deadline; if the task has not called
+ * trace_stop by then, a violation is recorded.
+ *
+ * Up to TLOB_MAX_MONITORED tasks may be tracked simultaneously.
+ *
+ * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
+ */
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/ftrace.h>
+#include <linux/hash.h>
+#include <linux/hrtimer.h>
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+#include <linux/rv.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/atomic.h>
+#include <linux/rcupdate.h>
+#include <linux/spinlock.h>
+#include <linux/tracefs.h>
+#include <linux/uaccess.h>
+#include <linux/uprobes.h>
+#include <kunit/visibility.h>
+#include <rv/instrumentation.h>
+
+/* rv_interface_lock is defined in kernel/trace/rv/rv.c */
+extern struct mutex rv_interface_lock;
+
+#define MODULE_NAME "tlob"
+
+#include <rv_trace.h>
+#include <trace/events/sched.h>
+
+#define RV_MON_TYPE RV_MON_PER_TASK
+#include "tlob.h"
+#include <rv/da_monitor.h>
+
+/* Hash table size; must be a power of two. */
+#define TLOB_HTABLE_BITS		6
+#define TLOB_HTABLE_SIZE		(1 << TLOB_HTABLE_BITS)
+
+/* Maximum binary path length for uprobe binding. */
+#define TLOB_MAX_PATH			256
+
+/* Per-task latency monitoring state. */
+struct tlob_task_state {
+	struct hlist_node	hlist;
+	struct task_struct	*task;
+	u64			threshold_us;
+	u64			tag;
+	struct hrtimer		deadline_timer;
+	int			canceled;	/* protected by entry_lock */
+	struct file		*notify_file;	/* NULL or held reference */
+
+	/*
+	 * entry_lock serialises the mutable accounting fields below.
+	 * Lock order: tlob_table_lock -> entry_lock (never reverse).
+	 */
+	raw_spinlock_t		entry_lock;
+	u64			on_cpu_us;
+	u64			off_cpu_us;
+	ktime_t			last_ts;
+	u32			switches;
+	u8			da_state;
+
+	struct rcu_head		rcu;	/* for call_rcu() teardown */
+};
+
+/* Per-uprobe-binding state: a start + stop probe pair for one binary region. */
+struct tlob_uprobe_binding {
+	struct list_head	list;
+	u64			threshold_us;
+	struct path		path;
+	char			binpath[TLOB_MAX_PATH];	/* canonical path for read/remove */
+	loff_t			offset_start;
+	loff_t			offset_stop;
+	struct uprobe_consumer	entry_uc;
+	struct uprobe_consumer	stop_uc;
+	struct uprobe		*entry_uprobe;
+	struct uprobe		*stop_uprobe;
+};
+
+/* Object pool for tlob_task_state. */
+static struct kmem_cache *tlob_state_cache;
+
+/* Hash table and lock protecting table structure (insert/delete/canceled). */
+static struct hlist_head tlob_htable[TLOB_HTABLE_SIZE];
+static DEFINE_RAW_SPINLOCK(tlob_table_lock);
+static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
+
+/* Uprobe binding list; protected by tlob_uprobe_mutex. */
+static LIST_HEAD(tlob_uprobe_list);
+static DEFINE_MUTEX(tlob_uprobe_mutex);
+
+/* Forward declaration */
+static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer);
+
+/* Hash table helpers */
+
+static unsigned int tlob_hash_task(const struct task_struct *task)
+{
+	return hash_ptr((void *)task, TLOB_HTABLE_BITS);
+}
+
+/*
+ * tlob_find_rcu - look up per-task state.
+ * Must be called under rcu_read_lock() or with tlob_table_lock held.
+ */
+static struct tlob_task_state *tlob_find_rcu(struct task_struct *task)
+{
+	struct tlob_task_state *ws;
+	unsigned int h = tlob_hash_task(task);
+
+	hlist_for_each_entry_rcu(ws, &tlob_htable[h], hlist,
+				 lockdep_is_held(&tlob_table_lock))
+		if (ws->task == task)
+			return ws;
+	return NULL;
+}
+
+/* Allocate and initialise a new per-task state entry. */
+static struct tlob_task_state *tlob_alloc(struct task_struct *task,
+					  u64 threshold_us, u64 tag)
+{
+	struct tlob_task_state *ws;
+
+	ws = kmem_cache_zalloc(tlob_state_cache, GFP_ATOMIC);
+	if (!ws)
+		return NULL;
+
+	ws->task = task;
+	get_task_struct(task);
+	ws->threshold_us = threshold_us;
+	ws->tag = tag;
+	ws->last_ts = ktime_get();
+	ws->da_state = on_cpu_tlob;
+	raw_spin_lock_init(&ws->entry_lock);
+	hrtimer_setup(&ws->deadline_timer, tlob_deadline_timer_fn,
+		      CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	return ws;
+}
+
+/* RCU callback: free the slab once no readers remain. */
+static void tlob_free_rcu_slab(struct rcu_head *head)
+{
+	struct tlob_task_state *ws =
+		container_of(head, struct tlob_task_state, rcu);
+	kmem_cache_free(tlob_state_cache, ws);
+}
+
+/* Arm the one-shot deadline timer for threshold_us microseconds. */
+static void tlob_arm_deadline(struct tlob_task_state *ws)
+{
+	hrtimer_start(&ws->deadline_timer,
+		      ns_to_ktime(ws->threshold_us * NSEC_PER_USEC),
+		      HRTIMER_MODE_REL);
+}
+
+/*
+ * Push a violation record into a monitor fd's ring buffer (softirq context).
+ * Drop-new policy: discard incoming record when full.  smp_store_release on
+ * data_head pairs with smp_load_acquire in the consumer.
+ */
+static void tlob_event_push(struct rv_file_priv *priv,
+			    const struct tlob_event *info)
+{
+	struct tlob_ring *ring = &priv->ring;
+	unsigned long flags;
+	u32 head, tail;
+
+	spin_lock_irqsave(&ring->lock, flags);
+
+	head = ring->page->data_head;
+	tail = READ_ONCE(ring->page->data_tail);
+
+	if (head - tail > ring->mask) {
+		/* Ring full: drop incoming record. */
+		ring->page->dropped++;
+		spin_unlock_irqrestore(&ring->lock, flags);
+		return;
+	}
+
+	ring->data[head & ring->mask] = *info;
+	/* pairs with smp_load_acquire() in the consumer */
+	smp_store_release(&ring->page->data_head, head + 1);
+
+	spin_unlock_irqrestore(&ring->lock, flags);
+
+	wake_up_interruptible_poll(&priv->waitq, EPOLLIN | EPOLLRDNORM);
+}
+
+#if IS_ENABLED(CONFIG_KUNIT)
+void tlob_event_push_kunit(struct rv_file_priv *priv,
+			  const struct tlob_event *info)
+{
+	tlob_event_push(priv, info);
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_event_push_kunit);
+#endif /* CONFIG_KUNIT */
+
+/*
+ * Budget exceeded: remove the entry, record the violation, and inject
+ * budget_expired into the DA.
+ *
+ * Lock order: tlob_table_lock -> entry_lock.  tlob_stop_task() sets
+ * ws->canceled under both locks; if we see it here the stop path owns cleanup.
+ * fput/put_task_struct are done before call_rcu(); the RCU callback only
+ * reclaims the slab.
+ */
+static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer)
+{
+	struct tlob_task_state *ws =
+		container_of(timer, struct tlob_task_state, deadline_timer);
+	struct tlob_event info = {};
+	struct file *notify_file;
+	struct task_struct *task;
+	unsigned long flags;
+	/* snapshots taken under entry_lock */
+	u64 on_cpu_us, off_cpu_us, threshold_us, tag;
+	u32 switches;
+	bool on_cpu;
+	bool push_event = false;
+
+	raw_spin_lock_irqsave(&tlob_table_lock, flags);
+	/* stop path sets canceled under both locks; if set it owns cleanup */
+	if (ws->canceled) {
+		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
+		return HRTIMER_NORESTART;
+	}
+
+	/* Finalize accounting and snapshot all fields under entry_lock. */
+	raw_spin_lock(&ws->entry_lock);
+
+	{
+		ktime_t now = ktime_get();
+		u64 delta_us = ktime_to_us(ktime_sub(now, ws->last_ts));
+
+		if (ws->da_state == on_cpu_tlob)
+			ws->on_cpu_us += delta_us;
+		else
+			ws->off_cpu_us += delta_us;
+	}
+
+	ws->canceled  = 1;
+	on_cpu_us     = ws->on_cpu_us;
+	off_cpu_us    = ws->off_cpu_us;
+	threshold_us  = ws->threshold_us;
+	tag           = ws->tag;
+	switches      = ws->switches;
+	on_cpu        = (ws->da_state == on_cpu_tlob);
+	notify_file   = ws->notify_file;
+	if (notify_file) {
+		info.tid          = task_pid_vnr(ws->task);
+		info.threshold_us = threshold_us;
+		info.on_cpu_us    = on_cpu_us;
+		info.off_cpu_us   = off_cpu_us;
+		info.switches     = switches;
+		info.state        = on_cpu ? 1 : 0;
+		info.tag          = tag;
+		push_event        = true;
+	}
+
+	raw_spin_unlock(&ws->entry_lock);
+
+	hlist_del_rcu(&ws->hlist);
+	atomic_dec(&tlob_num_monitored);
+	/*
+	 * Hold a reference so task remains valid across da_handle_event()
+	 * after we drop tlob_table_lock.
+	 */
+	task = ws->task;
+	get_task_struct(task);
+	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
+
+	/*
+	 * Both locks are now released; ws is exclusively owned (removed from
+	 * the hash table with canceled=1).  Emit the tracepoint and push the
+	 * violation record.
+	 */
+	trace_tlob_budget_exceeded(ws->task, threshold_us, on_cpu_us,
+				   off_cpu_us, switches, on_cpu, tag);
+
+	if (push_event) {
+		struct rv_file_priv *priv = notify_file->private_data;
+
+		if (priv)
+			tlob_event_push(priv, &info);
+	}
+
+	da_handle_event(task, budget_expired_tlob);
+
+	if (notify_file)
+		fput(notify_file);		/* ref from fget() at TRACE_START */
+	put_task_struct(ws->task);		/* ref from tlob_alloc() */
+	put_task_struct(task);			/* extra ref from get_task_struct() above */
+	call_rcu(&ws->rcu, tlob_free_rcu_slab);
+	return HRTIMER_NORESTART;
+}
+
+/* Tracepoint handlers */
+
+/*
+ * handle_sched_switch - advance the DA and accumulate on/off-CPU time.
+ *
+ * RCU read-side for lock-free lookup; entry_lock for per-task accounting.
+ * da_handle_event() is called after rcu_read_unlock() to avoid holding the
+ * read-side critical section across the RV framework.
+ */
+static void handle_sched_switch(void *data, bool preempt,
+				struct task_struct *prev,
+				struct task_struct *next,
+				unsigned int prev_state)
+{
+	struct tlob_task_state *ws;
+	unsigned long flags;
+	bool do_prev = false, do_next = false;
+	ktime_t now;
+
+	rcu_read_lock();
+
+	ws = tlob_find_rcu(prev);
+	if (ws) {
+		raw_spin_lock_irqsave(&ws->entry_lock, flags);
+		if (!ws->canceled) {
+			now = ktime_get();
+			ws->on_cpu_us += ktime_to_us(ktime_sub(now, ws->last_ts));
+			ws->last_ts = now;
+			ws->switches++;
+			ws->da_state = off_cpu_tlob;
+			do_prev = true;
+		}
+		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
+	}
+
+	ws = tlob_find_rcu(next);
+	if (ws) {
+		raw_spin_lock_irqsave(&ws->entry_lock, flags);
+		if (!ws->canceled) {
+			now = ktime_get();
+			ws->off_cpu_us += ktime_to_us(ktime_sub(now, ws->last_ts));
+			ws->last_ts = now;
+			ws->da_state = on_cpu_tlob;
+			do_next = true;
+		}
+		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
+	}
+
+	rcu_read_unlock();
+
+	if (do_prev)
+		da_handle_event(prev, switch_out_tlob);
+	if (do_next)
+		da_handle_event(next, switch_in_tlob);
+}
+
+static void handle_sched_wakeup(void *data, struct task_struct *p)
+{
+	struct tlob_task_state *ws;
+	unsigned long flags;
+	bool found = false;
+
+	rcu_read_lock();
+	ws = tlob_find_rcu(p);
+	if (ws) {
+		raw_spin_lock_irqsave(&ws->entry_lock, flags);
+		found = !ws->canceled;
+		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
+	}
+	rcu_read_unlock();
+
+	if (found)
+		da_handle_event(p, sched_wakeup_tlob);
+}
+
+/* -----------------------------------------------------------------------
+ * Core start/stop helpers (also called from rv_dev.c)
+ * -----------------------------------------------------------------------
+ */
+
+/*
+ * __tlob_insert - insert @ws into the hash table and arm its deadline timer.
+ *
+ * Re-checks for duplicates and capacity under tlob_table_lock; the caller
+ * may have done a lock-free pre-check before allocating @ws.  On failure @ws
+ * is freed directly (never in table, so no call_rcu needed).
+ */
+static int __tlob_insert(struct task_struct *task, struct tlob_task_state *ws)
+{
+	unsigned int h;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&tlob_table_lock, flags);
+	if (tlob_find_rcu(task)) {
+		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
+		if (ws->notify_file)
+			fput(ws->notify_file);
+		put_task_struct(ws->task);
+		kmem_cache_free(tlob_state_cache, ws);
+		return -EEXIST;
+	}
+	if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
+		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
+		if (ws->notify_file)
+			fput(ws->notify_file);
+		put_task_struct(ws->task);
+		kmem_cache_free(tlob_state_cache, ws);
+		return -ENOSPC;
+	}
+	h = tlob_hash_task(task);
+	hlist_add_head_rcu(&ws->hlist, &tlob_htable[h]);
+	atomic_inc(&tlob_num_monitored);
+	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
+
+	da_handle_start_run_event(task, trace_start_tlob);
+	tlob_arm_deadline(ws);
+	return 0;
+}
+
+/**
+ * tlob_start_task - begin monitoring @task with latency budget @threshold_us.
+ *
+ * @notify_file: /dev/rv fd whose ring buffer receives a tlob_event on
+ *               violation; caller transfers the fget() reference to tlob.c.
+ *               Pass NULL for synchronous mode (violations only via
+ *               TRACE_STOP return value and the tlob_budget_exceeded event).
+ *
+ * Returns 0, -ENODEV, -EEXIST, -ENOSPC, or -ENOMEM.  On failure the caller
+ * retains responsibility for any @notify_file reference.
+ */
+int tlob_start_task(struct task_struct *task, u64 threshold_us,
+		    struct file *notify_file, u64 tag)
+{
+	struct tlob_task_state *ws;
+	unsigned long flags;
+
+	if (!tlob_state_cache)
+		return -ENODEV;
+
+	if (threshold_us > (u64)KTIME_MAX / NSEC_PER_USEC)
+		return -ERANGE;
+
+	/* Quick pre-check before allocation. */
+	raw_spin_lock_irqsave(&tlob_table_lock, flags);
+	if (tlob_find_rcu(task)) {
+		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
+		return -EEXIST;
+	}
+	if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
+		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
+		return -ENOSPC;
+	}
+	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
+
+	ws = tlob_alloc(task, threshold_us, tag);
+	if (!ws)
+		return -ENOMEM;
+
+	ws->notify_file = notify_file;
+	return __tlob_insert(task, ws);
+}
+EXPORT_SYMBOL_GPL(tlob_start_task);
+
+/**
+ * tlob_stop_task - stop monitoring @task before the deadline fires.
+ *
+ * Sets canceled under entry_lock (inside tlob_table_lock) before calling
+ * hrtimer_cancel(), racing safely with the timer callback.
+ *
+ * Returns 0 if within budget, -ESRCH if the entry is gone (deadline already
+ * fired, or TRACE_START was never called).
+ */
+int tlob_stop_task(struct task_struct *task)
+{
+	struct tlob_task_state *ws;
+	struct file *notify_file;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&tlob_table_lock, flags);
+	ws = tlob_find_rcu(task);
+	if (!ws) {
+		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
+		return -ESRCH;
+	}
+
+	/* Prevent handle_sched_switch from updating accounting after removal. */
+	raw_spin_lock(&ws->entry_lock);
+	ws->canceled = 1;
+	raw_spin_unlock(&ws->entry_lock);
+
+	hlist_del_rcu(&ws->hlist);
+	atomic_dec(&tlob_num_monitored);
+	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
+
+	hrtimer_cancel(&ws->deadline_timer);
+
+	da_handle_event(task, trace_stop_tlob);
+
+	notify_file = ws->notify_file;
+	if (notify_file)
+		fput(notify_file);
+	put_task_struct(ws->task);
+	call_rcu(&ws->rcu, tlob_free_rcu_slab);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(tlob_stop_task);
+
+/* Stop monitoring all tracked tasks; called on monitor disable. */
+static void tlob_stop_all(void)
+{
+	struct tlob_task_state *batch[TLOB_MAX_MONITORED];
+	struct tlob_task_state *ws;
+	struct hlist_node *tmp;
+	unsigned long flags;
+	int n = 0, i;
+
+	raw_spin_lock_irqsave(&tlob_table_lock, flags);
+	for (i = 0; i < TLOB_HTABLE_SIZE; i++) {
+		hlist_for_each_entry_safe(ws, tmp, &tlob_htable[i], hlist) {
+			raw_spin_lock(&ws->entry_lock);
+			ws->canceled = 1;
+			raw_spin_unlock(&ws->entry_lock);
+			hlist_del_rcu(&ws->hlist);
+			atomic_dec(&tlob_num_monitored);
+			if (n < TLOB_MAX_MONITORED)
+				batch[n++] = ws;
+		}
+	}
+	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
+
+	for (i = 0; i < n; i++) {
+		ws = batch[i];
+		hrtimer_cancel(&ws->deadline_timer);
+		da_handle_event(ws->task, trace_stop_tlob);
+		if (ws->notify_file)
+			fput(ws->notify_file);
+		put_task_struct(ws->task);
+		call_rcu(&ws->rcu, tlob_free_rcu_slab);
+	}
+}
+
+/* uprobe binding helpers */
+
+static int tlob_uprobe_entry_handler(struct uprobe_consumer *uc,
+				     struct pt_regs *regs, __u64 *data)
+{
+	struct tlob_uprobe_binding *b =
+		container_of(uc, struct tlob_uprobe_binding, entry_uc);
+
+	tlob_start_task(current, b->threshold_us, NULL, (u64)b->offset_start);
+	return 0;
+}
+
+static int tlob_uprobe_stop_handler(struct uprobe_consumer *uc,
+				    struct pt_regs *regs, __u64 *data)
+{
+	tlob_stop_task(current);
+	return 0;
+}
+
+/*
+ * Register start + stop entry uprobes for a binding.
+ * Both are plain entry uprobes (no uretprobe), so a wrong offset never
+ * corrupts the call stack; the worst outcome is a missed stop (hrtimer
+ * fires and reports a budget violation).
+ * Called with tlob_uprobe_mutex held.
+ */
+static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
+			   loff_t offset_start, loff_t offset_stop)
+{
+	struct tlob_uprobe_binding *b, *tmp_b;
+	char pathbuf[TLOB_MAX_PATH];
+	struct inode *inode;
+	char *canon;
+	int ret;
+
+	b = kzalloc(sizeof(*b), GFP_KERNEL);
+	if (!b)
+		return -ENOMEM;
+
+	if (binpath[0] != '/') {
+		kfree(b);
+		return -EINVAL;
+	}
+
+	b->threshold_us = threshold_us;
+	b->offset_start = offset_start;
+	b->offset_stop  = offset_stop;
+
+	ret = kern_path(binpath, LOOKUP_FOLLOW, &b->path);
+	if (ret)
+		goto err_free;
+
+	if (!d_is_reg(b->path.dentry)) {
+		ret = -EINVAL;
+		goto err_path;
+	}
+
+	/* Reject duplicate start offset for the same binary. */
+	list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
+		if (tmp_b->offset_start == offset_start &&
+		    tmp_b->path.dentry == b->path.dentry) {
+			ret = -EEXIST;
+			goto err_path;
+		}
+	}
+
+	/* Store canonical path for read-back and removal matching. */
+	canon = d_path(&b->path, pathbuf, sizeof(pathbuf));
+	if (IS_ERR(canon)) {
+		ret = PTR_ERR(canon);
+		goto err_path;
+	}
+	strscpy(b->binpath, canon, sizeof(b->binpath));
+
+	b->entry_uc.handler = tlob_uprobe_entry_handler;
+	b->stop_uc.handler  = tlob_uprobe_stop_handler;
+
+	inode = d_real_inode(b->path.dentry);
+
+	b->entry_uprobe = uprobe_register(inode, offset_start, 0, &b->entry_uc);
+	if (IS_ERR(b->entry_uprobe)) {
+		ret = PTR_ERR(b->entry_uprobe);
+		b->entry_uprobe = NULL;
+		goto err_path;
+	}
+
+	b->stop_uprobe = uprobe_register(inode, offset_stop, 0, &b->stop_uc);
+	if (IS_ERR(b->stop_uprobe)) {
+		ret = PTR_ERR(b->stop_uprobe);
+		b->stop_uprobe = NULL;
+		goto err_entry;
+	}
+
+	list_add_tail(&b->list, &tlob_uprobe_list);
+	return 0;
+
+err_entry:
+	uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
+	uprobe_unregister_sync();
+err_path:
+	path_put(&b->path);
+err_free:
+	kfree(b);
+	return ret;
+}
+
+/*
+ * Remove the uprobe binding for (offset_start, binpath).
+ * binpath is resolved to a dentry for comparison so symlinks are handled
+ * correctly.  Called with tlob_uprobe_mutex held.
+ */
+static void tlob_remove_uprobe_by_key(loff_t offset_start, const char *binpath)
+{
+	struct tlob_uprobe_binding *b, *tmp;
+	struct path remove_path;
+
+	if (kern_path(binpath, LOOKUP_FOLLOW, &remove_path))
+		return;
+
+	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
+		if (b->offset_start != offset_start)
+			continue;
+		if (b->path.dentry != remove_path.dentry)
+			continue;
+		uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
+		uprobe_unregister_nosync(b->stop_uprobe,  &b->stop_uc);
+		list_del(&b->list);
+		uprobe_unregister_sync();
+		path_put(&b->path);
+		kfree(b);
+		break;
+	}
+
+	path_put(&remove_path);
+}
+
+/* Unregister all uprobe bindings; called from disable_tlob(). */
+static void tlob_remove_all_uprobes(void)
+{
+	struct tlob_uprobe_binding *b, *tmp;
+
+	mutex_lock(&tlob_uprobe_mutex);
+	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
+		uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
+		uprobe_unregister_nosync(b->stop_uprobe,  &b->stop_uc);
+		list_del(&b->list);
+		path_put(&b->path);
+		kfree(b);
+	}
+	mutex_unlock(&tlob_uprobe_mutex);
+	uprobe_unregister_sync();
+}
+
+/*
+ * tracefs "monitor" file
+ *
+ * Read:  one "threshold_us:0xoffset_start:0xoffset_stop:binary_path\n"
+ *        line per registered uprobe binding.
+ * Write: "threshold_us:offset_start:offset_stop:binary_path" - add uprobe binding
+ *        "-offset_start:binary_path"                         - remove uprobe binding
+ */
+
+static ssize_t tlob_monitor_read(struct file *file,
+				 char __user *ubuf,
+				 size_t count, loff_t *ppos)
+{
+	/* pid(10) + threshold(20) + 2 offsets(2*18) + path(256) + delimiters */
+	const int line_sz = TLOB_MAX_PATH + 72;
+	struct tlob_uprobe_binding *b;
+	char *buf, *p;
+	int n = 0, buf_sz, pos = 0;
+	ssize_t ret;
+
+	mutex_lock(&tlob_uprobe_mutex);
+	list_for_each_entry(b, &tlob_uprobe_list, list)
+		n++;
+	mutex_unlock(&tlob_uprobe_mutex);
+
+	buf_sz = (n ? n : 1) * line_sz + 1;
+	buf = kmalloc(buf_sz, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	mutex_lock(&tlob_uprobe_mutex);
+	list_for_each_entry(b, &tlob_uprobe_list, list) {
+		p = b->binpath;
+		pos += scnprintf(buf + pos, buf_sz - pos,
+				 "%llu:0x%llx:0x%llx:%s\n",
+				 b->threshold_us,
+				 (unsigned long long)b->offset_start,
+				 (unsigned long long)b->offset_stop,
+				 p);
+	}
+	mutex_unlock(&tlob_uprobe_mutex);
+
+	ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
+	kfree(buf);
+	return ret;
+}
+
+/*
+ * Parse "threshold_us:offset_start:offset_stop:binary_path".
+ * binary_path comes last so it may freely contain ':'.
+ * Returns 0 on success.
+ */
+VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
+					    char **path_out,
+					    loff_t *start_out, loff_t *stop_out)
+{
+	unsigned long long thr;
+	long long start, stop;
+	int n = 0;
+
+	/*
+	 * %llu : decimal-only (microseconds)
+	 * %lli : auto-base, accepts 0x-prefixed hex for offsets
+	 * %n   : records the byte offset of the first path character
+	 */
+	if (sscanf(buf, "%llu:%lli:%lli:%n", &thr, &start, &stop, &n) != 3)
+		return -EINVAL;
+	if (thr == 0 || n == 0 || buf[n] == '\0')
+		return -EINVAL;
+	if (start < 0 || stop < 0)
+		return -EINVAL;
+
+	*thr_out   = thr;
+	*start_out = start;
+	*stop_out  = stop;
+	*path_out  = buf + n;
+	return 0;
+}
+
+static ssize_t tlob_monitor_write(struct file *file,
+				  const char __user *ubuf,
+				  size_t count, loff_t *ppos)
+{
+	char buf[TLOB_MAX_PATH + 64];
+	loff_t offset_start, offset_stop;
+	u64 threshold_us;
+	char *binpath;
+	int ret;
+
+	if (count >= sizeof(buf))
+		return -EINVAL;
+	if (copy_from_user(buf, ubuf, count))
+		return -EFAULT;
+	buf[count] = '\0';
+
+	if (count > 0 && buf[count - 1] == '\n')
+		buf[count - 1] = '\0';
+
+	/* Remove request: "-offset_start:binary_path" */
+	if (buf[0] == '-') {
+		long long off;
+		int n = 0;
+
+		if (sscanf(buf + 1, "%lli:%n", &off, &n) != 1 || n == 0)
+			return -EINVAL;
+		binpath = buf + 1 + n;
+		if (binpath[0] != '/')
+			return -EINVAL;
+
+		mutex_lock(&tlob_uprobe_mutex);
+		tlob_remove_uprobe_by_key((loff_t)off, binpath);
+		mutex_unlock(&tlob_uprobe_mutex);
+
+		return (ssize_t)count;
+	}
+
+	/*
+	 * Uprobe binding: "threshold_us:offset_start:offset_stop:binary_path"
+	 * binpath points into buf at the start of the path field.
+	 */
+	ret = tlob_parse_uprobe_line(buf, &threshold_us,
+				     &binpath, &offset_start, &offset_stop);
+	if (ret)
+		return ret;
+
+	mutex_lock(&tlob_uprobe_mutex);
+	ret = tlob_add_uprobe(threshold_us, binpath, offset_start, offset_stop);
+	mutex_unlock(&tlob_uprobe_mutex);
+	return ret ? ret : (ssize_t)count;
+}
+
+static const struct file_operations tlob_monitor_fops = {
+	.open	= simple_open,
+	.read	= tlob_monitor_read,
+	.write	= tlob_monitor_write,
+	.llseek	= noop_llseek,
+};
+
+/*
+ * __tlob_init_monitor / __tlob_destroy_monitor - called with rv_interface_lock
+ * held (required by da_monitor_init/destroy via rv_get/put_task_monitor_slot).
+ */
+static int __tlob_init_monitor(void)
+{
+	int i, retval;
+
+	tlob_state_cache = kmem_cache_create("tlob_task_state",
+					     sizeof(struct tlob_task_state),
+					     0, 0, NULL);
+	if (!tlob_state_cache)
+		return -ENOMEM;
+
+	for (i = 0; i < TLOB_HTABLE_SIZE; i++)
+		INIT_HLIST_HEAD(&tlob_htable[i]);
+	atomic_set(&tlob_num_monitored, 0);
+
+	retval = da_monitor_init();
+	if (retval) {
+		kmem_cache_destroy(tlob_state_cache);
+		tlob_state_cache = NULL;
+		return retval;
+	}
+
+	rv_this.enabled = 1;
+	return 0;
+}
+
+static void __tlob_destroy_monitor(void)
+{
+	rv_this.enabled = 0;
+	tlob_stop_all();
+	tlob_remove_all_uprobes();
+	/*
+	 * Drain pending call_rcu() callbacks from tlob_stop_all() before
+	 * destroying the kmem_cache.
+	 */
+	synchronize_rcu();
+	da_monitor_destroy();
+	kmem_cache_destroy(tlob_state_cache);
+	tlob_state_cache = NULL;
+}
+
+/*
+ * tlob_init_monitor / tlob_destroy_monitor - KUnit wrappers that acquire
+ * rv_interface_lock, satisfying the lockdep_assert_held() inside
+ * rv_get/put_task_monitor_slot().
+ */
+VISIBLE_IF_KUNIT int tlob_init_monitor(void)
+{
+	int ret;
+
+	mutex_lock(&rv_interface_lock);
+	ret = __tlob_init_monitor();
+	mutex_unlock(&rv_interface_lock);
+	return ret;
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_init_monitor);
+
+VISIBLE_IF_KUNIT void tlob_destroy_monitor(void)
+{
+	mutex_lock(&rv_interface_lock);
+	__tlob_destroy_monitor();
+	mutex_unlock(&rv_interface_lock);
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_destroy_monitor);
+
+VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
+{
+	rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
+	rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
+	return 0;
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
+
+VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
+{
+	rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
+	rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
+
+/*
+ * enable_tlob / disable_tlob - called by rv_enable/disable_monitor() which
+ * already holds rv_interface_lock; call the __ variants directly.
+ */
+static int enable_tlob(void)
+{
+	int retval;
+
+	retval = __tlob_init_monitor();
+	if (retval)
+		return retval;
+
+	return tlob_enable_hooks();
+}
+
+static void disable_tlob(void)
+{
+	tlob_disable_hooks();
+	__tlob_destroy_monitor();
+}
+
+static struct rv_monitor rv_this = {
+	.name		= "tlob",
+	.description	= "Per-task latency-over-budget monitor.",
+	.enable		= enable_tlob,
+	.disable	= disable_tlob,
+	.reset		= da_monitor_reset_all,
+	.enabled	= 0,
+};
+
+static int __init register_tlob(void)
+{
+	int ret;
+
+	ret = rv_register_monitor(&rv_this, NULL);
+	if (ret)
+		return ret;
+
+	if (rv_this.root_d) {
+		tracefs_create_file("monitor", 0644, rv_this.root_d, NULL,
+				    &tlob_monitor_fops);
+	}
+
+	return 0;
+}
+
+static void __exit unregister_tlob(void)
+{
+	rv_unregister_monitor(&rv_this);
+}
+
+module_init(register_tlob);
+module_exit(unregister_tlob);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
+MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
diff --git a/kernel/trace/rv/monitors/tlob/tlob.h b/kernel/trace/rv/monitors/tlob/tlob.h
new file mode 100644
index 000000000..3438a6175
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/tlob.h
@@ -0,0 +1,145 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _RV_TLOB_H
+#define _RV_TLOB_H
+
+/*
+ * C representation of the tlob automaton, generated from tlob.dot via rvgen
+ * and extended with tlob_start_task()/tlob_stop_task() declarations.
+ * For the format description see Documentation/trace/rv/deterministic_automata.rst
+ */
+
+#include <linux/rv.h>
+#include <uapi/linux/rv.h>
+
+#define MONITOR_NAME tlob
+
+enum states_tlob {
+	unmonitored_tlob,
+	on_cpu_tlob,
+	off_cpu_tlob,
+	state_max_tlob,
+};
+
+#define INVALID_STATE state_max_tlob
+
+enum events_tlob {
+	trace_start_tlob,
+	switch_in_tlob,
+	switch_out_tlob,
+	sched_wakeup_tlob,
+	trace_stop_tlob,
+	budget_expired_tlob,
+	event_max_tlob,
+};
+
+struct automaton_tlob {
+	char *state_names[state_max_tlob];
+	char *event_names[event_max_tlob];
+	unsigned char function[state_max_tlob][event_max_tlob];
+	unsigned char initial_state;
+	bool final_states[state_max_tlob];
+};
+
+static const struct automaton_tlob automaton_tlob = {
+	.state_names = {
+		"unmonitored",
+		"on_cpu",
+		"off_cpu",
+	},
+	.event_names = {
+		"trace_start",
+		"switch_in",
+		"switch_out",
+		"sched_wakeup",
+		"trace_stop",
+		"budget_expired",
+	},
+	.function = {
+		/* unmonitored */
+		{
+			on_cpu_tlob,		/* trace_start    */
+			unmonitored_tlob,	/* switch_in      */
+			unmonitored_tlob,	/* switch_out     */
+			unmonitored_tlob,	/* sched_wakeup   */
+			INVALID_STATE,		/* trace_stop     */
+			INVALID_STATE,		/* budget_expired */
+		},
+		/* on_cpu */
+		{
+			INVALID_STATE,		/* trace_start    */
+			INVALID_STATE,		/* switch_in      */
+			off_cpu_tlob,		/* switch_out     */
+			on_cpu_tlob,		/* sched_wakeup   */
+			unmonitored_tlob,	/* trace_stop     */
+			unmonitored_tlob,	/* budget_expired */
+		},
+		/* off_cpu */
+		{
+			INVALID_STATE,		/* trace_start    */
+			on_cpu_tlob,		/* switch_in      */
+			off_cpu_tlob,		/* switch_out     */
+			off_cpu_tlob,		/* sched_wakeup   */
+			unmonitored_tlob,	/* trace_stop     */
+			unmonitored_tlob,	/* budget_expired */
+		},
+	},
+	/*
+	 * final_states: unmonitored is the sole accepting state.
+	 * Violations are recorded via ntf_push and tlob_budget_exceeded.
+	 */
+	.initial_state = unmonitored_tlob,
+	.final_states = { 1, 0, 0 },
+};
+
+/* Exported for use by the RV ioctl layer (rv_dev.c) */
+int tlob_start_task(struct task_struct *task, u64 threshold_us,
+		    struct file *notify_file, u64 tag);
+int tlob_stop_task(struct task_struct *task);
+
+/* Maximum number of concurrently monitored tasks (also used by KUnit). */
+#define TLOB_MAX_MONITORED	64U
+
+/*
+ * Ring buffer constants (also published in UAPI for mmap size calculation).
+ */
+#define TLOB_RING_DEFAULT_CAP	64U	/* records allocated at open()  */
+#define TLOB_RING_MIN_CAP	 8U	/* minimum accepted by mmap()   */
+#define TLOB_RING_MAX_CAP	4096U	/* maximum accepted by mmap()   */
+
+/**
+ * struct tlob_ring - per-fd mmap-capable violation ring buffer.
+ *
+ * Allocated as a contiguous page range at rv_open() time:
+ *   page 0:    struct tlob_mmap_page  (shared with userspace)
+ *   pages 1-N: struct tlob_event[capacity]
+ */
+struct tlob_ring {
+	struct tlob_mmap_page	*page;
+	struct tlob_event	*data;
+	u32			 mask;
+	spinlock_t		 lock;
+	unsigned long		 base;
+	unsigned int		 order;
+};
+
+/**
+ * struct rv_file_priv - per-fd private data for /dev/rv.
+ */
+struct rv_file_priv {
+	struct tlob_ring	ring;
+	wait_queue_head_t	waitq;
+};
+
+#if IS_ENABLED(CONFIG_KUNIT)
+int tlob_init_monitor(void);
+void tlob_destroy_monitor(void);
+int tlob_enable_hooks(void);
+void tlob_disable_hooks(void);
+void tlob_event_push_kunit(struct rv_file_priv *priv,
+			  const struct tlob_event *info);
+int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
+			   char **path_out,
+			   loff_t *start_out, loff_t *stop_out);
+#endif /* CONFIG_KUNIT */
+
+#endif /* _RV_TLOB_H */
diff --git a/kernel/trace/rv/monitors/tlob/tlob_trace.h b/kernel/trace/rv/monitors/tlob/tlob_trace.h
new file mode 100644
index 000000000..b08d67776
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/tlob_trace.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Snippet to be included in rv_trace.h
+ */
+
+#ifdef CONFIG_RV_MON_TLOB
+/*
+ * tlob uses the generic event_da_monitor_id and error_da_monitor_id event
+ * classes so that both event classes are instantiated.  This avoids a
+ * -Werror=unused-variable warning that the compiler emits when a
+ * DECLARE_EVENT_CLASS has no corresponding DEFINE_EVENT instance.
+ *
+ * The event_tlob tracepoint is defined here but the call-site in
+ * da_handle_event() is overridden with a no-op macro below so that no
+ * trace record is emitted on every scheduler context switch.  Budget
+ * violations are reported via the dedicated tlob_budget_exceeded event.
+ *
+ * error_tlob IS kept active so that invalid DA transitions (programming
+ * errors) are still visible in the ftrace ring buffer for debugging.
+ */
+DEFINE_EVENT(event_da_monitor_id, event_tlob,
+	     TP_PROTO(int id, char *state, char *event, char *next_state,
+		      bool final_state),
+	     TP_ARGS(id, state, event, next_state, final_state));
+
+DEFINE_EVENT(error_da_monitor_id, error_tlob,
+	     TP_PROTO(int id, char *state, char *event),
+	     TP_ARGS(id, state, event));
+
+/*
+ * Override the trace_event_tlob() call-site with a no-op after the
+ * DEFINE_EVENT above has satisfied the event class instantiation
+ * requirement.  The tracepoint symbol itself exists (and can be enabled
+ * via tracefs) but the automatic call from da_handle_event() is silenced
+ * to avoid per-context-switch ftrace noise during normal operation.
+ */
+#undef trace_event_tlob
+#define trace_event_tlob(id, state, event, next_state, final_state)	\
+	do { (void)(id); (void)(state); (void)(event);			\
+	     (void)(next_state); (void)(final_state); } while (0)
+#endif /* CONFIG_RV_MON_TLOB */
diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
index ee4e68102..e754e76d5 100644
--- a/kernel/trace/rv/rv.c
+++ b/kernel/trace/rv/rv.c
@@ -148,6 +148,10 @@
 #include <rv_trace.h>
 #endif
 
+#ifdef CONFIG_RV_MON_TLOB
+EXPORT_TRACEPOINT_SYMBOL_GPL(tlob_budget_exceeded);
+#endif
+
 #include "rv.h"
 
 DEFINE_MUTEX(rv_interface_lock);
diff --git a/kernel/trace/rv/rv_dev.c b/kernel/trace/rv/rv_dev.c
new file mode 100644
index 000000000..a052f3203
--- /dev/null
+++ b/kernel/trace/rv/rv_dev.c
@@ -0,0 +1,602 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * rv_dev.c - /dev/rv misc device for RV monitor self-instrumentation
+ *
+ * A single misc device (MISC_DYNAMIC_MINOR) serves all RV monitors.
+ * ioctl numbers encode the monitor identity:
+ *
+ *   0x01 - 0x1F  tlob (task latency over budget)
+ *   0x20 - 0x3F  reserved
+ *
+ * Each monitor exports tlob_start_task() / tlob_stop_task() which are
+ * called here.  The calling task is identified by current.
+ *
+ * Magic: RV_IOC_MAGIC (0xB9), defined in include/uapi/linux/rv.h
+ *
+ * Per-fd private data (rv_file_priv)
+ * ------------------------------------
+ * Every open() of /dev/rv allocates an rv_file_priv (defined in tlob.h).
+ * When TLOB_IOCTL_TRACE_START is called with args.notify_fd >= 0, violations
+ * are pushed as tlob_event records into that fd's per-fd ring buffer (tlob_ring)
+ * and its poll/epoll waitqueue is woken.
+ *
+ * Consumers drain records with read() on the notify_fd; read() blocks until
+ * at least one record is available (unless O_NONBLOCK is set).
+ *
+ * Per-thread "started" tracking (tlob_task_handle)
+ * -------------------------------------------------
+ * tlob_stop_task() returns -ESRCH in two distinct situations:
+ *
+ *   (a) The deadline timer already fired and removed the tlob hash-table
+ *       entry before TRACE_STOP arrived -> budget was exceeded -> -EOVERFLOW
+ *
+ *   (b) TRACE_START was never called for this thread -> programming error
+ *       -> -ESRCH
+ *
+ * To distinguish them, rv_dev.c maintains a lightweight hash table
+ * (tlob_handles) that records a tlob_task_handle for every task_struct *
+ * for which a successful TLOB_IOCTL_TRACE_START has been
+ * issued but the corresponding TLOB_IOCTL_TRACE_STOP has not yet arrived.
+ *
+ * tlob_task_handle is a thin "session ticket"  --  it carries only the
+ * task pointer and the owning file descriptor.  The heavy per-task state
+ * (hrtimer, DA state, threshold) lives in tlob_task_state inside tlob.c.
+ *
+ * The table is keyed on task_struct * (same key as tlob.c), protected
+ * by tlob_handles_lock (spinlock, irq-safe).  No get_task_struct()
+ * refcount is needed here because tlob.c already holds a reference for
+ * each live entry.
+ *
+ * Multiple threads may share the same fd.  Each thread has its own
+ * tlob_task_handle in the table, so concurrent TRACE_START / TRACE_STOP
+ * calls from different threads do not interfere.
+ *
+ * The fd release path (rv_release) calls tlob_stop_task() for every
+ * handle in tlob_handles that belongs to the closing fd, ensuring cleanup
+ * even if the user forgets to call TRACE_STOP.
+ */
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/gfp.h>
+#include <linux/hash.h>
+#include <linux/mm.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <uapi/linux/rv.h>
+
+#ifdef CONFIG_RV_MON_TLOB
+#include "monitors/tlob/tlob.h"
+#endif
+
+/* -----------------------------------------------------------------------
+ * tlob_task_handle - per-thread session ticket for the ioctl interface
+ *
+ * One handle is allocated by TLOB_IOCTL_TRACE_START and freed by
+ * TLOB_IOCTL_TRACE_STOP (or by rv_release if the fd is closed).
+ *
+ * @hlist:  Hash-table linkage in tlob_handles (keyed on task pointer).
+ * @task:   The monitored thread.  Plain pointer; no refcount held here
+ *          because tlob.c holds one for the lifetime of the monitoring
+ *          window, which encompasses the lifetime of this handle.
+ * @file:   The /dev/rv file descriptor that issued TRACE_START.
+ *          Used by rv_release() to sweep orphaned handles on close().
+ * -----------------------------------------------------------------------
+ */
+#define TLOB_HANDLES_BITS	5
+#define TLOB_HANDLES_SIZE	(1 << TLOB_HANDLES_BITS)
+
+struct tlob_task_handle {
+	struct hlist_node	hlist;
+	struct task_struct	*task;
+	struct file		*file;
+};
+
+static struct hlist_head tlob_handles[TLOB_HANDLES_SIZE];
+static DEFINE_SPINLOCK(tlob_handles_lock);
+
+static unsigned int tlob_handle_hash(const struct task_struct *task)
+{
+	return hash_ptr((void *)task, TLOB_HANDLES_BITS);
+}
+
+/* Must be called with tlob_handles_lock held. */
+static struct tlob_task_handle *
+tlob_handle_find_locked(struct task_struct *task)
+{
+	struct tlob_task_handle *h;
+	unsigned int slot = tlob_handle_hash(task);
+
+	hlist_for_each_entry(h, &tlob_handles[slot], hlist) {
+		if (h->task == task)
+			return h;
+	}
+	return NULL;
+}
+
+/*
+ * tlob_handle_alloc - record that @task has an active monitoring session
+ *                     opened via @file.
+ *
+ * Returns 0 on success, -EEXIST if @task already has a handle (double
+ * TRACE_START without TRACE_STOP), -ENOMEM on allocation failure.
+ */
+static int tlob_handle_alloc(struct task_struct *task, struct file *file)
+{
+	struct tlob_task_handle *h;
+	unsigned long flags;
+	unsigned int slot;
+
+	h = kmalloc(sizeof(*h), GFP_KERNEL);
+	if (!h)
+		return -ENOMEM;
+	h->task = task;
+	h->file = file;
+
+	spin_lock_irqsave(&tlob_handles_lock, flags);
+	if (tlob_handle_find_locked(task)) {
+		spin_unlock_irqrestore(&tlob_handles_lock, flags);
+		kfree(h);
+		return -EEXIST;
+	}
+	slot = tlob_handle_hash(task);
+	hlist_add_head(&h->hlist, &tlob_handles[slot]);
+	spin_unlock_irqrestore(&tlob_handles_lock, flags);
+	return 0;
+}
+
+/*
+ * tlob_handle_free - remove the handle for @task and free it.
+ *
+ * Returns 1 if a handle existed (TRACE_START was called), 0 if not found
+ * (TRACE_START was never called for this thread).
+ */
+static int tlob_handle_free(struct task_struct *task)
+{
+	struct tlob_task_handle *h;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tlob_handles_lock, flags);
+	h = tlob_handle_find_locked(task);
+	if (h) {
+		hlist_del_init(&h->hlist);
+		spin_unlock_irqrestore(&tlob_handles_lock, flags);
+		kfree(h);
+		return 1;
+	}
+	spin_unlock_irqrestore(&tlob_handles_lock, flags);
+	return 0;
+}
+
+/*
+ * tlob_handle_sweep_file - release all handles owned by @file.
+ *
+ * Called from rv_release() when the fd is closed without TRACE_STOP.
+ * Calls tlob_stop_task() for each orphaned handle to drain the tlob
+ * monitoring entries and prevent resource leaks in tlob.c.
+ *
+ * Handles are collected under the lock (short critical section), then
+ * processed outside it (tlob_stop_task() may sleep/spin internally).
+ */
+#ifdef CONFIG_RV_MON_TLOB
+static void tlob_handle_sweep_file(struct file *file)
+{
+	struct tlob_task_handle *batch[TLOB_HANDLES_SIZE];
+	struct tlob_task_handle *h;
+	struct hlist_node *tmp;
+	unsigned long flags;
+	int i, n = 0;
+
+	spin_lock_irqsave(&tlob_handles_lock, flags);
+	for (i = 0; i < TLOB_HANDLES_SIZE; i++) {
+		hlist_for_each_entry_safe(h, tmp, &tlob_handles[i], hlist) {
+			if (h->file == file) {
+				hlist_del_init(&h->hlist);
+				batch[n++] = h;
+			}
+		}
+	}
+	spin_unlock_irqrestore(&tlob_handles_lock, flags);
+
+	for (i = 0; i < n; i++) {
+		/*
+		 * Ignore -ESRCH: the deadline timer may have already fired
+		 * and cleaned up the tlob entry.
+		 */
+		tlob_stop_task(batch[i]->task);
+		kfree(batch[i]);
+	}
+}
+#else
+static inline void tlob_handle_sweep_file(struct file *file) {}
+#endif /* CONFIG_RV_MON_TLOB */
+
+/* -----------------------------------------------------------------------
+ * Ring buffer lifecycle
+ * -----------------------------------------------------------------------
+ */
+
+/*
+ * tlob_ring_alloc - allocate a ring of @cap records (must be a power of 2).
+ *
+ * Allocates a physically contiguous block of pages:
+ *   page 0     : struct tlob_mmap_page  (control page, shared with userspace)
+ *   pages 1..N : struct tlob_event[cap] (data pages)
+ *
+ * Each page is marked reserved so it can be mapped to userspace via mmap().
+ */
+static int tlob_ring_alloc(struct tlob_ring *ring, u32 cap)
+{
+	unsigned int total = PAGE_SIZE + cap * sizeof(struct tlob_event);
+	unsigned int order = get_order(total);
+	unsigned long base;
+	unsigned int i;
+
+	base = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
+	if (!base)
+		return -ENOMEM;
+
+	for (i = 0; i < (1u << order); i++)
+		SetPageReserved(virt_to_page((void *)(base + i * PAGE_SIZE)));
+
+	ring->base  = base;
+	ring->order = order;
+	ring->page  = (struct tlob_mmap_page *)base;
+	ring->data  = (struct tlob_event *)(base + PAGE_SIZE);
+	ring->mask  = cap - 1;
+	spin_lock_init(&ring->lock);
+
+	ring->page->capacity    = cap;
+	ring->page->version     = 1;
+	ring->page->data_offset = PAGE_SIZE;
+	ring->page->record_size = sizeof(struct tlob_event);
+	return 0;
+}
+
+static void tlob_ring_free(struct tlob_ring *ring)
+{
+	unsigned int i;
+
+	if (!ring->base)
+		return;
+
+	for (i = 0; i < (1u << ring->order); i++)
+		ClearPageReserved(virt_to_page((void *)(ring->base + i * PAGE_SIZE)));
+
+	free_pages(ring->base, ring->order);
+	ring->base = 0;
+	ring->page = NULL;
+	ring->data = NULL;
+}
+
+/* -----------------------------------------------------------------------
+ * File operations
+ * -----------------------------------------------------------------------
+ */
+
+static int rv_open(struct inode *inode, struct file *file)
+{
+	struct rv_file_priv *priv;
+	int ret;
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	ret = tlob_ring_alloc(&priv->ring, TLOB_RING_DEFAULT_CAP);
+	if (ret) {
+		kfree(priv);
+		return ret;
+	}
+
+	init_waitqueue_head(&priv->waitq);
+	file->private_data = priv;
+	return 0;
+}
+
+static int rv_release(struct inode *inode, struct file *file)
+{
+	struct rv_file_priv *priv = file->private_data;
+
+	tlob_handle_sweep_file(file);
+	tlob_ring_free(&priv->ring);
+	kfree(priv);
+	file->private_data = NULL;
+	return 0;
+}
+
+static __poll_t rv_poll(struct file *file, poll_table *wait)
+{
+	struct rv_file_priv *priv = file->private_data;
+
+	if (!priv)
+		return EPOLLERR;
+
+	poll_wait(file, &priv->waitq, wait);
+
+	/*
+	 * Pairs with smp_store_release(&ring->page->data_head, ...) in
+	 * tlob_event_push().  No lock needed: head is written by the kernel
+	 * producer and read here; tail is written by the consumer and we only
+	 * need an approximate check for the poll fast path.
+	 */
+	if (smp_load_acquire(&priv->ring.page->data_head) !=
+	    READ_ONCE(priv->ring.page->data_tail))
+		return EPOLLIN | EPOLLRDNORM;
+
+	return 0;
+}
+
+/*
+ * rv_read - consume tlob_event violation records from this fd's ring buffer.
+ *
+ * Each read() returns a whole number of struct tlob_event records.  @count must
+ * be at least sizeof(struct tlob_event); partial-record sizes are rejected with
+ * -EINVAL.
+ *
+ * Blocking behaviour follows O_NONBLOCK on the fd:
+ *   O_NONBLOCK clear: blocks until at least one record is available.
+ *   O_NONBLOCK set:   returns -EAGAIN immediately if the ring is empty.
+ *
+ * Returns the number of bytes copied (always a multiple of sizeof tlob_event),
+ * -EAGAIN if non-blocking and empty, or a negative error code.
+ *
+ * read() and mmap() share the same ring and data_tail cursor; do not use
+ * both simultaneously on the same fd.
+ */
+static ssize_t rv_read(struct file *file, char __user *buf, size_t count,
+		       loff_t *ppos)
+{
+	struct rv_file_priv *priv = file->private_data;
+	struct tlob_ring *ring;
+	size_t rec = sizeof(struct tlob_event);
+	unsigned long irqflags;
+	ssize_t done = 0;
+	int ret;
+
+	if (!priv)
+		return -ENODEV;
+
+	ring = &priv->ring;
+
+	if (count < rec)
+		return -EINVAL;
+
+	/* Blocking path: sleep until the producer advances data_head. */
+	if (!(file->f_flags & O_NONBLOCK)) {
+		ret = wait_event_interruptible(priv->waitq,
+			/* pairs with smp_store_release() in the producer */
+			smp_load_acquire(&ring->page->data_head) !=
+			READ_ONCE(ring->page->data_tail));
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Drain records into the caller's buffer.  ring->lock serialises
+	 * concurrent read() callers and the softirq producer.
+	 */
+	while (done + rec <= count) {
+		struct tlob_event record;
+		u32 head, tail;
+
+		spin_lock_irqsave(&ring->lock, irqflags);
+		/* pairs with smp_store_release() in the producer */
+		head = smp_load_acquire(&ring->page->data_head);
+		tail = ring->page->data_tail;
+		if (head == tail) {
+			spin_unlock_irqrestore(&ring->lock, irqflags);
+			break;
+		}
+		record = ring->data[tail & ring->mask];
+		WRITE_ONCE(ring->page->data_tail, tail + 1);
+		spin_unlock_irqrestore(&ring->lock, irqflags);
+
+		if (copy_to_user(buf + done, &record, rec))
+			return done ? done : -EFAULT;
+		done += rec;
+	}
+
+	return done ? done : -EAGAIN;
+}
+
+/*
+ * rv_mmap - map the per-fd violation ring buffer into userspace.
+ *
+ * The mmap region covers the full ring allocation:
+ *
+ *   offset 0          : struct tlob_mmap_page  (control page)
+ *   offset PAGE_SIZE  : struct tlob_event[capacity]  (data pages)
+ *
+ * The caller must map exactly PAGE_SIZE + capacity * sizeof(struct tlob_event)
+ * bytes starting at offset 0 (vm_pgoff must be 0).  The actual capacity is
+ * read from tlob_mmap_page.capacity after a successful mmap(2).
+ *
+ * Private mappings (MAP_PRIVATE) are rejected: the shared data_tail field
+ * written by userspace must be visible to the kernel producer.
+ */
+static int rv_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct rv_file_priv *priv = file->private_data;
+	struct tlob_ring    *ring;
+	unsigned long        size = vma->vm_end - vma->vm_start;
+	unsigned long        ring_size;
+
+	if (!priv)
+		return -ENODEV;
+
+	ring = &priv->ring;
+
+	if (vma->vm_pgoff != 0)
+		return -EINVAL;
+
+	ring_size = PAGE_ALIGN(PAGE_SIZE + ((unsigned long)(ring->mask + 1) *
+					    sizeof(struct tlob_event)));
+	if (size != ring_size)
+		return -EINVAL;
+
+	if (!(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+
+	return remap_pfn_range(vma, vma->vm_start,
+			       page_to_pfn(virt_to_page((void *)ring->base)),
+			       ring_size, vma->vm_page_prot);
+}
+
+/* -----------------------------------------------------------------------
+ * ioctl dispatcher
+ * -----------------------------------------------------------------------
+ */
+
+static long rv_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	unsigned int nr = _IOC_NR(cmd);
+
+	/*
+	 * Verify the magic byte so we don't accidentally handle ioctls
+	 * intended for a different device.
+	 */
+	if (_IOC_TYPE(cmd) != RV_IOC_MAGIC)
+		return -ENOTTY;
+
+#ifdef CONFIG_RV_MON_TLOB
+	/* tlob: ioctl numbers 0x01 - 0x1F */
+	switch (cmd) {
+	case TLOB_IOCTL_TRACE_START: {
+		struct tlob_start_args args;
+		struct file *notify_file = NULL;
+		int ret, hret;
+
+		if (copy_from_user(&args,
+				   (struct tlob_start_args __user *)arg,
+				   sizeof(args)))
+			return -EFAULT;
+		if (args.threshold_us == 0)
+			return -EINVAL;
+		if (args.flags != 0)
+			return -EINVAL;
+
+		/*
+		 * If notify_fd >= 0, resolve it to a file pointer.
+		 * fget() bumps the reference count; tlob.c drops it
+		 * via fput() when the monitoring window ends.
+		 * Reject non-/dev/rv fds to prevent type confusion.
+		 */
+		if (args.notify_fd >= 0) {
+			notify_file = fget(args.notify_fd);
+			if (!notify_file)
+				return -EBADF;
+			if (notify_file->f_op != file->f_op) {
+				fput(notify_file);
+				return -EINVAL;
+			}
+		}
+
+		ret = tlob_start_task(current, args.threshold_us,
+				      notify_file, args.tag);
+		if (ret != 0) {
+			/* tlob.c did not take ownership; drop ref. */
+			if (notify_file)
+				fput(notify_file);
+			return ret;
+		}
+
+		/*
+		 * Record session handle.  Free any stale handle left by
+		 * a previous window whose deadline timer fired (timer
+		 * removes tlob_task_state but cannot touch tlob_handles).
+		 */
+		tlob_handle_free(current);
+		hret = tlob_handle_alloc(current, file);
+		if (hret < 0) {
+			tlob_stop_task(current);
+			return hret;
+		}
+		return 0;
+	}
+	case TLOB_IOCTL_TRACE_STOP: {
+		int had_handle;
+		int ret;
+
+		/*
+		 * Atomically remove the session handle for current.
+		 *
+		 *   had_handle == 0: TRACE_START was never called for
+		 *                    this thread -> caller bug -> -ESRCH
+		 *
+		 *   had_handle == 1: TRACE_START was called.  If
+		 *                    tlob_stop_task() now returns
+		 *                    -ESRCH, the deadline timer already
+		 *                    fired -> budget exceeded -> -EOVERFLOW
+		 */
+		had_handle = tlob_handle_free(current);
+		if (!had_handle)
+			return -ESRCH;
+
+		ret = tlob_stop_task(current);
+		return (ret == -ESRCH) ? -EOVERFLOW : ret;
+	}
+	default:
+		break;
+	}
+#endif /* CONFIG_RV_MON_TLOB */
+
+	return -ENOTTY;
+}
+
+/* -----------------------------------------------------------------------
+ * Module init / exit
+ * -----------------------------------------------------------------------
+ */
+
+static const struct file_operations rv_fops = {
+	.owner		= THIS_MODULE,
+	.open		= rv_open,
+	.release	= rv_release,
+	.read		= rv_read,
+	.poll		= rv_poll,
+	.mmap		= rv_mmap,
+	.unlocked_ioctl	= rv_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= rv_ioctl,
+#endif
+	.llseek		= noop_llseek,
+};
+
+/*
+ * 0666: /dev/rv is a self-instrumentation device.  All ioctls operate
+ * exclusively on the calling task (current); no task can monitor another
+ * via this interface.  Opening the device does not grant any privilege
+ * beyond observing one's own latency, so world-read/write is appropriate.
+ */
+static struct miscdevice rv_miscdev = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "rv",
+	.fops	= &rv_fops,
+	.mode	= 0666,
+};
+
+static int __init rv_ioctl_init(void)
+{
+	int i;
+
+	for (i = 0; i < TLOB_HANDLES_SIZE; i++)
+		INIT_HLIST_HEAD(&tlob_handles[i]);
+
+	return misc_register(&rv_miscdev);
+}
+
+static void __exit rv_ioctl_exit(void)
+{
+	misc_deregister(&rv_miscdev);
+}
+
+module_init(rv_ioctl_init);
+module_exit(rv_ioctl_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("RV ioctl interface via /dev/rv");
diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
index 4a6faddac..65d6c6485 100644
--- a/kernel/trace/rv/rv_trace.h
+++ b/kernel/trace/rv/rv_trace.h
@@ -126,6 +126,7 @@ DECLARE_EVENT_CLASS(error_da_monitor_id,
 #include <monitors/snroc/snroc_trace.h>
 #include <monitors/nrp/nrp_trace.h>
 #include <monitors/sssw/sssw_trace.h>
+#include <monitors/tlob/tlob_trace.h>
 // Add new monitors based on CONFIG_DA_MON_EVENTS_ID here
 
 #endif /* CONFIG_DA_MON_EVENTS_ID */
@@ -202,6 +203,55 @@ TRACE_EVENT(rv_retries_error,
 		__get_str(event), __get_str(name))
 );
 #endif /* CONFIG_RV_MON_MAINTENANCE_EVENTS */
+
+#ifdef CONFIG_RV_MON_TLOB
+/*
+ * tlob_budget_exceeded - emitted when a monitored task exceeds its latency
+ * budget.  Carries the on-CPU / off-CPU time breakdown so that the cause
+ * of the overrun (CPU-bound vs. scheduling/I/O latency) is immediately
+ * visible in the ftrace ring buffer without post-processing.
+ */
+TRACE_EVENT(tlob_budget_exceeded,
+
+	TP_PROTO(struct task_struct *task, u64 threshold_us,
+		 u64 on_cpu_us, u64 off_cpu_us, u32 switches,
+		 bool state_is_on_cpu, u64 tag),
+
+	TP_ARGS(task, threshold_us, on_cpu_us, off_cpu_us, switches,
+		state_is_on_cpu, tag),
+
+	TP_STRUCT__entry(
+		__string(comm,		task->comm)
+		__field(pid_t,		pid)
+		__field(u64,		threshold_us)
+		__field(u64,		on_cpu_us)
+		__field(u64,		off_cpu_us)
+		__field(u32,		switches)
+		__field(bool,		state_is_on_cpu)
+		__field(u64,		tag)
+	),
+
+	TP_fast_assign(
+		__assign_str(comm);
+		__entry->pid		= task->pid;
+		__entry->threshold_us	= threshold_us;
+		__entry->on_cpu_us	= on_cpu_us;
+		__entry->off_cpu_us	= off_cpu_us;
+		__entry->switches	= switches;
+		__entry->state_is_on_cpu = state_is_on_cpu;
+		__entry->tag		= tag;
+	),
+
+	TP_printk("%s[%d]: budget exceeded threshold=%llu on_cpu=%llu off_cpu=%llu switches=%u state=%s tag=0x%016llx",
+		__get_str(comm), __entry->pid,
+		__entry->threshold_us,
+		__entry->on_cpu_us, __entry->off_cpu_us,
+		__entry->switches,
+		__entry->state_is_on_cpu ? "on_cpu" : "off_cpu",
+		__entry->tag)
+);
+#endif /* CONFIG_RV_MON_TLOB */
+
 #endif /* _TRACE_RV_H */
 
 /* This part must be outside protection */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/4] rv/tlob: Add KUnit tests for the tlob monitor
  2026-04-12 19:27 [RFC PATCH 0/4] rv/tlob: Add task latency over budget RV monitor wen.yang
  2026-04-12 19:27 ` [RFC PATCH 1/4] rv/tlob: Add tlob model DOT file wen.yang
  2026-04-12 19:27 ` [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor wen.yang
@ 2026-04-12 19:27 ` wen.yang
  2026-04-16 12:09   ` Gabriele Monaco
  2026-04-12 19:27 ` [RFC PATCH 4/4] selftests/rv: Add selftest " wen.yang
  3 siblings, 1 reply; 11+ messages in thread
From: wen.yang @ 2026-04-12 19:27 UTC (permalink / raw)
  To: Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Wen Yang

From: Wen Yang <wen.yang@linux.dev>

Add six KUnit test suites gated behind CONFIG_TLOB_KUNIT_TEST
(depends on RV_MON_TLOB && KUNIT; default KUNIT_ALL_TESTS).
A .kunitconfig fragment is provided for the kunit.py runner.

Coverage: automaton state transitions and self-loops; start/stop API
error paths (duplicate start, missing start, overflow threshold,
table-full, immediate deadline); scheduler context-switch accounting
for on/off-CPU time; violation tracepoint payload fields; ring buffer
push, drop-new overflow, and wakeup; and the uprobe line parser.

Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
 kernel/trace/rv/Makefile                   |    1 +
 kernel/trace/rv/monitors/tlob/.kunitconfig |    5 +
 kernel/trace/rv/monitors/tlob/Kconfig      |   12 +
 kernel/trace/rv/monitors/tlob/tlob.c       |    1 +
 kernel/trace/rv/monitors/tlob/tlob_kunit.c | 1194 ++++++++++++++++++++
 5 files changed, 1213 insertions(+)
 create mode 100644 kernel/trace/rv/monitors/tlob/.kunitconfig
 create mode 100644 kernel/trace/rv/monitors/tlob/tlob_kunit.c

diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index cc3781a3b..6d963207d 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
 obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
 obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
 obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
+obj-$(CONFIG_TLOB_KUNIT_TEST) += monitors/tlob/tlob_kunit.o
 # Add new monitors here
 obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
 obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
diff --git a/kernel/trace/rv/monitors/tlob/.kunitconfig b/kernel/trace/rv/monitors/tlob/.kunitconfig
new file mode 100644
index 000000000..977c58601
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/.kunitconfig
@@ -0,0 +1,5 @@
+CONFIG_FTRACE=y
+CONFIG_KUNIT=y
+CONFIG_RV=y
+CONFIG_RV_MON_TLOB=y
+CONFIG_TLOB_KUNIT_TEST=y
diff --git a/kernel/trace/rv/monitors/tlob/Kconfig b/kernel/trace/rv/monitors/tlob/Kconfig
index 010237480..4ccd2f881 100644
--- a/kernel/trace/rv/monitors/tlob/Kconfig
+++ b/kernel/trace/rv/monitors/tlob/Kconfig
@@ -49,3 +49,15 @@ config RV_MON_TLOB
 	  For further information, see:
 	    Documentation/trace/rv/monitor_tlob.rst
 
+config TLOB_KUNIT_TEST
+	tristate "KUnit tests for tlob monitor" if !KUNIT_ALL_TESTS
+	depends on RV_MON_TLOB && KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  Enable KUnit in-kernel unit tests for the tlob RV monitor.
+
+	  Tests cover automaton state transitions, the hash table helpers,
+	  the start/stop task interface, and the event ring buffer including
+	  overflow handling and wakeup behaviour.
+
+	  Say Y or M here to run the tlob KUnit test suite; otherwise say N.
diff --git a/kernel/trace/rv/monitors/tlob/tlob.c b/kernel/trace/rv/monitors/tlob/tlob.c
index a6e474025..dd959eb9b 100644
--- a/kernel/trace/rv/monitors/tlob/tlob.c
+++ b/kernel/trace/rv/monitors/tlob/tlob.c
@@ -784,6 +784,7 @@ VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
 	*path_out  = buf + n;
 	return 0;
 }
+EXPORT_SYMBOL_IF_KUNIT(tlob_parse_uprobe_line);
 
 static ssize_t tlob_monitor_write(struct file *file,
 				  const char __user *ubuf,
diff --git a/kernel/trace/rv/monitors/tlob/tlob_kunit.c b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
new file mode 100644
index 000000000..64f5abb34
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
@@ -0,0 +1,1194 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit tests for the tlob RV monitor.
+ *
+ * tlob_automaton:         DA transition table coverage.
+ * tlob_task_api:          tlob_start_task()/tlob_stop_task() lifecycle and errors.
+ * tlob_sched_integration: on/off-CPU accounting across real context switches.
+ * tlob_trace_output:      tlob_budget_exceeded tracepoint field verification.
+ * tlob_event_buf:         ring buffer push, overflow, and wakeup.
+ * tlob_parse_uprobe:      uprobe format string parser acceptance and rejection.
+ *
+ * The duplicate-(binary, offset_start) constraint enforced by tlob_add_uprobe()
+ * is not covered here: that function calls kern_path() and requires a real
+ * filesystem, which is outside the scope of unit tests. It is covered by the
+ * uprobe_duplicate_offset case in tools/testing/selftests/rv/test_tlob.sh.
+ */
+#include <kunit/test.h>
+#include <linux/atomic.h>
+#include <linux/completion.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/ktime.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/sched/task.h>
+#include <linux/tracepoint.h>
+
+/*
+ * Pull in the rv tracepoint declarations so that
+ * register_trace_tlob_budget_exceeded() is available.
+ * No CREATE_TRACE_POINTS here  --  the tracepoint implementation lives in rv.c.
+ */
+#include <rv_trace.h>
+
+#include "tlob.h"
+
+/*
+ * da_handle_event_tlob - apply one automaton transition on @da_mon.
+ *
+ * This helper is used only by the KUnit automaton suite. It applies the
+ * tlob transition table directly on a supplied da_monitor without touching
+ * per-task slots, tracepoints, or timers.
+ */
+static void da_handle_event_tlob(struct da_monitor *da_mon,
+				 enum events_tlob event)
+{
+	enum states_tlob curr_state = (enum states_tlob)da_mon->curr_state;
+	enum states_tlob next_state =
+		(enum states_tlob)automaton_tlob.function[curr_state][event];
+
+	if (next_state != INVALID_STATE)
+		da_mon->curr_state = next_state;
+}
+
+MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");
+
+/*
+ * Suite 1: automaton state-machine transitions
+ */
+
+/* unmonitored -> trace_start -> on_cpu */
+static void tlob_unmonitored_to_on_cpu(struct kunit *test)
+{
+	struct da_monitor mon = { .curr_state = unmonitored_tlob };
+
+	da_handle_event_tlob(&mon, trace_start_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
+}
+
+/* on_cpu -> switch_out -> off_cpu */
+static void tlob_on_cpu_switch_out(struct kunit *test)
+{
+	struct da_monitor mon = { .curr_state = on_cpu_tlob };
+
+	da_handle_event_tlob(&mon, switch_out_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)off_cpu_tlob);
+}
+
+/* off_cpu -> switch_in -> on_cpu */
+static void tlob_off_cpu_switch_in(struct kunit *test)
+{
+	struct da_monitor mon = { .curr_state = off_cpu_tlob };
+
+	da_handle_event_tlob(&mon, switch_in_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
+}
+
+/* on_cpu -> budget_expired -> unmonitored */
+static void tlob_on_cpu_budget_expired(struct kunit *test)
+{
+	struct da_monitor mon = { .curr_state = on_cpu_tlob };
+
+	da_handle_event_tlob(&mon, budget_expired_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
+}
+
+/* off_cpu -> budget_expired -> unmonitored */
+static void tlob_off_cpu_budget_expired(struct kunit *test)
+{
+	struct da_monitor mon = { .curr_state = off_cpu_tlob };
+
+	da_handle_event_tlob(&mon, budget_expired_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
+}
+
+/* on_cpu -> trace_stop -> unmonitored */
+static void tlob_on_cpu_trace_stop(struct kunit *test)
+{
+	struct da_monitor mon = { .curr_state = on_cpu_tlob };
+
+	da_handle_event_tlob(&mon, trace_stop_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
+}
+
+/* off_cpu -> trace_stop -> unmonitored */
+static void tlob_off_cpu_trace_stop(struct kunit *test)
+{
+	struct da_monitor mon = { .curr_state = off_cpu_tlob };
+
+	da_handle_event_tlob(&mon, trace_stop_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
+}
+
+/* budget_expired -> unmonitored; a single trace_start re-enters on_cpu. */
+static void tlob_violation_then_restart(struct kunit *test)
+{
+	struct da_monitor mon = { .curr_state = unmonitored_tlob };
+
+	da_handle_event_tlob(&mon, trace_start_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
+
+	da_handle_event_tlob(&mon, budget_expired_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
+
+	/* Single trace_start is sufficient to re-enter on_cpu */
+	da_handle_event_tlob(&mon, trace_start_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
+
+	da_handle_event_tlob(&mon, trace_stop_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
+}
+
+/* off_cpu self-loops on switch_out and sched_wakeup. */
+static void tlob_off_cpu_self_loops(struct kunit *test)
+{
+	static const enum events_tlob events[] = {
+		switch_out_tlob, sched_wakeup_tlob,
+	};
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(events); i++) {
+		struct da_monitor mon = { .curr_state = off_cpu_tlob };
+
+		da_handle_event_tlob(&mon, events[i]);
+		KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state,
+				    (int)off_cpu_tlob,
+				    "event %u should self-loop in off_cpu",
+				    events[i]);
+	}
+}
+
+/* on_cpu self-loops on sched_wakeup. */
+static void tlob_on_cpu_self_loops(struct kunit *test)
+{
+	struct da_monitor mon = { .curr_state = on_cpu_tlob };
+
+	da_handle_event_tlob(&mon, sched_wakeup_tlob);
+	KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state, (int)on_cpu_tlob,
+			    "sched_wakeup should self-loop in on_cpu");
+}
+
+/* Scheduling events in unmonitored self-loop (no state change). */
+static void tlob_unmonitored_ignores_sched(struct kunit *test)
+{
+	static const enum events_tlob events[] = {
+		switch_in_tlob, switch_out_tlob, sched_wakeup_tlob,
+	};
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(events); i++) {
+		struct da_monitor mon = { .curr_state = unmonitored_tlob };
+
+		da_handle_event_tlob(&mon, events[i]);
+		KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state,
+				    (int)unmonitored_tlob,
+				    "event %u should self-loop in unmonitored",
+				    events[i]);
+	}
+}
+
+static void tlob_full_happy_path(struct kunit *test)
+{
+	struct da_monitor mon = { .curr_state = unmonitored_tlob };
+
+	da_handle_event_tlob(&mon, trace_start_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
+
+	da_handle_event_tlob(&mon, switch_out_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)off_cpu_tlob);
+
+	da_handle_event_tlob(&mon, switch_in_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
+
+	da_handle_event_tlob(&mon, trace_stop_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
+}
+
+static void tlob_multiple_switches(struct kunit *test)
+{
+	struct da_monitor mon = { .curr_state = unmonitored_tlob };
+	int i;
+
+	da_handle_event_tlob(&mon, trace_start_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
+
+	for (i = 0; i < 3; i++) {
+		da_handle_event_tlob(&mon, switch_out_tlob);
+		KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)off_cpu_tlob);
+		da_handle_event_tlob(&mon, switch_in_tlob);
+		KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
+	}
+
+	da_handle_event_tlob(&mon, trace_stop_tlob);
+	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
+}
+
+static struct kunit_case tlob_automaton_cases[] = {
+	KUNIT_CASE(tlob_unmonitored_to_on_cpu),
+	KUNIT_CASE(tlob_on_cpu_switch_out),
+	KUNIT_CASE(tlob_off_cpu_switch_in),
+	KUNIT_CASE(tlob_on_cpu_budget_expired),
+	KUNIT_CASE(tlob_off_cpu_budget_expired),
+	KUNIT_CASE(tlob_on_cpu_trace_stop),
+	KUNIT_CASE(tlob_off_cpu_trace_stop),
+	KUNIT_CASE(tlob_off_cpu_self_loops),
+	KUNIT_CASE(tlob_on_cpu_self_loops),
+	KUNIT_CASE(tlob_unmonitored_ignores_sched),
+	KUNIT_CASE(tlob_full_happy_path),
+	KUNIT_CASE(tlob_violation_then_restart),
+	KUNIT_CASE(tlob_multiple_switches),
+	{}
+};
+
+static struct kunit_suite tlob_automaton_suite = {
+	.name       = "tlob_automaton",
+	.test_cases = tlob_automaton_cases,
+};
+
+/*
+ * Suite 2: task registration API
+ */
+
+/* Basic start/stop cycle */
+static void tlob_start_stop_ok(struct kunit *test)
+{
+	int ret;
+
+	ret = tlob_start_task(current, 10000000 /* 10 s, won't fire */, NULL, 0);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), 0);
+}
+
+/* Double start must return -EEXIST. */
+static void tlob_double_start(struct kunit *test)
+{
+	KUNIT_ASSERT_EQ(test, tlob_start_task(current, 10000000, NULL, 0), 0);
+	KUNIT_EXPECT_EQ(test, tlob_start_task(current, 10000000, NULL, 0), -EEXIST);
+	tlob_stop_task(current);
+}
+
+/* Stop without start must return -ESRCH. */
+static void tlob_stop_without_start(struct kunit *test)
+{
+	tlob_stop_task(current);  /* clear any stale entry first */
+	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
+}
+
+/*
+ * A 1 us budget fires before tlob_stop_task() is called. Either the
+ * timer wins (-ESRCH) or we are very fast (0); both are valid.
+ */
+static void tlob_immediate_deadline(struct kunit *test)
+{
+	int ret = tlob_start_task(current, 1 /* 1 us - fires almost immediately */, NULL, 0);
+
+	KUNIT_ASSERT_EQ(test, ret, 0);
+	/* Let the 1 us timer fire */
+	udelay(100);
+	/*
+	 * By now the hrtimer has almost certainly fired. Either it has
+	 * (returns -ESRCH) or we were very fast (returns 0). Both are
+	 * acceptable; just ensure no crash and the table is clean after.
+	 */
+	ret = tlob_stop_task(current);
+	KUNIT_EXPECT_TRUE(test, ret == 0 || ret == -ESRCH);
+}
+
+/*
+ * Fill the table to TLOB_MAX_MONITORED using kthreads (each needs a
+ * distinct task_struct), then verify the next start returns -ENOSPC.
+ */
+struct tlob_waiter_ctx {
+	struct completion start;
+	struct completion done;
+};
+
+static int tlob_waiter_fn(void *arg)
+{
+	struct tlob_waiter_ctx *ctx = arg;
+
+	wait_for_completion(&ctx->start);
+	complete(&ctx->done);
+	return 0;
+}
+
+static void tlob_enospc(struct kunit *test)
+{
+	struct tlob_waiter_ctx *ctxs;
+	struct task_struct **threads;
+	int i, ret;
+
+	ctxs = kunit_kcalloc(test, TLOB_MAX_MONITORED,
+			     sizeof(*ctxs), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test, ctxs);
+
+	threads = kunit_kcalloc(test, TLOB_MAX_MONITORED,
+				sizeof(*threads), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test, threads);
+
+	/* Start TLOB_MAX_MONITORED kthreads and monitor each */
+	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
+		init_completion(&ctxs[i].start);
+		init_completion(&ctxs[i].done);
+
+		threads[i] = kthread_run(tlob_waiter_fn, &ctxs[i],
+					 "tlob_waiter_%d", i);
+		if (IS_ERR(threads[i])) {
+			KUNIT_FAIL(test, "kthread_run failed at i=%d", i);
+			threads[i] = NULL;
+			goto cleanup;
+		}
+		get_task_struct(threads[i]);
+
+		ret = tlob_start_task(threads[i], 10000000, NULL, 0);
+		if (ret != 0) {
+			KUNIT_FAIL(test, "tlob_start_task failed at i=%d: %d",
+				   i, ret);
+			put_task_struct(threads[i]);
+			complete(&ctxs[i].start);
+			goto cleanup;
+		}
+	}
+
+	/* The table is now full: one more must fail with -ENOSPC */
+	ret = tlob_start_task(current, 10000000, NULL, 0);
+	KUNIT_EXPECT_EQ(test, ret, -ENOSPC);
+
+cleanup:
+	/*
+	 * Two-pass cleanup: cancel tlob monitoring and unblock kthreads first,
+	 * then kthread_stop() to wait for full exit before releasing refs.
+	 */
+	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
+		if (!threads[i])
+			break;
+		tlob_stop_task(threads[i]);
+		complete(&ctxs[i].start);
+	}
+	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
+		if (!threads[i])
+			break;
+		kthread_stop(threads[i]);
+		put_task_struct(threads[i]);
+	}
+}
+
+/*
+ * A kthread holds a mutex for 80 ms; arm a 10 ms budget, burn ~1 ms
+ * on-CPU, then block on the mutex. The timer fires off-CPU; stop
+ * must return -ESRCH.
+ */
+struct tlob_holder_ctx {
+	struct mutex		lock;
+	struct completion	ready;
+	unsigned int		hold_ms;
+};
+
+static int tlob_holder_fn(void *arg)
+{
+	struct tlob_holder_ctx *ctx = arg;
+
+	mutex_lock(&ctx->lock);
+	complete(&ctx->ready);
+	msleep(ctx->hold_ms);
+	mutex_unlock(&ctx->lock);
+	return 0;
+}
+
+static void tlob_deadline_fires_off_cpu(struct kunit *test)
+{
+	struct tlob_holder_ctx ctx = { .hold_ms = 80 };
+	struct task_struct *holder;
+	ktime_t t0;
+	int ret;
+
+	mutex_init(&ctx.lock);
+	init_completion(&ctx.ready);
+
+	holder = kthread_run(tlob_holder_fn, &ctx, "tlob_holder_kunit");
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, holder);
+	wait_for_completion(&ctx.ready);
+
+	/* Arm 10 ms budget while kthread holds the mutex. */
+	ret = tlob_start_task(current, 10000, NULL, 0);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+
+	/* Phase 1: burn ~1 ms on-CPU to exercise on_cpu accounting. */
+	t0 = ktime_get();
+	while (ktime_us_delta(ktime_get(), t0) < 1000)
+		cpu_relax();
+
+	/*
+	 * Phase 2: block on the mutex -> on_cpu->off_cpu transition.
+	 * The 10 ms budget fires while we are off-CPU.
+	 */
+	mutex_lock(&ctx.lock);
+	mutex_unlock(&ctx.lock);
+
+	/* Timer already fired and removed the entry -> -ESRCH */
+	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
+}
+
+/* Arm a 1 ms budget and busy-spin for 50 ms; timer fires on-CPU. */
+static void tlob_deadline_fires_on_cpu(struct kunit *test)
+{
+	ktime_t t0;
+	int ret;
+
+	ret = tlob_start_task(current, 1000 /* 1 ms */, NULL, 0);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+
+	/* Busy-spin 50 ms - 50x the budget */
+	t0 = ktime_get();
+	while (ktime_us_delta(ktime_get(), t0) < 50000)
+		cpu_relax();
+
+	/* Timer fired during the spin; entry is gone */
+	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
+}
+
+/*
+ * Start three tasks, call tlob_destroy_monitor() + tlob_init_monitor(),
+ * and verify the table is empty afterwards.
+ */
+static int tlob_dummy_fn(void *arg)
+{
+	wait_for_completion((struct completion *)arg);
+	return 0;
+}
+
+static void tlob_stop_all_cleanup(struct kunit *test)
+{
+	struct completion done1, done2;
+	struct task_struct *t1, *t2;
+	int ret;
+
+	init_completion(&done1);
+	init_completion(&done2);
+
+	t1 = kthread_run(tlob_dummy_fn, &done1, "tlob_dummy1");
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t1);
+	get_task_struct(t1);
+
+	t2 = kthread_run(tlob_dummy_fn, &done2, "tlob_dummy2");
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t2);
+	get_task_struct(t2);
+
+	KUNIT_ASSERT_EQ(test, tlob_start_task(current, 10000000, NULL, 0), 0);
+	KUNIT_ASSERT_EQ(test, tlob_start_task(t1, 10000000, NULL, 0), 0);
+	KUNIT_ASSERT_EQ(test, tlob_start_task(t2, 10000000, NULL, 0), 0);
+
+	/* Destroy clears all entries via tlob_stop_all() */
+	tlob_destroy_monitor();
+	ret = tlob_init_monitor();
+	KUNIT_ASSERT_EQ(test, ret, 0);
+
+	/* Table must be empty now */
+	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
+	KUNIT_EXPECT_EQ(test, tlob_stop_task(t1), -ESRCH);
+	KUNIT_EXPECT_EQ(test, tlob_stop_task(t2), -ESRCH);
+
+	complete(&done1);
+	complete(&done2);
+	/*
+	 * completions live on stack; wait for kthreads to exit before return.
+	 */
+	kthread_stop(t1);
+	kthread_stop(t2);
+	put_task_struct(t1);
+	put_task_struct(t2);
+}
+
+/* A threshold that overflows ktime_t must be rejected with -ERANGE. */
+static void tlob_overflow_threshold(struct kunit *test)
+{
+	/* KTIME_MAX / NSEC_PER_USEC + 1 overflows ktime_t */
+	u64 too_large = (u64)(KTIME_MAX / NSEC_PER_USEC) + 1;
+
+	KUNIT_EXPECT_EQ(test,
+		tlob_start_task(current, too_large, NULL, 0),
+		-ERANGE);
+}
+
+static int tlob_task_api_suite_init(struct kunit_suite *suite)
+{
+	return tlob_init_monitor();
+}
+
+static void tlob_task_api_suite_exit(struct kunit_suite *suite)
+{
+	tlob_destroy_monitor();
+}
+
+static struct kunit_case tlob_task_api_cases[] = {
+	KUNIT_CASE(tlob_start_stop_ok),
+	KUNIT_CASE(tlob_double_start),
+	KUNIT_CASE(tlob_stop_without_start),
+	KUNIT_CASE(tlob_immediate_deadline),
+	KUNIT_CASE(tlob_enospc),
+	KUNIT_CASE(tlob_overflow_threshold),
+	KUNIT_CASE(tlob_deadline_fires_off_cpu),
+	KUNIT_CASE(tlob_deadline_fires_on_cpu),
+	KUNIT_CASE(tlob_stop_all_cleanup),
+	{}
+};
+
+static struct kunit_suite tlob_task_api_suite = {
+	.name       = "tlob_task_api",
+	.suite_init = tlob_task_api_suite_init,
+	.suite_exit = tlob_task_api_suite_exit,
+	.test_cases = tlob_task_api_cases,
+};
+
+/*
+ * Suite 3: scheduling integration
+ */
+
+struct tlob_ping_ctx {
+	struct completion ping;
+	struct completion pong;
+};
+
+static int tlob_ping_fn(void *arg)
+{
+	struct tlob_ping_ctx *ctx = arg;
+
+	/* Wait for main to give us the CPU back */
+	wait_for_completion(&ctx->ping);
+	complete(&ctx->pong);
+	return 0;
+}
+
+/* Force two context switches and verify stop returns 0 (within budget). */
+static void tlob_sched_switch_accounting(struct kunit *test)
+{
+	struct tlob_ping_ctx ctx;
+	struct task_struct *peer;
+	int ret;
+
+	init_completion(&ctx.ping);
+	init_completion(&ctx.pong);
+
+	peer = kthread_run(tlob_ping_fn, &ctx, "tlob_ping_kunit");
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, peer);
+
+	/* Arm a generous 5 s budget so the timer never fires */
+	ret = tlob_start_task(current, 5000000, NULL, 0);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+
+	/*
+	 * complete(ping) -> peer runs, forcing a context switch out and back.
+	 */
+	complete(&ctx.ping);
+	wait_for_completion(&ctx.pong);
+
+	/*
+	 * Back on CPU after one off-CPU interval; stop must return 0.
+	 */
+	ret = tlob_stop_task(current);
+	KUNIT_EXPECT_EQ(test, ret, 0);
+}
+
+/*
+ * Verify that monitoring a kthread (not current) works: start on behalf
+ * of a kthread, let it block, then stop it.
+ */
+static int tlob_block_fn(void *arg)
+{
+	struct completion *done = arg;
+
+	/* Block briefly, exercising off_cpu accounting for this task */
+	msleep(20);
+	complete(done);
+	return 0;
+}
+
+static void tlob_monitor_other_task(struct kunit *test)
+{
+	struct completion done;
+	struct task_struct *target;
+	int ret;
+
+	init_completion(&done);
+
+	target = kthread_run(tlob_block_fn, &done, "tlob_target_kunit");
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, target);
+	get_task_struct(target);
+
+	/* Arm a 5 s budget for the target task */
+	ret = tlob_start_task(target, 5000000, NULL, 0);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+
+	wait_for_completion(&done);
+
+	/*
+	 * Target has finished; stop_task may return 0 (still in htable)
+	 * or -ESRCH (kthread exited and timer fired / entry cleaned up).
+	 */
+	ret = tlob_stop_task(target);
+	KUNIT_EXPECT_TRUE(test, ret == 0 || ret == -ESRCH);
+	put_task_struct(target);
+}
+
+static int tlob_sched_suite_init(struct kunit_suite *suite)
+{
+	return tlob_init_monitor();
+}
+
+static void tlob_sched_suite_exit(struct kunit_suite *suite)
+{
+	tlob_destroy_monitor();
+}
+
+static struct kunit_case tlob_sched_integration_cases[] = {
+	KUNIT_CASE(tlob_sched_switch_accounting),
+	KUNIT_CASE(tlob_monitor_other_task),
+	{}
+};
+
+static struct kunit_suite tlob_sched_integration_suite = {
+	.name       = "tlob_sched_integration",
+	.suite_init = tlob_sched_suite_init,
+	.suite_exit = tlob_sched_suite_exit,
+	.test_cases = tlob_sched_integration_cases,
+};
+
+/*
+ * Suite 4: ftrace tracepoint field verification
+ */
+
+/* Capture fields from trace_tlob_budget_exceeded for inspection. */
+struct tlob_exceeded_capture {
+	atomic_t	fired;		/* 1 after first call */
+	pid_t		pid;
+	u64		threshold_us;
+	u64		on_cpu_us;
+	u64		off_cpu_us;
+	u32		switches;
+	bool		state_is_on_cpu;
+	u64		tag;
+};
+
+static void
+probe_tlob_budget_exceeded(void *data,
+			   struct task_struct *task, u64 threshold_us,
+			   u64 on_cpu_us, u64 off_cpu_us,
+			   u32 switches, bool state_is_on_cpu, u64 tag)
+{
+	struct tlob_exceeded_capture *cap = data;
+
+	/* Only capture the first event to avoid races. */
+	if (atomic_cmpxchg(&cap->fired, 0, 1) != 0)
+		return;
+
+	cap->pid		= task->pid;
+	cap->threshold_us	= threshold_us;
+	cap->on_cpu_us		= on_cpu_us;
+	cap->off_cpu_us		= off_cpu_us;
+	cap->switches		= switches;
+	cap->state_is_on_cpu	= state_is_on_cpu;
+	cap->tag		= tag;
+}
+
+/*
+ * Arm a 2 ms budget and busy-spin for 60 ms. Verify the tracepoint fires
+ * once with matching threshold, correct pid, and total time >= budget.
+ *
+ * state_is_on_cpu is not asserted: preemption during the spin makes it
+ * non-deterministic.
+ */
+static void tlob_trace_budget_exceeded_on_cpu(struct kunit *test)
+{
+	struct tlob_exceeded_capture cap = {};
+	const u64 threshold_us = 2000; /* 2 ms */
+	ktime_t t0;
+	int ret;
+
+	atomic_set(&cap.fired, 0);
+
+	ret = register_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
+						  &cap);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+
+	ret = tlob_start_task(current, threshold_us, NULL, 0);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+
+	/* Busy-spin 60 ms  --  30x the budget */
+	t0 = ktime_get();
+	while (ktime_us_delta(ktime_get(), t0) < 60000)
+		cpu_relax();
+
+	/* Entry removed by timer; stop returns -ESRCH */
+	tlob_stop_task(current);
+
+	/*
+	 * Synchronise: ensure the probe callback has completed before we
+	 * read the captured fields.
+	 */
+	tracepoint_synchronize_unregister();
+	unregister_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded, &cap);
+
+	KUNIT_EXPECT_EQ(test, atomic_read(&cap.fired), 1);
+	KUNIT_EXPECT_EQ(test, (int)cap.pid, (int)current->pid);
+	KUNIT_EXPECT_EQ(test, cap.threshold_us, threshold_us);
+	/* Total elapsed must cover at least the budget */
+	KUNIT_EXPECT_GE(test, cap.on_cpu_us + cap.off_cpu_us, threshold_us);
+}
+
+/*
+ * Holder kthread grabs a mutex for 80 ms; arm 10 ms budget, burn ~1 ms
+ * on-CPU, then block on the mutex. Timer fires off-CPU. Verify:
+ * state_is_on_cpu == false, switches >= 1, off_cpu_us > 0.
+ */
+static void tlob_trace_budget_exceeded_off_cpu(struct kunit *test)
+{
+	struct tlob_exceeded_capture cap = {};
+	struct tlob_holder_ctx ctx = { .hold_ms = 80 };
+	struct task_struct *holder;
+	const u64 threshold_us = 10000; /* 10 ms */
+	ktime_t t0;
+	int ret;
+
+	atomic_set(&cap.fired, 0);
+
+	mutex_init(&ctx.lock);
+	init_completion(&ctx.ready);
+
+	holder = kthread_run(tlob_holder_fn, &ctx, "tlob_holder2_kunit");
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, holder);
+	wait_for_completion(&ctx.ready);
+
+	ret = register_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
+						  &cap);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+
+	ret = tlob_start_task(current, threshold_us, NULL, 0);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+
+	/* Phase 1: ~1 ms on-CPU */
+	t0 = ktime_get();
+	while (ktime_us_delta(ktime_get(), t0) < 1000)
+		cpu_relax();
+
+	/* Phase 2: block -> off-CPU; timer fires here */
+	mutex_lock(&ctx.lock);
+	mutex_unlock(&ctx.lock);
+
+	tlob_stop_task(current);
+
+	tracepoint_synchronize_unregister();
+	unregister_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded, &cap);
+
+	KUNIT_EXPECT_EQ(test, atomic_read(&cap.fired), 1);
+	KUNIT_EXPECT_EQ(test, cap.threshold_us, threshold_us);
+	/* Violation happened off-CPU */
+	KUNIT_EXPECT_FALSE(test, cap.state_is_on_cpu);
+	/* At least the switch_out event was counted */
+	KUNIT_EXPECT_GE(test, (u64)cap.switches, (u64)1);
+	/* Off-CPU time must be non-zero */
+	KUNIT_EXPECT_GT(test, cap.off_cpu_us, (u64)0);
+}
+
+/* threshold_us in the tracepoint must exactly match the start argument. */
+static void tlob_trace_threshold_field_accuracy(struct kunit *test)
+{
+	static const u64 thresholds[] = { 500, 1000, 3000 };
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(thresholds); i++) {
+		struct tlob_exceeded_capture cap = {};
+		ktime_t t0;
+		int ret;
+
+		atomic_set(&cap.fired, 0);
+
+		ret = register_trace_tlob_budget_exceeded(
+			probe_tlob_budget_exceeded, &cap);
+		KUNIT_ASSERT_EQ(test, ret, 0);
+
+		ret = tlob_start_task(current, thresholds[i], NULL, 0);
+		KUNIT_ASSERT_EQ(test, ret, 0);
+
+		/* Spin for 20x the threshold to ensure timer fires */
+		t0 = ktime_get();
+		while (ktime_us_delta(ktime_get(), t0) <
+		       (s64)(thresholds[i] * 20))
+			cpu_relax();
+
+		tlob_stop_task(current);
+
+		tracepoint_synchronize_unregister();
+		unregister_trace_tlob_budget_exceeded(
+			probe_tlob_budget_exceeded, &cap);
+
+		KUNIT_EXPECT_EQ_MSG(test, cap.threshold_us, thresholds[i],
+				    "threshold mismatch for entry %u", i);
+	}
+}
+
+static int tlob_trace_suite_init(struct kunit_suite *suite)
+{
+	int ret;
+
+	ret = tlob_init_monitor();
+	if (ret)
+		return ret;
+	return tlob_enable_hooks();
+}
+
+static void tlob_trace_suite_exit(struct kunit_suite *suite)
+{
+	tlob_disable_hooks();
+	tlob_destroy_monitor();
+}
+
+static struct kunit_case tlob_trace_output_cases[] = {
+	KUNIT_CASE(tlob_trace_budget_exceeded_on_cpu),
+	KUNIT_CASE(tlob_trace_budget_exceeded_off_cpu),
+	KUNIT_CASE(tlob_trace_threshold_field_accuracy),
+	{}
+};
+
+static struct kunit_suite tlob_trace_output_suite = {
+	.name       = "tlob_trace_output",
+	.suite_init = tlob_trace_suite_init,
+	.suite_exit = tlob_trace_suite_exit,
+	.test_cases = tlob_trace_output_cases,
+};
+
+/* Suite 5: ring buffer */
+
+/*
+ * Allocate a synthetic rv_file_priv for ring buffer tests. Uses
+ * kunit_kzalloc() instead of __get_free_pages() since the ring is never
+ * mmap'd here.
+ */
+static struct rv_file_priv *alloc_priv_kunit(struct kunit *test, u32 cap)
+{
+	struct rv_file_priv *priv;
+	struct tlob_ring *ring;
+
+	priv = kunit_kzalloc(test, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return NULL;
+
+	ring = &priv->ring;
+
+	ring->page = kunit_kzalloc(test, sizeof(struct tlob_mmap_page),
+				   GFP_KERNEL);
+	if (!ring->page)
+		return NULL;
+
+	ring->data = kunit_kzalloc(test, cap * sizeof(struct tlob_event),
+				   GFP_KERNEL);
+	if (!ring->data)
+		return NULL;
+
+	ring->mask            = cap - 1;
+	ring->page->capacity  = cap;
+	ring->page->version   = 1;
+	ring->page->data_offset = PAGE_SIZE; /* nominal; not used in tests */
+	ring->page->record_size = sizeof(struct tlob_event);
+	spin_lock_init(&ring->lock);
+	init_waitqueue_head(&priv->waitq);
+	return priv;
+}
+
+/* Push one record and verify all fields survive the round-trip. */
+static void tlob_event_push_one(struct kunit *test)
+{
+	struct rv_file_priv *priv;
+	struct tlob_ring *ring;
+	struct tlob_event in = {
+		.tid		= 1234,
+		.threshold_us	= 5000,
+		.on_cpu_us	= 3000,
+		.off_cpu_us	= 2000,
+		.switches	= 3,
+		.state		= 1,
+	};
+	struct tlob_event out = {};
+	u32 tail;
+
+	priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
+	KUNIT_ASSERT_NOT_NULL(test, priv);
+
+	ring = &priv->ring;
+
+	tlob_event_push_kunit(priv, &in);
+
+	/* One record written, none dropped */
+	KUNIT_EXPECT_EQ(test, ring->page->data_head, 1u);
+	KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
+	KUNIT_EXPECT_EQ(test, ring->page->dropped,   0ull);
+
+	/* Dequeue manually */
+	tail = ring->page->data_tail;
+	out  = ring->data[tail & ring->mask];
+	ring->page->data_tail = tail + 1;
+
+	KUNIT_EXPECT_EQ(test, out.tid,          in.tid);
+	KUNIT_EXPECT_EQ(test, out.threshold_us, in.threshold_us);
+	KUNIT_EXPECT_EQ(test, out.on_cpu_us,    in.on_cpu_us);
+	KUNIT_EXPECT_EQ(test, out.off_cpu_us,   in.off_cpu_us);
+	KUNIT_EXPECT_EQ(test, out.switches,     in.switches);
+	KUNIT_EXPECT_EQ(test, out.state,        in.state);
+
+	/* Ring is now empty */
+	KUNIT_EXPECT_EQ(test, ring->page->data_head, ring->page->data_tail);
+}
+
+/*
+ * Fill to capacity, push one more. Drop-new policy: head stays at cap,
+ * dropped == 1, oldest record is preserved.
+ */
+static void tlob_event_push_overflow(struct kunit *test)
+{
+	struct rv_file_priv *priv;
+	struct tlob_ring *ring;
+	struct tlob_event ntf = {};
+	struct tlob_event out = {};
+	const u32 cap = TLOB_RING_MIN_CAP;
+	u32 i;
+
+	priv = alloc_priv_kunit(test, cap);
+	KUNIT_ASSERT_NOT_NULL(test, priv);
+
+	ring = &priv->ring;
+
+	/* Push cap + 1 records; tid encodes the sequence */
+	for (i = 0; i <= cap; i++) {
+		ntf.tid          = i;
+		ntf.threshold_us = (u64)i * 1000;
+		tlob_event_push_kunit(priv, &ntf);
+	}
+
+	/* Drop-new: head stopped at cap; one record was silently discarded */
+	KUNIT_EXPECT_EQ(test, ring->page->data_head, cap);
+	KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
+	KUNIT_EXPECT_EQ(test, ring->page->dropped,   1ull);
+
+	/* Oldest surviving record must be the first one pushed (tid == 0) */
+	out = ring->data[ring->page->data_tail & ring->mask];
+	KUNIT_EXPECT_EQ(test, out.tid, 0u);
+
+	/* Drain the ring; the last record must have tid == cap - 1 */
+	for (i = 0; i < cap; i++) {
+		u32 tail = ring->page->data_tail;
+
+		out = ring->data[tail & ring->mask];
+		ring->page->data_tail = tail + 1;
+	}
+	KUNIT_EXPECT_EQ(test, out.tid, cap - 1);
+	KUNIT_EXPECT_EQ(test, ring->page->data_head, ring->page->data_tail);
+}
+
+/* A freshly initialised ring is empty. */
+static void tlob_event_empty(struct kunit *test)
+{
+	struct rv_file_priv *priv;
+	struct tlob_ring *ring;
+
+	priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
+	KUNIT_ASSERT_NOT_NULL(test, priv);
+
+	ring = &priv->ring;
+
+	KUNIT_EXPECT_EQ(test, ring->page->data_head, 0u);
+	KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
+	KUNIT_EXPECT_EQ(test, ring->page->dropped,   0ull);
+}
+
+/* A kthread blocks on wait_event_interruptible(); pushing one record must
+ * wake it within 1 s.
+ */
+
+struct tlob_wakeup_ctx {
+	struct rv_file_priv	*priv;
+	struct completion	 ready;
+	struct completion	 done;
+	int			 woke;
+};
+
+static int tlob_wakeup_thread(void *arg)
+{
+	struct tlob_wakeup_ctx *ctx = arg;
+	struct tlob_ring *ring = &ctx->priv->ring;
+
+	complete(&ctx->ready);
+
+	wait_event_interruptible(ctx->priv->waitq,
+		smp_load_acquire(&ring->page->data_head) !=
+		READ_ONCE(ring->page->data_tail) ||
+		kthread_should_stop());
+
+	if (smp_load_acquire(&ring->page->data_head) !=
+	    READ_ONCE(ring->page->data_tail))
+		ctx->woke = 1;
+
+	complete(&ctx->done);
+	return 0;
+}
+
+static void tlob_ring_wakeup(struct kunit *test)
+{
+	struct rv_file_priv *priv;
+	struct tlob_wakeup_ctx ctx;
+	struct task_struct *t;
+	struct tlob_event ev = { .tid = 99 };
+	long timeout;
+
+	priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
+	KUNIT_ASSERT_NOT_NULL(test, priv);
+
+	init_completion(&ctx.ready);
+	init_completion(&ctx.done);
+	ctx.priv = priv;
+	ctx.woke = 0;
+
+	t = kthread_run(tlob_wakeup_thread, &ctx, "tlob_wakeup_kunit");
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t);
+	get_task_struct(t);
+
+	/* Let the kthread reach wait_event_interruptible */
+	wait_for_completion(&ctx.ready);
+	usleep_range(10000, 20000);
+
+	/* Push one record  --  must wake the waiter */
+	tlob_event_push_kunit(priv, &ev);
+
+	timeout = wait_for_completion_timeout(&ctx.done, msecs_to_jiffies(1000));
+	kthread_stop(t);
+	put_task_struct(t);
+
+	KUNIT_EXPECT_GT(test, timeout, 0L);
+	KUNIT_EXPECT_EQ(test, ctx.woke, 1);
+	KUNIT_EXPECT_EQ(test, priv->ring.page->data_head, 1u);
+}
+
+static struct kunit_case tlob_event_buf_cases[] = {
+	KUNIT_CASE(tlob_event_push_one),
+	KUNIT_CASE(tlob_event_push_overflow),
+	KUNIT_CASE(tlob_event_empty),
+	KUNIT_CASE(tlob_ring_wakeup),
+	{}
+};
+
+static struct kunit_suite tlob_event_buf_suite = {
+	.name       = "tlob_event_buf",
+	.test_cases = tlob_event_buf_cases,
+};
+
+/* Suite 6: uprobe format string parser */
+
+/* Happy path: decimal offsets, plain path. */
+static void tlob_parse_decimal_offsets(struct kunit *test)
+{
+	char buf[] = "5000:4768:4848:/usr/bin/myapp";
+	u64 thr; loff_t start, stop; char *path;
+
+	KUNIT_EXPECT_EQ(test,
+		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
+		0);
+	KUNIT_EXPECT_EQ(test, thr,      (u64)5000);
+	KUNIT_EXPECT_EQ(test, start,    (loff_t)4768);
+	KUNIT_EXPECT_EQ(test, stop,     (loff_t)4848);
+	KUNIT_EXPECT_STREQ(test, path,  "/usr/bin/myapp");
+}
+
+/* Happy path: 0x-prefixed hex offsets. */
+static void tlob_parse_hex_offsets(struct kunit *test)
+{
+	char buf[] = "10000:0x12a0:0x12f0:/usr/bin/myapp";
+	u64 thr; loff_t start, stop; char *path;
+
+	KUNIT_EXPECT_EQ(test,
+		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
+		0);
+	KUNIT_EXPECT_EQ(test, start,   (loff_t)0x12a0);
+	KUNIT_EXPECT_EQ(test, stop,    (loff_t)0x12f0);
+	KUNIT_EXPECT_STREQ(test, path, "/usr/bin/myapp");
+}
+
+/* Path containing ':' must not be truncated. */
+static void tlob_parse_path_with_colon(struct kunit *test)
+{
+	char buf[] = "1000:0x100:0x200:/opt/my:app/bin";
+	u64 thr; loff_t start, stop; char *path;
+
+	KUNIT_EXPECT_EQ(test,
+		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
+		0);
+	KUNIT_EXPECT_STREQ(test, path, "/opt/my:app/bin");
+}
+
+/* Zero threshold must be rejected. */
+static void tlob_parse_zero_threshold(struct kunit *test)
+{
+	char buf[] = "0:0x100:0x200:/usr/bin/myapp";
+	u64 thr; loff_t start, stop; char *path;
+
+	KUNIT_EXPECT_EQ(test,
+		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
+		-EINVAL);
+}
+
+/* Empty path (trailing ':' with nothing after) must be rejected. */
+static void tlob_parse_empty_path(struct kunit *test)
+{
+	char buf[] = "5000:0x100:0x200:";
+	u64 thr; loff_t start, stop; char *path;
+
+	KUNIT_EXPECT_EQ(test,
+		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
+		-EINVAL);
+}
+
+/* Missing field (3 tokens instead of 4) must be rejected. */
+static void tlob_parse_too_few_fields(struct kunit *test)
+{
+	char buf[] = "5000:0x100:/usr/bin/myapp";
+	u64 thr; loff_t start, stop; char *path;
+
+	KUNIT_EXPECT_EQ(test,
+		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
+		-EINVAL);
+}
+
+/* Negative offset must be rejected. */
+static void tlob_parse_negative_offset(struct kunit *test)
+{
+	char buf[] = "5000:-1:0x200:/usr/bin/myapp";
+	u64 thr; loff_t start, stop; char *path;
+
+	KUNIT_EXPECT_EQ(test,
+		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
+		-EINVAL);
+}
+
+static struct kunit_case tlob_parse_uprobe_cases[] = {
+	KUNIT_CASE(tlob_parse_decimal_offsets),
+	KUNIT_CASE(tlob_parse_hex_offsets),
+	KUNIT_CASE(tlob_parse_path_with_colon),
+	KUNIT_CASE(tlob_parse_zero_threshold),
+	KUNIT_CASE(tlob_parse_empty_path),
+	KUNIT_CASE(tlob_parse_too_few_fields),
+	KUNIT_CASE(tlob_parse_negative_offset),
+	{}
+};
+
+static struct kunit_suite tlob_parse_uprobe_suite = {
+	.name       = "tlob_parse_uprobe",
+	.test_cases = tlob_parse_uprobe_cases,
+};
+
+kunit_test_suites(&tlob_automaton_suite,
+		  &tlob_task_api_suite,
+		  &tlob_sched_integration_suite,
+		  &tlob_trace_output_suite,
+		  &tlob_event_buf_suite,
+		  &tlob_parse_uprobe_suite);
+
+MODULE_DESCRIPTION("KUnit tests for the tlob RV monitor");
+MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 4/4] selftests/rv: Add selftest for the tlob monitor
  2026-04-12 19:27 [RFC PATCH 0/4] rv/tlob: Add task latency over budget RV monitor wen.yang
                   ` (2 preceding siblings ...)
  2026-04-12 19:27 ` [RFC PATCH 3/4] rv/tlob: Add KUnit tests for the tlob monitor wen.yang
@ 2026-04-12 19:27 ` wen.yang
  2026-04-16 12:00   ` Gabriele Monaco
  3 siblings, 1 reply; 11+ messages in thread
From: wen.yang @ 2026-04-12 19:27 UTC (permalink / raw)
  To: Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Wen Yang

From: Wen Yang <wen.yang@linux.dev>

Add a kselftest suite (TAP output, 19 test points) for the tlob RV
monitor under tools/testing/selftests/rv/.

test_tlob.sh drives a compiled C helper (tlob_helper) and, for uprobe
tests, a target binary (tlob_uprobe_target). Coverage spans the
tracefs enable/disable path, uprobe-triggered violations, and the
ioctl interface (within-budget stop, CPU-bound and sleep violations,
duplicate start, ring buffer mmap and consumption).

Requires CONFIG_RV_MON_TLOB=y and CONFIG_RV_CHARDEV=y; must be run
as root.

Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
 tools/include/uapi/linux/rv.h                 |  54 +
 tools/testing/selftests/rv/Makefile           |  18 +
 tools/testing/selftests/rv/test_tlob.sh       | 563 ++++++++++
 tools/testing/selftests/rv/tlob_helper.c      | 994 ++++++++++++++++++
 .../testing/selftests/rv/tlob_uprobe_target.c | 108 ++
 5 files changed, 1737 insertions(+)
 create mode 100644 tools/include/uapi/linux/rv.h
 create mode 100644 tools/testing/selftests/rv/Makefile
 create mode 100755 tools/testing/selftests/rv/test_tlob.sh
 create mode 100644 tools/testing/selftests/rv/tlob_helper.c
 create mode 100644 tools/testing/selftests/rv/tlob_uprobe_target.c

diff --git a/tools/include/uapi/linux/rv.h b/tools/include/uapi/linux/rv.h
new file mode 100644
index 000000000..bef07aded
--- /dev/null
+++ b/tools/include/uapi/linux/rv.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * UAPI definitions for Runtime Verification (RV) monitors.
+ *
+ * This is a tools-friendly copy of include/uapi/linux/rv.h.
+ * Keep in sync with the kernel header.
+ */
+
+#ifndef _UAPI_LINUX_RV_H
+#define _UAPI_LINUX_RV_H
+
+#include <linux/types.h>
+#include <sys/ioctl.h>
+
+/* Magic byte shared by all RV monitor ioctls. */
+#define RV_IOC_MAGIC	0xB9
+
+/* -----------------------------------------------------------------------
+ * tlob: task latency over budget monitor  (nr 0x01 - 0x1F)
+ * -----------------------------------------------------------------------
+ */
+
+struct tlob_start_args {
+	__u64 threshold_us;
+	__u64 tag;
+	__s32 notify_fd;
+	__u32 flags;
+};
+
+struct tlob_event {
+	__u32 tid;
+	__u32 pad;
+	__u64 threshold_us;
+	__u64 on_cpu_us;
+	__u64 off_cpu_us;
+	__u32 switches;
+	__u32 state;   /* 1 = on_cpu, 0 = off_cpu */
+	__u64 tag;
+};
+
+struct tlob_mmap_page {
+	__u32  data_head;
+	__u32  data_tail;
+	__u32  capacity;
+	__u32  version;
+	__u32  data_offset;
+	__u32  record_size;
+	__u64  dropped;
+};
+
+#define TLOB_IOCTL_TRACE_START	_IOW(RV_IOC_MAGIC, 0x01, struct tlob_start_args)
+#define TLOB_IOCTL_TRACE_STOP	_IO(RV_IOC_MAGIC,  0x02)
+
+#endif /* _UAPI_LINUX_RV_H */
diff --git a/tools/testing/selftests/rv/Makefile b/tools/testing/selftests/rv/Makefile
new file mode 100644
index 000000000..14e94a1ab
--- /dev/null
+++ b/tools/testing/selftests/rv/Makefile
@@ -0,0 +1,18 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for rv selftests
+
+TEST_GEN_PROGS := tlob_helper tlob_uprobe_target
+
+TEST_PROGS := \
+	test_tlob.sh \
+
+# TOOLS_INCLUDES is defined by ../lib.mk; provides -isystem to
+# tools/include/uapi so that #include <linux/rv.h> resolves to the
+# in-tree UAPI header without requiring make headers_install.
+# Note: both must be added to the global variables, not as target-specific
+# overrides, because lib.mk rewrites TEST_GEN_PROGS to $(OUTPUT)/name
+# before per-target rules would be evaluated.
+CFLAGS += $(TOOLS_INCLUDES)
+LDLIBS += -lpthread
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rv/test_tlob.sh b/tools/testing/selftests/rv/test_tlob.sh
new file mode 100755
index 000000000..3ba2125eb
--- /dev/null
+++ b/tools/testing/selftests/rv/test_tlob.sh
@@ -0,0 +1,563 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+#
+# Selftest for the tlob (task latency over budget) RV monitor.
+#
+# Two interfaces are tested:
+#
+#   1. tracefs interface:
+#        enable/disable, presence of tracefs files,
+#        uprobe binding (threshold_us:offset_start:offset_stop:binary_path) and
+#        violation detection via the ftrace ring buffer.
+#
+#   2. /dev/rv ioctl self-instrumentation (via tlob_helper):
+#        within-budget, over-budget on-CPU, over-budget off-CPU (sleep),
+#        double-start, stop-without-start.
+#
+# Written to be POSIX sh compatible (no bash-specific extensions).
+
+ksft_skip=4
+t_pass=0; t_fail=0; t_skip=0; t_total=0
+
+tap_header() { echo "TAP version 13"; }
+tap_plan()   { echo "1..$1"; }
+tap_pass()   { t_pass=$((t_pass+1)); echo "ok $t_total - $1"; }
+tap_fail()   { t_fail=$((t_fail+1)); echo "not ok $t_total - $1"
+               [ -n "$2" ] && echo "  # $2"; }
+tap_skip()   { t_skip=$((t_skip+1)); echo "ok $t_total - $1 # SKIP $2"; }
+next_test()  { t_total=$((t_total+1)); }
+
+TRACEFS=$(grep -m1 tracefs /proc/mounts 2>/dev/null | awk '{print $2}')
+[ -z "$TRACEFS" ] && TRACEFS=/sys/kernel/tracing
+
+RV_DIR="${TRACEFS}/rv"
+TLOB_DIR="${RV_DIR}/monitors/tlob"
+TRACE_FILE="${TRACEFS}/trace"
+TRACING_ON="${TRACEFS}/tracing_on"
+TLOB_MONITOR="${TLOB_DIR}/monitor"
+BUDGET_EXCEEDED_ENABLE="${TRACEFS}/events/rv/tlob_budget_exceeded/enable"
+RV_DEV="/dev/rv"
+
+# tlob_helper and tlob_uprobe_target must be in the same directory as
+# this script or on PATH.
+SCRIPT_DIR=$(dirname "$0")
+IOCTL_HELPER="${SCRIPT_DIR}/tlob_helper"
+UPROBE_TARGET="${SCRIPT_DIR}/tlob_uprobe_target"
+
+check_root()     { [ "$(id -u)" = "0" ] || { echo "# Need root" >&2; exit $ksft_skip; }; }
+check_tracefs()  { [ -d "${TRACEFS}" ]   || { echo "# No tracefs" >&2; exit $ksft_skip; }; }
+check_rv_dir()   { [ -d "${RV_DIR}" ]    || { echo "# No RV infra" >&2; exit $ksft_skip; }; }
+check_tlob()     { [ -d "${TLOB_DIR}" ]  || { echo "# No tlob monitor" >&2; exit $ksft_skip; }; }
+
+tlob_enable()         { echo 1 > "${TLOB_DIR}/enable"; }
+tlob_disable()        { echo 0 > "${TLOB_DIR}/enable" 2>/dev/null; }
+tlob_is_enabled()     { [ "$(cat "${TLOB_DIR}/enable" 2>/dev/null)" = "1" ]; }
+trace_event_enable()  { echo 1 > "${BUDGET_EXCEEDED_ENABLE}" 2>/dev/null; }
+trace_event_disable() { echo 0 > "${BUDGET_EXCEEDED_ENABLE}" 2>/dev/null; }
+trace_on()            { echo 1 > "${TRACING_ON}" 2>/dev/null; }
+trace_clear()         { echo > "${TRACE_FILE}"; }
+trace_grep()          { grep -q "$1" "${TRACE_FILE}" 2>/dev/null; }
+
+cleanup() {
+	tlob_disable
+	trace_event_disable
+	trace_clear
+}
+
+# ---------------------------------------------------------------------------
+# Test 1: enable / disable
+# ---------------------------------------------------------------------------
+run_test_enable_disable() {
+	next_test; cleanup
+	tlob_enable
+	if ! tlob_is_enabled; then
+		tap_fail "enable_disable" "not enabled after echo 1"; cleanup; return
+	fi
+	tlob_disable
+	if tlob_is_enabled; then
+		tap_fail "enable_disable" "still enabled after echo 0"; cleanup; return
+	fi
+	tap_pass "enable_disable"; cleanup
+}
+
+# ---------------------------------------------------------------------------
+# Test 2: tracefs files present
+# ---------------------------------------------------------------------------
+run_test_tracefs_files() {
+	next_test; cleanup
+	missing=""
+	for f in enable desc monitor; do
+		[ ! -e "${TLOB_DIR}/${f}" ] && missing="${missing} ${f}"
+	done
+	[ -n "${missing}" ] \
+		&& tap_fail "tracefs_files" "missing:${missing}" \
+		|| tap_pass "tracefs_files"
+	cleanup
+}
+
+# ---------------------------------------------------------------------------
+# Helper: resolve file offset of a function inside a binary.
+#
+# Usage: resolve_offset <binary> <vaddr_hex>
+# Prints the hex file offset, or empty string on failure.
+# ---------------------------------------------------------------------------
+resolve_offset() {
+	bin=$1; vaddr=$2
+	# Parse /proc/self/maps to find the mapping that contains vaddr.
+	# Each line: start-end perms offset dev inode [path]
+	while IFS= read -r line; do
+		set -- $line
+		range=$1; off=$4; path=$7
+		[ -z "$path" ] && continue
+		# Only consider the mapping for our binary
+		[ "$path" != "$bin" ] && continue
+		# Split range into start and end
+		start=$(echo "$range" | cut -d- -f1)
+		end=$(echo "$range" | cut -d- -f2)
+		# Convert hex to decimal for comparison (use printf)
+		s=$(printf "%d" "0x${start}" 2>/dev/null) || continue
+		e=$(printf "%d" "0x${end}"   2>/dev/null) || continue
+		v=$(printf "%d" "${vaddr}"   2>/dev/null) || continue
+		o=$(printf "%d" "0x${off}"   2>/dev/null) || continue
+		if [ "$v" -ge "$s" ] && [ "$v" -lt "$e" ]; then
+			file_off=$(printf "0x%x" $(( (v - s) + o )))
+			echo "$file_off"
+			return
+		fi
+	done < /proc/self/maps
+}
+
+# ---------------------------------------------------------------------------
+# Test 3: uprobe binding - no false positive
+#
+# Bind this process with a 10 s budget.  Do nothing for 0.5 s.
+# No budget_exceeded event should appear in the trace.
+# ---------------------------------------------------------------------------
+run_test_uprobe_no_false_positive() {
+	next_test; cleanup
+	if [ ! -e "${TLOB_MONITOR}" ]; then
+		tap_skip "uprobe_no_false_positive" "monitor file not available"
+		cleanup; return
+	fi
+	# We probe the "sleep" command that we will run as a subprocess.
+	# Use /bin/sleep as the binary; find a valid function offset (0x0
+	# resolves to the ELF entry point, which is sufficient for a
+	# no-false-positive test since we just need the binding to exist).
+	sleep_bin=$(command -v sleep 2>/dev/null)
+	if [ -z "$sleep_bin" ]; then
+		tap_skip "uprobe_no_false_positive" "sleep not found"; cleanup; return
+	fi
+	pid=$$
+	# offset 0x0 probes the entry point of /bin/sleep - this is a
+	# deliberate probe that will not fire during a simple 'sleep 10'
+	# invoked in a subshell, but registers the pid in tlob.
+	#
+	# Instead, bind our own pid with a generous 10 s threshold and
+	# verify that 0.5 s of idle time does NOT fire the timer.
+	#
+	# Since we cannot easily get a valid uprobe offset in pure shell,
+	# we skip this sub-test if we cannot form a valid binding.
+	exe=$(readlink /proc/self/exe 2>/dev/null)
+	if [ -z "$exe" ]; then
+		tap_skip "uprobe_no_false_positive" "cannot read /proc/self/exe"
+		cleanup; return
+	fi
+	trace_event_enable
+	trace_on
+	tlob_enable
+	trace_clear
+	# Sleep without any binding - just verify no spurious events
+	sleep 0.5
+	trace_grep "budget_exceeded" \
+		&& tap_fail "uprobe_no_false_positive" \
+			"spurious budget_exceeded without any binding" \
+		|| tap_pass "uprobe_no_false_positive"
+	cleanup
+}
+
+# ---------------------------------------------------------------------------
+# Helper: get_uprobe_offset <binary> <symbol>
+#
+# Use tlob_helper sym_offset to get the ELF file offset of <symbol>
+# in <binary>.  Prints the hex offset (e.g. "0x11d0") or empty string on
+# failure.
+# ---------------------------------------------------------------------------
+get_uprobe_offset() {
+	bin=$1; sym=$2
+	if [ ! -x "${IOCTL_HELPER}" ]; then
+		return
+	fi
+	"${IOCTL_HELPER}" sym_offset "${bin}" "${sym}" 2>/dev/null
+}
+
+# ---------------------------------------------------------------------------
+# Test 4: uprobe binding - violation detected
+#
+# Start tlob_uprobe_target (a busy-spin binary with a well-known symbol),
+# attach a uprobe on tlob_busy_work with a 10 ms threshold, and verify
+# that a budget_expired event appears.
+# ---------------------------------------------------------------------------
+run_test_uprobe_violation() {
+	next_test; cleanup
+	if [ ! -e "${TLOB_MONITOR}" ]; then
+		tap_skip "uprobe_violation" "monitor file not available"
+		cleanup; return
+	fi
+	if [ ! -x "${UPROBE_TARGET}" ]; then
+		tap_skip "uprobe_violation" \
+			"tlob_uprobe_target not found or not executable"
+		cleanup; return
+	fi
+
+	# Get the file offsets of the start and stop probe symbols
+	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
+	if [ -z "${busy_offset}" ]; then
+		tap_skip "uprobe_violation" \
+			"cannot resolve tlob_busy_work offset in ${UPROBE_TARGET}"
+		cleanup; return
+	fi
+	stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work_done")
+	if [ -z "${stop_offset}" ]; then
+		tap_skip "uprobe_violation" \
+			"cannot resolve tlob_busy_work_done offset in ${UPROBE_TARGET}"
+		cleanup; return
+	fi
+
+	# Start the busy-spin target (run for 30 s so the test can observe it)
+	"${UPROBE_TARGET}" 30000 &
+	busy_pid=$!
+	sleep 0.05
+
+	trace_event_enable
+	trace_on
+	tlob_enable
+	trace_clear
+
+	# Bind the target: 10 us budget; start=tlob_busy_work, stop=tlob_busy_work_done
+	binding="10:${busy_offset}:${stop_offset}:${UPROBE_TARGET}"
+	if ! echo "${binding}" > "${TLOB_MONITOR}" 2>/dev/null; then
+		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
+		tap_skip "uprobe_violation" \
+			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
+		cleanup; return
+	fi
+
+	# Wait up to 2 s for a budget_exceeded event
+	found=0; i=0
+	while [ "$i" -lt 20 ]; do
+		sleep 0.1
+		trace_grep "budget_exceeded" && { found=1; break; }
+		i=$((i+1))
+	done
+
+	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}" 2>/dev/null
+	kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
+
+	if [ "${found}" != "1" ]; then
+		tap_fail "uprobe_violation" "no budget_exceeded within 2 s"
+		cleanup; return
+	fi
+
+	# Validate the event fields: threshold must match, on_cpu must be non-zero
+	# (CPU-bound violation), and state must be on_cpu.
+	ev=$(grep "budget_exceeded" "${TRACE_FILE}" | head -n 1)
+	if ! echo "${ev}" | grep -q "threshold=10 "; then
+		tap_fail "uprobe_violation" "threshold field mismatch: ${ev}"
+		cleanup; return
+	fi
+	on_cpu=$(echo "${ev}" | grep -o "on_cpu=[0-9]*" | cut -d= -f2)
+	if [ "${on_cpu:-0}" -eq 0 ]; then
+		tap_fail "uprobe_violation" "on_cpu=0 for a CPU-bound spin: ${ev}"
+		cleanup; return
+	fi
+	if ! echo "${ev}" | grep -q "state=on_cpu"; then
+		tap_fail "uprobe_violation" "state is not on_cpu: ${ev}"
+		cleanup; return
+	fi
+	tap_pass "uprobe_violation"
+	cleanup
+}
+
+# ---------------------------------------------------------------------------
+# Test 5: uprobe binding - remove binding stops monitoring
+#
+# Bind a pid via tlob_uprobe_target, then immediately remove it.
+# Verify that after removal the monitor file no longer lists the pid.
+# ---------------------------------------------------------------------------
+run_test_uprobe_unbind() {
+	next_test; cleanup
+	if [ ! -e "${TLOB_MONITOR}" ]; then
+		tap_skip "uprobe_unbind" "monitor file not available"
+		cleanup; return
+	fi
+	if [ ! -x "${UPROBE_TARGET}" ]; then
+		tap_skip "uprobe_unbind" \
+			"tlob_uprobe_target not found or not executable"
+		cleanup; return
+	fi
+
+	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
+	stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work_done")
+	if [ -z "${busy_offset}" ] || [ -z "${stop_offset}" ]; then
+		tap_skip "uprobe_unbind" \
+			"cannot resolve tlob_busy_work/tlob_busy_work_done offset"
+		cleanup; return
+	fi
+
+	"${UPROBE_TARGET}" 30000 &
+	busy_pid=$!
+	sleep 0.05
+
+	tlob_enable
+	# 5 s budget - should not fire during this quick test
+	binding="5000000:${busy_offset}:${stop_offset}:${UPROBE_TARGET}"
+	if ! echo "${binding}" > "${TLOB_MONITOR}" 2>/dev/null; then
+		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
+		tap_skip "uprobe_unbind" \
+			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
+		cleanup; return
+	fi
+
+	# Remove the binding
+	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}" 2>/dev/null
+
+	# The monitor file should no longer list the binding for this offset
+	if grep -q "^[0-9]*:0x${busy_offset#0x}:" "${TLOB_MONITOR}" 2>/dev/null; then
+		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
+		tap_fail "uprobe_unbind" "pid still listed after removal"
+		cleanup; return
+	fi
+
+	kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
+	tap_pass "uprobe_unbind"
+	cleanup
+}
+
+# ---------------------------------------------------------------------------
+# Test 6: uprobe - duplicate offset_start rejected
+#
+# Registering a second binding with the same offset_start in the same binary
+# must be rejected with an error, since two entry uprobes at the same address
+# would cause double tlob_start_task() calls and undefined behaviour.
+# ---------------------------------------------------------------------------
+run_test_uprobe_duplicate_offset() {
+	next_test; cleanup
+	if [ ! -e "${TLOB_MONITOR}" ]; then
+		tap_skip "uprobe_duplicate_offset" "monitor file not available"
+		cleanup; return
+	fi
+	if [ ! -x "${UPROBE_TARGET}" ]; then
+		tap_skip "uprobe_duplicate_offset" \
+			"tlob_uprobe_target not found or not executable"
+		cleanup; return
+	fi
+
+	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
+	stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work_done")
+	if [ -z "${busy_offset}" ] || [ -z "${stop_offset}" ]; then
+		tap_skip "uprobe_duplicate_offset" \
+			"cannot resolve tlob_busy_work/tlob_busy_work_done offset"
+		cleanup; return
+	fi
+
+	tlob_enable
+
+	# First binding: should succeed
+	if ! echo "5000000:${busy_offset}:${stop_offset}:${UPROBE_TARGET}" \
+	        > "${TLOB_MONITOR}" 2>/dev/null; then
+		tap_skip "uprobe_duplicate_offset" \
+			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
+		cleanup; return
+	fi
+
+	# Second binding with same offset_start: must be rejected
+	if echo "9999:${busy_offset}:${stop_offset}:${UPROBE_TARGET}" \
+	        > "${TLOB_MONITOR}" 2>/dev/null; then
+		echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}" 2>/dev/null
+		tap_fail "uprobe_duplicate_offset" \
+			"duplicate offset_start was accepted (expected error)"
+		cleanup; return
+	fi
+
+	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}" 2>/dev/null
+	tap_pass "uprobe_duplicate_offset"
+	cleanup
+}
+
+
+#
+# Region A: tlob_busy_work with a 5 s budget - should NOT fire during the test.
+# Region B: tlob_busy_work_done with a 10 us budget - SHOULD fire quickly since
+#           tlob_uprobe_target calls tlob_busy_work_done after a busy spin.
+#
+# Verifies that independent bindings for different offsets in the same binary
+# are tracked separately and that only the tight-budget binding triggers a
+# budget_exceeded event.
+# ---------------------------------------------------------------------------
+run_test_uprobe_independent_thresholds() {
+	next_test; cleanup
+	if [ ! -e "${TLOB_MONITOR}" ]; then
+		tap_skip "uprobe_independent_thresholds" \
+			"monitor file not available"; cleanup; return
+	fi
+	if [ ! -x "${UPROBE_TARGET}" ]; then
+		tap_skip "uprobe_independent_thresholds" \
+			"tlob_uprobe_target not found or not executable"
+		cleanup; return
+	fi
+
+	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
+	busy_stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work_done")
+	if [ -z "${busy_offset}" ] || [ -z "${busy_stop_offset}" ]; then
+		tap_skip "uprobe_independent_thresholds" \
+			"cannot resolve tlob_busy_work/tlob_busy_work_done offset"
+		cleanup; return
+	fi
+
+	"${UPROBE_TARGET}" 30000 &
+	busy_pid=$!
+	sleep 0.05
+
+	trace_event_enable
+	trace_on
+	tlob_enable
+	trace_clear
+
+	# Region A: generous 5 s budget on tlob_busy_work entry (should not fire)
+	if ! echo "5000000:${busy_offset}:${busy_stop_offset}:${UPROBE_TARGET}" \
+	        > "${TLOB_MONITOR}" 2>/dev/null; then
+		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
+		tap_skip "uprobe_independent_thresholds" \
+			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
+		cleanup; return
+	fi
+	# Region B: tight 10 us budget on tlob_busy_work_done (fires quickly)
+	echo "10:${busy_stop_offset}:${busy_stop_offset}:${UPROBE_TARGET}" \
+		> "${TLOB_MONITOR}" 2>/dev/null
+
+	found=0; i=0
+	while [ "$i" -lt 20 ]; do
+		sleep 0.1
+		trace_grep "budget_exceeded" && { found=1; break; }
+		i=$((i+1))
+	done
+
+	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}" 2>/dev/null
+	echo "-${busy_stop_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}" 2>/dev/null
+	kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
+
+	if [ "${found}" != "1" ]; then
+		tap_fail "uprobe_independent_thresholds" \
+			"budget_exceeded not raised for tight-budget region within 2 s"
+		cleanup; return
+	fi
+
+	# The violation must carry threshold=10 (Region B's budget).
+	ev=$(grep "budget_exceeded" "${TRACE_FILE}" | head -n 1)
+	if ! echo "${ev}" | grep -q "threshold=10 "; then
+		tap_fail "uprobe_independent_thresholds" \
+			"violation threshold is not Region B's 10 us: ${ev}"
+		cleanup; return
+	fi
+	tap_pass "uprobe_independent_thresholds"
+	cleanup
+}
+
+# ---------------------------------------------------------------------------
+# ioctl tests via tlob_helper
+#
+# Each test invokes the helper with a sub-test name.
+# Exit code: 0=pass, 1=fail, 2=skip.
+# ---------------------------------------------------------------------------
+run_ioctl_test() {
+	testname=$1
+	next_test
+
+	if [ ! -x "${IOCTL_HELPER}" ]; then
+		tap_skip "ioctl_${testname}" \
+			"tlob_helper not found or not executable"
+		return
+	fi
+	if [ ! -c "${RV_DEV}" ]; then
+		tap_skip "ioctl_${testname}" \
+			"${RV_DEV} not present (CONFIG_RV_CHARDEV=y needed)"
+		return
+	fi
+
+	tlob_enable
+	"${IOCTL_HELPER}" "${testname}"
+	rc=$?
+	tlob_disable
+
+	case "${rc}" in
+	0) tap_pass "ioctl_${testname}" ;;
+	2) tap_skip "ioctl_${testname}" "helper returned skip" ;;
+	*) tap_fail "ioctl_${testname}" "helper exited with code ${rc}" ;;
+	esac
+}
+
+# run_ioctl_test_not_enabled - like run_ioctl_test but deliberately does NOT
+# enable the tlob monitor before invoking the helper.  Used to verify that
+# ioctls issued against a disabled monitor return ENODEV rather than crashing
+# the kernel with a NULL pointer dereference.
+run_ioctl_test_not_enabled()
+{
+	next_test
+
+	if [ ! -x "${IOCTL_HELPER}" ]; then
+		tap_skip "ioctl_not_enabled" \
+			"tlob_helper not found or not executable"
+		return
+	fi
+	if [ ! -c "${RV_DEV}" ]; then
+		tap_skip "ioctl_not_enabled" \
+			"${RV_DEV} not present (CONFIG_RV_CHARDEV=y needed)"
+		return
+	fi
+
+	# Monitor intentionally left disabled.
+	tlob_disable
+	"${IOCTL_HELPER}" not_enabled
+	rc=$?
+
+	case "${rc}" in
+	0) tap_pass "ioctl_not_enabled" ;;
+	2) tap_skip "ioctl_not_enabled" "helper returned skip" ;;
+	*) tap_fail "ioctl_not_enabled" "helper exited with code ${rc}" ;;
+	esac
+}
+
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+check_root; check_tracefs; check_rv_dir; check_tlob
+tap_header; tap_plan 20
+
+# tracefs interface tests
+run_test_enable_disable
+run_test_tracefs_files
+
+# uprobe external monitoring tests
+run_test_uprobe_no_false_positive
+run_test_uprobe_violation
+run_test_uprobe_unbind
+run_test_uprobe_duplicate_offset
+run_test_uprobe_independent_thresholds
+
+# /dev/rv ioctl self-instrumentation tests
+run_ioctl_test_not_enabled
+run_ioctl_test within_budget
+run_ioctl_test over_budget_cpu
+run_ioctl_test over_budget_sleep
+run_ioctl_test double_start
+run_ioctl_test stop_no_start
+run_ioctl_test multi_thread
+run_ioctl_test self_watch
+run_ioctl_test invalid_flags
+run_ioctl_test notify_fd_bad
+run_ioctl_test mmap_basic
+run_ioctl_test mmap_errors
+run_ioctl_test mmap_consume
+
+echo "# Passed: ${t_pass} Failed: ${t_fail} Skipped: ${t_skip}"
+[ "${t_fail}" -gt 0 ] && exit 1 || exit 0
diff --git a/tools/testing/selftests/rv/tlob_helper.c b/tools/testing/selftests/rv/tlob_helper.c
new file mode 100644
index 000000000..cd76b56d1
--- /dev/null
+++ b/tools/testing/selftests/rv/tlob_helper.c
@@ -0,0 +1,994 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tlob_helper.c - test helper and ELF utility for tlob selftests
+ *
+ * Called by test_tlob.sh to exercise the /dev/rv ioctl interface and to
+ * resolve ELF symbol offsets for uprobe bindings.  One subcommand per
+ * invocation so the shell script can report each as an independent TAP
+ * test case.
+ *
+ * Usage: tlob_helper <subcommand> [args...]
+ *
+ * Synchronous TRACE_START / TRACE_STOP tests:
+ *   not_enabled        - TRACE_START without tlob enabled -> ENODEV (no kernel crash)
+ *   within_budget      - start(50000 us), sleep 10 ms, stop -> expect 0
+ *   over_budget_cpu    - start(5000 us), busyspin 100 ms, stop -> EOVERFLOW
+ *   over_budget_sleep  - start(3000 us), sleep 50 ms, stop -> EOVERFLOW
+ *
+ * Error-handling tests:
+ *   double_start       - two starts without stop -> EEXIST on second
+ *   stop_no_start      - stop without start -> ESRCH
+ *
+ * Per-thread isolation test:
+ *   multi_thread       - two threads share one fd; one within budget, one over
+ *
+ * Asynchronous notification test (notify_fd + read()):
+ *   self_watch         - one worker exceeds budget; monitor fd receives one ntf via read()
+ *
+ * Input-validation tests (TRACE_START error paths):
+ *   invalid_flags      - TRACE_START with flags != 0 -> EINVAL
+ *   notify_fd_bad      - TRACE_START with notify_fd = stdout (non-rv fd) -> EINVAL
+ *
+ * mmap ring buffer tests (Scenario D):
+ *   mmap_basic         - mmap succeeds; verify tlob_mmap_page fields
+ *                        (version, capacity, data_offset, record_size)
+ *   mmap_errors        - MAP_PRIVATE, wrong size, and non-zero pgoff all
+ *                        return EINVAL
+ *   mmap_consume       - trigger a real violation via self-notification and
+ *                        consume the event through the mmap'd ring
+ *
+ * ELF utility (does not require /dev/rv):
+ *   sym_offset <binary> <symbol>
+ *                      - print the ELF file offset of <symbol> in <binary>
+ *                        (used by the shell script to build uprobe bindings)
+ *
+ * Exit code: 0 = pass, 1 = fail, 2 = skip (device not available).
+ */
+#define _GNU_SOURCE
+#include <elf.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <pthread.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <linux/rv.h>
+
+/* Default ring capacity allocated at open(); matches TLOB_RING_DEFAULT_CAP. */
+#define TLOB_RING_DEFAULT_CAP	64U
+
+static int rv_fd = -1;
+
+static int open_rv(void)
+{
+	rv_fd = open("/dev/rv", O_RDWR);
+	if (rv_fd < 0) {
+		fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
+		return -1;
+	}
+	return 0;
+}
+
+static void busy_spin_us(unsigned long us)
+{
+	struct timespec start, now;
+	unsigned long elapsed;
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	do {
+		clock_gettime(CLOCK_MONOTONIC, &now);
+		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
+			  * 1000000000UL
+			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
+	} while (elapsed < us * 1000UL);
+}
+
+static int do_start(uint64_t threshold_us)
+{
+	struct tlob_start_args args = {
+		.threshold_us = threshold_us,
+		.notify_fd    = -1,
+	};
+
+	return ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
+}
+
+static int do_stop(void)
+{
+	return ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
+}
+
+/* -----------------------------------------------------------------------
+ * Synchronous TRACE_START / TRACE_STOP tests
+ * -----------------------------------------------------------------------
+ */
+
+/*
+ * test_not_enabled - TRACE_START must return ENODEV when the tlob monitor
+ * has not been enabled (tlob_state_cache is NULL).
+ *
+ * The shell wrapper deliberately does NOT call tlob_enable before invoking
+ * this subcommand, so the ioctl is expected to fail with ENODEV rather than
+ * crashing the kernel with a NULL pointer dereference in kmem_cache_alloc.
+ */
+static int test_not_enabled(void)
+{
+	int ret;
+
+	ret = do_start(1000);
+	if (ret == 0) {
+		fprintf(stderr, "TRACE_START: expected ENODEV, got success\n");
+		do_stop();
+		return 1;
+	}
+	if (errno != ENODEV) {
+		fprintf(stderr, "TRACE_START: expected ENODEV, got %s\n",
+			strerror(errno));
+		return 1;
+	}
+	return 0;
+}
+
+static int test_within_budget(void)
+{
+	int ret;
+
+	if (do_start(50000) < 0) {
+		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
+		return 1;
+	}
+	usleep(10000); /* 10 ms < 50 ms budget */
+	ret = do_stop();
+	if (ret != 0) {
+		fprintf(stderr, "TRACE_STOP: expected 0, got %d errno=%s\n",
+			ret, strerror(errno));
+		return 1;
+	}
+	return 0;
+}
+
+static int test_over_budget_cpu(void)
+{
+	int ret;
+
+	if (do_start(5000) < 0) {
+		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
+		return 1;
+	}
+	busy_spin_us(100000); /* 100 ms >> 5 ms budget */
+	ret = do_stop();
+	if (ret == 0) {
+		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
+		return 1;
+	}
+	if (errno != EOVERFLOW) {
+		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
+			strerror(errno));
+		return 1;
+	}
+	return 0;
+}
+
+static int test_over_budget_sleep(void)
+{
+	int ret;
+
+	if (do_start(3000) < 0) {
+		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
+		return 1;
+	}
+	usleep(50000); /* 50 ms >> 3 ms budget, off-CPU time counts */
+	ret = do_stop();
+	if (ret == 0) {
+		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
+		return 1;
+	}
+	if (errno != EOVERFLOW) {
+		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
+			strerror(errno));
+		return 1;
+	}
+	return 0;
+}
+
+/* -----------------------------------------------------------------------
+ * Error-handling tests
+ * -----------------------------------------------------------------------
+ */
+
+static int test_double_start(void)
+{
+	int ret;
+
+	if (do_start(10000000) < 0) {
+		fprintf(stderr, "first TRACE_START: %s\n", strerror(errno));
+		return 1;
+	}
+	ret = do_start(10000000);
+	if (ret == 0) {
+		fprintf(stderr, "second TRACE_START: expected EEXIST, got 0\n");
+		do_stop();
+		return 1;
+	}
+	if (errno != EEXIST) {
+		fprintf(stderr, "second TRACE_START: expected EEXIST, got %s\n",
+			strerror(errno));
+		do_stop();
+		return 1;
+	}
+	do_stop(); /* clean up */
+	return 0;
+}
+
+static int test_stop_no_start(void)
+{
+	int ret;
+
+	/* Ensure clean state: ignore error from a stale entry */
+	do_stop();
+
+	ret = do_stop();
+	if (ret == 0) {
+		fprintf(stderr, "TRACE_STOP: expected ESRCH, got 0\n");
+		return 1;
+	}
+	if (errno != ESRCH) {
+		fprintf(stderr, "TRACE_STOP: expected ESRCH, got %s\n",
+			strerror(errno));
+		return 1;
+	}
+	return 0;
+}
+
+/* -----------------------------------------------------------------------
+ * Per-thread isolation test
+ *
+ * Two threads share a single /dev/rv fd.  The monitor uses task_struct *
+ * as the key, so each thread gets an independent slot regardless of the
+ * shared fd.
+ * -----------------------------------------------------------------------
+ */
+
+struct mt_thread_args {
+	uint64_t      threshold_us;
+	unsigned long workload_us;
+	int           busy;
+	int           expect_eoverflow;
+	int           result;
+};
+
+static void *mt_thread_fn(void *arg)
+{
+	struct mt_thread_args *a = arg;
+	int ret;
+
+	if (do_start(a->threshold_us) < 0) {
+		fprintf(stderr, "thread TRACE_START: %s\n", strerror(errno));
+		a->result = 1;
+		return NULL;
+	}
+
+	if (a->busy)
+		busy_spin_us(a->workload_us);
+	else
+		usleep(a->workload_us);
+
+	ret = do_stop();
+	if (a->expect_eoverflow) {
+		if (ret == 0 || errno != EOVERFLOW) {
+			fprintf(stderr, "thread: expected EOVERFLOW, got ret=%d errno=%s\n",
+				ret, strerror(errno));
+			a->result = 1;
+			return NULL;
+		}
+	} else {
+		if (ret != 0) {
+			fprintf(stderr, "thread: expected 0, got ret=%d errno=%s\n",
+				ret, strerror(errno));
+			a->result = 1;
+			return NULL;
+		}
+	}
+	a->result = 0;
+	return NULL;
+}
+
+static int test_multi_thread(void)
+{
+	pthread_t ta, tb;
+	struct mt_thread_args a = {
+		.threshold_us     = 20000,  /* 20 ms */
+		.workload_us      = 5000,   /* 5 ms sleep -> within budget */
+		.busy             = 0,
+		.expect_eoverflow = 0,
+	};
+	struct mt_thread_args b = {
+		.threshold_us     = 3000,   /* 3 ms */
+		.workload_us      = 30000,  /* 30 ms spin -> over budget */
+		.busy             = 1,
+		.expect_eoverflow = 1,
+	};
+
+	pthread_create(&ta, NULL, mt_thread_fn, &a);
+	pthread_create(&tb, NULL, mt_thread_fn, &b);
+	pthread_join(ta, NULL);
+	pthread_join(tb, NULL);
+
+	return (a.result || b.result) ? 1 : 0;
+}
+
+/* -----------------------------------------------------------------------
+ * Asynchronous notification test (notify_fd + read())
+ *
+ * A dedicated monitor_fd is opened by the main thread.  Two worker threads
+ * each open their own work_fd and call TLOB_IOCTL_TRACE_START with
+ * notify_fd = monitor_fd, nominating it as the violation target.  Worker A
+ * stays within budget; worker B exceeds it.  The main thread reads from
+ * monitor_fd and expects exactly one tlob_event record.
+ * -----------------------------------------------------------------------
+ */
+
+struct sw_worker_args {
+	int           monitor_fd;
+	uint64_t      threshold_us;
+	unsigned long workload_us;
+	int           busy;
+	int           result;
+};
+
+static void *sw_worker_fn(void *arg)
+{
+	struct sw_worker_args *a = arg;
+	struct tlob_start_args args = {
+		.threshold_us = a->threshold_us,
+		.notify_fd    = a->monitor_fd,
+	};
+	int work_fd;
+	int ret;
+
+	work_fd = open("/dev/rv", O_RDWR);
+	if (work_fd < 0) {
+		fprintf(stderr, "worker open /dev/rv: %s\n", strerror(errno));
+		a->result = 1;
+		return NULL;
+	}
+
+	ret = ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
+	if (ret < 0) {
+		fprintf(stderr, "TRACE_START (notify): %s\n", strerror(errno));
+		close(work_fd);
+		a->result = 1;
+		return NULL;
+	}
+
+	if (a->busy)
+		busy_spin_us(a->workload_us);
+	else
+		usleep(a->workload_us);
+
+	ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
+	close(work_fd);
+	a->result = 0;
+	return NULL;
+}
+
+static int test_self_watch(void)
+{
+	int monitor_fd;
+	pthread_t ta, tb;
+	struct sw_worker_args a = {
+		.threshold_us = 50000,  /* 50 ms */
+		.workload_us  = 5000,   /* 5 ms sleep -> no violation */
+		.busy         = 0,
+	};
+	struct sw_worker_args b = {
+		.threshold_us = 3000,   /* 3 ms */
+		.workload_us  = 30000,  /* 30 ms spin -> violation */
+		.busy         = 1,
+	};
+	struct tlob_event ntfs[8];
+	int violations = 0;
+	ssize_t n;
+
+	/*
+	 * Open monitor_fd with O_NONBLOCK so read() after the workers finish
+	 * returns immediately rather than blocking forever.
+	 */
+	monitor_fd = open("/dev/rv", O_RDWR | O_NONBLOCK);
+	if (monitor_fd < 0) {
+		fprintf(stderr, "open /dev/rv (monitor_fd): %s\n", strerror(errno));
+		return 1;
+	}
+	a.monitor_fd = monitor_fd;
+	b.monitor_fd = monitor_fd;
+
+	pthread_create(&ta, NULL, sw_worker_fn, &a);
+	pthread_create(&tb, NULL, sw_worker_fn, &b);
+	pthread_join(ta, NULL);
+	pthread_join(tb, NULL);
+
+	if (a.result || b.result) {
+		close(monitor_fd);
+		return 1;
+	}
+
+	/*
+	 * Drain all available tlob_event records.  With O_NONBLOCK the final
+	 * read() returns -EAGAIN when the buffer is empty.
+	 */
+	while ((n = read(monitor_fd, ntfs, sizeof(ntfs))) > 0)
+		violations += (int)(n / sizeof(struct tlob_event));
+
+	close(monitor_fd);
+
+	if (violations != 1) {
+		fprintf(stderr, "self_watch: expected 1 violation, got %d\n",
+			violations);
+		return 1;
+	}
+	return 0;
+}
+
+/* -----------------------------------------------------------------------
+ * Input-validation tests (TRACE_START error paths)
+ * -----------------------------------------------------------------------
+ */
+
+/*
+ * test_invalid_flags - TRACE_START with flags != 0 must return EINVAL.
+ *
+ * The flags field is reserved for future extensions and must be zero.
+ * Callers that set it to a non-zero value are rejected early so that a
+ * future kernel can assign meaning to those bits without silently
+ * ignoring them.
+ */
+static int test_invalid_flags(void)
+{
+	struct tlob_start_args args = {
+		.threshold_us = 1000,
+		.notify_fd    = -1,
+		.flags        = 1,   /* non-zero: must be rejected */
+	};
+	int ret;
+
+	ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
+	if (ret == 0) {
+		fprintf(stderr, "TRACE_START(flags=1): expected EINVAL, got success\n");
+		do_stop();
+		return 1;
+	}
+	if (errno != EINVAL) {
+		fprintf(stderr, "TRACE_START(flags=1): expected EINVAL, got %s\n",
+			strerror(errno));
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * test_notify_fd_bad - TRACE_START with a non-/dev/rv notify_fd must return
+ * EINVAL.
+ *
+ * When notify_fd >= 0, the kernel resolves it to a struct file and checks
+ * that its private_data is non-NULL (i.e. it is a /dev/rv file descriptor).
+ * Passing stdout (fd 1) supplies a real, open fd whose private_data is NULL,
+ * so the kernel must reject it with EINVAL.
+ */
+static int test_notify_fd_bad(void)
+{
+	struct tlob_start_args args = {
+		.threshold_us = 1000,
+		.notify_fd    = STDOUT_FILENO,   /* open but not a /dev/rv fd */
+		.flags        = 0,
+	};
+	int ret;
+
+	ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
+	if (ret == 0) {
+		fprintf(stderr,
+			"TRACE_START(notify_fd=stdout): expected EINVAL, got success\n");
+		do_stop();
+		return 1;
+	}
+	if (errno != EINVAL) {
+		fprintf(stderr,
+			"TRACE_START(notify_fd=stdout): expected EINVAL, got %s\n",
+			strerror(errno));
+		return 1;
+	}
+	return 0;
+}
+
+/* -----------------------------------------------------------------------
+ * mmap ring buffer tests (Scenario D)
+ * -----------------------------------------------------------------------
+ */
+
+/*
+ * test_mmap_basic - mmap the ring buffer and verify the control page fields.
+ *
+ * The kernel allocates TLOB_RING_DEFAULT_CAP records at open().  A shared
+ * mmap of PAGE_SIZE + cap * record_size must succeed and the tlob_mmap_page
+ * header must contain consistent values.
+ */
+static int test_mmap_basic(void)
+{
+	long pagesize = sysconf(_SC_PAGESIZE);
+	size_t mmap_len = (size_t)pagesize +
+			  TLOB_RING_DEFAULT_CAP * sizeof(struct tlob_event);
+	/* rv_mmap requires a page-aligned length */
+	mmap_len = (mmap_len + (size_t)(pagesize - 1)) & ~(size_t)(pagesize - 1);
+	struct tlob_mmap_page *page;
+	struct tlob_event *data;
+	void *map;
+	int ret = 0;
+
+	map = mmap(NULL, mmap_len, PROT_READ | PROT_WRITE, MAP_SHARED, rv_fd, 0);
+	if (map == MAP_FAILED) {
+		fprintf(stderr, "mmap_basic: mmap: %s\n", strerror(errno));
+		return 1;
+	}
+
+	page = (struct tlob_mmap_page *)map;
+	data = (struct tlob_event *)((char *)map + page->data_offset);
+
+	if (page->version != 1) {
+		fprintf(stderr, "mmap_basic: expected version=1, got %u\n",
+			page->version);
+		ret = 1;
+		goto out;
+	}
+	if (page->capacity != TLOB_RING_DEFAULT_CAP) {
+		fprintf(stderr, "mmap_basic: expected capacity=%u, got %u\n",
+			TLOB_RING_DEFAULT_CAP, page->capacity);
+		ret = 1;
+		goto out;
+	}
+	if (page->data_offset != (uint32_t)pagesize) {
+		fprintf(stderr, "mmap_basic: expected data_offset=%ld, got %u\n",
+			pagesize, page->data_offset);
+		ret = 1;
+		goto out;
+	}
+	if (page->record_size != sizeof(struct tlob_event)) {
+		fprintf(stderr, "mmap_basic: expected record_size=%zu, got %u\n",
+			sizeof(struct tlob_event), page->record_size);
+		ret = 1;
+		goto out;
+	}
+	if (page->data_head != 0 || page->data_tail != 0) {
+		fprintf(stderr, "mmap_basic: ring not empty at open: head=%u tail=%u\n",
+			page->data_head, page->data_tail);
+		ret = 1;
+		goto out;
+	}
+	/* Touch the data array to confirm it is accessible. */
+	(void)data[0].tid;
+out:
+	munmap(map, mmap_len);
+	return ret;
+}
+
+/*
+ * test_mmap_errors - verify that rv_mmap() rejects invalid mmap parameters.
+ *
+ * Four cases are tested, each must return MAP_FAILED with errno == EINVAL:
+ *   1. size one page short of the correct ring length
+ *   2. size one page larger than the correct ring length
+ *   3. MAP_PRIVATE (only MAP_SHARED is permitted)
+ *   4. non-zero vm_pgoff (offset must be 0)
+ */
+static int test_mmap_errors(void)
+{
+	long pagesize = sysconf(_SC_PAGESIZE);
+	size_t correct_len = (size_t)pagesize +
+			     TLOB_RING_DEFAULT_CAP * sizeof(struct tlob_event);
+	/* rv_mmap requires a page-aligned length */
+	correct_len = (correct_len + (size_t)(pagesize - 1)) & ~(size_t)(pagesize - 1);
+	void *map;
+	int ret = 0;
+
+	/* Case 1: size one page short (correct_len - 1 still rounds up to correct_len) */
+	map = mmap(NULL, correct_len - (size_t)pagesize, PROT_READ | PROT_WRITE,
+		   MAP_SHARED, rv_fd, 0);
+	if (map != MAP_FAILED) {
+		fprintf(stderr, "mmap_errors: short-size mmap succeeded (expected EINVAL)\n");
+		munmap(map, correct_len - (size_t)pagesize);
+		ret = 1;
+	} else if (errno != EINVAL) {
+		fprintf(stderr, "mmap_errors: short-size: expected EINVAL, got %s\n",
+			strerror(errno));
+		ret = 1;
+	}
+
+	/* Case 2: size one page too large */
+	map = mmap(NULL, correct_len + (size_t)pagesize, PROT_READ | PROT_WRITE,
+		   MAP_SHARED, rv_fd, 0);
+	if (map != MAP_FAILED) {
+		fprintf(stderr, "mmap_errors: oversized mmap succeeded (expected EINVAL)\n");
+		munmap(map, correct_len + (size_t)pagesize);
+		ret = 1;
+	} else if (errno != EINVAL) {
+		fprintf(stderr, "mmap_errors: oversized: expected EINVAL, got %s\n",
+			strerror(errno));
+		ret = 1;
+	}
+
+	/* Case 3: MAP_PRIVATE instead of MAP_SHARED */
+	map = mmap(NULL, correct_len, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE, rv_fd, 0);
+	if (map != MAP_FAILED) {
+		fprintf(stderr, "mmap_errors: MAP_PRIVATE succeeded (expected EINVAL)\n");
+		munmap(map, correct_len);
+		ret = 1;
+	} else if (errno != EINVAL) {
+		fprintf(stderr, "mmap_errors: MAP_PRIVATE: expected EINVAL, got %s\n",
+			strerror(errno));
+		ret = 1;
+	}
+
+	/* Case 4: non-zero file offset (pgoff = 1) */
+	map = mmap(NULL, correct_len, PROT_READ | PROT_WRITE,
+		   MAP_SHARED, rv_fd, (off_t)pagesize);
+	if (map != MAP_FAILED) {
+		fprintf(stderr, "mmap_errors: non-zero pgoff mmap succeeded (expected EINVAL)\n");
+		munmap(map, correct_len);
+		ret = 1;
+	} else if (errno != EINVAL) {
+		fprintf(stderr, "mmap_errors: non-zero pgoff: expected EINVAL, got %s\n",
+			strerror(errno));
+		ret = 1;
+	}
+
+	return ret;
+}
+
+/*
+ * test_mmap_consume - zero-copy consumption of a real violation event.
+ *
+ * Arms a 5 ms budget with self-notification (notify_fd = rv_fd), sleeps
+ * 50 ms (off-CPU violation), then reads the pushed event through the mmap'd
+ * ring without calling read().  Verifies:
+ *   - TRACE_STOP returns EOVERFLOW (budget was exceeded)
+ *   - data_head == 1 after the violation
+ *   - the event fields (threshold_us, tag, tid) are correct
+ *   - data_tail can be advanced to consume the record (ring empties)
+ */
+static int test_mmap_consume(void)
+{
+	long pagesize = sysconf(_SC_PAGESIZE);
+	size_t mmap_len = (size_t)pagesize +
+			  TLOB_RING_DEFAULT_CAP * sizeof(struct tlob_event);
+	/* rv_mmap requires a page-aligned length */
+	mmap_len = (mmap_len + (size_t)(pagesize - 1)) & ~(size_t)(pagesize - 1);
+	struct tlob_start_args args = {
+		.threshold_us = 5000,		/* 5 ms */
+		.notify_fd    = rv_fd,		/* self-notification */
+		.tag          = 0xdeadbeefULL,
+		.flags        = 0,
+	};
+	struct tlob_mmap_page *page;
+	struct tlob_event *data;
+	void *map;
+	int stop_ret;
+	int ret = 0;
+
+	map = mmap(NULL, mmap_len, PROT_READ | PROT_WRITE, MAP_SHARED, rv_fd, 0);
+	if (map == MAP_FAILED) {
+		fprintf(stderr, "mmap_consume: mmap: %s\n", strerror(errno));
+		return 1;
+	}
+
+	page = (struct tlob_mmap_page *)map;
+	data = (struct tlob_event *)((char *)map + page->data_offset);
+
+	if (ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args) < 0) {
+		fprintf(stderr, "mmap_consume: TRACE_START: %s\n", strerror(errno));
+		ret = 1;
+		goto out;
+	}
+
+	usleep(50000); /* 50 ms >> 5 ms budget -> off-CPU violation */
+
+	stop_ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
+	if (stop_ret == 0) {
+		fprintf(stderr, "mmap_consume: TRACE_STOP returned 0, expected EOVERFLOW\n");
+		ret = 1;
+		goto out;
+	}
+	if (errno != EOVERFLOW) {
+		fprintf(stderr, "mmap_consume: TRACE_STOP: expected EOVERFLOW, got %s\n",
+			strerror(errno));
+		ret = 1;
+		goto out;
+	}
+
+	/* Pairs with smp_store_release in tlob_event_push. */
+	if (__atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE) != 1) {
+		fprintf(stderr, "mmap_consume: expected data_head=1, got %u\n",
+			page->data_head);
+		ret = 1;
+		goto out;
+	}
+	if (page->data_tail != 0) {
+		fprintf(stderr, "mmap_consume: expected data_tail=0, got %u\n",
+			page->data_tail);
+		ret = 1;
+		goto out;
+	}
+
+	/* Verify record content */
+	if (data[0].threshold_us != 5000) {
+		fprintf(stderr, "mmap_consume: expected threshold_us=5000, got %llu\n",
+			(unsigned long long)data[0].threshold_us);
+		ret = 1;
+		goto out;
+	}
+	if (data[0].tag != 0xdeadbeefULL) {
+		fprintf(stderr, "mmap_consume: expected tag=0xdeadbeef, got %llx\n",
+			(unsigned long long)data[0].tag);
+		ret = 1;
+		goto out;
+	}
+	if (data[0].tid == 0) {
+		fprintf(stderr, "mmap_consume: tid is 0\n");
+		ret = 1;
+		goto out;
+	}
+
+	/* Consume: advance data_tail and confirm ring is empty */
+	__atomic_store_n(&page->data_tail, 1U, __ATOMIC_RELEASE);
+	if (__atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE) !=
+	    __atomic_load_n(&page->data_tail, __ATOMIC_ACQUIRE)) {
+		fprintf(stderr, "mmap_consume: ring not empty after consume\n");
+		ret = 1;
+	}
+
+out:
+	munmap(map, mmap_len);
+	return ret;
+}
+
+/* -----------------------------------------------------------------------
+ * ELF utility: sym_offset
+ *
+ * Print the ELF file offset of a symbol in a binary.  Supports 32- and
+ * 64-bit ELF.  Walks the section headers to find .symtab (falling back to
+ * .dynsym), then converts the symbol's virtual address to a file offset
+ * via the PT_LOAD program headers.
+ *
+ * Does not require /dev/rv; used by the shell script to build uprobe
+ * bindings of the form pid:threshold_us:offset_start:offset_stop:binary_path.
+ *
+ * Returns 0 on success (offset printed to stdout), 1 on failure.
+ * -----------------------------------------------------------------------
+ */
+static int sym_offset(const char *binary, const char *symname)
+{
+	int fd;
+	struct stat st;
+	void *map;
+	Elf64_Ehdr *ehdr;
+	Elf32_Ehdr *ehdr32;
+	int is64;
+	uint64_t sym_vaddr = 0;
+	int found = 0;
+	uint64_t file_offset = 0;
+
+	fd = open(binary, O_RDONLY);
+	if (fd < 0) {
+		fprintf(stderr, "open %s: %s\n", binary, strerror(errno));
+		return 1;
+	}
+	if (fstat(fd, &st) < 0) {
+		close(fd);
+		return 1;
+	}
+	map = mmap(NULL, (size_t)st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	close(fd);
+	if (map == MAP_FAILED) {
+		fprintf(stderr, "mmap: %s\n", strerror(errno));
+		return 1;
+	}
+
+	/* Identify ELF class */
+	ehdr = (Elf64_Ehdr *)map;
+	ehdr32 = (Elf32_Ehdr *)map;
+	if (st.st_size < 4 ||
+	    ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
+	    ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
+	    ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
+	    ehdr->e_ident[EI_MAG3] != ELFMAG3) {
+		fprintf(stderr, "%s: not an ELF file\n", binary);
+		munmap(map, (size_t)st.st_size);
+		return 1;
+	}
+	is64 = (ehdr->e_ident[EI_CLASS] == ELFCLASS64);
+
+	if (is64) {
+		/* Walk section headers to find .symtab or .dynsym */
+		Elf64_Shdr *shdrs = (Elf64_Shdr *)((char *)map + ehdr->e_shoff);
+		Elf64_Shdr *shstrtab_hdr = &shdrs[ehdr->e_shstrndx];
+		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
+		int si;
+
+		/* Prefer .symtab; fall back to .dynsym */
+		for (int pass = 0; pass < 2 && !found; pass++) {
+			const char *target = pass ? ".dynsym" : ".symtab";
+
+			for (si = 0; si < ehdr->e_shnum && !found; si++) {
+				Elf64_Shdr *sh = &shdrs[si];
+				const char *name = shstrtab + sh->sh_name;
+
+				if (strcmp(name, target) != 0)
+					continue;
+
+				Elf64_Shdr *strtab_sh = &shdrs[sh->sh_link];
+				const char *strtab = (char *)map + strtab_sh->sh_offset;
+				Elf64_Sym *syms = (Elf64_Sym *)((char *)map + sh->sh_offset);
+				uint64_t nsyms = sh->sh_size / sizeof(Elf64_Sym);
+				uint64_t j;
+
+				for (j = 0; j < nsyms; j++) {
+					if (strcmp(strtab + syms[j].st_name, symname) == 0) {
+						sym_vaddr = syms[j].st_value;
+						found = 1;
+						break;
+					}
+				}
+			}
+		}
+
+		if (!found) {
+			fprintf(stderr, "symbol '%s' not found in %s\n", symname, binary);
+			munmap(map, (size_t)st.st_size);
+			return 1;
+		}
+
+		/* Convert vaddr to file offset via PT_LOAD segments */
+		Elf64_Phdr *phdrs = (Elf64_Phdr *)((char *)map + ehdr->e_phoff);
+		int pi;
+
+		for (pi = 0; pi < ehdr->e_phnum; pi++) {
+			Elf64_Phdr *ph = &phdrs[pi];
+
+			if (ph->p_type != PT_LOAD)
+				continue;
+			if (sym_vaddr >= ph->p_vaddr &&
+			    sym_vaddr < ph->p_vaddr + ph->p_filesz) {
+				file_offset = sym_vaddr - ph->p_vaddr + ph->p_offset;
+				break;
+			}
+		}
+	} else {
+		/* 32-bit ELF */
+		Elf32_Shdr *shdrs = (Elf32_Shdr *)((char *)map + ehdr32->e_shoff);
+		Elf32_Shdr *shstrtab_hdr = &shdrs[ehdr32->e_shstrndx];
+		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
+		int si;
+		uint32_t sym_vaddr32 = 0;
+
+		for (int pass = 0; pass < 2 && !found; pass++) {
+			const char *target = pass ? ".dynsym" : ".symtab";
+
+			for (si = 0; si < ehdr32->e_shnum && !found; si++) {
+				Elf32_Shdr *sh = &shdrs[si];
+				const char *name = shstrtab + sh->sh_name;
+
+				if (strcmp(name, target) != 0)
+					continue;
+
+				Elf32_Shdr *strtab_sh = &shdrs[sh->sh_link];
+				const char *strtab = (char *)map + strtab_sh->sh_offset;
+				Elf32_Sym *syms = (Elf32_Sym *)((char *)map + sh->sh_offset);
+				uint32_t nsyms = sh->sh_size / sizeof(Elf32_Sym);
+				uint32_t j;
+
+				for (j = 0; j < nsyms; j++) {
+					if (strcmp(strtab + syms[j].st_name, symname) == 0) {
+						sym_vaddr32 = syms[j].st_value;
+						found = 1;
+						break;
+					}
+				}
+			}
+		}
+
+		if (!found) {
+			fprintf(stderr, "symbol '%s' not found in %s\n", symname, binary);
+			munmap(map, (size_t)st.st_size);
+			return 1;
+		}
+
+		Elf32_Phdr *phdrs = (Elf32_Phdr *)((char *)map + ehdr32->e_phoff);
+		int pi;
+
+		for (pi = 0; pi < ehdr32->e_phnum; pi++) {
+			Elf32_Phdr *ph = &phdrs[pi];
+
+			if (ph->p_type != PT_LOAD)
+				continue;
+			if (sym_vaddr32 >= ph->p_vaddr &&
+			    sym_vaddr32 < ph->p_vaddr + ph->p_filesz) {
+				file_offset = sym_vaddr32 - ph->p_vaddr + ph->p_offset;
+				break;
+			}
+		}
+		sym_vaddr = sym_vaddr32;
+	}
+
+	munmap(map, (size_t)st.st_size);
+
+	if (!file_offset && sym_vaddr) {
+		fprintf(stderr, "could not map vaddr 0x%lx to file offset\n",
+			(unsigned long)sym_vaddr);
+		return 1;
+	}
+
+	printf("0x%lx\n", (unsigned long)file_offset);
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int rc;
+
+	if (argc < 2) {
+		fprintf(stderr, "Usage: %s <subcommand> [args...]\n", argv[0]);
+		return 1;
+	}
+
+	/* sym_offset does not need /dev/rv */
+	if (strcmp(argv[1], "sym_offset") == 0) {
+		if (argc < 4) {
+			fprintf(stderr, "Usage: %s sym_offset <binary> <symbol>\n",
+				argv[0]);
+			return 1;
+		}
+		return sym_offset(argv[2], argv[3]);
+	}
+
+	if (open_rv() < 0)
+		return 2; /* skip */
+
+	if (strcmp(argv[1], "not_enabled") == 0)
+		rc = test_not_enabled();
+	else if (strcmp(argv[1], "within_budget") == 0)
+		rc = test_within_budget();
+	else if (strcmp(argv[1], "over_budget_cpu") == 0)
+		rc = test_over_budget_cpu();
+	else if (strcmp(argv[1], "over_budget_sleep") == 0)
+		rc = test_over_budget_sleep();
+	else if (strcmp(argv[1], "double_start") == 0)
+		rc = test_double_start();
+	else if (strcmp(argv[1], "stop_no_start") == 0)
+		rc = test_stop_no_start();
+	else if (strcmp(argv[1], "multi_thread") == 0)
+		rc = test_multi_thread();
+	else if (strcmp(argv[1], "self_watch") == 0)
+		rc = test_self_watch();
+	else if (strcmp(argv[1], "invalid_flags") == 0)
+		rc = test_invalid_flags();
+	else if (strcmp(argv[1], "notify_fd_bad") == 0)
+		rc = test_notify_fd_bad();
+	else if (strcmp(argv[1], "mmap_basic") == 0)
+		rc = test_mmap_basic();
+	else if (strcmp(argv[1], "mmap_errors") == 0)
+		rc = test_mmap_errors();
+	else if (strcmp(argv[1], "mmap_consume") == 0)
+		rc = test_mmap_consume();
+	else {
+		fprintf(stderr, "Unknown test: %s\n", argv[1]);
+		rc = 1;
+	}
+
+	close(rv_fd);
+	return rc;
+}
diff --git a/tools/testing/selftests/rv/tlob_uprobe_target.c b/tools/testing/selftests/rv/tlob_uprobe_target.c
new file mode 100644
index 000000000..6c895cb40
--- /dev/null
+++ b/tools/testing/selftests/rv/tlob_uprobe_target.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tlob_uprobe_target.c - uprobe target binary for tlob selftests.
+ *
+ * Provides two well-known probe points:
+ *   tlob_busy_work()      - start probe: arms the tlob budget timer
+ *   tlob_busy_work_done() - stop  probe: cancels the timer on completion
+ *
+ * The tlob selftest writes a five-field uprobe binding:
+ *   pid:threshold_us:binary:offset_start:offset_stop
+ * where offset_start is the file offset of tlob_busy_work and offset_stop
+ * is the file offset of tlob_busy_work_done (resolved via tlob_helper
+ * sym_offset).
+ *
+ * Both probe points are plain entry uprobes (no uretprobe).  The busy loop
+ * keeps the task on-CPU so that either the stop probe fires cleanly (within
+ * budget) or the hrtimer fires first and emits tlob_budget_exceeded (over
+ * budget).
+ *
+ * Usage: tlob_uprobe_target <duration_ms>
+ *
+ * Loops calling tlob_busy_work() in 200 ms iterations until <duration_ms>
+ * has elapsed (0 = run for ~24 hours).  Short iterations ensure the uprobe
+ * entry fires on every call even if the uprobe is installed after the
+ * program has started.
+ */
+#define _GNU_SOURCE
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <time.h>
+
+#ifndef noinline
+#define noinline __attribute__((noinline))
+#endif
+
+static inline int timespec_before(const struct timespec *a,
+				   const struct timespec *b)
+{
+	return a->tv_sec < b->tv_sec ||
+	       (a->tv_sec == b->tv_sec && a->tv_nsec < b->tv_nsec);
+}
+
+static void timespec_add_ms(struct timespec *ts, unsigned long ms)
+{
+	ts->tv_sec  += ms / 1000;
+	ts->tv_nsec += (long)(ms % 1000) * 1000000L;
+	if (ts->tv_nsec >= 1000000000L) {
+		ts->tv_sec++;
+		ts->tv_nsec -= 1000000000L;
+	}
+}
+
+/*
+ * tlob_busy_work_done - stop-probe target.
+ *
+ * Called by tlob_busy_work() after the busy loop.  The uprobe on this
+ * function's entry fires tlob_stop_task(), cancelling the budget timer.
+ * noinline ensures the compiler never merges this function with its caller,
+ * guaranteeing the entry uprobe always fires.
+ */
+noinline void tlob_busy_work_done(void)
+{
+	/* empty: the uprobe fires on entry */
+}
+
+/*
+ * tlob_busy_work - start-probe target.
+ *
+ * The uprobe on this function's entry fires tlob_start_task(), arming the
+ * budget timer.  noinline prevents the compiler and linker (including LTO)
+ * from inlining this function into its callers, ensuring the entry uprobe
+ * fires on every call.
+ */
+noinline void tlob_busy_work(unsigned long duration_ns)
+{
+	struct timespec start, now;
+	unsigned long elapsed;
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	do {
+		clock_gettime(CLOCK_MONOTONIC, &now);
+		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
+			  * 1000000000UL
+			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
+	} while (elapsed < duration_ns);
+
+	tlob_busy_work_done();
+}
+
+int main(int argc, char *argv[])
+{
+	unsigned long duration_ms = 0;
+	struct timespec deadline, now;
+
+	if (argc >= 2)
+		duration_ms = strtoul(argv[1], NULL, 10);
+
+	clock_gettime(CLOCK_MONOTONIC, &deadline);
+	timespec_add_ms(&deadline, duration_ms ? duration_ms : 86400000UL);
+
+	do {
+		tlob_busy_work(200 * 1000000UL); /* 200 ms per iteration */
+		clock_gettime(CLOCK_MONOTONIC, &now);
+	} while (timespec_before(&now, &deadline));
+
+	return 0;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor
  2026-04-12 19:27 ` [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor wen.yang
@ 2026-04-13  8:19   ` Gabriele Monaco
  2026-04-16 15:09     ` Wen Yang
  0 siblings, 1 reply; 11+ messages in thread
From: Gabriele Monaco @ 2026-04-13  8:19 UTC (permalink / raw)
  To: wen.yang
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, linux-kernel

On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add the tlob (task latency over budget) RV monitor. tlob tracks the
> monotonic elapsed time (CLOCK_MONOTONIC) of a marked per-task code
> path, including time off-CPU, and fires a per-task hrtimer when the
> elapsed time exceeds a configurable budget.
> 
> Three-state DA (unmonitored/on_cpu/off_cpu) driven by trace_start,
> switch_in/out, and budget_expired events. Per-task state lives in a
> fixed-size hash table (TLOB_MAX_MONITORED slots) with RCU-deferred
> free.
> 
> Two userspace interfaces:
>  - tracefs: uprobe pair registration via the monitor file using the
>    format "pid:threshold_us:offset_start:offset_stop:binary_path"
>  - /dev/rv ioctls (CONFIG_RV_CHARDEV): TLOB_IOCTL_TRACE_START /
>    TRACE_STOP; TRACE_STOP returns -EOVERFLOW on violation
> 
> Each /dev/rv fd has a per-fd mmap ring buffer (physically contiguous
> pages). A control page (struct tlob_mmap_page) at offset 0 exposes
> head/tail/dropped for lockless userspace reads; struct tlob_event
> records follow at data_offset. Drop-new policy on overflow.
> 
> UAPI: include/uapi/linux/rv.h (tlob_start_args, tlob_event,
>       tlob_mmap_page, ioctl numbers), monitor_tlob.rst,
>       ioctl-number.rst (RV_IOC_MAGIC=0xB9).
> 

I'm not fully grasping all the requirements for the monitors yet, but I see you
are reimplementing a lot of functionality in the monitor itself rather than
within RV, let's see if we can consolidate some of them:

 * you're using timer expirations, can we do it with timed automata? [1]
 * RV automata usually don't have an /unmonitored/ state, your trace_start event
would  be the start condition (da_event_start) and the monitor will get non-
running at each violation (it calls da_monitor_reset() automatically), all
setup/cleanup logic should be handled implicitly within RV. I believe that would
also save you that ugly trace_event_tlob() redefinition.
 * you're maintaining a local hash table for each task_struct, that could use
the per-object monitors [2] where your "object" is in fact your struct,
allocated when you start the monitor with all appropriate fields and indexed by
pid
 * you are handling violations manually, considering timed automata trigger a
full fledged violation on timeouts, can you use the RV-way (error tracepoints or
reactors only)? Do you need the additional reporting within the
tracepoint/ioctl? Cannot the userspace consumer desume all those from other
events and let RV do just the monitoring?
 * I like the uprobe thing, we could probably move all that to a common helper
once we figure out how to make it generic.

Note: [1] and [2] didn't reach upstream yet, but should reach linux-next soon.

Thanks,
Gabriele

[1] -
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=f5587d1b6ec938afb2f74fe399a68020d66923e4
[2] -
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=da282bf7fadb095ee0a40c32ff0126429c769b45

> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  Documentation/trace/rv/index.rst              |   1 +
>  Documentation/trace/rv/monitor_tlob.rst       | 381 +++++++
>  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>  include/uapi/linux/rv.h                       | 181 ++++
>  kernel/trace/rv/Kconfig                       |  17 +
>  kernel/trace/rv/Makefile                      |   2 +
>  kernel/trace/rv/monitors/tlob/Kconfig         |  51 +
>  kernel/trace/rv/monitors/tlob/tlob.c          | 986 ++++++++++++++++++
>  kernel/trace/rv/monitors/tlob/tlob.h          | 145 +++
>  kernel/trace/rv/monitors/tlob/tlob_trace.h    |  42 +
>  kernel/trace/rv/rv.c                          |   4 +
>  kernel/trace/rv/rv_dev.c                      | 602 +++++++++++
>  kernel/trace/rv/rv_trace.h                    |  50 +
>  13 files changed, 2463 insertions(+)
>  create mode 100644 Documentation/trace/rv/monitor_tlob.rst
>  create mode 100644 include/uapi/linux/rv.h
>  create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
>  create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
>  create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
>  create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
>  create mode 100644 kernel/trace/rv/rv_dev.c
> 
> diff --git a/Documentation/trace/rv/index.rst
> b/Documentation/trace/rv/index.rst
> index a2812ac5c..4f2bfaf38 100644
> --- a/Documentation/trace/rv/index.rst
> +++ b/Documentation/trace/rv/index.rst
> @@ -15,3 +15,4 @@ Runtime Verification
>     monitor_wwnr.rst
>     monitor_sched.rst
>     monitor_rtapp.rst
> +   monitor_tlob.rst
> diff --git a/Documentation/trace/rv/monitor_tlob.rst
> b/Documentation/trace/rv/monitor_tlob.rst
> new file mode 100644
> index 000000000..d498e9894
> --- /dev/null
> +++ b/Documentation/trace/rv/monitor_tlob.rst
> @@ -0,0 +1,381 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Monitor tlob
> +============
> +
> +- Name: tlob - task latency over budget
> +- Type: per-task deterministic automaton
> +- Author: Wen Yang <wen.yang@linux.dev>
> +
> +Description
> +-----------
> +
> +The tlob monitor tracks per-task elapsed time (CLOCK_MONOTONIC, including
> +both on-CPU and off-CPU time) and reports a violation when the monitored
> +task exceeds a configurable latency budget threshold.
> +
> +The monitor implements a three-state deterministic automaton::
> +
> +                              |
> +                              | (initial)
> +                              v
> +                    +--------------+
> +          +-------> | unmonitored  |
> +          |         +--------------+
> +          |                |
> +          |          trace_start
> +          |                v
> +          |         +--------------+
> +          |         |   on_cpu     |
> +          |         +--------------+
> +          |           |         |
> +          |  switch_out|         | trace_stop / budget_expired
> +          |            v         v
> +          |  +--------------+  (unmonitored)
> +          |  |   off_cpu    |
> +          |  +--------------+
> +          |     |         |
> +          |     | switch_in| trace_stop / budget_expired
> +          |     v         v
> +          |  (on_cpu)  (unmonitored)
> +          |
> +          +-- trace_stop (from on_cpu or off_cpu)
> +
> +  Key transitions:
> +    unmonitored   --(trace_start)-->   on_cpu
> +    on_cpu        --(switch_out)-->    off_cpu
> +    off_cpu       --(switch_in)-->     on_cpu
> +    on_cpu        --(trace_stop)-->    unmonitored
> +    off_cpu       --(trace_stop)-->    unmonitored
> +    on_cpu        --(budget_expired)-> unmonitored   [violation]
> +    off_cpu       --(budget_expired)-> unmonitored   [violation]
> +
> +  sched_wakeup self-loops in on_cpu and unmonitored; switch_out and
> +  sched_wakeup self-loop in off_cpu.  budget_expired is fired by the one-shot
> hrtimer; it always
> +  transitions to unmonitored regardless of whether the task is on-CPU
> +  or off-CPU when the timer fires.
> +
> +State Descriptions
> +------------------
> +
> +- **unmonitored**: Task is not being traced.  Scheduling events
> +  (``switch_in``, ``switch_out``, ``sched_wakeup``) are silently
> +  ignored (self-loop).  The monitor waits for a ``trace_start`` event
> +  to begin a new observation window.
> +
> +- **on_cpu**: Task is running on the CPU with the deadline timer armed.
> +  A one-shot hrtimer was set for ``threshold_us`` microseconds at
> +  ``trace_start`` time.  A ``switch_out`` event transitions to
> +  ``off_cpu``; the hrtimer keeps running (off-CPU time counts toward
> +  the budget).  A ``trace_stop`` cancels the timer and returns to
> +  ``unmonitored`` (normal completion).  If the hrtimer fires
> +  (``budget_expired``) the violation is recorded and the automaton
> +  transitions to ``unmonitored``.
> +
> +- **off_cpu**: Task was preempted or blocked.  The one-shot hrtimer
> +  continues to run.  A ``switch_in`` event returns to ``on_cpu``.
> +  A ``trace_stop`` cancels the timer and returns to ``unmonitored``.
> +  If the hrtimer fires (``budget_expired``) while the task is off-CPU,
> +  the violation is recorded and the automaton transitions to
> +  ``unmonitored``.
> +
> +Rationale
> +---------
> +
> +The per-task latency budget threshold allows operators to express timing
> +requirements in microseconds and receive an immediate ftrace event when a
> +task exceeds its budget.  This is useful for real-time tasks
> +(``SCHED_FIFO`` / ``SCHED_DEADLINE``) where total elapsed time must
> +remain within a known bound.
> +
> +Each task has an independent threshold, so up to ``TLOB_MAX_MONITORED``
> +(64) tasks with different timing requirements can be monitored
> +simultaneously.
> +
> +On threshold violation the automaton records a ``tlob_budget_exceeded``
> +ftrace event carrying the final on-CPU / off-CPU time breakdown, but does
> +not kill or throttle the task.  Monitoring can be restarted by issuing a
> +new ``trace_start`` event (or a new ``TLOB_IOCTL_TRACE_START`` ioctl).
> +
> +A per-task one-shot hrtimer is armed at ``trace_start`` for exactly
> +``threshold_us`` microseconds.  It fires at most once per monitoring
> +window, performs an O(1) hash lookup, records the violation, and injects
> +the ``budget_expired`` event into the DA.  When ``CONFIG_RV_MON_TLOB``
> +is not set there is zero runtime cost.
> +
> +Usage
> +-----
> +
> +tracefs interface (uprobe-based external monitoring)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The ``monitor`` tracefs file allows any privileged user to instrument an
> +unmodified binary via uprobes, without changing its source code.  Write a
> +four-field record to attach two plain entry uprobes: one at
> +``offset_start`` fires ``tlob_start_task()`` and one at ``offset_stop``
> +fires ``tlob_stop_task()``, so the latency budget covers exactly the code
> +region between the two offsets::
> +
> +  threshold_us:offset_start:offset_stop:binary_path
> +
> +``binary_path`` comes last so it may freely contain ``:`` (e.g. paths
> +inside a container namespace).
> +
> +The uprobes fire for every task that executes the probed instruction in
> +the binary, consistent with the native uprobe semantics.  All tasks that
> +execute the code region get independent per-task monitoring slots.
> +
> +Using two plain entry uprobes (rather than a uretprobe for the stop) means
> +that a mistyped offset can never corrupt the call stack; the worst outcome
> +of a bad ``offset_stop`` is a missed stop that causes the hrtimer to fire
> +and report a budget violation.
> +
> +Example  --  monitor a code region in ``/usr/bin/myapp`` with a 5 ms
> +budget, where the region starts at offset 0x12a0 and ends at 0x12f0::
> +
> +  echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
> +
> +  # Bind uprobes: start probe starts the clock, stop probe stops it
> +  echo "5000:0x12a0:0x12f0:/usr/bin/myapp" \
> +      > /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> +  # Remove the uprobe binding for this code region
> +  echo "-0x12a0:/usr/bin/myapp" >
> /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> +  # List registered uprobe bindings (mirrors the write format)
> +  cat /sys/kernel/tracing/rv/monitors/tlob/monitor
> +  # -> 5000:0x12a0:0x12f0:/usr/bin/myapp
> +
> +  # Read violations from the trace buffer
> +  cat /sys/kernel/tracing/trace
> +
> +Up to ``TLOB_MAX_MONITORED`` tasks may be monitored simultaneously.
> +
> +The offsets can be obtained with ``nm`` or ``readelf``::
> +
> +  nm -n /usr/bin/myapp | grep my_function
> +  # -> 0000000000012a0 T my_function
> +
> +  readelf -s /usr/bin/myapp | grep my_function
> +  # -> 42: 0000000000012a0  336 FUNC GLOBAL DEFAULT  13 my_function
> +
> +  # offset_start = 0x12a0 (function entry)
> +  # offset_stop  = 0x12a0 + 0x50 = 0x12f0 (or any instruction before return)
> +
> +Notes:
> +
> +- The uprobes fire for every task that executes the probed instruction,
> +  so concurrent calls from different threads each get independent
> +  monitoring slots.
> +- ``offset_stop`` need not be a function return; it can be any instruction
> +  within the region.  If the stop probe is never reached (e.g. early exit
> +  path bypasses it), the hrtimer fires and a budget violation is reported.
> +- Each ``(binary_path, offset_start)`` pair may only be registered once.
> +  A second write with the same ``offset_start`` for the same binary is
> +  rejected with ``-EEXIST``.  Two entry uprobes at the same address would
> +  both fire for every task, causing ``tlob_start_task()`` to be called
> +  twice; the second call would silently fail with ``-EEXIST`` and the
> +  second binding's threshold would never take effect.  Different code
> +  regions that share the same ``offset_stop`` (common exit point) are
> +  explicitly allowed.
> +- The uprobe binding is removed when ``-offset_start:binary_path`` is
> +  written to ``monitor``, or when the monitor is disabled.
> +- The ``tag`` field in every ``tlob_budget_exceeded`` event is
> +  automatically set to ``offset_start`` for the tracefs path, so
> +  violation events for different code regions are immediately
> +  distinguishable even when ``threshold_us`` values are identical.
> +
> +ftrace ring buffer (budget violation events)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a monitored task exceeds its latency budget the hrtimer fires,
> +records the violation, and emits a single ``tlob_budget_exceeded`` event
> +into the ftrace ring buffer.  **Nothing is written to the ftrace ring
> +buffer while the task is within budget.**
> +
> +The event carries the on-CPU / off-CPU time breakdown so that root-cause
> +analysis (CPU-bound vs. scheduling / I/O overrun) is immediate::
> +
> +  cat /sys/kernel/tracing/trace
> +
> +Example output::
> +
> +  myapp-1234 [003] .... 12345.678: tlob_budget_exceeded: \
> +    myapp[1234]: budget exceeded threshold=5000 \
> +    on_cpu=820 off_cpu=4500 switches=3 state=off_cpu tag=0x00000000000012a0
> +
> +Field descriptions:
> +
> +``threshold``
> +  Configured latency budget in microseconds.
> +
> +``on_cpu``
> +  Cumulative on-CPU time since ``trace_start``, in microseconds.
> +
> +``off_cpu``
> +  Cumulative off-CPU (scheduling + I/O wait) time since ``trace_start``,
> +  in microseconds.
> +
> +``switches``
> +  Number of times the task was scheduled out during this window.
> +
> +``state``
> +  DA state when the hrtimer fired: ``on_cpu`` means the task was executing
> +  when the budget expired (CPU-bound overrun); ``off_cpu`` means the task
> +  was preempted or blocked (scheduling / I/O overrun).
> +
> +``tag``
> +  Opaque 64-bit cookie supplied by the caller via ``tlob_start_args.tag``
> +  (ioctl path) or automatically set to ``offset_start`` (tracefs uprobe
> +  path).  Use it to distinguish violations from different code regions
> +  monitored by the same thread.  Zero when not set.
> +
> +To capture violations in a file::
> +
> +  trace-cmd record -e tlob_budget_exceeded &
> +  # ... run workload ...
> +  trace-cmd report
> +
> +/dev/rv ioctl interface (self-instrumentation)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Tasks can self-instrument their own code paths via the ``/dev/rv`` misc
> +device (requires ``CONFIG_RV_CHARDEV``).  The kernel key is
> +``task_struct``; multiple threads sharing a single fd each get their own
> +independent monitoring slot.
> +
> +**Synchronous mode**  --  the calling thread checks its own result::
> +
> +  int fd = open("/dev/rv", O_RDWR);
> +
> +  struct tlob_start_args args = {
> +      .threshold_us = 50000,   /* 50 ms */
> +      .tag          = 0,       /* optional; 0 = don't care */
> +      .notify_fd    = -1,      /* no fd notification */
> +  };
> +  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +
> +  /* ... code path under observation ... */
> +
> +  int ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +  /* ret == 0:          within budget  */
> +  /* ret == -EOVERFLOW: budget exceeded */
> +
> +  close(fd);
> +
> +**Asynchronous mode**  --  a dedicated monitor thread receives violation
> +records via ``read()`` on a shared fd, decoupling the observation from
> +the critical path::
> +
> +  /* Monitor thread: open a dedicated fd. */
> +  int monitor_fd = open("/dev/rv", O_RDWR);
> +
> +  /* Worker thread: set notify_fd = monitor_fd in TRACE_START args. */
> +  int work_fd = open("/dev/rv", O_RDWR);
> +  struct tlob_start_args args = {
> +      .threshold_us = 10000,   /* 10 ms */
> +      .tag          = REGION_A,
> +      .notify_fd    = monitor_fd,
> +  };
> +  ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
> +  /* ... critical section ... */
> +  ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +
> +  /* Monitor thread: blocking read() returns one or more tlob_event records.
> */
> +  struct tlob_event ntfs[8];
> +  ssize_t n = read(monitor_fd, ntfs, sizeof(ntfs));
> +  for (int i = 0; i < n / sizeof(struct tlob_event); i++) {
> +      struct tlob_event *ntf = &ntfs[i];
> +      printf("tid=%u tag=0x%llx exceeded budget=%llu us "
> +             "(on_cpu=%llu off_cpu=%llu switches=%u state=%s)\n",
> +             ntf->tid, ntf->tag, ntf->threshold_us,
> +             ntf->on_cpu_us, ntf->off_cpu_us, ntf->switches,
> +             ntf->state ? "on_cpu" : "off_cpu");
> +  }
> +
> +**mmap ring buffer**  --  zero-copy consumption of violation events::
> +
> +  int fd = open("/dev/rv", O_RDWR);
> +  struct tlob_start_args args = {
> +      .threshold_us = 1000,   /* 1 ms */
> +      .notify_fd    = fd,     /* push violations to own ring buffer */
> +  };
> +  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +
> +  /* Map the ring: one control page + capacity data records. */
> +  size_t pagesize = sysconf(_SC_PAGESIZE);
> +  size_t cap = 64;   /* read from page->capacity after mmap */
> +  size_t len = pagesize + cap * sizeof(struct tlob_event);
> +  void *map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> +
> +  struct tlob_mmap_page *page = map;
> +  struct tlob_event *data =
> +      (struct tlob_event *)((char *)map + page->data_offset);
> +
> +  /* Consumer loop: poll for events, read without copying. */
> +  while (1) {
> +      poll(&(struct pollfd){fd, POLLIN, 0}, 1, -1);
> +
> +      uint32_t head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
> +      uint32_t tail = page->data_tail;
> +      while (tail != head) {
> +          handle(&data[tail & (page->capacity - 1)]);
> +          tail++;
> +      }
> +      __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
> +  }
> +
> +Note: ``read()`` and ``mmap()`` share the same ring and ``data_tail``
> +cursor.  Do not use both simultaneously on the same fd.
> +
> +``tlob_event`` fields:
> +
> +``tid``
> +  Thread ID (``task_pid_vnr``) of the violating task.
> +
> +``threshold_us``
> +  Budget that was exceeded, in microseconds.
> +
> +``on_cpu_us``
> +  Cumulative on-CPU time at violation time, in microseconds.
> +
> +``off_cpu_us``
> +  Cumulative off-CPU time at violation time, in microseconds.
> +
> +``switches``
> +  Number of context switches since ``TRACE_START``.
> +
> +``state``
> +  1 = timer fired while task was on-CPU; 0 = timer fired while off-CPU.
> +
> +``tag``
> +  Cookie from ``tlob_start_args.tag``; for the tracefs uprobe path this
> +  equals ``offset_start``.  Zero when not set.
> +
> +tracefs files
> +-------------
> +
> +The following files are created under
> +``/sys/kernel/tracing/rv/monitors/tlob/``:
> +
> +``enable`` (rw)
> +  Write ``1`` to enable the monitor; write ``0`` to disable it and
> +  stop all currently monitored tasks.
> +
> +``desc`` (ro)
> +  Human-readable description of the monitor.
> +
> +``monitor`` (rw)
> +  Write ``threshold_us:offset_start:offset_stop:binary_path`` to bind two
> +  plain entry uprobes in *binary_path*.  The uprobe at *offset_start* fires
> +  ``tlob_start_task()``; the uprobe at *offset_stop* fires
> +  ``tlob_stop_task()``.  Returns ``-EEXIST`` if a binding with the same
> +  *offset_start* already exists for *binary_path*.  Write
> +  ``-offset_start:binary_path`` to remove the binding.  Read to list
> +  registered bindings, one
> +  ``threshold_us:0xoffset_start:0xoffset_stop:binary_path`` entry per line.
> +
> +Specification
> +-------------
> +
> +Graphviz DOT file in tools/verification/models/tlob.dot
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 331223761..8d3af68db 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -385,6 +385,7 @@ Code  Seq#    Include
> File                                             Comments
>  0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                               
> Marvell CN10K DPI driver
>  0xB8  all    uapi/linux/mshv.h                                        
> Microsoft Hyper-V /dev/mshv driver
>                                                                        
> <mailto:linux-hyperv@vger.kernel.org>
> +0xB9  00-3F  linux/rv.h                                               
> Runtime Verification (RV) monitors
>  0xBA  00-0F  uapi/linux/liveupdate.h                                   Pasha
> Tatashin
>                                                                        
> <mailto:pasha.tatashin@soleen.com>
>  0xC0  00-0F  linux/usb/iowarrior.h
> diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000..d1b96d8cd
> --- /dev/null
> +++ b/include/uapi/linux/rv.h
> @@ -0,0 +1,181 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * All RV monitors that expose an ioctl self-instrumentation interface
> + * share the magic byte RV_IOC_MAGIC (0xB9), registered in
> + * Documentation/userspace-api/ioctl/ioctl-number.rst.
> + *
> + * A single /dev/rv misc device serves as the entry point.  ioctl numbers
> + * encode both the monitor identity and the operation:
> + *
> + *   0x01 - 0x1F  tlob (task latency over budget)
> + *   0x20 - 0x3F  reserved for future RV monitors
> + *
> + * Usage examples and design rationale are in:
> + *   Documentation/trace/rv/monitor_tlob.rst
> + */
> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H
> +
> +#include <linux/ioctl.h>
> +#include <linux/types.h>
> +
> +/* Magic byte shared by all RV monitor ioctls. */
> +#define RV_IOC_MAGIC	0xB9
> +
> +/* -----------------------------------------------------------------------
> + * tlob: task latency over budget monitor  (nr 0x01 - 0x1F)
> + * -----------------------------------------------------------------------
> + */
> +
> +/**
> + * struct tlob_start_args - arguments for TLOB_IOCTL_TRACE_START
> + * @threshold_us: Latency budget for this critical section, in microseconds.
> + *               Must be greater than zero.
> + * @tag:         Opaque 64-bit cookie supplied by the caller.  Echoed back
> + *               verbatim in the tlob_budget_exceeded ftrace event and in any
> + *               tlob_event record delivered via @notify_fd.  Use it to
> identify
> + *               which code region triggered a violation when the same thread
> + *               monitors multiple regions sequentially.  Set to 0 if not
> + *               needed.
> + * @notify_fd:   File descriptor that will receive a tlob_event record on
> + *               violation.  Must refer to an open /dev/rv fd.  May equal
> + *               the calling fd (self-notification, useful for retrieving the
> + *               on_cpu_us / off_cpu_us breakdown after TRACE_STOP returns
> + *               -EOVERFLOW).  Set to -1 to disable fd notification; in that
> + *               case violations are only signalled via the TRACE_STOP return
> + *               value and the tlob_budget_exceeded ftrace event.
> + * @flags:       Must be 0.  Reserved for future extensions.
> + */
> +struct tlob_start_args {
> +	__u64 threshold_us;
> +	__u64 tag;
> +	__s32 notify_fd;
> +	__u32 flags;
> +};
> +
> +/**
> + * struct tlob_event - one budget-exceeded event
> + *
> + * Consumed by read() on the notify_fd registered at TLOB_IOCTL_TRACE_START.
> + * Each record describes a single budget exceedance for one task.
> + *
> + * @tid:          Thread ID (task_pid_vnr) of the violating task.
> + * @threshold_us: Budget that was exceeded, in microseconds.
> + * @on_cpu_us:    Cumulative on-CPU time at violation time, in microseconds.
> + * @off_cpu_us:   Cumulative off-CPU (scheduling + I/O wait) time at
> + *               violation time, in microseconds.
> + * @switches:     Number of context switches since TRACE_START.
> + * @state:        DA state at violation: 1 = on_cpu, 0 = off_cpu.
> + * @tag:          Cookie from tlob_start_args.tag; for the tracefs uprobe
> path
> + *               this is the offset_start value.  Zero when not set.
> + */
> +struct tlob_event {
> +	__u32 tid;
> +	__u32 pad;
> +	__u64 threshold_us;
> +	__u64 on_cpu_us;
> +	__u64 off_cpu_us;
> +	__u32 switches;
> +	__u32 state;   /* 1 = on_cpu, 0 = off_cpu */
> +	__u64 tag;
> +};
> +
> +/**
> + * struct tlob_mmap_page - control page for the mmap'd violation ring buffer
> + *
> + * Mapped at offset 0 of the mmap region returned by mmap(2) on a /dev/rv fd.
> + * The data array of struct tlob_event records begins at offset @data_offset
> + * (always one page from the mmap base; use this field rather than hard-
> coding
> + * PAGE_SIZE so the code remains correct across architectures).
> + *
> + * Ring layout:
> + *
> + *   mmap base + 0             : struct tlob_mmap_page  (one page)
> + *   mmap base + data_offset   : struct tlob_event[capacity]
> + *
> + * The mmap length determines the ring capacity.  Compute it as:
> + *
> + *   raw    = sysconf(_SC_PAGESIZE) + capacity * sizeof(struct tlob_event)
> + *   length = (raw + sysconf(_SC_PAGESIZE) - 1) & ~(sysconf(_SC_PAGESIZE) -
> 1)
> + *
> + * i.e. round the raw byte count up to the next page boundary before
> + * passing it to mmap(2).  The kernel requires a page-aligned length.
> + * capacity must be a power of 2.  Read @capacity after a successful
> + * mmap(2) for the actual value.
> + *
> + * Producer/consumer ordering contract:
> + *
> + *   Kernel (producer):
> + *     data[data_head & (capacity - 1)] = event;
> + *     // pairs with load-acquire in userspace:
> + *     smp_store_release(&page->data_head, data_head + 1);
> + *
> + *   Userspace (consumer):
> + *     // pairs with store-release in kernel:
> + *     head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
> + *     for (tail = page->data_tail; tail != head; tail++)
> + *         handle(&data[tail & (capacity - 1)]);
> + *     __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
> + *
> + * @data_head and @data_tail are monotonically increasing __u32 counters
> + * in units of records.  Unsigned 32-bit wrap-around is handled correctly
> + * by modular arithmetic; the ring is full when
> + * (data_head - data_tail) == capacity.
> + *
> + * When the ring is full the kernel drops the incoming record and increments
> + * @dropped.  The consumer should check @dropped periodically to detect loss.
> + *
> + * read() and mmap() share the same ring buffer.  Do not use both
> + * simultaneously on the same fd.
> + *
> + * @data_head:   Next write slot index.  Updated by the kernel with
> + *               store-release ordering.  Read by userspace with load-
> acquire.
> + * @data_tail:   Next read slot index.  Updated by userspace.  Read by the
> + *               kernel to detect overflow.
> + * @capacity:    Actual ring capacity in records (power of 2).  Written once
> + *               by the kernel at mmap time; read-only for userspace
> thereafter.
> + * @version:     Ring buffer ABI version; currently 1.
> + * @data_offset: Byte offset from the mmap base to the data array.
> + *               Always equal to sysconf(_SC_PAGESIZE) on the running kernel.
> + * @record_size: sizeof(struct tlob_event) as seen by the kernel.  Verify
> + *               this matches userspace's sizeof before indexing the array.
> + * @dropped:     Number of events dropped because the ring was full.
> + *               Monotonically increasing; read with __ATOMIC_RELAXED.
> + */
> +struct tlob_mmap_page {
> +	__u32  data_head;
> +	__u32  data_tail;
> +	__u32  capacity;
> +	__u32  version;
> +	__u32  data_offset;
> +	__u32  record_size;
> +	__u64  dropped;
> +};
> +
> +/*
> + * TLOB_IOCTL_TRACE_START - begin monitoring the calling task.
> + *
> + * Arms a per-task hrtimer for threshold_us microseconds.  If args.notify_fd
> + * is >= 0, a tlob_event record is pushed into that fd's ring buffer on
> + * violation in addition to the tlob_budget_exceeded ftrace event.
> + * args.notify_fd == -1 disables fd notification.
> + *
> + * Violation records are consumed by read() on the notify_fd (blocking or
> + * non-blocking depending on O_NONBLOCK).  On violation,
> TLOB_IOCTL_TRACE_STOP
> + * also returns -EOVERFLOW regardless of whether notify_fd is set.
> + *
> + * args.flags must be 0.
> + */
> +#define TLOB_IOCTL_TRACE_START		_IOW(RV_IOC_MAGIC, 0x01, struct
> tlob_start_args)
> +
> +/*
> + * TLOB_IOCTL_TRACE_STOP - end monitoring the calling task.
> + *
> + * Returns 0 if within budget, -EOVERFLOW if the budget was exceeded.
> + */
> +#define TLOB_IOCTL_TRACE_STOP		_IO(RV_IOC_MAGIC,  0x02)
> +
> +#endif /* _UAPI_LINUX_RV_H */
> diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
> index 5b4be87ba..227573cda 100644
> --- a/kernel/trace/rv/Kconfig
> +++ b/kernel/trace/rv/Kconfig
> @@ -65,6 +65,7 @@ source "kernel/trace/rv/monitors/pagefault/Kconfig"
>  source "kernel/trace/rv/monitors/sleep/Kconfig"
>  # Add new rtapp monitors here
>  
> +source "kernel/trace/rv/monitors/tlob/Kconfig"
>  # Add new monitors here
>  
>  config RV_REACTORS
> @@ -93,3 +94,19 @@ config RV_REACT_PANIC
>  	help
>  	  Enables the panic reactor. The panic reactor emits a printk()
>  	  message if an exception is found and panic()s the system.
> +
> +config RV_CHARDEV
> +	bool "RV ioctl interface via /dev/rv"
> +	depends on RV
> +	default n
> +	help
> +	  Register a /dev/rv misc device that exposes an ioctl interface
> +	  for RV monitor self-instrumentation.  All RV monitors share the
> +	  single device node; ioctl numbers encode the monitor identity.
> +
> +	  When enabled, user-space programs can open /dev/rv and use
> +	  monitor-specific ioctl commands to bracket code regions they
> +	  want the kernel RV subsystem to observe.
> +
> +	  Say Y here if you want to use the tlob self-instrumentation
> +	  ioctl interface; otherwise say N.
> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
> index 750e4ad6f..cc3781a3b 100644
> --- a/kernel/trace/rv/Makefile
> +++ b/kernel/trace/rv/Makefile
> @@ -3,6 +3,7 @@
>  ccflags-y += -I $(src)		# needed for trace events
>  
>  obj-$(CONFIG_RV) += rv.o
> +obj-$(CONFIG_RV_CHARDEV) += rv_dev.o
>  obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
>  obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
>  obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
> @@ -17,6 +18,7 @@ obj-$(CONFIG_RV_MON_STS) += monitors/sts/sts.o
>  obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
>  obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
>  obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
> +obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
>  # Add new monitors here
>  obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
>  obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
> diff --git a/kernel/trace/rv/monitors/tlob/Kconfig
> b/kernel/trace/rv/monitors/tlob/Kconfig
> new file mode 100644
> index 000000000..010237480
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
> @@ -0,0 +1,51 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +config RV_MON_TLOB
> +	depends on RV
> +	depends on UPROBES
> +	select DA_MON_EVENTS_ID
> +	bool "tlob monitor"
> +	help
> +	  Enable the tlob (task latency over budget) monitor. This monitor
> +	  tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path
> within a
> +	  task (including both on-CPU and off-CPU time) and reports a
> +	  violation when the elapsed time exceeds a configurable budget
> +	  threshold.
> +
> +	  The monitor implements a three-state deterministic automaton.
> +	  States: unmonitored, on_cpu, off_cpu.
> +	  Key transitions:
> +	    unmonitored    --(trace_start)-->    on_cpu
> +	    on_cpu   --(switch_out)-->     off_cpu
> +	    off_cpu  --(switch_in)-->      on_cpu
> +	    on_cpu   --(trace_stop)-->    unmonitored
> +	    off_cpu  --(trace_stop)-->    unmonitored
> +	    on_cpu   --(budget_expired)--> unmonitored
> +	    off_cpu  --(budget_expired)--> unmonitored
> +
> +	  External configuration is done via the tracefs "monitor" file:
> +	    echo pid:threshold_us:binary:offset_start:offset_stop >
> .../rv/monitors/tlob/monitor
> +	    echo -pid             > .../rv/monitors/tlob/monitor  (remove
> task)
> +	    cat                     .../rv/monitors/tlob/monitor  (list
> tasks)
> +
> +	  The uprobe binding places two plain entry uprobes at offset_start
> and
> +	  offset_stop in the binary; these trigger tlob_start_task() and
> +	  tlob_stop_task() respectively.  Using two entry uprobes (rather
> than a
> +	  uretprobe) means that a mistyped offset can never corrupt the call
> +	  stack; the worst outcome is a missed stop, which causes the hrtimer
> to
> +	  fire and report a budget violation.
> +
> +	  Violation events are delivered via a lock-free mmap ring buffer on
> +	  /dev/rv (enabled by CONFIG_RV_CHARDEV).  The consumer mmap()s the
> +	  device, reads records from the data array using the head/tail
> indices
> +	  in the control page, and advances data_tail when done.
> +
> +	  For self-instrumentation, use TLOB_IOCTL_TRACE_START /
> +	  TLOB_IOCTL_TRACE_STOP via the /dev/rv misc device (enabled by
> +	  CONFIG_RV_CHARDEV).
> +
> +	  Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
> +
> +	  For further information, see:
> +	    Documentation/trace/rv/monitor_tlob.rst
> +
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> new file mode 100644
> index 000000000..a6e474025
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -0,0 +1,986 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob: task latency over budget monitor
> + *
> + * Track the elapsed wall-clock time of a marked code path and detect when
> + * a monitored task exceeds its per-task latency budget.  CLOCK_MONOTONIC
> + * is used so both on-CPU and off-CPU time count toward the budget.
> + *
> + * Per-task state is maintained in a spinlock-protected hash table.  A
> + * one-shot hrtimer fires at the deadline; if the task has not called
> + * trace_stop by then, a violation is recorded.
> + *
> + * Up to TLOB_MAX_MONITORED tasks may be tracked simultaneously.
> + *
> + * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
> + */
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/ftrace.h>
> +#include <linux/hash.h>
> +#include <linux/hrtimer.h>
> +#include <linux/kernel.h>
> +#include <linux/ktime.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/namei.h>
> +#include <linux/poll.h>
> +#include <linux/rv.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/atomic.h>
> +#include <linux/rcupdate.h>
> +#include <linux/spinlock.h>
> +#include <linux/tracefs.h>
> +#include <linux/uaccess.h>
> +#include <linux/uprobes.h>
> +#include <kunit/visibility.h>
> +#include <rv/instrumentation.h>
> +
> +/* rv_interface_lock is defined in kernel/trace/rv/rv.c */
> +extern struct mutex rv_interface_lock;
> +
> +#define MODULE_NAME "tlob"
> +
> +#include <rv_trace.h>
> +#include <trace/events/sched.h>
> +
> +#define RV_MON_TYPE RV_MON_PER_TASK
> +#include "tlob.h"
> +#include <rv/da_monitor.h>
> +
> +/* Hash table size; must be a power of two. */
> +#define TLOB_HTABLE_BITS		6
> +#define TLOB_HTABLE_SIZE		(1 << TLOB_HTABLE_BITS)
> +
> +/* Maximum binary path length for uprobe binding. */
> +#define TLOB_MAX_PATH			256
> +
> +/* Per-task latency monitoring state. */
> +struct tlob_task_state {
> +	struct hlist_node	hlist;
> +	struct task_struct	*task;
> +	u64			threshold_us;
> +	u64			tag;
> +	struct hrtimer		deadline_timer;
> +	int			canceled;	/* protected by entry_lock */
> +	struct file		*notify_file;	/* NULL or held reference */
> +
> +	/*
> +	 * entry_lock serialises the mutable accounting fields below.
> +	 * Lock order: tlob_table_lock -> entry_lock (never reverse).
> +	 */
> +	raw_spinlock_t		entry_lock;
> +	u64			on_cpu_us;
> +	u64			off_cpu_us;
> +	ktime_t			last_ts;
> +	u32			switches;
> +	u8			da_state;
> +
> +	struct rcu_head		rcu;	/* for call_rcu() teardown */
> +};
> +
> +/* Per-uprobe-binding state: a start + stop probe pair for one binary region.
> */
> +struct tlob_uprobe_binding {
> +	struct list_head	list;
> +	u64			threshold_us;
> +	struct path		path;
> +	char			binpath[TLOB_MAX_PATH];	/* canonical
> path for read/remove */
> +	loff_t			offset_start;
> +	loff_t			offset_stop;
> +	struct uprobe_consumer	entry_uc;
> +	struct uprobe_consumer	stop_uc;
> +	struct uprobe		*entry_uprobe;
> +	struct uprobe		*stop_uprobe;
> +};
> +
> +/* Object pool for tlob_task_state. */
> +static struct kmem_cache *tlob_state_cache;
> +
> +/* Hash table and lock protecting table structure (insert/delete/canceled).
> */
> +static struct hlist_head tlob_htable[TLOB_HTABLE_SIZE];
> +static DEFINE_RAW_SPINLOCK(tlob_table_lock);
> +static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
> +
> +/* Uprobe binding list; protected by tlob_uprobe_mutex. */
> +static LIST_HEAD(tlob_uprobe_list);
> +static DEFINE_MUTEX(tlob_uprobe_mutex);
> +
> +/* Forward declaration */
> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer);
> +
> +/* Hash table helpers */
> +
> +static unsigned int tlob_hash_task(const struct task_struct *task)
> +{
> +	return hash_ptr((void *)task, TLOB_HTABLE_BITS);
> +}
> +
> +/*
> + * tlob_find_rcu - look up per-task state.
> + * Must be called under rcu_read_lock() or with tlob_table_lock held.
> + */
> +static struct tlob_task_state *tlob_find_rcu(struct task_struct *task)
> +{
> +	struct tlob_task_state *ws;
> +	unsigned int h = tlob_hash_task(task);
> +
> +	hlist_for_each_entry_rcu(ws, &tlob_htable[h], hlist,
> +				 lockdep_is_held(&tlob_table_lock))
> +		if (ws->task == task)
> +			return ws;
> +	return NULL;
> +}
> +
> +/* Allocate and initialise a new per-task state entry. */
> +static struct tlob_task_state *tlob_alloc(struct task_struct *task,
> +					  u64 threshold_us, u64 tag)
> +{
> +	struct tlob_task_state *ws;
> +
> +	ws = kmem_cache_zalloc(tlob_state_cache, GFP_ATOMIC);
> +	if (!ws)
> +		return NULL;
> +
> +	ws->task = task;
> +	get_task_struct(task);
> +	ws->threshold_us = threshold_us;
> +	ws->tag = tag;
> +	ws->last_ts = ktime_get();
> +	ws->da_state = on_cpu_tlob;
> +	raw_spin_lock_init(&ws->entry_lock);
> +	hrtimer_setup(&ws->deadline_timer, tlob_deadline_timer_fn,
> +		      CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	return ws;
> +}
> +
> +/* RCU callback: free the slab once no readers remain. */
> +static void tlob_free_rcu_slab(struct rcu_head *head)
> +{
> +	struct tlob_task_state *ws =
> +		container_of(head, struct tlob_task_state, rcu);
> +	kmem_cache_free(tlob_state_cache, ws);
> +}
> +
> +/* Arm the one-shot deadline timer for threshold_us microseconds. */
> +static void tlob_arm_deadline(struct tlob_task_state *ws)
> +{
> +	hrtimer_start(&ws->deadline_timer,
> +		      ns_to_ktime(ws->threshold_us * NSEC_PER_USEC),
> +		      HRTIMER_MODE_REL);
> +}
> +
> +/*
> + * Push a violation record into a monitor fd's ring buffer (softirq context).
> + * Drop-new policy: discard incoming record when full.  smp_store_release on
> + * data_head pairs with smp_load_acquire in the consumer.
> + */
> +static void tlob_event_push(struct rv_file_priv *priv,
> +			    const struct tlob_event *info)
> +{
> +	struct tlob_ring *ring = &priv->ring;
> +	unsigned long flags;
> +	u32 head, tail;
> +
> +	spin_lock_irqsave(&ring->lock, flags);
> +
> +	head = ring->page->data_head;
> +	tail = READ_ONCE(ring->page->data_tail);
> +
> +	if (head - tail > ring->mask) {
> +		/* Ring full: drop incoming record. */
> +		ring->page->dropped++;
> +		spin_unlock_irqrestore(&ring->lock, flags);
> +		return;
> +	}
> +
> +	ring->data[head & ring->mask] = *info;
> +	/* pairs with smp_load_acquire() in the consumer */
> +	smp_store_release(&ring->page->data_head, head + 1);
> +
> +	spin_unlock_irqrestore(&ring->lock, flags);
> +
> +	wake_up_interruptible_poll(&priv->waitq, EPOLLIN | EPOLLRDNORM);
> +}
> +
> +#if IS_ENABLED(CONFIG_KUNIT)
> +void tlob_event_push_kunit(struct rv_file_priv *priv,
> +			  const struct tlob_event *info)
> +{
> +	tlob_event_push(priv, info);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_push_kunit);
> +#endif /* CONFIG_KUNIT */
> +
> +/*
> + * Budget exceeded: remove the entry, record the violation, and inject
> + * budget_expired into the DA.
> + *
> + * Lock order: tlob_table_lock -> entry_lock.  tlob_stop_task() sets
> + * ws->canceled under both locks; if we see it here the stop path owns
> cleanup.
> + * fput/put_task_struct are done before call_rcu(); the RCU callback only
> + * reclaims the slab.
> + */
> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer)
> +{
> +	struct tlob_task_state *ws =
> +		container_of(timer, struct tlob_task_state, deadline_timer);
> +	struct tlob_event info = {};
> +	struct file *notify_file;
> +	struct task_struct *task;
> +	unsigned long flags;
> +	/* snapshots taken under entry_lock */
> +	u64 on_cpu_us, off_cpu_us, threshold_us, tag;
> +	u32 switches;
> +	bool on_cpu;
> +	bool push_event = false;
> +
> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
> +	/* stop path sets canceled under both locks; if set it owns cleanup
> */
> +	if (ws->canceled) {
> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +		return HRTIMER_NORESTART;
> +	}
> +
> +	/* Finalize accounting and snapshot all fields under entry_lock. */
> +	raw_spin_lock(&ws->entry_lock);
> +
> +	{
> +		ktime_t now = ktime_get();
> +		u64 delta_us = ktime_to_us(ktime_sub(now, ws->last_ts));
> +
> +		if (ws->da_state == on_cpu_tlob)
> +			ws->on_cpu_us += delta_us;
> +		else
> +			ws->off_cpu_us += delta_us;
> +	}
> +
> +	ws->canceled  = 1;
> +	on_cpu_us     = ws->on_cpu_us;
> +	off_cpu_us    = ws->off_cpu_us;
> +	threshold_us  = ws->threshold_us;
> +	tag           = ws->tag;
> +	switches      = ws->switches;
> +	on_cpu        = (ws->da_state == on_cpu_tlob);
> +	notify_file   = ws->notify_file;
> +	if (notify_file) {
> +		info.tid          = task_pid_vnr(ws->task);
> +		info.threshold_us = threshold_us;
> +		info.on_cpu_us    = on_cpu_us;
> +		info.off_cpu_us   = off_cpu_us;
> +		info.switches     = switches;
> +		info.state        = on_cpu ? 1 : 0;
> +		info.tag          = tag;
> +		push_event        = true;
> +	}
> +
> +	raw_spin_unlock(&ws->entry_lock);
> +
> +	hlist_del_rcu(&ws->hlist);
> +	atomic_dec(&tlob_num_monitored);
> +	/*
> +	 * Hold a reference so task remains valid across da_handle_event()
> +	 * after we drop tlob_table_lock.
> +	 */
> +	task = ws->task;
> +	get_task_struct(task);
> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> +	/*
> +	 * Both locks are now released; ws is exclusively owned (removed from
> +	 * the hash table with canceled=1).  Emit the tracepoint and push the
> +	 * violation record.
> +	 */
> +	trace_tlob_budget_exceeded(ws->task, threshold_us, on_cpu_us,
> +				   off_cpu_us, switches, on_cpu, tag);
> +
> +	if (push_event) {
> +		struct rv_file_priv *priv = notify_file->private_data;
> +
> +		if (priv)
> +			tlob_event_push(priv, &info);
> +	}
> +
> +	da_handle_event(task, budget_expired_tlob);
> +
> +	if (notify_file)
> +		fput(notify_file);		/* ref from fget() at
> TRACE_START */
> +	put_task_struct(ws->task);		/* ref from tlob_alloc() */
> +	put_task_struct(task);			/* extra ref from
> get_task_struct() above */
> +	call_rcu(&ws->rcu, tlob_free_rcu_slab);
> +	return HRTIMER_NORESTART;
> +}
> +
> +/* Tracepoint handlers */
> +
> +/*
> + * handle_sched_switch - advance the DA and accumulate on/off-CPU time.
> + *
> + * RCU read-side for lock-free lookup; entry_lock for per-task accounting.
> + * da_handle_event() is called after rcu_read_unlock() to avoid holding the
> + * read-side critical section across the RV framework.
> + */
> +static void handle_sched_switch(void *data, bool preempt,
> +				struct task_struct *prev,
> +				struct task_struct *next,
> +				unsigned int prev_state)
> +{
> +	struct tlob_task_state *ws;
> +	unsigned long flags;
> +	bool do_prev = false, do_next = false;
> +	ktime_t now;
> +
> +	rcu_read_lock();
> +
> +	ws = tlob_find_rcu(prev);
> +	if (ws) {
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		if (!ws->canceled) {
> +			now = ktime_get();
> +			ws->on_cpu_us += ktime_to_us(ktime_sub(now, ws-
> >last_ts));
> +			ws->last_ts = now;
> +			ws->switches++;
> +			ws->da_state = off_cpu_tlob;
> +			do_prev = true;
> +		}
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +	}
> +
> +	ws = tlob_find_rcu(next);
> +	if (ws) {
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		if (!ws->canceled) {
> +			now = ktime_get();
> +			ws->off_cpu_us += ktime_to_us(ktime_sub(now, ws-
> >last_ts));
> +			ws->last_ts = now;
> +			ws->da_state = on_cpu_tlob;
> +			do_next = true;
> +		}
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +	}
> +
> +	rcu_read_unlock();
> +
> +	if (do_prev)
> +		da_handle_event(prev, switch_out_tlob);
> +	if (do_next)
> +		da_handle_event(next, switch_in_tlob);
> +}
> +
> +static void handle_sched_wakeup(void *data, struct task_struct *p)
> +{
> +	struct tlob_task_state *ws;
> +	unsigned long flags;
> +	bool found = false;
> +
> +	rcu_read_lock();
> +	ws = tlob_find_rcu(p);
> +	if (ws) {
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		found = !ws->canceled;
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +	}
> +	rcu_read_unlock();
> +
> +	if (found)
> +		da_handle_event(p, sched_wakeup_tlob);
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Core start/stop helpers (also called from rv_dev.c)
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * __tlob_insert - insert @ws into the hash table and arm its deadline timer.
> + *
> + * Re-checks for duplicates and capacity under tlob_table_lock; the caller
> + * may have done a lock-free pre-check before allocating @ws.  On failure @ws
> + * is freed directly (never in table, so no call_rcu needed).
> + */
> +static int __tlob_insert(struct task_struct *task, struct tlob_task_state
> *ws)
> +{
> +	unsigned int h;
> +	unsigned long flags;
> +
> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
> +	if (tlob_find_rcu(task)) {
> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +		if (ws->notify_file)
> +			fput(ws->notify_file);
> +		put_task_struct(ws->task);
> +		kmem_cache_free(tlob_state_cache, ws);
> +		return -EEXIST;
> +	}
> +	if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +		if (ws->notify_file)
> +			fput(ws->notify_file);
> +		put_task_struct(ws->task);
> +		kmem_cache_free(tlob_state_cache, ws);
> +		return -ENOSPC;
> +	}
> +	h = tlob_hash_task(task);
> +	hlist_add_head_rcu(&ws->hlist, &tlob_htable[h]);
> +	atomic_inc(&tlob_num_monitored);
> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> +	da_handle_start_run_event(task, trace_start_tlob);
> +	tlob_arm_deadline(ws);
> +	return 0;
> +}
> +
> +/**
> + * tlob_start_task - begin monitoring @task with latency budget
> @threshold_us.
> + *
> + * @notify_file: /dev/rv fd whose ring buffer receives a tlob_event on
> + *               violation; caller transfers the fget() reference to tlob.c.
> + *               Pass NULL for synchronous mode (violations only via
> + *               TRACE_STOP return value and the tlob_budget_exceeded event).
> + *
> + * Returns 0, -ENODEV, -EEXIST, -ENOSPC, or -ENOMEM.  On failure the caller
> + * retains responsibility for any @notify_file reference.
> + */
> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
> +		    struct file *notify_file, u64 tag)
> +{
> +	struct tlob_task_state *ws;
> +	unsigned long flags;
> +
> +	if (!tlob_state_cache)
> +		return -ENODEV;
> +
> +	if (threshold_us > (u64)KTIME_MAX / NSEC_PER_USEC)
> +		return -ERANGE;
> +
> +	/* Quick pre-check before allocation. */
> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
> +	if (tlob_find_rcu(task)) {
> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +		return -EEXIST;
> +	}
> +	if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +		return -ENOSPC;
> +	}
> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> +	ws = tlob_alloc(task, threshold_us, tag);
> +	if (!ws)
> +		return -ENOMEM;
> +
> +	ws->notify_file = notify_file;
> +	return __tlob_insert(task, ws);
> +}
> +EXPORT_SYMBOL_GPL(tlob_start_task);
> +
> +/**
> + * tlob_stop_task - stop monitoring @task before the deadline fires.
> + *
> + * Sets canceled under entry_lock (inside tlob_table_lock) before calling
> + * hrtimer_cancel(), racing safely with the timer callback.
> + *
> + * Returns 0 if within budget, -ESRCH if the entry is gone (deadline already
> + * fired, or TRACE_START was never called).
> + */
> +int tlob_stop_task(struct task_struct *task)
> +{
> +	struct tlob_task_state *ws;
> +	struct file *notify_file;
> +	unsigned long flags;
> +
> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
> +	ws = tlob_find_rcu(task);
> +	if (!ws) {
> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +		return -ESRCH;
> +	}
> +
> +	/* Prevent handle_sched_switch from updating accounting after
> removal. */
> +	raw_spin_lock(&ws->entry_lock);
> +	ws->canceled = 1;
> +	raw_spin_unlock(&ws->entry_lock);
> +
> +	hlist_del_rcu(&ws->hlist);
> +	atomic_dec(&tlob_num_monitored);
> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> +	hrtimer_cancel(&ws->deadline_timer);
> +
> +	da_handle_event(task, trace_stop_tlob);
> +
> +	notify_file = ws->notify_file;
> +	if (notify_file)
> +		fput(notify_file);
> +	put_task_struct(ws->task);
> +	call_rcu(&ws->rcu, tlob_free_rcu_slab);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_stop_task);
> +
> +/* Stop monitoring all tracked tasks; called on monitor disable. */
> +static void tlob_stop_all(void)
> +{
> +	struct tlob_task_state *batch[TLOB_MAX_MONITORED];
> +	struct tlob_task_state *ws;
> +	struct hlist_node *tmp;
> +	unsigned long flags;
> +	int n = 0, i;
> +
> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
> +	for (i = 0; i < TLOB_HTABLE_SIZE; i++) {
> +		hlist_for_each_entry_safe(ws, tmp, &tlob_htable[i], hlist) {
> +			raw_spin_lock(&ws->entry_lock);
> +			ws->canceled = 1;
> +			raw_spin_unlock(&ws->entry_lock);
> +			hlist_del_rcu(&ws->hlist);
> +			atomic_dec(&tlob_num_monitored);
> +			if (n < TLOB_MAX_MONITORED)
> +				batch[n++] = ws;
> +		}
> +	}
> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> +	for (i = 0; i < n; i++) {
> +		ws = batch[i];
> +		hrtimer_cancel(&ws->deadline_timer);
> +		da_handle_event(ws->task, trace_stop_tlob);
> +		if (ws->notify_file)
> +			fput(ws->notify_file);
> +		put_task_struct(ws->task);
> +		call_rcu(&ws->rcu, tlob_free_rcu_slab);
> +	}
> +}
> +
> +/* uprobe binding helpers */
> +
> +static int tlob_uprobe_entry_handler(struct uprobe_consumer *uc,
> +				     struct pt_regs *regs, __u64 *data)
> +{
> +	struct tlob_uprobe_binding *b =
> +		container_of(uc, struct tlob_uprobe_binding, entry_uc);
> +
> +	tlob_start_task(current, b->threshold_us, NULL, (u64)b-
> >offset_start);
> +	return 0;
> +}
> +
> +static int tlob_uprobe_stop_handler(struct uprobe_consumer *uc,
> +				    struct pt_regs *regs, __u64 *data)
> +{
> +	tlob_stop_task(current);
> +	return 0;
> +}
> +
> +/*
> + * Register start + stop entry uprobes for a binding.
> + * Both are plain entry uprobes (no uretprobe), so a wrong offset never
> + * corrupts the call stack; the worst outcome is a missed stop (hrtimer
> + * fires and reports a budget violation).
> + * Called with tlob_uprobe_mutex held.
> + */
> +static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
> +			   loff_t offset_start, loff_t offset_stop)
> +{
> +	struct tlob_uprobe_binding *b, *tmp_b;
> +	char pathbuf[TLOB_MAX_PATH];
> +	struct inode *inode;
> +	char *canon;
> +	int ret;
> +
> +	b = kzalloc(sizeof(*b), GFP_KERNEL);
> +	if (!b)
> +		return -ENOMEM;
> +
> +	if (binpath[0] != '/') {
> +		kfree(b);
> +		return -EINVAL;
> +	}
> +
> +	b->threshold_us = threshold_us;
> +	b->offset_start = offset_start;
> +	b->offset_stop  = offset_stop;
> +
> +	ret = kern_path(binpath, LOOKUP_FOLLOW, &b->path);
> +	if (ret)
> +		goto err_free;
> +
> +	if (!d_is_reg(b->path.dentry)) {
> +		ret = -EINVAL;
> +		goto err_path;
> +	}
> +
> +	/* Reject duplicate start offset for the same binary. */
> +	list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
> +		if (tmp_b->offset_start == offset_start &&
> +		    tmp_b->path.dentry == b->path.dentry) {
> +			ret = -EEXIST;
> +			goto err_path;
> +		}
> +	}
> +
> +	/* Store canonical path for read-back and removal matching. */
> +	canon = d_path(&b->path, pathbuf, sizeof(pathbuf));
> +	if (IS_ERR(canon)) {
> +		ret = PTR_ERR(canon);
> +		goto err_path;
> +	}
> +	strscpy(b->binpath, canon, sizeof(b->binpath));
> +
> +	b->entry_uc.handler = tlob_uprobe_entry_handler;
> +	b->stop_uc.handler  = tlob_uprobe_stop_handler;
> +
> +	inode = d_real_inode(b->path.dentry);
> +
> +	b->entry_uprobe = uprobe_register(inode, offset_start, 0, &b-
> >entry_uc);
> +	if (IS_ERR(b->entry_uprobe)) {
> +		ret = PTR_ERR(b->entry_uprobe);
> +		b->entry_uprobe = NULL;
> +		goto err_path;
> +	}
> +
> +	b->stop_uprobe = uprobe_register(inode, offset_stop, 0, &b->stop_uc);
> +	if (IS_ERR(b->stop_uprobe)) {
> +		ret = PTR_ERR(b->stop_uprobe);
> +		b->stop_uprobe = NULL;
> +		goto err_entry;
> +	}
> +
> +	list_add_tail(&b->list, &tlob_uprobe_list);
> +	return 0;
> +
> +err_entry:
> +	uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
> +	uprobe_unregister_sync();
> +err_path:
> +	path_put(&b->path);
> +err_free:
> +	kfree(b);
> +	return ret;
> +}
> +
> +/*
> + * Remove the uprobe binding for (offset_start, binpath).
> + * binpath is resolved to a dentry for comparison so symlinks are handled
> + * correctly.  Called with tlob_uprobe_mutex held.
> + */
> +static void tlob_remove_uprobe_by_key(loff_t offset_start, const char
> *binpath)
> +{
> +	struct tlob_uprobe_binding *b, *tmp;
> +	struct path remove_path;
> +
> +	if (kern_path(binpath, LOOKUP_FOLLOW, &remove_path))
> +		return;
> +
> +	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> +		if (b->offset_start != offset_start)
> +			continue;
> +		if (b->path.dentry != remove_path.dentry)
> +			continue;
> +		uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
> +		uprobe_unregister_nosync(b->stop_uprobe,  &b->stop_uc);
> +		list_del(&b->list);
> +		uprobe_unregister_sync();
> +		path_put(&b->path);
> +		kfree(b);
> +		break;
> +	}
> +
> +	path_put(&remove_path);
> +}
> +
> +/* Unregister all uprobe bindings; called from disable_tlob(). */
> +static void tlob_remove_all_uprobes(void)
> +{
> +	struct tlob_uprobe_binding *b, *tmp;
> +
> +	mutex_lock(&tlob_uprobe_mutex);
> +	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> +		uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
> +		uprobe_unregister_nosync(b->stop_uprobe,  &b->stop_uc);
> +		list_del(&b->list);
> +		path_put(&b->path);
> +		kfree(b);
> +	}
> +	mutex_unlock(&tlob_uprobe_mutex);
> +	uprobe_unregister_sync();
> +}
> +
> +/*
> + * tracefs "monitor" file
> + *
> + * Read:  one "threshold_us:0xoffset_start:0xoffset_stop:binary_path\n"
> + *        line per registered uprobe binding.
> + * Write: "threshold_us:offset_start:offset_stop:binary_path" - add uprobe
> binding
> + *        "-offset_start:binary_path"                         - remove uprobe
> binding
> + */
> +
> +static ssize_t tlob_monitor_read(struct file *file,
> +				 char __user *ubuf,
> +				 size_t count, loff_t *ppos)
> +{
> +	/* pid(10) + threshold(20) + 2 offsets(2*18) + path(256) + delimiters
> */
> +	const int line_sz = TLOB_MAX_PATH + 72;
> +	struct tlob_uprobe_binding *b;
> +	char *buf, *p;
> +	int n = 0, buf_sz, pos = 0;
> +	ssize_t ret;
> +
> +	mutex_lock(&tlob_uprobe_mutex);
> +	list_for_each_entry(b, &tlob_uprobe_list, list)
> +		n++;
> +	mutex_unlock(&tlob_uprobe_mutex);
> +
> +	buf_sz = (n ? n : 1) * line_sz + 1;
> +	buf = kmalloc(buf_sz, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	mutex_lock(&tlob_uprobe_mutex);
> +	list_for_each_entry(b, &tlob_uprobe_list, list) {
> +		p = b->binpath;
> +		pos += scnprintf(buf + pos, buf_sz - pos,
> +				 "%llu:0x%llx:0x%llx:%s\n",
> +				 b->threshold_us,
> +				 (unsigned long long)b->offset_start,
> +				 (unsigned long long)b->offset_stop,
> +				 p);
> +	}
> +	mutex_unlock(&tlob_uprobe_mutex);
> +
> +	ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
> +	kfree(buf);
> +	return ret;
> +}
> +
> +/*
> + * Parse "threshold_us:offset_start:offset_stop:binary_path".
> + * binary_path comes last so it may freely contain ':'.
> + * Returns 0 on success.
> + */
> +VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
> +					    char **path_out,
> +					    loff_t *start_out, loff_t
> *stop_out)
> +{
> +	unsigned long long thr;
> +	long long start, stop;
> +	int n = 0;
> +
> +	/*
> +	 * %llu : decimal-only (microseconds)
> +	 * %lli : auto-base, accepts 0x-prefixed hex for offsets
> +	 * %n   : records the byte offset of the first path character
> +	 */
> +	if (sscanf(buf, "%llu:%lli:%lli:%n", &thr, &start, &stop, &n) != 3)
> +		return -EINVAL;
> +	if (thr == 0 || n == 0 || buf[n] == '\0')
> +		return -EINVAL;
> +	if (start < 0 || stop < 0)
> +		return -EINVAL;
> +
> +	*thr_out   = thr;
> +	*start_out = start;
> +	*stop_out  = stop;
> +	*path_out  = buf + n;
> +	return 0;
> +}
> +
> +static ssize_t tlob_monitor_write(struct file *file,
> +				  const char __user *ubuf,
> +				  size_t count, loff_t *ppos)
> +{
> +	char buf[TLOB_MAX_PATH + 64];
> +	loff_t offset_start, offset_stop;
> +	u64 threshold_us;
> +	char *binpath;
> +	int ret;
> +
> +	if (count >= sizeof(buf))
> +		return -EINVAL;
> +	if (copy_from_user(buf, ubuf, count))
> +		return -EFAULT;
> +	buf[count] = '\0';
> +
> +	if (count > 0 && buf[count - 1] == '\n')
> +		buf[count - 1] = '\0';
> +
> +	/* Remove request: "-offset_start:binary_path" */
> +	if (buf[0] == '-') {
> +		long long off;
> +		int n = 0;
> +
> +		if (sscanf(buf + 1, "%lli:%n", &off, &n) != 1 || n == 0)
> +			return -EINVAL;
> +		binpath = buf + 1 + n;
> +		if (binpath[0] != '/')
> +			return -EINVAL;
> +
> +		mutex_lock(&tlob_uprobe_mutex);
> +		tlob_remove_uprobe_by_key((loff_t)off, binpath);
> +		mutex_unlock(&tlob_uprobe_mutex);
> +
> +		return (ssize_t)count;
> +	}
> +
> +	/*
> +	 * Uprobe binding:
> "threshold_us:offset_start:offset_stop:binary_path"
> +	 * binpath points into buf at the start of the path field.
> +	 */
> +	ret = tlob_parse_uprobe_line(buf, &threshold_us,
> +				     &binpath, &offset_start, &offset_stop);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&tlob_uprobe_mutex);
> +	ret = tlob_add_uprobe(threshold_us, binpath, offset_start,
> offset_stop);
> +	mutex_unlock(&tlob_uprobe_mutex);
> +	return ret ? ret : (ssize_t)count;
> +}
> +
> +static const struct file_operations tlob_monitor_fops = {
> +	.open	= simple_open,
> +	.read	= tlob_monitor_read,
> +	.write	= tlob_monitor_write,
> +	.llseek	= noop_llseek,
> +};
> +
> +/*
> + * __tlob_init_monitor / __tlob_destroy_monitor - called with
> rv_interface_lock
> + * held (required by da_monitor_init/destroy via
> rv_get/put_task_monitor_slot).
> + */
> +static int __tlob_init_monitor(void)
> +{
> +	int i, retval;
> +
> +	tlob_state_cache = kmem_cache_create("tlob_task_state",
> +					     sizeof(struct tlob_task_state),
> +					     0, 0, NULL);
> +	if (!tlob_state_cache)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < TLOB_HTABLE_SIZE; i++)
> +		INIT_HLIST_HEAD(&tlob_htable[i]);
> +	atomic_set(&tlob_num_monitored, 0);
> +
> +	retval = da_monitor_init();
> +	if (retval) {
> +		kmem_cache_destroy(tlob_state_cache);
> +		tlob_state_cache = NULL;
> +		return retval;
> +	}
> +
> +	rv_this.enabled = 1;
> +	return 0;
> +}
> +
> +static void __tlob_destroy_monitor(void)
> +{
> +	rv_this.enabled = 0;
> +	tlob_stop_all();
> +	tlob_remove_all_uprobes();
> +	/*
> +	 * Drain pending call_rcu() callbacks from tlob_stop_all() before
> +	 * destroying the kmem_cache.
> +	 */
> +	synchronize_rcu();
> +	da_monitor_destroy();
> +	kmem_cache_destroy(tlob_state_cache);
> +	tlob_state_cache = NULL;
> +}
> +
> +/*
> + * tlob_init_monitor / tlob_destroy_monitor - KUnit wrappers that acquire
> + * rv_interface_lock, satisfying the lockdep_assert_held() inside
> + * rv_get/put_task_monitor_slot().
> + */
> +VISIBLE_IF_KUNIT int tlob_init_monitor(void)
> +{
> +	int ret;
> +
> +	mutex_lock(&rv_interface_lock);
> +	ret = __tlob_init_monitor();
> +	mutex_unlock(&rv_interface_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_init_monitor);
> +
> +VISIBLE_IF_KUNIT void tlob_destroy_monitor(void)
> +{
> +	mutex_lock(&rv_interface_lock);
> +	__tlob_destroy_monitor();
> +	mutex_unlock(&rv_interface_lock);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_destroy_monitor);
> +
> +VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
> +{
> +	rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
> +	rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> +	return 0;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
> +
> +VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
> +{
> +	rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
> +	rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
> +
> +/*
> + * enable_tlob / disable_tlob - called by rv_enable/disable_monitor() which
> + * already holds rv_interface_lock; call the __ variants directly.
> + */
> +static int enable_tlob(void)
> +{
> +	int retval;
> +
> +	retval = __tlob_init_monitor();
> +	if (retval)
> +		return retval;
> +
> +	return tlob_enable_hooks();
> +}
> +
> +static void disable_tlob(void)
> +{
> +	tlob_disable_hooks();
> +	__tlob_destroy_monitor();
> +}
> +
> +static struct rv_monitor rv_this = {
> +	.name		= "tlob",
> +	.description	= "Per-task latency-over-budget monitor.",
> +	.enable		= enable_tlob,
> +	.disable	= disable_tlob,
> +	.reset		= da_monitor_reset_all,
> +	.enabled	= 0,
> +};
> +
> +static int __init register_tlob(void)
> +{
> +	int ret;
> +
> +	ret = rv_register_monitor(&rv_this, NULL);
> +	if (ret)
> +		return ret;
> +
> +	if (rv_this.root_d) {
> +		tracefs_create_file("monitor", 0644, rv_this.root_d, NULL,
> +				    &tlob_monitor_fops);
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit unregister_tlob(void)
> +{
> +	rv_unregister_monitor(&rv_this);
> +}
> +
> +module_init(register_tlob);
> +module_exit(unregister_tlob);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
> +MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
> b/kernel/trace/rv/monitors/tlob/tlob.h
> new file mode 100644
> index 000000000..3438a6175
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.h
> @@ -0,0 +1,145 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _RV_TLOB_H
> +#define _RV_TLOB_H
> +
> +/*
> + * C representation of the tlob automaton, generated from tlob.dot via rvgen
> + * and extended with tlob_start_task()/tlob_stop_task() declarations.
> + * For the format description see
> Documentation/trace/rv/deterministic_automata.rst
> + */
> +
> +#include <linux/rv.h>
> +#include <uapi/linux/rv.h>
> +
> +#define MONITOR_NAME tlob
> +
> +enum states_tlob {
> +	unmonitored_tlob,
> +	on_cpu_tlob,
> +	off_cpu_tlob,
> +	state_max_tlob,
> +};
> +
> +#define INVALID_STATE state_max_tlob
> +
> +enum events_tlob {
> +	trace_start_tlob,
> +	switch_in_tlob,
> +	switch_out_tlob,
> +	sched_wakeup_tlob,
> +	trace_stop_tlob,
> +	budget_expired_tlob,
> +	event_max_tlob,
> +};
> +
> +struct automaton_tlob {
> +	char *state_names[state_max_tlob];
> +	char *event_names[event_max_tlob];
> +	unsigned char function[state_max_tlob][event_max_tlob];
> +	unsigned char initial_state;
> +	bool final_states[state_max_tlob];
> +};
> +
> +static const struct automaton_tlob automaton_tlob = {
> +	.state_names = {
> +		"unmonitored",
> +		"on_cpu",
> +		"off_cpu",
> +	},
> +	.event_names = {
> +		"trace_start",
> +		"switch_in",
> +		"switch_out",
> +		"sched_wakeup",
> +		"trace_stop",
> +		"budget_expired",
> +	},
> +	.function = {
> +		/* unmonitored */
> +		{
> +			on_cpu_tlob,		/* trace_start    */
> +			unmonitored_tlob,	/* switch_in      */
> +			unmonitored_tlob,	/* switch_out     */
> +			unmonitored_tlob,	/* sched_wakeup   */
> +			INVALID_STATE,		/* trace_stop     */
> +			INVALID_STATE,		/* budget_expired */
> +		},
> +		/* on_cpu */
> +		{
> +			INVALID_STATE,		/* trace_start    */
> +			INVALID_STATE,		/* switch_in      */
> +			off_cpu_tlob,		/* switch_out     */
> +			on_cpu_tlob,		/* sched_wakeup   */
> +			unmonitored_tlob,	/* trace_stop     */
> +			unmonitored_tlob,	/* budget_expired */
> +		},
> +		/* off_cpu */
> +		{
> +			INVALID_STATE,		/* trace_start    */
> +			on_cpu_tlob,		/* switch_in      */
> +			off_cpu_tlob,		/* switch_out     */
> +			off_cpu_tlob,		/* sched_wakeup   */
> +			unmonitored_tlob,	/* trace_stop     */
> +			unmonitored_tlob,	/* budget_expired */
> +		},
> +	},
> +	/*
> +	 * final_states: unmonitored is the sole accepting state.
> +	 * Violations are recorded via ntf_push and tlob_budget_exceeded.
> +	 */
> +	.initial_state = unmonitored_tlob,
> +	.final_states = { 1, 0, 0 },
> +};
> +
> +/* Exported for use by the RV ioctl layer (rv_dev.c) */
> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
> +		    struct file *notify_file, u64 tag);
> +int tlob_stop_task(struct task_struct *task);
> +
> +/* Maximum number of concurrently monitored tasks (also used by KUnit). */
> +#define TLOB_MAX_MONITORED	64U
> +
> +/*
> + * Ring buffer constants (also published in UAPI for mmap size calculation).
> + */
> +#define TLOB_RING_DEFAULT_CAP	64U	/* records allocated at open()  */
> +#define TLOB_RING_MIN_CAP	 8U	/* minimum accepted by mmap()   */
> +#define TLOB_RING_MAX_CAP	4096U	/* maximum accepted by mmap()   */
> +
> +/**
> + * struct tlob_ring - per-fd mmap-capable violation ring buffer.
> + *
> + * Allocated as a contiguous page range at rv_open() time:
> + *   page 0:    struct tlob_mmap_page  (shared with userspace)
> + *   pages 1-N: struct tlob_event[capacity]
> + */
> +struct tlob_ring {
> +	struct tlob_mmap_page	*page;
> +	struct tlob_event	*data;
> +	u32			 mask;
> +	spinlock_t		 lock;
> +	unsigned long		 base;
> +	unsigned int		 order;
> +};
> +
> +/**
> + * struct rv_file_priv - per-fd private data for /dev/rv.
> + */
> +struct rv_file_priv {
> +	struct tlob_ring	ring;
> +	wait_queue_head_t	waitq;
> +};
> +
> +#if IS_ENABLED(CONFIG_KUNIT)
> +int tlob_init_monitor(void);
> +void tlob_destroy_monitor(void);
> +int tlob_enable_hooks(void);
> +void tlob_disable_hooks(void);
> +void tlob_event_push_kunit(struct rv_file_priv *priv,
> +			  const struct tlob_event *info);
> +int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
> +			   char **path_out,
> +			   loff_t *start_out, loff_t *stop_out);
> +#endif /* CONFIG_KUNIT */
> +
> +#endif /* _RV_TLOB_H */
> diff --git a/kernel/trace/rv/monitors/tlob/tlob_trace.h
> b/kernel/trace/rv/monitors/tlob/tlob_trace.h
> new file mode 100644
> index 000000000..b08d67776
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob_trace.h
> @@ -0,0 +1,42 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Snippet to be included in rv_trace.h
> + */
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +/*
> + * tlob uses the generic event_da_monitor_id and error_da_monitor_id event
> + * classes so that both event classes are instantiated.  This avoids a
> + * -Werror=unused-variable warning that the compiler emits when a
> + * DECLARE_EVENT_CLASS has no corresponding DEFINE_EVENT instance.
> + *
> + * The event_tlob tracepoint is defined here but the call-site in
> + * da_handle_event() is overridden with a no-op macro below so that no
> + * trace record is emitted on every scheduler context switch.  Budget
> + * violations are reported via the dedicated tlob_budget_exceeded event.
> + *
> + * error_tlob IS kept active so that invalid DA transitions (programming
> + * errors) are still visible in the ftrace ring buffer for debugging.
> + */
> +DEFINE_EVENT(event_da_monitor_id, event_tlob,
> +	     TP_PROTO(int id, char *state, char *event, char *next_state,
> +		      bool final_state),
> +	     TP_ARGS(id, state, event, next_state, final_state));
> +
> +DEFINE_EVENT(error_da_monitor_id, error_tlob,
> +	     TP_PROTO(int id, char *state, char *event),
> +	     TP_ARGS(id, state, event));
> +
> +/*
> + * Override the trace_event_tlob() call-site with a no-op after the
> + * DEFINE_EVENT above has satisfied the event class instantiation
> + * requirement.  The tracepoint symbol itself exists (and can be enabled
> + * via tracefs) but the automatic call from da_handle_event() is silenced
> + * to avoid per-context-switch ftrace noise during normal operation.
> + */
> +#undef trace_event_tlob
> +#define trace_event_tlob(id, state, event, next_state, final_state)	\
> +	do { (void)(id); (void)(state); (void)(event);			\
> +	     (void)(next_state); (void)(final_state); } while (0)
> +#endif /* CONFIG_RV_MON_TLOB */
> diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
> index ee4e68102..e754e76d5 100644
> --- a/kernel/trace/rv/rv.c
> +++ b/kernel/trace/rv/rv.c
> @@ -148,6 +148,10 @@
>  #include <rv_trace.h>
>  #endif
>  
> +#ifdef CONFIG_RV_MON_TLOB
> +EXPORT_TRACEPOINT_SYMBOL_GPL(tlob_budget_exceeded);
> +#endif
> +
>  #include "rv.h"
>  
>  DEFINE_MUTEX(rv_interface_lock);
> diff --git a/kernel/trace/rv/rv_dev.c b/kernel/trace/rv/rv_dev.c
> new file mode 100644
> index 000000000..a052f3203
> --- /dev/null
> +++ b/kernel/trace/rv/rv_dev.c
> @@ -0,0 +1,602 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * rv_dev.c - /dev/rv misc device for RV monitor self-instrumentation
> + *
> + * A single misc device (MISC_DYNAMIC_MINOR) serves all RV monitors.
> + * ioctl numbers encode the monitor identity:
> + *
> + *   0x01 - 0x1F  tlob (task latency over budget)
> + *   0x20 - 0x3F  reserved
> + *
> + * Each monitor exports tlob_start_task() / tlob_stop_task() which are
> + * called here.  The calling task is identified by current.
> + *
> + * Magic: RV_IOC_MAGIC (0xB9), defined in include/uapi/linux/rv.h
> + *
> + * Per-fd private data (rv_file_priv)
> + * ------------------------------------
> + * Every open() of /dev/rv allocates an rv_file_priv (defined in tlob.h).
> + * When TLOB_IOCTL_TRACE_START is called with args.notify_fd >= 0, violations
> + * are pushed as tlob_event records into that fd's per-fd ring buffer
> (tlob_ring)
> + * and its poll/epoll waitqueue is woken.
> + *
> + * Consumers drain records with read() on the notify_fd; read() blocks until
> + * at least one record is available (unless O_NONBLOCK is set).
> + *
> + * Per-thread "started" tracking (tlob_task_handle)
> + * -------------------------------------------------
> + * tlob_stop_task() returns -ESRCH in two distinct situations:
> + *
> + *   (a) The deadline timer already fired and removed the tlob hash-table
> + *       entry before TRACE_STOP arrived -> budget was exceeded -> -EOVERFLOW
> + *
> + *   (b) TRACE_START was never called for this thread -> programming error
> + *       -> -ESRCH
> + *
> + * To distinguish them, rv_dev.c maintains a lightweight hash table
> + * (tlob_handles) that records a tlob_task_handle for every task_struct *
> + * for which a successful TLOB_IOCTL_TRACE_START has been
> + * issued but the corresponding TLOB_IOCTL_TRACE_STOP has not yet arrived.
> + *
> + * tlob_task_handle is a thin "session ticket"  --  it carries only the
> + * task pointer and the owning file descriptor.  The heavy per-task state
> + * (hrtimer, DA state, threshold) lives in tlob_task_state inside tlob.c.
> + *
> + * The table is keyed on task_struct * (same key as tlob.c), protected
> + * by tlob_handles_lock (spinlock, irq-safe).  No get_task_struct()
> + * refcount is needed here because tlob.c already holds a reference for
> + * each live entry.
> + *
> + * Multiple threads may share the same fd.  Each thread has its own
> + * tlob_task_handle in the table, so concurrent TRACE_START / TRACE_STOP
> + * calls from different threads do not interfere.
> + *
> + * The fd release path (rv_release) calls tlob_stop_task() for every
> + * handle in tlob_handles that belongs to the closing fd, ensuring cleanup
> + * even if the user forgets to call TRACE_STOP.
> + */
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/gfp.h>
> +#include <linux/hash.h>
> +#include <linux/mm.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/uaccess.h>
> +#include <uapi/linux/rv.h>
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +#include "monitors/tlob/tlob.h"
> +#endif
> +
> +/* -----------------------------------------------------------------------
> + * tlob_task_handle - per-thread session ticket for the ioctl interface
> + *
> + * One handle is allocated by TLOB_IOCTL_TRACE_START and freed by
> + * TLOB_IOCTL_TRACE_STOP (or by rv_release if the fd is closed).
> + *
> + * @hlist:  Hash-table linkage in tlob_handles (keyed on task pointer).
> + * @task:   The monitored thread.  Plain pointer; no refcount held here
> + *          because tlob.c holds one for the lifetime of the monitoring
> + *          window, which encompasses the lifetime of this handle.
> + * @file:   The /dev/rv file descriptor that issued TRACE_START.
> + *          Used by rv_release() to sweep orphaned handles on close().
> + * -----------------------------------------------------------------------
> + */
> +#define TLOB_HANDLES_BITS	5
> +#define TLOB_HANDLES_SIZE	(1 << TLOB_HANDLES_BITS)
> +
> +struct tlob_task_handle {
> +	struct hlist_node	hlist;
> +	struct task_struct	*task;
> +	struct file		*file;
> +};
> +
> +static struct hlist_head tlob_handles[TLOB_HANDLES_SIZE];
> +static DEFINE_SPINLOCK(tlob_handles_lock);
> +
> +static unsigned int tlob_handle_hash(const struct task_struct *task)
> +{
> +	return hash_ptr((void *)task, TLOB_HANDLES_BITS);
> +}
> +
> +/* Must be called with tlob_handles_lock held. */
> +static struct tlob_task_handle *
> +tlob_handle_find_locked(struct task_struct *task)
> +{
> +	struct tlob_task_handle *h;
> +	unsigned int slot = tlob_handle_hash(task);
> +
> +	hlist_for_each_entry(h, &tlob_handles[slot], hlist) {
> +		if (h->task == task)
> +			return h;
> +	}
> +	return NULL;
> +}
> +
> +/*
> + * tlob_handle_alloc - record that @task has an active monitoring session
> + *                     opened via @file.
> + *
> + * Returns 0 on success, -EEXIST if @task already has a handle (double
> + * TRACE_START without TRACE_STOP), -ENOMEM on allocation failure.
> + */
> +static int tlob_handle_alloc(struct task_struct *task, struct file *file)
> +{
> +	struct tlob_task_handle *h;
> +	unsigned long flags;
> +	unsigned int slot;
> +
> +	h = kmalloc(sizeof(*h), GFP_KERNEL);
> +	if (!h)
> +		return -ENOMEM;
> +	h->task = task;
> +	h->file = file;
> +
> +	spin_lock_irqsave(&tlob_handles_lock, flags);
> +	if (tlob_handle_find_locked(task)) {
> +		spin_unlock_irqrestore(&tlob_handles_lock, flags);
> +		kfree(h);
> +		return -EEXIST;
> +	}
> +	slot = tlob_handle_hash(task);
> +	hlist_add_head(&h->hlist, &tlob_handles[slot]);
> +	spin_unlock_irqrestore(&tlob_handles_lock, flags);
> +	return 0;
> +}
> +
> +/*
> + * tlob_handle_free - remove the handle for @task and free it.
> + *
> + * Returns 1 if a handle existed (TRACE_START was called), 0 if not found
> + * (TRACE_START was never called for this thread).
> + */
> +static int tlob_handle_free(struct task_struct *task)
> +{
> +	struct tlob_task_handle *h;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tlob_handles_lock, flags);
> +	h = tlob_handle_find_locked(task);
> +	if (h) {
> +		hlist_del_init(&h->hlist);
> +		spin_unlock_irqrestore(&tlob_handles_lock, flags);
> +		kfree(h);
> +		return 1;
> +	}
> +	spin_unlock_irqrestore(&tlob_handles_lock, flags);
> +	return 0;
> +}
> +
> +/*
> + * tlob_handle_sweep_file - release all handles owned by @file.
> + *
> + * Called from rv_release() when the fd is closed without TRACE_STOP.
> + * Calls tlob_stop_task() for each orphaned handle to drain the tlob
> + * monitoring entries and prevent resource leaks in tlob.c.
> + *
> + * Handles are collected under the lock (short critical section), then
> + * processed outside it (tlob_stop_task() may sleep/spin internally).
> + */
> +#ifdef CONFIG_RV_MON_TLOB
> +static void tlob_handle_sweep_file(struct file *file)
> +{
> +	struct tlob_task_handle *batch[TLOB_HANDLES_SIZE];
> +	struct tlob_task_handle *h;
> +	struct hlist_node *tmp;
> +	unsigned long flags;
> +	int i, n = 0;
> +
> +	spin_lock_irqsave(&tlob_handles_lock, flags);
> +	for (i = 0; i < TLOB_HANDLES_SIZE; i++) {
> +		hlist_for_each_entry_safe(h, tmp, &tlob_handles[i], hlist) {
> +			if (h->file == file) {
> +				hlist_del_init(&h->hlist);
> +				batch[n++] = h;
> +			}
> +		}
> +	}
> +	spin_unlock_irqrestore(&tlob_handles_lock, flags);
> +
> +	for (i = 0; i < n; i++) {
> +		/*
> +		 * Ignore -ESRCH: the deadline timer may have already fired
> +		 * and cleaned up the tlob entry.
> +		 */
> +		tlob_stop_task(batch[i]->task);
> +		kfree(batch[i]);
> +	}
> +}
> +#else
> +static inline void tlob_handle_sweep_file(struct file *file) {}
> +#endif /* CONFIG_RV_MON_TLOB */
> +
> +/* -----------------------------------------------------------------------
> + * Ring buffer lifecycle
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * tlob_ring_alloc - allocate a ring of @cap records (must be a power of 2).
> + *
> + * Allocates a physically contiguous block of pages:
> + *   page 0     : struct tlob_mmap_page  (control page, shared with
> userspace)
> + *   pages 1..N : struct tlob_event[cap] (data pages)
> + *
> + * Each page is marked reserved so it can be mapped to userspace via mmap().
> + */
> +static int tlob_ring_alloc(struct tlob_ring *ring, u32 cap)
> +{
> +	unsigned int total = PAGE_SIZE + cap * sizeof(struct tlob_event);
> +	unsigned int order = get_order(total);
> +	unsigned long base;
> +	unsigned int i;
> +
> +	base = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> +	if (!base)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < (1u << order); i++)
> +		SetPageReserved(virt_to_page((void *)(base + i *
> PAGE_SIZE)));
> +
> +	ring->base  = base;
> +	ring->order = order;
> +	ring->page  = (struct tlob_mmap_page *)base;
> +	ring->data  = (struct tlob_event *)(base + PAGE_SIZE);
> +	ring->mask  = cap - 1;
> +	spin_lock_init(&ring->lock);
> +
> +	ring->page->capacity    = cap;
> +	ring->page->version     = 1;
> +	ring->page->data_offset = PAGE_SIZE;
> +	ring->page->record_size = sizeof(struct tlob_event);
> +	return 0;
> +}
> +
> +static void tlob_ring_free(struct tlob_ring *ring)
> +{
> +	unsigned int i;
> +
> +	if (!ring->base)
> +		return;
> +
> +	for (i = 0; i < (1u << ring->order); i++)
> +		ClearPageReserved(virt_to_page((void *)(ring->base + i *
> PAGE_SIZE)));
> +
> +	free_pages(ring->base, ring->order);
> +	ring->base = 0;
> +	ring->page = NULL;
> +	ring->data = NULL;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * File operations
> + * -----------------------------------------------------------------------
> + */
> +
> +static int rv_open(struct inode *inode, struct file *file)
> +{
> +	struct rv_file_priv *priv;
> +	int ret;
> +
> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;
> +
> +	ret = tlob_ring_alloc(&priv->ring, TLOB_RING_DEFAULT_CAP);
> +	if (ret) {
> +		kfree(priv);
> +		return ret;
> +	}
> +
> +	init_waitqueue_head(&priv->waitq);
> +	file->private_data = priv;
> +	return 0;
> +}
> +
> +static int rv_release(struct inode *inode, struct file *file)
> +{
> +	struct rv_file_priv *priv = file->private_data;
> +
> +	tlob_handle_sweep_file(file);
> +	tlob_ring_free(&priv->ring);
> +	kfree(priv);
> +	file->private_data = NULL;
> +	return 0;
> +}
> +
> +static __poll_t rv_poll(struct file *file, poll_table *wait)
> +{
> +	struct rv_file_priv *priv = file->private_data;
> +
> +	if (!priv)
> +		return EPOLLERR;
> +
> +	poll_wait(file, &priv->waitq, wait);
> +
> +	/*
> +	 * Pairs with smp_store_release(&ring->page->data_head, ...) in
> +	 * tlob_event_push().  No lock needed: head is written by the kernel
> +	 * producer and read here; tail is written by the consumer and we
> only
> +	 * need an approximate check for the poll fast path.
> +	 */
> +	if (smp_load_acquire(&priv->ring.page->data_head) !=
> +	    READ_ONCE(priv->ring.page->data_tail))
> +		return EPOLLIN | EPOLLRDNORM;
> +
> +	return 0;
> +}
> +
> +/*
> + * rv_read - consume tlob_event violation records from this fd's ring buffer.
> + *
> + * Each read() returns a whole number of struct tlob_event records.  @count
> must
> + * be at least sizeof(struct tlob_event); partial-record sizes are rejected
> with
> + * -EINVAL.
> + *
> + * Blocking behaviour follows O_NONBLOCK on the fd:
> + *   O_NONBLOCK clear: blocks until at least one record is available.
> + *   O_NONBLOCK set:   returns -EAGAIN immediately if the ring is empty.
> + *
> + * Returns the number of bytes copied (always a multiple of sizeof
> tlob_event),
> + * -EAGAIN if non-blocking and empty, or a negative error code.
> + *
> + * read() and mmap() share the same ring and data_tail cursor; do not use
> + * both simultaneously on the same fd.
> + */
> +static ssize_t rv_read(struct file *file, char __user *buf, size_t count,
> +		       loff_t *ppos)
> +{
> +	struct rv_file_priv *priv = file->private_data;
> +	struct tlob_ring *ring;
> +	size_t rec = sizeof(struct tlob_event);
> +	unsigned long irqflags;
> +	ssize_t done = 0;
> +	int ret;
> +
> +	if (!priv)
> +		return -ENODEV;
> +
> +	ring = &priv->ring;
> +
> +	if (count < rec)
> +		return -EINVAL;
> +
> +	/* Blocking path: sleep until the producer advances data_head. */
> +	if (!(file->f_flags & O_NONBLOCK)) {
> +		ret = wait_event_interruptible(priv->waitq,
> +			/* pairs with smp_store_release() in the producer */
> +			smp_load_acquire(&ring->page->data_head) !=
> +			READ_ONCE(ring->page->data_tail));
> +		if (ret)
> +			return ret;
> +	}
> +
> +	/*
> +	 * Drain records into the caller's buffer.  ring->lock serialises
> +	 * concurrent read() callers and the softirq producer.
> +	 */
> +	while (done + rec <= count) {
> +		struct tlob_event record;
> +		u32 head, tail;
> +
> +		spin_lock_irqsave(&ring->lock, irqflags);
> +		/* pairs with smp_store_release() in the producer */
> +		head = smp_load_acquire(&ring->page->data_head);
> +		tail = ring->page->data_tail;
> +		if (head == tail) {
> +			spin_unlock_irqrestore(&ring->lock, irqflags);
> +			break;
> +		}
> +		record = ring->data[tail & ring->mask];
> +		WRITE_ONCE(ring->page->data_tail, tail + 1);
> +		spin_unlock_irqrestore(&ring->lock, irqflags);
> +
> +		if (copy_to_user(buf + done, &record, rec))
> +			return done ? done : -EFAULT;
> +		done += rec;
> +	}
> +
> +	return done ? done : -EAGAIN;
> +}
> +
> +/*
> + * rv_mmap - map the per-fd violation ring buffer into userspace.
> + *
> + * The mmap region covers the full ring allocation:
> + *
> + *   offset 0          : struct tlob_mmap_page  (control page)
> + *   offset PAGE_SIZE  : struct tlob_event[capacity]  (data pages)
> + *
> + * The caller must map exactly PAGE_SIZE + capacity * sizeof(struct
> tlob_event)
> + * bytes starting at offset 0 (vm_pgoff must be 0).  The actual capacity is
> + * read from tlob_mmap_page.capacity after a successful mmap(2).
> + *
> + * Private mappings (MAP_PRIVATE) are rejected: the shared data_tail field
> + * written by userspace must be visible to the kernel producer.
> + */
> +static int rv_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct rv_file_priv *priv = file->private_data;
> +	struct tlob_ring    *ring;
> +	unsigned long        size = vma->vm_end - vma->vm_start;
> +	unsigned long        ring_size;
> +
> +	if (!priv)
> +		return -ENODEV;
> +
> +	ring = &priv->ring;
> +
> +	if (vma->vm_pgoff != 0)
> +		return -EINVAL;
> +
> +	ring_size = PAGE_ALIGN(PAGE_SIZE + ((unsigned long)(ring->mask + 1) *
> +					    sizeof(struct tlob_event)));
> +	if (size != ring_size)
> +		return -EINVAL;
> +
> +	if (!(vma->vm_flags & VM_SHARED))
> +		return -EINVAL;
> +
> +	return remap_pfn_range(vma, vma->vm_start,
> +			       page_to_pfn(virt_to_page((void *)ring->base)),
> +			       ring_size, vma->vm_page_prot);
> +}
> +
> +/* -----------------------------------------------------------------------
> + * ioctl dispatcher
> + * -----------------------------------------------------------------------
> + */
> +
> +static long rv_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	unsigned int nr = _IOC_NR(cmd);
> +
> +	/*
> +	 * Verify the magic byte so we don't accidentally handle ioctls
> +	 * intended for a different device.
> +	 */
> +	if (_IOC_TYPE(cmd) != RV_IOC_MAGIC)
> +		return -ENOTTY;
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +	/* tlob: ioctl numbers 0x01 - 0x1F */
> +	switch (cmd) {
> +	case TLOB_IOCTL_TRACE_START: {
> +		struct tlob_start_args args;
> +		struct file *notify_file = NULL;
> +		int ret, hret;
> +
> +		if (copy_from_user(&args,
> +				   (struct tlob_start_args __user *)arg,
> +				   sizeof(args)))
> +			return -EFAULT;
> +		if (args.threshold_us == 0)
> +			return -EINVAL;
> +		if (args.flags != 0)
> +			return -EINVAL;
> +
> +		/*
> +		 * If notify_fd >= 0, resolve it to a file pointer.
> +		 * fget() bumps the reference count; tlob.c drops it
> +		 * via fput() when the monitoring window ends.
> +		 * Reject non-/dev/rv fds to prevent type confusion.
> +		 */
> +		if (args.notify_fd >= 0) {
> +			notify_file = fget(args.notify_fd);
> +			if (!notify_file)
> +				return -EBADF;
> +			if (notify_file->f_op != file->f_op) {
> +				fput(notify_file);
> +				return -EINVAL;
> +			}
> +		}
> +
> +		ret = tlob_start_task(current, args.threshold_us,
> +				      notify_file, args.tag);
> +		if (ret != 0) {
> +			/* tlob.c did not take ownership; drop ref. */
> +			if (notify_file)
> +				fput(notify_file);
> +			return ret;
> +		}
> +
> +		/*
> +		 * Record session handle.  Free any stale handle left by
> +		 * a previous window whose deadline timer fired (timer
> +		 * removes tlob_task_state but cannot touch tlob_handles).
> +		 */
> +		tlob_handle_free(current);
> +		hret = tlob_handle_alloc(current, file);
> +		if (hret < 0) {
> +			tlob_stop_task(current);
> +			return hret;
> +		}
> +		return 0;
> +	}
> +	case TLOB_IOCTL_TRACE_STOP: {
> +		int had_handle;
> +		int ret;
> +
> +		/*
> +		 * Atomically remove the session handle for current.
> +		 *
> +		 *   had_handle == 0: TRACE_START was never called for
> +		 *                    this thread -> caller bug -> -ESRCH
> +		 *
> +		 *   had_handle == 1: TRACE_START was called.  If
> +		 *                    tlob_stop_task() now returns
> +		 *                    -ESRCH, the deadline timer already
> +		 *                    fired -> budget exceeded -> -EOVERFLOW
> +		 */
> +		had_handle = tlob_handle_free(current);
> +		if (!had_handle)
> +			return -ESRCH;
> +
> +		ret = tlob_stop_task(current);
> +		return (ret == -ESRCH) ? -EOVERFLOW : ret;
> +	}
> +	default:
> +		break;
> +	}
> +#endif /* CONFIG_RV_MON_TLOB */
> +
> +	return -ENOTTY;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Module init / exit
> + * -----------------------------------------------------------------------
> + */
> +
> +static const struct file_operations rv_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= rv_open,
> +	.release	= rv_release,
> +	.read		= rv_read,
> +	.poll		= rv_poll,
> +	.mmap		= rv_mmap,
> +	.unlocked_ioctl	= rv_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= rv_ioctl,
> +#endif
> +	.llseek		= noop_llseek,
> +};
> +
> +/*
> + * 0666: /dev/rv is a self-instrumentation device.  All ioctls operate
> + * exclusively on the calling task (current); no task can monitor another
> + * via this interface.  Opening the device does not grant any privilege
> + * beyond observing one's own latency, so world-read/write is appropriate.
> + */
> +static struct miscdevice rv_miscdev = {
> +	.minor	= MISC_DYNAMIC_MINOR,
> +	.name	= "rv",
> +	.fops	= &rv_fops,
> +	.mode	= 0666,
> +};
> +
> +static int __init rv_ioctl_init(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < TLOB_HANDLES_SIZE; i++)
> +		INIT_HLIST_HEAD(&tlob_handles[i]);
> +
> +	return misc_register(&rv_miscdev);
> +}
> +
> +static void __exit rv_ioctl_exit(void)
> +{
> +	misc_deregister(&rv_miscdev);
> +}
> +
> +module_init(rv_ioctl_init);
> +module_exit(rv_ioctl_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("RV ioctl interface via /dev/rv");
> diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
> index 4a6faddac..65d6c6485 100644
> --- a/kernel/trace/rv/rv_trace.h
> +++ b/kernel/trace/rv/rv_trace.h
> @@ -126,6 +126,7 @@ DECLARE_EVENT_CLASS(error_da_monitor_id,
>  #include <monitors/snroc/snroc_trace.h>
>  #include <monitors/nrp/nrp_trace.h>
>  #include <monitors/sssw/sssw_trace.h>
> +#include <monitors/tlob/tlob_trace.h>
>  // Add new monitors based on CONFIG_DA_MON_EVENTS_ID here
>  
>  #endif /* CONFIG_DA_MON_EVENTS_ID */
> @@ -202,6 +203,55 @@ TRACE_EVENT(rv_retries_error,
>  		__get_str(event), __get_str(name))
>  );
>  #endif /* CONFIG_RV_MON_MAINTENANCE_EVENTS */
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +/*
> + * tlob_budget_exceeded - emitted when a monitored task exceeds its latency
> + * budget.  Carries the on-CPU / off-CPU time breakdown so that the cause
> + * of the overrun (CPU-bound vs. scheduling/I/O latency) is immediately
> + * visible in the ftrace ring buffer without post-processing.
> + */
> +TRACE_EVENT(tlob_budget_exceeded,
> +
> +	TP_PROTO(struct task_struct *task, u64 threshold_us,
> +		 u64 on_cpu_us, u64 off_cpu_us, u32 switches,
> +		 bool state_is_on_cpu, u64 tag),
> +
> +	TP_ARGS(task, threshold_us, on_cpu_us, off_cpu_us, switches,
> +		state_is_on_cpu, tag),
> +
> +	TP_STRUCT__entry(
> +		__string(comm,		task->comm)
> +		__field(pid_t,		pid)
> +		__field(u64,		threshold_us)
> +		__field(u64,		on_cpu_us)
> +		__field(u64,		off_cpu_us)
> +		__field(u32,		switches)
> +		__field(bool,		state_is_on_cpu)
> +		__field(u64,		tag)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(comm);
> +		__entry->pid		= task->pid;
> +		__entry->threshold_us	= threshold_us;
> +		__entry->on_cpu_us	= on_cpu_us;
> +		__entry->off_cpu_us	= off_cpu_us;
> +		__entry->switches	= switches;
> +		__entry->state_is_on_cpu = state_is_on_cpu;
> +		__entry->tag		= tag;
> +	),
> +
> +	TP_printk("%s[%d]: budget exceeded threshold=%llu on_cpu=%llu
> off_cpu=%llu switches=%u state=%s tag=0x%016llx",
> +		__get_str(comm), __entry->pid,
> +		__entry->threshold_us,
> +		__entry->on_cpu_us, __entry->off_cpu_us,
> +		__entry->switches,
> +		__entry->state_is_on_cpu ? "on_cpu" : "off_cpu",
> +		__entry->tag)
> +);
> +#endif /* CONFIG_RV_MON_TLOB */
> +
>  #endif /* _TRACE_RV_H */
>  
>  /* This part must be outside protection */


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/4] rv/tlob: Add tlob model DOT file
  2026-04-12 19:27 ` [RFC PATCH 1/4] rv/tlob: Add tlob model DOT file wen.yang
@ 2026-04-13  8:19   ` Gabriele Monaco
  0 siblings, 0 replies; 11+ messages in thread
From: Gabriele Monaco @ 2026-04-13  8:19 UTC (permalink / raw)
  To: wen.yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel

On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add the Graphviz DOT specification for the tlob (task latency over
> budget) deterministic automaton.
> 
> The model has three states: unmonitored, on_cpu, and off_cpu.
> trace_start transitions from unmonitored to on_cpu; switch_out and
> switch_in cycle between on_cpu and off_cpu; trace_stop and
> budget_expired return to unmonitored from either active state.
> unmonitored is the sole accepting state.
> 
> switch_in, switch_out, and sched_wakeup self-loop in unmonitored;
> sched_wakeup self-loops in on_cpu; switch_out and sched_wakeup
> self-loop in off_cpu.
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---

Interesting monitor! Thanks.
I'm going to go through it more in details later, but let me share some initial
comments.

>  MAINTAINERS                        |  3 +++
>  tools/verification/models/tlob.dot | 25 +++++++++++++++++++++++++
>  2 files changed, 28 insertions(+)
>  create mode 100644 tools/verification/models/tlob.dot
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9fbb619c6..c2c56236c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -23242,7 +23242,10 @@ S:	Maintained
>  F:	Documentation/trace/rv/
>  F:	include/linux/rv.h
>  F:	include/rv/
> +F:	include/uapi/linux/rv.h
>  F:	kernel/trace/rv/
> +F:	samples/rv/
> +F:	tools/testing/selftests/rv/
>  F:	tools/testing/selftests/verification/
>  F:	tools/verification/

This change doesn't belong here, the patch itself is not adding those file, you
should probably move it later.

>  
> diff --git a/tools/verification/models/tlob.dot
> b/tools/verification/models/tlob.dot
> new file mode 100644
> index 000000000..df34a14b8
> --- /dev/null
> +++ b/tools/verification/models/tlob.dot
> @@ -0,0 +1,25 @@
> +digraph state_automaton {
> +	center = true;
> +	size = "7,11";
> +	{node [shape = plaintext, style=invis, label=""]
> "__init_unmonitored"};
> +	{node [shape = ellipse] "unmonitored"};
> +	{node [shape = plaintext] "unmonitored"};
> +	{node [shape = plaintext] "on_cpu"};
> +	{node [shape = plaintext] "off_cpu"};
> +	"__init_unmonitored" -> "unmonitored";
> +	"unmonitored" [label = "unmonitored", color = green3];
> +	"unmonitored" -> "on_cpu" [ label = "trace_start" ];
> +	"unmonitored" -> "unmonitored" [ label =
> "switch_in\nswitch_out\nsched_wakeup" ];
> +	"on_cpu" [label = "on_cpu"];
> +	"on_cpu" -> "off_cpu" [ label = "switch_out" ];
> +	"on_cpu" -> "unmonitored" [ label = "trace_stop\nbudget_expired" ];
> +	"on_cpu" -> "on_cpu" [ label = "sched_wakeup" ];
> +	"off_cpu" [label = "off_cpu"];
> +	"off_cpu" -> "on_cpu" [ label = "switch_in" ];
> +	"off_cpu" -> "unmonitored" [ label = "trace_stop\nbudget_expired" ];
> +	"off_cpu" -> "off_cpu" [ label = "switch_out\nsched_wakeup" ];
> +	{ rank = min ;
> +		"__init_unmonitored";
> +		"unmonitored";
> +	}
> +}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 4/4] selftests/rv: Add selftest for the tlob monitor
  2026-04-12 19:27 ` [RFC PATCH 4/4] selftests/rv: Add selftest " wen.yang
@ 2026-04-16 12:00   ` Gabriele Monaco
  0 siblings, 0 replies; 11+ messages in thread
From: Gabriele Monaco @ 2026-04-16 12:00 UTC (permalink / raw)
  To: wen.yang
  Cc: linux-trace-kernel, linux-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers

On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add a kselftest suite (TAP output, 19 test points) for the tlob RV
> monitor under tools/testing/selftests/rv/.
> 
> test_tlob.sh drives a compiled C helper (tlob_helper) and, for uprobe
> tests, a target binary (tlob_uprobe_target). Coverage spans the
> tracefs enable/disable path, uprobe-triggered violations, and the
> ioctl interface (within-budget stop, CPU-bound and sleep violations,
> duplicate start, ring buffer mmap and consumption).
> 
> Requires CONFIG_RV_MON_TLOB=y and CONFIG_RV_CHARDEV=y; must be run
> as root.
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

Those are some extensive selftests!
Could you integrate them with the existing test suite under
tools/testing/selftests/verification ?

You would probably just get your tlob_helper built and call it from some shell
script under test.d, the harness should work without the need for extra helpers.

Thanks,
Gabriele

> ---
>  tools/include/uapi/linux/rv.h                 |  54 +
>  tools/testing/selftests/rv/Makefile           |  18 +
>  tools/testing/selftests/rv/test_tlob.sh       | 563 ++++++++++
>  tools/testing/selftests/rv/tlob_helper.c      | 994 ++++++++++++++++++
>  .../testing/selftests/rv/tlob_uprobe_target.c | 108 ++
>  5 files changed, 1737 insertions(+)
>  create mode 100644 tools/include/uapi/linux/rv.h
>  create mode 100644 tools/testing/selftests/rv/Makefile
>  create mode 100755 tools/testing/selftests/rv/test_tlob.sh
>  create mode 100644 tools/testing/selftests/rv/tlob_helper.c
>  create mode 100644 tools/testing/selftests/rv/tlob_uprobe_target.c
> 
> diff --git a/tools/include/uapi/linux/rv.h b/tools/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000..bef07aded
> --- /dev/null
> +++ b/tools/include/uapi/linux/rv.h
> @@ -0,0 +1,54 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * This is a tools-friendly copy of include/uapi/linux/rv.h.
> + * Keep in sync with the kernel header.
> + */
> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H
> +
> +#include <linux/types.h>
> +#include <sys/ioctl.h>
> +
> +/* Magic byte shared by all RV monitor ioctls. */
> +#define RV_IOC_MAGIC	0xB9
> +
> +/* -----------------------------------------------------------------------
> + * tlob: task latency over budget monitor  (nr 0x01 - 0x1F)
> + * -----------------------------------------------------------------------
> + */
> +
> +struct tlob_start_args {
> +	__u64 threshold_us;
> +	__u64 tag;
> +	__s32 notify_fd;
> +	__u32 flags;
> +};
> +
> +struct tlob_event {
> +	__u32 tid;
> +	__u32 pad;
> +	__u64 threshold_us;
> +	__u64 on_cpu_us;
> +	__u64 off_cpu_us;
> +	__u32 switches;
> +	__u32 state;   /* 1 = on_cpu, 0 = off_cpu */
> +	__u64 tag;
> +};
> +
> +struct tlob_mmap_page {
> +	__u32  data_head;
> +	__u32  data_tail;
> +	__u32  capacity;
> +	__u32  version;
> +	__u32  data_offset;
> +	__u32  record_size;
> +	__u64  dropped;
> +};
> +
> +#define TLOB_IOCTL_TRACE_START	_IOW(RV_IOC_MAGIC, 0x01, struct
> tlob_start_args)
> +#define TLOB_IOCTL_TRACE_STOP	_IO(RV_IOC_MAGIC,  0x02)
> +
> +#endif /* _UAPI_LINUX_RV_H */
> diff --git a/tools/testing/selftests/rv/Makefile
> b/tools/testing/selftests/rv/Makefile
> new file mode 100644
> index 000000000..14e94a1ab
> --- /dev/null
> +++ b/tools/testing/selftests/rv/Makefile
> @@ -0,0 +1,18 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# Makefile for rv selftests
> +
> +TEST_GEN_PROGS := tlob_helper tlob_uprobe_target
> +
> +TEST_PROGS := \
> +	test_tlob.sh \
> +
> +# TOOLS_INCLUDES is defined by ../lib.mk; provides -isystem to
> +# tools/include/uapi so that #include <linux/rv.h> resolves to the
> +# in-tree UAPI header without requiring make headers_install.
> +# Note: both must be added to the global variables, not as target-specific
> +# overrides, because lib.mk rewrites TEST_GEN_PROGS to $(OUTPUT)/name
> +# before per-target rules would be evaluated.
> +CFLAGS += $(TOOLS_INCLUDES)
> +LDLIBS += -lpthread
> +
> +include ../lib.mk
> diff --git a/tools/testing/selftests/rv/test_tlob.sh
> b/tools/testing/selftests/rv/test_tlob.sh
> new file mode 100755
> index 000000000..3ba2125eb
> --- /dev/null
> +++ b/tools/testing/selftests/rv/test_tlob.sh
> @@ -0,0 +1,563 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Selftest for the tlob (task latency over budget) RV monitor.
> +#
> +# Two interfaces are tested:
> +#
> +#   1. tracefs interface:
> +#        enable/disable, presence of tracefs files,
> +#        uprobe binding (threshold_us:offset_start:offset_stop:binary_path)
> and
> +#        violation detection via the ftrace ring buffer.
> +#
> +#   2. /dev/rv ioctl self-instrumentation (via tlob_helper):
> +#        within-budget, over-budget on-CPU, over-budget off-CPU (sleep),
> +#        double-start, stop-without-start.
> +#
> +# Written to be POSIX sh compatible (no bash-specific extensions).
> +
> +ksft_skip=4
> +t_pass=0; t_fail=0; t_skip=0; t_total=0
> +
> +tap_header() { echo "TAP version 13"; }
> +tap_plan()   { echo "1..$1"; }
> +tap_pass()   { t_pass=$((t_pass+1)); echo "ok $t_total - $1"; }
> +tap_fail()   { t_fail=$((t_fail+1)); echo "not ok $t_total - $1"
> +               [ -n "$2" ] && echo "  # $2"; }
> +tap_skip()   { t_skip=$((t_skip+1)); echo "ok $t_total - $1 # SKIP $2"; }
> +next_test()  { t_total=$((t_total+1)); }
> +
> +TRACEFS=$(grep -m1 tracefs /proc/mounts 2>/dev/null | awk '{print $2}')
> +[ -z "$TRACEFS" ] && TRACEFS=/sys/kernel/tracing
> +
> +RV_DIR="${TRACEFS}/rv"
> +TLOB_DIR="${RV_DIR}/monitors/tlob"
> +TRACE_FILE="${TRACEFS}/trace"
> +TRACING_ON="${TRACEFS}/tracing_on"
> +TLOB_MONITOR="${TLOB_DIR}/monitor"
> +BUDGET_EXCEEDED_ENABLE="${TRACEFS}/events/rv/tlob_budget_exceeded/enable"
> +RV_DEV="/dev/rv"
> +
> +# tlob_helper and tlob_uprobe_target must be in the same directory as
> +# this script or on PATH.
> +SCRIPT_DIR=$(dirname "$0")
> +IOCTL_HELPER="${SCRIPT_DIR}/tlob_helper"
> +UPROBE_TARGET="${SCRIPT_DIR}/tlob_uprobe_target"
> +
> +check_root()     { [ "$(id -u)" = "0" ] || { echo "# Need root" >&2; exit
> $ksft_skip; }; }
> +check_tracefs()  { [ -d "${TRACEFS}" ]   || { echo "# No tracefs" >&2; exit
> $ksft_skip; }; }
> +check_rv_dir()   { [ -d "${RV_DIR}" ]    || { echo "# No RV infra" >&2; exit
> $ksft_skip; }; }
> +check_tlob()     { [ -d "${TLOB_DIR}" ]  || { echo "# No tlob monitor" >&2;
> exit $ksft_skip; }; }
> +
> +tlob_enable()         { echo 1 > "${TLOB_DIR}/enable"; }
> +tlob_disable()        { echo 0 > "${TLOB_DIR}/enable" 2>/dev/null; }
> +tlob_is_enabled()     { [ "$(cat "${TLOB_DIR}/enable" 2>/dev/null)" = "1" ];
> }
> +trace_event_enable()  { echo 1 > "${BUDGET_EXCEEDED_ENABLE}" 2>/dev/null; }
> +trace_event_disable() { echo 0 > "${BUDGET_EXCEEDED_ENABLE}" 2>/dev/null; }
> +trace_on()            { echo 1 > "${TRACING_ON}" 2>/dev/null; }
> +trace_clear()         { echo > "${TRACE_FILE}"; }
> +trace_grep()          { grep -q "$1" "${TRACE_FILE}" 2>/dev/null; }
> +
> +cleanup() {
> +	tlob_disable
> +	trace_event_disable
> +	trace_clear
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 1: enable / disable
> +# ---------------------------------------------------------------------------
> +run_test_enable_disable() {
> +	next_test; cleanup
> +	tlob_enable
> +	if ! tlob_is_enabled; then
> +		tap_fail "enable_disable" "not enabled after echo 1";
> cleanup; return
> +	fi
> +	tlob_disable
> +	if tlob_is_enabled; then
> +		tap_fail "enable_disable" "still enabled after echo 0";
> cleanup; return
> +	fi
> +	tap_pass "enable_disable"; cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 2: tracefs files present
> +# ---------------------------------------------------------------------------
> +run_test_tracefs_files() {
> +	next_test; cleanup
> +	missing=""
> +	for f in enable desc monitor; do
> +		[ ! -e "${TLOB_DIR}/${f}" ] && missing="${missing} ${f}"
> +	done
> +	[ -n "${missing}" ] \
> +		&& tap_fail "tracefs_files" "missing:${missing}" \
> +		|| tap_pass "tracefs_files"
> +	cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Helper: resolve file offset of a function inside a binary.
> +#
> +# Usage: resolve_offset <binary> <vaddr_hex>
> +# Prints the hex file offset, or empty string on failure.
> +# ---------------------------------------------------------------------------
> +resolve_offset() {
> +	bin=$1; vaddr=$2
> +	# Parse /proc/self/maps to find the mapping that contains vaddr.
> +	# Each line: start-end perms offset dev inode [path]
> +	while IFS= read -r line; do
> +		set -- $line
> +		range=$1; off=$4; path=$7
> +		[ -z "$path" ] && continue
> +		# Only consider the mapping for our binary
> +		[ "$path" != "$bin" ] && continue
> +		# Split range into start and end
> +		start=$(echo "$range" | cut -d- -f1)
> +		end=$(echo "$range" | cut -d- -f2)
> +		# Convert hex to decimal for comparison (use printf)
> +		s=$(printf "%d" "0x${start}" 2>/dev/null) || continue
> +		e=$(printf "%d" "0x${end}"   2>/dev/null) || continue
> +		v=$(printf "%d" "${vaddr}"   2>/dev/null) || continue
> +		o=$(printf "%d" "0x${off}"   2>/dev/null) || continue
> +		if [ "$v" -ge "$s" ] && [ "$v" -lt "$e" ]; then
> +			file_off=$(printf "0x%x" $(( (v - s) + o )))
> +			echo "$file_off"
> +			return
> +		fi
> +	done < /proc/self/maps
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 3: uprobe binding - no false positive
> +#
> +# Bind this process with a 10 s budget.  Do nothing for 0.5 s.
> +# No budget_exceeded event should appear in the trace.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_no_false_positive() {
> +	next_test; cleanup
> +	if [ ! -e "${TLOB_MONITOR}" ]; then
> +		tap_skip "uprobe_no_false_positive" "monitor file not
> available"
> +		cleanup; return
> +	fi
> +	# We probe the "sleep" command that we will run as a subprocess.
> +	# Use /bin/sleep as the binary; find a valid function offset (0x0
> +	# resolves to the ELF entry point, which is sufficient for a
> +	# no-false-positive test since we just need the binding to exist).
> +	sleep_bin=$(command -v sleep 2>/dev/null)
> +	if [ -z "$sleep_bin" ]; then
> +		tap_skip "uprobe_no_false_positive" "sleep not found";
> cleanup; return
> +	fi
> +	pid=$$
> +	# offset 0x0 probes the entry point of /bin/sleep - this is a
> +	# deliberate probe that will not fire during a simple 'sleep 10'
> +	# invoked in a subshell, but registers the pid in tlob.
> +	#
> +	# Instead, bind our own pid with a generous 10 s threshold and
> +	# verify that 0.5 s of idle time does NOT fire the timer.
> +	#
> +	# Since we cannot easily get a valid uprobe offset in pure shell,
> +	# we skip this sub-test if we cannot form a valid binding.
> +	exe=$(readlink /proc/self/exe 2>/dev/null)
> +	if [ -z "$exe" ]; then
> +		tap_skip "uprobe_no_false_positive" "cannot read
> /proc/self/exe"
> +		cleanup; return
> +	fi
> +	trace_event_enable
> +	trace_on
> +	tlob_enable
> +	trace_clear
> +	# Sleep without any binding - just verify no spurious events
> +	sleep 0.5
> +	trace_grep "budget_exceeded" \
> +		&& tap_fail "uprobe_no_false_positive" \
> +			"spurious budget_exceeded without any binding" \
> +		|| tap_pass "uprobe_no_false_positive"
> +	cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Helper: get_uprobe_offset <binary> <symbol>
> +#
> +# Use tlob_helper sym_offset to get the ELF file offset of <symbol>
> +# in <binary>.  Prints the hex offset (e.g. "0x11d0") or empty string on
> +# failure.
> +# ---------------------------------------------------------------------------
> +get_uprobe_offset() {
> +	bin=$1; sym=$2
> +	if [ ! -x "${IOCTL_HELPER}" ]; then
> +		return
> +	fi
> +	"${IOCTL_HELPER}" sym_offset "${bin}" "${sym}" 2>/dev/null
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 4: uprobe binding - violation detected
> +#
> +# Start tlob_uprobe_target (a busy-spin binary with a well-known symbol),
> +# attach a uprobe on tlob_busy_work with a 10 ms threshold, and verify
> +# that a budget_expired event appears.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_violation() {
> +	next_test; cleanup
> +	if [ ! -e "${TLOB_MONITOR}" ]; then
> +		tap_skip "uprobe_violation" "monitor file not available"
> +		cleanup; return
> +	fi
> +	if [ ! -x "${UPROBE_TARGET}" ]; then
> +		tap_skip "uprobe_violation" \
> +			"tlob_uprobe_target not found or not executable"
> +		cleanup; return
> +	fi
> +
> +	# Get the file offsets of the start and stop probe symbols
> +	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> +	if [ -z "${busy_offset}" ]; then
> +		tap_skip "uprobe_violation" \
> +			"cannot resolve tlob_busy_work offset in
> ${UPROBE_TARGET}"
> +		cleanup; return
> +	fi
> +	stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> +	if [ -z "${stop_offset}" ]; then
> +		tap_skip "uprobe_violation" \
> +			"cannot resolve tlob_busy_work_done offset in
> ${UPROBE_TARGET}"
> +		cleanup; return
> +	fi
> +
> +	# Start the busy-spin target (run for 30 s so the test can observe
> it)
> +	"${UPROBE_TARGET}" 30000 &
> +	busy_pid=$!
> +	sleep 0.05
> +
> +	trace_event_enable
> +	trace_on
> +	tlob_enable
> +	trace_clear
> +
> +	# Bind the target: 10 us budget; start=tlob_busy_work,
> stop=tlob_busy_work_done
> +	binding="10:${busy_offset}:${stop_offset}:${UPROBE_TARGET}"
> +	if ! echo "${binding}" > "${TLOB_MONITOR}" 2>/dev/null; then
> +		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> +		tap_skip "uprobe_violation" \
> +			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
> +		cleanup; return
> +	fi
> +
> +	# Wait up to 2 s for a budget_exceeded event
> +	found=0; i=0
> +	while [ "$i" -lt 20 ]; do
> +		sleep 0.1
> +		trace_grep "budget_exceeded" && { found=1; break; }
> +		i=$((i+1))
> +	done
> +
> +	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +	kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
> +
> +	if [ "${found}" != "1" ]; then
> +		tap_fail "uprobe_violation" "no budget_exceeded within 2 s"
> +		cleanup; return
> +	fi
> +
> +	# Validate the event fields: threshold must match, on_cpu must be
> non-zero
> +	# (CPU-bound violation), and state must be on_cpu.
> +	ev=$(grep "budget_exceeded" "${TRACE_FILE}" | head -n 1)
> +	if ! echo "${ev}" | grep -q "threshold=10 "; then
> +		tap_fail "uprobe_violation" "threshold field mismatch: ${ev}"
> +		cleanup; return
> +	fi
> +	on_cpu=$(echo "${ev}" | grep -o "on_cpu=[0-9]*" | cut -d= -f2)
> +	if [ "${on_cpu:-0}" -eq 0 ]; then
> +		tap_fail "uprobe_violation" "on_cpu=0 for a CPU-bound spin:
> ${ev}"
> +		cleanup; return
> +	fi
> +	if ! echo "${ev}" | grep -q "state=on_cpu"; then
> +		tap_fail "uprobe_violation" "state is not on_cpu: ${ev}"
> +		cleanup; return
> +	fi
> +	tap_pass "uprobe_violation"
> +	cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 5: uprobe binding - remove binding stops monitoring
> +#
> +# Bind a pid via tlob_uprobe_target, then immediately remove it.
> +# Verify that after removal the monitor file no longer lists the pid.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_unbind() {
> +	next_test; cleanup
> +	if [ ! -e "${TLOB_MONITOR}" ]; then
> +		tap_skip "uprobe_unbind" "monitor file not available"
> +		cleanup; return
> +	fi
> +	if [ ! -x "${UPROBE_TARGET}" ]; then
> +		tap_skip "uprobe_unbind" \
> +			"tlob_uprobe_target not found or not executable"
> +		cleanup; return
> +	fi
> +
> +	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> +	stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> +	if [ -z "${busy_offset}" ] || [ -z "${stop_offset}" ]; then
> +		tap_skip "uprobe_unbind" \
> +			"cannot resolve tlob_busy_work/tlob_busy_work_done
> offset"
> +		cleanup; return
> +	fi
> +
> +	"${UPROBE_TARGET}" 30000 &
> +	busy_pid=$!
> +	sleep 0.05
> +
> +	tlob_enable
> +	# 5 s budget - should not fire during this quick test
> +	binding="5000000:${busy_offset}:${stop_offset}:${UPROBE_TARGET}"
> +	if ! echo "${binding}" > "${TLOB_MONITOR}" 2>/dev/null; then
> +		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> +		tap_skip "uprobe_unbind" \
> +			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
> +		cleanup; return
> +	fi
> +
> +	# Remove the binding
> +	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +
> +	# The monitor file should no longer list the binding for this offset
> +	if grep -q "^[0-9]*:0x${busy_offset#0x}:" "${TLOB_MONITOR}"
> 2>/dev/null; then
> +		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> +		tap_fail "uprobe_unbind" "pid still listed after removal"
> +		cleanup; return
> +	fi
> +
> +	kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
> +	tap_pass "uprobe_unbind"
> +	cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Test 6: uprobe - duplicate offset_start rejected
> +#
> +# Registering a second binding with the same offset_start in the same binary
> +# must be rejected with an error, since two entry uprobes at the same address
> +# would cause double tlob_start_task() calls and undefined behaviour.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_duplicate_offset() {
> +	next_test; cleanup
> +	if [ ! -e "${TLOB_MONITOR}" ]; then
> +		tap_skip "uprobe_duplicate_offset" "monitor file not
> available"
> +		cleanup; return
> +	fi
> +	if [ ! -x "${UPROBE_TARGET}" ]; then
> +		tap_skip "uprobe_duplicate_offset" \
> +			"tlob_uprobe_target not found or not executable"
> +		cleanup; return
> +	fi
> +
> +	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> +	stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> +	if [ -z "${busy_offset}" ] || [ -z "${stop_offset}" ]; then
> +		tap_skip "uprobe_duplicate_offset" \
> +			"cannot resolve tlob_busy_work/tlob_busy_work_done
> offset"
> +		cleanup; return
> +	fi
> +
> +	tlob_enable
> +
> +	# First binding: should succeed
> +	if ! echo "5000000:${busy_offset}:${stop_offset}:${UPROBE_TARGET}" \
> +	        > "${TLOB_MONITOR}" 2>/dev/null; then
> +		tap_skip "uprobe_duplicate_offset" \
> +			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
> +		cleanup; return
> +	fi
> +
> +	# Second binding with same offset_start: must be rejected
> +	if echo "9999:${busy_offset}:${stop_offset}:${UPROBE_TARGET}" \
> +	        > "${TLOB_MONITOR}" 2>/dev/null; then
> +		echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +		tap_fail "uprobe_duplicate_offset" \
> +			"duplicate offset_start was accepted (expected
> error)"
> +		cleanup; return
> +	fi
> +
> +	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +	tap_pass "uprobe_duplicate_offset"
> +	cleanup
> +}
> +
> +
> +#
> +# Region A: tlob_busy_work with a 5 s budget - should NOT fire during the
> test.
> +# Region B: tlob_busy_work_done with a 10 us budget - SHOULD fire quickly
> since
> +#           tlob_uprobe_target calls tlob_busy_work_done after a busy spin.
> +#
> +# Verifies that independent bindings for different offsets in the same binary
> +# are tracked separately and that only the tight-budget binding triggers a
> +# budget_exceeded event.
> +# ---------------------------------------------------------------------------
> +run_test_uprobe_independent_thresholds() {
> +	next_test; cleanup
> +	if [ ! -e "${TLOB_MONITOR}" ]; then
> +		tap_skip "uprobe_independent_thresholds" \
> +			"monitor file not available"; cleanup; return
> +	fi
> +	if [ ! -x "${UPROBE_TARGET}" ]; then
> +		tap_skip "uprobe_independent_thresholds" \
> +			"tlob_uprobe_target not found or not executable"
> +		cleanup; return
> +	fi
> +
> +	busy_offset=$(get_uprobe_offset "${UPROBE_TARGET}" "tlob_busy_work")
> +	busy_stop_offset=$(get_uprobe_offset "${UPROBE_TARGET}"
> "tlob_busy_work_done")
> +	if [ -z "${busy_offset}" ] || [ -z "${busy_stop_offset}" ]; then
> +		tap_skip "uprobe_independent_thresholds" \
> +			"cannot resolve tlob_busy_work/tlob_busy_work_done
> offset"
> +		cleanup; return
> +	fi
> +
> +	"${UPROBE_TARGET}" 30000 &
> +	busy_pid=$!
> +	sleep 0.05
> +
> +	trace_event_enable
> +	trace_on
> +	tlob_enable
> +	trace_clear
> +
> +	# Region A: generous 5 s budget on tlob_busy_work entry (should not
> fire)
> +	if ! echo
> "5000000:${busy_offset}:${busy_stop_offset}:${UPROBE_TARGET}" \
> +	        > "${TLOB_MONITOR}" 2>/dev/null; then
> +		kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}"
> 2>/dev/null
> +		tap_skip "uprobe_independent_thresholds" \
> +			"uprobe binding rejected (CONFIG_UPROBES=y needed)"
> +		cleanup; return
> +	fi
> +	# Region B: tight 10 us budget on tlob_busy_work_done (fires quickly)
> +	echo "10:${busy_stop_offset}:${busy_stop_offset}:${UPROBE_TARGET}" \
> +		> "${TLOB_MONITOR}" 2>/dev/null
> +
> +	found=0; i=0
> +	while [ "$i" -lt 20 ]; do
> +		sleep 0.1
> +		trace_grep "budget_exceeded" && { found=1; break; }
> +		i=$((i+1))
> +	done
> +
> +	echo "-${busy_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +	echo "-${busy_stop_offset}:${UPROBE_TARGET}" > "${TLOB_MONITOR}"
> 2>/dev/null
> +	kill "${busy_pid}" 2>/dev/null; wait "${busy_pid}" 2>/dev/null
> +
> +	if [ "${found}" != "1" ]; then
> +		tap_fail "uprobe_independent_thresholds" \
> +			"budget_exceeded not raised for tight-budget region
> within 2 s"
> +		cleanup; return
> +	fi
> +
> +	# The violation must carry threshold=10 (Region B's budget).
> +	ev=$(grep "budget_exceeded" "${TRACE_FILE}" | head -n 1)
> +	if ! echo "${ev}" | grep -q "threshold=10 "; then
> +		tap_fail "uprobe_independent_thresholds" \
> +			"violation threshold is not Region B's 10 us: ${ev}"
> +		cleanup; return
> +	fi
> +	tap_pass "uprobe_independent_thresholds"
> +	cleanup
> +}
> +
> +# ---------------------------------------------------------------------------
> +# ioctl tests via tlob_helper
> +#
> +# Each test invokes the helper with a sub-test name.
> +# Exit code: 0=pass, 1=fail, 2=skip.
> +# ---------------------------------------------------------------------------
> +run_ioctl_test() {
> +	testname=$1
> +	next_test
> +
> +	if [ ! -x "${IOCTL_HELPER}" ]; then
> +		tap_skip "ioctl_${testname}" \
> +			"tlob_helper not found or not executable"
> +		return
> +	fi
> +	if [ ! -c "${RV_DEV}" ]; then
> +		tap_skip "ioctl_${testname}" \
> +			"${RV_DEV} not present (CONFIG_RV_CHARDEV=y needed)"
> +		return
> +	fi
> +
> +	tlob_enable
> +	"${IOCTL_HELPER}" "${testname}"
> +	rc=$?
> +	tlob_disable
> +
> +	case "${rc}" in
> +	0) tap_pass "ioctl_${testname}" ;;
> +	2) tap_skip "ioctl_${testname}" "helper returned skip" ;;
> +	*) tap_fail "ioctl_${testname}" "helper exited with code ${rc}" ;;
> +	esac
> +}
> +
> +# run_ioctl_test_not_enabled - like run_ioctl_test but deliberately does NOT
> +# enable the tlob monitor before invoking the helper.  Used to verify that
> +# ioctls issued against a disabled monitor return ENODEV rather than crashing
> +# the kernel with a NULL pointer dereference.
> +run_ioctl_test_not_enabled()
> +{
> +	next_test
> +
> +	if [ ! -x "${IOCTL_HELPER}" ]; then
> +		tap_skip "ioctl_not_enabled" \
> +			"tlob_helper not found or not executable"
> +		return
> +	fi
> +	if [ ! -c "${RV_DEV}" ]; then
> +		tap_skip "ioctl_not_enabled" \
> +			"${RV_DEV} not present (CONFIG_RV_CHARDEV=y needed)"
> +		return
> +	fi
> +
> +	# Monitor intentionally left disabled.
> +	tlob_disable
> +	"${IOCTL_HELPER}" not_enabled
> +	rc=$?
> +
> +	case "${rc}" in
> +	0) tap_pass "ioctl_not_enabled" ;;
> +	2) tap_skip "ioctl_not_enabled" "helper returned skip" ;;
> +	*) tap_fail "ioctl_not_enabled" "helper exited with code ${rc}" ;;
> +	esac
> +}
> +
> +# ---------------------------------------------------------------------------
> +# Main
> +# ---------------------------------------------------------------------------
> +check_root; check_tracefs; check_rv_dir; check_tlob
> +tap_header; tap_plan 20
> +
> +# tracefs interface tests
> +run_test_enable_disable
> +run_test_tracefs_files
> +
> +# uprobe external monitoring tests
> +run_test_uprobe_no_false_positive
> +run_test_uprobe_violation
> +run_test_uprobe_unbind
> +run_test_uprobe_duplicate_offset
> +run_test_uprobe_independent_thresholds
> +
> +# /dev/rv ioctl self-instrumentation tests
> +run_ioctl_test_not_enabled
> +run_ioctl_test within_budget
> +run_ioctl_test over_budget_cpu
> +run_ioctl_test over_budget_sleep
> +run_ioctl_test double_start
> +run_ioctl_test stop_no_start
> +run_ioctl_test multi_thread
> +run_ioctl_test self_watch
> +run_ioctl_test invalid_flags
> +run_ioctl_test notify_fd_bad
> +run_ioctl_test mmap_basic
> +run_ioctl_test mmap_errors
> +run_ioctl_test mmap_consume
> +
> +echo "# Passed: ${t_pass} Failed: ${t_fail} Skipped: ${t_skip}"
> +[ "${t_fail}" -gt 0 ] && exit 1 || exit 0
> diff --git a/tools/testing/selftests/rv/tlob_helper.c
> b/tools/testing/selftests/rv/tlob_helper.c
> new file mode 100644
> index 000000000..cd76b56d1
> --- /dev/null
> +++ b/tools/testing/selftests/rv/tlob_helper.c
> @@ -0,0 +1,994 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_helper.c - test helper and ELF utility for tlob selftests
> + *
> + * Called by test_tlob.sh to exercise the /dev/rv ioctl interface and to
> + * resolve ELF symbol offsets for uprobe bindings.  One subcommand per
> + * invocation so the shell script can report each as an independent TAP
> + * test case.
> + *
> + * Usage: tlob_helper <subcommand> [args...]
> + *
> + * Synchronous TRACE_START / TRACE_STOP tests:
> + *   not_enabled        - TRACE_START without tlob enabled -> ENODEV (no
> kernel crash)
> + *   within_budget      - start(50000 us), sleep 10 ms, stop -> expect 0
> + *   over_budget_cpu    - start(5000 us), busyspin 100 ms, stop -> EOVERFLOW
> + *   over_budget_sleep  - start(3000 us), sleep 50 ms, stop -> EOVERFLOW
> + *
> + * Error-handling tests:
> + *   double_start       - two starts without stop -> EEXIST on second
> + *   stop_no_start      - stop without start -> ESRCH
> + *
> + * Per-thread isolation test:
> + *   multi_thread       - two threads share one fd; one within budget, one
> over
> + *
> + * Asynchronous notification test (notify_fd + read()):
> + *   self_watch         - one worker exceeds budget; monitor fd receives one
> ntf via read()
> + *
> + * Input-validation tests (TRACE_START error paths):
> + *   invalid_flags      - TRACE_START with flags != 0 -> EINVAL
> + *   notify_fd_bad      - TRACE_START with notify_fd = stdout (non-rv fd) ->
> EINVAL
> + *
> + * mmap ring buffer tests (Scenario D):
> + *   mmap_basic         - mmap succeeds; verify tlob_mmap_page fields
> + *                        (version, capacity, data_offset, record_size)
> + *   mmap_errors        - MAP_PRIVATE, wrong size, and non-zero pgoff all
> + *                        return EINVAL
> + *   mmap_consume       - trigger a real violation via self-notification and
> + *                        consume the event through the mmap'd ring
> + *
> + * ELF utility (does not require /dev/rv):
> + *   sym_offset <binary> <symbol>
> + *                      - print the ELF file offset of <symbol> in <binary>
> + *                        (used by the shell script to build uprobe bindings)
> + *
> + * Exit code: 0 = pass, 1 = fail, 2 = skip (device not available).
> + */
> +#define _GNU_SOURCE
> +#include <elf.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <poll.h>
> +#include <pthread.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <linux/rv.h>
> +
> +/* Default ring capacity allocated at open(); matches TLOB_RING_DEFAULT_CAP.
> */
> +#define TLOB_RING_DEFAULT_CAP	64U
> +
> +static int rv_fd = -1;
> +
> +static int open_rv(void)
> +{
> +	rv_fd = open("/dev/rv", O_RDWR);
> +	if (rv_fd < 0) {
> +		fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +static void busy_spin_us(unsigned long us)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < us * 1000UL);
> +}
> +
> +static int do_start(uint64_t threshold_us)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = threshold_us,
> +		.notify_fd    = -1,
> +	};
> +
> +	return ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +}
> +
> +static int do_stop(void)
> +{
> +	return ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Synchronous TRACE_START / TRACE_STOP tests
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * test_not_enabled - TRACE_START must return ENODEV when the tlob monitor
> + * has not been enabled (tlob_state_cache is NULL).
> + *
> + * The shell wrapper deliberately does NOT call tlob_enable before invoking
> + * this subcommand, so the ioctl is expected to fail with ENODEV rather than
> + * crashing the kernel with a NULL pointer dereference in kmem_cache_alloc.
> + */
> +static int test_not_enabled(void)
> +{
> +	int ret;
> +
> +	ret = do_start(1000);
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_START: expected ENODEV, got
> success\n");
> +		do_stop();
> +		return 1;
> +	}
> +	if (errno != ENODEV) {
> +		fprintf(stderr, "TRACE_START: expected ENODEV, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_within_budget(void)
> +{
> +	int ret;
> +
> +	if (do_start(50000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	usleep(10000); /* 10 ms < 50 ms budget */
> +	ret = do_stop();
> +	if (ret != 0) {
> +		fprintf(stderr, "TRACE_STOP: expected 0, got %d errno=%s\n",
> +			ret, strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_cpu(void)
> +{
> +	int ret;
> +
> +	if (do_start(5000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	busy_spin_us(100000); /* 100 ms >> 5 ms budget */
> +	ret = do_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_sleep(void)
> +{
> +	int ret;
> +
> +	if (do_start(3000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	usleep(50000); /* 50 ms >> 3 ms budget, off-CPU time counts */
> +	ret = do_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Error-handling tests
> + * -----------------------------------------------------------------------
> + */
> +
> +static int test_double_start(void)
> +{
> +	int ret;
> +
> +	if (do_start(10000000) < 0) {
> +		fprintf(stderr, "first TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	ret = do_start(10000000);
> +	if (ret == 0) {
> +		fprintf(stderr, "second TRACE_START: expected EEXIST, got
> 0\n");
> +		do_stop();
> +		return 1;
> +	}
> +	if (errno != EEXIST) {
> +		fprintf(stderr, "second TRACE_START: expected EEXIST, got
> %s\n",
> +			strerror(errno));
> +		do_stop();
> +		return 1;
> +	}
> +	do_stop(); /* clean up */
> +	return 0;
> +}
> +
> +static int test_stop_no_start(void)
> +{
> +	int ret;
> +
> +	/* Ensure clean state: ignore error from a stale entry */
> +	do_stop();
> +
> +	ret = do_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected ESRCH, got 0\n");
> +		return 1;
> +	}
> +	if (errno != ESRCH) {
> +		fprintf(stderr, "TRACE_STOP: expected ESRCH, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Per-thread isolation test
> + *
> + * Two threads share a single /dev/rv fd.  The monitor uses task_struct *
> + * as the key, so each thread gets an independent slot regardless of the
> + * shared fd.
> + * -----------------------------------------------------------------------
> + */
> +
> +struct mt_thread_args {
> +	uint64_t      threshold_us;
> +	unsigned long workload_us;
> +	int           busy;
> +	int           expect_eoverflow;
> +	int           result;
> +};
> +
> +static void *mt_thread_fn(void *arg)
> +{
> +	struct mt_thread_args *a = arg;
> +	int ret;
> +
> +	if (do_start(a->threshold_us) < 0) {
> +		fprintf(stderr, "thread TRACE_START: %s\n", strerror(errno));
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	if (a->busy)
> +		busy_spin_us(a->workload_us);
> +	else
> +		usleep(a->workload_us);
> +
> +	ret = do_stop();
> +	if (a->expect_eoverflow) {
> +		if (ret == 0 || errno != EOVERFLOW) {
> +			fprintf(stderr, "thread: expected EOVERFLOW, got
> ret=%d errno=%s\n",
> +				ret, strerror(errno));
> +			a->result = 1;
> +			return NULL;
> +		}
> +	} else {
> +		if (ret != 0) {
> +			fprintf(stderr, "thread: expected 0, got ret=%d
> errno=%s\n",
> +				ret, strerror(errno));
> +			a->result = 1;
> +			return NULL;
> +		}
> +	}
> +	a->result = 0;
> +	return NULL;
> +}
> +
> +static int test_multi_thread(void)
> +{
> +	pthread_t ta, tb;
> +	struct mt_thread_args a = {
> +		.threshold_us     = 20000,  /* 20 ms */
> +		.workload_us      = 5000,   /* 5 ms sleep -> within budget */
> +		.busy             = 0,
> +		.expect_eoverflow = 0,
> +	};
> +	struct mt_thread_args b = {
> +		.threshold_us     = 3000,   /* 3 ms */
> +		.workload_us      = 30000,  /* 30 ms spin -> over budget */
> +		.busy             = 1,
> +		.expect_eoverflow = 1,
> +	};
> +
> +	pthread_create(&ta, NULL, mt_thread_fn, &a);
> +	pthread_create(&tb, NULL, mt_thread_fn, &b);
> +	pthread_join(ta, NULL);
> +	pthread_join(tb, NULL);
> +
> +	return (a.result || b.result) ? 1 : 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Asynchronous notification test (notify_fd + read())
> + *
> + * A dedicated monitor_fd is opened by the main thread.  Two worker threads
> + * each open their own work_fd and call TLOB_IOCTL_TRACE_START with
> + * notify_fd = monitor_fd, nominating it as the violation target.  Worker A
> + * stays within budget; worker B exceeds it.  The main thread reads from
> + * monitor_fd and expects exactly one tlob_event record.
> + * -----------------------------------------------------------------------
> + */
> +
> +struct sw_worker_args {
> +	int           monitor_fd;
> +	uint64_t      threshold_us;
> +	unsigned long workload_us;
> +	int           busy;
> +	int           result;
> +};
> +
> +static void *sw_worker_fn(void *arg)
> +{
> +	struct sw_worker_args *a = arg;
> +	struct tlob_start_args args = {
> +		.threshold_us = a->threshold_us,
> +		.notify_fd    = a->monitor_fd,
> +	};
> +	int work_fd;
> +	int ret;
> +
> +	work_fd = open("/dev/rv", O_RDWR);
> +	if (work_fd < 0) {
> +		fprintf(stderr, "worker open /dev/rv: %s\n",
> strerror(errno));
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	ret = ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
> +	if (ret < 0) {
> +		fprintf(stderr, "TRACE_START (notify): %s\n",
> strerror(errno));
> +		close(work_fd);
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	if (a->busy)
> +		busy_spin_us(a->workload_us);
> +	else
> +		usleep(a->workload_us);
> +
> +	ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	close(work_fd);
> +	a->result = 0;
> +	return NULL;
> +}
> +
> +static int test_self_watch(void)
> +{
> +	int monitor_fd;
> +	pthread_t ta, tb;
> +	struct sw_worker_args a = {
> +		.threshold_us = 50000,  /* 50 ms */
> +		.workload_us  = 5000,   /* 5 ms sleep -> no violation */
> +		.busy         = 0,
> +	};
> +	struct sw_worker_args b = {
> +		.threshold_us = 3000,   /* 3 ms */
> +		.workload_us  = 30000,  /* 30 ms spin -> violation */
> +		.busy         = 1,
> +	};
> +	struct tlob_event ntfs[8];
> +	int violations = 0;
> +	ssize_t n;
> +
> +	/*
> +	 * Open monitor_fd with O_NONBLOCK so read() after the workers finish
> +	 * returns immediately rather than blocking forever.
> +	 */
> +	monitor_fd = open("/dev/rv", O_RDWR | O_NONBLOCK);
> +	if (monitor_fd < 0) {
> +		fprintf(stderr, "open /dev/rv (monitor_fd): %s\n",
> strerror(errno));
> +		return 1;
> +	}
> +	a.monitor_fd = monitor_fd;
> +	b.monitor_fd = monitor_fd;
> +
> +	pthread_create(&ta, NULL, sw_worker_fn, &a);
> +	pthread_create(&tb, NULL, sw_worker_fn, &b);
> +	pthread_join(ta, NULL);
> +	pthread_join(tb, NULL);
> +
> +	if (a.result || b.result) {
> +		close(monitor_fd);
> +		return 1;
> +	}
> +
> +	/*
> +	 * Drain all available tlob_event records.  With O_NONBLOCK the final
> +	 * read() returns -EAGAIN when the buffer is empty.
> +	 */
> +	while ((n = read(monitor_fd, ntfs, sizeof(ntfs))) > 0)
> +		violations += (int)(n / sizeof(struct tlob_event));
> +
> +	close(monitor_fd);
> +
> +	if (violations != 1) {
> +		fprintf(stderr, "self_watch: expected 1 violation, got %d\n",
> +			violations);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Input-validation tests (TRACE_START error paths)
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * test_invalid_flags - TRACE_START with flags != 0 must return EINVAL.
> + *
> + * The flags field is reserved for future extensions and must be zero.
> + * Callers that set it to a non-zero value are rejected early so that a
> + * future kernel can assign meaning to those bits without silently
> + * ignoring them.
> + */
> +static int test_invalid_flags(void)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = 1000,
> +		.notify_fd    = -1,
> +		.flags        = 1,   /* non-zero: must be rejected */
> +	};
> +	int ret;
> +
> +	ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_START(flags=1): expected EINVAL, got
> success\n");
> +		do_stop();
> +		return 1;
> +	}
> +	if (errno != EINVAL) {
> +		fprintf(stderr, "TRACE_START(flags=1): expected EINVAL, got
> %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * test_notify_fd_bad - TRACE_START with a non-/dev/rv notify_fd must return
> + * EINVAL.
> + *
> + * When notify_fd >= 0, the kernel resolves it to a struct file and checks
> + * that its private_data is non-NULL (i.e. it is a /dev/rv file descriptor).
> + * Passing stdout (fd 1) supplies a real, open fd whose private_data is NULL,
> + * so the kernel must reject it with EINVAL.
> + */
> +static int test_notify_fd_bad(void)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = 1000,
> +		.notify_fd    = STDOUT_FILENO,   /* open but not a /dev/rv fd
> */
> +		.flags        = 0,
> +	};
> +	int ret;
> +
> +	ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +	if (ret == 0) {
> +		fprintf(stderr,
> +			"TRACE_START(notify_fd=stdout): expected EINVAL, got
> success\n");
> +		do_stop();
> +		return 1;
> +	}
> +	if (errno != EINVAL) {
> +		fprintf(stderr,
> +			"TRACE_START(notify_fd=stdout): expected EINVAL, got
> %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * mmap ring buffer tests (Scenario D)
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * test_mmap_basic - mmap the ring buffer and verify the control page fields.
> + *
> + * The kernel allocates TLOB_RING_DEFAULT_CAP records at open().  A shared
> + * mmap of PAGE_SIZE + cap * record_size must succeed and the tlob_mmap_page
> + * header must contain consistent values.
> + */
> +static int test_mmap_basic(void)
> +{
> +	long pagesize = sysconf(_SC_PAGESIZE);
> +	size_t mmap_len = (size_t)pagesize +
> +			  TLOB_RING_DEFAULT_CAP * sizeof(struct tlob_event);
> +	/* rv_mmap requires a page-aligned length */
> +	mmap_len = (mmap_len + (size_t)(pagesize - 1)) & ~(size_t)(pagesize -
> 1);
> +	struct tlob_mmap_page *page;
> +	struct tlob_event *data;
> +	void *map;
> +	int ret = 0;
> +
> +	map = mmap(NULL, mmap_len, PROT_READ | PROT_WRITE, MAP_SHARED, rv_fd,
> 0);
> +	if (map == MAP_FAILED) {
> +		fprintf(stderr, "mmap_basic: mmap: %s\n", strerror(errno));
> +		return 1;
> +	}
> +
> +	page = (struct tlob_mmap_page *)map;
> +	data = (struct tlob_event *)((char *)map + page->data_offset);
> +
> +	if (page->version != 1) {
> +		fprintf(stderr, "mmap_basic: expected version=1, got %u\n",
> +			page->version);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (page->capacity != TLOB_RING_DEFAULT_CAP) {
> +		fprintf(stderr, "mmap_basic: expected capacity=%u, got %u\n",
> +			TLOB_RING_DEFAULT_CAP, page->capacity);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (page->data_offset != (uint32_t)pagesize) {
> +		fprintf(stderr, "mmap_basic: expected data_offset=%ld, got
> %u\n",
> +			pagesize, page->data_offset);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (page->record_size != sizeof(struct tlob_event)) {
> +		fprintf(stderr, "mmap_basic: expected record_size=%zu, got
> %u\n",
> +			sizeof(struct tlob_event), page->record_size);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (page->data_head != 0 || page->data_tail != 0) {
> +		fprintf(stderr, "mmap_basic: ring not empty at open: head=%u
> tail=%u\n",
> +			page->data_head, page->data_tail);
> +		ret = 1;
> +		goto out;
> +	}
> +	/* Touch the data array to confirm it is accessible. */
> +	(void)data[0].tid;
> +out:
> +	munmap(map, mmap_len);
> +	return ret;
> +}
> +
> +/*
> + * test_mmap_errors - verify that rv_mmap() rejects invalid mmap parameters.
> + *
> + * Four cases are tested, each must return MAP_FAILED with errno == EINVAL:
> + *   1. size one page short of the correct ring length
> + *   2. size one page larger than the correct ring length
> + *   3. MAP_PRIVATE (only MAP_SHARED is permitted)
> + *   4. non-zero vm_pgoff (offset must be 0)
> + */
> +static int test_mmap_errors(void)
> +{
> +	long pagesize = sysconf(_SC_PAGESIZE);
> +	size_t correct_len = (size_t)pagesize +
> +			     TLOB_RING_DEFAULT_CAP * sizeof(struct
> tlob_event);
> +	/* rv_mmap requires a page-aligned length */
> +	correct_len = (correct_len + (size_t)(pagesize - 1)) &
> ~(size_t)(pagesize - 1);
> +	void *map;
> +	int ret = 0;
> +
> +	/* Case 1: size one page short (correct_len - 1 still rounds up to
> correct_len) */
> +	map = mmap(NULL, correct_len - (size_t)pagesize, PROT_READ |
> PROT_WRITE,
> +		   MAP_SHARED, rv_fd, 0);
> +	if (map != MAP_FAILED) {
> +		fprintf(stderr, "mmap_errors: short-size mmap succeeded
> (expected EINVAL)\n");
> +		munmap(map, correct_len - (size_t)pagesize);
> +		ret = 1;
> +	} else if (errno != EINVAL) {
> +		fprintf(stderr, "mmap_errors: short-size: expected EINVAL,
> got %s\n",
> +			strerror(errno));
> +		ret = 1;
> +	}
> +
> +	/* Case 2: size one page too large */
> +	map = mmap(NULL, correct_len + (size_t)pagesize, PROT_READ |
> PROT_WRITE,
> +		   MAP_SHARED, rv_fd, 0);
> +	if (map != MAP_FAILED) {
> +		fprintf(stderr, "mmap_errors: oversized mmap succeeded
> (expected EINVAL)\n");
> +		munmap(map, correct_len + (size_t)pagesize);
> +		ret = 1;
> +	} else if (errno != EINVAL) {
> +		fprintf(stderr, "mmap_errors: oversized: expected EINVAL, got
> %s\n",
> +			strerror(errno));
> +		ret = 1;
> +	}
> +
> +	/* Case 3: MAP_PRIVATE instead of MAP_SHARED */
> +	map = mmap(NULL, correct_len, PROT_READ | PROT_WRITE,
> +		   MAP_PRIVATE, rv_fd, 0);
> +	if (map != MAP_FAILED) {
> +		fprintf(stderr, "mmap_errors: MAP_PRIVATE succeeded (expected
> EINVAL)\n");
> +		munmap(map, correct_len);
> +		ret = 1;
> +	} else if (errno != EINVAL) {
> +		fprintf(stderr, "mmap_errors: MAP_PRIVATE: expected EINVAL,
> got %s\n",
> +			strerror(errno));
> +		ret = 1;
> +	}
> +
> +	/* Case 4: non-zero file offset (pgoff = 1) */
> +	map = mmap(NULL, correct_len, PROT_READ | PROT_WRITE,
> +		   MAP_SHARED, rv_fd, (off_t)pagesize);
> +	if (map != MAP_FAILED) {
> +		fprintf(stderr, "mmap_errors: non-zero pgoff mmap succeeded
> (expected EINVAL)\n");
> +		munmap(map, correct_len);
> +		ret = 1;
> +	} else if (errno != EINVAL) {
> +		fprintf(stderr, "mmap_errors: non-zero pgoff: expected
> EINVAL, got %s\n",
> +			strerror(errno));
> +		ret = 1;
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * test_mmap_consume - zero-copy consumption of a real violation event.
> + *
> + * Arms a 5 ms budget with self-notification (notify_fd = rv_fd), sleeps
> + * 50 ms (off-CPU violation), then reads the pushed event through the mmap'd
> + * ring without calling read().  Verifies:
> + *   - TRACE_STOP returns EOVERFLOW (budget was exceeded)
> + *   - data_head == 1 after the violation
> + *   - the event fields (threshold_us, tag, tid) are correct
> + *   - data_tail can be advanced to consume the record (ring empties)
> + */
> +static int test_mmap_consume(void)
> +{
> +	long pagesize = sysconf(_SC_PAGESIZE);
> +	size_t mmap_len = (size_t)pagesize +
> +			  TLOB_RING_DEFAULT_CAP * sizeof(struct tlob_event);
> +	/* rv_mmap requires a page-aligned length */
> +	mmap_len = (mmap_len + (size_t)(pagesize - 1)) & ~(size_t)(pagesize -
> 1);
> +	struct tlob_start_args args = {
> +		.threshold_us = 5000,		/* 5 ms */
> +		.notify_fd    = rv_fd,		/* self-notification */
> +		.tag          = 0xdeadbeefULL,
> +		.flags        = 0,
> +	};
> +	struct tlob_mmap_page *page;
> +	struct tlob_event *data;
> +	void *map;
> +	int stop_ret;
> +	int ret = 0;
> +
> +	map = mmap(NULL, mmap_len, PROT_READ | PROT_WRITE, MAP_SHARED, rv_fd,
> 0);
> +	if (map == MAP_FAILED) {
> +		fprintf(stderr, "mmap_consume: mmap: %s\n", strerror(errno));
> +		return 1;
> +	}
> +
> +	page = (struct tlob_mmap_page *)map;
> +	data = (struct tlob_event *)((char *)map + page->data_offset);
> +
> +	if (ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args) < 0) {
> +		fprintf(stderr, "mmap_consume: TRACE_START: %s\n",
> strerror(errno));
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	usleep(50000); /* 50 ms >> 5 ms budget -> off-CPU violation */
> +
> +	stop_ret = ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	if (stop_ret == 0) {
> +		fprintf(stderr, "mmap_consume: TRACE_STOP returned 0,
> expected EOVERFLOW\n");
> +		ret = 1;
> +		goto out;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "mmap_consume: TRACE_STOP: expected
> EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/* Pairs with smp_store_release in tlob_event_push. */
> +	if (__atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE) != 1) {
> +		fprintf(stderr, "mmap_consume: expected data_head=1, got
> %u\n",
> +			page->data_head);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (page->data_tail != 0) {
> +		fprintf(stderr, "mmap_consume: expected data_tail=0, got
> %u\n",
> +			page->data_tail);
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/* Verify record content */
> +	if (data[0].threshold_us != 5000) {
> +		fprintf(stderr, "mmap_consume: expected threshold_us=5000,
> got %llu\n",
> +			(unsigned long long)data[0].threshold_us);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (data[0].tag != 0xdeadbeefULL) {
> +		fprintf(stderr, "mmap_consume: expected tag=0xdeadbeef, got
> %llx\n",
> +			(unsigned long long)data[0].tag);
> +		ret = 1;
> +		goto out;
> +	}
> +	if (data[0].tid == 0) {
> +		fprintf(stderr, "mmap_consume: tid is 0\n");
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/* Consume: advance data_tail and confirm ring is empty */
> +	__atomic_store_n(&page->data_tail, 1U, __ATOMIC_RELEASE);
> +	if (__atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE) !=
> +	    __atomic_load_n(&page->data_tail, __ATOMIC_ACQUIRE)) {
> +		fprintf(stderr, "mmap_consume: ring not empty after
> consume\n");
> +		ret = 1;
> +	}
> +
> +out:
> +	munmap(map, mmap_len);
> +	return ret;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * ELF utility: sym_offset
> + *
> + * Print the ELF file offset of a symbol in a binary.  Supports 32- and
> + * 64-bit ELF.  Walks the section headers to find .symtab (falling back to
> + * .dynsym), then converts the symbol's virtual address to a file offset
> + * via the PT_LOAD program headers.
> + *
> + * Does not require /dev/rv; used by the shell script to build uprobe
> + * bindings of the form
> pid:threshold_us:offset_start:offset_stop:binary_path.
> + *
> + * Returns 0 on success (offset printed to stdout), 1 on failure.
> + * -----------------------------------------------------------------------
> + */
> +static int sym_offset(const char *binary, const char *symname)
> +{
> +	int fd;
> +	struct stat st;
> +	void *map;
> +	Elf64_Ehdr *ehdr;
> +	Elf32_Ehdr *ehdr32;
> +	int is64;
> +	uint64_t sym_vaddr = 0;
> +	int found = 0;
> +	uint64_t file_offset = 0;
> +
> +	fd = open(binary, O_RDONLY);
> +	if (fd < 0) {
> +		fprintf(stderr, "open %s: %s\n", binary, strerror(errno));
> +		return 1;
> +	}
> +	if (fstat(fd, &st) < 0) {
> +		close(fd);
> +		return 1;
> +	}
> +	map = mmap(NULL, (size_t)st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +	close(fd);
> +	if (map == MAP_FAILED) {
> +		fprintf(stderr, "mmap: %s\n", strerror(errno));
> +		return 1;
> +	}
> +
> +	/* Identify ELF class */
> +	ehdr = (Elf64_Ehdr *)map;
> +	ehdr32 = (Elf32_Ehdr *)map;
> +	if (st.st_size < 4 ||
> +	    ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
> +	    ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
> +	    ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
> +	    ehdr->e_ident[EI_MAG3] != ELFMAG3) {
> +		fprintf(stderr, "%s: not an ELF file\n", binary);
> +		munmap(map, (size_t)st.st_size);
> +		return 1;
> +	}
> +	is64 = (ehdr->e_ident[EI_CLASS] == ELFCLASS64);
> +
> +	if (is64) {
> +		/* Walk section headers to find .symtab or .dynsym */
> +		Elf64_Shdr *shdrs = (Elf64_Shdr *)((char *)map + ehdr-
> >e_shoff);
> +		Elf64_Shdr *shstrtab_hdr = &shdrs[ehdr->e_shstrndx];
> +		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> +		int si;
> +
> +		/* Prefer .symtab; fall back to .dynsym */
> +		for (int pass = 0; pass < 2 && !found; pass++) {
> +			const char *target = pass ? ".dynsym" : ".symtab";
> +
> +			for (si = 0; si < ehdr->e_shnum && !found; si++) {
> +				Elf64_Shdr *sh = &shdrs[si];
> +				const char *name = shstrtab + sh->sh_name;
> +
> +				if (strcmp(name, target) != 0)
> +					continue;
> +
> +				Elf64_Shdr *strtab_sh = &shdrs[sh->sh_link];
> +				const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> +				Elf64_Sym *syms = (Elf64_Sym *)((char *)map +
> sh->sh_offset);
> +				uint64_t nsyms = sh->sh_size /
> sizeof(Elf64_Sym);
> +				uint64_t j;
> +
> +				for (j = 0; j < nsyms; j++) {
> +					if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> +						sym_vaddr = syms[j].st_value;
> +						found = 1;
> +						break;
> +					}
> +				}
> +			}
> +		}
> +
> +		if (!found) {
> +			fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> +			munmap(map, (size_t)st.st_size);
> +			return 1;
> +		}
> +
> +		/* Convert vaddr to file offset via PT_LOAD segments */
> +		Elf64_Phdr *phdrs = (Elf64_Phdr *)((char *)map + ehdr-
> >e_phoff);
> +		int pi;
> +
> +		for (pi = 0; pi < ehdr->e_phnum; pi++) {
> +			Elf64_Phdr *ph = &phdrs[pi];
> +
> +			if (ph->p_type != PT_LOAD)
> +				continue;
> +			if (sym_vaddr >= ph->p_vaddr &&
> +			    sym_vaddr < ph->p_vaddr + ph->p_filesz) {
> +				file_offset = sym_vaddr - ph->p_vaddr + ph-
> >p_offset;
> +				break;
> +			}
> +		}
> +	} else {
> +		/* 32-bit ELF */
> +		Elf32_Shdr *shdrs = (Elf32_Shdr *)((char *)map + ehdr32-
> >e_shoff);
> +		Elf32_Shdr *shstrtab_hdr = &shdrs[ehdr32->e_shstrndx];
> +		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> +		int si;
> +		uint32_t sym_vaddr32 = 0;
> +
> +		for (int pass = 0; pass < 2 && !found; pass++) {
> +			const char *target = pass ? ".dynsym" : ".symtab";
> +
> +			for (si = 0; si < ehdr32->e_shnum && !found; si++) {
> +				Elf32_Shdr *sh = &shdrs[si];
> +				const char *name = shstrtab + sh->sh_name;
> +
> +				if (strcmp(name, target) != 0)
> +					continue;
> +
> +				Elf32_Shdr *strtab_sh = &shdrs[sh->sh_link];
> +				const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> +				Elf32_Sym *syms = (Elf32_Sym *)((char *)map +
> sh->sh_offset);
> +				uint32_t nsyms = sh->sh_size /
> sizeof(Elf32_Sym);
> +				uint32_t j;
> +
> +				for (j = 0; j < nsyms; j++) {
> +					if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> +						sym_vaddr32 =
> syms[j].st_value;
> +						found = 1;
> +						break;
> +					}
> +				}
> +			}
> +		}
> +
> +		if (!found) {
> +			fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> +			munmap(map, (size_t)st.st_size);
> +			return 1;
> +		}
> +
> +		Elf32_Phdr *phdrs = (Elf32_Phdr *)((char *)map + ehdr32-
> >e_phoff);
> +		int pi;
> +
> +		for (pi = 0; pi < ehdr32->e_phnum; pi++) {
> +			Elf32_Phdr *ph = &phdrs[pi];
> +
> +			if (ph->p_type != PT_LOAD)
> +				continue;
> +			if (sym_vaddr32 >= ph->p_vaddr &&
> +			    sym_vaddr32 < ph->p_vaddr + ph->p_filesz) {
> +				file_offset = sym_vaddr32 - ph->p_vaddr + ph-
> >p_offset;
> +				break;
> +			}
> +		}
> +		sym_vaddr = sym_vaddr32;
> +	}
> +
> +	munmap(map, (size_t)st.st_size);
> +
> +	if (!file_offset && sym_vaddr) {
> +		fprintf(stderr, "could not map vaddr 0x%lx to file offset\n",
> +			(unsigned long)sym_vaddr);
> +		return 1;
> +	}
> +
> +	printf("0x%lx\n", (unsigned long)file_offset);
> +	return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	int rc;
> +
> +	if (argc < 2) {
> +		fprintf(stderr, "Usage: %s <subcommand> [args...]\n",
> argv[0]);
> +		return 1;
> +	}
> +
> +	/* sym_offset does not need /dev/rv */
> +	if (strcmp(argv[1], "sym_offset") == 0) {
> +		if (argc < 4) {
> +			fprintf(stderr, "Usage: %s sym_offset <binary>
> <symbol>\n",
> +				argv[0]);
> +			return 1;
> +		}
> +		return sym_offset(argv[2], argv[3]);
> +	}
> +
> +	if (open_rv() < 0)
> +		return 2; /* skip */
> +
> +	if (strcmp(argv[1], "not_enabled") == 0)
> +		rc = test_not_enabled();
> +	else if (strcmp(argv[1], "within_budget") == 0)
> +		rc = test_within_budget();
> +	else if (strcmp(argv[1], "over_budget_cpu") == 0)
> +		rc = test_over_budget_cpu();
> +	else if (strcmp(argv[1], "over_budget_sleep") == 0)
> +		rc = test_over_budget_sleep();
> +	else if (strcmp(argv[1], "double_start") == 0)
> +		rc = test_double_start();
> +	else if (strcmp(argv[1], "stop_no_start") == 0)
> +		rc = test_stop_no_start();
> +	else if (strcmp(argv[1], "multi_thread") == 0)
> +		rc = test_multi_thread();
> +	else if (strcmp(argv[1], "self_watch") == 0)
> +		rc = test_self_watch();
> +	else if (strcmp(argv[1], "invalid_flags") == 0)
> +		rc = test_invalid_flags();
> +	else if (strcmp(argv[1], "notify_fd_bad") == 0)
> +		rc = test_notify_fd_bad();
> +	else if (strcmp(argv[1], "mmap_basic") == 0)
> +		rc = test_mmap_basic();
> +	else if (strcmp(argv[1], "mmap_errors") == 0)
> +		rc = test_mmap_errors();
> +	else if (strcmp(argv[1], "mmap_consume") == 0)
> +		rc = test_mmap_consume();
> +	else {
> +		fprintf(stderr, "Unknown test: %s\n", argv[1]);
> +		rc = 1;
> +	}
> +
> +	close(rv_fd);
> +	return rc;
> +}
> diff --git a/tools/testing/selftests/rv/tlob_uprobe_target.c
> b/tools/testing/selftests/rv/tlob_uprobe_target.c
> new file mode 100644
> index 000000000..6c895cb40
> --- /dev/null
> +++ b/tools/testing/selftests/rv/tlob_uprobe_target.c
> @@ -0,0 +1,108 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_uprobe_target.c - uprobe target binary for tlob selftests.
> + *
> + * Provides two well-known probe points:
> + *   tlob_busy_work()      - start probe: arms the tlob budget timer
> + *   tlob_busy_work_done() - stop  probe: cancels the timer on completion
> + *
> + * The tlob selftest writes a five-field uprobe binding:
> + *   pid:threshold_us:binary:offset_start:offset_stop
> + * where offset_start is the file offset of tlob_busy_work and offset_stop
> + * is the file offset of tlob_busy_work_done (resolved via tlob_helper
> + * sym_offset).
> + *
> + * Both probe points are plain entry uprobes (no uretprobe).  The busy loop
> + * keeps the task on-CPU so that either the stop probe fires cleanly (within
> + * budget) or the hrtimer fires first and emits tlob_budget_exceeded (over
> + * budget).
> + *
> + * Usage: tlob_uprobe_target <duration_ms>
> + *
> + * Loops calling tlob_busy_work() in 200 ms iterations until <duration_ms>
> + * has elapsed (0 = run for ~24 hours).  Short iterations ensure the uprobe
> + * entry fires on every call even if the uprobe is installed after the
> + * program has started.
> + */
> +#define _GNU_SOURCE
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <time.h>
> +
> +#ifndef noinline
> +#define noinline __attribute__((noinline))
> +#endif
> +
> +static inline int timespec_before(const struct timespec *a,
> +				   const struct timespec *b)
> +{
> +	return a->tv_sec < b->tv_sec ||
> +	       (a->tv_sec == b->tv_sec && a->tv_nsec < b->tv_nsec);
> +}
> +
> +static void timespec_add_ms(struct timespec *ts, unsigned long ms)
> +{
> +	ts->tv_sec  += ms / 1000;
> +	ts->tv_nsec += (long)(ms % 1000) * 1000000L;
> +	if (ts->tv_nsec >= 1000000000L) {
> +		ts->tv_sec++;
> +		ts->tv_nsec -= 1000000000L;
> +	}
> +}
> +
> +/*
> + * tlob_busy_work_done - stop-probe target.
> + *
> + * Called by tlob_busy_work() after the busy loop.  The uprobe on this
> + * function's entry fires tlob_stop_task(), cancelling the budget timer.
> + * noinline ensures the compiler never merges this function with its caller,
> + * guaranteeing the entry uprobe always fires.
> + */
> +noinline void tlob_busy_work_done(void)
> +{
> +	/* empty: the uprobe fires on entry */
> +}
> +
> +/*
> + * tlob_busy_work - start-probe target.
> + *
> + * The uprobe on this function's entry fires tlob_start_task(), arming the
> + * budget timer.  noinline prevents the compiler and linker (including LTO)
> + * from inlining this function into its callers, ensuring the entry uprobe
> + * fires on every call.
> + */
> +noinline void tlob_busy_work(unsigned long duration_ns)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < duration_ns);
> +
> +	tlob_busy_work_done();
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	unsigned long duration_ms = 0;
> +	struct timespec deadline, now;
> +
> +	if (argc >= 2)
> +		duration_ms = strtoul(argv[1], NULL, 10);
> +
> +	clock_gettime(CLOCK_MONOTONIC, &deadline);
> +	timespec_add_ms(&deadline, duration_ms ? duration_ms : 86400000UL);
> +
> +	do {
> +		tlob_busy_work(200 * 1000000UL); /* 200 ms per iteration */
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +	} while (timespec_before(&now, &deadline));
> +
> +	return 0;
> +}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/4] rv/tlob: Add KUnit tests for the tlob monitor
  2026-04-12 19:27 ` [RFC PATCH 3/4] rv/tlob: Add KUnit tests for the tlob monitor wen.yang
@ 2026-04-16 12:09   ` Gabriele Monaco
  0 siblings, 0 replies; 11+ messages in thread
From: Gabriele Monaco @ 2026-04-16 12:09 UTC (permalink / raw)
  To: wen.yang
  Cc: linux-trace-kernel, linux-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers

On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add six KUnit test suites gated behind CONFIG_TLOB_KUNIT_TEST
> (depends on RV_MON_TLOB && KUNIT; default KUNIT_ALL_TESTS).
> A .kunitconfig fragment is provided for the kunit.py runner.
> 
> Coverage: automaton state transitions and self-loops; start/stop API
> error paths (duplicate start, missing start, overflow threshold,
> table-full, immediate deadline); scheduler context-switch accounting
> for on/off-CPU time; violation tracepoint payload fields; ring buffer
> push, drop-new overflow, and wakeup; and the uprobe line parser.
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

I was considering adding Kunit tests and thought to have them a bit more
integrated ([1] if you want to have a peek before I submit it for RFC, mind it's
a bit raw).

The problem with reimplementing the da_handle_event() is that you are in fact
validating only the model matrix, several other things could go wrong before you
get there (whether the monitor was started properly, other things you might be
doing from the tracepoint handler before you handle events, etc.).

Also, I believe it's a bit of an overkill to validate every single transition
like this, especially considering the work once you update the model for
whatever reason.

One meaningful thing to validate is that a certain sequence of events with a
certain timing causes a violation (or if you want, that a good sequence does
not), for instance. But that's just my opinion, of course.

Thanks,
Gabriele

> ---
>  kernel/trace/rv/Makefile                   |    1 +
>  kernel/trace/rv/monitors/tlob/.kunitconfig |    5 +
>  kernel/trace/rv/monitors/tlob/Kconfig      |   12 +
>  kernel/trace/rv/monitors/tlob/tlob.c       |    1 +
>  kernel/trace/rv/monitors/tlob/tlob_kunit.c | 1194 ++++++++++++++++++++
>  5 files changed, 1213 insertions(+)
>  create mode 100644 kernel/trace/rv/monitors/tlob/.kunitconfig
>  create mode 100644 kernel/trace/rv/monitors/tlob/tlob_kunit.c
> 
> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
> index cc3781a3b..6d963207d 100644
> --- a/kernel/trace/rv/Makefile
> +++ b/kernel/trace/rv/Makefile
> @@ -19,6 +19,7 @@ obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
>  obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
>  obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
>  obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
> +obj-$(CONFIG_TLOB_KUNIT_TEST) += monitors/tlob/tlob_kunit.o
>  # Add new monitors here
>  obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
>  obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
> diff --git a/kernel/trace/rv/monitors/tlob/.kunitconfig
> b/kernel/trace/rv/monitors/tlob/.kunitconfig
> new file mode 100644
> index 000000000..977c58601
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/.kunitconfig
> @@ -0,0 +1,5 @@
> +CONFIG_FTRACE=y
> +CONFIG_KUNIT=y
> +CONFIG_RV=y
> +CONFIG_RV_MON_TLOB=y
> +CONFIG_TLOB_KUNIT_TEST=y
> diff --git a/kernel/trace/rv/monitors/tlob/Kconfig
> b/kernel/trace/rv/monitors/tlob/Kconfig
> index 010237480..4ccd2f881 100644
> --- a/kernel/trace/rv/monitors/tlob/Kconfig
> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
> @@ -49,3 +49,15 @@ config RV_MON_TLOB
>  	  For further information, see:
>  	    Documentation/trace/rv/monitor_tlob.rst
>  
> +config TLOB_KUNIT_TEST
> +	tristate "KUnit tests for tlob monitor" if !KUNIT_ALL_TESTS
> +	depends on RV_MON_TLOB && KUNIT
> +	default KUNIT_ALL_TESTS
> +	help
> +	  Enable KUnit in-kernel unit tests for the tlob RV monitor.
> +
> +	  Tests cover automaton state transitions, the hash table helpers,
> +	  the start/stop task interface, and the event ring buffer including
> +	  overflow handling and wakeup behaviour.
> +
> +	  Say Y or M here to run the tlob KUnit test suite; otherwise say N.
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> index a6e474025..dd959eb9b 100644
> --- a/kernel/trace/rv/monitors/tlob/tlob.c
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -784,6 +784,7 @@ VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64
> *thr_out,
>  	*path_out  = buf + n;
>  	return 0;
>  }
> +EXPORT_SYMBOL_IF_KUNIT(tlob_parse_uprobe_line);
>  
>  static ssize_t tlob_monitor_write(struct file *file,
>  				  const char __user *ubuf,
> diff --git a/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> new file mode 100644
> index 000000000..64f5abb34
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> @@ -0,0 +1,1194 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KUnit tests for the tlob RV monitor.
> + *
> + * tlob_automaton:         DA transition table coverage.
> + * tlob_task_api:          tlob_start_task()/tlob_stop_task() lifecycle and
> errors.
> + * tlob_sched_integration: on/off-CPU accounting across real context
> switches.
> + * tlob_trace_output:      tlob_budget_exceeded tracepoint field
> verification.
> + * tlob_event_buf:         ring buffer push, overflow, and wakeup.
> + * tlob_parse_uprobe:      uprobe format string parser acceptance and
> rejection.
> + *
> + * The duplicate-(binary, offset_start) constraint enforced by
> tlob_add_uprobe()
> + * is not covered here: that function calls kern_path() and requires a real
> + * filesystem, which is outside the scope of unit tests. It is covered by the
> + * uprobe_duplicate_offset case in tools/testing/selftests/rv/test_tlob.sh.
> + */
> +#include <kunit/test.h>
> +#include <linux/atomic.h>
> +#include <linux/completion.h>
> +#include <linux/delay.h>
> +#include <linux/kthread.h>
> +#include <linux/ktime.h>
> +#include <linux/mutex.h>
> +#include <linux/sched.h>
> +#include <linux/sched/task.h>
> +#include <linux/tracepoint.h>
> +
> +/*
> + * Pull in the rv tracepoint declarations so that
> + * register_trace_tlob_budget_exceeded() is available.
> + * No CREATE_TRACE_POINTS here  --  the tracepoint implementation lives in
> rv.c.
> + */
> +#include <rv_trace.h>
> +
> +#include "tlob.h"
> +
> +/*
> + * da_handle_event_tlob - apply one automaton transition on @da_mon.
> + *
> + * This helper is used only by the KUnit automaton suite. It applies the
> + * tlob transition table directly on a supplied da_monitor without touching
> + * per-task slots, tracepoints, or timers.
> + */
> +static void da_handle_event_tlob(struct da_monitor *da_mon,
> +				 enum events_tlob event)
> +{
> +	enum states_tlob curr_state = (enum states_tlob)da_mon->curr_state;
> +	enum states_tlob next_state =
> +		(enum states_tlob)automaton_tlob.function[curr_state][event];
> +
> +	if (next_state != INVALID_STATE)
> +		da_mon->curr_state = next_state;
> +}
> +
> +MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");
> +
> +/*
> + * Suite 1: automaton state-machine transitions
> + */
> +
> +/* unmonitored -> trace_start -> on_cpu */
> +static void tlob_unmonitored_to_on_cpu(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> +	da_handle_event_tlob(&mon, trace_start_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +}
> +
> +/* on_cpu -> switch_out -> off_cpu */
> +static void tlob_on_cpu_switch_out(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, switch_out_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)off_cpu_tlob);
> +}
> +
> +/* off_cpu -> switch_in -> on_cpu */
> +static void tlob_off_cpu_switch_in(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, switch_in_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +}
> +
> +/* on_cpu -> budget_expired -> unmonitored */
> +static void tlob_on_cpu_budget_expired(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, budget_expired_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* off_cpu -> budget_expired -> unmonitored */
> +static void tlob_off_cpu_budget_expired(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, budget_expired_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* on_cpu -> trace_stop -> unmonitored */
> +static void tlob_on_cpu_trace_stop(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, trace_stop_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* off_cpu -> trace_stop -> unmonitored */
> +static void tlob_off_cpu_trace_stop(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, trace_stop_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* budget_expired -> unmonitored; a single trace_start re-enters on_cpu. */
> +static void tlob_violation_then_restart(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> +	da_handle_event_tlob(&mon, trace_start_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> +	da_handle_event_tlob(&mon, budget_expired_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +
> +	/* Single trace_start is sufficient to re-enter on_cpu */
> +	da_handle_event_tlob(&mon, trace_start_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> +	da_handle_event_tlob(&mon, trace_stop_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +/* off_cpu self-loops on switch_out and sched_wakeup. */
> +static void tlob_off_cpu_self_loops(struct kunit *test)
> +{
> +	static const enum events_tlob events[] = {
> +		switch_out_tlob, sched_wakeup_tlob,
> +	};
> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(events); i++) {
> +		struct da_monitor mon = { .curr_state = off_cpu_tlob };
> +
> +		da_handle_event_tlob(&mon, events[i]);
> +		KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state,
> +				    (int)off_cpu_tlob,
> +				    "event %u should self-loop in off_cpu",
> +				    events[i]);
> +	}
> +}
> +
> +/* on_cpu self-loops on sched_wakeup. */
> +static void tlob_on_cpu_self_loops(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = on_cpu_tlob };
> +
> +	da_handle_event_tlob(&mon, sched_wakeup_tlob);
> +	KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state, (int)on_cpu_tlob,
> +			    "sched_wakeup should self-loop in on_cpu");
> +}
> +
> +/* Scheduling events in unmonitored self-loop (no state change). */
> +static void tlob_unmonitored_ignores_sched(struct kunit *test)
> +{
> +	static const enum events_tlob events[] = {
> +		switch_in_tlob, switch_out_tlob, sched_wakeup_tlob,
> +	};
> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(events); i++) {
> +		struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> +		da_handle_event_tlob(&mon, events[i]);
> +		KUNIT_EXPECT_EQ_MSG(test, (int)mon.curr_state,
> +				    (int)unmonitored_tlob,
> +				    "event %u should self-loop in
> unmonitored",
> +				    events[i]);
> +	}
> +}
> +
> +static void tlob_full_happy_path(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +
> +	da_handle_event_tlob(&mon, trace_start_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> +	da_handle_event_tlob(&mon, switch_out_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)off_cpu_tlob);
> +
> +	da_handle_event_tlob(&mon, switch_in_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> +	da_handle_event_tlob(&mon, trace_stop_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +static void tlob_multiple_switches(struct kunit *test)
> +{
> +	struct da_monitor mon = { .curr_state = unmonitored_tlob };
> +	int i;
> +
> +	da_handle_event_tlob(&mon, trace_start_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +
> +	for (i = 0; i < 3; i++) {
> +		da_handle_event_tlob(&mon, switch_out_tlob);
> +		KUNIT_EXPECT_EQ(test, (int)mon.curr_state,
> (int)off_cpu_tlob);
> +		da_handle_event_tlob(&mon, switch_in_tlob);
> +		KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)on_cpu_tlob);
> +	}
> +
> +	da_handle_event_tlob(&mon, trace_stop_tlob);
> +	KUNIT_EXPECT_EQ(test, (int)mon.curr_state, (int)unmonitored_tlob);
> +}
> +
> +static struct kunit_case tlob_automaton_cases[] = {
> +	KUNIT_CASE(tlob_unmonitored_to_on_cpu),
> +	KUNIT_CASE(tlob_on_cpu_switch_out),
> +	KUNIT_CASE(tlob_off_cpu_switch_in),
> +	KUNIT_CASE(tlob_on_cpu_budget_expired),
> +	KUNIT_CASE(tlob_off_cpu_budget_expired),
> +	KUNIT_CASE(tlob_on_cpu_trace_stop),
> +	KUNIT_CASE(tlob_off_cpu_trace_stop),
> +	KUNIT_CASE(tlob_off_cpu_self_loops),
> +	KUNIT_CASE(tlob_on_cpu_self_loops),
> +	KUNIT_CASE(tlob_unmonitored_ignores_sched),
> +	KUNIT_CASE(tlob_full_happy_path),
> +	KUNIT_CASE(tlob_violation_then_restart),
> +	KUNIT_CASE(tlob_multiple_switches),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_automaton_suite = {
> +	.name       = "tlob_automaton",
> +	.test_cases = tlob_automaton_cases,
> +};
> +
> +/*
> + * Suite 2: task registration API
> + */
> +
> +/* Basic start/stop cycle */
> +static void tlob_start_stop_ok(struct kunit *test)
> +{
> +	int ret;
> +
> +	ret = tlob_start_task(current, 10000000 /* 10 s, won't fire */, NULL,
> 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), 0);
> +}
> +
> +/* Double start must return -EEXIST. */
> +static void tlob_double_start(struct kunit *test)
> +{
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(current, 10000000, NULL, 0),
> 0);
> +	KUNIT_EXPECT_EQ(test, tlob_start_task(current, 10000000, NULL, 0), -
> EEXIST);
> +	tlob_stop_task(current);
> +}
> +
> +/* Stop without start must return -ESRCH. */
> +static void tlob_stop_without_start(struct kunit *test)
> +{
> +	tlob_stop_task(current);  /* clear any stale entry first */
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +}
> +
> +/*
> + * A 1 us budget fires before tlob_stop_task() is called. Either the
> + * timer wins (-ESRCH) or we are very fast (0); both are valid.
> + */
> +static void tlob_immediate_deadline(struct kunit *test)
> +{
> +	int ret = tlob_start_task(current, 1 /* 1 us - fires almost
> immediately */, NULL, 0);
> +
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +	/* Let the 1 us timer fire */
> +	udelay(100);
> +	/*
> +	 * By now the hrtimer has almost certainly fired. Either it has
> +	 * (returns -ESRCH) or we were very fast (returns 0). Both are
> +	 * acceptable; just ensure no crash and the table is clean after.
> +	 */
> +	ret = tlob_stop_task(current);
> +	KUNIT_EXPECT_TRUE(test, ret == 0 || ret == -ESRCH);
> +}
> +
> +/*
> + * Fill the table to TLOB_MAX_MONITORED using kthreads (each needs a
> + * distinct task_struct), then verify the next start returns -ENOSPC.
> + */
> +struct tlob_waiter_ctx {
> +	struct completion start;
> +	struct completion done;
> +};
> +
> +static int tlob_waiter_fn(void *arg)
> +{
> +	struct tlob_waiter_ctx *ctx = arg;
> +
> +	wait_for_completion(&ctx->start);
> +	complete(&ctx->done);
> +	return 0;
> +}
> +
> +static void tlob_enospc(struct kunit *test)
> +{
> +	struct tlob_waiter_ctx *ctxs;
> +	struct task_struct **threads;
> +	int i, ret;
> +
> +	ctxs = kunit_kcalloc(test, TLOB_MAX_MONITORED,
> +			     sizeof(*ctxs), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, ctxs);
> +
> +	threads = kunit_kcalloc(test, TLOB_MAX_MONITORED,
> +				sizeof(*threads), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, threads);
> +
> +	/* Start TLOB_MAX_MONITORED kthreads and monitor each */
> +	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> +		init_completion(&ctxs[i].start);
> +		init_completion(&ctxs[i].done);
> +
> +		threads[i] = kthread_run(tlob_waiter_fn, &ctxs[i],
> +					 "tlob_waiter_%d", i);
> +		if (IS_ERR(threads[i])) {
> +			KUNIT_FAIL(test, "kthread_run failed at i=%d", i);
> +			threads[i] = NULL;
> +			goto cleanup;
> +		}
> +		get_task_struct(threads[i]);
> +
> +		ret = tlob_start_task(threads[i], 10000000, NULL, 0);
> +		if (ret != 0) {
> +			KUNIT_FAIL(test, "tlob_start_task failed at i=%d:
> %d",
> +				   i, ret);
> +			put_task_struct(threads[i]);
> +			complete(&ctxs[i].start);
> +			goto cleanup;
> +		}
> +	}
> +
> +	/* The table is now full: one more must fail with -ENOSPC */
> +	ret = tlob_start_task(current, 10000000, NULL, 0);
> +	KUNIT_EXPECT_EQ(test, ret, -ENOSPC);
> +
> +cleanup:
> +	/*
> +	 * Two-pass cleanup: cancel tlob monitoring and unblock kthreads
> first,
> +	 * then kthread_stop() to wait for full exit before releasing refs.
> +	 */
> +	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> +		if (!threads[i])
> +			break;
> +		tlob_stop_task(threads[i]);
> +		complete(&ctxs[i].start);
> +	}
> +	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> +		if (!threads[i])
> +			break;
> +		kthread_stop(threads[i]);
> +		put_task_struct(threads[i]);
> +	}
> +}
> +
> +/*
> + * A kthread holds a mutex for 80 ms; arm a 10 ms budget, burn ~1 ms
> + * on-CPU, then block on the mutex. The timer fires off-CPU; stop
> + * must return -ESRCH.
> + */
> +struct tlob_holder_ctx {
> +	struct mutex		lock;
> +	struct completion	ready;
> +	unsigned int		hold_ms;
> +};
> +
> +static int tlob_holder_fn(void *arg)
> +{
> +	struct tlob_holder_ctx *ctx = arg;
> +
> +	mutex_lock(&ctx->lock);
> +	complete(&ctx->ready);
> +	msleep(ctx->hold_ms);
> +	mutex_unlock(&ctx->lock);
> +	return 0;
> +}
> +
> +static void tlob_deadline_fires_off_cpu(struct kunit *test)
> +{
> +	struct tlob_holder_ctx ctx = { .hold_ms = 80 };
> +	struct task_struct *holder;
> +	ktime_t t0;
> +	int ret;
> +
> +	mutex_init(&ctx.lock);
> +	init_completion(&ctx.ready);
> +
> +	holder = kthread_run(tlob_holder_fn, &ctx, "tlob_holder_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, holder);
> +	wait_for_completion(&ctx.ready);
> +
> +	/* Arm 10 ms budget while kthread holds the mutex. */
> +	ret = tlob_start_task(current, 10000, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* Phase 1: burn ~1 ms on-CPU to exercise on_cpu accounting. */
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 1000)
> +		cpu_relax();
> +
> +	/*
> +	 * Phase 2: block on the mutex -> on_cpu->off_cpu transition.
> +	 * The 10 ms budget fires while we are off-CPU.
> +	 */
> +	mutex_lock(&ctx.lock);
> +	mutex_unlock(&ctx.lock);
> +
> +	/* Timer already fired and removed the entry -> -ESRCH */
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +}
> +
> +/* Arm a 1 ms budget and busy-spin for 50 ms; timer fires on-CPU. */
> +static void tlob_deadline_fires_on_cpu(struct kunit *test)
> +{
> +	ktime_t t0;
> +	int ret;
> +
> +	ret = tlob_start_task(current, 1000 /* 1 ms */, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* Busy-spin 50 ms - 50x the budget */
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 50000)
> +		cpu_relax();
> +
> +	/* Timer fired during the spin; entry is gone */
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +}
> +
> +/*
> + * Start three tasks, call tlob_destroy_monitor() + tlob_init_monitor(),
> + * and verify the table is empty afterwards.
> + */
> +static int tlob_dummy_fn(void *arg)
> +{
> +	wait_for_completion((struct completion *)arg);
> +	return 0;
> +}
> +
> +static void tlob_stop_all_cleanup(struct kunit *test)
> +{
> +	struct completion done1, done2;
> +	struct task_struct *t1, *t2;
> +	int ret;
> +
> +	init_completion(&done1);
> +	init_completion(&done2);
> +
> +	t1 = kthread_run(tlob_dummy_fn, &done1, "tlob_dummy1");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t1);
> +	get_task_struct(t1);
> +
> +	t2 = kthread_run(tlob_dummy_fn, &done2, "tlob_dummy2");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t2);
> +	get_task_struct(t2);
> +
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(current, 10000000, NULL, 0),
> 0);
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(t1, 10000000, NULL, 0), 0);
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(t2, 10000000, NULL, 0), 0);
> +
> +	/* Destroy clears all entries via tlob_stop_all() */
> +	tlob_destroy_monitor();
> +	ret = tlob_init_monitor();
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* Table must be empty now */
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(t1), -ESRCH);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(t2), -ESRCH);
> +
> +	complete(&done1);
> +	complete(&done2);
> +	/*
> +	 * completions live on stack; wait for kthreads to exit before
> return.
> +	 */
> +	kthread_stop(t1);
> +	kthread_stop(t2);
> +	put_task_struct(t1);
> +	put_task_struct(t2);
> +}
> +
> +/* A threshold that overflows ktime_t must be rejected with -ERANGE. */
> +static void tlob_overflow_threshold(struct kunit *test)
> +{
> +	/* KTIME_MAX / NSEC_PER_USEC + 1 overflows ktime_t */
> +	u64 too_large = (u64)(KTIME_MAX / NSEC_PER_USEC) + 1;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_start_task(current, too_large, NULL, 0),
> +		-ERANGE);
> +}
> +
> +static int tlob_task_api_suite_init(struct kunit_suite *suite)
> +{
> +	return tlob_init_monitor();
> +}
> +
> +static void tlob_task_api_suite_exit(struct kunit_suite *suite)
> +{
> +	tlob_destroy_monitor();
> +}
> +
> +static struct kunit_case tlob_task_api_cases[] = {
> +	KUNIT_CASE(tlob_start_stop_ok),
> +	KUNIT_CASE(tlob_double_start),
> +	KUNIT_CASE(tlob_stop_without_start),
> +	KUNIT_CASE(tlob_immediate_deadline),
> +	KUNIT_CASE(tlob_enospc),
> +	KUNIT_CASE(tlob_overflow_threshold),
> +	KUNIT_CASE(tlob_deadline_fires_off_cpu),
> +	KUNIT_CASE(tlob_deadline_fires_on_cpu),
> +	KUNIT_CASE(tlob_stop_all_cleanup),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_task_api_suite = {
> +	.name       = "tlob_task_api",
> +	.suite_init = tlob_task_api_suite_init,
> +	.suite_exit = tlob_task_api_suite_exit,
> +	.test_cases = tlob_task_api_cases,
> +};
> +
> +/*
> + * Suite 3: scheduling integration
> + */
> +
> +struct tlob_ping_ctx {
> +	struct completion ping;
> +	struct completion pong;
> +};
> +
> +static int tlob_ping_fn(void *arg)
> +{
> +	struct tlob_ping_ctx *ctx = arg;
> +
> +	/* Wait for main to give us the CPU back */
> +	wait_for_completion(&ctx->ping);
> +	complete(&ctx->pong);
> +	return 0;
> +}
> +
> +/* Force two context switches and verify stop returns 0 (within budget). */
> +static void tlob_sched_switch_accounting(struct kunit *test)
> +{
> +	struct tlob_ping_ctx ctx;
> +	struct task_struct *peer;
> +	int ret;
> +
> +	init_completion(&ctx.ping);
> +	init_completion(&ctx.pong);
> +
> +	peer = kthread_run(tlob_ping_fn, &ctx, "tlob_ping_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, peer);
> +
> +	/* Arm a generous 5 s budget so the timer never fires */
> +	ret = tlob_start_task(current, 5000000, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/*
> +	 * complete(ping) -> peer runs, forcing a context switch out and
> back.
> +	 */
> +	complete(&ctx.ping);
> +	wait_for_completion(&ctx.pong);
> +
> +	/*
> +	 * Back on CPU after one off-CPU interval; stop must return 0.
> +	 */
> +	ret = tlob_stop_task(current);
> +	KUNIT_EXPECT_EQ(test, ret, 0);
> +}
> +
> +/*
> + * Verify that monitoring a kthread (not current) works: start on behalf
> + * of a kthread, let it block, then stop it.
> + */
> +static int tlob_block_fn(void *arg)
> +{
> +	struct completion *done = arg;
> +
> +	/* Block briefly, exercising off_cpu accounting for this task */
> +	msleep(20);
> +	complete(done);
> +	return 0;
> +}
> +
> +static void tlob_monitor_other_task(struct kunit *test)
> +{
> +	struct completion done;
> +	struct task_struct *target;
> +	int ret;
> +
> +	init_completion(&done);
> +
> +	target = kthread_run(tlob_block_fn, &done, "tlob_target_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, target);
> +	get_task_struct(target);
> +
> +	/* Arm a 5 s budget for the target task */
> +	ret = tlob_start_task(target, 5000000, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	wait_for_completion(&done);
> +
> +	/*
> +	 * Target has finished; stop_task may return 0 (still in htable)
> +	 * or -ESRCH (kthread exited and timer fired / entry cleaned up).
> +	 */
> +	ret = tlob_stop_task(target);
> +	KUNIT_EXPECT_TRUE(test, ret == 0 || ret == -ESRCH);
> +	put_task_struct(target);
> +}
> +
> +static int tlob_sched_suite_init(struct kunit_suite *suite)
> +{
> +	return tlob_init_monitor();
> +}
> +
> +static void tlob_sched_suite_exit(struct kunit_suite *suite)
> +{
> +	tlob_destroy_monitor();
> +}
> +
> +static struct kunit_case tlob_sched_integration_cases[] = {
> +	KUNIT_CASE(tlob_sched_switch_accounting),
> +	KUNIT_CASE(tlob_monitor_other_task),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_sched_integration_suite = {
> +	.name       = "tlob_sched_integration",
> +	.suite_init = tlob_sched_suite_init,
> +	.suite_exit = tlob_sched_suite_exit,
> +	.test_cases = tlob_sched_integration_cases,
> +};
> +
> +/*
> + * Suite 4: ftrace tracepoint field verification
> + */
> +
> +/* Capture fields from trace_tlob_budget_exceeded for inspection. */
> +struct tlob_exceeded_capture {
> +	atomic_t	fired;		/* 1 after first call */
> +	pid_t		pid;
> +	u64		threshold_us;
> +	u64		on_cpu_us;
> +	u64		off_cpu_us;
> +	u32		switches;
> +	bool		state_is_on_cpu;
> +	u64		tag;
> +};
> +
> +static void
> +probe_tlob_budget_exceeded(void *data,
> +			   struct task_struct *task, u64 threshold_us,
> +			   u64 on_cpu_us, u64 off_cpu_us,
> +			   u32 switches, bool state_is_on_cpu, u64 tag)
> +{
> +	struct tlob_exceeded_capture *cap = data;
> +
> +	/* Only capture the first event to avoid races. */
> +	if (atomic_cmpxchg(&cap->fired, 0, 1) != 0)
> +		return;
> +
> +	cap->pid		= task->pid;
> +	cap->threshold_us	= threshold_us;
> +	cap->on_cpu_us		= on_cpu_us;
> +	cap->off_cpu_us		= off_cpu_us;
> +	cap->switches		= switches;
> +	cap->state_is_on_cpu	= state_is_on_cpu;
> +	cap->tag		= tag;
> +}
> +
> +/*
> + * Arm a 2 ms budget and busy-spin for 60 ms. Verify the tracepoint fires
> + * once with matching threshold, correct pid, and total time >= budget.
> + *
> + * state_is_on_cpu is not asserted: preemption during the spin makes it
> + * non-deterministic.
> + */
> +static void tlob_trace_budget_exceeded_on_cpu(struct kunit *test)
> +{
> +	struct tlob_exceeded_capture cap = {};
> +	const u64 threshold_us = 2000; /* 2 ms */
> +	ktime_t t0;
> +	int ret;
> +
> +	atomic_set(&cap.fired, 0);
> +
> +	ret = register_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> +						  &cap);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	ret = tlob_start_task(current, threshold_us, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* Busy-spin 60 ms  --  30x the budget */
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 60000)
> +		cpu_relax();
> +
> +	/* Entry removed by timer; stop returns -ESRCH */
> +	tlob_stop_task(current);
> +
> +	/*
> +	 * Synchronise: ensure the probe callback has completed before we
> +	 * read the captured fields.
> +	 */
> +	tracepoint_synchronize_unregister();
> +	unregister_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> &cap);
> +
> +	KUNIT_EXPECT_EQ(test, atomic_read(&cap.fired), 1);
> +	KUNIT_EXPECT_EQ(test, (int)cap.pid, (int)current->pid);
> +	KUNIT_EXPECT_EQ(test, cap.threshold_us, threshold_us);
> +	/* Total elapsed must cover at least the budget */
> +	KUNIT_EXPECT_GE(test, cap.on_cpu_us + cap.off_cpu_us, threshold_us);
> +}
> +
> +/*
> + * Holder kthread grabs a mutex for 80 ms; arm 10 ms budget, burn ~1 ms
> + * on-CPU, then block on the mutex. Timer fires off-CPU. Verify:
> + * state_is_on_cpu == false, switches >= 1, off_cpu_us > 0.
> + */
> +static void tlob_trace_budget_exceeded_off_cpu(struct kunit *test)
> +{
> +	struct tlob_exceeded_capture cap = {};
> +	struct tlob_holder_ctx ctx = { .hold_ms = 80 };
> +	struct task_struct *holder;
> +	const u64 threshold_us = 10000; /* 10 ms */
> +	ktime_t t0;
> +	int ret;
> +
> +	atomic_set(&cap.fired, 0);
> +
> +	mutex_init(&ctx.lock);
> +	init_completion(&ctx.ready);
> +
> +	holder = kthread_run(tlob_holder_fn, &ctx, "tlob_holder2_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, holder);
> +	wait_for_completion(&ctx.ready);
> +
> +	ret = register_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> +						  &cap);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	ret = tlob_start_task(current, threshold_us, NULL, 0);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* Phase 1: ~1 ms on-CPU */
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 1000)
> +		cpu_relax();
> +
> +	/* Phase 2: block -> off-CPU; timer fires here */
> +	mutex_lock(&ctx.lock);
> +	mutex_unlock(&ctx.lock);
> +
> +	tlob_stop_task(current);
> +
> +	tracepoint_synchronize_unregister();
> +	unregister_trace_tlob_budget_exceeded(probe_tlob_budget_exceeded,
> &cap);
> +
> +	KUNIT_EXPECT_EQ(test, atomic_read(&cap.fired), 1);
> +	KUNIT_EXPECT_EQ(test, cap.threshold_us, threshold_us);
> +	/* Violation happened off-CPU */
> +	KUNIT_EXPECT_FALSE(test, cap.state_is_on_cpu);
> +	/* At least the switch_out event was counted */
> +	KUNIT_EXPECT_GE(test, (u64)cap.switches, (u64)1);
> +	/* Off-CPU time must be non-zero */
> +	KUNIT_EXPECT_GT(test, cap.off_cpu_us, (u64)0);
> +}
> +
> +/* threshold_us in the tracepoint must exactly match the start argument. */
> +static void tlob_trace_threshold_field_accuracy(struct kunit *test)
> +{
> +	static const u64 thresholds[] = { 500, 1000, 3000 };
> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(thresholds); i++) {
> +		struct tlob_exceeded_capture cap = {};
> +		ktime_t t0;
> +		int ret;
> +
> +		atomic_set(&cap.fired, 0);
> +
> +		ret = register_trace_tlob_budget_exceeded(
> +			probe_tlob_budget_exceeded, &cap);
> +		KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +		ret = tlob_start_task(current, thresholds[i], NULL, 0);
> +		KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +		/* Spin for 20x the threshold to ensure timer fires */
> +		t0 = ktime_get();
> +		while (ktime_us_delta(ktime_get(), t0) <
> +		       (s64)(thresholds[i] * 20))
> +			cpu_relax();
> +
> +		tlob_stop_task(current);
> +
> +		tracepoint_synchronize_unregister();
> +		unregister_trace_tlob_budget_exceeded(
> +			probe_tlob_budget_exceeded, &cap);
> +
> +		KUNIT_EXPECT_EQ_MSG(test, cap.threshold_us, thresholds[i],
> +				    "threshold mismatch for entry %u", i);
> +	}
> +}
> +
> +static int tlob_trace_suite_init(struct kunit_suite *suite)
> +{
> +	int ret;
> +
> +	ret = tlob_init_monitor();
> +	if (ret)
> +		return ret;
> +	return tlob_enable_hooks();
> +}
> +
> +static void tlob_trace_suite_exit(struct kunit_suite *suite)
> +{
> +	tlob_disable_hooks();
> +	tlob_destroy_monitor();
> +}
> +
> +static struct kunit_case tlob_trace_output_cases[] = {
> +	KUNIT_CASE(tlob_trace_budget_exceeded_on_cpu),
> +	KUNIT_CASE(tlob_trace_budget_exceeded_off_cpu),
> +	KUNIT_CASE(tlob_trace_threshold_field_accuracy),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_trace_output_suite = {
> +	.name       = "tlob_trace_output",
> +	.suite_init = tlob_trace_suite_init,
> +	.suite_exit = tlob_trace_suite_exit,
> +	.test_cases = tlob_trace_output_cases,
> +};
> +
> +/* Suite 5: ring buffer */
> +
> +/*
> + * Allocate a synthetic rv_file_priv for ring buffer tests. Uses
> + * kunit_kzalloc() instead of __get_free_pages() since the ring is never
> + * mmap'd here.
> + */
> +static struct rv_file_priv *alloc_priv_kunit(struct kunit *test, u32 cap)
> +{
> +	struct rv_file_priv *priv;
> +	struct tlob_ring *ring;
> +
> +	priv = kunit_kzalloc(test, sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return NULL;
> +
> +	ring = &priv->ring;
> +
> +	ring->page = kunit_kzalloc(test, sizeof(struct tlob_mmap_page),
> +				   GFP_KERNEL);
> +	if (!ring->page)
> +		return NULL;
> +
> +	ring->data = kunit_kzalloc(test, cap * sizeof(struct tlob_event),
> +				   GFP_KERNEL);
> +	if (!ring->data)
> +		return NULL;
> +
> +	ring->mask            = cap - 1;
> +	ring->page->capacity  = cap;
> +	ring->page->version   = 1;
> +	ring->page->data_offset = PAGE_SIZE; /* nominal; not used in tests */
> +	ring->page->record_size = sizeof(struct tlob_event);
> +	spin_lock_init(&ring->lock);
> +	init_waitqueue_head(&priv->waitq);
> +	return priv;
> +}
> +
> +/* Push one record and verify all fields survive the round-trip. */
> +static void tlob_event_push_one(struct kunit *test)
> +{
> +	struct rv_file_priv *priv;
> +	struct tlob_ring *ring;
> +	struct tlob_event in = {
> +		.tid		= 1234,
> +		.threshold_us	= 5000,
> +		.on_cpu_us	= 3000,
> +		.off_cpu_us	= 2000,
> +		.switches	= 3,
> +		.state		= 1,
> +	};
> +	struct tlob_event out = {};
> +	u32 tail;
> +
> +	priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
> +	KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> +	ring = &priv->ring;
> +
> +	tlob_event_push_kunit(priv, &in);
> +
> +	/* One record written, none dropped */
> +	KUNIT_EXPECT_EQ(test, ring->page->data_head, 1u);
> +	KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
> +	KUNIT_EXPECT_EQ(test, ring->page->dropped,   0ull);
> +
> +	/* Dequeue manually */
> +	tail = ring->page->data_tail;
> +	out  = ring->data[tail & ring->mask];
> +	ring->page->data_tail = tail + 1;
> +
> +	KUNIT_EXPECT_EQ(test, out.tid,          in.tid);
> +	KUNIT_EXPECT_EQ(test, out.threshold_us, in.threshold_us);
> +	KUNIT_EXPECT_EQ(test, out.on_cpu_us,    in.on_cpu_us);
> +	KUNIT_EXPECT_EQ(test, out.off_cpu_us,   in.off_cpu_us);
> +	KUNIT_EXPECT_EQ(test, out.switches,     in.switches);
> +	KUNIT_EXPECT_EQ(test, out.state,        in.state);
> +
> +	/* Ring is now empty */
> +	KUNIT_EXPECT_EQ(test, ring->page->data_head, ring->page->data_tail);
> +}
> +
> +/*
> + * Fill to capacity, push one more. Drop-new policy: head stays at cap,
> + * dropped == 1, oldest record is preserved.
> + */
> +static void tlob_event_push_overflow(struct kunit *test)
> +{
> +	struct rv_file_priv *priv;
> +	struct tlob_ring *ring;
> +	struct tlob_event ntf = {};
> +	struct tlob_event out = {};
> +	const u32 cap = TLOB_RING_MIN_CAP;
> +	u32 i;
> +
> +	priv = alloc_priv_kunit(test, cap);
> +	KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> +	ring = &priv->ring;
> +
> +	/* Push cap + 1 records; tid encodes the sequence */
> +	for (i = 0; i <= cap; i++) {
> +		ntf.tid          = i;
> +		ntf.threshold_us = (u64)i * 1000;
> +		tlob_event_push_kunit(priv, &ntf);
> +	}
> +
> +	/* Drop-new: head stopped at cap; one record was silently discarded
> */
> +	KUNIT_EXPECT_EQ(test, ring->page->data_head, cap);
> +	KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
> +	KUNIT_EXPECT_EQ(test, ring->page->dropped,   1ull);
> +
> +	/* Oldest surviving record must be the first one pushed (tid == 0) */
> +	out = ring->data[ring->page->data_tail & ring->mask];
> +	KUNIT_EXPECT_EQ(test, out.tid, 0u);
> +
> +	/* Drain the ring; the last record must have tid == cap - 1 */
> +	for (i = 0; i < cap; i++) {
> +		u32 tail = ring->page->data_tail;
> +
> +		out = ring->data[tail & ring->mask];
> +		ring->page->data_tail = tail + 1;
> +	}
> +	KUNIT_EXPECT_EQ(test, out.tid, cap - 1);
> +	KUNIT_EXPECT_EQ(test, ring->page->data_head, ring->page->data_tail);
> +}
> +
> +/* A freshly initialised ring is empty. */
> +static void tlob_event_empty(struct kunit *test)
> +{
> +	struct rv_file_priv *priv;
> +	struct tlob_ring *ring;
> +
> +	priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
> +	KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> +	ring = &priv->ring;
> +
> +	KUNIT_EXPECT_EQ(test, ring->page->data_head, 0u);
> +	KUNIT_EXPECT_EQ(test, ring->page->data_tail, 0u);
> +	KUNIT_EXPECT_EQ(test, ring->page->dropped,   0ull);
> +}
> +
> +/* A kthread blocks on wait_event_interruptible(); pushing one record must
> + * wake it within 1 s.
> + */
> +
> +struct tlob_wakeup_ctx {
> +	struct rv_file_priv	*priv;
> +	struct completion	 ready;
> +	struct completion	 done;
> +	int			 woke;
> +};
> +
> +static int tlob_wakeup_thread(void *arg)
> +{
> +	struct tlob_wakeup_ctx *ctx = arg;
> +	struct tlob_ring *ring = &ctx->priv->ring;
> +
> +	complete(&ctx->ready);
> +
> +	wait_event_interruptible(ctx->priv->waitq,
> +		smp_load_acquire(&ring->page->data_head) !=
> +		READ_ONCE(ring->page->data_tail) ||
> +		kthread_should_stop());
> +
> +	if (smp_load_acquire(&ring->page->data_head) !=
> +	    READ_ONCE(ring->page->data_tail))
> +		ctx->woke = 1;
> +
> +	complete(&ctx->done);
> +	return 0;
> +}
> +
> +static void tlob_ring_wakeup(struct kunit *test)
> +{
> +	struct rv_file_priv *priv;
> +	struct tlob_wakeup_ctx ctx;
> +	struct task_struct *t;
> +	struct tlob_event ev = { .tid = 99 };
> +	long timeout;
> +
> +	priv = alloc_priv_kunit(test, TLOB_RING_DEFAULT_CAP);
> +	KUNIT_ASSERT_NOT_NULL(test, priv);
> +
> +	init_completion(&ctx.ready);
> +	init_completion(&ctx.done);
> +	ctx.priv = priv;
> +	ctx.woke = 0;
> +
> +	t = kthread_run(tlob_wakeup_thread, &ctx, "tlob_wakeup_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t);
> +	get_task_struct(t);
> +
> +	/* Let the kthread reach wait_event_interruptible */
> +	wait_for_completion(&ctx.ready);
> +	usleep_range(10000, 20000);
> +
> +	/* Push one record  --  must wake the waiter */
> +	tlob_event_push_kunit(priv, &ev);
> +
> +	timeout = wait_for_completion_timeout(&ctx.done,
> msecs_to_jiffies(1000));
> +	kthread_stop(t);
> +	put_task_struct(t);
> +
> +	KUNIT_EXPECT_GT(test, timeout, 0L);
> +	KUNIT_EXPECT_EQ(test, ctx.woke, 1);
> +	KUNIT_EXPECT_EQ(test, priv->ring.page->data_head, 1u);
> +}
> +
> +static struct kunit_case tlob_event_buf_cases[] = {
> +	KUNIT_CASE(tlob_event_push_one),
> +	KUNIT_CASE(tlob_event_push_overflow),
> +	KUNIT_CASE(tlob_event_empty),
> +	KUNIT_CASE(tlob_ring_wakeup),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_event_buf_suite = {
> +	.name       = "tlob_event_buf",
> +	.test_cases = tlob_event_buf_cases,
> +};
> +
> +/* Suite 6: uprobe format string parser */
> +
> +/* Happy path: decimal offsets, plain path. */
> +static void tlob_parse_decimal_offsets(struct kunit *test)
> +{
> +	char buf[] = "5000:4768:4848:/usr/bin/myapp";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		0);
> +	KUNIT_EXPECT_EQ(test, thr,      (u64)5000);
> +	KUNIT_EXPECT_EQ(test, start,    (loff_t)4768);
> +	KUNIT_EXPECT_EQ(test, stop,     (loff_t)4848);
> +	KUNIT_EXPECT_STREQ(test, path,  "/usr/bin/myapp");
> +}
> +
> +/* Happy path: 0x-prefixed hex offsets. */
> +static void tlob_parse_hex_offsets(struct kunit *test)
> +{
> +	char buf[] = "10000:0x12a0:0x12f0:/usr/bin/myapp";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		0);
> +	KUNIT_EXPECT_EQ(test, start,   (loff_t)0x12a0);
> +	KUNIT_EXPECT_EQ(test, stop,    (loff_t)0x12f0);
> +	KUNIT_EXPECT_STREQ(test, path, "/usr/bin/myapp");
> +}
> +
> +/* Path containing ':' must not be truncated. */
> +static void tlob_parse_path_with_colon(struct kunit *test)
> +{
> +	char buf[] = "1000:0x100:0x200:/opt/my:app/bin";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		0);
> +	KUNIT_EXPECT_STREQ(test, path, "/opt/my:app/bin");
> +}
> +
> +/* Zero threshold must be rejected. */
> +static void tlob_parse_zero_threshold(struct kunit *test)
> +{
> +	char buf[] = "0:0x100:0x200:/usr/bin/myapp";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		-EINVAL);
> +}
> +
> +/* Empty path (trailing ':' with nothing after) must be rejected. */
> +static void tlob_parse_empty_path(struct kunit *test)
> +{
> +	char buf[] = "5000:0x100:0x200:";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		-EINVAL);
> +}
> +
> +/* Missing field (3 tokens instead of 4) must be rejected. */
> +static void tlob_parse_too_few_fields(struct kunit *test)
> +{
> +	char buf[] = "5000:0x100:/usr/bin/myapp";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		-EINVAL);
> +}
> +
> +/* Negative offset must be rejected. */
> +static void tlob_parse_negative_offset(struct kunit *test)
> +{
> +	char buf[] = "5000:-1:0x200:/usr/bin/myapp";
> +	u64 thr; loff_t start, stop; char *path;
> +
> +	KUNIT_EXPECT_EQ(test,
> +		tlob_parse_uprobe_line(buf, &thr, &path, &start, &stop),
> +		-EINVAL);
> +}
> +
> +static struct kunit_case tlob_parse_uprobe_cases[] = {
> +	KUNIT_CASE(tlob_parse_decimal_offsets),
> +	KUNIT_CASE(tlob_parse_hex_offsets),
> +	KUNIT_CASE(tlob_parse_path_with_colon),
> +	KUNIT_CASE(tlob_parse_zero_threshold),
> +	KUNIT_CASE(tlob_parse_empty_path),
> +	KUNIT_CASE(tlob_parse_too_few_fields),
> +	KUNIT_CASE(tlob_parse_negative_offset),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_parse_uprobe_suite = {
> +	.name       = "tlob_parse_uprobe",
> +	.test_cases = tlob_parse_uprobe_cases,
> +};
> +
> +kunit_test_suites(&tlob_automaton_suite,
> +		  &tlob_task_api_suite,
> +		  &tlob_sched_integration_suite,
> +		  &tlob_trace_output_suite,
> +		  &tlob_event_buf_suite,
> +		  &tlob_parse_uprobe_suite);
> +
> +MODULE_DESCRIPTION("KUnit tests for the tlob RV monitor");
> +MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor
  2026-04-13  8:19   ` Gabriele Monaco
@ 2026-04-16 15:09     ` Wen Yang
  2026-04-16 15:35       ` Gabriele Monaco
  0 siblings, 1 reply; 11+ messages in thread
From: Wen Yang @ 2026-04-16 15:09 UTC (permalink / raw)
  To: Gabriele Monaco
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, linux-kernel



On 4/13/26 16:19, Gabriele Monaco wrote:
> On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
>> From: Wen Yang <wen.yang@linux.dev>
>>
>> Add the tlob (task latency over budget) RV monitor. tlob tracks the
>> monotonic elapsed time (CLOCK_MONOTONIC) of a marked per-task code
>> path, including time off-CPU, and fires a per-task hrtimer when the
>> elapsed time exceeds a configurable budget.
>>
>> Three-state DA (unmonitored/on_cpu/off_cpu) driven by trace_start,
>> switch_in/out, and budget_expired events. Per-task state lives in a
>> fixed-size hash table (TLOB_MAX_MONITORED slots) with RCU-deferred
>> free.
>>
>> Two userspace interfaces:
>>   - tracefs: uprobe pair registration via the monitor file using the
>>     format "pid:threshold_us:offset_start:offset_stop:binary_path"
>>   - /dev/rv ioctls (CONFIG_RV_CHARDEV): TLOB_IOCTL_TRACE_START /
>>     TRACE_STOP; TRACE_STOP returns -EOVERFLOW on violation
>>
>> Each /dev/rv fd has a per-fd mmap ring buffer (physically contiguous
>> pages). A control page (struct tlob_mmap_page) at offset 0 exposes
>> head/tail/dropped for lockless userspace reads; struct tlob_event
>> records follow at data_offset. Drop-new policy on overflow.
>>
>> UAPI: include/uapi/linux/rv.h (tlob_start_args, tlob_event,
>>        tlob_mmap_page, ioctl numbers), monitor_tlob.rst,
>>        ioctl-number.rst (RV_IOC_MAGIC=0xB9).
>>
> 
> I'm not fully grasping all the requirements for the monitors yet, but I see you
> are reimplementing a lot of functionality in the monitor itself rather than
> within RV, let's see if we can consolidate some of them:
> 
>   * you're using timer expirations, can we do it with timed automata? [1]
>   * RV automata usually don't have an /unmonitored/ state, your trace_start event
> would  be the start condition (da_event_start) and the monitor will get non-
> running at each violation (it calls da_monitor_reset() automatically), all
> setup/cleanup logic should be handled implicitly within RV. I believe that would
> also save you that ugly trace_event_tlob() redefinition.
>   * you're maintaining a local hash table for each task_struct, that could use
> the per-object monitors [2] where your "object" is in fact your struct,
> allocated when you start the monitor with all appropriate fields and indexed by
> pid
>   * you are handling violations manually, considering timed automata trigger a
> full fledged violation on timeouts, can you use the RV-way (error tracepoints or
> reactors only)? Do you need the additional reporting within the
> tracepoint/ioctl? Cannot the userspace consumer desume all those from other
> events and let RV do just the monitoring?
>   * I like the uprobe thing, we could probably move all that to a common helper
> once we figure out how to make it generic.
> 
> Note: [1] and [2] didn't reach upstream yet, but should reach linux-next soon.
> 

Thanks for the review.  Here's my plan for each point -- let me know if 
the direction looks right.


- Timed automata

The HA framework [1] is a good match when the timeout threshold is 
global or state-determined, but tlob needs a per-invocation threshold 
supplied at TRACE_START time -- fitting that into HA would require 
framework changes.

My plan is to use da_monitor_init_hook() -- the same mechanism HA 
monitors use internally -- to arm the per-invocation hrtimer once 
da_create_storage() has stored the monitor_target.  This gives the same 
"timer fires => violation" semantics without touching the HA infrastructure.

If you see a cleaner way to pass per-invocation data through HA I'm 
happy to go that route.


- Unmonitored state / da_handle_start_event

Fair point.  I'll drop the explicit unmonitored state and the
trace_event_tlob() redefinition.  tlob_start_task() will use
da_handle_start_event() to allocate storage, set initial state to on_cpu,
and fire the init hook to arm the timer in one shot.  tlob_stop_task()
calls da_monitor_reset() directly.

- Per-object monitors

Will do.  The custom hash table goes away; I'll switch to RV_MON_PER_OBJ
with:

     typedef struct tlob_task_state *monitor_target;

da_get_target_by_id() handles the sched_switch hot path lookup.


- RV-way violations

Agreed.  budget_expired will be declared INVALID in all states so the
framework calls react() (error_tlob tracepoint + any registered reactor)
and da_monitor_reset() automatically.  tlob won't emit any tracepoint of
its own.

One note on the /dev/tlob ioctl: TLOB_IOCTL_TRACE_STOP returns -EOVERFLOW
to the caller when the budget was exceeded.  This is just a syscall 
return code -- not a second reporting path -- to let in-process 
instrumentation react inline without polling the trace buffer.
Let me know if you have concerns about keeping this.


- Generic uprobe helper

Proposed interface:

     struct rv_uprobe *rv_uprobe_attach_path(
             struct path *path, loff_t offset,
             int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
             int (*ret_fn)  (struct rv_uprobe *, unsigned long func,
                             struct pt_regs *, __u64 *),
             void *priv);

     struct rv_uprobe *rv_uprobe_attach(
             const char *binpath, loff_t offset,
             int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
             int (*ret_fn)  (struct rv_uprobe *, unsigned long func,
                             struct pt_regs *, __u64 *),
             void *priv);

     void rv_uprobe_detach(struct rv_uprobe *p);

struct rv_uprobe exposes three read-only fields to monitors (offset, 
priv, path); the uprobe_consumer and callbacks would be kept private to 
the implementation, so monitors need not include <linux/uprobes.h>.

rv_uprobe_attach() resolves the path and delegates to 
rv_uprobe_attach_path(); the latter avoids a redundant kern_path() when 
registering multiple probes on the same binary:

     kern_path(binpath, LOOKUP_FOLLOW, &path);
     b->start = rv_uprobe_attach_path(&path, offset_start, entry_fn, 
NULL, b);
     b->stop  = rv_uprobe_attach_path(&path, offset_stop,  stop_fn, 
NULL, b);
     path_put(&path);

Does the interface look reasonable, or did you have a different shape in 
mind?


--
Best wishes,
Wen


> 
> [1] -
> https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=f5587d1b6ec938afb2f74fe399a68020d66923e4
> [2] -
> https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=da282bf7fadb095ee0a40c32ff0126429c769b45
> 
>> Signed-off-by: Wen Yang <wen.yang@linux.dev>
>> ---
>>   Documentation/trace/rv/index.rst              |   1 +
>>   Documentation/trace/rv/monitor_tlob.rst       | 381 +++++++
>>   .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>>   include/uapi/linux/rv.h                       | 181 ++++
>>   kernel/trace/rv/Kconfig                       |  17 +
>>   kernel/trace/rv/Makefile                      |   2 +
>>   kernel/trace/rv/monitors/tlob/Kconfig         |  51 +
>>   kernel/trace/rv/monitors/tlob/tlob.c          | 986 ++++++++++++++++++
>>   kernel/trace/rv/monitors/tlob/tlob.h          | 145 +++
>>   kernel/trace/rv/monitors/tlob/tlob_trace.h    |  42 +
>>   kernel/trace/rv/rv.c                          |   4 +
>>   kernel/trace/rv/rv_dev.c                      | 602 +++++++++++
>>   kernel/trace/rv/rv_trace.h                    |  50 +
>>   13 files changed, 2463 insertions(+)
>>   create mode 100644 Documentation/trace/rv/monitor_tlob.rst
>>   create mode 100644 include/uapi/linux/rv.h
>>   create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
>>   create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
>>   create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
>>   create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
>>   create mode 100644 kernel/trace/rv/rv_dev.c
>>
>> diff --git a/Documentation/trace/rv/index.rst
>> b/Documentation/trace/rv/index.rst
>> index a2812ac5c..4f2bfaf38 100644
>> --- a/Documentation/trace/rv/index.rst
>> +++ b/Documentation/trace/rv/index.rst
>> @@ -15,3 +15,4 @@ Runtime Verification
>>      monitor_wwnr.rst
>>      monitor_sched.rst
>>      monitor_rtapp.rst
>> +   monitor_tlob.rst
>> diff --git a/Documentation/trace/rv/monitor_tlob.rst
>> b/Documentation/trace/rv/monitor_tlob.rst
>> new file mode 100644
>> index 000000000..d498e9894
>> --- /dev/null
>> +++ b/Documentation/trace/rv/monitor_tlob.rst
>> @@ -0,0 +1,381 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +Monitor tlob
>> +============
>> +
>> +- Name: tlob - task latency over budget
>> +- Type: per-task deterministic automaton
>> +- Author: Wen Yang <wen.yang@linux.dev>
>> +
>> +Description
>> +-----------
>> +
>> +The tlob monitor tracks per-task elapsed time (CLOCK_MONOTONIC, including
>> +both on-CPU and off-CPU time) and reports a violation when the monitored
>> +task exceeds a configurable latency budget threshold.
>> +
>> +The monitor implements a three-state deterministic automaton::
>> +
>> +                              |
>> +                              | (initial)
>> +                              v
>> +                    +--------------+
>> +          +-------> | unmonitored  |
>> +          |         +--------------+
>> +          |                |
>> +          |          trace_start
>> +          |                v
>> +          |         +--------------+
>> +          |         |   on_cpu     |
>> +          |         +--------------+
>> +          |           |         |
>> +          |  switch_out|         | trace_stop / budget_expired
>> +          |            v         v
>> +          |  +--------------+  (unmonitored)
>> +          |  |   off_cpu    |
>> +          |  +--------------+
>> +          |     |         |
>> +          |     | switch_in| trace_stop / budget_expired
>> +          |     v         v
>> +          |  (on_cpu)  (unmonitored)
>> +          |
>> +          +-- trace_stop (from on_cpu or off_cpu)
>> +
>> +  Key transitions:
>> +    unmonitored   --(trace_start)-->   on_cpu
>> +    on_cpu        --(switch_out)-->    off_cpu
>> +    off_cpu       --(switch_in)-->     on_cpu
>> +    on_cpu        --(trace_stop)-->    unmonitored
>> +    off_cpu       --(trace_stop)-->    unmonitored
>> +    on_cpu        --(budget_expired)-> unmonitored   [violation]
>> +    off_cpu       --(budget_expired)-> unmonitored   [violation]
>> +
>> +  sched_wakeup self-loops in on_cpu and unmonitored; switch_out and
>> +  sched_wakeup self-loop in off_cpu.  budget_expired is fired by the one-shot
>> hrtimer; it always
>> +  transitions to unmonitored regardless of whether the task is on-CPU
>> +  or off-CPU when the timer fires.
>> +
>> +State Descriptions
>> +------------------
>> +
>> +- **unmonitored**: Task is not being traced.  Scheduling events
>> +  (``switch_in``, ``switch_out``, ``sched_wakeup``) are silently
>> +  ignored (self-loop).  The monitor waits for a ``trace_start`` event
>> +  to begin a new observation window.
>> +
>> +- **on_cpu**: Task is running on the CPU with the deadline timer armed.
>> +  A one-shot hrtimer was set for ``threshold_us`` microseconds at
>> +  ``trace_start`` time.  A ``switch_out`` event transitions to
>> +  ``off_cpu``; the hrtimer keeps running (off-CPU time counts toward
>> +  the budget).  A ``trace_stop`` cancels the timer and returns to
>> +  ``unmonitored`` (normal completion).  If the hrtimer fires
>> +  (``budget_expired``) the violation is recorded and the automaton
>> +  transitions to ``unmonitored``.
>> +
>> +- **off_cpu**: Task was preempted or blocked.  The one-shot hrtimer
>> +  continues to run.  A ``switch_in`` event returns to ``on_cpu``.
>> +  A ``trace_stop`` cancels the timer and returns to ``unmonitored``.
>> +  If the hrtimer fires (``budget_expired``) while the task is off-CPU,
>> +  the violation is recorded and the automaton transitions to
>> +  ``unmonitored``.
>> +
>> +Rationale
>> +---------
>> +
>> +The per-task latency budget threshold allows operators to express timing
>> +requirements in microseconds and receive an immediate ftrace event when a
>> +task exceeds its budget.  This is useful for real-time tasks
>> +(``SCHED_FIFO`` / ``SCHED_DEADLINE``) where total elapsed time must
>> +remain within a known bound.
>> +
>> +Each task has an independent threshold, so up to ``TLOB_MAX_MONITORED``
>> +(64) tasks with different timing requirements can be monitored
>> +simultaneously.
>> +
>> +On threshold violation the automaton records a ``tlob_budget_exceeded``
>> +ftrace event carrying the final on-CPU / off-CPU time breakdown, but does
>> +not kill or throttle the task.  Monitoring can be restarted by issuing a
>> +new ``trace_start`` event (or a new ``TLOB_IOCTL_TRACE_START`` ioctl).
>> +
>> +A per-task one-shot hrtimer is armed at ``trace_start`` for exactly
>> +``threshold_us`` microseconds.  It fires at most once per monitoring
>> +window, performs an O(1) hash lookup, records the violation, and injects
>> +the ``budget_expired`` event into the DA.  When ``CONFIG_RV_MON_TLOB``
>> +is not set there is zero runtime cost.
>> +
>> +Usage
>> +-----
>> +
>> +tracefs interface (uprobe-based external monitoring)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +The ``monitor`` tracefs file allows any privileged user to instrument an
>> +unmodified binary via uprobes, without changing its source code.  Write a
>> +four-field record to attach two plain entry uprobes: one at
>> +``offset_start`` fires ``tlob_start_task()`` and one at ``offset_stop``
>> +fires ``tlob_stop_task()``, so the latency budget covers exactly the code
>> +region between the two offsets::
>> +
>> +  threshold_us:offset_start:offset_stop:binary_path
>> +
>> +``binary_path`` comes last so it may freely contain ``:`` (e.g. paths
>> +inside a container namespace).
>> +
>> +The uprobes fire for every task that executes the probed instruction in
>> +the binary, consistent with the native uprobe semantics.  All tasks that
>> +execute the code region get independent per-task monitoring slots.
>> +
>> +Using two plain entry uprobes (rather than a uretprobe for the stop) means
>> +that a mistyped offset can never corrupt the call stack; the worst outcome
>> +of a bad ``offset_stop`` is a missed stop that causes the hrtimer to fire
>> +and report a budget violation.
>> +
>> +Example  --  monitor a code region in ``/usr/bin/myapp`` with a 5 ms
>> +budget, where the region starts at offset 0x12a0 and ends at 0x12f0::
>> +
>> +  echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
>> +
>> +  # Bind uprobes: start probe starts the clock, stop probe stops it
>> +  echo "5000:0x12a0:0x12f0:/usr/bin/myapp" \
>> +      > /sys/kernel/tracing/rv/monitors/tlob/monitor
>> +
>> +  # Remove the uprobe binding for this code region
>> +  echo "-0x12a0:/usr/bin/myapp" >
>> /sys/kernel/tracing/rv/monitors/tlob/monitor
>> +
>> +  # List registered uprobe bindings (mirrors the write format)
>> +  cat /sys/kernel/tracing/rv/monitors/tlob/monitor
>> +  # -> 5000:0x12a0:0x12f0:/usr/bin/myapp
>> +
>> +  # Read violations from the trace buffer
>> +  cat /sys/kernel/tracing/trace
>> +
>> +Up to ``TLOB_MAX_MONITORED`` tasks may be monitored simultaneously.
>> +
>> +The offsets can be obtained with ``nm`` or ``readelf``::
>> +
>> +  nm -n /usr/bin/myapp | grep my_function
>> +  # -> 0000000000012a0 T my_function
>> +
>> +  readelf -s /usr/bin/myapp | grep my_function
>> +  # -> 42: 0000000000012a0  336 FUNC GLOBAL DEFAULT  13 my_function
>> +
>> +  # offset_start = 0x12a0 (function entry)
>> +  # offset_stop  = 0x12a0 + 0x50 = 0x12f0 (or any instruction before return)
>> +
>> +Notes:
>> +
>> +- The uprobes fire for every task that executes the probed instruction,
>> +  so concurrent calls from different threads each get independent
>> +  monitoring slots.
>> +- ``offset_stop`` need not be a function return; it can be any instruction
>> +  within the region.  If the stop probe is never reached (e.g. early exit
>> +  path bypasses it), the hrtimer fires and a budget violation is reported.
>> +- Each ``(binary_path, offset_start)`` pair may only be registered once.
>> +  A second write with the same ``offset_start`` for the same binary is
>> +  rejected with ``-EEXIST``.  Two entry uprobes at the same address would
>> +  both fire for every task, causing ``tlob_start_task()`` to be called
>> +  twice; the second call would silently fail with ``-EEXIST`` and the
>> +  second binding's threshold would never take effect.  Different code
>> +  regions that share the same ``offset_stop`` (common exit point) are
>> +  explicitly allowed.
>> +- The uprobe binding is removed when ``-offset_start:binary_path`` is
>> +  written to ``monitor``, or when the monitor is disabled.
>> +- The ``tag`` field in every ``tlob_budget_exceeded`` event is
>> +  automatically set to ``offset_start`` for the tracefs path, so
>> +  violation events for different code regions are immediately
>> +  distinguishable even when ``threshold_us`` values are identical.
>> +
>> +ftrace ring buffer (budget violation events)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +When a monitored task exceeds its latency budget the hrtimer fires,
>> +records the violation, and emits a single ``tlob_budget_exceeded`` event
>> +into the ftrace ring buffer.  **Nothing is written to the ftrace ring
>> +buffer while the task is within budget.**
>> +
>> +The event carries the on-CPU / off-CPU time breakdown so that root-cause
>> +analysis (CPU-bound vs. scheduling / I/O overrun) is immediate::
>> +
>> +  cat /sys/kernel/tracing/trace
>> +
>> +Example output::
>> +
>> +  myapp-1234 [003] .... 12345.678: tlob_budget_exceeded: \
>> +    myapp[1234]: budget exceeded threshold=5000 \
>> +    on_cpu=820 off_cpu=4500 switches=3 state=off_cpu tag=0x00000000000012a0
>> +
>> +Field descriptions:
>> +
>> +``threshold``
>> +  Configured latency budget in microseconds.
>> +
>> +``on_cpu``
>> +  Cumulative on-CPU time since ``trace_start``, in microseconds.
>> +
>> +``off_cpu``
>> +  Cumulative off-CPU (scheduling + I/O wait) time since ``trace_start``,
>> +  in microseconds.
>> +
>> +``switches``
>> +  Number of times the task was scheduled out during this window.
>> +
>> +``state``
>> +  DA state when the hrtimer fired: ``on_cpu`` means the task was executing
>> +  when the budget expired (CPU-bound overrun); ``off_cpu`` means the task
>> +  was preempted or blocked (scheduling / I/O overrun).
>> +
>> +``tag``
>> +  Opaque 64-bit cookie supplied by the caller via ``tlob_start_args.tag``
>> +  (ioctl path) or automatically set to ``offset_start`` (tracefs uprobe
>> +  path).  Use it to distinguish violations from different code regions
>> +  monitored by the same thread.  Zero when not set.
>> +
>> +To capture violations in a file::
>> +
>> +  trace-cmd record -e tlob_budget_exceeded &
>> +  # ... run workload ...
>> +  trace-cmd report
>> +
>> +/dev/rv ioctl interface (self-instrumentation)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Tasks can self-instrument their own code paths via the ``/dev/rv`` misc
>> +device (requires ``CONFIG_RV_CHARDEV``).  The kernel key is
>> +``task_struct``; multiple threads sharing a single fd each get their own
>> +independent monitoring slot.
>> +
>> +**Synchronous mode**  --  the calling thread checks its own result::
>> +
>> +  int fd = open("/dev/rv", O_RDWR);
>> +
>> +  struct tlob_start_args args = {
>> +      .threshold_us = 50000,   /* 50 ms */
>> +      .tag          = 0,       /* optional; 0 = don't care */
>> +      .notify_fd    = -1,      /* no fd notification */
>> +  };
>> +  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
>> +
>> +  /* ... code path under observation ... */
>> +
>> +  int ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
>> +  /* ret == 0:          within budget  */
>> +  /* ret == -EOVERFLOW: budget exceeded */
>> +
>> +  close(fd);
>> +
>> +**Asynchronous mode**  --  a dedicated monitor thread receives violation
>> +records via ``read()`` on a shared fd, decoupling the observation from
>> +the critical path::
>> +
>> +  /* Monitor thread: open a dedicated fd. */
>> +  int monitor_fd = open("/dev/rv", O_RDWR);
>> +
>> +  /* Worker thread: set notify_fd = monitor_fd in TRACE_START args. */
>> +  int work_fd = open("/dev/rv", O_RDWR);
>> +  struct tlob_start_args args = {
>> +      .threshold_us = 10000,   /* 10 ms */
>> +      .tag          = REGION_A,
>> +      .notify_fd    = monitor_fd,
>> +  };
>> +  ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
>> +  /* ... critical section ... */
>> +  ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
>> +
>> +  /* Monitor thread: blocking read() returns one or more tlob_event records.
>> */
>> +  struct tlob_event ntfs[8];
>> +  ssize_t n = read(monitor_fd, ntfs, sizeof(ntfs));
>> +  for (int i = 0; i < n / sizeof(struct tlob_event); i++) {
>> +      struct tlob_event *ntf = &ntfs[i];
>> +      printf("tid=%u tag=0x%llx exceeded budget=%llu us "
>> +             "(on_cpu=%llu off_cpu=%llu switches=%u state=%s)\n",
>> +             ntf->tid, ntf->tag, ntf->threshold_us,
>> +             ntf->on_cpu_us, ntf->off_cpu_us, ntf->switches,
>> +             ntf->state ? "on_cpu" : "off_cpu");
>> +  }
>> +
>> +**mmap ring buffer**  --  zero-copy consumption of violation events::
>> +
>> +  int fd = open("/dev/rv", O_RDWR);
>> +  struct tlob_start_args args = {
>> +      .threshold_us = 1000,   /* 1 ms */
>> +      .notify_fd    = fd,     /* push violations to own ring buffer */
>> +  };
>> +  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
>> +
>> +  /* Map the ring: one control page + capacity data records. */
>> +  size_t pagesize = sysconf(_SC_PAGESIZE);
>> +  size_t cap = 64;   /* read from page->capacity after mmap */
>> +  size_t len = pagesize + cap * sizeof(struct tlob_event);
>> +  void *map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> +
>> +  struct tlob_mmap_page *page = map;
>> +  struct tlob_event *data =
>> +      (struct tlob_event *)((char *)map + page->data_offset);
>> +
>> +  /* Consumer loop: poll for events, read without copying. */
>> +  while (1) {
>> +      poll(&(struct pollfd){fd, POLLIN, 0}, 1, -1);
>> +
>> +      uint32_t head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
>> +      uint32_t tail = page->data_tail;
>> +      while (tail != head) {
>> +          handle(&data[tail & (page->capacity - 1)]);
>> +          tail++;
>> +      }
>> +      __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
>> +  }
>> +
>> +Note: ``read()`` and ``mmap()`` share the same ring and ``data_tail``
>> +cursor.  Do not use both simultaneously on the same fd.
>> +
>> +``tlob_event`` fields:
>> +
>> +``tid``
>> +  Thread ID (``task_pid_vnr``) of the violating task.
>> +
>> +``threshold_us``
>> +  Budget that was exceeded, in microseconds.
>> +
>> +``on_cpu_us``
>> +  Cumulative on-CPU time at violation time, in microseconds.
>> +
>> +``off_cpu_us``
>> +  Cumulative off-CPU time at violation time, in microseconds.
>> +
>> +``switches``
>> +  Number of context switches since ``TRACE_START``.
>> +
>> +``state``
>> +  1 = timer fired while task was on-CPU; 0 = timer fired while off-CPU.
>> +
>> +``tag``
>> +  Cookie from ``tlob_start_args.tag``; for the tracefs uprobe path this
>> +  equals ``offset_start``.  Zero when not set.
>> +
>> +tracefs files
>> +-------------
>> +
>> +The following files are created under
>> +``/sys/kernel/tracing/rv/monitors/tlob/``:
>> +
>> +``enable`` (rw)
>> +  Write ``1`` to enable the monitor; write ``0`` to disable it and
>> +  stop all currently monitored tasks.
>> +
>> +``desc`` (ro)
>> +  Human-readable description of the monitor.
>> +
>> +``monitor`` (rw)
>> +  Write ``threshold_us:offset_start:offset_stop:binary_path`` to bind two
>> +  plain entry uprobes in *binary_path*.  The uprobe at *offset_start* fires
>> +  ``tlob_start_task()``; the uprobe at *offset_stop* fires
>> +  ``tlob_stop_task()``.  Returns ``-EEXIST`` if a binding with the same
>> +  *offset_start* already exists for *binary_path*.  Write
>> +  ``-offset_start:binary_path`` to remove the binding.  Read to list
>> +  registered bindings, one
>> +  ``threshold_us:0xoffset_start:0xoffset_stop:binary_path`` entry per line.
>> +
>> +Specification
>> +-------------
>> +
>> +Graphviz DOT file in tools/verification/models/tlob.dot
>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> index 331223761..8d3af68db 100644
>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> @@ -385,6 +385,7 @@ Code  Seq#    Include
>> File                                             Comments
>>   0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h
>> Marvell CN10K DPI driver
>>   0xB8  all    uapi/linux/mshv.h
>> Microsoft Hyper-V /dev/mshv driver
>>                                                                         
>> <mailto:linux-hyperv@vger.kernel.org>
>> +0xB9  00-3F  linux/rv.h
>> Runtime Verification (RV) monitors
>>   0xBA  00-0F  uapi/linux/liveupdate.h                                   Pasha
>> Tatashin
>>                                                                         
>> <mailto:pasha.tatashin@soleen.com>
>>   0xC0  00-0F  linux/usb/iowarrior.h
>> diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
>> new file mode 100644
>> index 000000000..d1b96d8cd
>> --- /dev/null
>> +++ b/include/uapi/linux/rv.h
>> @@ -0,0 +1,181 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +/*
>> + * UAPI definitions for Runtime Verification (RV) monitors.
>> + *
>> + * All RV monitors that expose an ioctl self-instrumentation interface
>> + * share the magic byte RV_IOC_MAGIC (0xB9), registered in
>> + * Documentation/userspace-api/ioctl/ioctl-number.rst.
>> + *
>> + * A single /dev/rv misc device serves as the entry point.  ioctl numbers
>> + * encode both the monitor identity and the operation:
>> + *
>> + *   0x01 - 0x1F  tlob (task latency over budget)
>> + *   0x20 - 0x3F  reserved for future RV monitors
>> + *
>> + * Usage examples and design rationale are in:
>> + *   Documentation/trace/rv/monitor_tlob.rst
>> + */
>> +
>> +#ifndef _UAPI_LINUX_RV_H
>> +#define _UAPI_LINUX_RV_H
>> +
>> +#include <linux/ioctl.h>
>> +#include <linux/types.h>
>> +
>> +/* Magic byte shared by all RV monitor ioctls. */
>> +#define RV_IOC_MAGIC	0xB9
>> +
>> +/* -----------------------------------------------------------------------
>> + * tlob: task latency over budget monitor  (nr 0x01 - 0x1F)
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/**
>> + * struct tlob_start_args - arguments for TLOB_IOCTL_TRACE_START
>> + * @threshold_us: Latency budget for this critical section, in microseconds.
>> + *               Must be greater than zero.
>> + * @tag:         Opaque 64-bit cookie supplied by the caller.  Echoed back
>> + *               verbatim in the tlob_budget_exceeded ftrace event and in any
>> + *               tlob_event record delivered via @notify_fd.  Use it to
>> identify
>> + *               which code region triggered a violation when the same thread
>> + *               monitors multiple regions sequentially.  Set to 0 if not
>> + *               needed.
>> + * @notify_fd:   File descriptor that will receive a tlob_event record on
>> + *               violation.  Must refer to an open /dev/rv fd.  May equal
>> + *               the calling fd (self-notification, useful for retrieving the
>> + *               on_cpu_us / off_cpu_us breakdown after TRACE_STOP returns
>> + *               -EOVERFLOW).  Set to -1 to disable fd notification; in that
>> + *               case violations are only signalled via the TRACE_STOP return
>> + *               value and the tlob_budget_exceeded ftrace event.
>> + * @flags:       Must be 0.  Reserved for future extensions.
>> + */
>> +struct tlob_start_args {
>> +	__u64 threshold_us;
>> +	__u64 tag;
>> +	__s32 notify_fd;
>> +	__u32 flags;
>> +};
>> +
>> +/**
>> + * struct tlob_event - one budget-exceeded event
>> + *
>> + * Consumed by read() on the notify_fd registered at TLOB_IOCTL_TRACE_START.
>> + * Each record describes a single budget exceedance for one task.
>> + *
>> + * @tid:          Thread ID (task_pid_vnr) of the violating task.
>> + * @threshold_us: Budget that was exceeded, in microseconds.
>> + * @on_cpu_us:    Cumulative on-CPU time at violation time, in microseconds.
>> + * @off_cpu_us:   Cumulative off-CPU (scheduling + I/O wait) time at
>> + *               violation time, in microseconds.
>> + * @switches:     Number of context switches since TRACE_START.
>> + * @state:        DA state at violation: 1 = on_cpu, 0 = off_cpu.
>> + * @tag:          Cookie from tlob_start_args.tag; for the tracefs uprobe
>> path
>> + *               this is the offset_start value.  Zero when not set.
>> + */
>> +struct tlob_event {
>> +	__u32 tid;
>> +	__u32 pad;
>> +	__u64 threshold_us;
>> +	__u64 on_cpu_us;
>> +	__u64 off_cpu_us;
>> +	__u32 switches;
>> +	__u32 state;   /* 1 = on_cpu, 0 = off_cpu */
>> +	__u64 tag;
>> +};
>> +
>> +/**
>> + * struct tlob_mmap_page - control page for the mmap'd violation ring buffer
>> + *
>> + * Mapped at offset 0 of the mmap region returned by mmap(2) on a /dev/rv fd.
>> + * The data array of struct tlob_event records begins at offset @data_offset
>> + * (always one page from the mmap base; use this field rather than hard-
>> coding
>> + * PAGE_SIZE so the code remains correct across architectures).
>> + *
>> + * Ring layout:
>> + *
>> + *   mmap base + 0             : struct tlob_mmap_page  (one page)
>> + *   mmap base + data_offset   : struct tlob_event[capacity]
>> + *
>> + * The mmap length determines the ring capacity.  Compute it as:
>> + *
>> + *   raw    = sysconf(_SC_PAGESIZE) + capacity * sizeof(struct tlob_event)
>> + *   length = (raw + sysconf(_SC_PAGESIZE) - 1) & ~(sysconf(_SC_PAGESIZE) -
>> 1)
>> + *
>> + * i.e. round the raw byte count up to the next page boundary before
>> + * passing it to mmap(2).  The kernel requires a page-aligned length.
>> + * capacity must be a power of 2.  Read @capacity after a successful
>> + * mmap(2) for the actual value.
>> + *
>> + * Producer/consumer ordering contract:
>> + *
>> + *   Kernel (producer):
>> + *     data[data_head & (capacity - 1)] = event;
>> + *     // pairs with load-acquire in userspace:
>> + *     smp_store_release(&page->data_head, data_head + 1);
>> + *
>> + *   Userspace (consumer):
>> + *     // pairs with store-release in kernel:
>> + *     head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
>> + *     for (tail = page->data_tail; tail != head; tail++)
>> + *         handle(&data[tail & (capacity - 1)]);
>> + *     __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
>> + *
>> + * @data_head and @data_tail are monotonically increasing __u32 counters
>> + * in units of records.  Unsigned 32-bit wrap-around is handled correctly
>> + * by modular arithmetic; the ring is full when
>> + * (data_head - data_tail) == capacity.
>> + *
>> + * When the ring is full the kernel drops the incoming record and increments
>> + * @dropped.  The consumer should check @dropped periodically to detect loss.
>> + *
>> + * read() and mmap() share the same ring buffer.  Do not use both
>> + * simultaneously on the same fd.
>> + *
>> + * @data_head:   Next write slot index.  Updated by the kernel with
>> + *               store-release ordering.  Read by userspace with load-
>> acquire.
>> + * @data_tail:   Next read slot index.  Updated by userspace.  Read by the
>> + *               kernel to detect overflow.
>> + * @capacity:    Actual ring capacity in records (power of 2).  Written once
>> + *               by the kernel at mmap time; read-only for userspace
>> thereafter.
>> + * @version:     Ring buffer ABI version; currently 1.
>> + * @data_offset: Byte offset from the mmap base to the data array.
>> + *               Always equal to sysconf(_SC_PAGESIZE) on the running kernel.
>> + * @record_size: sizeof(struct tlob_event) as seen by the kernel.  Verify
>> + *               this matches userspace's sizeof before indexing the array.
>> + * @dropped:     Number of events dropped because the ring was full.
>> + *               Monotonically increasing; read with __ATOMIC_RELAXED.
>> + */
>> +struct tlob_mmap_page {
>> +	__u32  data_head;
>> +	__u32  data_tail;
>> +	__u32  capacity;
>> +	__u32  version;
>> +	__u32  data_offset;
>> +	__u32  record_size;
>> +	__u64  dropped;
>> +};
>> +
>> +/*
>> + * TLOB_IOCTL_TRACE_START - begin monitoring the calling task.
>> + *
>> + * Arms a per-task hrtimer for threshold_us microseconds.  If args.notify_fd
>> + * is >= 0, a tlob_event record is pushed into that fd's ring buffer on
>> + * violation in addition to the tlob_budget_exceeded ftrace event.
>> + * args.notify_fd == -1 disables fd notification.
>> + *
>> + * Violation records are consumed by read() on the notify_fd (blocking or
>> + * non-blocking depending on O_NONBLOCK).  On violation,
>> TLOB_IOCTL_TRACE_STOP
>> + * also returns -EOVERFLOW regardless of whether notify_fd is set.
>> + *
>> + * args.flags must be 0.
>> + */
>> +#define TLOB_IOCTL_TRACE_START		_IOW(RV_IOC_MAGIC, 0x01, struct
>> tlob_start_args)
>> +
>> +/*
>> + * TLOB_IOCTL_TRACE_STOP - end monitoring the calling task.
>> + *
>> + * Returns 0 if within budget, -EOVERFLOW if the budget was exceeded.
>> + */
>> +#define TLOB_IOCTL_TRACE_STOP		_IO(RV_IOC_MAGIC,  0x02)
>> +
>> +#endif /* _UAPI_LINUX_RV_H */
>> diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
>> index 5b4be87ba..227573cda 100644
>> --- a/kernel/trace/rv/Kconfig
>> +++ b/kernel/trace/rv/Kconfig
>> @@ -65,6 +65,7 @@ source "kernel/trace/rv/monitors/pagefault/Kconfig"
>>   source "kernel/trace/rv/monitors/sleep/Kconfig"
>>   # Add new rtapp monitors here
>>   
>> +source "kernel/trace/rv/monitors/tlob/Kconfig"
>>   # Add new monitors here
>>   
>>   config RV_REACTORS
>> @@ -93,3 +94,19 @@ config RV_REACT_PANIC
>>   	help
>>   	  Enables the panic reactor. The panic reactor emits a printk()
>>   	  message if an exception is found and panic()s the system.
>> +
>> +config RV_CHARDEV
>> +	bool "RV ioctl interface via /dev/rv"
>> +	depends on RV
>> +	default n
>> +	help
>> +	  Register a /dev/rv misc device that exposes an ioctl interface
>> +	  for RV monitor self-instrumentation.  All RV monitors share the
>> +	  single device node; ioctl numbers encode the monitor identity.
>> +
>> +	  When enabled, user-space programs can open /dev/rv and use
>> +	  monitor-specific ioctl commands to bracket code regions they
>> +	  want the kernel RV subsystem to observe.
>> +
>> +	  Say Y here if you want to use the tlob self-instrumentation
>> +	  ioctl interface; otherwise say N.
>> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
>> index 750e4ad6f..cc3781a3b 100644
>> --- a/kernel/trace/rv/Makefile
>> +++ b/kernel/trace/rv/Makefile
>> @@ -3,6 +3,7 @@
>>   ccflags-y += -I $(src)		# needed for trace events
>>   
>>   obj-$(CONFIG_RV) += rv.o
>> +obj-$(CONFIG_RV_CHARDEV) += rv_dev.o
>>   obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
>>   obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
>>   obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
>> @@ -17,6 +18,7 @@ obj-$(CONFIG_RV_MON_STS) += monitors/sts/sts.o
>>   obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
>>   obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
>>   obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
>> +obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
>>   # Add new monitors here
>>   obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
>>   obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
>> diff --git a/kernel/trace/rv/monitors/tlob/Kconfig
>> b/kernel/trace/rv/monitors/tlob/Kconfig
>> new file mode 100644
>> index 000000000..010237480
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
>> @@ -0,0 +1,51 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +#
>> +config RV_MON_TLOB
>> +	depends on RV
>> +	depends on UPROBES
>> +	select DA_MON_EVENTS_ID
>> +	bool "tlob monitor"
>> +	help
>> +	  Enable the tlob (task latency over budget) monitor. This monitor
>> +	  tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path
>> within a
>> +	  task (including both on-CPU and off-CPU time) and reports a
>> +	  violation when the elapsed time exceeds a configurable budget
>> +	  threshold.
>> +
>> +	  The monitor implements a three-state deterministic automaton.
>> +	  States: unmonitored, on_cpu, off_cpu.
>> +	  Key transitions:
>> +	    unmonitored    --(trace_start)-->    on_cpu
>> +	    on_cpu   --(switch_out)-->     off_cpu
>> +	    off_cpu  --(switch_in)-->      on_cpu
>> +	    on_cpu   --(trace_stop)-->    unmonitored
>> +	    off_cpu  --(trace_stop)-->    unmonitored
>> +	    on_cpu   --(budget_expired)--> unmonitored
>> +	    off_cpu  --(budget_expired)--> unmonitored
>> +
>> +	  External configuration is done via the tracefs "monitor" file:
>> +	    echo pid:threshold_us:binary:offset_start:offset_stop >
>> .../rv/monitors/tlob/monitor
>> +	    echo -pid             > .../rv/monitors/tlob/monitor  (remove
>> task)
>> +	    cat                     .../rv/monitors/tlob/monitor  (list
>> tasks)
>> +
>> +	  The uprobe binding places two plain entry uprobes at offset_start
>> and
>> +	  offset_stop in the binary; these trigger tlob_start_task() and
>> +	  tlob_stop_task() respectively.  Using two entry uprobes (rather
>> than a
>> +	  uretprobe) means that a mistyped offset can never corrupt the call
>> +	  stack; the worst outcome is a missed stop, which causes the hrtimer
>> to
>> +	  fire and report a budget violation.
>> +
>> +	  Violation events are delivered via a lock-free mmap ring buffer on
>> +	  /dev/rv (enabled by CONFIG_RV_CHARDEV).  The consumer mmap()s the
>> +	  device, reads records from the data array using the head/tail
>> indices
>> +	  in the control page, and advances data_tail when done.
>> +
>> +	  For self-instrumentation, use TLOB_IOCTL_TRACE_START /
>> +	  TLOB_IOCTL_TRACE_STOP via the /dev/rv misc device (enabled by
>> +	  CONFIG_RV_CHARDEV).
>> +
>> +	  Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
>> +
>> +	  For further information, see:
>> +	    Documentation/trace/rv/monitor_tlob.rst
>> +
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
>> b/kernel/trace/rv/monitors/tlob/tlob.c
>> new file mode 100644
>> index 000000000..a6e474025
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
>> @@ -0,0 +1,986 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * tlob: task latency over budget monitor
>> + *
>> + * Track the elapsed wall-clock time of a marked code path and detect when
>> + * a monitored task exceeds its per-task latency budget.  CLOCK_MONOTONIC
>> + * is used so both on-CPU and off-CPU time count toward the budget.
>> + *
>> + * Per-task state is maintained in a spinlock-protected hash table.  A
>> + * one-shot hrtimer fires at the deadline; if the task has not called
>> + * trace_stop by then, a violation is recorded.
>> + *
>> + * Up to TLOB_MAX_MONITORED tasks may be tracked simultaneously.
>> + *
>> + * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
>> + */
>> +#include <linux/file.h>
>> +#include <linux/fs.h>
>> +#include <linux/ftrace.h>
>> +#include <linux/hash.h>
>> +#include <linux/hrtimer.h>
>> +#include <linux/kernel.h>
>> +#include <linux/ktime.h>
>> +#include <linux/module.h>
>> +#include <linux/init.h>
>> +#include <linux/namei.h>
>> +#include <linux/poll.h>
>> +#include <linux/rv.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/atomic.h>
>> +#include <linux/rcupdate.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/tracefs.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/uprobes.h>
>> +#include <kunit/visibility.h>
>> +#include <rv/instrumentation.h>
>> +
>> +/* rv_interface_lock is defined in kernel/trace/rv/rv.c */
>> +extern struct mutex rv_interface_lock;
>> +
>> +#define MODULE_NAME "tlob"
>> +
>> +#include <rv_trace.h>
>> +#include <trace/events/sched.h>
>> +
>> +#define RV_MON_TYPE RV_MON_PER_TASK
>> +#include "tlob.h"
>> +#include <rv/da_monitor.h>
>> +
>> +/* Hash table size; must be a power of two. */
>> +#define TLOB_HTABLE_BITS		6
>> +#define TLOB_HTABLE_SIZE		(1 << TLOB_HTABLE_BITS)
>> +
>> +/* Maximum binary path length for uprobe binding. */
>> +#define TLOB_MAX_PATH			256
>> +
>> +/* Per-task latency monitoring state. */
>> +struct tlob_task_state {
>> +	struct hlist_node	hlist;
>> +	struct task_struct	*task;
>> +	u64			threshold_us;
>> +	u64			tag;
>> +	struct hrtimer		deadline_timer;
>> +	int			canceled;	/* protected by entry_lock */
>> +	struct file		*notify_file;	/* NULL or held reference */
>> +
>> +	/*
>> +	 * entry_lock serialises the mutable accounting fields below.
>> +	 * Lock order: tlob_table_lock -> entry_lock (never reverse).
>> +	 */
>> +	raw_spinlock_t		entry_lock;
>> +	u64			on_cpu_us;
>> +	u64			off_cpu_us;
>> +	ktime_t			last_ts;
>> +	u32			switches;
>> +	u8			da_state;
>> +
>> +	struct rcu_head		rcu;	/* for call_rcu() teardown */
>> +};
>> +
>> +/* Per-uprobe-binding state: a start + stop probe pair for one binary region.
>> */
>> +struct tlob_uprobe_binding {
>> +	struct list_head	list;
>> +	u64			threshold_us;
>> +	struct path		path;
>> +	char			binpath[TLOB_MAX_PATH];	/* canonical
>> path for read/remove */
>> +	loff_t			offset_start;
>> +	loff_t			offset_stop;
>> +	struct uprobe_consumer	entry_uc;
>> +	struct uprobe_consumer	stop_uc;
>> +	struct uprobe		*entry_uprobe;
>> +	struct uprobe		*stop_uprobe;
>> +};
>> +
>> +/* Object pool for tlob_task_state. */
>> +static struct kmem_cache *tlob_state_cache;
>> +
>> +/* Hash table and lock protecting table structure (insert/delete/canceled).
>> */
>> +static struct hlist_head tlob_htable[TLOB_HTABLE_SIZE];
>> +static DEFINE_RAW_SPINLOCK(tlob_table_lock);
>> +static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
>> +
>> +/* Uprobe binding list; protected by tlob_uprobe_mutex. */
>> +static LIST_HEAD(tlob_uprobe_list);
>> +static DEFINE_MUTEX(tlob_uprobe_mutex);
>> +
>> +/* Forward declaration */
>> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer);
>> +
>> +/* Hash table helpers */
>> +
>> +static unsigned int tlob_hash_task(const struct task_struct *task)
>> +{
>> +	return hash_ptr((void *)task, TLOB_HTABLE_BITS);
>> +}
>> +
>> +/*
>> + * tlob_find_rcu - look up per-task state.
>> + * Must be called under rcu_read_lock() or with tlob_table_lock held.
>> + */
>> +static struct tlob_task_state *tlob_find_rcu(struct task_struct *task)
>> +{
>> +	struct tlob_task_state *ws;
>> +	unsigned int h = tlob_hash_task(task);
>> +
>> +	hlist_for_each_entry_rcu(ws, &tlob_htable[h], hlist,
>> +				 lockdep_is_held(&tlob_table_lock))
>> +		if (ws->task == task)
>> +			return ws;
>> +	return NULL;
>> +}
>> +
>> +/* Allocate and initialise a new per-task state entry. */
>> +static struct tlob_task_state *tlob_alloc(struct task_struct *task,
>> +					  u64 threshold_us, u64 tag)
>> +{
>> +	struct tlob_task_state *ws;
>> +
>> +	ws = kmem_cache_zalloc(tlob_state_cache, GFP_ATOMIC);
>> +	if (!ws)
>> +		return NULL;
>> +
>> +	ws->task = task;
>> +	get_task_struct(task);
>> +	ws->threshold_us = threshold_us;
>> +	ws->tag = tag;
>> +	ws->last_ts = ktime_get();
>> +	ws->da_state = on_cpu_tlob;
>> +	raw_spin_lock_init(&ws->entry_lock);
>> +	hrtimer_setup(&ws->deadline_timer, tlob_deadline_timer_fn,
>> +		      CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +	return ws;
>> +}
>> +
>> +/* RCU callback: free the slab once no readers remain. */
>> +static void tlob_free_rcu_slab(struct rcu_head *head)
>> +{
>> +	struct tlob_task_state *ws =
>> +		container_of(head, struct tlob_task_state, rcu);
>> +	kmem_cache_free(tlob_state_cache, ws);
>> +}
>> +
>> +/* Arm the one-shot deadline timer for threshold_us microseconds. */
>> +static void tlob_arm_deadline(struct tlob_task_state *ws)
>> +{
>> +	hrtimer_start(&ws->deadline_timer,
>> +		      ns_to_ktime(ws->threshold_us * NSEC_PER_USEC),
>> +		      HRTIMER_MODE_REL);
>> +}
>> +
>> +/*
>> + * Push a violation record into a monitor fd's ring buffer (softirq context).
>> + * Drop-new policy: discard incoming record when full.  smp_store_release on
>> + * data_head pairs with smp_load_acquire in the consumer.
>> + */
>> +static void tlob_event_push(struct rv_file_priv *priv,
>> +			    const struct tlob_event *info)
>> +{
>> +	struct tlob_ring *ring = &priv->ring;
>> +	unsigned long flags;
>> +	u32 head, tail;
>> +
>> +	spin_lock_irqsave(&ring->lock, flags);
>> +
>> +	head = ring->page->data_head;
>> +	tail = READ_ONCE(ring->page->data_tail);
>> +
>> +	if (head - tail > ring->mask) {
>> +		/* Ring full: drop incoming record. */
>> +		ring->page->dropped++;
>> +		spin_unlock_irqrestore(&ring->lock, flags);
>> +		return;
>> +	}
>> +
>> +	ring->data[head & ring->mask] = *info;
>> +	/* pairs with smp_load_acquire() in the consumer */
>> +	smp_store_release(&ring->page->data_head, head + 1);
>> +
>> +	spin_unlock_irqrestore(&ring->lock, flags);
>> +
>> +	wake_up_interruptible_poll(&priv->waitq, EPOLLIN | EPOLLRDNORM);
>> +}
>> +
>> +#if IS_ENABLED(CONFIG_KUNIT)
>> +void tlob_event_push_kunit(struct rv_file_priv *priv,
>> +			  const struct tlob_event *info)
>> +{
>> +	tlob_event_push(priv, info);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_push_kunit);
>> +#endif /* CONFIG_KUNIT */
>> +
>> +/*
>> + * Budget exceeded: remove the entry, record the violation, and inject
>> + * budget_expired into the DA.
>> + *
>> + * Lock order: tlob_table_lock -> entry_lock.  tlob_stop_task() sets
>> + * ws->canceled under both locks; if we see it here the stop path owns
>> cleanup.
>> + * fput/put_task_struct are done before call_rcu(); the RCU callback only
>> + * reclaims the slab.
>> + */
>> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer)
>> +{
>> +	struct tlob_task_state *ws =
>> +		container_of(timer, struct tlob_task_state, deadline_timer);
>> +	struct tlob_event info = {};
>> +	struct file *notify_file;
>> +	struct task_struct *task;
>> +	unsigned long flags;
>> +	/* snapshots taken under entry_lock */
>> +	u64 on_cpu_us, off_cpu_us, threshold_us, tag;
>> +	u32 switches;
>> +	bool on_cpu;
>> +	bool push_event = false;
>> +
>> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> +	/* stop path sets canceled under both locks; if set it owns cleanup
>> */
>> +	if (ws->canceled) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		return HRTIMER_NORESTART;
>> +	}
>> +
>> +	/* Finalize accounting and snapshot all fields under entry_lock. */
>> +	raw_spin_lock(&ws->entry_lock);
>> +
>> +	{
>> +		ktime_t now = ktime_get();
>> +		u64 delta_us = ktime_to_us(ktime_sub(now, ws->last_ts));
>> +
>> +		if (ws->da_state == on_cpu_tlob)
>> +			ws->on_cpu_us += delta_us;
>> +		else
>> +			ws->off_cpu_us += delta_us;
>> +	}
>> +
>> +	ws->canceled  = 1;
>> +	on_cpu_us     = ws->on_cpu_us;
>> +	off_cpu_us    = ws->off_cpu_us;
>> +	threshold_us  = ws->threshold_us;
>> +	tag           = ws->tag;
>> +	switches      = ws->switches;
>> +	on_cpu        = (ws->da_state == on_cpu_tlob);
>> +	notify_file   = ws->notify_file;
>> +	if (notify_file) {
>> +		info.tid          = task_pid_vnr(ws->task);
>> +		info.threshold_us = threshold_us;
>> +		info.on_cpu_us    = on_cpu_us;
>> +		info.off_cpu_us   = off_cpu_us;
>> +		info.switches     = switches;
>> +		info.state        = on_cpu ? 1 : 0;
>> +		info.tag          = tag;
>> +		push_event        = true;
>> +	}
>> +
>> +	raw_spin_unlock(&ws->entry_lock);
>> +
>> +	hlist_del_rcu(&ws->hlist);
>> +	atomic_dec(&tlob_num_monitored);
>> +	/*
>> +	 * Hold a reference so task remains valid across da_handle_event()
>> +	 * after we drop tlob_table_lock.
>> +	 */
>> +	task = ws->task;
>> +	get_task_struct(task);
>> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> +	/*
>> +	 * Both locks are now released; ws is exclusively owned (removed from
>> +	 * the hash table with canceled=1).  Emit the tracepoint and push the
>> +	 * violation record.
>> +	 */
>> +	trace_tlob_budget_exceeded(ws->task, threshold_us, on_cpu_us,
>> +				   off_cpu_us, switches, on_cpu, tag);
>> +
>> +	if (push_event) {
>> +		struct rv_file_priv *priv = notify_file->private_data;
>> +
>> +		if (priv)
>> +			tlob_event_push(priv, &info);
>> +	}
>> +
>> +	da_handle_event(task, budget_expired_tlob);
>> +
>> +	if (notify_file)
>> +		fput(notify_file);		/* ref from fget() at
>> TRACE_START */
>> +	put_task_struct(ws->task);		/* ref from tlob_alloc() */
>> +	put_task_struct(task);			/* extra ref from
>> get_task_struct() above */
>> +	call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> +	return HRTIMER_NORESTART;
>> +}
>> +
>> +/* Tracepoint handlers */
>> +
>> +/*
>> + * handle_sched_switch - advance the DA and accumulate on/off-CPU time.
>> + *
>> + * RCU read-side for lock-free lookup; entry_lock for per-task accounting.
>> + * da_handle_event() is called after rcu_read_unlock() to avoid holding the
>> + * read-side critical section across the RV framework.
>> + */
>> +static void handle_sched_switch(void *data, bool preempt,
>> +				struct task_struct *prev,
>> +				struct task_struct *next,
>> +				unsigned int prev_state)
>> +{
>> +	struct tlob_task_state *ws;
>> +	unsigned long flags;
>> +	bool do_prev = false, do_next = false;
>> +	ktime_t now;
>> +
>> +	rcu_read_lock();
>> +
>> +	ws = tlob_find_rcu(prev);
>> +	if (ws) {
>> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> +		if (!ws->canceled) {
>> +			now = ktime_get();
>> +			ws->on_cpu_us += ktime_to_us(ktime_sub(now, ws-
>>> last_ts));
>> +			ws->last_ts = now;
>> +			ws->switches++;
>> +			ws->da_state = off_cpu_tlob;
>> +			do_prev = true;
>> +		}
>> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> +	}
>> +
>> +	ws = tlob_find_rcu(next);
>> +	if (ws) {
>> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> +		if (!ws->canceled) {
>> +			now = ktime_get();
>> +			ws->off_cpu_us += ktime_to_us(ktime_sub(now, ws-
>>> last_ts));
>> +			ws->last_ts = now;
>> +			ws->da_state = on_cpu_tlob;
>> +			do_next = true;
>> +		}
>> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> +	}
>> +
>> +	rcu_read_unlock();
>> +
>> +	if (do_prev)
>> +		da_handle_event(prev, switch_out_tlob);
>> +	if (do_next)
>> +		da_handle_event(next, switch_in_tlob);
>> +}
>> +
>> +static void handle_sched_wakeup(void *data, struct task_struct *p)
>> +{
>> +	struct tlob_task_state *ws;
>> +	unsigned long flags;
>> +	bool found = false;
>> +
>> +	rcu_read_lock();
>> +	ws = tlob_find_rcu(p);
>> +	if (ws) {
>> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> +		found = !ws->canceled;
>> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	if (found)
>> +		da_handle_event(p, sched_wakeup_tlob);
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * Core start/stop helpers (also called from rv_dev.c)
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/*
>> + * __tlob_insert - insert @ws into the hash table and arm its deadline timer.
>> + *
>> + * Re-checks for duplicates and capacity under tlob_table_lock; the caller
>> + * may have done a lock-free pre-check before allocating @ws.  On failure @ws
>> + * is freed directly (never in table, so no call_rcu needed).
>> + */
>> +static int __tlob_insert(struct task_struct *task, struct tlob_task_state
>> *ws)
>> +{
>> +	unsigned int h;
>> +	unsigned long flags;
>> +
>> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> +	if (tlob_find_rcu(task)) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		if (ws->notify_file)
>> +			fput(ws->notify_file);
>> +		put_task_struct(ws->task);
>> +		kmem_cache_free(tlob_state_cache, ws);
>> +		return -EEXIST;
>> +	}
>> +	if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		if (ws->notify_file)
>> +			fput(ws->notify_file);
>> +		put_task_struct(ws->task);
>> +		kmem_cache_free(tlob_state_cache, ws);
>> +		return -ENOSPC;
>> +	}
>> +	h = tlob_hash_task(task);
>> +	hlist_add_head_rcu(&ws->hlist, &tlob_htable[h]);
>> +	atomic_inc(&tlob_num_monitored);
>> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> +	da_handle_start_run_event(task, trace_start_tlob);
>> +	tlob_arm_deadline(ws);
>> +	return 0;
>> +}
>> +
>> +/**
>> + * tlob_start_task - begin monitoring @task with latency budget
>> @threshold_us.
>> + *
>> + * @notify_file: /dev/rv fd whose ring buffer receives a tlob_event on
>> + *               violation; caller transfers the fget() reference to tlob.c.
>> + *               Pass NULL for synchronous mode (violations only via
>> + *               TRACE_STOP return value and the tlob_budget_exceeded event).
>> + *
>> + * Returns 0, -ENODEV, -EEXIST, -ENOSPC, or -ENOMEM.  On failure the caller
>> + * retains responsibility for any @notify_file reference.
>> + */
>> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
>> +		    struct file *notify_file, u64 tag)
>> +{
>> +	struct tlob_task_state *ws;
>> +	unsigned long flags;
>> +
>> +	if (!tlob_state_cache)
>> +		return -ENODEV;
>> +
>> +	if (threshold_us > (u64)KTIME_MAX / NSEC_PER_USEC)
>> +		return -ERANGE;
>> +
>> +	/* Quick pre-check before allocation. */
>> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> +	if (tlob_find_rcu(task)) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		return -EEXIST;
>> +	}
>> +	if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		return -ENOSPC;
>> +	}
>> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> +	ws = tlob_alloc(task, threshold_us, tag);
>> +	if (!ws)
>> +		return -ENOMEM;
>> +
>> +	ws->notify_file = notify_file;
>> +	return __tlob_insert(task, ws);
>> +}
>> +EXPORT_SYMBOL_GPL(tlob_start_task);
>> +
>> +/**
>> + * tlob_stop_task - stop monitoring @task before the deadline fires.
>> + *
>> + * Sets canceled under entry_lock (inside tlob_table_lock) before calling
>> + * hrtimer_cancel(), racing safely with the timer callback.
>> + *
>> + * Returns 0 if within budget, -ESRCH if the entry is gone (deadline already
>> + * fired, or TRACE_START was never called).
>> + */
>> +int tlob_stop_task(struct task_struct *task)
>> +{
>> +	struct tlob_task_state *ws;
>> +	struct file *notify_file;
>> +	unsigned long flags;
>> +
>> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> +	ws = tlob_find_rcu(task);
>> +	if (!ws) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		return -ESRCH;
>> +	}
>> +
>> +	/* Prevent handle_sched_switch from updating accounting after
>> removal. */
>> +	raw_spin_lock(&ws->entry_lock);
>> +	ws->canceled = 1;
>> +	raw_spin_unlock(&ws->entry_lock);
>> +
>> +	hlist_del_rcu(&ws->hlist);
>> +	atomic_dec(&tlob_num_monitored);
>> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> +	hrtimer_cancel(&ws->deadline_timer);
>> +
>> +	da_handle_event(task, trace_stop_tlob);
>> +
>> +	notify_file = ws->notify_file;
>> +	if (notify_file)
>> +		fput(notify_file);
>> +	put_task_struct(ws->task);
>> +	call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(tlob_stop_task);
>> +
>> +/* Stop monitoring all tracked tasks; called on monitor disable. */
>> +static void tlob_stop_all(void)
>> +{
>> +	struct tlob_task_state *batch[TLOB_MAX_MONITORED];
>> +	struct tlob_task_state *ws;
>> +	struct hlist_node *tmp;
>> +	unsigned long flags;
>> +	int n = 0, i;
>> +
>> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> +	for (i = 0; i < TLOB_HTABLE_SIZE; i++) {
>> +		hlist_for_each_entry_safe(ws, tmp, &tlob_htable[i], hlist) {
>> +			raw_spin_lock(&ws->entry_lock);
>> +			ws->canceled = 1;
>> +			raw_spin_unlock(&ws->entry_lock);
>> +			hlist_del_rcu(&ws->hlist);
>> +			atomic_dec(&tlob_num_monitored);
>> +			if (n < TLOB_MAX_MONITORED)
>> +				batch[n++] = ws;
>> +		}
>> +	}
>> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> +	for (i = 0; i < n; i++) {
>> +		ws = batch[i];
>> +		hrtimer_cancel(&ws->deadline_timer);
>> +		da_handle_event(ws->task, trace_stop_tlob);
>> +		if (ws->notify_file)
>> +			fput(ws->notify_file);
>> +		put_task_struct(ws->task);
>> +		call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> +	}
>> +}
>> +
>> +/* uprobe binding helpers */
>> +
>> +static int tlob_uprobe_entry_handler(struct uprobe_consumer *uc,
>> +				     struct pt_regs *regs, __u64 *data)
>> +{
>> +	struct tlob_uprobe_binding *b =
>> +		container_of(uc, struct tlob_uprobe_binding, entry_uc);
>> +
>> +	tlob_start_task(current, b->threshold_us, NULL, (u64)b-
>>> offset_start);
>> +	return 0;
>> +}
>> +
>> +static int tlob_uprobe_stop_handler(struct uprobe_consumer *uc,
>> +				    struct pt_regs *regs, __u64 *data)
>> +{
>> +	tlob_stop_task(current);
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Register start + stop entry uprobes for a binding.
>> + * Both are plain entry uprobes (no uretprobe), so a wrong offset never
>> + * corrupts the call stack; the worst outcome is a missed stop (hrtimer
>> + * fires and reports a budget violation).
>> + * Called with tlob_uprobe_mutex held.
>> + */
>> +static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
>> +			   loff_t offset_start, loff_t offset_stop)
>> +{
>> +	struct tlob_uprobe_binding *b, *tmp_b;
>> +	char pathbuf[TLOB_MAX_PATH];
>> +	struct inode *inode;
>> +	char *canon;
>> +	int ret;
>> +
>> +	b = kzalloc(sizeof(*b), GFP_KERNEL);
>> +	if (!b)
>> +		return -ENOMEM;
>> +
>> +	if (binpath[0] != '/') {
>> +		kfree(b);
>> +		return -EINVAL;
>> +	}
>> +
>> +	b->threshold_us = threshold_us;
>> +	b->offset_start = offset_start;
>> +	b->offset_stop  = offset_stop;
>> +
>> +	ret = kern_path(binpath, LOOKUP_FOLLOW, &b->path);
>> +	if (ret)
>> +		goto err_free;
>> +
>> +	if (!d_is_reg(b->path.dentry)) {
>> +		ret = -EINVAL;
>> +		goto err_path;
>> +	}
>> +
>> +	/* Reject duplicate start offset for the same binary. */
>> +	list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
>> +		if (tmp_b->offset_start == offset_start &&
>> +		    tmp_b->path.dentry == b->path.dentry) {
>> +			ret = -EEXIST;
>> +			goto err_path;
>> +		}
>> +	}
>> +
>> +	/* Store canonical path for read-back and removal matching. */
>> +	canon = d_path(&b->path, pathbuf, sizeof(pathbuf));
>> +	if (IS_ERR(canon)) {
>> +		ret = PTR_ERR(canon);
>> +		goto err_path;
>> +	}
>> +	strscpy(b->binpath, canon, sizeof(b->binpath));
>> +
>> +	b->entry_uc.handler = tlob_uprobe_entry_handler;
>> +	b->stop_uc.handler  = tlob_uprobe_stop_handler;
>> +
>> +	inode = d_real_inode(b->path.dentry);
>> +
>> +	b->entry_uprobe = uprobe_register(inode, offset_start, 0, &b-
>>> entry_uc);
>> +	if (IS_ERR(b->entry_uprobe)) {
>> +		ret = PTR_ERR(b->entry_uprobe);
>> +		b->entry_uprobe = NULL;
>> +		goto err_path;
>> +	}
>> +
>> +	b->stop_uprobe = uprobe_register(inode, offset_stop, 0, &b->stop_uc);
>> +	if (IS_ERR(b->stop_uprobe)) {
>> +		ret = PTR_ERR(b->stop_uprobe);
>> +		b->stop_uprobe = NULL;
>> +		goto err_entry;
>> +	}
>> +
>> +	list_add_tail(&b->list, &tlob_uprobe_list);
>> +	return 0;
>> +
>> +err_entry:
>> +	uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> +	uprobe_unregister_sync();
>> +err_path:
>> +	path_put(&b->path);
>> +err_free:
>> +	kfree(b);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Remove the uprobe binding for (offset_start, binpath).
>> + * binpath is resolved to a dentry for comparison so symlinks are handled
>> + * correctly.  Called with tlob_uprobe_mutex held.
>> + */
>> +static void tlob_remove_uprobe_by_key(loff_t offset_start, const char
>> *binpath)
>> +{
>> +	struct tlob_uprobe_binding *b, *tmp;
>> +	struct path remove_path;
>> +
>> +	if (kern_path(binpath, LOOKUP_FOLLOW, &remove_path))
>> +		return;
>> +
>> +	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
>> +		if (b->offset_start != offset_start)
>> +			continue;
>> +		if (b->path.dentry != remove_path.dentry)
>> +			continue;
>> +		uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> +		uprobe_unregister_nosync(b->stop_uprobe,  &b->stop_uc);
>> +		list_del(&b->list);
>> +		uprobe_unregister_sync();
>> +		path_put(&b->path);
>> +		kfree(b);
>> +		break;
>> +	}
>> +
>> +	path_put(&remove_path);
>> +}
>> +
>> +/* Unregister all uprobe bindings; called from disable_tlob(). */
>> +static void tlob_remove_all_uprobes(void)
>> +{
>> +	struct tlob_uprobe_binding *b, *tmp;
>> +
>> +	mutex_lock(&tlob_uprobe_mutex);
>> +	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
>> +		uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> +		uprobe_unregister_nosync(b->stop_uprobe,  &b->stop_uc);
>> +		list_del(&b->list);
>> +		path_put(&b->path);
>> +		kfree(b);
>> +	}
>> +	mutex_unlock(&tlob_uprobe_mutex);
>> +	uprobe_unregister_sync();
>> +}
>> +
>> +/*
>> + * tracefs "monitor" file
>> + *
>> + * Read:  one "threshold_us:0xoffset_start:0xoffset_stop:binary_path\n"
>> + *        line per registered uprobe binding.
>> + * Write: "threshold_us:offset_start:offset_stop:binary_path" - add uprobe
>> binding
>> + *        "-offset_start:binary_path"                         - remove uprobe
>> binding
>> + */
>> +
>> +static ssize_t tlob_monitor_read(struct file *file,
>> +				 char __user *ubuf,
>> +				 size_t count, loff_t *ppos)
>> +{
>> +	/* pid(10) + threshold(20) + 2 offsets(2*18) + path(256) + delimiters
>> */
>> +	const int line_sz = TLOB_MAX_PATH + 72;
>> +	struct tlob_uprobe_binding *b;
>> +	char *buf, *p;
>> +	int n = 0, buf_sz, pos = 0;
>> +	ssize_t ret;
>> +
>> +	mutex_lock(&tlob_uprobe_mutex);
>> +	list_for_each_entry(b, &tlob_uprobe_list, list)
>> +		n++;
>> +	mutex_unlock(&tlob_uprobe_mutex);
>> +
>> +	buf_sz = (n ? n : 1) * line_sz + 1;
>> +	buf = kmalloc(buf_sz, GFP_KERNEL);
>> +	if (!buf)
>> +		return -ENOMEM;
>> +
>> +	mutex_lock(&tlob_uprobe_mutex);
>> +	list_for_each_entry(b, &tlob_uprobe_list, list) {
>> +		p = b->binpath;
>> +		pos += scnprintf(buf + pos, buf_sz - pos,
>> +				 "%llu:0x%llx:0x%llx:%s\n",
>> +				 b->threshold_us,
>> +				 (unsigned long long)b->offset_start,
>> +				 (unsigned long long)b->offset_stop,
>> +				 p);
>> +	}
>> +	mutex_unlock(&tlob_uprobe_mutex);
>> +
>> +	ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
>> +	kfree(buf);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Parse "threshold_us:offset_start:offset_stop:binary_path".
>> + * binary_path comes last so it may freely contain ':'.
>> + * Returns 0 on success.
>> + */
>> +VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
>> +					    char **path_out,
>> +					    loff_t *start_out, loff_t
>> *stop_out)
>> +{
>> +	unsigned long long thr;
>> +	long long start, stop;
>> +	int n = 0;
>> +
>> +	/*
>> +	 * %llu : decimal-only (microseconds)
>> +	 * %lli : auto-base, accepts 0x-prefixed hex for offsets
>> +	 * %n   : records the byte offset of the first path character
>> +	 */
>> +	if (sscanf(buf, "%llu:%lli:%lli:%n", &thr, &start, &stop, &n) != 3)
>> +		return -EINVAL;
>> +	if (thr == 0 || n == 0 || buf[n] == '\0')
>> +		return -EINVAL;
>> +	if (start < 0 || stop < 0)
>> +		return -EINVAL;
>> +
>> +	*thr_out   = thr;
>> +	*start_out = start;
>> +	*stop_out  = stop;
>> +	*path_out  = buf + n;
>> +	return 0;
>> +}
>> +
>> +static ssize_t tlob_monitor_write(struct file *file,
>> +				  const char __user *ubuf,
>> +				  size_t count, loff_t *ppos)
>> +{
>> +	char buf[TLOB_MAX_PATH + 64];
>> +	loff_t offset_start, offset_stop;
>> +	u64 threshold_us;
>> +	char *binpath;
>> +	int ret;
>> +
>> +	if (count >= sizeof(buf))
>> +		return -EINVAL;
>> +	if (copy_from_user(buf, ubuf, count))
>> +		return -EFAULT;
>> +	buf[count] = '\0';
>> +
>> +	if (count > 0 && buf[count - 1] == '\n')
>> +		buf[count - 1] = '\0';
>> +
>> +	/* Remove request: "-offset_start:binary_path" */
>> +	if (buf[0] == '-') {
>> +		long long off;
>> +		int n = 0;
>> +
>> +		if (sscanf(buf + 1, "%lli:%n", &off, &n) != 1 || n == 0)
>> +			return -EINVAL;
>> +		binpath = buf + 1 + n;
>> +		if (binpath[0] != '/')
>> +			return -EINVAL;
>> +
>> +		mutex_lock(&tlob_uprobe_mutex);
>> +		tlob_remove_uprobe_by_key((loff_t)off, binpath);
>> +		mutex_unlock(&tlob_uprobe_mutex);
>> +
>> +		return (ssize_t)count;
>> +	}
>> +
>> +	/*
>> +	 * Uprobe binding:
>> "threshold_us:offset_start:offset_stop:binary_path"
>> +	 * binpath points into buf at the start of the path field.
>> +	 */
>> +	ret = tlob_parse_uprobe_line(buf, &threshold_us,
>> +				     &binpath, &offset_start, &offset_stop);
>> +	if (ret)
>> +		return ret;
>> +
>> +	mutex_lock(&tlob_uprobe_mutex);
>> +	ret = tlob_add_uprobe(threshold_us, binpath, offset_start,
>> offset_stop);
>> +	mutex_unlock(&tlob_uprobe_mutex);
>> +	return ret ? ret : (ssize_t)count;
>> +}
>> +
>> +static const struct file_operations tlob_monitor_fops = {
>> +	.open	= simple_open,
>> +	.read	= tlob_monitor_read,
>> +	.write	= tlob_monitor_write,
>> +	.llseek	= noop_llseek,
>> +};
>> +
>> +/*
>> + * __tlob_init_monitor / __tlob_destroy_monitor - called with
>> rv_interface_lock
>> + * held (required by da_monitor_init/destroy via
>> rv_get/put_task_monitor_slot).
>> + */
>> +static int __tlob_init_monitor(void)
>> +{
>> +	int i, retval;
>> +
>> +	tlob_state_cache = kmem_cache_create("tlob_task_state",
>> +					     sizeof(struct tlob_task_state),
>> +					     0, 0, NULL);
>> +	if (!tlob_state_cache)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < TLOB_HTABLE_SIZE; i++)
>> +		INIT_HLIST_HEAD(&tlob_htable[i]);
>> +	atomic_set(&tlob_num_monitored, 0);
>> +
>> +	retval = da_monitor_init();
>> +	if (retval) {
>> +		kmem_cache_destroy(tlob_state_cache);
>> +		tlob_state_cache = NULL;
>> +		return retval;
>> +	}
>> +
>> +	rv_this.enabled = 1;
>> +	return 0;
>> +}
>> +
>> +static void __tlob_destroy_monitor(void)
>> +{
>> +	rv_this.enabled = 0;
>> +	tlob_stop_all();
>> +	tlob_remove_all_uprobes();
>> +	/*
>> +	 * Drain pending call_rcu() callbacks from tlob_stop_all() before
>> +	 * destroying the kmem_cache.
>> +	 */
>> +	synchronize_rcu();
>> +	da_monitor_destroy();
>> +	kmem_cache_destroy(tlob_state_cache);
>> +	tlob_state_cache = NULL;
>> +}
>> +
>> +/*
>> + * tlob_init_monitor / tlob_destroy_monitor - KUnit wrappers that acquire
>> + * rv_interface_lock, satisfying the lockdep_assert_held() inside
>> + * rv_get/put_task_monitor_slot().
>> + */
>> +VISIBLE_IF_KUNIT int tlob_init_monitor(void)
>> +{
>> +	int ret;
>> +
>> +	mutex_lock(&rv_interface_lock);
>> +	ret = __tlob_init_monitor();
>> +	mutex_unlock(&rv_interface_lock);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_init_monitor);
>> +
>> +VISIBLE_IF_KUNIT void tlob_destroy_monitor(void)
>> +{
>> +	mutex_lock(&rv_interface_lock);
>> +	__tlob_destroy_monitor();
>> +	mutex_unlock(&rv_interface_lock);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_destroy_monitor);
>> +
>> +VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
>> +{
>> +	rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
>> +	rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
>> +
>> +VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
>> +{
>> +	rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
>> +	rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
>> +
>> +/*
>> + * enable_tlob / disable_tlob - called by rv_enable/disable_monitor() which
>> + * already holds rv_interface_lock; call the __ variants directly.
>> + */
>> +static int enable_tlob(void)
>> +{
>> +	int retval;
>> +
>> +	retval = __tlob_init_monitor();
>> +	if (retval)
>> +		return retval;
>> +
>> +	return tlob_enable_hooks();
>> +}
>> +
>> +static void disable_tlob(void)
>> +{
>> +	tlob_disable_hooks();
>> +	__tlob_destroy_monitor();
>> +}
>> +
>> +static struct rv_monitor rv_this = {
>> +	.name		= "tlob",
>> +	.description	= "Per-task latency-over-budget monitor.",
>> +	.enable		= enable_tlob,
>> +	.disable	= disable_tlob,
>> +	.reset		= da_monitor_reset_all,
>> +	.enabled	= 0,
>> +};
>> +
>> +static int __init register_tlob(void)
>> +{
>> +	int ret;
>> +
>> +	ret = rv_register_monitor(&rv_this, NULL);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (rv_this.root_d) {
>> +		tracefs_create_file("monitor", 0644, rv_this.root_d, NULL,
>> +				    &tlob_monitor_fops);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void __exit unregister_tlob(void)
>> +{
>> +	rv_unregister_monitor(&rv_this);
>> +}
>> +
>> +module_init(register_tlob);
>> +module_exit(unregister_tlob);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
>> +MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
>> b/kernel/trace/rv/monitors/tlob/tlob.h
>> new file mode 100644
>> index 000000000..3438a6175
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob.h
>> @@ -0,0 +1,145 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _RV_TLOB_H
>> +#define _RV_TLOB_H
>> +
>> +/*
>> + * C representation of the tlob automaton, generated from tlob.dot via rvgen
>> + * and extended with tlob_start_task()/tlob_stop_task() declarations.
>> + * For the format description see
>> Documentation/trace/rv/deterministic_automata.rst
>> + */
>> +
>> +#include <linux/rv.h>
>> +#include <uapi/linux/rv.h>
>> +
>> +#define MONITOR_NAME tlob
>> +
>> +enum states_tlob {
>> +	unmonitored_tlob,
>> +	on_cpu_tlob,
>> +	off_cpu_tlob,
>> +	state_max_tlob,
>> +};
>> +
>> +#define INVALID_STATE state_max_tlob
>> +
>> +enum events_tlob {
>> +	trace_start_tlob,
>> +	switch_in_tlob,
>> +	switch_out_tlob,
>> +	sched_wakeup_tlob,
>> +	trace_stop_tlob,
>> +	budget_expired_tlob,
>> +	event_max_tlob,
>> +};
>> +
>> +struct automaton_tlob {
>> +	char *state_names[state_max_tlob];
>> +	char *event_names[event_max_tlob];
>> +	unsigned char function[state_max_tlob][event_max_tlob];
>> +	unsigned char initial_state;
>> +	bool final_states[state_max_tlob];
>> +};
>> +
>> +static const struct automaton_tlob automaton_tlob = {
>> +	.state_names = {
>> +		"unmonitored",
>> +		"on_cpu",
>> +		"off_cpu",
>> +	},
>> +	.event_names = {
>> +		"trace_start",
>> +		"switch_in",
>> +		"switch_out",
>> +		"sched_wakeup",
>> +		"trace_stop",
>> +		"budget_expired",
>> +	},
>> +	.function = {
>> +		/* unmonitored */
>> +		{
>> +			on_cpu_tlob,		/* trace_start    */
>> +			unmonitored_tlob,	/* switch_in      */
>> +			unmonitored_tlob,	/* switch_out     */
>> +			unmonitored_tlob,	/* sched_wakeup   */
>> +			INVALID_STATE,		/* trace_stop     */
>> +			INVALID_STATE,		/* budget_expired */
>> +		},
>> +		/* on_cpu */
>> +		{
>> +			INVALID_STATE,		/* trace_start    */
>> +			INVALID_STATE,		/* switch_in      */
>> +			off_cpu_tlob,		/* switch_out     */
>> +			on_cpu_tlob,		/* sched_wakeup   */
>> +			unmonitored_tlob,	/* trace_stop     */
>> +			unmonitored_tlob,	/* budget_expired */
>> +		},
>> +		/* off_cpu */
>> +		{
>> +			INVALID_STATE,		/* trace_start    */
>> +			on_cpu_tlob,		/* switch_in      */
>> +			off_cpu_tlob,		/* switch_out     */
>> +			off_cpu_tlob,		/* sched_wakeup   */
>> +			unmonitored_tlob,	/* trace_stop     */
>> +			unmonitored_tlob,	/* budget_expired */
>> +		},
>> +	},
>> +	/*
>> +	 * final_states: unmonitored is the sole accepting state.
>> +	 * Violations are recorded via ntf_push and tlob_budget_exceeded.
>> +	 */
>> +	.initial_state = unmonitored_tlob,
>> +	.final_states = { 1, 0, 0 },
>> +};
>> +
>> +/* Exported for use by the RV ioctl layer (rv_dev.c) */
>> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
>> +		    struct file *notify_file, u64 tag);
>> +int tlob_stop_task(struct task_struct *task);
>> +
>> +/* Maximum number of concurrently monitored tasks (also used by KUnit). */
>> +#define TLOB_MAX_MONITORED	64U
>> +
>> +/*
>> + * Ring buffer constants (also published in UAPI for mmap size calculation).
>> + */
>> +#define TLOB_RING_DEFAULT_CAP	64U	/* records allocated at open()  */
>> +#define TLOB_RING_MIN_CAP	 8U	/* minimum accepted by mmap()   */
>> +#define TLOB_RING_MAX_CAP	4096U	/* maximum accepted by mmap()   */
>> +
>> +/**
>> + * struct tlob_ring - per-fd mmap-capable violation ring buffer.
>> + *
>> + * Allocated as a contiguous page range at rv_open() time:
>> + *   page 0:    struct tlob_mmap_page  (shared with userspace)
>> + *   pages 1-N: struct tlob_event[capacity]
>> + */
>> +struct tlob_ring {
>> +	struct tlob_mmap_page	*page;
>> +	struct tlob_event	*data;
>> +	u32			 mask;
>> +	spinlock_t		 lock;
>> +	unsigned long		 base;
>> +	unsigned int		 order;
>> +};
>> +
>> +/**
>> + * struct rv_file_priv - per-fd private data for /dev/rv.
>> + */
>> +struct rv_file_priv {
>> +	struct tlob_ring	ring;
>> +	wait_queue_head_t	waitq;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_KUNIT)
>> +int tlob_init_monitor(void);
>> +void tlob_destroy_monitor(void);
>> +int tlob_enable_hooks(void);
>> +void tlob_disable_hooks(void);
>> +void tlob_event_push_kunit(struct rv_file_priv *priv,
>> +			  const struct tlob_event *info);
>> +int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
>> +			   char **path_out,
>> +			   loff_t *start_out, loff_t *stop_out);
>> +#endif /* CONFIG_KUNIT */
>> +
>> +#endif /* _RV_TLOB_H */
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> b/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> new file mode 100644
>> index 000000000..b08d67776
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> @@ -0,0 +1,42 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +/*
>> + * Snippet to be included in rv_trace.h
>> + */
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +/*
>> + * tlob uses the generic event_da_monitor_id and error_da_monitor_id event
>> + * classes so that both event classes are instantiated.  This avoids a
>> + * -Werror=unused-variable warning that the compiler emits when a
>> + * DECLARE_EVENT_CLASS has no corresponding DEFINE_EVENT instance.
>> + *
>> + * The event_tlob tracepoint is defined here but the call-site in
>> + * da_handle_event() is overridden with a no-op macro below so that no
>> + * trace record is emitted on every scheduler context switch.  Budget
>> + * violations are reported via the dedicated tlob_budget_exceeded event.
>> + *
>> + * error_tlob IS kept active so that invalid DA transitions (programming
>> + * errors) are still visible in the ftrace ring buffer for debugging.
>> + */
>> +DEFINE_EVENT(event_da_monitor_id, event_tlob,
>> +	     TP_PROTO(int id, char *state, char *event, char *next_state,
>> +		      bool final_state),
>> +	     TP_ARGS(id, state, event, next_state, final_state));
>> +
>> +DEFINE_EVENT(error_da_monitor_id, error_tlob,
>> +	     TP_PROTO(int id, char *state, char *event),
>> +	     TP_ARGS(id, state, event));
>> +
>> +/*
>> + * Override the trace_event_tlob() call-site with a no-op after the
>> + * DEFINE_EVENT above has satisfied the event class instantiation
>> + * requirement.  The tracepoint symbol itself exists (and can be enabled
>> + * via tracefs) but the automatic call from da_handle_event() is silenced
>> + * to avoid per-context-switch ftrace noise during normal operation.
>> + */
>> +#undef trace_event_tlob
>> +#define trace_event_tlob(id, state, event, next_state, final_state)	\
>> +	do { (void)(id); (void)(state); (void)(event);			\
>> +	     (void)(next_state); (void)(final_state); } while (0)
>> +#endif /* CONFIG_RV_MON_TLOB */
>> diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
>> index ee4e68102..e754e76d5 100644
>> --- a/kernel/trace/rv/rv.c
>> +++ b/kernel/trace/rv/rv.c
>> @@ -148,6 +148,10 @@
>>   #include <rv_trace.h>
>>   #endif
>>   
>> +#ifdef CONFIG_RV_MON_TLOB
>> +EXPORT_TRACEPOINT_SYMBOL_GPL(tlob_budget_exceeded);
>> +#endif
>> +
>>   #include "rv.h"
>>   
>>   DEFINE_MUTEX(rv_interface_lock);
>> diff --git a/kernel/trace/rv/rv_dev.c b/kernel/trace/rv/rv_dev.c
>> new file mode 100644
>> index 000000000..a052f3203
>> --- /dev/null
>> +++ b/kernel/trace/rv/rv_dev.c
>> @@ -0,0 +1,602 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * rv_dev.c - /dev/rv misc device for RV monitor self-instrumentation
>> + *
>> + * A single misc device (MISC_DYNAMIC_MINOR) serves all RV monitors.
>> + * ioctl numbers encode the monitor identity:
>> + *
>> + *   0x01 - 0x1F  tlob (task latency over budget)
>> + *   0x20 - 0x3F  reserved
>> + *
>> + * Each monitor exports tlob_start_task() / tlob_stop_task() which are
>> + * called here.  The calling task is identified by current.
>> + *
>> + * Magic: RV_IOC_MAGIC (0xB9), defined in include/uapi/linux/rv.h
>> + *
>> + * Per-fd private data (rv_file_priv)
>> + * ------------------------------------
>> + * Every open() of /dev/rv allocates an rv_file_priv (defined in tlob.h).
>> + * When TLOB_IOCTL_TRACE_START is called with args.notify_fd >= 0, violations
>> + * are pushed as tlob_event records into that fd's per-fd ring buffer
>> (tlob_ring)
>> + * and its poll/epoll waitqueue is woken.
>> + *
>> + * Consumers drain records with read() on the notify_fd; read() blocks until
>> + * at least one record is available (unless O_NONBLOCK is set).
>> + *
>> + * Per-thread "started" tracking (tlob_task_handle)
>> + * -------------------------------------------------
>> + * tlob_stop_task() returns -ESRCH in two distinct situations:
>> + *
>> + *   (a) The deadline timer already fired and removed the tlob hash-table
>> + *       entry before TRACE_STOP arrived -> budget was exceeded -> -EOVERFLOW
>> + *
>> + *   (b) TRACE_START was never called for this thread -> programming error
>> + *       -> -ESRCH
>> + *
>> + * To distinguish them, rv_dev.c maintains a lightweight hash table
>> + * (tlob_handles) that records a tlob_task_handle for every task_struct *
>> + * for which a successful TLOB_IOCTL_TRACE_START has been
>> + * issued but the corresponding TLOB_IOCTL_TRACE_STOP has not yet arrived.
>> + *
>> + * tlob_task_handle is a thin "session ticket"  --  it carries only the
>> + * task pointer and the owning file descriptor.  The heavy per-task state
>> + * (hrtimer, DA state, threshold) lives in tlob_task_state inside tlob.c.
>> + *
>> + * The table is keyed on task_struct * (same key as tlob.c), protected
>> + * by tlob_handles_lock (spinlock, irq-safe).  No get_task_struct()
>> + * refcount is needed here because tlob.c already holds a reference for
>> + * each live entry.
>> + *
>> + * Multiple threads may share the same fd.  Each thread has its own
>> + * tlob_task_handle in the table, so concurrent TRACE_START / TRACE_STOP
>> + * calls from different threads do not interfere.
>> + *
>> + * The fd release path (rv_release) calls tlob_stop_task() for every
>> + * handle in tlob_handles that belongs to the closing fd, ensuring cleanup
>> + * even if the user forgets to call TRACE_STOP.
>> + */
>> +#include <linux/file.h>
>> +#include <linux/fs.h>
>> +#include <linux/gfp.h>
>> +#include <linux/hash.h>
>> +#include <linux/mm.h>
>> +#include <linux/miscdevice.h>
>> +#include <linux/module.h>
>> +#include <linux/poll.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/uaccess.h>
>> +#include <uapi/linux/rv.h>
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +#include "monitors/tlob/tlob.h"
>> +#endif
>> +
>> +/* -----------------------------------------------------------------------
>> + * tlob_task_handle - per-thread session ticket for the ioctl interface
>> + *
>> + * One handle is allocated by TLOB_IOCTL_TRACE_START and freed by
>> + * TLOB_IOCTL_TRACE_STOP (or by rv_release if the fd is closed).
>> + *
>> + * @hlist:  Hash-table linkage in tlob_handles (keyed on task pointer).
>> + * @task:   The monitored thread.  Plain pointer; no refcount held here
>> + *          because tlob.c holds one for the lifetime of the monitoring
>> + *          window, which encompasses the lifetime of this handle.
>> + * @file:   The /dev/rv file descriptor that issued TRACE_START.
>> + *          Used by rv_release() to sweep orphaned handles on close().
>> + * -----------------------------------------------------------------------
>> + */
>> +#define TLOB_HANDLES_BITS	5
>> +#define TLOB_HANDLES_SIZE	(1 << TLOB_HANDLES_BITS)
>> +
>> +struct tlob_task_handle {
>> +	struct hlist_node	hlist;
>> +	struct task_struct	*task;
>> +	struct file		*file;
>> +};
>> +
>> +static struct hlist_head tlob_handles[TLOB_HANDLES_SIZE];
>> +static DEFINE_SPINLOCK(tlob_handles_lock);
>> +
>> +static unsigned int tlob_handle_hash(const struct task_struct *task)
>> +{
>> +	return hash_ptr((void *)task, TLOB_HANDLES_BITS);
>> +}
>> +
>> +/* Must be called with tlob_handles_lock held. */
>> +static struct tlob_task_handle *
>> +tlob_handle_find_locked(struct task_struct *task)
>> +{
>> +	struct tlob_task_handle *h;
>> +	unsigned int slot = tlob_handle_hash(task);
>> +
>> +	hlist_for_each_entry(h, &tlob_handles[slot], hlist) {
>> +		if (h->task == task)
>> +			return h;
>> +	}
>> +	return NULL;
>> +}
>> +
>> +/*
>> + * tlob_handle_alloc - record that @task has an active monitoring session
>> + *                     opened via @file.
>> + *
>> + * Returns 0 on success, -EEXIST if @task already has a handle (double
>> + * TRACE_START without TRACE_STOP), -ENOMEM on allocation failure.
>> + */
>> +static int tlob_handle_alloc(struct task_struct *task, struct file *file)
>> +{
>> +	struct tlob_task_handle *h;
>> +	unsigned long flags;
>> +	unsigned int slot;
>> +
>> +	h = kmalloc(sizeof(*h), GFP_KERNEL);
>> +	if (!h)
>> +		return -ENOMEM;
>> +	h->task = task;
>> +	h->file = file;
>> +
>> +	spin_lock_irqsave(&tlob_handles_lock, flags);
>> +	if (tlob_handle_find_locked(task)) {
>> +		spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +		kfree(h);
>> +		return -EEXIST;
>> +	}
>> +	slot = tlob_handle_hash(task);
>> +	hlist_add_head(&h->hlist, &tlob_handles[slot]);
>> +	spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +	return 0;
>> +}
>> +
>> +/*
>> + * tlob_handle_free - remove the handle for @task and free it.
>> + *
>> + * Returns 1 if a handle existed (TRACE_START was called), 0 if not found
>> + * (TRACE_START was never called for this thread).
>> + */
>> +static int tlob_handle_free(struct task_struct *task)
>> +{
>> +	struct tlob_task_handle *h;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&tlob_handles_lock, flags);
>> +	h = tlob_handle_find_locked(task);
>> +	if (h) {
>> +		hlist_del_init(&h->hlist);
>> +		spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +		kfree(h);
>> +		return 1;
>> +	}
>> +	spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +	return 0;
>> +}
>> +
>> +/*
>> + * tlob_handle_sweep_file - release all handles owned by @file.
>> + *
>> + * Called from rv_release() when the fd is closed without TRACE_STOP.
>> + * Calls tlob_stop_task() for each orphaned handle to drain the tlob
>> + * monitoring entries and prevent resource leaks in tlob.c.
>> + *
>> + * Handles are collected under the lock (short critical section), then
>> + * processed outside it (tlob_stop_task() may sleep/spin internally).
>> + */
>> +#ifdef CONFIG_RV_MON_TLOB
>> +static void tlob_handle_sweep_file(struct file *file)
>> +{
>> +	struct tlob_task_handle *batch[TLOB_HANDLES_SIZE];
>> +	struct tlob_task_handle *h;
>> +	struct hlist_node *tmp;
>> +	unsigned long flags;
>> +	int i, n = 0;
>> +
>> +	spin_lock_irqsave(&tlob_handles_lock, flags);
>> +	for (i = 0; i < TLOB_HANDLES_SIZE; i++) {
>> +		hlist_for_each_entry_safe(h, tmp, &tlob_handles[i], hlist) {
>> +			if (h->file == file) {
>> +				hlist_del_init(&h->hlist);
>> +				batch[n++] = h;
>> +			}
>> +		}
>> +	}
>> +	spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +
>> +	for (i = 0; i < n; i++) {
>> +		/*
>> +		 * Ignore -ESRCH: the deadline timer may have already fired
>> +		 * and cleaned up the tlob entry.
>> +		 */
>> +		tlob_stop_task(batch[i]->task);
>> +		kfree(batch[i]);
>> +	}
>> +}
>> +#else
>> +static inline void tlob_handle_sweep_file(struct file *file) {}
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>> +/* -----------------------------------------------------------------------
>> + * Ring buffer lifecycle
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/*
>> + * tlob_ring_alloc - allocate a ring of @cap records (must be a power of 2).
>> + *
>> + * Allocates a physically contiguous block of pages:
>> + *   page 0     : struct tlob_mmap_page  (control page, shared with
>> userspace)
>> + *   pages 1..N : struct tlob_event[cap] (data pages)
>> + *
>> + * Each page is marked reserved so it can be mapped to userspace via mmap().
>> + */
>> +static int tlob_ring_alloc(struct tlob_ring *ring, u32 cap)
>> +{
>> +	unsigned int total = PAGE_SIZE + cap * sizeof(struct tlob_event);
>> +	unsigned int order = get_order(total);
>> +	unsigned long base;
>> +	unsigned int i;
>> +
>> +	base = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
>> +	if (!base)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < (1u << order); i++)
>> +		SetPageReserved(virt_to_page((void *)(base + i *
>> PAGE_SIZE)));
>> +
>> +	ring->base  = base;
>> +	ring->order = order;
>> +	ring->page  = (struct tlob_mmap_page *)base;
>> +	ring->data  = (struct tlob_event *)(base + PAGE_SIZE);
>> +	ring->mask  = cap - 1;
>> +	spin_lock_init(&ring->lock);
>> +
>> +	ring->page->capacity    = cap;
>> +	ring->page->version     = 1;
>> +	ring->page->data_offset = PAGE_SIZE;
>> +	ring->page->record_size = sizeof(struct tlob_event);
>> +	return 0;
>> +}
>> +
>> +static void tlob_ring_free(struct tlob_ring *ring)
>> +{
>> +	unsigned int i;
>> +
>> +	if (!ring->base)
>> +		return;
>> +
>> +	for (i = 0; i < (1u << ring->order); i++)
>> +		ClearPageReserved(virt_to_page((void *)(ring->base + i *
>> PAGE_SIZE)));
>> +
>> +	free_pages(ring->base, ring->order);
>> +	ring->base = 0;
>> +	ring->page = NULL;
>> +	ring->data = NULL;
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * File operations
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static int rv_open(struct inode *inode, struct file *file)
>> +{
>> +	struct rv_file_priv *priv;
>> +	int ret;
>> +
>> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> +	if (!priv)
>> +		return -ENOMEM;
>> +
>> +	ret = tlob_ring_alloc(&priv->ring, TLOB_RING_DEFAULT_CAP);
>> +	if (ret) {
>> +		kfree(priv);
>> +		return ret;
>> +	}
>> +
>> +	init_waitqueue_head(&priv->waitq);
>> +	file->private_data = priv;
>> +	return 0;
>> +}
>> +
>> +static int rv_release(struct inode *inode, struct file *file)
>> +{
>> +	struct rv_file_priv *priv = file->private_data;
>> +
>> +	tlob_handle_sweep_file(file);
>> +	tlob_ring_free(&priv->ring);
>> +	kfree(priv);
>> +	file->private_data = NULL;
>> +	return 0;
>> +}
>> +
>> +static __poll_t rv_poll(struct file *file, poll_table *wait)
>> +{
>> +	struct rv_file_priv *priv = file->private_data;
>> +
>> +	if (!priv)
>> +		return EPOLLERR;
>> +
>> +	poll_wait(file, &priv->waitq, wait);
>> +
>> +	/*
>> +	 * Pairs with smp_store_release(&ring->page->data_head, ...) in
>> +	 * tlob_event_push().  No lock needed: head is written by the kernel
>> +	 * producer and read here; tail is written by the consumer and we
>> only
>> +	 * need an approximate check for the poll fast path.
>> +	 */
>> +	if (smp_load_acquire(&priv->ring.page->data_head) !=
>> +	    READ_ONCE(priv->ring.page->data_tail))
>> +		return EPOLLIN | EPOLLRDNORM;
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * rv_read - consume tlob_event violation records from this fd's ring buffer.
>> + *
>> + * Each read() returns a whole number of struct tlob_event records.  @count
>> must
>> + * be at least sizeof(struct tlob_event); partial-record sizes are rejected
>> with
>> + * -EINVAL.
>> + *
>> + * Blocking behaviour follows O_NONBLOCK on the fd:
>> + *   O_NONBLOCK clear: blocks until at least one record is available.
>> + *   O_NONBLOCK set:   returns -EAGAIN immediately if the ring is empty.
>> + *
>> + * Returns the number of bytes copied (always a multiple of sizeof
>> tlob_event),
>> + * -EAGAIN if non-blocking and empty, or a negative error code.
>> + *
>> + * read() and mmap() share the same ring and data_tail cursor; do not use
>> + * both simultaneously on the same fd.
>> + */
>> +static ssize_t rv_read(struct file *file, char __user *buf, size_t count,
>> +		       loff_t *ppos)
>> +{
>> +	struct rv_file_priv *priv = file->private_data;
>> +	struct tlob_ring *ring;
>> +	size_t rec = sizeof(struct tlob_event);
>> +	unsigned long irqflags;
>> +	ssize_t done = 0;
>> +	int ret;
>> +
>> +	if (!priv)
>> +		return -ENODEV;
>> +
>> +	ring = &priv->ring;
>> +
>> +	if (count < rec)
>> +		return -EINVAL;
>> +
>> +	/* Blocking path: sleep until the producer advances data_head. */
>> +	if (!(file->f_flags & O_NONBLOCK)) {
>> +		ret = wait_event_interruptible(priv->waitq,
>> +			/* pairs with smp_store_release() in the producer */
>> +			smp_load_acquire(&ring->page->data_head) !=
>> +			READ_ONCE(ring->page->data_tail));
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	/*
>> +	 * Drain records into the caller's buffer.  ring->lock serialises
>> +	 * concurrent read() callers and the softirq producer.
>> +	 */
>> +	while (done + rec <= count) {
>> +		struct tlob_event record;
>> +		u32 head, tail;
>> +
>> +		spin_lock_irqsave(&ring->lock, irqflags);
>> +		/* pairs with smp_store_release() in the producer */
>> +		head = smp_load_acquire(&ring->page->data_head);
>> +		tail = ring->page->data_tail;
>> +		if (head == tail) {
>> +			spin_unlock_irqrestore(&ring->lock, irqflags);
>> +			break;
>> +		}
>> +		record = ring->data[tail & ring->mask];
>> +		WRITE_ONCE(ring->page->data_tail, tail + 1);
>> +		spin_unlock_irqrestore(&ring->lock, irqflags);
>> +
>> +		if (copy_to_user(buf + done, &record, rec))
>> +			return done ? done : -EFAULT;
>> +		done += rec;
>> +	}
>> +
>> +	return done ? done : -EAGAIN;
>> +}
>> +
>> +/*
>> + * rv_mmap - map the per-fd violation ring buffer into userspace.
>> + *
>> + * The mmap region covers the full ring allocation:
>> + *
>> + *   offset 0          : struct tlob_mmap_page  (control page)
>> + *   offset PAGE_SIZE  : struct tlob_event[capacity]  (data pages)
>> + *
>> + * The caller must map exactly PAGE_SIZE + capacity * sizeof(struct
>> tlob_event)
>> + * bytes starting at offset 0 (vm_pgoff must be 0).  The actual capacity is
>> + * read from tlob_mmap_page.capacity after a successful mmap(2).
>> + *
>> + * Private mappings (MAP_PRIVATE) are rejected: the shared data_tail field
>> + * written by userspace must be visible to the kernel producer.
>> + */
>> +static int rv_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> +	struct rv_file_priv *priv = file->private_data;
>> +	struct tlob_ring    *ring;
>> +	unsigned long        size = vma->vm_end - vma->vm_start;
>> +	unsigned long        ring_size;
>> +
>> +	if (!priv)
>> +		return -ENODEV;
>> +
>> +	ring = &priv->ring;
>> +
>> +	if (vma->vm_pgoff != 0)
>> +		return -EINVAL;
>> +
>> +	ring_size = PAGE_ALIGN(PAGE_SIZE + ((unsigned long)(ring->mask + 1) *
>> +					    sizeof(struct tlob_event)));
>> +	if (size != ring_size)
>> +		return -EINVAL;
>> +
>> +	if (!(vma->vm_flags & VM_SHARED))
>> +		return -EINVAL;
>> +
>> +	return remap_pfn_range(vma, vma->vm_start,
>> +			       page_to_pfn(virt_to_page((void *)ring->base)),
>> +			       ring_size, vma->vm_page_prot);
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * ioctl dispatcher
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static long rv_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>> +{
>> +	unsigned int nr = _IOC_NR(cmd);
>> +
>> +	/*
>> +	 * Verify the magic byte so we don't accidentally handle ioctls
>> +	 * intended for a different device.
>> +	 */
>> +	if (_IOC_TYPE(cmd) != RV_IOC_MAGIC)
>> +		return -ENOTTY;
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +	/* tlob: ioctl numbers 0x01 - 0x1F */
>> +	switch (cmd) {
>> +	case TLOB_IOCTL_TRACE_START: {
>> +		struct tlob_start_args args;
>> +		struct file *notify_file = NULL;
>> +		int ret, hret;
>> +
>> +		if (copy_from_user(&args,
>> +				   (struct tlob_start_args __user *)arg,
>> +				   sizeof(args)))
>> +			return -EFAULT;
>> +		if (args.threshold_us == 0)
>> +			return -EINVAL;
>> +		if (args.flags != 0)
>> +			return -EINVAL;
>> +
>> +		/*
>> +		 * If notify_fd >= 0, resolve it to a file pointer.
>> +		 * fget() bumps the reference count; tlob.c drops it
>> +		 * via fput() when the monitoring window ends.
>> +		 * Reject non-/dev/rv fds to prevent type confusion.
>> +		 */
>> +		if (args.notify_fd >= 0) {
>> +			notify_file = fget(args.notify_fd);
>> +			if (!notify_file)
>> +				return -EBADF;
>> +			if (notify_file->f_op != file->f_op) {
>> +				fput(notify_file);
>> +				return -EINVAL;
>> +			}
>> +		}
>> +
>> +		ret = tlob_start_task(current, args.threshold_us,
>> +				      notify_file, args.tag);
>> +		if (ret != 0) {
>> +			/* tlob.c did not take ownership; drop ref. */
>> +			if (notify_file)
>> +				fput(notify_file);
>> +			return ret;
>> +		}
>> +
>> +		/*
>> +		 * Record session handle.  Free any stale handle left by
>> +		 * a previous window whose deadline timer fired (timer
>> +		 * removes tlob_task_state but cannot touch tlob_handles).
>> +		 */
>> +		tlob_handle_free(current);
>> +		hret = tlob_handle_alloc(current, file);
>> +		if (hret < 0) {
>> +			tlob_stop_task(current);
>> +			return hret;
>> +		}
>> +		return 0;
>> +	}
>> +	case TLOB_IOCTL_TRACE_STOP: {
>> +		int had_handle;
>> +		int ret;
>> +
>> +		/*
>> +		 * Atomically remove the session handle for current.
>> +		 *
>> +		 *   had_handle == 0: TRACE_START was never called for
>> +		 *                    this thread -> caller bug -> -ESRCH
>> +		 *
>> +		 *   had_handle == 1: TRACE_START was called.  If
>> +		 *                    tlob_stop_task() now returns
>> +		 *                    -ESRCH, the deadline timer already
>> +		 *                    fired -> budget exceeded -> -EOVERFLOW
>> +		 */
>> +		had_handle = tlob_handle_free(current);
>> +		if (!had_handle)
>> +			return -ESRCH;
>> +
>> +		ret = tlob_stop_task(current);
>> +		return (ret == -ESRCH) ? -EOVERFLOW : ret;
>> +	}
>> +	default:
>> +		break;
>> +	}
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>> +	return -ENOTTY;
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * Module init / exit
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static const struct file_operations rv_fops = {
>> +	.owner		= THIS_MODULE,
>> +	.open		= rv_open,
>> +	.release	= rv_release,
>> +	.read		= rv_read,
>> +	.poll		= rv_poll,
>> +	.mmap		= rv_mmap,
>> +	.unlocked_ioctl	= rv_ioctl,
>> +#ifdef CONFIG_COMPAT
>> +	.compat_ioctl	= rv_ioctl,
>> +#endif
>> +	.llseek		= noop_llseek,
>> +};
>> +
>> +/*
>> + * 0666: /dev/rv is a self-instrumentation device.  All ioctls operate
>> + * exclusively on the calling task (current); no task can monitor another
>> + * via this interface.  Opening the device does not grant any privilege
>> + * beyond observing one's own latency, so world-read/write is appropriate.
>> + */
>> +static struct miscdevice rv_miscdev = {
>> +	.minor	= MISC_DYNAMIC_MINOR,
>> +	.name	= "rv",
>> +	.fops	= &rv_fops,
>> +	.mode	= 0666,
>> +};
>> +
>> +static int __init rv_ioctl_init(void)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < TLOB_HANDLES_SIZE; i++)
>> +		INIT_HLIST_HEAD(&tlob_handles[i]);
>> +
>> +	return misc_register(&rv_miscdev);
>> +}
>> +
>> +static void __exit rv_ioctl_exit(void)
>> +{
>> +	misc_deregister(&rv_miscdev);
>> +}
>> +
>> +module_init(rv_ioctl_init);
>> +module_exit(rv_ioctl_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_DESCRIPTION("RV ioctl interface via /dev/rv");
>> diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
>> index 4a6faddac..65d6c6485 100644
>> --- a/kernel/trace/rv/rv_trace.h
>> +++ b/kernel/trace/rv/rv_trace.h
>> @@ -126,6 +126,7 @@ DECLARE_EVENT_CLASS(error_da_monitor_id,
>>   #include <monitors/snroc/snroc_trace.h>
>>   #include <monitors/nrp/nrp_trace.h>
>>   #include <monitors/sssw/sssw_trace.h>
>> +#include <monitors/tlob/tlob_trace.h>
>>   // Add new monitors based on CONFIG_DA_MON_EVENTS_ID here
>>   
>>   #endif /* CONFIG_DA_MON_EVENTS_ID */
>> @@ -202,6 +203,55 @@ TRACE_EVENT(rv_retries_error,
>>   		__get_str(event), __get_str(name))
>>   );
>>   #endif /* CONFIG_RV_MON_MAINTENANCE_EVENTS */
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +/*
>> + * tlob_budget_exceeded - emitted when a monitored task exceeds its latency
>> + * budget.  Carries the on-CPU / off-CPU time breakdown so that the cause
>> + * of the overrun (CPU-bound vs. scheduling/I/O latency) is immediately
>> + * visible in the ftrace ring buffer without post-processing.
>> + */
>> +TRACE_EVENT(tlob_budget_exceeded,
>> +
>> +	TP_PROTO(struct task_struct *task, u64 threshold_us,
>> +		 u64 on_cpu_us, u64 off_cpu_us, u32 switches,
>> +		 bool state_is_on_cpu, u64 tag),
>> +
>> +	TP_ARGS(task, threshold_us, on_cpu_us, off_cpu_us, switches,
>> +		state_is_on_cpu, tag),
>> +
>> +	TP_STRUCT__entry(
>> +		__string(comm,		task->comm)
>> +		__field(pid_t,		pid)
>> +		__field(u64,		threshold_us)
>> +		__field(u64,		on_cpu_us)
>> +		__field(u64,		off_cpu_us)
>> +		__field(u32,		switches)
>> +		__field(bool,		state_is_on_cpu)
>> +		__field(u64,		tag)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__assign_str(comm);
>> +		__entry->pid		= task->pid;
>> +		__entry->threshold_us	= threshold_us;
>> +		__entry->on_cpu_us	= on_cpu_us;
>> +		__entry->off_cpu_us	= off_cpu_us;
>> +		__entry->switches	= switches;
>> +		__entry->state_is_on_cpu = state_is_on_cpu;
>> +		__entry->tag		= tag;
>> +	),
>> +
>> +	TP_printk("%s[%d]: budget exceeded threshold=%llu on_cpu=%llu
>> off_cpu=%llu switches=%u state=%s tag=0x%016llx",
>> +		__get_str(comm), __entry->pid,
>> +		__entry->threshold_us,
>> +		__entry->on_cpu_us, __entry->off_cpu_us,
>> +		__entry->switches,
>> +		__entry->state_is_on_cpu ? "on_cpu" : "off_cpu",
>> +		__entry->tag)
>> +);
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>>   #endif /* _TRACE_RV_H */
>>   
>>   /* This part must be outside protection */
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor
  2026-04-16 15:09     ` Wen Yang
@ 2026-04-16 15:35       ` Gabriele Monaco
  0 siblings, 0 replies; 11+ messages in thread
From: Gabriele Monaco @ 2026-04-16 15:35 UTC (permalink / raw)
  To: Wen Yang
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, linux-kernel

Hello,

On Thu, 2026-04-16 at 23:09 +0800, Wen Yang wrote:
> 
> Thanks for the review.  Here's my plan for each point -- let me know if 
> the direction looks right.
> 
> 
> - Timed automata
> 
> The HA framework [1] is a good match when the timeout threshold is 
> global or state-determined, but tlob needs a per-invocation threshold 
> supplied at TRACE_START time -- fitting that into HA would require 
> framework changes.

Not quite, look at the nomiss monitor, the deadline comes directly from the
deadline entity.

What I meant with using per-object monitor is that you can use your custom
struct as a monitor target, that has your per-invocation threshold because you
set instantiate it on start.

Now you can simply do ha_get_target(ha_mon)->threshold and you get your value.

You can define in the dot representation "clk < THRESHOLD_NS()" and rvgen will
do most of the things for you. It's probably better to use nanoseconds so you
avoid conversions when dealing with hrtimers. You can do it transparently when
initialising so the user still passes us.

> My plan is to use da_monitor_init_hook() -- the same mechanism HA 
> monitors use internally -- to arm the per-invocation hrtimer once 
> da_create_storage() has stored the monitor_target.  This gives the same 
> "timer fires => violation" semantics without touching the HA infrastructure.
> 
> If you see a cleaner way to pass per-invocation data through HA I'm 
> happy to go that route.

The above looks cleaner to me, what do you think?

da_monitor_init_hook() isn't really meant to be used by monitors, it's more for
the infrastructure to extend da_monitor.h easily, sure you can use it if there's
no other way, though.

> - Unmonitored state / da_handle_start_event
> 
> Fair point.  I'll drop the explicit unmonitored state and the
> trace_event_tlob() redefinition.  tlob_start_task() will use
> da_handle_start_event() to allocate storage, set initial state to on_cpu,
> and fire the init hook to arm the timer in one shot.  tlob_stop_task()
> calls da_monitor_reset() directly.
> 
> - Per-object monitors
> 
> Will do.  The custom hash table goes away; I'll switch to RV_MON_PER_OBJ
> with:
> 
>      typedef struct tlob_task_state *monitor_target;
> 
> da_get_target_by_id() handles the sched_switch hot path lookup.
> 

Exactly! That should do.

> - RV-way violations
> 
> Agreed.  budget_expired will be declared INVALID in all states so the
> framework calls react() (error_tlob tracepoint + any registered reactor)
> and da_monitor_reset() automatically.  tlob won't emit any tracepoint of
> its own.
> 
> One note on the /dev/tlob ioctl: TLOB_IOCTL_TRACE_STOP returns -EOVERFLOW
> to the caller when the budget was exceeded.  This is just a syscall 
> return code -- not a second reporting path -- to let in-process 
> instrumentation react inline without polling the trace buffer.
> Let me know if you have concerns about keeping this.
> 

I'm not sure how faster can it be compared to attaching to the tracefs, that
should be quite light if you just listen to error events. Sure you'd need a few
more libraries.

I'm a bit concerned in adding new interfaces (ioctl), when we have already
tracepoints and reactors. The reactors themselves are not as flexible as they
should be though, but if required we may definitely create a ioctl reactor just
for this.

For now ignore all this and continue with the TLOB_IOCTL_TRACE_STOP, then we can
think of the details.

> - Generic uprobe helper
> 
> Proposed interface:
> 
>      struct rv_uprobe *rv_uprobe_attach_path(
>              struct path *path, loff_t offset,
>              int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
>              int (*ret_fn)  (struct rv_uprobe *, unsigned long func,
>                              struct pt_regs *, __u64 *),
>              void *priv);
> 
>      struct rv_uprobe *rv_uprobe_attach(
>              const char *binpath, loff_t offset,
>              int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
>              int (*ret_fn)  (struct rv_uprobe *, unsigned long func,
>                              struct pt_regs *, __u64 *),
>              void *priv);
> 
>      void rv_uprobe_detach(struct rv_uprobe *p);
> 
> struct rv_uprobe exposes three read-only fields to monitors (offset, 
> priv, path); the uprobe_consumer and callbacks would be kept private to 
> the implementation, so monitors need not include <linux/uprobes.h>.
> 
> rv_uprobe_attach() resolves the path and delegates to 
> rv_uprobe_attach_path(); the latter avoids a redundant kern_path() when 
> registering multiple probes on the same binary:
> 
>      kern_path(binpath, LOOKUP_FOLLOW, &path);
>      b->start = rv_uprobe_attach_path(&path, offset_start, entry_fn, 
> NULL, b);
>      b->stop  = rv_uprobe_attach_path(&path, offset_stop,  stop_fn, 
> NULL, b);
>      path_put(&path);
> 
> Does the interface look reasonable, or did you have a different shape in 
> mind?
> 

Yeah seems reasonable. Then we'd need to keep around the uprobe for
deinitialisation, but probably having it global is the best way without
overengineer anything.

Thanks,
Gabriele


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-16 15:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-12 19:27 [RFC PATCH 0/4] rv/tlob: Add task latency over budget RV monitor wen.yang
2026-04-12 19:27 ` [RFC PATCH 1/4] rv/tlob: Add tlob model DOT file wen.yang
2026-04-13  8:19   ` Gabriele Monaco
2026-04-12 19:27 ` [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor wen.yang
2026-04-13  8:19   ` Gabriele Monaco
2026-04-16 15:09     ` Wen Yang
2026-04-16 15:35       ` Gabriele Monaco
2026-04-12 19:27 ` [RFC PATCH 3/4] rv/tlob: Add KUnit tests for the tlob monitor wen.yang
2026-04-16 12:09   ` Gabriele Monaco
2026-04-12 19:27 ` [RFC PATCH 4/4] selftests/rv: Add selftest " wen.yang
2026-04-16 12:00   ` Gabriele Monaco

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox