All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wen Yang <wen.yang@linux.dev>
To: Gabriele Monaco <gmonaco@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	linux-trace-kernel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor
Date: Thu, 16 Apr 2026 23:09:37 +0800	[thread overview]
Message-ID: <228deda8-3685-4f07-afd5-d3f3ca531154@linux.dev> (raw)
In-Reply-To: <74a624434b59c00f9407909b8696f041536d9418.camel@redhat.com>



On 4/13/26 16:19, Gabriele Monaco wrote:
> On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
>> From: Wen Yang <wen.yang@linux.dev>
>>
>> Add the tlob (task latency over budget) RV monitor. tlob tracks the
>> monotonic elapsed time (CLOCK_MONOTONIC) of a marked per-task code
>> path, including time off-CPU, and fires a per-task hrtimer when the
>> elapsed time exceeds a configurable budget.
>>
>> Three-state DA (unmonitored/on_cpu/off_cpu) driven by trace_start,
>> switch_in/out, and budget_expired events. Per-task state lives in a
>> fixed-size hash table (TLOB_MAX_MONITORED slots) with RCU-deferred
>> free.
>>
>> Two userspace interfaces:
>>   - tracefs: uprobe pair registration via the monitor file using the
>>     format "pid:threshold_us:offset_start:offset_stop:binary_path"
>>   - /dev/rv ioctls (CONFIG_RV_CHARDEV): TLOB_IOCTL_TRACE_START /
>>     TRACE_STOP; TRACE_STOP returns -EOVERFLOW on violation
>>
>> Each /dev/rv fd has a per-fd mmap ring buffer (physically contiguous
>> pages). A control page (struct tlob_mmap_page) at offset 0 exposes
>> head/tail/dropped for lockless userspace reads; struct tlob_event
>> records follow at data_offset. Drop-new policy on overflow.
>>
>> UAPI: include/uapi/linux/rv.h (tlob_start_args, tlob_event,
>>        tlob_mmap_page, ioctl numbers), monitor_tlob.rst,
>>        ioctl-number.rst (RV_IOC_MAGIC=0xB9).
>>
> 
> I'm not fully grasping all the requirements for the monitors yet, but I see you
> are reimplementing a lot of functionality in the monitor itself rather than
> within RV, let's see if we can consolidate some of them:
> 
>   * you're using timer expirations, can we do it with timed automata? [1]
>   * RV automata usually don't have an /unmonitored/ state, your trace_start event
> would  be the start condition (da_event_start) and the monitor will get non-
> running at each violation (it calls da_monitor_reset() automatically), all
> setup/cleanup logic should be handled implicitly within RV. I believe that would
> also save you that ugly trace_event_tlob() redefinition.
>   * you're maintaining a local hash table for each task_struct, that could use
> the per-object monitors [2] where your "object" is in fact your struct,
> allocated when you start the monitor with all appropriate fields and indexed by
> pid
>   * you are handling violations manually, considering timed automata trigger a
> full fledged violation on timeouts, can you use the RV-way (error tracepoints or
> reactors only)? Do you need the additional reporting within the
> tracepoint/ioctl? Cannot the userspace consumer desume all those from other
> events and let RV do just the monitoring?
>   * I like the uprobe thing, we could probably move all that to a common helper
> once we figure out how to make it generic.
> 
> Note: [1] and [2] didn't reach upstream yet, but should reach linux-next soon.
> 

Thanks for the review.  Here's my plan for each point -- let me know if 
the direction looks right.


- Timed automata

The HA framework [1] is a good match when the timeout threshold is 
global or state-determined, but tlob needs a per-invocation threshold 
supplied at TRACE_START time -- fitting that into HA would require 
framework changes.

My plan is to use da_monitor_init_hook() -- the same mechanism HA 
monitors use internally -- to arm the per-invocation hrtimer once 
da_create_storage() has stored the monitor_target.  This gives the same 
"timer fires => violation" semantics without touching the HA infrastructure.

If you see a cleaner way to pass per-invocation data through HA I'm 
happy to go that route.


- Unmonitored state / da_handle_start_event

Fair point.  I'll drop the explicit unmonitored state and the
trace_event_tlob() redefinition.  tlob_start_task() will use
da_handle_start_event() to allocate storage, set initial state to on_cpu,
and fire the init hook to arm the timer in one shot.  tlob_stop_task()
calls da_monitor_reset() directly.

- Per-object monitors

Will do.  The custom hash table goes away; I'll switch to RV_MON_PER_OBJ
with:

     typedef struct tlob_task_state *monitor_target;

da_get_target_by_id() handles the sched_switch hot path lookup.


- RV-way violations

Agreed.  budget_expired will be declared INVALID in all states so the
framework calls react() (error_tlob tracepoint + any registered reactor)
and da_monitor_reset() automatically.  tlob won't emit any tracepoint of
its own.

One note on the /dev/tlob ioctl: TLOB_IOCTL_TRACE_STOP returns -EOVERFLOW
to the caller when the budget was exceeded.  This is just a syscall 
return code -- not a second reporting path -- to let in-process 
instrumentation react inline without polling the trace buffer.
Let me know if you have concerns about keeping this.


- Generic uprobe helper

Proposed interface:

     struct rv_uprobe *rv_uprobe_attach_path(
             struct path *path, loff_t offset,
             int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
             int (*ret_fn)  (struct rv_uprobe *, unsigned long func,
                             struct pt_regs *, __u64 *),
             void *priv);

     struct rv_uprobe *rv_uprobe_attach(
             const char *binpath, loff_t offset,
             int (*entry_fn)(struct rv_uprobe *, struct pt_regs *, __u64 *),
             int (*ret_fn)  (struct rv_uprobe *, unsigned long func,
                             struct pt_regs *, __u64 *),
             void *priv);

     void rv_uprobe_detach(struct rv_uprobe *p);

struct rv_uprobe exposes three read-only fields to monitors (offset, 
priv, path); the uprobe_consumer and callbacks would be kept private to 
the implementation, so monitors need not include <linux/uprobes.h>.

rv_uprobe_attach() resolves the path and delegates to 
rv_uprobe_attach_path(); the latter avoids a redundant kern_path() when 
registering multiple probes on the same binary:

     kern_path(binpath, LOOKUP_FOLLOW, &path);
     b->start = rv_uprobe_attach_path(&path, offset_start, entry_fn, 
NULL, b);
     b->stop  = rv_uprobe_attach_path(&path, offset_stop,  stop_fn, 
NULL, b);
     path_put(&path);

Does the interface look reasonable, or did you have a different shape in 
mind?


--
Best wishes,
Wen


> 
> [1] -
> https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=f5587d1b6ec938afb2f74fe399a68020d66923e4
> [2] -
> https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=da282bf7fadb095ee0a40c32ff0126429c769b45
> 
>> Signed-off-by: Wen Yang <wen.yang@linux.dev>
>> ---
>>   Documentation/trace/rv/index.rst              |   1 +
>>   Documentation/trace/rv/monitor_tlob.rst       | 381 +++++++
>>   .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>>   include/uapi/linux/rv.h                       | 181 ++++
>>   kernel/trace/rv/Kconfig                       |  17 +
>>   kernel/trace/rv/Makefile                      |   2 +
>>   kernel/trace/rv/monitors/tlob/Kconfig         |  51 +
>>   kernel/trace/rv/monitors/tlob/tlob.c          | 986 ++++++++++++++++++
>>   kernel/trace/rv/monitors/tlob/tlob.h          | 145 +++
>>   kernel/trace/rv/monitors/tlob/tlob_trace.h    |  42 +
>>   kernel/trace/rv/rv.c                          |   4 +
>>   kernel/trace/rv/rv_dev.c                      | 602 +++++++++++
>>   kernel/trace/rv/rv_trace.h                    |  50 +
>>   13 files changed, 2463 insertions(+)
>>   create mode 100644 Documentation/trace/rv/monitor_tlob.rst
>>   create mode 100644 include/uapi/linux/rv.h
>>   create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
>>   create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
>>   create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
>>   create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
>>   create mode 100644 kernel/trace/rv/rv_dev.c
>>
>> diff --git a/Documentation/trace/rv/index.rst
>> b/Documentation/trace/rv/index.rst
>> index a2812ac5c..4f2bfaf38 100644
>> --- a/Documentation/trace/rv/index.rst
>> +++ b/Documentation/trace/rv/index.rst
>> @@ -15,3 +15,4 @@ Runtime Verification
>>      monitor_wwnr.rst
>>      monitor_sched.rst
>>      monitor_rtapp.rst
>> +   monitor_tlob.rst
>> diff --git a/Documentation/trace/rv/monitor_tlob.rst
>> b/Documentation/trace/rv/monitor_tlob.rst
>> new file mode 100644
>> index 000000000..d498e9894
>> --- /dev/null
>> +++ b/Documentation/trace/rv/monitor_tlob.rst
>> @@ -0,0 +1,381 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +Monitor tlob
>> +============
>> +
>> +- Name: tlob - task latency over budget
>> +- Type: per-task deterministic automaton
>> +- Author: Wen Yang <wen.yang@linux.dev>
>> +
>> +Description
>> +-----------
>> +
>> +The tlob monitor tracks per-task elapsed time (CLOCK_MONOTONIC, including
>> +both on-CPU and off-CPU time) and reports a violation when the monitored
>> +task exceeds a configurable latency budget threshold.
>> +
>> +The monitor implements a three-state deterministic automaton::
>> +
>> +                              |
>> +                              | (initial)
>> +                              v
>> +                    +--------------+
>> +          +-------> | unmonitored  |
>> +          |         +--------------+
>> +          |                |
>> +          |          trace_start
>> +          |                v
>> +          |         +--------------+
>> +          |         |   on_cpu     |
>> +          |         +--------------+
>> +          |           |         |
>> +          |  switch_out|         | trace_stop / budget_expired
>> +          |            v         v
>> +          |  +--------------+  (unmonitored)
>> +          |  |   off_cpu    |
>> +          |  +--------------+
>> +          |     |         |
>> +          |     | switch_in| trace_stop / budget_expired
>> +          |     v         v
>> +          |  (on_cpu)  (unmonitored)
>> +          |
>> +          +-- trace_stop (from on_cpu or off_cpu)
>> +
>> +  Key transitions:
>> +    unmonitored   --(trace_start)-->   on_cpu
>> +    on_cpu        --(switch_out)-->    off_cpu
>> +    off_cpu       --(switch_in)-->     on_cpu
>> +    on_cpu        --(trace_stop)-->    unmonitored
>> +    off_cpu       --(trace_stop)-->    unmonitored
>> +    on_cpu        --(budget_expired)-> unmonitored   [violation]
>> +    off_cpu       --(budget_expired)-> unmonitored   [violation]
>> +
>> +  sched_wakeup self-loops in on_cpu and unmonitored; switch_out and
>> +  sched_wakeup self-loop in off_cpu.  budget_expired is fired by the one-shot
>> hrtimer; it always
>> +  transitions to unmonitored regardless of whether the task is on-CPU
>> +  or off-CPU when the timer fires.
>> +
>> +State Descriptions
>> +------------------
>> +
>> +- **unmonitored**: Task is not being traced.  Scheduling events
>> +  (``switch_in``, ``switch_out``, ``sched_wakeup``) are silently
>> +  ignored (self-loop).  The monitor waits for a ``trace_start`` event
>> +  to begin a new observation window.
>> +
>> +- **on_cpu**: Task is running on the CPU with the deadline timer armed.
>> +  A one-shot hrtimer was set for ``threshold_us`` microseconds at
>> +  ``trace_start`` time.  A ``switch_out`` event transitions to
>> +  ``off_cpu``; the hrtimer keeps running (off-CPU time counts toward
>> +  the budget).  A ``trace_stop`` cancels the timer and returns to
>> +  ``unmonitored`` (normal completion).  If the hrtimer fires
>> +  (``budget_expired``) the violation is recorded and the automaton
>> +  transitions to ``unmonitored``.
>> +
>> +- **off_cpu**: Task was preempted or blocked.  The one-shot hrtimer
>> +  continues to run.  A ``switch_in`` event returns to ``on_cpu``.
>> +  A ``trace_stop`` cancels the timer and returns to ``unmonitored``.
>> +  If the hrtimer fires (``budget_expired``) while the task is off-CPU,
>> +  the violation is recorded and the automaton transitions to
>> +  ``unmonitored``.
>> +
>> +Rationale
>> +---------
>> +
>> +The per-task latency budget threshold allows operators to express timing
>> +requirements in microseconds and receive an immediate ftrace event when a
>> +task exceeds its budget.  This is useful for real-time tasks
>> +(``SCHED_FIFO`` / ``SCHED_DEADLINE``) where total elapsed time must
>> +remain within a known bound.
>> +
>> +Each task has an independent threshold, so up to ``TLOB_MAX_MONITORED``
>> +(64) tasks with different timing requirements can be monitored
>> +simultaneously.
>> +
>> +On threshold violation the automaton records a ``tlob_budget_exceeded``
>> +ftrace event carrying the final on-CPU / off-CPU time breakdown, but does
>> +not kill or throttle the task.  Monitoring can be restarted by issuing a
>> +new ``trace_start`` event (or a new ``TLOB_IOCTL_TRACE_START`` ioctl).
>> +
>> +A per-task one-shot hrtimer is armed at ``trace_start`` for exactly
>> +``threshold_us`` microseconds.  It fires at most once per monitoring
>> +window, performs an O(1) hash lookup, records the violation, and injects
>> +the ``budget_expired`` event into the DA.  When ``CONFIG_RV_MON_TLOB``
>> +is not set there is zero runtime cost.
>> +
>> +Usage
>> +-----
>> +
>> +tracefs interface (uprobe-based external monitoring)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +The ``monitor`` tracefs file allows any privileged user to instrument an
>> +unmodified binary via uprobes, without changing its source code.  Write a
>> +four-field record to attach two plain entry uprobes: one at
>> +``offset_start`` fires ``tlob_start_task()`` and one at ``offset_stop``
>> +fires ``tlob_stop_task()``, so the latency budget covers exactly the code
>> +region between the two offsets::
>> +
>> +  threshold_us:offset_start:offset_stop:binary_path
>> +
>> +``binary_path`` comes last so it may freely contain ``:`` (e.g. paths
>> +inside a container namespace).
>> +
>> +The uprobes fire for every task that executes the probed instruction in
>> +the binary, consistent with the native uprobe semantics.  All tasks that
>> +execute the code region get independent per-task monitoring slots.
>> +
>> +Using two plain entry uprobes (rather than a uretprobe for the stop) means
>> +that a mistyped offset can never corrupt the call stack; the worst outcome
>> +of a bad ``offset_stop`` is a missed stop that causes the hrtimer to fire
>> +and report a budget violation.
>> +
>> +Example  --  monitor a code region in ``/usr/bin/myapp`` with a 5 ms
>> +budget, where the region starts at offset 0x12a0 and ends at 0x12f0::
>> +
>> +  echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
>> +
>> +  # Bind uprobes: start probe starts the clock, stop probe stops it
>> +  echo "5000:0x12a0:0x12f0:/usr/bin/myapp" \
>> +      > /sys/kernel/tracing/rv/monitors/tlob/monitor
>> +
>> +  # Remove the uprobe binding for this code region
>> +  echo "-0x12a0:/usr/bin/myapp" >
>> /sys/kernel/tracing/rv/monitors/tlob/monitor
>> +
>> +  # List registered uprobe bindings (mirrors the write format)
>> +  cat /sys/kernel/tracing/rv/monitors/tlob/monitor
>> +  # -> 5000:0x12a0:0x12f0:/usr/bin/myapp
>> +
>> +  # Read violations from the trace buffer
>> +  cat /sys/kernel/tracing/trace
>> +
>> +Up to ``TLOB_MAX_MONITORED`` tasks may be monitored simultaneously.
>> +
>> +The offsets can be obtained with ``nm`` or ``readelf``::
>> +
>> +  nm -n /usr/bin/myapp | grep my_function
>> +  # -> 0000000000012a0 T my_function
>> +
>> +  readelf -s /usr/bin/myapp | grep my_function
>> +  # -> 42: 0000000000012a0  336 FUNC GLOBAL DEFAULT  13 my_function
>> +
>> +  # offset_start = 0x12a0 (function entry)
>> +  # offset_stop  = 0x12a0 + 0x50 = 0x12f0 (or any instruction before return)
>> +
>> +Notes:
>> +
>> +- The uprobes fire for every task that executes the probed instruction,
>> +  so concurrent calls from different threads each get independent
>> +  monitoring slots.
>> +- ``offset_stop`` need not be a function return; it can be any instruction
>> +  within the region.  If the stop probe is never reached (e.g. early exit
>> +  path bypasses it), the hrtimer fires and a budget violation is reported.
>> +- Each ``(binary_path, offset_start)`` pair may only be registered once.
>> +  A second write with the same ``offset_start`` for the same binary is
>> +  rejected with ``-EEXIST``.  Two entry uprobes at the same address would
>> +  both fire for every task, causing ``tlob_start_task()`` to be called
>> +  twice; the second call would silently fail with ``-EEXIST`` and the
>> +  second binding's threshold would never take effect.  Different code
>> +  regions that share the same ``offset_stop`` (common exit point) are
>> +  explicitly allowed.
>> +- The uprobe binding is removed when ``-offset_start:binary_path`` is
>> +  written to ``monitor``, or when the monitor is disabled.
>> +- The ``tag`` field in every ``tlob_budget_exceeded`` event is
>> +  automatically set to ``offset_start`` for the tracefs path, so
>> +  violation events for different code regions are immediately
>> +  distinguishable even when ``threshold_us`` values are identical.
>> +
>> +ftrace ring buffer (budget violation events)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +When a monitored task exceeds its latency budget the hrtimer fires,
>> +records the violation, and emits a single ``tlob_budget_exceeded`` event
>> +into the ftrace ring buffer.  **Nothing is written to the ftrace ring
>> +buffer while the task is within budget.**
>> +
>> +The event carries the on-CPU / off-CPU time breakdown so that root-cause
>> +analysis (CPU-bound vs. scheduling / I/O overrun) is immediate::
>> +
>> +  cat /sys/kernel/tracing/trace
>> +
>> +Example output::
>> +
>> +  myapp-1234 [003] .... 12345.678: tlob_budget_exceeded: \
>> +    myapp[1234]: budget exceeded threshold=5000 \
>> +    on_cpu=820 off_cpu=4500 switches=3 state=off_cpu tag=0x00000000000012a0
>> +
>> +Field descriptions:
>> +
>> +``threshold``
>> +  Configured latency budget in microseconds.
>> +
>> +``on_cpu``
>> +  Cumulative on-CPU time since ``trace_start``, in microseconds.
>> +
>> +``off_cpu``
>> +  Cumulative off-CPU (scheduling + I/O wait) time since ``trace_start``,
>> +  in microseconds.
>> +
>> +``switches``
>> +  Number of times the task was scheduled out during this window.
>> +
>> +``state``
>> +  DA state when the hrtimer fired: ``on_cpu`` means the task was executing
>> +  when the budget expired (CPU-bound overrun); ``off_cpu`` means the task
>> +  was preempted or blocked (scheduling / I/O overrun).
>> +
>> +``tag``
>> +  Opaque 64-bit cookie supplied by the caller via ``tlob_start_args.tag``
>> +  (ioctl path) or automatically set to ``offset_start`` (tracefs uprobe
>> +  path).  Use it to distinguish violations from different code regions
>> +  monitored by the same thread.  Zero when not set.
>> +
>> +To capture violations in a file::
>> +
>> +  trace-cmd record -e tlob_budget_exceeded &
>> +  # ... run workload ...
>> +  trace-cmd report
>> +
>> +/dev/rv ioctl interface (self-instrumentation)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Tasks can self-instrument their own code paths via the ``/dev/rv`` misc
>> +device (requires ``CONFIG_RV_CHARDEV``).  The kernel key is
>> +``task_struct``; multiple threads sharing a single fd each get their own
>> +independent monitoring slot.
>> +
>> +**Synchronous mode**  --  the calling thread checks its own result::
>> +
>> +  int fd = open("/dev/rv", O_RDWR);
>> +
>> +  struct tlob_start_args args = {
>> +      .threshold_us = 50000,   /* 50 ms */
>> +      .tag          = 0,       /* optional; 0 = don't care */
>> +      .notify_fd    = -1,      /* no fd notification */
>> +  };
>> +  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
>> +
>> +  /* ... code path under observation ... */
>> +
>> +  int ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
>> +  /* ret == 0:          within budget  */
>> +  /* ret == -EOVERFLOW: budget exceeded */
>> +
>> +  close(fd);
>> +
>> +**Asynchronous mode**  --  a dedicated monitor thread receives violation
>> +records via ``read()`` on a shared fd, decoupling the observation from
>> +the critical path::
>> +
>> +  /* Monitor thread: open a dedicated fd. */
>> +  int monitor_fd = open("/dev/rv", O_RDWR);
>> +
>> +  /* Worker thread: set notify_fd = monitor_fd in TRACE_START args. */
>> +  int work_fd = open("/dev/rv", O_RDWR);
>> +  struct tlob_start_args args = {
>> +      .threshold_us = 10000,   /* 10 ms */
>> +      .tag          = REGION_A,
>> +      .notify_fd    = monitor_fd,
>> +  };
>> +  ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
>> +  /* ... critical section ... */
>> +  ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
>> +
>> +  /* Monitor thread: blocking read() returns one or more tlob_event records.
>> */
>> +  struct tlob_event ntfs[8];
>> +  ssize_t n = read(monitor_fd, ntfs, sizeof(ntfs));
>> +  for (int i = 0; i < n / sizeof(struct tlob_event); i++) {
>> +      struct tlob_event *ntf = &ntfs[i];
>> +      printf("tid=%u tag=0x%llx exceeded budget=%llu us "
>> +             "(on_cpu=%llu off_cpu=%llu switches=%u state=%s)\n",
>> +             ntf->tid, ntf->tag, ntf->threshold_us,
>> +             ntf->on_cpu_us, ntf->off_cpu_us, ntf->switches,
>> +             ntf->state ? "on_cpu" : "off_cpu");
>> +  }
>> +
>> +**mmap ring buffer**  --  zero-copy consumption of violation events::
>> +
>> +  int fd = open("/dev/rv", O_RDWR);
>> +  struct tlob_start_args args = {
>> +      .threshold_us = 1000,   /* 1 ms */
>> +      .notify_fd    = fd,     /* push violations to own ring buffer */
>> +  };
>> +  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
>> +
>> +  /* Map the ring: one control page + capacity data records. */
>> +  size_t pagesize = sysconf(_SC_PAGESIZE);
>> +  size_t cap = 64;   /* read from page->capacity after mmap */
>> +  size_t len = pagesize + cap * sizeof(struct tlob_event);
>> +  void *map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> +
>> +  struct tlob_mmap_page *page = map;
>> +  struct tlob_event *data =
>> +      (struct tlob_event *)((char *)map + page->data_offset);
>> +
>> +  /* Consumer loop: poll for events, read without copying. */
>> +  while (1) {
>> +      poll(&(struct pollfd){fd, POLLIN, 0}, 1, -1);
>> +
>> +      uint32_t head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
>> +      uint32_t tail = page->data_tail;
>> +      while (tail != head) {
>> +          handle(&data[tail & (page->capacity - 1)]);
>> +          tail++;
>> +      }
>> +      __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
>> +  }
>> +
>> +Note: ``read()`` and ``mmap()`` share the same ring and ``data_tail``
>> +cursor.  Do not use both simultaneously on the same fd.
>> +
>> +``tlob_event`` fields:
>> +
>> +``tid``
>> +  Thread ID (``task_pid_vnr``) of the violating task.
>> +
>> +``threshold_us``
>> +  Budget that was exceeded, in microseconds.
>> +
>> +``on_cpu_us``
>> +  Cumulative on-CPU time at violation time, in microseconds.
>> +
>> +``off_cpu_us``
>> +  Cumulative off-CPU time at violation time, in microseconds.
>> +
>> +``switches``
>> +  Number of context switches since ``TRACE_START``.
>> +
>> +``state``
>> +  1 = timer fired while task was on-CPU; 0 = timer fired while off-CPU.
>> +
>> +``tag``
>> +  Cookie from ``tlob_start_args.tag``; for the tracefs uprobe path this
>> +  equals ``offset_start``.  Zero when not set.
>> +
>> +tracefs files
>> +-------------
>> +
>> +The following files are created under
>> +``/sys/kernel/tracing/rv/monitors/tlob/``:
>> +
>> +``enable`` (rw)
>> +  Write ``1`` to enable the monitor; write ``0`` to disable it and
>> +  stop all currently monitored tasks.
>> +
>> +``desc`` (ro)
>> +  Human-readable description of the monitor.
>> +
>> +``monitor`` (rw)
>> +  Write ``threshold_us:offset_start:offset_stop:binary_path`` to bind two
>> +  plain entry uprobes in *binary_path*.  The uprobe at *offset_start* fires
>> +  ``tlob_start_task()``; the uprobe at *offset_stop* fires
>> +  ``tlob_stop_task()``.  Returns ``-EEXIST`` if a binding with the same
>> +  *offset_start* already exists for *binary_path*.  Write
>> +  ``-offset_start:binary_path`` to remove the binding.  Read to list
>> +  registered bindings, one
>> +  ``threshold_us:0xoffset_start:0xoffset_stop:binary_path`` entry per line.
>> +
>> +Specification
>> +-------------
>> +
>> +Graphviz DOT file in tools/verification/models/tlob.dot
>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> index 331223761..8d3af68db 100644
>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> @@ -385,6 +385,7 @@ Code  Seq#    Include
>> File                                             Comments
>>   0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h
>> Marvell CN10K DPI driver
>>   0xB8  all    uapi/linux/mshv.h
>> Microsoft Hyper-V /dev/mshv driver
>>                                                                         
>> <mailto:linux-hyperv@vger.kernel.org>
>> +0xB9  00-3F  linux/rv.h
>> Runtime Verification (RV) monitors
>>   0xBA  00-0F  uapi/linux/liveupdate.h                                   Pasha
>> Tatashin
>>                                                                         
>> <mailto:pasha.tatashin@soleen.com>
>>   0xC0  00-0F  linux/usb/iowarrior.h
>> diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
>> new file mode 100644
>> index 000000000..d1b96d8cd
>> --- /dev/null
>> +++ b/include/uapi/linux/rv.h
>> @@ -0,0 +1,181 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +/*
>> + * UAPI definitions for Runtime Verification (RV) monitors.
>> + *
>> + * All RV monitors that expose an ioctl self-instrumentation interface
>> + * share the magic byte RV_IOC_MAGIC (0xB9), registered in
>> + * Documentation/userspace-api/ioctl/ioctl-number.rst.
>> + *
>> + * A single /dev/rv misc device serves as the entry point.  ioctl numbers
>> + * encode both the monitor identity and the operation:
>> + *
>> + *   0x01 - 0x1F  tlob (task latency over budget)
>> + *   0x20 - 0x3F  reserved for future RV monitors
>> + *
>> + * Usage examples and design rationale are in:
>> + *   Documentation/trace/rv/monitor_tlob.rst
>> + */
>> +
>> +#ifndef _UAPI_LINUX_RV_H
>> +#define _UAPI_LINUX_RV_H
>> +
>> +#include <linux/ioctl.h>
>> +#include <linux/types.h>
>> +
>> +/* Magic byte shared by all RV monitor ioctls. */
>> +#define RV_IOC_MAGIC	0xB9
>> +
>> +/* -----------------------------------------------------------------------
>> + * tlob: task latency over budget monitor  (nr 0x01 - 0x1F)
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/**
>> + * struct tlob_start_args - arguments for TLOB_IOCTL_TRACE_START
>> + * @threshold_us: Latency budget for this critical section, in microseconds.
>> + *               Must be greater than zero.
>> + * @tag:         Opaque 64-bit cookie supplied by the caller.  Echoed back
>> + *               verbatim in the tlob_budget_exceeded ftrace event and in any
>> + *               tlob_event record delivered via @notify_fd.  Use it to
>> identify
>> + *               which code region triggered a violation when the same thread
>> + *               monitors multiple regions sequentially.  Set to 0 if not
>> + *               needed.
>> + * @notify_fd:   File descriptor that will receive a tlob_event record on
>> + *               violation.  Must refer to an open /dev/rv fd.  May equal
>> + *               the calling fd (self-notification, useful for retrieving the
>> + *               on_cpu_us / off_cpu_us breakdown after TRACE_STOP returns
>> + *               -EOVERFLOW).  Set to -1 to disable fd notification; in that
>> + *               case violations are only signalled via the TRACE_STOP return
>> + *               value and the tlob_budget_exceeded ftrace event.
>> + * @flags:       Must be 0.  Reserved for future extensions.
>> + */
>> +struct tlob_start_args {
>> +	__u64 threshold_us;
>> +	__u64 tag;
>> +	__s32 notify_fd;
>> +	__u32 flags;
>> +};
>> +
>> +/**
>> + * struct tlob_event - one budget-exceeded event
>> + *
>> + * Consumed by read() on the notify_fd registered at TLOB_IOCTL_TRACE_START.
>> + * Each record describes a single budget exceedance for one task.
>> + *
>> + * @tid:          Thread ID (task_pid_vnr) of the violating task.
>> + * @threshold_us: Budget that was exceeded, in microseconds.
>> + * @on_cpu_us:    Cumulative on-CPU time at violation time, in microseconds.
>> + * @off_cpu_us:   Cumulative off-CPU (scheduling + I/O wait) time at
>> + *               violation time, in microseconds.
>> + * @switches:     Number of context switches since TRACE_START.
>> + * @state:        DA state at violation: 1 = on_cpu, 0 = off_cpu.
>> + * @tag:          Cookie from tlob_start_args.tag; for the tracefs uprobe
>> path
>> + *               this is the offset_start value.  Zero when not set.
>> + */
>> +struct tlob_event {
>> +	__u32 tid;
>> +	__u32 pad;
>> +	__u64 threshold_us;
>> +	__u64 on_cpu_us;
>> +	__u64 off_cpu_us;
>> +	__u32 switches;
>> +	__u32 state;   /* 1 = on_cpu, 0 = off_cpu */
>> +	__u64 tag;
>> +};
>> +
>> +/**
>> + * struct tlob_mmap_page - control page for the mmap'd violation ring buffer
>> + *
>> + * Mapped at offset 0 of the mmap region returned by mmap(2) on a /dev/rv fd.
>> + * The data array of struct tlob_event records begins at offset @data_offset
>> + * (always one page from the mmap base; use this field rather than hard-
>> coding
>> + * PAGE_SIZE so the code remains correct across architectures).
>> + *
>> + * Ring layout:
>> + *
>> + *   mmap base + 0             : struct tlob_mmap_page  (one page)
>> + *   mmap base + data_offset   : struct tlob_event[capacity]
>> + *
>> + * The mmap length determines the ring capacity.  Compute it as:
>> + *
>> + *   raw    = sysconf(_SC_PAGESIZE) + capacity * sizeof(struct tlob_event)
>> + *   length = (raw + sysconf(_SC_PAGESIZE) - 1) & ~(sysconf(_SC_PAGESIZE) -
>> 1)
>> + *
>> + * i.e. round the raw byte count up to the next page boundary before
>> + * passing it to mmap(2).  The kernel requires a page-aligned length.
>> + * capacity must be a power of 2.  Read @capacity after a successful
>> + * mmap(2) for the actual value.
>> + *
>> + * Producer/consumer ordering contract:
>> + *
>> + *   Kernel (producer):
>> + *     data[data_head & (capacity - 1)] = event;
>> + *     // pairs with load-acquire in userspace:
>> + *     smp_store_release(&page->data_head, data_head + 1);
>> + *
>> + *   Userspace (consumer):
>> + *     // pairs with store-release in kernel:
>> + *     head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
>> + *     for (tail = page->data_tail; tail != head; tail++)
>> + *         handle(&data[tail & (capacity - 1)]);
>> + *     __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
>> + *
>> + * @data_head and @data_tail are monotonically increasing __u32 counters
>> + * in units of records.  Unsigned 32-bit wrap-around is handled correctly
>> + * by modular arithmetic; the ring is full when
>> + * (data_head - data_tail) == capacity.
>> + *
>> + * When the ring is full the kernel drops the incoming record and increments
>> + * @dropped.  The consumer should check @dropped periodically to detect loss.
>> + *
>> + * read() and mmap() share the same ring buffer.  Do not use both
>> + * simultaneously on the same fd.
>> + *
>> + * @data_head:   Next write slot index.  Updated by the kernel with
>> + *               store-release ordering.  Read by userspace with load-
>> acquire.
>> + * @data_tail:   Next read slot index.  Updated by userspace.  Read by the
>> + *               kernel to detect overflow.
>> + * @capacity:    Actual ring capacity in records (power of 2).  Written once
>> + *               by the kernel at mmap time; read-only for userspace
>> thereafter.
>> + * @version:     Ring buffer ABI version; currently 1.
>> + * @data_offset: Byte offset from the mmap base to the data array.
>> + *               Always equal to sysconf(_SC_PAGESIZE) on the running kernel.
>> + * @record_size: sizeof(struct tlob_event) as seen by the kernel.  Verify
>> + *               this matches userspace's sizeof before indexing the array.
>> + * @dropped:     Number of events dropped because the ring was full.
>> + *               Monotonically increasing; read with __ATOMIC_RELAXED.
>> + */
>> +struct tlob_mmap_page {
>> +	__u32  data_head;
>> +	__u32  data_tail;
>> +	__u32  capacity;
>> +	__u32  version;
>> +	__u32  data_offset;
>> +	__u32  record_size;
>> +	__u64  dropped;
>> +};
>> +
>> +/*
>> + * TLOB_IOCTL_TRACE_START - begin monitoring the calling task.
>> + *
>> + * Arms a per-task hrtimer for threshold_us microseconds.  If args.notify_fd
>> + * is >= 0, a tlob_event record is pushed into that fd's ring buffer on
>> + * violation in addition to the tlob_budget_exceeded ftrace event.
>> + * args.notify_fd == -1 disables fd notification.
>> + *
>> + * Violation records are consumed by read() on the notify_fd (blocking or
>> + * non-blocking depending on O_NONBLOCK).  On violation,
>> TLOB_IOCTL_TRACE_STOP
>> + * also returns -EOVERFLOW regardless of whether notify_fd is set.
>> + *
>> + * args.flags must be 0.
>> + */
>> +#define TLOB_IOCTL_TRACE_START		_IOW(RV_IOC_MAGIC, 0x01, struct
>> tlob_start_args)
>> +
>> +/*
>> + * TLOB_IOCTL_TRACE_STOP - end monitoring the calling task.
>> + *
>> + * Returns 0 if within budget, -EOVERFLOW if the budget was exceeded.
>> + */
>> +#define TLOB_IOCTL_TRACE_STOP		_IO(RV_IOC_MAGIC,  0x02)
>> +
>> +#endif /* _UAPI_LINUX_RV_H */
>> diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
>> index 5b4be87ba..227573cda 100644
>> --- a/kernel/trace/rv/Kconfig
>> +++ b/kernel/trace/rv/Kconfig
>> @@ -65,6 +65,7 @@ source "kernel/trace/rv/monitors/pagefault/Kconfig"
>>   source "kernel/trace/rv/monitors/sleep/Kconfig"
>>   # Add new rtapp monitors here
>>   
>> +source "kernel/trace/rv/monitors/tlob/Kconfig"
>>   # Add new monitors here
>>   
>>   config RV_REACTORS
>> @@ -93,3 +94,19 @@ config RV_REACT_PANIC
>>   	help
>>   	  Enables the panic reactor. The panic reactor emits a printk()
>>   	  message if an exception is found and panic()s the system.
>> +
>> +config RV_CHARDEV
>> +	bool "RV ioctl interface via /dev/rv"
>> +	depends on RV
>> +	default n
>> +	help
>> +	  Register a /dev/rv misc device that exposes an ioctl interface
>> +	  for RV monitor self-instrumentation.  All RV monitors share the
>> +	  single device node; ioctl numbers encode the monitor identity.
>> +
>> +	  When enabled, user-space programs can open /dev/rv and use
>> +	  monitor-specific ioctl commands to bracket code regions they
>> +	  want the kernel RV subsystem to observe.
>> +
>> +	  Say Y here if you want to use the tlob self-instrumentation
>> +	  ioctl interface; otherwise say N.
>> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
>> index 750e4ad6f..cc3781a3b 100644
>> --- a/kernel/trace/rv/Makefile
>> +++ b/kernel/trace/rv/Makefile
>> @@ -3,6 +3,7 @@
>>   ccflags-y += -I $(src)		# needed for trace events
>>   
>>   obj-$(CONFIG_RV) += rv.o
>> +obj-$(CONFIG_RV_CHARDEV) += rv_dev.o
>>   obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
>>   obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
>>   obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
>> @@ -17,6 +18,7 @@ obj-$(CONFIG_RV_MON_STS) += monitors/sts/sts.o
>>   obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
>>   obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
>>   obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
>> +obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
>>   # Add new monitors here
>>   obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
>>   obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
>> diff --git a/kernel/trace/rv/monitors/tlob/Kconfig
>> b/kernel/trace/rv/monitors/tlob/Kconfig
>> new file mode 100644
>> index 000000000..010237480
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
>> @@ -0,0 +1,51 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +#
>> +config RV_MON_TLOB
>> +	depends on RV
>> +	depends on UPROBES
>> +	select DA_MON_EVENTS_ID
>> +	bool "tlob monitor"
>> +	help
>> +	  Enable the tlob (task latency over budget) monitor. This monitor
>> +	  tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path
>> within a
>> +	  task (including both on-CPU and off-CPU time) and reports a
>> +	  violation when the elapsed time exceeds a configurable budget
>> +	  threshold.
>> +
>> +	  The monitor implements a three-state deterministic automaton.
>> +	  States: unmonitored, on_cpu, off_cpu.
>> +	  Key transitions:
>> +	    unmonitored    --(trace_start)-->    on_cpu
>> +	    on_cpu   --(switch_out)-->     off_cpu
>> +	    off_cpu  --(switch_in)-->      on_cpu
>> +	    on_cpu   --(trace_stop)-->    unmonitored
>> +	    off_cpu  --(trace_stop)-->    unmonitored
>> +	    on_cpu   --(budget_expired)--> unmonitored
>> +	    off_cpu  --(budget_expired)--> unmonitored
>> +
>> +	  External configuration is done via the tracefs "monitor" file:
>> +	    echo pid:threshold_us:binary:offset_start:offset_stop >
>> .../rv/monitors/tlob/monitor
>> +	    echo -pid             > .../rv/monitors/tlob/monitor  (remove
>> task)
>> +	    cat                     .../rv/monitors/tlob/monitor  (list
>> tasks)
>> +
>> +	  The uprobe binding places two plain entry uprobes at offset_start
>> and
>> +	  offset_stop in the binary; these trigger tlob_start_task() and
>> +	  tlob_stop_task() respectively.  Using two entry uprobes (rather
>> than a
>> +	  uretprobe) means that a mistyped offset can never corrupt the call
>> +	  stack; the worst outcome is a missed stop, which causes the hrtimer
>> to
>> +	  fire and report a budget violation.
>> +
>> +	  Violation events are delivered via a lock-free mmap ring buffer on
>> +	  /dev/rv (enabled by CONFIG_RV_CHARDEV).  The consumer mmap()s the
>> +	  device, reads records from the data array using the head/tail
>> indices
>> +	  in the control page, and advances data_tail when done.
>> +
>> +	  For self-instrumentation, use TLOB_IOCTL_TRACE_START /
>> +	  TLOB_IOCTL_TRACE_STOP via the /dev/rv misc device (enabled by
>> +	  CONFIG_RV_CHARDEV).
>> +
>> +	  Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
>> +
>> +	  For further information, see:
>> +	    Documentation/trace/rv/monitor_tlob.rst
>> +
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
>> b/kernel/trace/rv/monitors/tlob/tlob.c
>> new file mode 100644
>> index 000000000..a6e474025
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
>> @@ -0,0 +1,986 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * tlob: task latency over budget monitor
>> + *
>> + * Track the elapsed wall-clock time of a marked code path and detect when
>> + * a monitored task exceeds its per-task latency budget.  CLOCK_MONOTONIC
>> + * is used so both on-CPU and off-CPU time count toward the budget.
>> + *
>> + * Per-task state is maintained in a spinlock-protected hash table.  A
>> + * one-shot hrtimer fires at the deadline; if the task has not called
>> + * trace_stop by then, a violation is recorded.
>> + *
>> + * Up to TLOB_MAX_MONITORED tasks may be tracked simultaneously.
>> + *
>> + * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
>> + */
>> +#include <linux/file.h>
>> +#include <linux/fs.h>
>> +#include <linux/ftrace.h>
>> +#include <linux/hash.h>
>> +#include <linux/hrtimer.h>
>> +#include <linux/kernel.h>
>> +#include <linux/ktime.h>
>> +#include <linux/module.h>
>> +#include <linux/init.h>
>> +#include <linux/namei.h>
>> +#include <linux/poll.h>
>> +#include <linux/rv.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/atomic.h>
>> +#include <linux/rcupdate.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/tracefs.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/uprobes.h>
>> +#include <kunit/visibility.h>
>> +#include <rv/instrumentation.h>
>> +
>> +/* rv_interface_lock is defined in kernel/trace/rv/rv.c */
>> +extern struct mutex rv_interface_lock;
>> +
>> +#define MODULE_NAME "tlob"
>> +
>> +#include <rv_trace.h>
>> +#include <trace/events/sched.h>
>> +
>> +#define RV_MON_TYPE RV_MON_PER_TASK
>> +#include "tlob.h"
>> +#include <rv/da_monitor.h>
>> +
>> +/* Hash table size; must be a power of two. */
>> +#define TLOB_HTABLE_BITS		6
>> +#define TLOB_HTABLE_SIZE		(1 << TLOB_HTABLE_BITS)
>> +
>> +/* Maximum binary path length for uprobe binding. */
>> +#define TLOB_MAX_PATH			256
>> +
>> +/* Per-task latency monitoring state. */
>> +struct tlob_task_state {
>> +	struct hlist_node	hlist;
>> +	struct task_struct	*task;
>> +	u64			threshold_us;
>> +	u64			tag;
>> +	struct hrtimer		deadline_timer;
>> +	int			canceled;	/* protected by entry_lock */
>> +	struct file		*notify_file;	/* NULL or held reference */
>> +
>> +	/*
>> +	 * entry_lock serialises the mutable accounting fields below.
>> +	 * Lock order: tlob_table_lock -> entry_lock (never reverse).
>> +	 */
>> +	raw_spinlock_t		entry_lock;
>> +	u64			on_cpu_us;
>> +	u64			off_cpu_us;
>> +	ktime_t			last_ts;
>> +	u32			switches;
>> +	u8			da_state;
>> +
>> +	struct rcu_head		rcu;	/* for call_rcu() teardown */
>> +};
>> +
>> +/* Per-uprobe-binding state: a start + stop probe pair for one binary region.
>> */
>> +struct tlob_uprobe_binding {
>> +	struct list_head	list;
>> +	u64			threshold_us;
>> +	struct path		path;
>> +	char			binpath[TLOB_MAX_PATH];	/* canonical
>> path for read/remove */
>> +	loff_t			offset_start;
>> +	loff_t			offset_stop;
>> +	struct uprobe_consumer	entry_uc;
>> +	struct uprobe_consumer	stop_uc;
>> +	struct uprobe		*entry_uprobe;
>> +	struct uprobe		*stop_uprobe;
>> +};
>> +
>> +/* Object pool for tlob_task_state. */
>> +static struct kmem_cache *tlob_state_cache;
>> +
>> +/* Hash table and lock protecting table structure (insert/delete/canceled).
>> */
>> +static struct hlist_head tlob_htable[TLOB_HTABLE_SIZE];
>> +static DEFINE_RAW_SPINLOCK(tlob_table_lock);
>> +static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
>> +
>> +/* Uprobe binding list; protected by tlob_uprobe_mutex. */
>> +static LIST_HEAD(tlob_uprobe_list);
>> +static DEFINE_MUTEX(tlob_uprobe_mutex);
>> +
>> +/* Forward declaration */
>> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer);
>> +
>> +/* Hash table helpers */
>> +
>> +static unsigned int tlob_hash_task(const struct task_struct *task)
>> +{
>> +	return hash_ptr((void *)task, TLOB_HTABLE_BITS);
>> +}
>> +
>> +/*
>> + * tlob_find_rcu - look up per-task state.
>> + * Must be called under rcu_read_lock() or with tlob_table_lock held.
>> + */
>> +static struct tlob_task_state *tlob_find_rcu(struct task_struct *task)
>> +{
>> +	struct tlob_task_state *ws;
>> +	unsigned int h = tlob_hash_task(task);
>> +
>> +	hlist_for_each_entry_rcu(ws, &tlob_htable[h], hlist,
>> +				 lockdep_is_held(&tlob_table_lock))
>> +		if (ws->task == task)
>> +			return ws;
>> +	return NULL;
>> +}
>> +
>> +/* Allocate and initialise a new per-task state entry. */
>> +static struct tlob_task_state *tlob_alloc(struct task_struct *task,
>> +					  u64 threshold_us, u64 tag)
>> +{
>> +	struct tlob_task_state *ws;
>> +
>> +	ws = kmem_cache_zalloc(tlob_state_cache, GFP_ATOMIC);
>> +	if (!ws)
>> +		return NULL;
>> +
>> +	ws->task = task;
>> +	get_task_struct(task);
>> +	ws->threshold_us = threshold_us;
>> +	ws->tag = tag;
>> +	ws->last_ts = ktime_get();
>> +	ws->da_state = on_cpu_tlob;
>> +	raw_spin_lock_init(&ws->entry_lock);
>> +	hrtimer_setup(&ws->deadline_timer, tlob_deadline_timer_fn,
>> +		      CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +	return ws;
>> +}
>> +
>> +/* RCU callback: free the slab once no readers remain. */
>> +static void tlob_free_rcu_slab(struct rcu_head *head)
>> +{
>> +	struct tlob_task_state *ws =
>> +		container_of(head, struct tlob_task_state, rcu);
>> +	kmem_cache_free(tlob_state_cache, ws);
>> +}
>> +
>> +/* Arm the one-shot deadline timer for threshold_us microseconds. */
>> +static void tlob_arm_deadline(struct tlob_task_state *ws)
>> +{
>> +	hrtimer_start(&ws->deadline_timer,
>> +		      ns_to_ktime(ws->threshold_us * NSEC_PER_USEC),
>> +		      HRTIMER_MODE_REL);
>> +}
>> +
>> +/*
>> + * Push a violation record into a monitor fd's ring buffer (softirq context).
>> + * Drop-new policy: discard incoming record when full.  smp_store_release on
>> + * data_head pairs with smp_load_acquire in the consumer.
>> + */
>> +static void tlob_event_push(struct rv_file_priv *priv,
>> +			    const struct tlob_event *info)
>> +{
>> +	struct tlob_ring *ring = &priv->ring;
>> +	unsigned long flags;
>> +	u32 head, tail;
>> +
>> +	spin_lock_irqsave(&ring->lock, flags);
>> +
>> +	head = ring->page->data_head;
>> +	tail = READ_ONCE(ring->page->data_tail);
>> +
>> +	if (head - tail > ring->mask) {
>> +		/* Ring full: drop incoming record. */
>> +		ring->page->dropped++;
>> +		spin_unlock_irqrestore(&ring->lock, flags);
>> +		return;
>> +	}
>> +
>> +	ring->data[head & ring->mask] = *info;
>> +	/* pairs with smp_load_acquire() in the consumer */
>> +	smp_store_release(&ring->page->data_head, head + 1);
>> +
>> +	spin_unlock_irqrestore(&ring->lock, flags);
>> +
>> +	wake_up_interruptible_poll(&priv->waitq, EPOLLIN | EPOLLRDNORM);
>> +}
>> +
>> +#if IS_ENABLED(CONFIG_KUNIT)
>> +void tlob_event_push_kunit(struct rv_file_priv *priv,
>> +			  const struct tlob_event *info)
>> +{
>> +	tlob_event_push(priv, info);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_push_kunit);
>> +#endif /* CONFIG_KUNIT */
>> +
>> +/*
>> + * Budget exceeded: remove the entry, record the violation, and inject
>> + * budget_expired into the DA.
>> + *
>> + * Lock order: tlob_table_lock -> entry_lock.  tlob_stop_task() sets
>> + * ws->canceled under both locks; if we see it here the stop path owns
>> cleanup.
>> + * fput/put_task_struct are done before call_rcu(); the RCU callback only
>> + * reclaims the slab.
>> + */
>> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer)
>> +{
>> +	struct tlob_task_state *ws =
>> +		container_of(timer, struct tlob_task_state, deadline_timer);
>> +	struct tlob_event info = {};
>> +	struct file *notify_file;
>> +	struct task_struct *task;
>> +	unsigned long flags;
>> +	/* snapshots taken under entry_lock */
>> +	u64 on_cpu_us, off_cpu_us, threshold_us, tag;
>> +	u32 switches;
>> +	bool on_cpu;
>> +	bool push_event = false;
>> +
>> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> +	/* stop path sets canceled under both locks; if set it owns cleanup
>> */
>> +	if (ws->canceled) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		return HRTIMER_NORESTART;
>> +	}
>> +
>> +	/* Finalize accounting and snapshot all fields under entry_lock. */
>> +	raw_spin_lock(&ws->entry_lock);
>> +
>> +	{
>> +		ktime_t now = ktime_get();
>> +		u64 delta_us = ktime_to_us(ktime_sub(now, ws->last_ts));
>> +
>> +		if (ws->da_state == on_cpu_tlob)
>> +			ws->on_cpu_us += delta_us;
>> +		else
>> +			ws->off_cpu_us += delta_us;
>> +	}
>> +
>> +	ws->canceled  = 1;
>> +	on_cpu_us     = ws->on_cpu_us;
>> +	off_cpu_us    = ws->off_cpu_us;
>> +	threshold_us  = ws->threshold_us;
>> +	tag           = ws->tag;
>> +	switches      = ws->switches;
>> +	on_cpu        = (ws->da_state == on_cpu_tlob);
>> +	notify_file   = ws->notify_file;
>> +	if (notify_file) {
>> +		info.tid          = task_pid_vnr(ws->task);
>> +		info.threshold_us = threshold_us;
>> +		info.on_cpu_us    = on_cpu_us;
>> +		info.off_cpu_us   = off_cpu_us;
>> +		info.switches     = switches;
>> +		info.state        = on_cpu ? 1 : 0;
>> +		info.tag          = tag;
>> +		push_event        = true;
>> +	}
>> +
>> +	raw_spin_unlock(&ws->entry_lock);
>> +
>> +	hlist_del_rcu(&ws->hlist);
>> +	atomic_dec(&tlob_num_monitored);
>> +	/*
>> +	 * Hold a reference so task remains valid across da_handle_event()
>> +	 * after we drop tlob_table_lock.
>> +	 */
>> +	task = ws->task;
>> +	get_task_struct(task);
>> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> +	/*
>> +	 * Both locks are now released; ws is exclusively owned (removed from
>> +	 * the hash table with canceled=1).  Emit the tracepoint and push the
>> +	 * violation record.
>> +	 */
>> +	trace_tlob_budget_exceeded(ws->task, threshold_us, on_cpu_us,
>> +				   off_cpu_us, switches, on_cpu, tag);
>> +
>> +	if (push_event) {
>> +		struct rv_file_priv *priv = notify_file->private_data;
>> +
>> +		if (priv)
>> +			tlob_event_push(priv, &info);
>> +	}
>> +
>> +	da_handle_event(task, budget_expired_tlob);
>> +
>> +	if (notify_file)
>> +		fput(notify_file);		/* ref from fget() at
>> TRACE_START */
>> +	put_task_struct(ws->task);		/* ref from tlob_alloc() */
>> +	put_task_struct(task);			/* extra ref from
>> get_task_struct() above */
>> +	call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> +	return HRTIMER_NORESTART;
>> +}
>> +
>> +/* Tracepoint handlers */
>> +
>> +/*
>> + * handle_sched_switch - advance the DA and accumulate on/off-CPU time.
>> + *
>> + * RCU read-side for lock-free lookup; entry_lock for per-task accounting.
>> + * da_handle_event() is called after rcu_read_unlock() to avoid holding the
>> + * read-side critical section across the RV framework.
>> + */
>> +static void handle_sched_switch(void *data, bool preempt,
>> +				struct task_struct *prev,
>> +				struct task_struct *next,
>> +				unsigned int prev_state)
>> +{
>> +	struct tlob_task_state *ws;
>> +	unsigned long flags;
>> +	bool do_prev = false, do_next = false;
>> +	ktime_t now;
>> +
>> +	rcu_read_lock();
>> +
>> +	ws = tlob_find_rcu(prev);
>> +	if (ws) {
>> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> +		if (!ws->canceled) {
>> +			now = ktime_get();
>> +			ws->on_cpu_us += ktime_to_us(ktime_sub(now, ws-
>>> last_ts));
>> +			ws->last_ts = now;
>> +			ws->switches++;
>> +			ws->da_state = off_cpu_tlob;
>> +			do_prev = true;
>> +		}
>> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> +	}
>> +
>> +	ws = tlob_find_rcu(next);
>> +	if (ws) {
>> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> +		if (!ws->canceled) {
>> +			now = ktime_get();
>> +			ws->off_cpu_us += ktime_to_us(ktime_sub(now, ws-
>>> last_ts));
>> +			ws->last_ts = now;
>> +			ws->da_state = on_cpu_tlob;
>> +			do_next = true;
>> +		}
>> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> +	}
>> +
>> +	rcu_read_unlock();
>> +
>> +	if (do_prev)
>> +		da_handle_event(prev, switch_out_tlob);
>> +	if (do_next)
>> +		da_handle_event(next, switch_in_tlob);
>> +}
>> +
>> +static void handle_sched_wakeup(void *data, struct task_struct *p)
>> +{
>> +	struct tlob_task_state *ws;
>> +	unsigned long flags;
>> +	bool found = false;
>> +
>> +	rcu_read_lock();
>> +	ws = tlob_find_rcu(p);
>> +	if (ws) {
>> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
>> +		found = !ws->canceled;
>> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	if (found)
>> +		da_handle_event(p, sched_wakeup_tlob);
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * Core start/stop helpers (also called from rv_dev.c)
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/*
>> + * __tlob_insert - insert @ws into the hash table and arm its deadline timer.
>> + *
>> + * Re-checks for duplicates and capacity under tlob_table_lock; the caller
>> + * may have done a lock-free pre-check before allocating @ws.  On failure @ws
>> + * is freed directly (never in table, so no call_rcu needed).
>> + */
>> +static int __tlob_insert(struct task_struct *task, struct tlob_task_state
>> *ws)
>> +{
>> +	unsigned int h;
>> +	unsigned long flags;
>> +
>> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> +	if (tlob_find_rcu(task)) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		if (ws->notify_file)
>> +			fput(ws->notify_file);
>> +		put_task_struct(ws->task);
>> +		kmem_cache_free(tlob_state_cache, ws);
>> +		return -EEXIST;
>> +	}
>> +	if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		if (ws->notify_file)
>> +			fput(ws->notify_file);
>> +		put_task_struct(ws->task);
>> +		kmem_cache_free(tlob_state_cache, ws);
>> +		return -ENOSPC;
>> +	}
>> +	h = tlob_hash_task(task);
>> +	hlist_add_head_rcu(&ws->hlist, &tlob_htable[h]);
>> +	atomic_inc(&tlob_num_monitored);
>> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> +	da_handle_start_run_event(task, trace_start_tlob);
>> +	tlob_arm_deadline(ws);
>> +	return 0;
>> +}
>> +
>> +/**
>> + * tlob_start_task - begin monitoring @task with latency budget
>> @threshold_us.
>> + *
>> + * @notify_file: /dev/rv fd whose ring buffer receives a tlob_event on
>> + *               violation; caller transfers the fget() reference to tlob.c.
>> + *               Pass NULL for synchronous mode (violations only via
>> + *               TRACE_STOP return value and the tlob_budget_exceeded event).
>> + *
>> + * Returns 0, -ENODEV, -EEXIST, -ENOSPC, or -ENOMEM.  On failure the caller
>> + * retains responsibility for any @notify_file reference.
>> + */
>> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
>> +		    struct file *notify_file, u64 tag)
>> +{
>> +	struct tlob_task_state *ws;
>> +	unsigned long flags;
>> +
>> +	if (!tlob_state_cache)
>> +		return -ENODEV;
>> +
>> +	if (threshold_us > (u64)KTIME_MAX / NSEC_PER_USEC)
>> +		return -ERANGE;
>> +
>> +	/* Quick pre-check before allocation. */
>> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> +	if (tlob_find_rcu(task)) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		return -EEXIST;
>> +	}
>> +	if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		return -ENOSPC;
>> +	}
>> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> +	ws = tlob_alloc(task, threshold_us, tag);
>> +	if (!ws)
>> +		return -ENOMEM;
>> +
>> +	ws->notify_file = notify_file;
>> +	return __tlob_insert(task, ws);
>> +}
>> +EXPORT_SYMBOL_GPL(tlob_start_task);
>> +
>> +/**
>> + * tlob_stop_task - stop monitoring @task before the deadline fires.
>> + *
>> + * Sets canceled under entry_lock (inside tlob_table_lock) before calling
>> + * hrtimer_cancel(), racing safely with the timer callback.
>> + *
>> + * Returns 0 if within budget, -ESRCH if the entry is gone (deadline already
>> + * fired, or TRACE_START was never called).
>> + */
>> +int tlob_stop_task(struct task_struct *task)
>> +{
>> +	struct tlob_task_state *ws;
>> +	struct file *notify_file;
>> +	unsigned long flags;
>> +
>> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> +	ws = tlob_find_rcu(task);
>> +	if (!ws) {
>> +		raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +		return -ESRCH;
>> +	}
>> +
>> +	/* Prevent handle_sched_switch from updating accounting after
>> removal. */
>> +	raw_spin_lock(&ws->entry_lock);
>> +	ws->canceled = 1;
>> +	raw_spin_unlock(&ws->entry_lock);
>> +
>> +	hlist_del_rcu(&ws->hlist);
>> +	atomic_dec(&tlob_num_monitored);
>> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> +	hrtimer_cancel(&ws->deadline_timer);
>> +
>> +	da_handle_event(task, trace_stop_tlob);
>> +
>> +	notify_file = ws->notify_file;
>> +	if (notify_file)
>> +		fput(notify_file);
>> +	put_task_struct(ws->task);
>> +	call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(tlob_stop_task);
>> +
>> +/* Stop monitoring all tracked tasks; called on monitor disable. */
>> +static void tlob_stop_all(void)
>> +{
>> +	struct tlob_task_state *batch[TLOB_MAX_MONITORED];
>> +	struct tlob_task_state *ws;
>> +	struct hlist_node *tmp;
>> +	unsigned long flags;
>> +	int n = 0, i;
>> +
>> +	raw_spin_lock_irqsave(&tlob_table_lock, flags);
>> +	for (i = 0; i < TLOB_HTABLE_SIZE; i++) {
>> +		hlist_for_each_entry_safe(ws, tmp, &tlob_htable[i], hlist) {
>> +			raw_spin_lock(&ws->entry_lock);
>> +			ws->canceled = 1;
>> +			raw_spin_unlock(&ws->entry_lock);
>> +			hlist_del_rcu(&ws->hlist);
>> +			atomic_dec(&tlob_num_monitored);
>> +			if (n < TLOB_MAX_MONITORED)
>> +				batch[n++] = ws;
>> +		}
>> +	}
>> +	raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
>> +
>> +	for (i = 0; i < n; i++) {
>> +		ws = batch[i];
>> +		hrtimer_cancel(&ws->deadline_timer);
>> +		da_handle_event(ws->task, trace_stop_tlob);
>> +		if (ws->notify_file)
>> +			fput(ws->notify_file);
>> +		put_task_struct(ws->task);
>> +		call_rcu(&ws->rcu, tlob_free_rcu_slab);
>> +	}
>> +}
>> +
>> +/* uprobe binding helpers */
>> +
>> +static int tlob_uprobe_entry_handler(struct uprobe_consumer *uc,
>> +				     struct pt_regs *regs, __u64 *data)
>> +{
>> +	struct tlob_uprobe_binding *b =
>> +		container_of(uc, struct tlob_uprobe_binding, entry_uc);
>> +
>> +	tlob_start_task(current, b->threshold_us, NULL, (u64)b-
>>> offset_start);
>> +	return 0;
>> +}
>> +
>> +static int tlob_uprobe_stop_handler(struct uprobe_consumer *uc,
>> +				    struct pt_regs *regs, __u64 *data)
>> +{
>> +	tlob_stop_task(current);
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Register start + stop entry uprobes for a binding.
>> + * Both are plain entry uprobes (no uretprobe), so a wrong offset never
>> + * corrupts the call stack; the worst outcome is a missed stop (hrtimer
>> + * fires and reports a budget violation).
>> + * Called with tlob_uprobe_mutex held.
>> + */
>> +static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
>> +			   loff_t offset_start, loff_t offset_stop)
>> +{
>> +	struct tlob_uprobe_binding *b, *tmp_b;
>> +	char pathbuf[TLOB_MAX_PATH];
>> +	struct inode *inode;
>> +	char *canon;
>> +	int ret;
>> +
>> +	b = kzalloc(sizeof(*b), GFP_KERNEL);
>> +	if (!b)
>> +		return -ENOMEM;
>> +
>> +	if (binpath[0] != '/') {
>> +		kfree(b);
>> +		return -EINVAL;
>> +	}
>> +
>> +	b->threshold_us = threshold_us;
>> +	b->offset_start = offset_start;
>> +	b->offset_stop  = offset_stop;
>> +
>> +	ret = kern_path(binpath, LOOKUP_FOLLOW, &b->path);
>> +	if (ret)
>> +		goto err_free;
>> +
>> +	if (!d_is_reg(b->path.dentry)) {
>> +		ret = -EINVAL;
>> +		goto err_path;
>> +	}
>> +
>> +	/* Reject duplicate start offset for the same binary. */
>> +	list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
>> +		if (tmp_b->offset_start == offset_start &&
>> +		    tmp_b->path.dentry == b->path.dentry) {
>> +			ret = -EEXIST;
>> +			goto err_path;
>> +		}
>> +	}
>> +
>> +	/* Store canonical path for read-back and removal matching. */
>> +	canon = d_path(&b->path, pathbuf, sizeof(pathbuf));
>> +	if (IS_ERR(canon)) {
>> +		ret = PTR_ERR(canon);
>> +		goto err_path;
>> +	}
>> +	strscpy(b->binpath, canon, sizeof(b->binpath));
>> +
>> +	b->entry_uc.handler = tlob_uprobe_entry_handler;
>> +	b->stop_uc.handler  = tlob_uprobe_stop_handler;
>> +
>> +	inode = d_real_inode(b->path.dentry);
>> +
>> +	b->entry_uprobe = uprobe_register(inode, offset_start, 0, &b-
>>> entry_uc);
>> +	if (IS_ERR(b->entry_uprobe)) {
>> +		ret = PTR_ERR(b->entry_uprobe);
>> +		b->entry_uprobe = NULL;
>> +		goto err_path;
>> +	}
>> +
>> +	b->stop_uprobe = uprobe_register(inode, offset_stop, 0, &b->stop_uc);
>> +	if (IS_ERR(b->stop_uprobe)) {
>> +		ret = PTR_ERR(b->stop_uprobe);
>> +		b->stop_uprobe = NULL;
>> +		goto err_entry;
>> +	}
>> +
>> +	list_add_tail(&b->list, &tlob_uprobe_list);
>> +	return 0;
>> +
>> +err_entry:
>> +	uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> +	uprobe_unregister_sync();
>> +err_path:
>> +	path_put(&b->path);
>> +err_free:
>> +	kfree(b);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Remove the uprobe binding for (offset_start, binpath).
>> + * binpath is resolved to a dentry for comparison so symlinks are handled
>> + * correctly.  Called with tlob_uprobe_mutex held.
>> + */
>> +static void tlob_remove_uprobe_by_key(loff_t offset_start, const char
>> *binpath)
>> +{
>> +	struct tlob_uprobe_binding *b, *tmp;
>> +	struct path remove_path;
>> +
>> +	if (kern_path(binpath, LOOKUP_FOLLOW, &remove_path))
>> +		return;
>> +
>> +	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
>> +		if (b->offset_start != offset_start)
>> +			continue;
>> +		if (b->path.dentry != remove_path.dentry)
>> +			continue;
>> +		uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> +		uprobe_unregister_nosync(b->stop_uprobe,  &b->stop_uc);
>> +		list_del(&b->list);
>> +		uprobe_unregister_sync();
>> +		path_put(&b->path);
>> +		kfree(b);
>> +		break;
>> +	}
>> +
>> +	path_put(&remove_path);
>> +}
>> +
>> +/* Unregister all uprobe bindings; called from disable_tlob(). */
>> +static void tlob_remove_all_uprobes(void)
>> +{
>> +	struct tlob_uprobe_binding *b, *tmp;
>> +
>> +	mutex_lock(&tlob_uprobe_mutex);
>> +	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
>> +		uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
>> +		uprobe_unregister_nosync(b->stop_uprobe,  &b->stop_uc);
>> +		list_del(&b->list);
>> +		path_put(&b->path);
>> +		kfree(b);
>> +	}
>> +	mutex_unlock(&tlob_uprobe_mutex);
>> +	uprobe_unregister_sync();
>> +}
>> +
>> +/*
>> + * tracefs "monitor" file
>> + *
>> + * Read:  one "threshold_us:0xoffset_start:0xoffset_stop:binary_path\n"
>> + *        line per registered uprobe binding.
>> + * Write: "threshold_us:offset_start:offset_stop:binary_path" - add uprobe
>> binding
>> + *        "-offset_start:binary_path"                         - remove uprobe
>> binding
>> + */
>> +
>> +static ssize_t tlob_monitor_read(struct file *file,
>> +				 char __user *ubuf,
>> +				 size_t count, loff_t *ppos)
>> +{
>> +	/* pid(10) + threshold(20) + 2 offsets(2*18) + path(256) + delimiters
>> */
>> +	const int line_sz = TLOB_MAX_PATH + 72;
>> +	struct tlob_uprobe_binding *b;
>> +	char *buf, *p;
>> +	int n = 0, buf_sz, pos = 0;
>> +	ssize_t ret;
>> +
>> +	mutex_lock(&tlob_uprobe_mutex);
>> +	list_for_each_entry(b, &tlob_uprobe_list, list)
>> +		n++;
>> +	mutex_unlock(&tlob_uprobe_mutex);
>> +
>> +	buf_sz = (n ? n : 1) * line_sz + 1;
>> +	buf = kmalloc(buf_sz, GFP_KERNEL);
>> +	if (!buf)
>> +		return -ENOMEM;
>> +
>> +	mutex_lock(&tlob_uprobe_mutex);
>> +	list_for_each_entry(b, &tlob_uprobe_list, list) {
>> +		p = b->binpath;
>> +		pos += scnprintf(buf + pos, buf_sz - pos,
>> +				 "%llu:0x%llx:0x%llx:%s\n",
>> +				 b->threshold_us,
>> +				 (unsigned long long)b->offset_start,
>> +				 (unsigned long long)b->offset_stop,
>> +				 p);
>> +	}
>> +	mutex_unlock(&tlob_uprobe_mutex);
>> +
>> +	ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
>> +	kfree(buf);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Parse "threshold_us:offset_start:offset_stop:binary_path".
>> + * binary_path comes last so it may freely contain ':'.
>> + * Returns 0 on success.
>> + */
>> +VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
>> +					    char **path_out,
>> +					    loff_t *start_out, loff_t
>> *stop_out)
>> +{
>> +	unsigned long long thr;
>> +	long long start, stop;
>> +	int n = 0;
>> +
>> +	/*
>> +	 * %llu : decimal-only (microseconds)
>> +	 * %lli : auto-base, accepts 0x-prefixed hex for offsets
>> +	 * %n   : records the byte offset of the first path character
>> +	 */
>> +	if (sscanf(buf, "%llu:%lli:%lli:%n", &thr, &start, &stop, &n) != 3)
>> +		return -EINVAL;
>> +	if (thr == 0 || n == 0 || buf[n] == '\0')
>> +		return -EINVAL;
>> +	if (start < 0 || stop < 0)
>> +		return -EINVAL;
>> +
>> +	*thr_out   = thr;
>> +	*start_out = start;
>> +	*stop_out  = stop;
>> +	*path_out  = buf + n;
>> +	return 0;
>> +}
>> +
>> +static ssize_t tlob_monitor_write(struct file *file,
>> +				  const char __user *ubuf,
>> +				  size_t count, loff_t *ppos)
>> +{
>> +	char buf[TLOB_MAX_PATH + 64];
>> +	loff_t offset_start, offset_stop;
>> +	u64 threshold_us;
>> +	char *binpath;
>> +	int ret;
>> +
>> +	if (count >= sizeof(buf))
>> +		return -EINVAL;
>> +	if (copy_from_user(buf, ubuf, count))
>> +		return -EFAULT;
>> +	buf[count] = '\0';
>> +
>> +	if (count > 0 && buf[count - 1] == '\n')
>> +		buf[count - 1] = '\0';
>> +
>> +	/* Remove request: "-offset_start:binary_path" */
>> +	if (buf[0] == '-') {
>> +		long long off;
>> +		int n = 0;
>> +
>> +		if (sscanf(buf + 1, "%lli:%n", &off, &n) != 1 || n == 0)
>> +			return -EINVAL;
>> +		binpath = buf + 1 + n;
>> +		if (binpath[0] != '/')
>> +			return -EINVAL;
>> +
>> +		mutex_lock(&tlob_uprobe_mutex);
>> +		tlob_remove_uprobe_by_key((loff_t)off, binpath);
>> +		mutex_unlock(&tlob_uprobe_mutex);
>> +
>> +		return (ssize_t)count;
>> +	}
>> +
>> +	/*
>> +	 * Uprobe binding:
>> "threshold_us:offset_start:offset_stop:binary_path"
>> +	 * binpath points into buf at the start of the path field.
>> +	 */
>> +	ret = tlob_parse_uprobe_line(buf, &threshold_us,
>> +				     &binpath, &offset_start, &offset_stop);
>> +	if (ret)
>> +		return ret;
>> +
>> +	mutex_lock(&tlob_uprobe_mutex);
>> +	ret = tlob_add_uprobe(threshold_us, binpath, offset_start,
>> offset_stop);
>> +	mutex_unlock(&tlob_uprobe_mutex);
>> +	return ret ? ret : (ssize_t)count;
>> +}
>> +
>> +static const struct file_operations tlob_monitor_fops = {
>> +	.open	= simple_open,
>> +	.read	= tlob_monitor_read,
>> +	.write	= tlob_monitor_write,
>> +	.llseek	= noop_llseek,
>> +};
>> +
>> +/*
>> + * __tlob_init_monitor / __tlob_destroy_monitor - called with
>> rv_interface_lock
>> + * held (required by da_monitor_init/destroy via
>> rv_get/put_task_monitor_slot).
>> + */
>> +static int __tlob_init_monitor(void)
>> +{
>> +	int i, retval;
>> +
>> +	tlob_state_cache = kmem_cache_create("tlob_task_state",
>> +					     sizeof(struct tlob_task_state),
>> +					     0, 0, NULL);
>> +	if (!tlob_state_cache)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < TLOB_HTABLE_SIZE; i++)
>> +		INIT_HLIST_HEAD(&tlob_htable[i]);
>> +	atomic_set(&tlob_num_monitored, 0);
>> +
>> +	retval = da_monitor_init();
>> +	if (retval) {
>> +		kmem_cache_destroy(tlob_state_cache);
>> +		tlob_state_cache = NULL;
>> +		return retval;
>> +	}
>> +
>> +	rv_this.enabled = 1;
>> +	return 0;
>> +}
>> +
>> +static void __tlob_destroy_monitor(void)
>> +{
>> +	rv_this.enabled = 0;
>> +	tlob_stop_all();
>> +	tlob_remove_all_uprobes();
>> +	/*
>> +	 * Drain pending call_rcu() callbacks from tlob_stop_all() before
>> +	 * destroying the kmem_cache.
>> +	 */
>> +	synchronize_rcu();
>> +	da_monitor_destroy();
>> +	kmem_cache_destroy(tlob_state_cache);
>> +	tlob_state_cache = NULL;
>> +}
>> +
>> +/*
>> + * tlob_init_monitor / tlob_destroy_monitor - KUnit wrappers that acquire
>> + * rv_interface_lock, satisfying the lockdep_assert_held() inside
>> + * rv_get/put_task_monitor_slot().
>> + */
>> +VISIBLE_IF_KUNIT int tlob_init_monitor(void)
>> +{
>> +	int ret;
>> +
>> +	mutex_lock(&rv_interface_lock);
>> +	ret = __tlob_init_monitor();
>> +	mutex_unlock(&rv_interface_lock);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_init_monitor);
>> +
>> +VISIBLE_IF_KUNIT void tlob_destroy_monitor(void)
>> +{
>> +	mutex_lock(&rv_interface_lock);
>> +	__tlob_destroy_monitor();
>> +	mutex_unlock(&rv_interface_lock);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_destroy_monitor);
>> +
>> +VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
>> +{
>> +	rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
>> +	rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
>> +
>> +VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
>> +{
>> +	rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
>> +	rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
>> +}
>> +EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
>> +
>> +/*
>> + * enable_tlob / disable_tlob - called by rv_enable/disable_monitor() which
>> + * already holds rv_interface_lock; call the __ variants directly.
>> + */
>> +static int enable_tlob(void)
>> +{
>> +	int retval;
>> +
>> +	retval = __tlob_init_monitor();
>> +	if (retval)
>> +		return retval;
>> +
>> +	return tlob_enable_hooks();
>> +}
>> +
>> +static void disable_tlob(void)
>> +{
>> +	tlob_disable_hooks();
>> +	__tlob_destroy_monitor();
>> +}
>> +
>> +static struct rv_monitor rv_this = {
>> +	.name		= "tlob",
>> +	.description	= "Per-task latency-over-budget monitor.",
>> +	.enable		= enable_tlob,
>> +	.disable	= disable_tlob,
>> +	.reset		= da_monitor_reset_all,
>> +	.enabled	= 0,
>> +};
>> +
>> +static int __init register_tlob(void)
>> +{
>> +	int ret;
>> +
>> +	ret = rv_register_monitor(&rv_this, NULL);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (rv_this.root_d) {
>> +		tracefs_create_file("monitor", 0644, rv_this.root_d, NULL,
>> +				    &tlob_monitor_fops);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void __exit unregister_tlob(void)
>> +{
>> +	rv_unregister_monitor(&rv_this);
>> +}
>> +
>> +module_init(register_tlob);
>> +module_exit(unregister_tlob);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
>> +MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
>> b/kernel/trace/rv/monitors/tlob/tlob.h
>> new file mode 100644
>> index 000000000..3438a6175
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob.h
>> @@ -0,0 +1,145 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _RV_TLOB_H
>> +#define _RV_TLOB_H
>> +
>> +/*
>> + * C representation of the tlob automaton, generated from tlob.dot via rvgen
>> + * and extended with tlob_start_task()/tlob_stop_task() declarations.
>> + * For the format description see
>> Documentation/trace/rv/deterministic_automata.rst
>> + */
>> +
>> +#include <linux/rv.h>
>> +#include <uapi/linux/rv.h>
>> +
>> +#define MONITOR_NAME tlob
>> +
>> +enum states_tlob {
>> +	unmonitored_tlob,
>> +	on_cpu_tlob,
>> +	off_cpu_tlob,
>> +	state_max_tlob,
>> +};
>> +
>> +#define INVALID_STATE state_max_tlob
>> +
>> +enum events_tlob {
>> +	trace_start_tlob,
>> +	switch_in_tlob,
>> +	switch_out_tlob,
>> +	sched_wakeup_tlob,
>> +	trace_stop_tlob,
>> +	budget_expired_tlob,
>> +	event_max_tlob,
>> +};
>> +
>> +struct automaton_tlob {
>> +	char *state_names[state_max_tlob];
>> +	char *event_names[event_max_tlob];
>> +	unsigned char function[state_max_tlob][event_max_tlob];
>> +	unsigned char initial_state;
>> +	bool final_states[state_max_tlob];
>> +};
>> +
>> +static const struct automaton_tlob automaton_tlob = {
>> +	.state_names = {
>> +		"unmonitored",
>> +		"on_cpu",
>> +		"off_cpu",
>> +	},
>> +	.event_names = {
>> +		"trace_start",
>> +		"switch_in",
>> +		"switch_out",
>> +		"sched_wakeup",
>> +		"trace_stop",
>> +		"budget_expired",
>> +	},
>> +	.function = {
>> +		/* unmonitored */
>> +		{
>> +			on_cpu_tlob,		/* trace_start    */
>> +			unmonitored_tlob,	/* switch_in      */
>> +			unmonitored_tlob,	/* switch_out     */
>> +			unmonitored_tlob,	/* sched_wakeup   */
>> +			INVALID_STATE,		/* trace_stop     */
>> +			INVALID_STATE,		/* budget_expired */
>> +		},
>> +		/* on_cpu */
>> +		{
>> +			INVALID_STATE,		/* trace_start    */
>> +			INVALID_STATE,		/* switch_in      */
>> +			off_cpu_tlob,		/* switch_out     */
>> +			on_cpu_tlob,		/* sched_wakeup   */
>> +			unmonitored_tlob,	/* trace_stop     */
>> +			unmonitored_tlob,	/* budget_expired */
>> +		},
>> +		/* off_cpu */
>> +		{
>> +			INVALID_STATE,		/* trace_start    */
>> +			on_cpu_tlob,		/* switch_in      */
>> +			off_cpu_tlob,		/* switch_out     */
>> +			off_cpu_tlob,		/* sched_wakeup   */
>> +			unmonitored_tlob,	/* trace_stop     */
>> +			unmonitored_tlob,	/* budget_expired */
>> +		},
>> +	},
>> +	/*
>> +	 * final_states: unmonitored is the sole accepting state.
>> +	 * Violations are recorded via ntf_push and tlob_budget_exceeded.
>> +	 */
>> +	.initial_state = unmonitored_tlob,
>> +	.final_states = { 1, 0, 0 },
>> +};
>> +
>> +/* Exported for use by the RV ioctl layer (rv_dev.c) */
>> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
>> +		    struct file *notify_file, u64 tag);
>> +int tlob_stop_task(struct task_struct *task);
>> +
>> +/* Maximum number of concurrently monitored tasks (also used by KUnit). */
>> +#define TLOB_MAX_MONITORED	64U
>> +
>> +/*
>> + * Ring buffer constants (also published in UAPI for mmap size calculation).
>> + */
>> +#define TLOB_RING_DEFAULT_CAP	64U	/* records allocated at open()  */
>> +#define TLOB_RING_MIN_CAP	 8U	/* minimum accepted by mmap()   */
>> +#define TLOB_RING_MAX_CAP	4096U	/* maximum accepted by mmap()   */
>> +
>> +/**
>> + * struct tlob_ring - per-fd mmap-capable violation ring buffer.
>> + *
>> + * Allocated as a contiguous page range at rv_open() time:
>> + *   page 0:    struct tlob_mmap_page  (shared with userspace)
>> + *   pages 1-N: struct tlob_event[capacity]
>> + */
>> +struct tlob_ring {
>> +	struct tlob_mmap_page	*page;
>> +	struct tlob_event	*data;
>> +	u32			 mask;
>> +	spinlock_t		 lock;
>> +	unsigned long		 base;
>> +	unsigned int		 order;
>> +};
>> +
>> +/**
>> + * struct rv_file_priv - per-fd private data for /dev/rv.
>> + */
>> +struct rv_file_priv {
>> +	struct tlob_ring	ring;
>> +	wait_queue_head_t	waitq;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_KUNIT)
>> +int tlob_init_monitor(void);
>> +void tlob_destroy_monitor(void);
>> +int tlob_enable_hooks(void);
>> +void tlob_disable_hooks(void);
>> +void tlob_event_push_kunit(struct rv_file_priv *priv,
>> +			  const struct tlob_event *info);
>> +int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
>> +			   char **path_out,
>> +			   loff_t *start_out, loff_t *stop_out);
>> +#endif /* CONFIG_KUNIT */
>> +
>> +#endif /* _RV_TLOB_H */
>> diff --git a/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> b/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> new file mode 100644
>> index 000000000..b08d67776
>> --- /dev/null
>> +++ b/kernel/trace/rv/monitors/tlob/tlob_trace.h
>> @@ -0,0 +1,42 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +/*
>> + * Snippet to be included in rv_trace.h
>> + */
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +/*
>> + * tlob uses the generic event_da_monitor_id and error_da_monitor_id event
>> + * classes so that both event classes are instantiated.  This avoids a
>> + * -Werror=unused-variable warning that the compiler emits when a
>> + * DECLARE_EVENT_CLASS has no corresponding DEFINE_EVENT instance.
>> + *
>> + * The event_tlob tracepoint is defined here but the call-site in
>> + * da_handle_event() is overridden with a no-op macro below so that no
>> + * trace record is emitted on every scheduler context switch.  Budget
>> + * violations are reported via the dedicated tlob_budget_exceeded event.
>> + *
>> + * error_tlob IS kept active so that invalid DA transitions (programming
>> + * errors) are still visible in the ftrace ring buffer for debugging.
>> + */
>> +DEFINE_EVENT(event_da_monitor_id, event_tlob,
>> +	     TP_PROTO(int id, char *state, char *event, char *next_state,
>> +		      bool final_state),
>> +	     TP_ARGS(id, state, event, next_state, final_state));
>> +
>> +DEFINE_EVENT(error_da_monitor_id, error_tlob,
>> +	     TP_PROTO(int id, char *state, char *event),
>> +	     TP_ARGS(id, state, event));
>> +
>> +/*
>> + * Override the trace_event_tlob() call-site with a no-op after the
>> + * DEFINE_EVENT above has satisfied the event class instantiation
>> + * requirement.  The tracepoint symbol itself exists (and can be enabled
>> + * via tracefs) but the automatic call from da_handle_event() is silenced
>> + * to avoid per-context-switch ftrace noise during normal operation.
>> + */
>> +#undef trace_event_tlob
>> +#define trace_event_tlob(id, state, event, next_state, final_state)	\
>> +	do { (void)(id); (void)(state); (void)(event);			\
>> +	     (void)(next_state); (void)(final_state); } while (0)
>> +#endif /* CONFIG_RV_MON_TLOB */
>> diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
>> index ee4e68102..e754e76d5 100644
>> --- a/kernel/trace/rv/rv.c
>> +++ b/kernel/trace/rv/rv.c
>> @@ -148,6 +148,10 @@
>>   #include <rv_trace.h>
>>   #endif
>>   
>> +#ifdef CONFIG_RV_MON_TLOB
>> +EXPORT_TRACEPOINT_SYMBOL_GPL(tlob_budget_exceeded);
>> +#endif
>> +
>>   #include "rv.h"
>>   
>>   DEFINE_MUTEX(rv_interface_lock);
>> diff --git a/kernel/trace/rv/rv_dev.c b/kernel/trace/rv/rv_dev.c
>> new file mode 100644
>> index 000000000..a052f3203
>> --- /dev/null
>> +++ b/kernel/trace/rv/rv_dev.c
>> @@ -0,0 +1,602 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * rv_dev.c - /dev/rv misc device for RV monitor self-instrumentation
>> + *
>> + * A single misc device (MISC_DYNAMIC_MINOR) serves all RV monitors.
>> + * ioctl numbers encode the monitor identity:
>> + *
>> + *   0x01 - 0x1F  tlob (task latency over budget)
>> + *   0x20 - 0x3F  reserved
>> + *
>> + * Each monitor exports tlob_start_task() / tlob_stop_task() which are
>> + * called here.  The calling task is identified by current.
>> + *
>> + * Magic: RV_IOC_MAGIC (0xB9), defined in include/uapi/linux/rv.h
>> + *
>> + * Per-fd private data (rv_file_priv)
>> + * ------------------------------------
>> + * Every open() of /dev/rv allocates an rv_file_priv (defined in tlob.h).
>> + * When TLOB_IOCTL_TRACE_START is called with args.notify_fd >= 0, violations
>> + * are pushed as tlob_event records into that fd's per-fd ring buffer
>> (tlob_ring)
>> + * and its poll/epoll waitqueue is woken.
>> + *
>> + * Consumers drain records with read() on the notify_fd; read() blocks until
>> + * at least one record is available (unless O_NONBLOCK is set).
>> + *
>> + * Per-thread "started" tracking (tlob_task_handle)
>> + * -------------------------------------------------
>> + * tlob_stop_task() returns -ESRCH in two distinct situations:
>> + *
>> + *   (a) The deadline timer already fired and removed the tlob hash-table
>> + *       entry before TRACE_STOP arrived -> budget was exceeded -> -EOVERFLOW
>> + *
>> + *   (b) TRACE_START was never called for this thread -> programming error
>> + *       -> -ESRCH
>> + *
>> + * To distinguish them, rv_dev.c maintains a lightweight hash table
>> + * (tlob_handles) that records a tlob_task_handle for every task_struct *
>> + * for which a successful TLOB_IOCTL_TRACE_START has been
>> + * issued but the corresponding TLOB_IOCTL_TRACE_STOP has not yet arrived.
>> + *
>> + * tlob_task_handle is a thin "session ticket"  --  it carries only the
>> + * task pointer and the owning file descriptor.  The heavy per-task state
>> + * (hrtimer, DA state, threshold) lives in tlob_task_state inside tlob.c.
>> + *
>> + * The table is keyed on task_struct * (same key as tlob.c), protected
>> + * by tlob_handles_lock (spinlock, irq-safe).  No get_task_struct()
>> + * refcount is needed here because tlob.c already holds a reference for
>> + * each live entry.
>> + *
>> + * Multiple threads may share the same fd.  Each thread has its own
>> + * tlob_task_handle in the table, so concurrent TRACE_START / TRACE_STOP
>> + * calls from different threads do not interfere.
>> + *
>> + * The fd release path (rv_release) calls tlob_stop_task() for every
>> + * handle in tlob_handles that belongs to the closing fd, ensuring cleanup
>> + * even if the user forgets to call TRACE_STOP.
>> + */
>> +#include <linux/file.h>
>> +#include <linux/fs.h>
>> +#include <linux/gfp.h>
>> +#include <linux/hash.h>
>> +#include <linux/mm.h>
>> +#include <linux/miscdevice.h>
>> +#include <linux/module.h>
>> +#include <linux/poll.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/uaccess.h>
>> +#include <uapi/linux/rv.h>
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +#include "monitors/tlob/tlob.h"
>> +#endif
>> +
>> +/* -----------------------------------------------------------------------
>> + * tlob_task_handle - per-thread session ticket for the ioctl interface
>> + *
>> + * One handle is allocated by TLOB_IOCTL_TRACE_START and freed by
>> + * TLOB_IOCTL_TRACE_STOP (or by rv_release if the fd is closed).
>> + *
>> + * @hlist:  Hash-table linkage in tlob_handles (keyed on task pointer).
>> + * @task:   The monitored thread.  Plain pointer; no refcount held here
>> + *          because tlob.c holds one for the lifetime of the monitoring
>> + *          window, which encompasses the lifetime of this handle.
>> + * @file:   The /dev/rv file descriptor that issued TRACE_START.
>> + *          Used by rv_release() to sweep orphaned handles on close().
>> + * -----------------------------------------------------------------------
>> + */
>> +#define TLOB_HANDLES_BITS	5
>> +#define TLOB_HANDLES_SIZE	(1 << TLOB_HANDLES_BITS)
>> +
>> +struct tlob_task_handle {
>> +	struct hlist_node	hlist;
>> +	struct task_struct	*task;
>> +	struct file		*file;
>> +};
>> +
>> +static struct hlist_head tlob_handles[TLOB_HANDLES_SIZE];
>> +static DEFINE_SPINLOCK(tlob_handles_lock);
>> +
>> +static unsigned int tlob_handle_hash(const struct task_struct *task)
>> +{
>> +	return hash_ptr((void *)task, TLOB_HANDLES_BITS);
>> +}
>> +
>> +/* Must be called with tlob_handles_lock held. */
>> +static struct tlob_task_handle *
>> +tlob_handle_find_locked(struct task_struct *task)
>> +{
>> +	struct tlob_task_handle *h;
>> +	unsigned int slot = tlob_handle_hash(task);
>> +
>> +	hlist_for_each_entry(h, &tlob_handles[slot], hlist) {
>> +		if (h->task == task)
>> +			return h;
>> +	}
>> +	return NULL;
>> +}
>> +
>> +/*
>> + * tlob_handle_alloc - record that @task has an active monitoring session
>> + *                     opened via @file.
>> + *
>> + * Returns 0 on success, -EEXIST if @task already has a handle (double
>> + * TRACE_START without TRACE_STOP), -ENOMEM on allocation failure.
>> + */
>> +static int tlob_handle_alloc(struct task_struct *task, struct file *file)
>> +{
>> +	struct tlob_task_handle *h;
>> +	unsigned long flags;
>> +	unsigned int slot;
>> +
>> +	h = kmalloc(sizeof(*h), GFP_KERNEL);
>> +	if (!h)
>> +		return -ENOMEM;
>> +	h->task = task;
>> +	h->file = file;
>> +
>> +	spin_lock_irqsave(&tlob_handles_lock, flags);
>> +	if (tlob_handle_find_locked(task)) {
>> +		spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +		kfree(h);
>> +		return -EEXIST;
>> +	}
>> +	slot = tlob_handle_hash(task);
>> +	hlist_add_head(&h->hlist, &tlob_handles[slot]);
>> +	spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +	return 0;
>> +}
>> +
>> +/*
>> + * tlob_handle_free - remove the handle for @task and free it.
>> + *
>> + * Returns 1 if a handle existed (TRACE_START was called), 0 if not found
>> + * (TRACE_START was never called for this thread).
>> + */
>> +static int tlob_handle_free(struct task_struct *task)
>> +{
>> +	struct tlob_task_handle *h;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&tlob_handles_lock, flags);
>> +	h = tlob_handle_find_locked(task);
>> +	if (h) {
>> +		hlist_del_init(&h->hlist);
>> +		spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +		kfree(h);
>> +		return 1;
>> +	}
>> +	spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +	return 0;
>> +}
>> +
>> +/*
>> + * tlob_handle_sweep_file - release all handles owned by @file.
>> + *
>> + * Called from rv_release() when the fd is closed without TRACE_STOP.
>> + * Calls tlob_stop_task() for each orphaned handle to drain the tlob
>> + * monitoring entries and prevent resource leaks in tlob.c.
>> + *
>> + * Handles are collected under the lock (short critical section), then
>> + * processed outside it (tlob_stop_task() may sleep/spin internally).
>> + */
>> +#ifdef CONFIG_RV_MON_TLOB
>> +static void tlob_handle_sweep_file(struct file *file)
>> +{
>> +	struct tlob_task_handle *batch[TLOB_HANDLES_SIZE];
>> +	struct tlob_task_handle *h;
>> +	struct hlist_node *tmp;
>> +	unsigned long flags;
>> +	int i, n = 0;
>> +
>> +	spin_lock_irqsave(&tlob_handles_lock, flags);
>> +	for (i = 0; i < TLOB_HANDLES_SIZE; i++) {
>> +		hlist_for_each_entry_safe(h, tmp, &tlob_handles[i], hlist) {
>> +			if (h->file == file) {
>> +				hlist_del_init(&h->hlist);
>> +				batch[n++] = h;
>> +			}
>> +		}
>> +	}
>> +	spin_unlock_irqrestore(&tlob_handles_lock, flags);
>> +
>> +	for (i = 0; i < n; i++) {
>> +		/*
>> +		 * Ignore -ESRCH: the deadline timer may have already fired
>> +		 * and cleaned up the tlob entry.
>> +		 */
>> +		tlob_stop_task(batch[i]->task);
>> +		kfree(batch[i]);
>> +	}
>> +}
>> +#else
>> +static inline void tlob_handle_sweep_file(struct file *file) {}
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>> +/* -----------------------------------------------------------------------
>> + * Ring buffer lifecycle
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +/*
>> + * tlob_ring_alloc - allocate a ring of @cap records (must be a power of 2).
>> + *
>> + * Allocates a physically contiguous block of pages:
>> + *   page 0     : struct tlob_mmap_page  (control page, shared with
>> userspace)
>> + *   pages 1..N : struct tlob_event[cap] (data pages)
>> + *
>> + * Each page is marked reserved so it can be mapped to userspace via mmap().
>> + */
>> +static int tlob_ring_alloc(struct tlob_ring *ring, u32 cap)
>> +{
>> +	unsigned int total = PAGE_SIZE + cap * sizeof(struct tlob_event);
>> +	unsigned int order = get_order(total);
>> +	unsigned long base;
>> +	unsigned int i;
>> +
>> +	base = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
>> +	if (!base)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < (1u << order); i++)
>> +		SetPageReserved(virt_to_page((void *)(base + i *
>> PAGE_SIZE)));
>> +
>> +	ring->base  = base;
>> +	ring->order = order;
>> +	ring->page  = (struct tlob_mmap_page *)base;
>> +	ring->data  = (struct tlob_event *)(base + PAGE_SIZE);
>> +	ring->mask  = cap - 1;
>> +	spin_lock_init(&ring->lock);
>> +
>> +	ring->page->capacity    = cap;
>> +	ring->page->version     = 1;
>> +	ring->page->data_offset = PAGE_SIZE;
>> +	ring->page->record_size = sizeof(struct tlob_event);
>> +	return 0;
>> +}
>> +
>> +static void tlob_ring_free(struct tlob_ring *ring)
>> +{
>> +	unsigned int i;
>> +
>> +	if (!ring->base)
>> +		return;
>> +
>> +	for (i = 0; i < (1u << ring->order); i++)
>> +		ClearPageReserved(virt_to_page((void *)(ring->base + i *
>> PAGE_SIZE)));
>> +
>> +	free_pages(ring->base, ring->order);
>> +	ring->base = 0;
>> +	ring->page = NULL;
>> +	ring->data = NULL;
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * File operations
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static int rv_open(struct inode *inode, struct file *file)
>> +{
>> +	struct rv_file_priv *priv;
>> +	int ret;
>> +
>> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> +	if (!priv)
>> +		return -ENOMEM;
>> +
>> +	ret = tlob_ring_alloc(&priv->ring, TLOB_RING_DEFAULT_CAP);
>> +	if (ret) {
>> +		kfree(priv);
>> +		return ret;
>> +	}
>> +
>> +	init_waitqueue_head(&priv->waitq);
>> +	file->private_data = priv;
>> +	return 0;
>> +}
>> +
>> +static int rv_release(struct inode *inode, struct file *file)
>> +{
>> +	struct rv_file_priv *priv = file->private_data;
>> +
>> +	tlob_handle_sweep_file(file);
>> +	tlob_ring_free(&priv->ring);
>> +	kfree(priv);
>> +	file->private_data = NULL;
>> +	return 0;
>> +}
>> +
>> +static __poll_t rv_poll(struct file *file, poll_table *wait)
>> +{
>> +	struct rv_file_priv *priv = file->private_data;
>> +
>> +	if (!priv)
>> +		return EPOLLERR;
>> +
>> +	poll_wait(file, &priv->waitq, wait);
>> +
>> +	/*
>> +	 * Pairs with smp_store_release(&ring->page->data_head, ...) in
>> +	 * tlob_event_push().  No lock needed: head is written by the kernel
>> +	 * producer and read here; tail is written by the consumer and we
>> only
>> +	 * need an approximate check for the poll fast path.
>> +	 */
>> +	if (smp_load_acquire(&priv->ring.page->data_head) !=
>> +	    READ_ONCE(priv->ring.page->data_tail))
>> +		return EPOLLIN | EPOLLRDNORM;
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * rv_read - consume tlob_event violation records from this fd's ring buffer.
>> + *
>> + * Each read() returns a whole number of struct tlob_event records.  @count
>> must
>> + * be at least sizeof(struct tlob_event); partial-record sizes are rejected
>> with
>> + * -EINVAL.
>> + *
>> + * Blocking behaviour follows O_NONBLOCK on the fd:
>> + *   O_NONBLOCK clear: blocks until at least one record is available.
>> + *   O_NONBLOCK set:   returns -EAGAIN immediately if the ring is empty.
>> + *
>> + * Returns the number of bytes copied (always a multiple of sizeof
>> tlob_event),
>> + * -EAGAIN if non-blocking and empty, or a negative error code.
>> + *
>> + * read() and mmap() share the same ring and data_tail cursor; do not use
>> + * both simultaneously on the same fd.
>> + */
>> +static ssize_t rv_read(struct file *file, char __user *buf, size_t count,
>> +		       loff_t *ppos)
>> +{
>> +	struct rv_file_priv *priv = file->private_data;
>> +	struct tlob_ring *ring;
>> +	size_t rec = sizeof(struct tlob_event);
>> +	unsigned long irqflags;
>> +	ssize_t done = 0;
>> +	int ret;
>> +
>> +	if (!priv)
>> +		return -ENODEV;
>> +
>> +	ring = &priv->ring;
>> +
>> +	if (count < rec)
>> +		return -EINVAL;
>> +
>> +	/* Blocking path: sleep until the producer advances data_head. */
>> +	if (!(file->f_flags & O_NONBLOCK)) {
>> +		ret = wait_event_interruptible(priv->waitq,
>> +			/* pairs with smp_store_release() in the producer */
>> +			smp_load_acquire(&ring->page->data_head) !=
>> +			READ_ONCE(ring->page->data_tail));
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	/*
>> +	 * Drain records into the caller's buffer.  ring->lock serialises
>> +	 * concurrent read() callers and the softirq producer.
>> +	 */
>> +	while (done + rec <= count) {
>> +		struct tlob_event record;
>> +		u32 head, tail;
>> +
>> +		spin_lock_irqsave(&ring->lock, irqflags);
>> +		/* pairs with smp_store_release() in the producer */
>> +		head = smp_load_acquire(&ring->page->data_head);
>> +		tail = ring->page->data_tail;
>> +		if (head == tail) {
>> +			spin_unlock_irqrestore(&ring->lock, irqflags);
>> +			break;
>> +		}
>> +		record = ring->data[tail & ring->mask];
>> +		WRITE_ONCE(ring->page->data_tail, tail + 1);
>> +		spin_unlock_irqrestore(&ring->lock, irqflags);
>> +
>> +		if (copy_to_user(buf + done, &record, rec))
>> +			return done ? done : -EFAULT;
>> +		done += rec;
>> +	}
>> +
>> +	return done ? done : -EAGAIN;
>> +}
>> +
>> +/*
>> + * rv_mmap - map the per-fd violation ring buffer into userspace.
>> + *
>> + * The mmap region covers the full ring allocation:
>> + *
>> + *   offset 0          : struct tlob_mmap_page  (control page)
>> + *   offset PAGE_SIZE  : struct tlob_event[capacity]  (data pages)
>> + *
>> + * The caller must map exactly PAGE_SIZE + capacity * sizeof(struct
>> tlob_event)
>> + * bytes starting at offset 0 (vm_pgoff must be 0).  The actual capacity is
>> + * read from tlob_mmap_page.capacity after a successful mmap(2).
>> + *
>> + * Private mappings (MAP_PRIVATE) are rejected: the shared data_tail field
>> + * written by userspace must be visible to the kernel producer.
>> + */
>> +static int rv_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> +	struct rv_file_priv *priv = file->private_data;
>> +	struct tlob_ring    *ring;
>> +	unsigned long        size = vma->vm_end - vma->vm_start;
>> +	unsigned long        ring_size;
>> +
>> +	if (!priv)
>> +		return -ENODEV;
>> +
>> +	ring = &priv->ring;
>> +
>> +	if (vma->vm_pgoff != 0)
>> +		return -EINVAL;
>> +
>> +	ring_size = PAGE_ALIGN(PAGE_SIZE + ((unsigned long)(ring->mask + 1) *
>> +					    sizeof(struct tlob_event)));
>> +	if (size != ring_size)
>> +		return -EINVAL;
>> +
>> +	if (!(vma->vm_flags & VM_SHARED))
>> +		return -EINVAL;
>> +
>> +	return remap_pfn_range(vma, vma->vm_start,
>> +			       page_to_pfn(virt_to_page((void *)ring->base)),
>> +			       ring_size, vma->vm_page_prot);
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * ioctl dispatcher
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static long rv_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>> +{
>> +	unsigned int nr = _IOC_NR(cmd);
>> +
>> +	/*
>> +	 * Verify the magic byte so we don't accidentally handle ioctls
>> +	 * intended for a different device.
>> +	 */
>> +	if (_IOC_TYPE(cmd) != RV_IOC_MAGIC)
>> +		return -ENOTTY;
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +	/* tlob: ioctl numbers 0x01 - 0x1F */
>> +	switch (cmd) {
>> +	case TLOB_IOCTL_TRACE_START: {
>> +		struct tlob_start_args args;
>> +		struct file *notify_file = NULL;
>> +		int ret, hret;
>> +
>> +		if (copy_from_user(&args,
>> +				   (struct tlob_start_args __user *)arg,
>> +				   sizeof(args)))
>> +			return -EFAULT;
>> +		if (args.threshold_us == 0)
>> +			return -EINVAL;
>> +		if (args.flags != 0)
>> +			return -EINVAL;
>> +
>> +		/*
>> +		 * If notify_fd >= 0, resolve it to a file pointer.
>> +		 * fget() bumps the reference count; tlob.c drops it
>> +		 * via fput() when the monitoring window ends.
>> +		 * Reject non-/dev/rv fds to prevent type confusion.
>> +		 */
>> +		if (args.notify_fd >= 0) {
>> +			notify_file = fget(args.notify_fd);
>> +			if (!notify_file)
>> +				return -EBADF;
>> +			if (notify_file->f_op != file->f_op) {
>> +				fput(notify_file);
>> +				return -EINVAL;
>> +			}
>> +		}
>> +
>> +		ret = tlob_start_task(current, args.threshold_us,
>> +				      notify_file, args.tag);
>> +		if (ret != 0) {
>> +			/* tlob.c did not take ownership; drop ref. */
>> +			if (notify_file)
>> +				fput(notify_file);
>> +			return ret;
>> +		}
>> +
>> +		/*
>> +		 * Record session handle.  Free any stale handle left by
>> +		 * a previous window whose deadline timer fired (timer
>> +		 * removes tlob_task_state but cannot touch tlob_handles).
>> +		 */
>> +		tlob_handle_free(current);
>> +		hret = tlob_handle_alloc(current, file);
>> +		if (hret < 0) {
>> +			tlob_stop_task(current);
>> +			return hret;
>> +		}
>> +		return 0;
>> +	}
>> +	case TLOB_IOCTL_TRACE_STOP: {
>> +		int had_handle;
>> +		int ret;
>> +
>> +		/*
>> +		 * Atomically remove the session handle for current.
>> +		 *
>> +		 *   had_handle == 0: TRACE_START was never called for
>> +		 *                    this thread -> caller bug -> -ESRCH
>> +		 *
>> +		 *   had_handle == 1: TRACE_START was called.  If
>> +		 *                    tlob_stop_task() now returns
>> +		 *                    -ESRCH, the deadline timer already
>> +		 *                    fired -> budget exceeded -> -EOVERFLOW
>> +		 */
>> +		had_handle = tlob_handle_free(current);
>> +		if (!had_handle)
>> +			return -ESRCH;
>> +
>> +		ret = tlob_stop_task(current);
>> +		return (ret == -ESRCH) ? -EOVERFLOW : ret;
>> +	}
>> +	default:
>> +		break;
>> +	}
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>> +	return -ENOTTY;
>> +}
>> +
>> +/* -----------------------------------------------------------------------
>> + * Module init / exit
>> + * -----------------------------------------------------------------------
>> + */
>> +
>> +static const struct file_operations rv_fops = {
>> +	.owner		= THIS_MODULE,
>> +	.open		= rv_open,
>> +	.release	= rv_release,
>> +	.read		= rv_read,
>> +	.poll		= rv_poll,
>> +	.mmap		= rv_mmap,
>> +	.unlocked_ioctl	= rv_ioctl,
>> +#ifdef CONFIG_COMPAT
>> +	.compat_ioctl	= rv_ioctl,
>> +#endif
>> +	.llseek		= noop_llseek,
>> +};
>> +
>> +/*
>> + * 0666: /dev/rv is a self-instrumentation device.  All ioctls operate
>> + * exclusively on the calling task (current); no task can monitor another
>> + * via this interface.  Opening the device does not grant any privilege
>> + * beyond observing one's own latency, so world-read/write is appropriate.
>> + */
>> +static struct miscdevice rv_miscdev = {
>> +	.minor	= MISC_DYNAMIC_MINOR,
>> +	.name	= "rv",
>> +	.fops	= &rv_fops,
>> +	.mode	= 0666,
>> +};
>> +
>> +static int __init rv_ioctl_init(void)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < TLOB_HANDLES_SIZE; i++)
>> +		INIT_HLIST_HEAD(&tlob_handles[i]);
>> +
>> +	return misc_register(&rv_miscdev);
>> +}
>> +
>> +static void __exit rv_ioctl_exit(void)
>> +{
>> +	misc_deregister(&rv_miscdev);
>> +}
>> +
>> +module_init(rv_ioctl_init);
>> +module_exit(rv_ioctl_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_DESCRIPTION("RV ioctl interface via /dev/rv");
>> diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
>> index 4a6faddac..65d6c6485 100644
>> --- a/kernel/trace/rv/rv_trace.h
>> +++ b/kernel/trace/rv/rv_trace.h
>> @@ -126,6 +126,7 @@ DECLARE_EVENT_CLASS(error_da_monitor_id,
>>   #include <monitors/snroc/snroc_trace.h>
>>   #include <monitors/nrp/nrp_trace.h>
>>   #include <monitors/sssw/sssw_trace.h>
>> +#include <monitors/tlob/tlob_trace.h>
>>   // Add new monitors based on CONFIG_DA_MON_EVENTS_ID here
>>   
>>   #endif /* CONFIG_DA_MON_EVENTS_ID */
>> @@ -202,6 +203,55 @@ TRACE_EVENT(rv_retries_error,
>>   		__get_str(event), __get_str(name))
>>   );
>>   #endif /* CONFIG_RV_MON_MAINTENANCE_EVENTS */
>> +
>> +#ifdef CONFIG_RV_MON_TLOB
>> +/*
>> + * tlob_budget_exceeded - emitted when a monitored task exceeds its latency
>> + * budget.  Carries the on-CPU / off-CPU time breakdown so that the cause
>> + * of the overrun (CPU-bound vs. scheduling/I/O latency) is immediately
>> + * visible in the ftrace ring buffer without post-processing.
>> + */
>> +TRACE_EVENT(tlob_budget_exceeded,
>> +
>> +	TP_PROTO(struct task_struct *task, u64 threshold_us,
>> +		 u64 on_cpu_us, u64 off_cpu_us, u32 switches,
>> +		 bool state_is_on_cpu, u64 tag),
>> +
>> +	TP_ARGS(task, threshold_us, on_cpu_us, off_cpu_us, switches,
>> +		state_is_on_cpu, tag),
>> +
>> +	TP_STRUCT__entry(
>> +		__string(comm,		task->comm)
>> +		__field(pid_t,		pid)
>> +		__field(u64,		threshold_us)
>> +		__field(u64,		on_cpu_us)
>> +		__field(u64,		off_cpu_us)
>> +		__field(u32,		switches)
>> +		__field(bool,		state_is_on_cpu)
>> +		__field(u64,		tag)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__assign_str(comm);
>> +		__entry->pid		= task->pid;
>> +		__entry->threshold_us	= threshold_us;
>> +		__entry->on_cpu_us	= on_cpu_us;
>> +		__entry->off_cpu_us	= off_cpu_us;
>> +		__entry->switches	= switches;
>> +		__entry->state_is_on_cpu = state_is_on_cpu;
>> +		__entry->tag		= tag;
>> +	),
>> +
>> +	TP_printk("%s[%d]: budget exceeded threshold=%llu on_cpu=%llu
>> off_cpu=%llu switches=%u state=%s tag=0x%016llx",
>> +		__get_str(comm), __entry->pid,
>> +		__entry->threshold_us,
>> +		__entry->on_cpu_us, __entry->off_cpu_us,
>> +		__entry->switches,
>> +		__entry->state_is_on_cpu ? "on_cpu" : "off_cpu",
>> +		__entry->tag)
>> +);
>> +#endif /* CONFIG_RV_MON_TLOB */
>> +
>>   #endif /* _TRACE_RV_H */
>>   
>>   /* This part must be outside protection */
> 

  reply	other threads:[~2026-04-16 15:10 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-12 19:27 [RFC PATCH 0/4] rv/tlob: Add task latency over budget RV monitor wen.yang
2026-04-12 19:27 ` [RFC PATCH 1/4] rv/tlob: Add tlob model DOT file wen.yang
2026-04-13  8:19   ` Gabriele Monaco
2026-04-12 19:27 ` [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor wen.yang
2026-04-13  8:19   ` Gabriele Monaco
2026-04-16 15:09     ` Wen Yang [this message]
2026-04-16 15:35       ` Gabriele Monaco
2026-04-12 19:27 ` [RFC PATCH 3/4] rv/tlob: Add KUnit tests for the tlob monitor wen.yang
2026-04-16 12:09   ` Gabriele Monaco
2026-04-12 19:27 ` [RFC PATCH 4/4] selftests/rv: Add selftest " wen.yang
2026-04-16 12:00   ` Gabriele Monaco

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=228deda8-3685-4f07-afd5-d3f3ca531154@linux.dev \
    --to=wen.yang@linux.dev \
    --cc=gmonaco@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.