* Re: [PATCH v7 14/42] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Vlastimil Babka (SUSE) @ 2026-06-08 8:45 UTC (permalink / raw)
To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-14-2f0fae496530@google.com>
On 5/23/26 02:17, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
>
> When checking if a guest_memfd folio is safe for conversion, its refcount
> is examined. A folio may be present in a per-CPU lru_add fbatch, which
> temporarily increases its refcount. This can lead to a false positive,
> incorrectly indicating that the folio is in use and preventing the
> conversion, even if it is otherwise safe. The conversion process might not
> be on the same CPU that holds the folio in its fbatch, making a simple
> per-CPU check insufficient.
>
> To address this, drain all CPUs' lru_add fbatches if an unexpectedly high
> refcount is encountered during the safety check. This is performed at most
> once per conversion request. Draining only if the folio in question may be
> lru cached.
>
> guest_memfd folios are unevictable, so they can only reside in the lru_add
> fbatch. If the folio's refcount is still unsafe after draining, then the
> conversion is truly deemed unsafe.
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
^ permalink raw reply
* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Tengda Wu @ 2026-06-08 8:31 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Peter Zijlstra, Josh Poimboeuf, Steven Rostedt, Mathieu Desnoyers,
Alexei Starovoitov, linux-trace-kernel, linux-kernel
In-Reply-To: <20260608115646.97d80d30aed182d468496449@kernel.org>
On 2026/6/8 10:56, Masami Hiramatsu wrote:
> On Mon, 8 Jun 2026 09:52:37 +0800
> Tengda Wu <wutengda@huaweicloud.com> wrote:
>
>>
>>
>> On 2026/6/5 21:43, Masami Hiramatsu wrote:
>>> On Thu, 4 Jun 2026 11:34:45 +0200
>>> Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>>> On Mon, Jun 01, 2026 at 08:40:01AM +0900, Masami Hiramatsu wrote:
>>>>
>>>>> Peter, is it OK to drop @rq from task_on_cpu()?
>>>>
>>>> Sure.
>>>>
>>>>> Then we can use it from rethook.
>>>>
>>>> Well, it is in sched/sched.h, which is an internal header, and no you
>>>> cannot use that header in rethook.
>>>
>>> Ah, OK. Hmm, then we should not use it. Maybe ->on_cpu is also internal
>>> state?
>>>
>>>>
>>>> But lets step back first, what is the actual problem here, why are we
>>>> looking at ->on_cpu at all?
>>>
>>> Tengda, can you explain it?
>>> I think you want to take a stacktrace on !current process, and
>>> rethook_find_ret_addr() is rejected i the task is running state.
>>>
>>> But if you can share actual situation what you need, it is
>>> helpful for us to understand.
>>>
>>> Thank you,
>>>
>>>
>>
>>
>> Sure.
>>
>> Background: We are verifying the support of live patches for functions that
>> have a kretprobe. The specific verification method is as follows:
>>
>> We construct a function foo() that calls bar():
>>
>> void bar(void)
>> {
>> for (;;) {
>> schedule();
>> }
>> }
>>
>> void foo(void)
>> {
>> bar();
>> }
>>
>> A kretprobe is attached to bar():
>>
>> echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
>> echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
>>
>> Then foo() is triggered. The expected behavior is that bar() will call
>> schedule() and yield the CPU.
>>
>> After that, the live patch is activated to attempt replacing the implementation
>> of foo(). The expectation is that this should succeed.
>>
>> However, in reality, because the task that called schedule() is still in the
>> RUNNING state, the condition task_is_running(tsk) inside rethook_find_ret_addr()
>> is not satisfied, causing the function to return early. This, in turn,
>> prevents stack_trace_save_tsk_reliable() from determining the stack as
>> reliable, leading to a failure in activating the live patch.
>
> Hmm is the bar() doing infinite loop, or limited loop but take a long time
> so just yield a while? Anyway, it seems like a non-good design pattern.
> Is it possible to avoid busy loops and instead use Workers, or wait for
> something to complete or for input within a loop?
>
>>
>> **Not sure if this is correct:**
>>
>> We believe that after a task voluntarily calls schedule(), when the stack
>> is expected to be reliable, it is a safe time to activate a live patch.
>
> In this case, I don't know how to block the loop inside the bar.
> Even if !tsk->on_cpu, the tsk can restart running right after checking
> the flag.
>
The infinite loop in bar() is indeed a poor design pattern. This test
case is only artificial, not from real-world code. It is merely
intended to verify live patch support for various cases.
However, the point you raised has indeed made me think. I realize that
checking only tsk->on_cpu is not sufficient -- there is also a race
condition where the task could be scheduled back onto a CPU right after
the check. I need to re-examine the validity of this test case and
whether it represents a safe live patch activation scenario.
Thank you again for your patience and for pointing out these
fundamental issues. Your guidance is much appreciated.
Best regards,
Tengda
^ permalink raw reply
* Re: [PATCH v2 13/13] verification/rvgen: Remove dead code
From: Nam Cao @ 2026-06-08 8:29 UTC (permalink / raw)
To: Gabriele Monaco, Wander Lairson Costa, Steven Rostedt,
linux-trace-kernel, linux-kernel
In-Reply-To: <310e485ce394c7d258e142e14d5e51c5e15e1d30.camel@redhat.com>
Gabriele Monaco <gmonaco@redhat.com> writes:
> You might want to remove unused imports (linters should help you with
> that too):
> * re, typing.Iterator, and itertools.islice from automata.py
> * deque and ConstraintRule from dot2k
Thanks, I overlooked those warnings due to the noises from the existing
warnings :(
Let me clean up the existing pylint issues, so that new warnings are
easily noticed.
Nam
^ permalink raw reply
* Re: [PATCH next] kernel/trace/trace_printk: Use kstrdup() instead of kmalloc() and strcpy()
From: Masami Hiramatsu @ 2026-06-08 8:27 UTC (permalink / raw)
To: david.laight.linux
Cc: Kees Cook, linux-hardening, Arnd Bergmann, linux-kernel,
linux-trace-kernel, Masami Hiramatsu, Steven Rostedt
In-Reply-To: <20260606202633.5018-34-david.laight.linux@gmail.com>
On Sat, 6 Jun 2026 21:26:28 +0100
david.laight.linux@gmail.com wrote:
> From: David Laight <david.laight.linux@gmail.com>
>
> Signed-off-by: David Laight <david.laight.linux@gmail.com>
> ---
> This is one of a group of patches that remove potentially unbounded
> strcpy() calls.
>
> They are mostly replaced by strscpy() or, when strlen() has just been
> called, with memcpy() (usually including the '\0').
>
> Calls with copy string literals into arrays are left unchanged.
> They are safe and easily detected as such.
>
> The changes were made by getting the compiler to detect the calls and
> then fixing the code by hand.
>
> Note that all the changes are only compile tested.
>
> Some Makefiles were changed to allow files to contain strcpy().
> As well as 'difficult to fix' files, this included 'show' functions
> as they really need to use sysfs_emit() or seq_printf().
>
> All the patches are being sent individually to avoid very long cc lists.
> Apologies for the terse commit messages and likely unexpected tags.
> (There are about 100 patches in total.)
>
This looks good to me.
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thanks,
> kernel/trace/trace_printk.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c
> index 3ea17af60169..98171a2398e4 100644
> --- a/kernel/trace/trace_printk.c
> +++ b/kernel/trace/trace_printk.c
> @@ -71,10 +71,9 @@ void hold_module_trace_bprintk_format(const char **start, const char **end)
> fmt = NULL;
> tb_fmt = kmalloc_obj(*tb_fmt);
> if (tb_fmt) {
> - fmt = kmalloc(strlen(*iter) + 1, GFP_KERNEL);
> + fmt = kstrdup(*iter, GFP_KERNEL);
> if (fmt) {
> list_add_tail(&tb_fmt->list, &trace_bprintk_fmt_list);
> - strcpy(fmt, *iter);
> tb_fmt->fmt = fmt;
> } else
> kfree(tb_fmt);
> --
> 2.39.5
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCHv8 bpf-next 28/29] selftests/bpf: Add tracing multi attach benchmark test
From: Jiri Olsa @ 2026-06-08 8:25 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
linux-trace-kernel, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, Menglong Dong, Steven Rostedt
In-Reply-To: <DJ30RPYRXESZ.2P3LP98X55VMG@gmail.com>
On Sun, Jun 07, 2026 at 11:13:32AM -0700, Alexei Starovoitov wrote:
> On Sat Jun 6, 2026 at 5:39 AM PDT, Jiri Olsa wrote:
> > Adding benchmark test that attaches to (almost) all allowed tracing
> > functions and display attach/detach times.
> >
> > # ./test_progs -t tracing_multi_bench_attach -v
> > bpf_testmod.ko is already unloaded.
> > Loading bpf_testmod.ko...
> > Successfully loaded bpf_testmod.ko.
> > serial_test_tracing_multi_bench_attach:PASS:btf__load_vmlinux_btf 0 nsec
> > serial_test_tracing_multi_bench_attach:PASS:tracing_multi_bench__open_and_load 0 nsec
> > serial_test_tracing_multi_bench_attach:PASS:get_syms 0 nsec
> > serial_test_tracing_multi_bench_attach:PASS:bpf_program__attach_tracing_multi 0 nsec
> > serial_test_tracing_multi_bench_attach: found 51186 functions
> > serial_test_tracing_multi_bench_attach: attached in 1.295s
> > serial_test_tracing_multi_bench_attach: detached in 0.243s
>
> ...
>
> > + if (!ASSERT_OK(bpf_get_ksyms(&ksyms, true), "get_syms"))
> > + goto cleanup;
> > +
> > + /* Get all ftrace 'safe' symbols.. */
> > + for (i = 0; i < ksyms->filtered_cnt; i++) {
> > + if (!tsearch(&ksyms->filtered_syms[i], &root, compare)) {
> > + ASSERT_FAIL("tsearch failed");
> > + goto cleanup;
> > + }
> > + }
> > +
> > + /* ..and filter them through BTF and btf_type_is_traceable_func. */
> > + nr = btf__type_cnt(btf);
> > + for (type_id = 1; type_id < nr; type_id++) {
> > + const struct btf_type *type;
> > + const char *str;
> > +
> > + type = btf__type_by_id(btf, type_id);
> > + if (!type)
> > + break;
> > +
> > + if (BTF_INFO_KIND(type->info) != BTF_KIND_FUNC)
> > + continue;
> > +
> > + str = btf__name_by_offset(btf, type->name_off);
> > + if (!str)
> > + break;
> > +
> > + if (!tfind(&str, &root, compare))
> > + continue;
> > +
> > + if (!btf_type_is_traceable_func(btf, type))
> > + continue;
> > +
> > + err = libbpf_ensure_mem((void **) &ids, &cap, sizeof(*ids), cnt + 1);
> > + if (err)
> > + goto cleanup;
> > +
> > + ids[cnt++] = type_id;
> > + }
>
> This filtering wasn't enough.
> I've added removal of duplicates here while applying:
>
> + /*
> + * Collect names that are not unique in kallsyms. The kernel resolves a
> + * tracing-multi BTF id to an address with kallsyms_lookup_name(), which
> + * returns the first symbol of that name. For a duplicate name that may
> + * be a different (non-ftrace-able) instance than the ftrace-able one in
> + * available_filter_functions, so attaching to it by BTF id fails with
> + * -ENOENT (e.g. t_start/t_next/t_stop). ksyms->syms is sorted by name,
> + * so equal names are adjacent.
> + */
> + for (i = 1; i < ksyms->sym_cnt; i++) {
> + if (strcmp(ksyms->syms[i].name, ksyms->syms[i - 1].name))
> + continue;
> + if (!tsearch(&ksyms->syms[i].name, &dups, compare)) {
> + ASSERT_FAIL("tsearch failed");
> + goto cleanup;
> + }
> + }
>
>
> + /* Skip names that are not unique in kallsyms, see above. */
> + if (tfind(&str, &dups, compare))
> + continue;
>
>
> As claude explains it:
> ----
> 1. The kernel attaches tracing_multi by BTF id. To get an address it resolves
> the BTF function name via kallsyms_lookup_name(tname) and requires
> ftrace_location(addr) — kernel/bpf/verifier.c:19380:
> addr = kallsyms_lookup_name(tname);
> ...
> if (!addr || !ftrace_location(addr))
> return -ENOENT;
> 2. t_start/t_next/t_stop each have 5 instances in this kernel. Only one is
> ftrace-able — the copies in kernel/trace/* are built notrace (ftrace's Makefile
> strips -pg), so only the unrelated copy is in available_filter_functions:
> 3. kallsyms_lookup_name() returns the lowest-address instance among equal names
> (exact strcmp, lowest seq). That instance has no fentry → ftrace_location()
> returns 0 → -ENOENT, which aborts the whole all-or-nothing bench attach.
>
> Why the bench includes them: it intersects BTF FUNC names with
> available_filter_functions names. Since some t_start is ftrace-able, the name
> passes the filter — but the kernel resolves the wrong (non-ftrace-able)
> t_start. The author's kernel apparently had the ftrace-able copy at the lowest
> address, so it passed there.
>
> This is a pre-existing limitation, not multi-specific: single fentry attach by
> BTF id uses the same kallsyms_lookup_name(tname) path (verifier.c:19120) — you
> can't reliably fentry-attach to any duplicate-named function on this kernel
> either.
> ----
strange I never triggered that.. but makes sense
>
> Maybe we should adjust bpf_get_ksyms() instead. Not sure.
the only other user is test_kprobe_multi_bench_attach which does
not care about this, so I think at this point keeping this in the
serial_test_tracing_multi_bench_attach is enough for now
thanks,
jirka
^ permalink raw reply
* Re: [PATCH mm-unstable v19 14/14] Documentation: mm: update the admin guide for mTHP collapse
From: Lance Yang @ 2026-06-08 7:41 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, bagasdotme
In-Reply-To: <20260605161422.213817-15-npache@redhat.com>
On Fri, Jun 05, 2026 at 10:14:21AM -0600, Nico Pache wrote:
>Now that we can collapse to mTHPs lets update the admin guide to
>reflect these changes and provide proper guidance on how to utilize it.
>
>Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH mm-unstable v19 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
From: Lance Yang @ 2026-06-08 7:36 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <20260605161422.213817-13-npache@redhat.com>
On Fri, Jun 05, 2026 at 10:14:19AM -0600, Nico Pache wrote:
>There are cases where, if an attempted collapse fails, all subsequent
>orders are guaranteed to also fail. Avoid these collapse attempts by
>bailing out early.
>
>Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>Acked-by: Usama Arif <usama.arif@linux.dev>
>Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH mm-unstable v19 10/14] mm/khugepaged: introduce collapse_possible_orders helper functions
From: Lance Yang @ 2026-06-08 7:27 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260605161422.213817-11-npache@redhat.com>
On Fri, Jun 05, 2026 at 10:14:17AM -0600, Nico Pache wrote:
>Add collapse_possible_orders() to generalize THP order eligibility. The
>function determines which THP orders are permitted based on collapse
>context (khugepaged vs madv_collapse). We also add collapse_possible()
>as a thin wrapper around collapse_possible_orders() that returns a bool
>rather than the whole bitmap.
>
>This consolidates collapse configuration logic and provides a clean
>interface for future mTHP collapse support where the orders may be
>different.
>
>Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH mm-unstable v19 09/14] mm/khugepaged: improve tracepoints for mTHP orders
From: Lance Yang @ 2026-06-08 7:19 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260605161422.213817-10-npache@redhat.com>
On Fri, Jun 05, 2026 at 10:14:16AM -0600, Nico Pache wrote:
>Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to
>give better insight into what order is being operated at for.
>
>Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH mm-unstable v19 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics
From: Lance Yang @ 2026-06-08 7:13 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260605161422.213817-9-npache@redhat.com>
On Fri, Jun 05, 2026 at 10:14:15AM -0600, Nico Pache wrote:
>Add three new mTHP statistics to track collapse failures for different
>orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
>
>- collapse_exceed_swap_pte: Increment when mTHP collapse fails due to
> encountering a swap PTE.
>
>- collapse_exceed_none_pte: Counts when mTHP collapse fails due to
> exceeding the none PTE threshold for the given order
>
>- collapse_exceed_shared_pte: Counts when mTHP collapse fails due to
> encountering a shared PTE.
>
>These statistics complement the existing THP_SCAN_EXCEED_* events by
>providing per-order granularity for mTHP collapse attempts. The stats are
>exposed via sysfs under
>`/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
>supported hugepage size.
>
>As we currently do not support collapsing mTHPs that contain a swap or
>shared entry, those statistics keep track of how often we are
>encountering failed mTHP collapses due to these restrictions.
>
>We will add support for mTHP collapse for anonymous pages next; lets also
>track when this happens at the PMD level within the per-mTHP stats.
>
>Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH mm-unstable v19 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders
From: Lance Yang @ 2026-06-08 6:59 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <20260605161422.213817-8-npache@redhat.com>
On Fri, Jun 05, 2026 at 10:14:14AM -0600, Nico Pache wrote:
>khugepaged may try to collapse a mTHP to a folio of equal or smaller size,
>possibly resulting in a partially mapped source folio, which is undesired.
>Skip these cases until we have a way to check if its ok to collapse to a
>smaller mTHP size (like in the case of a partially mapped folio). This
>check is not done during the scan phase as the current collapse order is
>unknown at that time.
>
>This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
>
>[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
>
>Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>Acked-by: David Hildenbrand (arm) <david@kernel.org>
>Acked-by: Usama Arif <usama.arif@linux.dev>
>Co-developed-by: Dev Jain <dev.jain@arm.com>
>Signed-off-by: Dev Jain <dev.jain@arm.com>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index c2769d82a719..191e529c185c 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -697,6 +697,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> goto out;
> }
> }
>+ /*
>+ * TODO: In some cases of partially-mapped folios, we'd actually
>+ * want to collapse.
>+ */
Partially mapped folios can be handled later :)
Reviewed-by: Lance Yang <lance.yang@linux.dev>
>+ if (!is_pmd_order(order) && folio_order(folio) >= order) {
>+ result = SCAN_PTE_MAPPED_HUGEPAGE;
>+ goto out;
>+ }
>
> if (folio_test_large(folio)) {
> struct folio *f;
>--
>2.54.0
>
>
^ permalink raw reply
* Re: [PATCH mm-unstable v19 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-06-08 4:54 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260605161422.213817-7-npache@redhat.com>
On Fri, Jun 05, 2026 at 10:14:13AM -0600, Nico Pache wrote:
>Pass an order to collapse_huge_page to support collapsing anon memory to
>arbitrary orders within a PMD. order indicates what mTHP size we are
>attempting to collapse to.
>
>For non-PMD collapse we must leave the anon VMA write locked until after
>we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>the mTHP case this is not true, and we must keep the lock to prevent
>access/changes to the page tables. This can happen if the rmap walkers hit
>a pmd_none while the PMD entry is currently unavailable due to being
>temporarily removed during the collapse phase.
>
>To properly establish the page table hierarchy without violating any
>expectations from certain architectures (e.g. MIPS), we must make sure to
>have the PMD reinstalled before the PTEs, and hold both PTE/PMD locks
>before calling update_mmu_cache_range() (if they are distinct locks).
>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
Nothing else jumped out at me. Anything left can be sorted out later, as
David and Lorenzo said :)
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH mm-unstable v19 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped
From: Lance Yang @ 2026-06-08 4:34 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260605161422.213817-6-npache@redhat.com>
On Fri, Jun 05, 2026 at 10:14:12AM -0600, Nico Pache wrote:
>Currently the collapse_huge_page function requires the mmap_read_lock to
>enter with it held, and exit with it dropped. This function moves the
>unlock into its parent caller, and changes this semantic to requiring it
>to enter/exit with it always unlocked.
>
>In future patches, we need this expectation, as for in mTHP collapse, we
>may have already dropped the lock, and do not want to conditionally
>check for this by passing through the lock_dropped variable.
>
>No functional change is expected as one of the first things the
>collapse_huge_page function does is drop this lock before allocating the
>hugepage.
>
>Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Masami Hiramatsu @ 2026-06-08 2:56 UTC (permalink / raw)
To: Tengda Wu, Josh Poimboeuf
Cc: Peter Zijlstra, Steven Rostedt, Mathieu Desnoyers,
Alexei Starovoitov, linux-trace-kernel, linux-kernel
In-Reply-To: <679a1c8f-1e4d-4ae5-83e1-d0068e6de1a6@huaweicloud.com>
On Mon, 8 Jun 2026 09:52:37 +0800
Tengda Wu <wutengda@huaweicloud.com> wrote:
>
>
> On 2026/6/5 21:43, Masami Hiramatsu wrote:
> > On Thu, 4 Jun 2026 11:34:45 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> On Mon, Jun 01, 2026 at 08:40:01AM +0900, Masami Hiramatsu wrote:
> >>
> >>> Peter, is it OK to drop @rq from task_on_cpu()?
> >>
> >> Sure.
> >>
> >>> Then we can use it from rethook.
> >>
> >> Well, it is in sched/sched.h, which is an internal header, and no you
> >> cannot use that header in rethook.
> >
> > Ah, OK. Hmm, then we should not use it. Maybe ->on_cpu is also internal
> > state?
> >
> >>
> >> But lets step back first, what is the actual problem here, why are we
> >> looking at ->on_cpu at all?
> >
> > Tengda, can you explain it?
> > I think you want to take a stacktrace on !current process, and
> > rethook_find_ret_addr() is rejected i the task is running state.
> >
> > But if you can share actual situation what you need, it is
> > helpful for us to understand.
> >
> > Thank you,
> >
> >
>
>
> Sure.
>
> Background: We are verifying the support of live patches for functions that
> have a kretprobe. The specific verification method is as follows:
>
> We construct a function foo() that calls bar():
>
> void bar(void)
> {
> for (;;) {
> schedule();
> }
> }
>
> void foo(void)
> {
> bar();
> }
>
> A kretprobe is attached to bar():
>
> echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
> echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
>
> Then foo() is triggered. The expected behavior is that bar() will call
> schedule() and yield the CPU.
>
> After that, the live patch is activated to attempt replacing the implementation
> of foo(). The expectation is that this should succeed.
>
> However, in reality, because the task that called schedule() is still in the
> RUNNING state, the condition task_is_running(tsk) inside rethook_find_ret_addr()
> is not satisfied, causing the function to return early. This, in turn,
> prevents stack_trace_save_tsk_reliable() from determining the stack as
> reliable, leading to a failure in activating the live patch.
Hmm is the bar() doing infinite loop, or limited loop but take a long time
so just yield a while? Anyway, it seems like a non-good design pattern.
Is it possible to avoid busy loops and instead use Workers, or wait for
something to complete or for input within a loop?
>
> **Not sure if this is correct:**
>
> We believe that after a task voluntarily calls schedule(), when the stack
> is expected to be reliable, it is a safe time to activate a live patch.
In this case, I don't know how to block the loop inside the bar.
Even if !tsk->on_cpu, the tsk can restart running right after checking
the flag.
> Additionally, a similar tsk->on_cpu check can be found elsewhere in the
> kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
> Therefore, we propose changing the task_is_running(tsk) condition to
> tsk->on_cpu.
Yes, but the caller said there is another check to ensure the race.
/*
* Refuse to unwind the stack of a task while it's executing on another
* CPU. This check is racy, but that's ok: the unwinder has other
* checks to prevent it from going off the rails.
*/
if (task_on_another_cpu(task))
goto err;
Josh, do you know how this avoid the race case?
Thank you,
>
> Thanks,
> Tengda
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v2 3/6] bootconfig: render embedded bootconfig as a kernel cmdline at build time
From: Masami Hiramatsu @ 2026-06-08 2:29 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, kernel-team
In-Reply-To: <20260605-bootconfig_using_tools-v2-3-d309f544b5f7@debian.org>
Hi Breno,
On Fri, 05 Jun 2026 05:03:34 -0700
Breno Leitao <leitao@debian.org> wrote:
> diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
> index 90eb47c9d8de..aa75a7828685 100644
> --- a/tools/bootconfig/Makefile
> +++ b/tools/bootconfig/Makefile
> @@ -15,10 +15,14 @@ override CFLAGS += -Wall -g -I$(CURDIR)/include
> ALL_TARGETS := bootconfig
> ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
>
> -all: $(ALL_PROGRAMS) test
> +all: $(ALL_PROGRAMS)
>
> +# bootconfig is a build host tool: Kbuild's prepare hook runs it on the
> +# build machine to render the embedded cmdline, so always compile it with
> +# $(HOSTCC). Using $(CC) would cross-compile it under ARCH=... builds and
> +# fail to exec on the host ("Exec format error").
> $(OUTPUT)bootconfig: main.c include/linux/bootconfig.h $(LIBSRC)
> - $(CC) $(filter %.c,$^) $(CFLAGS) $(LDFLAGS) -o $@
> + $(HOSTCC) $(filter %.c,$^) $(CFLAGS) $(LDFLAGS) -o $@
> Is it safe to pass $(CFLAGS) and $(LDFLAGS) to $(HOSTCC) here?
> When cross-compiling, $(CFLAGS) and $(LDFLAGS) often contain target-specific
> flags. Passing these target flags to the host compiler might cause it to fail,
> or incorrectly generate binaries for the target architecture that fail to
> execute on the build host.
Sashiko found a problem here. Hmm, I would like to build the bootconfig
tool just as a tool. (like `cd tools/bootconfig; make`)
Can we identify the build (make) is called from kernel build process or not
and switching CC/CFLAGS/LDFLAGS?
>
> Additionally, since bootconfig is an administrative utility meant to be
> deployed on the target system, will permanently hardcoding $(HOSTCC) prevent
> users from cross-compiling it for their target devices?
This is also a good point. We need cros build binary and host binary.
Thank you,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Tengda Wu @ 2026-06-08 1:52 UTC (permalink / raw)
To: Masami Hiramatsu, Peter Zijlstra
Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
linux-trace-kernel, linux-kernel
In-Reply-To: <20260605224341.c926299d613b6102912c9a3f@kernel.org>
On 2026/6/5 21:43, Masami Hiramatsu wrote:
> On Thu, 4 Jun 2026 11:34:45 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Mon, Jun 01, 2026 at 08:40:01AM +0900, Masami Hiramatsu wrote:
>>
>>> Peter, is it OK to drop @rq from task_on_cpu()?
>>
>> Sure.
>>
>>> Then we can use it from rethook.
>>
>> Well, it is in sched/sched.h, which is an internal header, and no you
>> cannot use that header in rethook.
>
> Ah, OK. Hmm, then we should not use it. Maybe ->on_cpu is also internal
> state?
>
>>
>> But lets step back first, what is the actual problem here, why are we
>> looking at ->on_cpu at all?
>
> Tengda, can you explain it?
> I think you want to take a stacktrace on !current process, and
> rethook_find_ret_addr() is rejected i the task is running state.
>
> But if you can share actual situation what you need, it is
> helpful for us to understand.
>
> Thank you,
>
>
Sure.
Background: We are verifying the support of live patches for functions that
have a kretprobe. The specific verification method is as follows:
We construct a function foo() that calls bar():
void bar(void)
{
for (;;) {
schedule();
}
}
void foo(void)
{
bar();
}
A kretprobe is attached to bar():
echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
Then foo() is triggered. The expected behavior is that bar() will call
schedule() and yield the CPU.
After that, the live patch is activated to attempt replacing the implementation
of foo(). The expectation is that this should succeed.
However, in reality, because the task that called schedule() is still in the
RUNNING state, the condition task_is_running(tsk) inside rethook_find_ret_addr()
is not satisfied, causing the function to return early. This, in turn,
prevents stack_trace_save_tsk_reliable() from determining the stack as
reliable, leading to a failure in activating the live patch.
**Not sure if this is correct:**
We believe that after a task voluntarily calls schedule(), when the stack
is expected to be reliable, it is a safe time to activate a live patch.
Additionally, a similar tsk->on_cpu check can be found elsewhere in the
kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
Therefore, we propose changing the task_is_running(tsk) condition to
tsk->on_cpu.
Thanks,
Tengda
^ permalink raw reply
* Re: [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Li Pengfei @ 2026-06-08 2:06 UTC (permalink / raw)
To: rostedt, mhiramat
Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li
In-Reply-To: <cover.1779769138.git.lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Steven, Masami,
Friendly ping on this v3 series. Let me know if there's anything
I should adjust or any direction concerns you'd like me to address.
Lore link:
https://lore.kernel.org/all/cover.1779769138.git.lipengfei28@xiaomi.com/
Pengfei
^ permalink raw reply
* Re: [PATCHv8 bpf-next 00/29] bpf: tracing_multi link
From: patchwork-bot+netdevbpf @ 2026-06-07 18:20 UTC (permalink / raw)
To: Jiri Olsa
Cc: ast, daniel, andrii, hengqi.chen, bpf, linux-trace-kernel,
martin.lau, eddyz87, songliubraving, yhs, menglong8.dong, rostedt
In-Reply-To: <20260606123955.345967-1-jolsa@kernel.org>
Hello:
This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:
On Sat, 6 Jun 2026 14:39:25 +0200 you wrote:
> hi,
> adding tracing_multi link support that allows fast attachment
> of tracing program to many functions.
>
> RFC: https://lore.kernel.org/bpf/20260203093819.2105105-1-jolsa@kernel.org/
> v1: https://lore.kernel.org/bpf/20260220100649.628307-1-jolsa@kernel.org/
> v2: https://lore.kernel.org/bpf/20260304222141.497203-1-jolsa@kernel.org/
> v3: https://lore.kernel.org/bpf/20260316075138.465430-1-jolsa@kernel.org/
> v4: https://lore.kernel.org/bpf/20260324081846.2334094-1-jolsa@kernel.org/
> v5: https://lore.kernel.org/bpf/20260417192502.194548-1-jolsa@kernel.org/
> v6: https://lore.kernel.org/bpf/20260527113951.46265-1-jolsa@kernel.org/
> v7: https://lore.kernel.org/bpf/20260603110554.29590-1-jolsa@kernel.org/
>
> [...]
Here is the summary with links:
- [PATCHv8,bpf-next,01/29] ftrace: Add ftrace_hash_count function
https://git.kernel.org/bpf/bpf-next/c/e57f13eaab25
- [PATCHv8,bpf-next,02/29] ftrace: Add ftrace_hash_remove function
https://git.kernel.org/bpf/bpf-next/c/af7c32365090
- [PATCHv8,bpf-next,03/29] ftrace: Add add_ftrace_hash_entry function
https://git.kernel.org/bpf/bpf-next/c/2cd298c106e0
- [PATCHv8,bpf-next,04/29] bpf: Use mutex lock pool for bpf trampolines
https://git.kernel.org/bpf/bpf-next/c/e6abd4cd157b
- [PATCHv8,bpf-next,05/29] bpf: Add struct bpf_trampoline_ops object
https://git.kernel.org/bpf/bpf-next/c/8a35e8db740f
- [PATCHv8,bpf-next,06/29] bpf: Move trampoline image setup into bpf_trampoline_ops callbacks
https://git.kernel.org/bpf/bpf-next/c/bf4bc3e11c41
- [PATCHv8,bpf-next,07/29] bpf: Add bpf_trampoline_add/remove_prog functions
https://git.kernel.org/bpf/bpf-next/c/e6cc9ed677e6
- [PATCHv8,bpf-next,08/29] bpf: Add struct bpf_tramp_node object
https://git.kernel.org/bpf/bpf-next/c/65499074efaf
- [PATCHv8,bpf-next,09/29] bpf: Factor fsession link to use struct bpf_tramp_node
https://git.kernel.org/bpf/bpf-next/c/880db5d4abb2
- [PATCHv8,bpf-next,10/29] bpf: Add multi tracing attach types
https://git.kernel.org/bpf/bpf-next/c/d14e6b4346bf
- [PATCHv8,bpf-next,11/29] bpf: Move sleepable verification code to btf_id_allow_sleepable
https://git.kernel.org/bpf/bpf-next/c/bd06659d3b8a
- [PATCHv8,bpf-next,12/29] bpf: Add bpf_trampoline_multi_attach/detach functions
https://git.kernel.org/bpf/bpf-next/c/aef4dfa790b2
- [PATCHv8,bpf-next,13/29] bpf: Add support for tracing multi link
https://git.kernel.org/bpf/bpf-next/c/c1d32dea5d46
- [PATCHv8,bpf-next,14/29] bpf: Add support for tracing_multi link cookies
https://git.kernel.org/bpf/bpf-next/c/46b42af27d40
- [PATCHv8,bpf-next,15/29] bpf: Add support for tracing_multi link session
https://git.kernel.org/bpf/bpf-next/c/ba042ed6446f
- [PATCHv8,bpf-next,16/29] bpf: Add support for tracing_multi link fdinfo
https://git.kernel.org/bpf/bpf-next/c/8abecdafd575
- [PATCHv8,bpf-next,17/29] libbpf: Add bpf_object_cleanup_btf function
https://git.kernel.org/bpf/bpf-next/c/fe9c8cb2b52b
- [PATCHv8,bpf-next,18/29] libbpf: Add bpf_link_create support for tracing_multi link
https://git.kernel.org/bpf/bpf-next/c/630e85a9f005
- [PATCHv8,bpf-next,19/29] libbpf: Add btf_type_is_traceable_func function
https://git.kernel.org/bpf/bpf-next/c/616a93b473a6
- [PATCHv8,bpf-next,20/29] libbpf: Add support to create tracing multi link
https://git.kernel.org/bpf/bpf-next/c/f2aa370dfe57
- [PATCHv8,bpf-next,21/29] selftests/bpf: Add tracing multi skel/pattern/ids attach tests
https://git.kernel.org/bpf/bpf-next/c/2922dd58413c
- [PATCHv8,bpf-next,22/29] selftests/bpf: Add tracing multi skel/pattern/ids module attach tests
https://git.kernel.org/bpf/bpf-next/c/2863f074f146
- [PATCHv8,bpf-next,23/29] selftests/bpf: Add tracing multi intersect tests
https://git.kernel.org/bpf/bpf-next/c/4309f580a0a6
- [PATCHv8,bpf-next,24/29] selftests/bpf: Add tracing multi cookies test
https://git.kernel.org/bpf/bpf-next/c/1b938f42f5fa
- [PATCHv8,bpf-next,25/29] selftests/bpf: Add tracing multi session test
https://git.kernel.org/bpf/bpf-next/c/69f25d4b0c17
- [PATCHv8,bpf-next,26/29] selftests/bpf: Add tracing multi attach fails test
https://git.kernel.org/bpf/bpf-next/c/1fd832854997
- [PATCHv8,bpf-next,27/29] selftests/bpf: Add tracing multi verifier fails test
https://git.kernel.org/bpf/bpf-next/c/443c91d08c4b
- [PATCHv8,bpf-next,28/29] selftests/bpf: Add tracing multi attach benchmark test
(no matching commit)
- [PATCHv8,bpf-next,29/29] selftests/bpf: Add tracing multi attach rollback tests
https://git.kernel.org/bpf/bpf-next/c/b349efe49a12
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCHv8 bpf-next 28/29] selftests/bpf: Add tracing multi attach benchmark test
From: Alexei Starovoitov @ 2026-06-07 18:13 UTC (permalink / raw)
To: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: bpf, linux-trace-kernel, Martin KaFai Lau, Eduard Zingerman,
Song Liu, Yonghong Song, Menglong Dong, Steven Rostedt
In-Reply-To: <20260606123955.345967-29-jolsa@kernel.org>
On Sat Jun 6, 2026 at 5:39 AM PDT, Jiri Olsa wrote:
> Adding benchmark test that attaches to (almost) all allowed tracing
> functions and display attach/detach times.
>
> # ./test_progs -t tracing_multi_bench_attach -v
> bpf_testmod.ko is already unloaded.
> Loading bpf_testmod.ko...
> Successfully loaded bpf_testmod.ko.
> serial_test_tracing_multi_bench_attach:PASS:btf__load_vmlinux_btf 0 nsec
> serial_test_tracing_multi_bench_attach:PASS:tracing_multi_bench__open_and_load 0 nsec
> serial_test_tracing_multi_bench_attach:PASS:get_syms 0 nsec
> serial_test_tracing_multi_bench_attach:PASS:bpf_program__attach_tracing_multi 0 nsec
> serial_test_tracing_multi_bench_attach: found 51186 functions
> serial_test_tracing_multi_bench_attach: attached in 1.295s
> serial_test_tracing_multi_bench_attach: detached in 0.243s
...
> + if (!ASSERT_OK(bpf_get_ksyms(&ksyms, true), "get_syms"))
> + goto cleanup;
> +
> + /* Get all ftrace 'safe' symbols.. */
> + for (i = 0; i < ksyms->filtered_cnt; i++) {
> + if (!tsearch(&ksyms->filtered_syms[i], &root, compare)) {
> + ASSERT_FAIL("tsearch failed");
> + goto cleanup;
> + }
> + }
> +
> + /* ..and filter them through BTF and btf_type_is_traceable_func. */
> + nr = btf__type_cnt(btf);
> + for (type_id = 1; type_id < nr; type_id++) {
> + const struct btf_type *type;
> + const char *str;
> +
> + type = btf__type_by_id(btf, type_id);
> + if (!type)
> + break;
> +
> + if (BTF_INFO_KIND(type->info) != BTF_KIND_FUNC)
> + continue;
> +
> + str = btf__name_by_offset(btf, type->name_off);
> + if (!str)
> + break;
> +
> + if (!tfind(&str, &root, compare))
> + continue;
> +
> + if (!btf_type_is_traceable_func(btf, type))
> + continue;
> +
> + err = libbpf_ensure_mem((void **) &ids, &cap, sizeof(*ids), cnt + 1);
> + if (err)
> + goto cleanup;
> +
> + ids[cnt++] = type_id;
> + }
This filtering wasn't enough.
I've added removal of duplicates here while applying:
+ /*
+ * Collect names that are not unique in kallsyms. The kernel resolves a
+ * tracing-multi BTF id to an address with kallsyms_lookup_name(), which
+ * returns the first symbol of that name. For a duplicate name that may
+ * be a different (non-ftrace-able) instance than the ftrace-able one in
+ * available_filter_functions, so attaching to it by BTF id fails with
+ * -ENOENT (e.g. t_start/t_next/t_stop). ksyms->syms is sorted by name,
+ * so equal names are adjacent.
+ */
+ for (i = 1; i < ksyms->sym_cnt; i++) {
+ if (strcmp(ksyms->syms[i].name, ksyms->syms[i - 1].name))
+ continue;
+ if (!tsearch(&ksyms->syms[i].name, &dups, compare)) {
+ ASSERT_FAIL("tsearch failed");
+ goto cleanup;
+ }
+ }
+ /* Skip names that are not unique in kallsyms, see above. */
+ if (tfind(&str, &dups, compare))
+ continue;
As claude explains it:
----
1. The kernel attaches tracing_multi by BTF id. To get an address it resolves
the BTF function name via kallsyms_lookup_name(tname) and requires
ftrace_location(addr) — kernel/bpf/verifier.c:19380:
addr = kallsyms_lookup_name(tname);
...
if (!addr || !ftrace_location(addr))
return -ENOENT;
2. t_start/t_next/t_stop each have 5 instances in this kernel. Only one is
ftrace-able — the copies in kernel/trace/* are built notrace (ftrace's Makefile
strips -pg), so only the unrelated copy is in available_filter_functions:
3. kallsyms_lookup_name() returns the lowest-address instance among equal names
(exact strcmp, lowest seq). That instance has no fentry → ftrace_location()
returns 0 → -ENOENT, which aborts the whole all-or-nothing bench attach.
Why the bench includes them: it intersects BTF FUNC names with
available_filter_functions names. Since some t_start is ftrace-able, the name
passes the filter — but the kernel resolves the wrong (non-ftrace-able)
t_start. The author's kernel apparently had the ftrace-able copy at the lowest
address, so it passed there.
This is a pre-existing limitation, not multi-specific: single fentry attach by
BTF id uses the same kallsyms_lookup_name(tname) path (verifier.c:19120) — you
can't reliably fentry-attach to any duplicate-named function on this kernel
either.
----
Maybe we should adjust bpf_get_ksyms() instead. Not sure.
After the fix in my gcc 15 kernel with KASAN:
serial_test_tracing_multi_bench_attach: found 49045 functions
serial_test_tracing_multi_bench_attach: attached in 4.115s
serial_test_tracing_multi_bench_attach: detached in 2.313s
^ permalink raw reply
* [PATCH v3 9/9] selftests/verification: add tlob selftests
From: wen.yang @ 2026-06-07 16:13 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, linux-trace-kernel, linux-kernel, Wen Yang
In-Reply-To: <cover.1780847473.git.wen.yang@linux.dev>
From: Wen Yang <wen.yang@linux.dev>
Add selftest coverage for the tlob uprobe monitoring interface under
tools/testing/selftests/verification/.
test.d/tlob/ contains both the helper sources (tlob_target, tlob_sym)
and the seven test scripts so the test suite is self-contained.
tlob_target provides busy-spin, sleep, and preempt workloads; tlob_sym
resolves ELF symbol offsets for uprobe registration.
Seven test scripts exercise uprobe binding management, budget violation
detection, and per-state time accounting (running_ns, waiting_ns,
sleeping_ns).
Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
.../testing/selftests/verification/.gitignore | 2 +
tools/testing/selftests/verification/Makefile | 19 +-
.../verification/test.d/tlob/Makefile | 20 ++
.../verification/test.d/tlob/test.d/functions | 1 +
.../verification/test.d/tlob/tlob_sym.c | 189 ++++++++++++++++++
.../verification/test.d/tlob/tlob_target.c | 138 +++++++++++++
.../verification/test.d/tlob/uprobe_bind.tc | 37 ++++
.../test.d/tlob/uprobe_detail_running.tc | 51 +++++
.../test.d/tlob/uprobe_detail_sleeping.tc | 50 +++++
.../test.d/tlob/uprobe_detail_waiting.tc | 66 ++++++
.../verification/test.d/tlob/uprobe_multi.tc | 64 ++++++
.../test.d/tlob/uprobe_no_event.tc | 19 ++
.../test.d/tlob/uprobe_violation.tc | 67 +++++++
13 files changed, 722 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/verification/test.d/tlob/Makefile
create mode 100644 tools/testing/selftests/verification/test.d/tlob/test.d/functions
create mode 100644 tools/testing/selftests/verification/test.d/tlob/tlob_sym.c
create mode 100644 tools/testing/selftests/verification/test.d/tlob/tlob_target.c
create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_detail_running.tc
create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
diff --git a/tools/testing/selftests/verification/.gitignore b/tools/testing/selftests/verification/.gitignore
index 2659417cb2c7..cbbd03ee16c7 100644
--- a/tools/testing/selftests/verification/.gitignore
+++ b/tools/testing/selftests/verification/.gitignore
@@ -1,2 +1,4 @@
# SPDX-License-Identifier: GPL-2.0-only
logs
+test.d/tlob/tlob_sym
+test.d/tlob/tlob_target
diff --git a/tools/testing/selftests/verification/Makefile b/tools/testing/selftests/verification/Makefile
index aa8790c22a71..0b32bdfdb8db 100644
--- a/tools/testing/selftests/verification/Makefile
+++ b/tools/testing/selftests/verification/Makefile
@@ -1,8 +1,25 @@
# SPDX-License-Identifier: GPL-2.0
-all:
TEST_PROGS := verificationtest-ktap
TEST_FILES := test.d settings
EXTRA_CLEAN := $(OUTPUT)/logs/*
+# Subdirectories that provide binaries used by the test runner.
+# Each entry must contain a Makefile that accepts OUTDIR= and
+# deposits its binaries there.
+BUILD_SUBDIRS := test.d/tlob
+
include ../lib.mk
+
+all: $(patsubst %,_build_%,$(BUILD_SUBDIRS))
+
+clean: $(patsubst %,_clean_%,$(BUILD_SUBDIRS))
+
+.PHONY: $(patsubst %,_build_%,$(BUILD_SUBDIRS)) \
+ $(patsubst %,_clean_%,$(BUILD_SUBDIRS))
+
+$(patsubst %,_build_%,$(BUILD_SUBDIRS)): _build_%:
+ $(MAKE) -C $* OUTDIR="$(OUTPUT)" TOOLS_INCLUDES="$(TOOLS_INCLUDES)"
+
+$(patsubst %,_clean_%,$(BUILD_SUBDIRS)): _clean_%:
+ $(MAKE) -C $* OUTDIR="$(OUTPUT)" clean
diff --git a/tools/testing/selftests/verification/test.d/tlob/Makefile b/tools/testing/selftests/verification/test.d/tlob/Makefile
new file mode 100644
index 000000000000..29b3519b255f
--- /dev/null
+++ b/tools/testing/selftests/verification/test.d/tlob/Makefile
@@ -0,0 +1,20 @@
+# SPDX-License-Identifier: GPL-2.0
+# Builds tlob selftest helper binaries in the directory of this Makefile.
+#
+# Invoked by ../../Makefile via BUILD_SUBDIRS; outputs tlob_sym and
+# tlob_target alongside the .tc scripts so they are self-contained.
+
+CFLAGS += $(TOOLS_INCLUDES)
+
+.PHONY: all
+all: tlob_sym tlob_target
+
+tlob_sym: tlob_sym.c
+ $(CC) $(CFLAGS) -o $@ $<
+
+tlob_target: tlob_target.c
+ $(CC) $(CFLAGS) -o $@ $<
+
+.PHONY: clean
+clean:
+ $(RM) tlob_sym tlob_target
diff --git a/tools/testing/selftests/verification/test.d/tlob/test.d/functions b/tools/testing/selftests/verification/test.d/tlob/test.d/functions
new file mode 100644
index 000000000000..0b4c5e4344d2
--- /dev/null
+++ b/tools/testing/selftests/verification/test.d/tlob/test.d/functions
@@ -0,0 +1 @@
+. "${TOP_DIR%/*}/functions"
diff --git a/tools/testing/selftests/verification/test.d/tlob/tlob_sym.c b/tools/testing/selftests/verification/test.d/tlob/tlob_sym.c
new file mode 100644
index 000000000000..1b7ba1c6d95b
--- /dev/null
+++ b/tools/testing/selftests/verification/test.d/tlob/tlob_sym.c
@@ -0,0 +1,189 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tlob_sym.c - ELF symbol-to-file-offset utility for tlob selftests
+ *
+ * Usage: tlob_sym sym_offset <binary> <symbol>
+ *
+ * Prints the ELF file offset of <symbol> in <binary> to stdout.
+ *
+ * Exit: 0 = found, 1 = error / not found.
+ */
+#include <elf.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+static int sym_offset(const char *binary, const char *symname)
+{
+ int fd;
+ struct stat st;
+ void *map;
+ Elf64_Ehdr *ehdr;
+ Elf32_Ehdr *ehdr32;
+ int is64;
+ uint64_t sym_vaddr = 0;
+ int found = 0;
+ uint64_t file_offset = 0;
+
+ fd = open(binary, O_RDONLY);
+ if (fd < 0) {
+ fprintf(stderr, "open %s: %s\n", binary, strerror(errno));
+ return 1;
+ }
+ if (fstat(fd, &st) < 0) {
+ close(fd);
+ return 1;
+ }
+ map = mmap(NULL, (size_t)st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+ close(fd);
+ if (map == MAP_FAILED) {
+ fprintf(stderr, "mmap: %s\n", strerror(errno));
+ return 1;
+ }
+
+ ehdr = (Elf64_Ehdr *)map;
+ ehdr32 = (Elf32_Ehdr *)map;
+ if (st.st_size < 4 ||
+ ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
+ ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
+ ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
+ ehdr->e_ident[EI_MAG3] != ELFMAG3) {
+ fprintf(stderr, "%s: not an ELF file\n", binary);
+ munmap(map, (size_t)st.st_size);
+ return 1;
+ }
+ is64 = (ehdr->e_ident[EI_CLASS] == ELFCLASS64);
+
+ if (is64) {
+ Elf64_Shdr *shdrs = (Elf64_Shdr *)((char *)map + ehdr->e_shoff);
+ Elf64_Shdr *shstrtab_hdr = &shdrs[ehdr->e_shstrndx];
+ const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
+ int si;
+
+ for (int pass = 0; pass < 2 && !found; pass++) {
+ const char *target = pass ? ".dynsym" : ".symtab";
+
+ for (si = 0; si < ehdr->e_shnum && !found; si++) {
+ Elf64_Shdr *sh = &shdrs[si];
+ const char *name = shstrtab + sh->sh_name;
+
+ if (strcmp(name, target) != 0)
+ continue;
+
+ Elf64_Shdr *strtab_sh = &shdrs[sh->sh_link];
+ const char *strtab = (char *)map + strtab_sh->sh_offset;
+ Elf64_Sym *syms = (Elf64_Sym *)((char *)map + sh->sh_offset);
+ uint64_t nsyms = sh->sh_size / sizeof(Elf64_Sym);
+ uint64_t j;
+
+ for (j = 0; j < nsyms; j++) {
+ if (strcmp(strtab + syms[j].st_name, symname) == 0) {
+ sym_vaddr = syms[j].st_value;
+ found = 1;
+ break;
+ }
+ }
+ }
+ }
+
+ if (!found) {
+ fprintf(stderr, "symbol '%s' not found in %s\n", symname, binary);
+ munmap(map, (size_t)st.st_size);
+ return 1;
+ }
+
+ Elf64_Phdr *phdrs = (Elf64_Phdr *)((char *)map + ehdr->e_phoff);
+ int pi;
+
+ for (pi = 0; pi < ehdr->e_phnum; pi++) {
+ Elf64_Phdr *ph = &phdrs[pi];
+
+ if (ph->p_type != PT_LOAD)
+ continue;
+ if (sym_vaddr >= ph->p_vaddr &&
+ sym_vaddr < ph->p_vaddr + ph->p_filesz) {
+ file_offset = sym_vaddr - ph->p_vaddr + ph->p_offset;
+ break;
+ }
+ }
+ } else {
+ Elf32_Shdr *shdrs = (Elf32_Shdr *)((char *)map + ehdr32->e_shoff);
+ Elf32_Shdr *shstrtab_hdr = &shdrs[ehdr32->e_shstrndx];
+ const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
+ int si;
+ uint32_t sym_vaddr32 = 0;
+
+ for (int pass = 0; pass < 2 && !found; pass++) {
+ const char *target = pass ? ".dynsym" : ".symtab";
+
+ for (si = 0; si < ehdr32->e_shnum && !found; si++) {
+ Elf32_Shdr *sh = &shdrs[si];
+ const char *name = shstrtab + sh->sh_name;
+
+ if (strcmp(name, target) != 0)
+ continue;
+
+ Elf32_Shdr *strtab_sh = &shdrs[sh->sh_link];
+ const char *strtab = (char *)map + strtab_sh->sh_offset;
+ Elf32_Sym *syms = (Elf32_Sym *)((char *)map + sh->sh_offset);
+ uint32_t nsyms = sh->sh_size / sizeof(Elf32_Sym);
+ uint32_t j;
+
+ for (j = 0; j < nsyms; j++) {
+ if (strcmp(strtab + syms[j].st_name, symname) == 0) {
+ sym_vaddr32 = syms[j].st_value;
+ found = 1;
+ break;
+ }
+ }
+ }
+ }
+
+ if (!found) {
+ fprintf(stderr, "symbol '%s' not found in %s\n", symname, binary);
+ munmap(map, (size_t)st.st_size);
+ return 1;
+ }
+
+ Elf32_Phdr *phdrs = (Elf32_Phdr *)((char *)map + ehdr32->e_phoff);
+ int pi;
+
+ for (pi = 0; pi < ehdr32->e_phnum; pi++) {
+ Elf32_Phdr *ph = &phdrs[pi];
+
+ if (ph->p_type != PT_LOAD)
+ continue;
+ if (sym_vaddr32 >= ph->p_vaddr &&
+ sym_vaddr32 < ph->p_vaddr + ph->p_filesz) {
+ file_offset = sym_vaddr32 - ph->p_vaddr + ph->p_offset;
+ break;
+ }
+ }
+ sym_vaddr = sym_vaddr32;
+ }
+
+ munmap(map, (size_t)st.st_size);
+
+ if (!file_offset && sym_vaddr) {
+ fprintf(stderr, "could not map vaddr 0x%lx to file offset\n",
+ (unsigned long)sym_vaddr);
+ return 1;
+ }
+
+ printf("0x%lx\n", (unsigned long)file_offset);
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ if (argc != 4 || strcmp(argv[1], "sym_offset") != 0) {
+ fprintf(stderr, "Usage: %s sym_offset <binary> <symbol>\n", argv[0]);
+ return 1;
+ }
+ return sym_offset(argv[2], argv[3]);
+}
diff --git a/tools/testing/selftests/verification/test.d/tlob/tlob_target.c b/tools/testing/selftests/verification/test.d/tlob/tlob_target.c
new file mode 100644
index 000000000000..0fdbc575d71d
--- /dev/null
+++ b/tools/testing/selftests/verification/test.d/tlob/tlob_target.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tlob_target.c - uprobe target binary for tlob selftests.
+ *
+ * Provides three start/stop probe pairs, each designed to exercise a
+ * different dominant component of the detail_env_tlob ns breakdown:
+ *
+ * tlob_busy_work / tlob_busy_work_done - busy-spin: running_ns dominates
+ * tlob_sleep_work / tlob_sleep_work_done - nanosleep: sleeping_ns dominates
+ * tlob_preempt_work / tlob_preempt_work_done - busy-spin: waiting_ns dominates
+ * (needs an RT competitor on the same CPU)
+ *
+ * Usage: tlob_target <duration_ms> [mode]
+ *
+ * mode is one of: busy (default), sleep, preempt.
+ * Loops in 200 ms iterations until <duration_ms> has elapsed
+ * (0 = run for ~24 hours).
+ */
+#define _GNU_SOURCE
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+
+#ifndef noinline
+#define noinline __attribute__((noinline))
+#endif
+
+static inline int timespec_before(const struct timespec *a,
+ const struct timespec *b)
+{
+ return a->tv_sec < b->tv_sec ||
+ (a->tv_sec == b->tv_sec && a->tv_nsec < b->tv_nsec);
+}
+
+static void timespec_add_ms(struct timespec *ts, unsigned long ms)
+{
+ ts->tv_sec += ms / 1000;
+ ts->tv_nsec += (long)(ms % 1000) * 1000000L;
+ if (ts->tv_nsec >= 1000000000L) {
+ ts->tv_sec++;
+ ts->tv_nsec -= 1000000000L;
+ }
+}
+
+/* stop probe; noinline keeps the entry point visible to uprobes */
+noinline void tlob_busy_work_done(void)
+{
+ /* empty: uprobe fires on entry */
+}
+
+/* start probe; busy-spin so running_ns dominates */
+noinline void tlob_busy_work(unsigned long duration_ns)
+{
+ struct timespec start, now;
+ unsigned long elapsed;
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &now);
+ elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
+ * 1000000000UL
+ + (unsigned long)(now.tv_nsec - start.tv_nsec);
+ } while (elapsed < duration_ns);
+
+ tlob_busy_work_done();
+}
+
+/* stop probe; noinline keeps the entry point visible to uprobes */
+noinline void tlob_sleep_work_done(void)
+{
+ /* empty: uprobe fires on entry */
+}
+
+/* start probe; nanosleep so sleeping_ns dominates */
+noinline void tlob_sleep_work(unsigned long duration_ms)
+{
+ struct timespec ts = {
+ .tv_sec = duration_ms / 1000,
+ .tv_nsec = (long)(duration_ms % 1000) * 1000000L,
+ };
+ nanosleep(&ts, NULL);
+ tlob_sleep_work_done();
+}
+
+/* stop probe; noinline keeps the entry point visible to uprobes */
+noinline void tlob_preempt_work_done(void)
+{
+ /* empty: uprobe fires on entry */
+}
+
+/*
+ * start probe; busy-spin so an RT competitor on the same CPU drives
+ * waiting_ns (prev_state==0 -> preempt event, task stays runnable off-CPU).
+ */
+noinline void tlob_preempt_work(unsigned long duration_ms)
+{
+ struct timespec start, now;
+ unsigned long elapsed;
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &now);
+ elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
+ * 1000000000UL
+ + (unsigned long)(now.tv_nsec - start.tv_nsec);
+ } while (elapsed < duration_ms * 1000000UL);
+
+ tlob_preempt_work_done();
+}
+
+int main(int argc, char *argv[])
+{
+ unsigned long duration_ms = 0;
+ const char *mode = "busy";
+ struct timespec deadline, now;
+
+ if (argc >= 2)
+ duration_ms = strtoul(argv[1], NULL, 10);
+ if (argc >= 3)
+ mode = argv[2];
+
+ clock_gettime(CLOCK_MONOTONIC, &deadline);
+ timespec_add_ms(&deadline, duration_ms ? duration_ms : 86400000UL);
+
+ do {
+ if (strcmp(mode, "sleep") == 0)
+ tlob_sleep_work(200);
+ else if (strcmp(mode, "preempt") == 0)
+ tlob_preempt_work(200);
+ else
+ tlob_busy_work(200 * 1000000UL);
+ clock_gettime(CLOCK_MONOTONIC, &now);
+ } while (timespec_before(&now, &deadline));
+
+ return 0;
+}
diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
new file mode 100644
index 000000000000..1ac3db6ca7bb
--- /dev/null
+++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
@@ -0,0 +1,37 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# description: Test tlob monitor uprobe binding (visible in monitor file, removable, duplicate rejected)
+# requires: tlob:monitor
+
+RV_BINDIR="${RV_BINDIR:-$(realpath "$(dirname "${1:-$0}")")}"
+UPROBE_TARGET="${RV_BINDIR}/tlob_target"
+TLOB_SYM="${RV_BINDIR}/tlob_sym"
+[ -x "$UPROBE_TARGET" ] || exit_unsupported
+[ -x "$TLOB_SYM" ] || exit_unsupported
+TLOB_MONITOR=monitors/tlob/monitor
+
+busy_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_busy_work 2>/dev/null)
+stop_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_busy_work_done 2>/dev/null)
+[ -n "$busy_offset" ] || exit_unsupported
+[ -n "$stop_offset" ] || exit_unsupported
+
+"$UPROBE_TARGET" 30000 &
+busy_pid=$!
+sleep 0.05
+
+echo 1 > monitors/tlob/enable
+echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=5000000000" > "$TLOB_MONITOR"
+
+# Binding must appear in monitor file with canonical hex-offset format.
+grep -qE "^p ${UPROBE_TARGET}:0x[0-9a-f]+ 0x[0-9a-f]+ threshold=[0-9]+$" "$TLOB_MONITOR"
+grep -q "threshold=5000000000" "$TLOB_MONITOR"
+
+# Duplicate offset_start must be rejected.
+! echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=9999000" > "$TLOB_MONITOR" 2>/dev/null
+
+# Remove the binding; it must no longer appear.
+echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR"
+! grep -q "^p .*:0x${busy_offset#0x} " "$TLOB_MONITOR"
+
+kill "$busy_pid" 2>/dev/null || true; wait "$busy_pid" 2>/dev/null || true
+echo 0 > monitors/tlob/enable
diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_running.tc b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_running.tc
new file mode 100644
index 000000000000..2814caa34902
--- /dev/null
+++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_running.tc
@@ -0,0 +1,51 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# description: Test tlob monitor detail running (running_ns dominates when task busy-spins between probes)
+# requires: tlob:monitor
+
+RV_BINDIR="${RV_BINDIR:-$(realpath "$(dirname "${1:-$0}")")}"
+UPROBE_TARGET="${RV_BINDIR}/tlob_target"
+TLOB_SYM="${RV_BINDIR}/tlob_sym"
+[ -x "$UPROBE_TARGET" ] || exit_unsupported
+[ -x "$TLOB_SYM" ] || exit_unsupported
+TLOB_MONITOR=monitors/tlob/monitor
+
+start_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_busy_work 2>/dev/null)
+stop_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_busy_work_done 2>/dev/null)
+[ -n "$start_offset" ] || exit_unsupported
+[ -n "$stop_offset" ] || exit_unsupported
+
+"$UPROBE_TARGET" 5000 &
+busy_pid=$!
+sleep 0.05
+
+echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
+echo 1 > /sys/kernel/tracing/tracing_on
+echo 1 > monitors/tlob/enable
+echo > /sys/kernel/tracing/trace
+
+# 10 µs budget; task busy-spins 200 ms per iteration -> running_ns dominates.
+echo "p ${UPROBE_TARGET}:${start_offset} ${stop_offset} threshold=10000" > "$TLOB_MONITOR"
+
+found=0; i=0
+while [ "$i" -lt 30 ]; do
+ sleep 0.1
+ grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1; break; }
+ i=$((i+1))
+done
+
+echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 2>/dev/null
+kill "$busy_pid" 2>/dev/null || true; wait "$busy_pid" 2>/dev/null || true
+echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
+echo 0 > monitors/tlob/enable
+
+[ "$found" = "1" ]
+
+line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
+running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
+waiting=$(echo "$line" | sed 's/.*waiting_ns=\([0-9]*\).*/\1/')
+sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
+# Busy-spin keeps the task on-CPU: running_ns must exceed sleeping_ns.
+[ "$running" -gt "$sleeping" ]
+
+echo > /sys/kernel/tracing/trace
diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
new file mode 100644
index 000000000000..0a6470b4cadb
--- /dev/null
+++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
@@ -0,0 +1,50 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# description: Test tlob monitor detail sleeping (sleeping_ns dominates when task blocks between probes)
+# requires: tlob:monitor
+
+RV_BINDIR="${RV_BINDIR:-$(realpath "$(dirname "${1:-$0}")")}"
+UPROBE_TARGET="${RV_BINDIR}/tlob_target"
+TLOB_SYM="${RV_BINDIR}/tlob_sym"
+[ -x "$UPROBE_TARGET" ] || exit_unsupported
+[ -x "$TLOB_SYM" ] || exit_unsupported
+TLOB_MONITOR=monitors/tlob/monitor
+
+start_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_sleep_work 2>/dev/null)
+stop_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_sleep_work_done 2>/dev/null)
+[ -n "$start_offset" ] || exit_unsupported
+[ -n "$stop_offset" ] || exit_unsupported
+
+"$UPROBE_TARGET" 5000 sleep &
+busy_pid=$!
+sleep 0.05
+
+echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
+echo 1 > /sys/kernel/tracing/tracing_on
+echo 1 > monitors/tlob/enable
+echo > /sys/kernel/tracing/trace
+
+# 50 ms budget; task sleeps 200 ms per iteration -> sleeping_ns dominates.
+echo "p ${UPROBE_TARGET}:${start_offset} ${stop_offset} threshold=50000000" > "$TLOB_MONITOR"
+
+found=0; i=0
+while [ "$i" -lt 30 ]; do
+ sleep 0.1
+ grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1; break; }
+ i=$((i+1))
+done
+
+echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 2>/dev/null
+kill "$busy_pid" 2>/dev/null || true; wait "$busy_pid" 2>/dev/null || true
+echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
+echo 0 > monitors/tlob/enable
+
+[ "$found" = "1" ]
+
+line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
+running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
+waiting=$(echo "$line" | sed 's/.*waiting_ns=\([0-9]*\).*/\1/')
+sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
+[ "$sleeping" -gt "$((running + waiting))" ]
+
+echo > /sys/kernel/tracing/trace
diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
new file mode 100644
index 000000000000..ef22fce700fc
--- /dev/null
+++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
@@ -0,0 +1,66 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# description: Test tlob monitor detail waiting (waiting_ns dominates when task is preempted between probes)
+# requires: tlob:monitor
+
+RV_BINDIR="${RV_BINDIR:-$(realpath "$(dirname "${1:-$0}")")}"
+UPROBE_TARGET="${RV_BINDIR}/tlob_target"
+TLOB_SYM="${RV_BINDIR}/tlob_sym"
+[ -x "$UPROBE_TARGET" ] || exit_unsupported
+[ -x "$TLOB_SYM" ] || exit_unsupported
+TLOB_MONITOR=monitors/tlob/monitor
+
+command -v chrt > /dev/null || exit_unsupported
+command -v taskset > /dev/null || exit_unsupported
+
+start_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_preempt_work 2>/dev/null)
+stop_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_preempt_work_done 2>/dev/null)
+[ -n "$start_offset" ] || exit_unsupported
+[ -n "$stop_offset" ] || exit_unsupported
+
+cpu=0
+
+echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
+echo 1 > /sys/kernel/tracing/tracing_on
+echo 1 > monitors/tlob/enable
+echo > /sys/kernel/tracing/trace
+
+# Register probe before the target starts so the start uprobe fires on the
+# first entry to tlob_preempt_work. Budget: 500 ms.
+echo "p ${UPROBE_TARGET}:${start_offset} ${stop_offset} threshold=500000000" > "$TLOB_MONITOR"
+
+# Target starts; start probe fires on tlob_preempt_work entry.
+taskset -c "$cpu" "$UPROBE_TARGET" 5000 preempt &
+busy_pid=$!
+sleep 0.05
+
+# RT hog on the same CPU preempts the target; target stays in waiting state
+# (runnable, off-CPU) until the budget expires -> waiting_ns dominates.
+chrt -f 99 taskset -c "$cpu" sh -c 'while true; do :; done' 2>/dev/null &
+hog_pid=$!
+
+found=0; i=0
+while [ "$i" -lt 30 ]; do
+ sleep 0.1
+ grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1; break; }
+ i=$((i+1))
+done
+
+# Kill the RT hog first so tlob_target can release any in-flight SRCU read
+# section from uprobe_notify_resume; otherwise probe removal blocks in
+# synchronize_srcu with the hog monopolising the CPU at FIFO-99.
+kill "$hog_pid" 2>/dev/null || true; wait "$hog_pid" 2>/dev/null || true
+kill "$busy_pid" 2>/dev/null || true; wait "$busy_pid" 2>/dev/null || true
+echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 2>/dev/null
+echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
+echo 0 > monitors/tlob/enable
+
+[ "$found" = "1" ]
+
+line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
+running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
+sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
+waiting=$(echo "$line" | sed 's/.*waiting_ns=\([0-9]*\).*/\1/')
+[ "$waiting" -gt "$((running + sleeping))" ]
+
+echo > /sys/kernel/tracing/trace
diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc b/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
new file mode 100644
index 000000000000..f1bd6c955f1d
--- /dev/null
+++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
@@ -0,0 +1,64 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# description: Test tlob monitor multiple uprobe bindings (different offsets fire independently)
+# requires: tlob:monitor
+
+RV_BINDIR="${RV_BINDIR:-$(realpath "$(dirname "${1:-$0}")")}"
+UPROBE_TARGET="${RV_BINDIR}/tlob_target"
+TLOB_SYM="${RV_BINDIR}/tlob_sym"
+[ -x "$UPROBE_TARGET" ] || exit_unsupported
+[ -x "$TLOB_SYM" ] || exit_unsupported
+TLOB_MONITOR=monitors/tlob/monitor
+
+busy_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_busy_work 2>/dev/null)
+busy_stop=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_busy_work_done 2>/dev/null)
+sleep_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_sleep_work 2>/dev/null)
+sleep_stop=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_sleep_work_done 2>/dev/null)
+[ -n "$busy_offset" ] || exit_unsupported
+[ -n "$busy_stop" ] || exit_unsupported
+[ -n "$sleep_offset" ] || exit_unsupported
+[ -n "$sleep_stop" ] || exit_unsupported
+
+"$UPROBE_TARGET" 30000 & # busy mode: tlob_busy_work fires every 200 ms
+busy_pid=$!
+"$UPROBE_TARGET" 30000 sleep & # sleep mode: tlob_sleep_work fires every 200 ms
+sleep_pid=$!
+sleep 0.05
+
+echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
+echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
+echo 1 > /sys/kernel/tracing/tracing_on
+echo 1 > monitors/tlob/enable
+echo > /sys/kernel/tracing/trace
+
+# Binding A: 5 s budget on the busy probe - must not fire in 200 ms loops.
+echo "p ${UPROBE_TARGET}:${busy_offset} ${busy_stop} threshold=5000000000" > "$TLOB_MONITOR"
+# Binding B: 10 µs budget on the sleep probe - fires on first invocation.
+echo "p ${UPROBE_TARGET}:${sleep_offset} ${sleep_stop} threshold=10000" > "$TLOB_MONITOR"
+
+# Wait up to 2 s for error_env_tlob from binding B.
+found=0; i=0
+while [ "$i" -lt 20 ]; do
+ sleep 0.1
+ grep -q "error_env_tlob" /sys/kernel/tracing/trace && { found=1; break; }
+ i=$((i+1))
+done
+
+echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR" 2>/dev/null
+echo "-${UPROBE_TARGET}:${sleep_offset}" > "$TLOB_MONITOR" 2>/dev/null
+kill "$sleep_pid" 2>/dev/null || true; wait "$sleep_pid" 2>/dev/null || true
+kill "$busy_pid" 2>/dev/null || true; wait "$busy_pid" 2>/dev/null || true
+
+echo 0 > monitors/tlob/enable
+echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
+echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
+
+[ "$found" = "1" ]
+# error_env_tlob payload: clock variable must be present.
+# The event field can be "budget_exceeded" (hrtimer path) or the DA event
+# name ("sleep", "preempt") depending on which fires first; don't constrain it.
+grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q "clk_elapsed="
+# detail_env_tlob must appear alongside the error.
+grep -q "detail_env_tlob" /sys/kernel/tracing/trace
+
+echo > /sys/kernel/tracing/trace
diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc b/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
new file mode 100644
index 000000000000..a143635a60ce
--- /dev/null
+++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
@@ -0,0 +1,19 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# description: Test tlob monitor no spurious events without active uprobe binding
+# requires: tlob:monitor
+
+TLOB_MONITOR=monitors/tlob/monitor
+
+echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
+echo 1 > /sys/kernel/tracing/tracing_on
+echo 1 > monitors/tlob/enable
+echo > /sys/kernel/tracing/trace
+
+sleep 0.5
+
+! grep -q "error_env_tlob" /sys/kernel/tracing/trace
+
+echo 0 > monitors/tlob/enable
+echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
+echo > /sys/kernel/tracing/trace
diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc b/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
new file mode 100644
index 000000000000..d210d9c3a92d
--- /dev/null
+++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
@@ -0,0 +1,67 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# description: Test tlob monitor budget violation (error_env_tlob and detail_env_tlob fire with correct fields)
+# requires: tlob:monitor
+
+RV_BINDIR="${RV_BINDIR:-$(realpath "$(dirname "${1:-$0}")")}"
+UPROBE_TARGET="${RV_BINDIR}/tlob_target"
+TLOB_SYM="${RV_BINDIR}/tlob_sym"
+[ -x "$UPROBE_TARGET" ] || exit_unsupported
+[ -x "$TLOB_SYM" ] || exit_unsupported
+TLOB_MONITOR=monitors/tlob/monitor
+
+busy_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_busy_work 2>/dev/null)
+stop_offset=$("$TLOB_SYM" sym_offset "$UPROBE_TARGET" tlob_busy_work_done 2>/dev/null)
+[ -n "$busy_offset" ] || exit_unsupported
+[ -n "$stop_offset" ] || exit_unsupported
+
+"$UPROBE_TARGET" 30000 &
+busy_pid=$!
+sleep 0.05
+
+echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
+echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
+echo 1 > /sys/kernel/tracing/tracing_on
+echo 1 > monitors/tlob/enable
+echo > /sys/kernel/tracing/trace
+
+# 10 µs budget - fires almost immediately; task is busy-spinning on-CPU.
+echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=10000" > "$TLOB_MONITOR"
+
+# wait up to 2 s for detail_env_tlob
+found=0; i=0
+while [ "$i" -lt 20 ]; do
+ sleep 0.1
+ grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1; break; }
+ i=$((i+1))
+done
+
+echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR" 2>/dev/null
+kill "$busy_pid" 2>/dev/null || true; wait "$busy_pid" 2>/dev/null || true
+echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
+echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
+echo 0 > monitors/tlob/enable
+
+[ "$found" = "1" ]
+
+# error_env_tlob must carry the clk_elapsed environment field.
+# The event label is "budget_exceeded" when detected by the hrtimer callback,
+# or the triggering sched event name when detected by the constraint path on a
+# preemption that races with the timer (common on PREEMPT_RT / VM). Both are
+# valid detections; check the env field instead of the label.
+grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q "clk_elapsed="
+
+# detail_env_tlob must have all five fields with the correct threshold
+line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
+echo "$line" | grep -q "pid="
+echo "$line" | grep -q "threshold_ns=10000"
+echo "$line" | grep -q "running_ns="
+echo "$line" | grep -q "waiting_ns="
+echo "$line" | grep -q "sleeping_ns="
+
+# Busy-spin keeps the task on-CPU: running_ns must exceed sleeping_ns.
+running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
+sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
+[ "$running" -gt "$sleeping" ]
+
+echo > /sys/kernel/tracing/trace
--
2.43.0
^ permalink raw reply related
* [PATCH v3 8/9] selftests/verification: fix verificationtest-ktap for out-of-tree execution
From: wen.yang @ 2026-06-07 16:13 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, linux-trace-kernel, linux-kernel, Wen Yang
In-Reply-To: <cover.1780847473.git.wen.yang@linux.dev>
From: Wen Yang <wen.yang@linux.dev>
verificationtest-ktap used CWD-relative paths which broke when
invoked outside the verification directory (e.g. via vng).
Resolve paths via realpath "$(dirname "$0")" so the script works
from any working directory. Accept an optional subdirectory argument
interpreted relative to the script's directory.
Suggested-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
tools/testing/selftests/verification/verificationtest-ktap | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/verification/verificationtest-ktap b/tools/testing/selftests/verification/verificationtest-ktap
index 18f7fe324e2f..055747cef38a 100755
--- a/tools/testing/selftests/verification/verificationtest-ktap
+++ b/tools/testing/selftests/verification/verificationtest-ktap
@@ -5,4 +5,6 @@
#
# Copyright (C) Arm Ltd., 2023
-../ftrace/ftracetest -K -v --rv ../verification
+dir=$(realpath "$(dirname "$0")")
+testdir=$(cd "$dir" && realpath "${1:-.}")
+"$dir/../ftrace/ftracetest" -K -v --rv "$testdir"
--
2.43.0
^ permalink raw reply related
* [PATCH v3 7/9] rv/tlob: add KUnit tests for the tlob monitor
From: wen.yang @ 2026-06-07 16:13 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, linux-trace-kernel, linux-kernel, Wen Yang
In-Reply-To: <cover.1780847473.git.wen.yang@linux.dev>
From: Wen Yang <wen.yang@linux.dev>
Add CONFIG_TLOB_KUNIT_TEST (tristate, depends on RV_MON_TLOB && KUNIT,
default KUNIT_ALL_TESTS) with a single test suite covering the uprobe
line parser: valid bindings are accepted, malformed ones return -EINVAL,
and out-of-range thresholds return -ERANGE.
Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
kernel/trace/rv/Makefile | 1 +
kernel/trace/rv/monitors/tlob/.kunitconfig | 6 ++
kernel/trace/rv/monitors/tlob/Kconfig | 7 ++
kernel/trace/rv/monitors/tlob/tlob_kunit.c | 92 ++++++++++++++++++++++
4 files changed, 106 insertions(+)
create mode 100644 kernel/trace/rv/monitors/tlob/.kunitconfig
create mode 100644 kernel/trace/rv/monitors/tlob/tlob_kunit.c
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index ae59e97f8682..316d53398345 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -21,6 +21,7 @@ obj-$(CONFIG_RV_MON_STALL) += monitors/stall/stall.o
obj-$(CONFIG_RV_MON_DEADLINE) += monitors/deadline/deadline.o
obj-$(CONFIG_RV_MON_NOMISS) += monitors/nomiss/nomiss.o
obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
+obj-$(CONFIG_TLOB_KUNIT_TEST) += monitors/tlob/tlob_kunit.o
# Add new monitors here
obj-$(CONFIG_RV_UPROBE) += rv_uprobe.o
obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
diff --git a/kernel/trace/rv/monitors/tlob/.kunitconfig b/kernel/trace/rv/monitors/tlob/.kunitconfig
new file mode 100644
index 000000000000..35d313dfc20d
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/.kunitconfig
@@ -0,0 +1,6 @@
+CONFIG_FTRACE=y
+CONFIG_KUNIT=y
+CONFIG_MODULES=y
+CONFIG_RV=y
+CONFIG_RV_MON_TLOB=y
+CONFIG_TLOB_KUNIT_TEST=y
diff --git a/kernel/trace/rv/monitors/tlob/Kconfig b/kernel/trace/rv/monitors/tlob/Kconfig
index b29a375de228..7ec3326640c2 100644
--- a/kernel/trace/rv/monitors/tlob/Kconfig
+++ b/kernel/trace/rv/monitors/tlob/Kconfig
@@ -10,3 +10,10 @@ config RV_MON_TLOB
monitor. tlob tracks per-task elapsed wall-clock time across a
user-delimited code section and emits error_env_tlob when the
elapsed time exceeds a configurable per-invocation budget.
+
+config TLOB_KUNIT_TEST
+ tristate "KUnit tests for tlob monitor" if !KUNIT_ALL_TESTS
+ depends on RV_MON_TLOB && KUNIT
+ default KUNIT_ALL_TESTS
+ help
+ Enable KUnit unit tests for the tlob RV monitor.
diff --git a/kernel/trace/rv/monitors/tlob/tlob_kunit.c b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
new file mode 100644
index 000000000000..6450d61b26c3
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit tests for the tlob RV monitor.
+ *
+ */
+#include <kunit/test.h>
+
+#include "tlob.h"
+
+MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");
+
+static const char * const tlob_parse_valid[] = {
+ "p /usr/bin/myapp:4768 4848 threshold=5000000",
+ "p /usr/bin/myapp:0x12a0 0x12f0 threshold=10000000",
+ "p /opt/my:app/bin:0x100 0x200 threshold=1000000",
+};
+
+static const char * const tlob_parse_invalid[] = {
+ /* add: malformed */
+ "p :0x100 0x200 threshold=5000",
+ "p /usr/bin/myapp:0x100 threshold=5000",
+ "p /usr/bin/myapp:-1 0x200 threshold=5000",
+ "p /usr/bin/myapp:0x100 0x200",
+ "p /usr/bin/myapp:0x100 0x100 threshold=5000",
+ /* remove: malformed */
+ "-usr/bin/myapp:0x100",
+ "-/usr/bin/myapp",
+ "-/:0x100",
+ "-/usr/bin/myapp:abc",
+};
+
+/* threshold_ns < 1000 or > TLOB_MAX_THRESHOLD_NS return -ERANGE, not -EINVAL. */
+static const char * const tlob_parse_out_of_range[] = {
+ "p /usr/bin/myapp:0x100 0x200 threshold=0",
+ "p /usr/bin/myapp:0x100 0x200 threshold=999",
+ "p /usr/bin/myapp:0x100 0x200 threshold=3600000000001", /* TLOB_MAX_THRESHOLD_NS + 1 */
+};
+
+/*
+ * Valid add lines return -ENOENT (kern_path() finds no such file in the test
+ * environment) rather than 0; a non-(-EINVAL) return confirms the format was
+ * accepted by the parser.
+ */
+static void tlob_parse_valid_accepted(struct kunit *test)
+{
+ char buf[128];
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(tlob_parse_valid); i++) {
+ strscpy(buf, tlob_parse_valid[i], sizeof(buf));
+ KUNIT_EXPECT_NE(test, tlob_create_or_delete_uprobe(buf), -EINVAL);
+ }
+}
+
+static void tlob_parse_invalid_rejected(struct kunit *test)
+{
+ char buf[128];
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(tlob_parse_invalid); i++) {
+ strscpy(buf, tlob_parse_invalid[i], sizeof(buf));
+ KUNIT_EXPECT_EQ(test, tlob_create_or_delete_uprobe(buf), -EINVAL);
+ }
+}
+
+static void tlob_parse_out_of_range_rejected(struct kunit *test)
+{
+ char buf[128];
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(tlob_parse_out_of_range); i++) {
+ strscpy(buf, tlob_parse_out_of_range[i], sizeof(buf));
+ KUNIT_EXPECT_EQ(test, tlob_create_or_delete_uprobe(buf), -ERANGE);
+ }
+}
+
+static struct kunit_case tlob_parse_cases[] = {
+ KUNIT_CASE(tlob_parse_valid_accepted),
+ KUNIT_CASE(tlob_parse_invalid_rejected),
+ KUNIT_CASE(tlob_parse_out_of_range_rejected),
+ {}
+};
+
+static struct kunit_suite tlob_parse_suite = {
+ .name = "tlob_parse",
+ .test_cases = tlob_parse_cases,
+};
+
+kunit_test_suite(tlob_parse_suite);
+
+MODULE_DESCRIPTION("KUnit tests for the tlob RV monitor");
+MODULE_LICENSE("GPL");
--
2.43.0
^ permalink raw reply related
* [PATCH v3 6/9] rv/tlob: add tlob hybrid automaton monitor
From: wen.yang @ 2026-06-07 16:13 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, linux-trace-kernel, linux-kernel, Wen Yang
In-Reply-To: <cover.1780847473.git.wen.yang@linux.dev>
From: Wen Yang <wen.yang@linux.dev>
Add tlob (task latency over budget), a per-task hybrid automaton RV
monitor that tracks elapsed wall-clock time across a user-delimited
code section and emits error_env_tlob when the elapsed time exceeds a
configurable budget.
The monitor uses RV_MON_PER_OBJ with three states (running, waiting,
sleeping) driven by sched_switch and sched_wakeup tracepoints, and a
single clock invariant clk_elapsed < budget enforced by an hrtimer
(HRTIMER_MODE_REL_HARD). On violation, detail_env_tlob provides a
per-state time breakdown (running_ns, waiting_ns, sleeping_ns).
Per-task state is managed via DA_ALLOC_POOL to avoid allocation on the
scheduler tracepoint path. Uprobe pairs are registered through the
tracefs monitor file as "p PATH:OFFSET_START OFFSET_STOP threshold=NS".
Also adds ha_cancel_timer_sync() to ha_monitor.h, a blocking cancel
variant needed by tlob's stop_task path to ensure the hrtimer callback
has completed before the per-task monitor state is freed.
Suggested-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
Documentation/trace/rv/index.rst | 1 +
Documentation/trace/rv/monitor_tlob.rst | 177 ++++
kernel/trace/rv/Kconfig | 1 +
kernel/trace/rv/Makefile | 1 +
kernel/trace/rv/monitors/tlob/Kconfig | 12 +
kernel/trace/rv/monitors/tlob/tlob.c | 968 +++++++++++++++++++++
kernel/trace/rv/monitors/tlob/tlob.h | 148 ++++
kernel/trace/rv/monitors/tlob/tlob_trace.h | 49 ++
kernel/trace/rv/rv_trace.h | 1 +
9 files changed, 1358 insertions(+)
create mode 100644 Documentation/trace/rv/monitor_tlob.rst
create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
index 29769f06bb0f..1501545b5f08 100644
--- a/Documentation/trace/rv/index.rst
+++ b/Documentation/trace/rv/index.rst
@@ -16,5 +16,6 @@ Runtime Verification
monitor_wwnr.rst
monitor_sched.rst
monitor_rtapp.rst
+ monitor_tlob.rst
monitor_stall.rst
monitor_deadline.rst
diff --git a/Documentation/trace/rv/monitor_tlob.rst b/Documentation/trace/rv/monitor_tlob.rst
new file mode 100644
index 000000000000..c651272eab89
--- /dev/null
+++ b/Documentation/trace/rv/monitor_tlob.rst
@@ -0,0 +1,177 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Monitor tlob
+============
+
+- Name: tlob - task latency over budget
+- Type: per-object hybrid automaton (RV_MON_PER_OBJ)
+- Author: Wen Yang <wen.yang@linux.dev>
+
+Description
+-----------
+
+The tlob monitor tracks per-task elapsed wall-clock time (CLOCK_MONOTONIC,
+spanning running, waiting, and sleeping states) and reports a violation when
+the monitored task exceeds a configurable per-invocation budget threshold.
+
+The monitor implements a three-state hybrid automaton with a single clock
+environment variable ``clk_elapsed``. The clock invariant
+``clk_elapsed < BUDGET_NS()`` is active in all three states; when it is
+violated the HA timer fires and the framework emits ``error_env_tlob``
+then calls ``da_monitor_reset()`` automatically::
+
+ | (initial, via task_start)
+ v
+ +--------------+
+ | running | <-----------+
+ +--------------+ |
+ | | |
+ sleep preempt switch_in
+ | | |
+ v v |
+ +---------+ +---------+ |
+ | sleeping| | waiting | -------+
+ +---------+ +---------+
+ | ^
+ +---wakeup---+
+
+ Key transitions:
+ running --(sleep)------> sleeping (task blocks waiting for a resource)
+ running --(preempt)----> waiting (task preempted, back in runqueue)
+ sleeping --(wakeup)-----> waiting (resource available, enters runqueue)
+ waiting --(switch_in)--> running (scheduler picks task, back on CPU)
+
+ ``tlob_start_task()`` calls ``da_handle_start_run_event(task->pid, ws, start_tlob)``.
+ The ``start_tlob`` self-loop on the ``running`` state triggers
+ ``ha_setup_invariants()``, which resets ``clk_elapsed`` and arms the budget
+ timer automatically. ``tlob_stop_task()`` cancels the HA timer synchronously
+ via ``ha_cancel_timer_sync()``, then calls ``da_monitor_reset()``.
+
+The non-running condition (monitor not yet started or reset after a
+stop/violation) is handled implicitly by the RV framework
+(``da_mon->monitoring == 0``) — it is not an explicit DA state.
+
+Per-task state lives in ``struct tlob_task_state`` which is stored as
+``monitor_target`` in the framework's ``da_monitor_storage``, indexed by
+pid. The per-invocation ``threshold_ns`` is read via
+``ha_get_target(ha_mon)->threshold_ns`` inside the HA constraint functions,
+following the same pattern as the ``nomiss`` monitor.
+
+Usage
+-----
+
+tracefs interface (uprobe-based external monitoring)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``monitor`` tracefs file instruments an unmodified binary via uprobes.
+The format follows the ftrace ``uprobe_events`` convention (``PATH:OFFSET``
+for the probe location, ``key=value`` for configuration parameters)::
+
+ p PATH:OFFSET_START OFFSET_STOP threshold=NS
+
+The uprobe at ``OFFSET_START`` fires ``tlob_start_task()``; the uprobe at
+``OFFSET_STOP`` fires ``tlob_stop_task()``. Both offsets are ELF file
+offsets of entry points in ``PATH``. ``PATH`` may contain ``:``; the last
+``:`` in the ``PATH:OFFSET_START`` token is the separator.
+
+To remove a binding, use ``-PATH:OFFSET_START``::
+
+ echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
+
+ echo "p /usr/bin/myapp:0x12a0 0x12f0 threshold=5000000" \
+ > /sys/kernel/tracing/rv/monitors/tlob/monitor
+
+ # Remove a binding
+ echo "-/usr/bin/myapp:0x12a0" > /sys/kernel/tracing/rv/monitors/tlob/monitor
+
+ # List registered bindings
+ cat /sys/kernel/tracing/rv/monitors/tlob/monitor
+
+ # Read violations from the trace buffer
+ cat /sys/kernel/tracing/trace
+
+Violation tracepoints
+~~~~~~~~~~~~~~~~~~~~~
+
+Two tracepoints are emitted together on a budget violation:
+
+``error_env_tlob``
+ Standard HA clock-invariant tracepoint (emitted by the RV framework).
+ Fields: ``id`` (task pid), ``state``, ``event`` (``"budget_exceeded"``),
+ ``env`` (``"clk_elapsed"``).
+
+``detail_env_tlob``
+ Tlob-specific breakdown of elapsed time per DA state.
+ Fields: ``id`` (task pid), ``threshold_ns``, ``running_ns``,
+ ``waiting_ns``, ``sleeping_ns``.
+
+ Use ``detail_env_tlob`` to diagnose *which phase* consumed the budget:
+ high ``sleeping_ns`` indicates I/O latency; high ``waiting_ns`` indicates
+ scheduler pressure; high ``running_ns`` indicates a compute overrun.
+
+Example: correlate the two tracepoints to see the breakdown::
+
+ trace-cmd record -e error_env_tlob -e detail_env_tlob &
+ # ... run workload ...
+ trace-cmd report
+
+tracefs files
+~~~~~~~~~~~~~
+
+The following files are specific to tlob under
+``/sys/kernel/tracing/rv/monitors/tlob/``:
+
+``monitor`` (rw)
+ Write ``p PATH:OFFSET_START OFFSET_STOP threshold=NS``
+ to bind two entry uprobes. Write ``-PATH:OFFSET_START`` to remove a
+ binding. Read to list registered bindings in the same format.
+ See the `tracefs interface (uprobe-based external monitoring)`_ section above.
+
+Kernel API
+----------
+
+``tlob_start_task`` and ``tlob_stop_task`` are the implementation-level
+functions called by the uprobe entry/exit handlers; the interface is
+driven from userspace.
+
+.. kernel-doc:: kernel/trace/rv/monitors/tlob/tlob.c
+ :functions: tlob_start_task tlob_stop_task
+
+``tlob_start_task(task, threshold_ns)``
+ Begin monitoring *task* with a total latency budget of *threshold_ns*
+ nanoseconds. Allocates per-task state, sets initial DA state to
+ ``running``, resets ``clk_elapsed``, and arms the HA budget timer.
+ Returns 0, -ENODEV (monitor disabled), -ERANGE (threshold out of range),
+ -EALREADY (already monitoring), -ENOSPC (at capacity), or -ENOMEM.
+
+``tlob_stop_task(task)``
+ Stop monitoring *task*. Synchronously cancels the HA timer via
+ ``ha_cancel_timer_sync()``, checks ``da_monitoring()`` to determine outcome.
+ Returns 0 (clean stop, within budget), -EOVERFLOW (budget was exceeded),
+ -ESRCH (not monitored), or -EAGAIN (concurrent stop racing).
+
+Design notes
+------------
+
+Limitations:
+
+- The initial DA state is always ``running``, set by feeding the synthetic
+ event ``switch_in_tlob`` to ``da_handle_start_event()``. Monitoring a non-current
+ task that is already in waiting or sleeping state at call time misclassifies
+ the first interval as ``running_ns``.
+- ``TASK_STOPPED`` and ``TASK_TRACED`` carry ``prev_state != 0`` and are
+ therefore counted as ``sleeping_ns``, indistinguishable from
+ I/O-blocked time.
+- ``sched_wakeup_new`` is not hooked. In practice this is not an issue
+ because ``tlob_start_task`` is always called from a running context.
+
+Specification
+-------------
+
+Graphviz DOT file in tools/verification/models/tlob.dot.
+
+KUnit tests under ``kernel/trace/rv/monitors/tlob/tlob_kunit.c``
+(CONFIG_TLOB_KUNIT_TEST).
+
+User-space integration tests under ``tools/testing/selftests/verification/``
+(requires CONFIG_RV_MON_TLOB=y and root).
diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index e2e0033a00b9..ed2de31d0312 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -85,6 +85,7 @@ source "kernel/trace/rv/monitors/sleep/Kconfig"
source "kernel/trace/rv/monitors/stall/Kconfig"
source "kernel/trace/rv/monitors/deadline/Kconfig"
source "kernel/trace/rv/monitors/nomiss/Kconfig"
+source "kernel/trace/rv/monitors/tlob/Kconfig"
# Add new deadline monitors here
# Add new monitors here
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index f139b904bea3..ae59e97f8682 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
obj-$(CONFIG_RV_MON_STALL) += monitors/stall/stall.o
obj-$(CONFIG_RV_MON_DEADLINE) += monitors/deadline/deadline.o
obj-$(CONFIG_RV_MON_NOMISS) += monitors/nomiss/nomiss.o
+obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
# Add new monitors here
obj-$(CONFIG_RV_UPROBE) += rv_uprobe.o
obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
diff --git a/kernel/trace/rv/monitors/tlob/Kconfig b/kernel/trace/rv/monitors/tlob/Kconfig
new file mode 100644
index 000000000000..b29a375de228
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/Kconfig
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+config RV_MON_TLOB
+ depends on RV && UPROBES && HIGH_RES_TIMERS
+ select HA_MON_EVENTS_ID
+ select RV_UPROBE
+ bool "tlob monitor"
+ help
+ Enable the tlob (task latency over budget) hybrid-automaton RV
+ monitor. tlob tracks per-task elapsed wall-clock time across a
+ user-delimited code section and emits error_env_tlob when the
+ elapsed time exceeds a configurable per-invocation budget.
diff --git a/kernel/trace/rv/monitors/tlob/tlob.c b/kernel/trace/rv/monitors/tlob/tlob.c
new file mode 100644
index 000000000000..d8e0c4794720
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/tlob.c
@@ -0,0 +1,968 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tlob: task latency over budget monitor
+ *
+ * Track the elapsed wall-clock time of a marked code path and detect when
+ * a monitored task exceeds its per-task latency budget. CLOCK_MONOTONIC
+ * is used so both on-CPU and off-CPU time count toward the budget.
+ *
+ * On a budget violation, two tracepoints are emitted from the hrtimer
+ * callback: error_env_tlob signals the violation, and detail_env_tlob
+ * provides a per-state time breakdown (running_ns, waiting_ns, sleeping_ns)
+ * that pinpoints whether the overrun occurred in running, waiting, or sleeping state.
+ *
+ * The monitor uses RV_MON_PER_OBJ: per-task state (struct tlob_task_state)
+ * is stored as monitor_target in the framework's hash table.
+ *
+ * One HA clock invariant is enforced:
+ * clk_elapsed < BUDGET_NS() (active in all states)
+ *
+ * tlob_start_task() uses da_handle_start_run_event(start_tlob) to initialise
+ * the monitor: the DA framework sets the initial state and processes the start
+ * event, which resets clk_elapsed and arms the budget hrtimer via
+ * ha_setup_invariants(). The HA timer is cancelled synchronously by
+ * ha_cancel_timer_sync() in tlob_stop_task().
+ *
+ * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
+ */
+#include <linux/hrtimer.h>
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/namei.h>
+#include <linux/rv.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/tracefs.h>
+#include <kunit/visibility.h>
+#include <rv/instrumentation.h>
+#include <rv/rv_uprobe.h>
+#include "../../rv.h"
+
+#define MODULE_NAME "tlob"
+
+#include <trace/events/sched.h>
+#include <rv_trace.h>
+
+/*
+ * Per-task latency monitoring state. One instance per monitoring window.
+ * Stored as monitor_target in da_monitor_storage; freed via call_rcu.
+ */
+struct tlob_task_state {
+ struct task_struct *task; /* via get_task_struct */
+ u64 threshold_ns; /* budget in nanoseconds */
+
+ /* 1 = cleanup claimed; ha_setup_invariants won't restart the timer. */
+ atomic_t stopping;
+
+ /* Serialises the ns accumulators; held briefly (hardirq-safe). */
+ raw_spinlock_t entry_lock;
+ u64 running_ns; /* time in running state */
+ u64 waiting_ns; /* time in waiting state */
+ u64 sleeping_ns; /* time in sleeping state */
+ ktime_t last_ts;
+
+ struct rcu_head rcu; /* for call_rcu() teardown */
+};
+
+#define RV_MON_TYPE RV_MON_PER_OBJ
+#define HA_TIMER_TYPE HA_TIMER_HRTIMER
+#define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_POOL
+
+/* Type for da_monitor_storage.target; must be defined before the includes. */
+typedef struct tlob_task_state *monitor_target;
+
+/* Forward-declared so da_monitor_reset_hook works before ha_monitor.h. */
+static inline void tlob_reset_notify(struct da_monitor *da_mon);
+#define da_monitor_reset_hook tlob_reset_notify
+
+/* Override EVENT_NONE_LBL so the timer-fired violation shows "budget_exceeded". */
+#define EVENT_NONE_LBL "budget_exceeded"
+
+#include "tlob.h"
+
+/*
+ * DA_MON_POOL_SIZE must be defined HERE: after tlob.h (which defines
+ * TLOB_MAX_MONITORED) and before #include <rv/ha_monitor.h> (which
+ * transitively includes da_monitor.h and expands __da_monitor_init_pool
+ * using this macro). Placing the define before tlob.h or after
+ * ha_monitor.h both cause a build error.
+ */
+#define DA_MON_POOL_SIZE TLOB_MAX_MONITORED
+
+/*
+ * Forward-declare tlob_extra_cleanup so the #define below is valid when
+ * da_monitor.h (included via ha_monitor.h) expands da_extra_cleanup inside
+ * da_monitor_destroy(). The full definition follows after ha_monitor.h.
+ */
+static inline void tlob_extra_cleanup(struct da_monitor *da_mon);
+#define da_extra_cleanup tlob_extra_cleanup
+
+#include <rv/ha_monitor.h>
+
+/*
+ * Called from da_monitor_reset() on both normal stop and hrtimer expiry.
+ * On violation (stopping==0), emits detail_env_tlob.
+ */
+static inline void tlob_reset_notify(struct da_monitor *da_mon)
+{
+ struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
+ struct tlob_task_state *ws;
+
+ ha_monitor_reset_env(da_mon);
+
+ ws = ha_get_target(ha_mon);
+ if (!ws)
+ return;
+
+ /*
+ * Emit per-state breakdown on budget violation only.
+ * stopping==0: timer callback owns this path (genuine overrun).
+ * stopping==1: normal stop claimed ownership first; skip.
+ */
+ if (!atomic_read(&ws->stopping)) {
+ unsigned int curr_state = READ_ONCE(da_mon->curr_state);
+ u64 running_ns, waiting_ns, sleeping_ns, partial_ns;
+ unsigned long flags;
+
+ /*
+ * Snapshot accumulators; partial_ns covers curr_state time
+ * not yet folded in (transition-out pending).
+ */
+ raw_spin_lock_irqsave(&ws->entry_lock, flags);
+ partial_ns = ktime_get_ns() - ktime_to_ns(ws->last_ts);
+ running_ns = ws->running_ns +
+ (curr_state == running_tlob ? partial_ns : 0);
+ waiting_ns = ws->waiting_ns +
+ (curr_state == waiting_tlob ? partial_ns : 0);
+ sleeping_ns = ws->sleeping_ns +
+ (curr_state == sleeping_tlob ? partial_ns : 0);
+ raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
+
+ trace_detail_env_tlob(da_get_id(da_mon), ws->threshold_ns,
+ running_ns, waiting_ns, sleeping_ns);
+ }
+}
+
+#define BUDGET_NS(ha_mon) (ha_get_target(ha_mon)->threshold_ns)
+
+/* HA constraint functions (called by ha_monitor_handle_constraint) */
+
+static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_tlob env, u64 time_ns)
+{
+ if (env == clk_elapsed_tlob)
+ return ha_get_clk_ns(ha_mon, env, time_ns);
+ return ENV_INVALID_VALUE;
+}
+
+/*
+ * ha_verify_invariants - clk_elapsed < BUDGET_NS must hold in all states.
+ *
+ * The invariant is uniform across running/waiting/sleeping; check it
+ * unconditionally rather than enumerating each state.
+ */
+static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob, time_ns);
+}
+
+/*
+ * Convert invariant (deadline) to guard (reset anchor) on state transitions.
+ *
+ * The conversion is identical for every departing state; skip only self-loops.
+ */
+static inline void ha_convert_inv_guard(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ if (curr_state != next_state)
+ ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon), time_ns);
+}
+
+/* No per-event guard conditions for tlob; invariants suffice. */
+static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ return true;
+}
+
+/*
+ * Arm or cancel the HA budget timer on state transitions.
+ *
+ * The timer must run in every monitored state (running/waiting/sleeping),
+ * so arm it whenever next_state is any of the three. On a self-loop caused
+ * by a non-start event the timer is already running; skip the redundant
+ * restart. On a true state change the old timer is implicitly superseded by
+ * the new ha_start_timer_ns() call.
+ *
+ * Guard on stopping: sched_switch events can arrive after ha_cancel_timer_sync,
+ * restarting the timer and triggering an ODEBUG "activate active" splat.
+ * The _acquire pairs with the cmpxchg_release in tlob_stop_task.
+ */
+static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ if (next_state == curr_state && event != start_tlob)
+ return;
+
+ if (next_state < state_max_tlob) {
+ if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
+ ha_start_timer_ns(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon), time_ns);
+ } else {
+ ha_cancel_timer(ha_mon);
+ }
+}
+
+static bool ha_verify_constraint(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ if (!ha_verify_invariants(ha_mon, curr_state, event, next_state, time_ns))
+ return false;
+
+ ha_convert_inv_guard(ha_mon, curr_state, event, next_state, time_ns);
+
+ if (!ha_verify_guards(ha_mon, curr_state, event, next_state, time_ns))
+ return false;
+
+ ha_setup_invariants(ha_mon, curr_state, event, next_state, time_ns);
+
+ return true;
+}
+
+static struct kmem_cache *tlob_state_cache;
+
+/* Uprobe binding list; protected by tlob_uprobe_mutex. */
+static LIST_HEAD(tlob_uprobe_list);
+static DEFINE_MUTEX(tlob_uprobe_mutex);
+
+/* Serialises duplicate-check + da_handle_start_run_event() for the same pid. */
+static DEFINE_MUTEX(tlob_start_mutex);
+
+
+/* Per-uprobe-binding state: a start + stop probe pair for one binary region. */
+struct tlob_uprobe_binding {
+ struct list_head list;
+ u64 threshold_ns;
+ char binpath[TLOB_MAX_PATH];
+ loff_t offset_start;
+ loff_t offset_stop;
+ struct rv_uprobe *start_probe;
+ struct rv_uprobe *stop_probe;
+};
+
+/* RCU callback: free the slab once no readers remain. */
+static void tlob_free_rcu(struct rcu_head *head)
+{
+ struct tlob_task_state *ws =
+ container_of(head, struct tlob_task_state, rcu);
+ kmem_cache_free(tlob_state_cache, ws);
+}
+
+/*
+ * da_extra_cleanup - per-task teardown called by da_monitor_destroy().
+ *
+ * Claims cleanup ownership via CAS; cancels the budget timer; decrements the
+ * monitored-task counter; and schedules the slab free via call_rcu().
+ * Must run before da_monitor_reset() (i.e. before hash_del_rcu()) so that
+ * ha_cancel_timer_sync() can safely access the still-registered ha_monitor.
+ */
+static inline void tlob_extra_cleanup(struct da_monitor *da_mon)
+{
+ struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
+ struct tlob_task_state *ws = ha_get_target(ha_mon);
+
+ if (!ws)
+ return;
+
+ if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
+ return;
+
+ ha_cancel_timer_sync(ha_mon);
+ put_task_struct(ws->task);
+ call_rcu(&ws->rcu, tlob_free_rcu);
+}
+
+/*
+ * __tlob_acc - accumulate elapsed ns into one per-state counter.
+ *
+ * Looks up the task's tlob_task_state under RCU, adds the interval
+ * [ws->last_ts, now] to the field at @offset within the state struct,
+ * and updates last_ts. Returns true if the task is monitored.
+ *
+ * entry_lock is a raw spinlock so this is safe from hardirq context.
+ */
+static inline bool __tlob_acc(struct task_struct *task, ktime_t now,
+ size_t offset)
+{
+ struct tlob_task_state *ws;
+ unsigned long flags;
+
+ scoped_guard(rcu) {
+ ws = da_get_target_by_id(task->pid);
+ if (!ws)
+ return false;
+ raw_spin_lock_irqsave(&ws->entry_lock, flags);
+ *(u64 *)((char *)ws + offset) += ktime_to_ns(ktime_sub(now, ws->last_ts));
+ ws->last_ts = now;
+ raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
+ }
+ return true;
+}
+
+/* Accumulate running_ns for prev; returns true if prev is monitored. */
+static inline bool tlob_acc_running(struct task_struct *task, ktime_t now)
+{
+ return __tlob_acc(task, now, offsetof(struct tlob_task_state, running_ns));
+}
+
+/* Accumulate waiting_ns for next; returns true if next is monitored. */
+static inline bool tlob_acc_waiting(struct task_struct *task, ktime_t now)
+{
+ return __tlob_acc(task, now, offsetof(struct tlob_task_state, waiting_ns));
+}
+
+/*
+ * handle_sched_switch - advance the DA on every context switch.
+ *
+ * Generates three DA events:
+ * prev, prev_state != 0 -> sleep_tlob (running -> sleeping)
+ * prev, prev_state == 0 -> preempt_tlob (running -> waiting)
+ * next -> switch_in_tlob (waiting -> running)
+ *
+ * A single ktime_get() at handler entry is shared by both acc calls so that
+ * prev's running_ns and next's waiting_ns share the same context-switch
+ * timestamp; neither absorbs handler overhead into its accumulator.
+ *
+ * No waiting->sleeping edge exists: a task can only block voluntarily
+ * (call schedule()) while it is executing on CPU, which corresponds to
+ * the running DA state. A task in the waiting state is TASK_RUNNING in
+ * kernel terms (on the runqueue) and cannot block itself.
+ *
+ * da_handle_event() is called unconditionally: it skips tasks that have no
+ * monitor entry in the hash table.
+ */
+static void handle_sched_switch(void *data, bool preempt_unused,
+ struct task_struct *prev,
+ struct task_struct *next,
+ unsigned int prev_state)
+{
+ ktime_t now = ktime_get();
+ bool prev_preempted = (prev_state == 0);
+
+ /*
+ * No guard on tlob_num_monitored here: da_handle_event() internally
+ * calls da_monitor_handling_event() which checks both rv_monitoring_on()
+ * and da_monitoring(da_mon). The hash lookup inside da_get_monitor()
+ * simply returns NULL for unmonitored tasks, which is equally fast as
+ * an atomic_read() guard. By omitting the guard we avoid touching the
+ * tlob_num_monitored cacheline on every global context-switch.
+ */
+ if (tlob_acc_running(prev, now))
+ da_handle_event(prev->pid, NULL,
+ prev_preempted ? preempt_tlob : sleep_tlob);
+ if (tlob_acc_waiting(next, now))
+ da_handle_event(next->pid, NULL, switch_in_tlob);
+}
+
+/* Accumulate sleeping_ns on wakeup; returns true if task is monitored. */
+static inline bool tlob_acc_sleeping(struct task_struct *task, ktime_t now)
+{
+ return __tlob_acc(task, now, offsetof(struct tlob_task_state, sleeping_ns));
+}
+
+/*
+ * handle_sched_wakeup - sleeping -> waiting transition.
+ *
+ * try_to_wake_up() skips TASK_RUNNING tasks, so this never fires for a
+ * task already in running or waiting state.
+ */
+static void handle_sched_wakeup(void *data, struct task_struct *p)
+{
+ ktime_t now = ktime_get();
+
+ /* Same reasoning as handle_sched_switch: rely on hash-lookup fast path. */
+ if (tlob_acc_sleeping(p, now))
+ da_handle_event(p->pid, NULL, wakeup_tlob);
+}
+
+/*
+ * handle_sched_process_exit - clean up if a task exits without TRACE_STOP.
+ *
+ * Called in do_exit() context; the task still has a valid pid here.
+ * tlob_stop_task() returns -ESRCH if the task is not monitored, which is fine.
+ */
+static void handle_sched_process_exit(void *data, struct task_struct *p,
+ bool group_dead)
+{
+ tlob_stop_task(p);
+}
+
+
+
+/**
+ * tlob_start_task - begin monitoring @task with budget @threshold_ns ns.
+ * @task: Task to monitor; may be current or another task.
+ * @threshold_ns: Latency budget in nanoseconds (wall-clock; running + waiting + sleeping).
+ * Must be in [1000, TLOB_MAX_THRESHOLD_NS].
+ *
+ * Returns 0, -ENODEV, -ERANGE, -EALREADY, -ENOMEM, or -ENOSPC.
+ */
+int tlob_start_task(struct task_struct *task, u64 threshold_ns)
+{
+ struct tlob_task_state *ws;
+
+ if (!da_monitor_enabled())
+ return -ENODEV;
+
+ if (threshold_ns < 1000 || threshold_ns > TLOB_MAX_THRESHOLD_NS)
+ return -ERANGE;
+
+ /* Serialise duplicate-check + pool-slot claim for the same pid. */
+ guard(mutex)(&tlob_start_mutex);
+
+ if (da_get_target_by_id(task->pid))
+ return -EALREADY;
+
+ ws = kmem_cache_zalloc(tlob_state_cache, GFP_KERNEL);
+ if (!ws)
+ return -ENOMEM;
+
+ ws->task = task;
+ get_task_struct(task);
+ ws->threshold_ns = threshold_ns;
+ ws->last_ts = ktime_get();
+ raw_spin_lock_init(&ws->entry_lock);
+
+ /*
+ * da_handle_start_run_event() claims a pool slot via da_prepare_storage(),
+ * initialises the monitor, and delivers start_tlob in one step: the
+ * generated ha_setup_invariants() resets clk_elapsed and arms the timer.
+ * Returns 0 if the pool is exhausted (-ENOSPC).
+ */
+ if (!da_handle_start_run_event(task->pid, ws, start_tlob)) {
+ put_task_struct(task);
+ kmem_cache_free(tlob_state_cache, ws);
+ return -ENOSPC;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(tlob_start_task);
+
+/**
+ * tlob_stop_task - stop monitoring @task.
+ * @task: Task to stop.
+ *
+ * CAS on ws->stopping (0->1) under RCU claims cleanup ownership;
+ * the winner cancels the timer synchronously and frees all resources.
+ *
+ * Returns 0, -EOVERFLOW (budget exceeded), -ESRCH (not monitored),
+ * or -EAGAIN (concurrent caller claimed cleanup).
+ */
+int tlob_stop_task(struct task_struct *task)
+{
+ struct da_monitor *da_mon;
+ struct ha_monitor *ha_mon;
+ struct tlob_task_state *ws;
+ bool budget_exceeded;
+
+ scoped_guard(rcu) {
+ ws = da_get_target_by_id(task->pid);
+ if (!ws)
+ return -ESRCH;
+
+ da_mon = da_get_monitor(task->pid, NULL);
+ if (unlikely(!da_mon)) {
+ /* ws in hash but da_mon gone; internal inconsistency. */
+ WARN_ON_ONCE(1);
+ return -ESRCH;
+ }
+
+ ha_mon = to_ha_monitor(da_mon);
+
+ /*
+ * CAS (0->1) claims cleanup ownership under RCU (ws guaranteed valid).
+ * _release pairs with atomic_read_acquire in ha_setup_invariants.
+ */
+ if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
+ return -EAGAIN;
+ }
+
+ /* Wait for in-flight timer callback before reading da_monitoring. */
+ ha_cancel_timer_sync(ha_mon);
+
+ /* Timer fired first -> budget exceeded; otherwise reset normally. */
+ scoped_guard(rcu) {
+ budget_exceeded = !da_monitoring(da_mon);
+ if (!budget_exceeded)
+ da_monitor_reset(da_mon);
+ }
+ da_destroy_storage(task->pid);
+
+ put_task_struct(ws->task);
+ call_rcu(&ws->rcu, tlob_free_rcu);
+ return budget_exceeded ? -EOVERFLOW : 0;
+}
+EXPORT_SYMBOL_GPL(tlob_stop_task);
+
+
+static int tlob_uprobe_entry_handler(struct rv_uprobe *p, struct pt_regs *regs,
+ __u64 *data)
+{
+ struct tlob_uprobe_binding *b = p->priv;
+
+ tlob_start_task(current, b->threshold_ns);
+ return 0;
+}
+
+static int tlob_uprobe_stop_handler(struct rv_uprobe *p, struct pt_regs *regs,
+ __u64 *data)
+{
+ tlob_stop_task(current);
+ return 0;
+}
+
+/*
+ * Register start + stop entry uprobes for a binding.
+ * Called with tlob_uprobe_mutex held.
+ */
+static int tlob_add_uprobe(u64 threshold_ns, const char *binpath,
+ loff_t offset_start, loff_t offset_stop)
+{
+ struct tlob_uprobe_binding *b, *tmp_b;
+ char pathbuf[TLOB_MAX_PATH];
+ struct path path;
+ char *canon;
+ int ret;
+
+ if (binpath[0] != '/')
+ return -EINVAL;
+
+ b = kzalloc_obj(*b, GFP_KERNEL);
+ if (!b)
+ return -ENOMEM;
+
+ b->threshold_ns = threshold_ns;
+ b->offset_start = offset_start;
+ b->offset_stop = offset_stop;
+
+ ret = kern_path(binpath, LOOKUP_FOLLOW, &path);
+ if (ret)
+ goto err_free;
+
+ if (!d_is_reg(path.dentry)) {
+ ret = -EINVAL;
+ goto err_path;
+ }
+
+ /* Reject duplicate start offset for the same binary. */
+ list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
+ if (tmp_b->offset_start == offset_start &&
+ tmp_b->start_probe->path.dentry == path.dentry) {
+ ret = -EEXIST;
+ goto err_path;
+ }
+ }
+
+ canon = d_path(&path, pathbuf, sizeof(pathbuf));
+ if (IS_ERR(canon)) {
+ ret = PTR_ERR(canon);
+ goto err_path;
+ }
+ strscpy(b->binpath, canon, sizeof(b->binpath));
+
+ /* Both probes share b (priv) and path; attach_path refs path itself. */
+ b->start_probe = rv_uprobe_attach_path(&path, offset_start,
+ tlob_uprobe_entry_handler, NULL, b);
+ if (IS_ERR(b->start_probe)) {
+ ret = PTR_ERR(b->start_probe);
+ b->start_probe = NULL;
+ goto err_path;
+ }
+
+ b->stop_probe = rv_uprobe_attach_path(&path, offset_stop,
+ tlob_uprobe_stop_handler, NULL, b);
+ if (IS_ERR(b->stop_probe)) {
+ ret = PTR_ERR(b->stop_probe);
+ b->stop_probe = NULL;
+ goto err_start;
+ }
+
+ path_put(&path);
+ list_add_tail(&b->list, &tlob_uprobe_list);
+ return 0;
+
+err_start:
+ rv_uprobe_detach(b->start_probe);
+err_path:
+ path_put(&path);
+err_free:
+ kfree(b);
+ return ret;
+}
+
+static int tlob_remove_uprobe_by_key(loff_t offset_start, const char *binpath)
+{
+ struct tlob_uprobe_binding *b, *tmp;
+ struct path remove_path;
+ int ret;
+
+ ret = kern_path(binpath, LOOKUP_FOLLOW, &remove_path);
+ if (ret)
+ return ret;
+
+ ret = -ENOENT;
+ list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
+ if (b->offset_start != offset_start)
+ continue;
+ if (b->start_probe->path.dentry != remove_path.dentry)
+ continue;
+ list_del(&b->list);
+ rv_uprobe_detach(b->start_probe);
+ rv_uprobe_detach(b->stop_probe);
+ kfree(b);
+ ret = 0;
+ break;
+ }
+
+ path_put(&remove_path);
+ return ret;
+}
+
+static void tlob_remove_all_uprobes(void)
+{
+ struct tlob_uprobe_binding *b, *tmp;
+ LIST_HEAD(pending);
+
+ mutex_lock(&tlob_uprobe_mutex);
+ list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
+ list_move(&b->list, &pending);
+ rv_uprobe_unregister_nosync(b->start_probe);
+ rv_uprobe_unregister_nosync(b->stop_probe);
+ }
+ mutex_unlock(&tlob_uprobe_mutex);
+
+ if (list_empty(&pending))
+ return;
+
+ /*
+ * One global barrier for all probes dequeued above; no new handlers
+ * for any of them can fire after this returns.
+ */
+ rv_uprobe_sync();
+
+ list_for_each_entry_safe(b, tmp, &pending, list) {
+ rv_uprobe_free(b->start_probe);
+ rv_uprobe_free(b->stop_probe);
+ kfree(b);
+ }
+}
+
+static ssize_t tlob_monitor_read(struct file *file,
+ char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ const int line_sz = TLOB_MAX_PATH + 128;
+ struct tlob_uprobe_binding *b;
+ char *buf, *p;
+ int n = 0, buf_sz, pos = 0;
+ ssize_t ret;
+
+ mutex_lock(&tlob_uprobe_mutex);
+ list_for_each_entry(b, &tlob_uprobe_list, list)
+ n++;
+
+ buf_sz = (n ? n : 1) * line_sz + 1;
+ buf = kmalloc(buf_sz, GFP_KERNEL);
+ if (!buf) {
+ mutex_unlock(&tlob_uprobe_mutex);
+ return -ENOMEM;
+ }
+
+ list_for_each_entry(b, &tlob_uprobe_list, list) {
+ p = b->binpath;
+ pos += scnprintf(buf + pos, buf_sz - pos,
+ "p %s:0x%llx 0x%llx threshold=%llu\n",
+ p,
+ (unsigned long long)b->offset_start,
+ (unsigned long long)b->offset_stop,
+ b->threshold_ns);
+ }
+ mutex_unlock(&tlob_uprobe_mutex);
+
+ ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
+ kfree(buf);
+ return ret;
+}
+
+/*
+ * Parse "p PATH:OFFSET_START OFFSET_STOP threshold=NS".
+ * PATH may contain ':'; the last ':' separates path from offset.
+ * Returns 0, -EINVAL, or -ERANGE.
+ */
+static int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
+ char **path_out,
+ loff_t *start_out, loff_t *stop_out)
+{
+ unsigned long long thr = 0, stop_val = 0;
+ long long start_val;
+ char *p, *path_token, *token, *colon;
+ bool got_stop = false, got_thr = false;
+ int n;
+
+ /* Must start with "p " */
+ if (buf[0] != 'p' || buf[1] != ' ')
+ return -EINVAL;
+
+ p = buf + 2;
+ while (*p == ' ')
+ p++;
+
+ /* First space-delimited token is PATH:OFFSET_START */
+ path_token = strsep(&p, " \t");
+ if (!path_token || !*path_token)
+ return -EINVAL;
+
+ /* Split at last ':' to handle paths that contain ':'. */
+ colon = strrchr(path_token, ':');
+ if (!colon || colon - path_token < 2)
+ return -EINVAL;
+ *colon = '\0';
+
+ if (path_token[0] != '/')
+ return -EINVAL;
+
+ n = 0;
+ if (sscanf(colon + 1, "%lli%n", &start_val, &n) != 1 || n == 0)
+ return -EINVAL;
+ if (start_val < 0)
+ return -EINVAL;
+
+ /* Remaining tokens: OFFSET_STOP threshold=NS */
+ while (p && (token = strsep(&p, " \t")) != NULL) {
+ if (!*token)
+ continue;
+ if (strncmp(token, "threshold=", 10) == 0) {
+ if (kstrtoull(token + 10, 0, &thr))
+ return -EINVAL;
+ if (thr < 1000 || thr > TLOB_MAX_THRESHOLD_NS)
+ return -ERANGE;
+ got_thr = true;
+ } else if (!got_stop) {
+ long long sv;
+
+ n = 0;
+ if (sscanf(token, "%lli%n", &sv, &n) != 1 || n == 0)
+ return -EINVAL;
+ if (sv < 0)
+ return -EINVAL;
+ stop_val = (unsigned long long)sv;
+ got_stop = true;
+ } else {
+ return -EINVAL;
+ }
+ }
+
+ if (!got_stop || !got_thr)
+ return -EINVAL;
+ if (start_val == (long long)stop_val)
+ return -EINVAL;
+
+ *thr_out = thr;
+ *path_out = path_token;
+ *start_out = (loff_t)start_val;
+ *stop_out = (loff_t)stop_val;
+ return 0;
+}
+
+/* Parse "-PATH:OFFSET_START" (ftrace uprobe_events removal convention). */
+static int tlob_parse_remove_line(char *buf, char **path_out, loff_t *start_out)
+{
+ char *binpath, *colon;
+ long long off;
+ int n = 0;
+
+ if (buf[0] != '-')
+ return -EINVAL;
+ binpath = buf + 1;
+ if (binpath[0] != '/')
+ return -EINVAL;
+ colon = strrchr(binpath, ':');
+ if (!colon || colon - binpath < 2)
+ return -EINVAL;
+ *colon = '\0';
+ if (sscanf(colon + 1, "%lli%n", &off, &n) != 1 || n == 0)
+ return -EINVAL;
+ *path_out = binpath;
+ *start_out = (loff_t)off;
+ return 0;
+}
+
+VISIBLE_IF_KUNIT int tlob_create_or_delete_uprobe(char *buf)
+{
+ loff_t offset_start, offset_stop;
+ u64 threshold_ns;
+ char *binpath;
+ int ret;
+
+ if (buf[0] == '-') {
+ ret = tlob_parse_remove_line(buf, &binpath, &offset_start);
+ if (ret)
+ return ret;
+ mutex_lock(&tlob_uprobe_mutex);
+ ret = tlob_remove_uprobe_by_key(offset_start, binpath);
+ mutex_unlock(&tlob_uprobe_mutex);
+ return ret;
+ }
+ ret = tlob_parse_uprobe_line(buf, &threshold_ns, &binpath,
+ &offset_start, &offset_stop);
+ if (ret)
+ return ret;
+ mutex_lock(&tlob_uprobe_mutex);
+ ret = tlob_add_uprobe(threshold_ns, binpath, offset_start, offset_stop);
+ mutex_unlock(&tlob_uprobe_mutex);
+ return ret;
+}
+EXPORT_SYMBOL_IF_KUNIT(tlob_create_or_delete_uprobe);
+
+static ssize_t tlob_monitor_write(struct file *file,
+ const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ char buf[TLOB_MAX_PATH + 128];
+
+ if (count >= sizeof(buf))
+ return -EINVAL;
+ if (copy_from_user(buf, ubuf, count))
+ return -EFAULT;
+ buf[count] = '\0';
+ if (count > 0 && buf[count - 1] == '\n')
+ buf[count - 1] = '\0';
+ return tlob_create_or_delete_uprobe(buf) ?: (ssize_t)count;
+}
+
+static const struct file_operations tlob_monitor_fops = {
+ .open = simple_open,
+ .read = tlob_monitor_read,
+ .write = tlob_monitor_write,
+ .llseek = noop_llseek,
+};
+
+static int __tlob_init_monitor(void)
+{
+ int retval;
+
+ tlob_state_cache = kmem_cache_create("tlob_task_state",
+ sizeof(struct tlob_task_state),
+ 0, 0, NULL);
+ if (!tlob_state_cache)
+ return -ENOMEM;
+
+ retval = ha_monitor_init();
+ if (retval) {
+ kmem_cache_destroy(tlob_state_cache);
+ tlob_state_cache = NULL;
+ return retval;
+ }
+
+ rv_this.enabled = 1;
+ return 0;
+}
+
+static void __tlob_destroy_monitor(void)
+{
+ rv_this.enabled = 0;
+ /*
+ * Remove uprobes first; rv_uprobe_sync() inside ensures all in-flight
+ * handlers have finished before we proceed.
+ */
+ tlob_remove_all_uprobes();
+
+ /*
+ * da_monitor_destroy() iterates any remaining entries via da_extra_cleanup
+ * (tlob_extra_cleanup), cancels their timers, and frees their state.
+ * rcu_barrier() inside drains both da_pool_return_cb and tlob_free_rcu
+ * callbacks before the pool arrays are freed.
+ */
+ ha_monitor_destroy();
+ kmem_cache_destroy(tlob_state_cache);
+ tlob_state_cache = NULL;
+}
+
+static int tlob_enable_hooks(void)
+{
+ rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
+ rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
+ rv_attach_trace_probe("tlob", sched_process_exit, handle_sched_process_exit);
+ return 0;
+}
+
+static void tlob_disable_hooks(void)
+{
+ rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
+ rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
+ rv_detach_trace_probe("tlob", sched_process_exit, handle_sched_process_exit);
+}
+
+static int enable_tlob(void)
+{
+ int retval;
+
+ retval = __tlob_init_monitor();
+ if (retval)
+ return retval;
+
+ return tlob_enable_hooks();
+}
+
+static void disable_tlob(void)
+{
+ tlob_disable_hooks();
+ __tlob_destroy_monitor();
+}
+
+static struct rv_monitor rv_this = {
+ .name = "tlob",
+ .description = "Per-task latency-over-budget monitor.",
+ .enable = enable_tlob,
+ .disable = disable_tlob,
+ .reset = da_monitor_reset_all,
+ .enabled = 0,
+};
+
+static int __init register_tlob(void)
+{
+ int ret;
+
+ ret = rv_register_monitor(&rv_this, NULL);
+ if (ret)
+ return ret;
+
+ if (rv_this.root_d) {
+ if (!tracefs_create_file("monitor", 0644, rv_this.root_d, NULL,
+ &tlob_monitor_fops)) {
+ rv_unregister_monitor(&rv_this);
+ return -ENOMEM;
+ }
+ }
+
+ return 0;
+}
+
+static void __exit unregister_tlob(void)
+{
+ rv_unregister_monitor(&rv_this);
+}
+
+module_init(register_tlob);
+module_exit(unregister_tlob);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
+MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
diff --git a/kernel/trace/rv/monitors/tlob/tlob.h b/kernel/trace/rv/monitors/tlob/tlob.h
new file mode 100644
index 000000000000..b6724e629c69
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/tlob.h
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _RV_TLOB_H
+#define _RV_TLOB_H
+
+/*
+ * C representation of the tlob hybrid automaton.
+ *
+ * Three-state HA following sched_stat / wwnr monitor naming conventions:
+ *
+ * running (initial) - task is executing on CPU [sched_stat: runtime]
+ * waiting - task is in runqueue, awaiting CPU [sched_stat: wait ]
+ * sleeping - task is blocked, awaiting resource[sched_stat: sleep ]
+ *
+ * Events (derived from sched_switch / sched_wakeup tracepoints):
+ * start - tlob_start_task() running → running (resets clock, arms timer)
+ * sleep - sched_switch, prev_state != 0 running → sleeping
+ * preempt - sched_switch, prev_state == 0 running → waiting
+ * wakeup - sched_wakeup sleeping → waiting
+ * switch_in - sched_switch, next == task waiting → running
+ *
+ * One HA clock invariant:
+ * clk_elapsed < BUDGET_NS() active in all states (total latency budget)
+ *
+ * tlob_start_task() uses da_handle_start_run_event(start_tlob) to initialise
+ * the monitor: the DA framework sets the initial state and then processes the
+ * start event, which resets clk_elapsed and arms the budget hrtimer via the
+ * generated ha_setup_invariants().
+ * tlob_stop_task() calls ha_cancel_timer_sync() + da_monitor_reset() directly.
+ *
+ * For the format description see:
+ * Documentation/trace/rv/deterministic_automata.rst
+ */
+
+#include <linux/rv.h>
+#include <linux/sched.h>
+
+#define MONITOR_NAME tlob
+
+enum states_tlob {
+ running_tlob,
+ waiting_tlob,
+ sleeping_tlob,
+ state_max_tlob,
+};
+
+#define INVALID_STATE state_max_tlob
+
+enum events_tlob {
+ start_tlob,
+ sleep_tlob,
+ preempt_tlob,
+ wakeup_tlob,
+ switch_in_tlob,
+ event_max_tlob,
+};
+
+/*
+ * HA environment variable: clk_elapsed is the only clock.
+ * It measures wall-clock time since task_start and is active in all states.
+ */
+enum envs_tlob {
+ clk_elapsed_tlob,
+ env_max_tlob,
+ env_max_stored_tlob = env_max_tlob,
+};
+
+_Static_assert(env_max_stored_tlob <= MAX_HA_ENV_LEN, "Not enough slots");
+#define HA_CLK_NS
+
+struct automaton_tlob {
+ char *state_names[state_max_tlob];
+ char *event_names[event_max_tlob];
+ char *env_names[env_max_tlob];
+ unsigned char function[state_max_tlob][event_max_tlob];
+ unsigned char initial_state;
+ bool final_states[state_max_tlob];
+};
+
+static const struct automaton_tlob automaton_tlob = {
+ .state_names = {
+ "running",
+ "waiting",
+ "sleeping",
+ },
+ .event_names = {
+ "start",
+ "sleep",
+ "preempt",
+ "wakeup",
+ "switch_in",
+ },
+ .env_names = {
+ "clk_elapsed",
+ },
+ .function = {
+ /* running */
+ {
+ running_tlob, /* start (tlob_start_task, resets clock) */
+ sleeping_tlob, /* sleep (sched_switch, prev_state != 0) */
+ waiting_tlob, /* preempt (sched_switch, prev_state == 0) */
+ INVALID_STATE, /* wakeup (TASK_RUNNING can't be woken) */
+ INVALID_STATE, /* switch_in (already on CPU) */
+ },
+ /* waiting */
+ {
+ INVALID_STATE, /* start (not in running state) */
+ INVALID_STATE, /* sleep (not on CPU) */
+ INVALID_STATE, /* preempt (not on CPU) */
+ INVALID_STATE, /* wakeup (already TASK_RUNNING) */
+ running_tlob, /* switch_in */
+ },
+ /* sleeping */
+ {
+ INVALID_STATE, /* start (not in running state) */
+ INVALID_STATE, /* sleep (already sleeping) */
+ INVALID_STATE, /* preempt (not on CPU) */
+ waiting_tlob, /* wakeup */
+ INVALID_STATE, /* switch_in (must go through waiting first) */
+ },
+ },
+ .initial_state = running_tlob,
+ .final_states = { 1, 0, 0 },
+};
+
+/* Maximum number of concurrently monitored tasks. */
+#define TLOB_MAX_MONITORED 64U
+
+/* Maximum binary path length for uprobe binding. */
+#define TLOB_MAX_PATH 256
+
+/*
+ * Upper bound on the monitoring budget (1 hour = 3 600 000 000 000 ns).
+ * The ns-resolution accumulators (running_ns, waiting_ns, sleeping_ns)
+ * are u64; keeping the window below this limit ensures they stay well
+ * clear of u64 overflow and covers every realistic latency-monitoring
+ * use case.
+ */
+#define TLOB_MAX_THRESHOLD_NS 3600000000000ULL
+
+/* Exported to ioctl/uprobe layers and KUnit */
+int tlob_start_task(struct task_struct *task, u64 threshold_ns);
+int tlob_stop_task(struct task_struct *task);
+
+#if IS_ENABLED(CONFIG_KUNIT)
+int tlob_create_or_delete_uprobe(char *buf);
+#endif /* CONFIG_KUNIT */
+
+#endif /* _RV_TLOB_H */
diff --git a/kernel/trace/rv/monitors/tlob/tlob_trace.h b/kernel/trace/rv/monitors/tlob/tlob_trace.h
new file mode 100644
index 000000000000..1ac4900d38e8
--- /dev/null
+++ b/kernel/trace/rv/monitors/tlob/tlob_trace.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Snippet to be included in rv_trace.h
+ */
+
+#ifdef CONFIG_RV_MON_TLOB
+DEFINE_EVENT(event_da_monitor_id, event_tlob,
+ TP_PROTO(int id, char *state, char *event, char *next_state, bool final_state),
+ TP_ARGS(id, state, event, next_state, final_state));
+
+DEFINE_EVENT(error_da_monitor_id, error_tlob,
+ TP_PROTO(int id, char *state, char *event),
+ TP_ARGS(id, state, event));
+
+DEFINE_EVENT(error_env_da_monitor_id, error_env_tlob,
+ TP_PROTO(int id, char *state, char *event, char *env),
+ TP_ARGS(id, state, event, env));
+
+/*
+ * detail_env_tlob - per-state latency breakdown emitted on budget violation.
+ *
+ * Fired immediately after error_env_tlob from the hrtimer callback.
+ * Fields show how much time was spent in each DA state since tlob_start_task().
+ * running_ns + waiting_ns + sleeping_ns ≈ total elapsed time (threshold_ns exceeded).
+ */
+TRACE_EVENT(detail_env_tlob,
+ TP_PROTO(int id, u64 threshold_ns,
+ u64 running_ns, u64 waiting_ns, u64 sleeping_ns),
+ TP_ARGS(id, threshold_ns, running_ns, waiting_ns, sleeping_ns),
+ TP_STRUCT__entry(
+ __field(int, id)
+ __field(u64, threshold_ns)
+ __field(u64, running_ns)
+ __field(u64, waiting_ns)
+ __field(u64, sleeping_ns)
+ ),
+ TP_fast_assign(
+ __entry->id = id;
+ __entry->threshold_ns = threshold_ns;
+ __entry->running_ns = running_ns;
+ __entry->waiting_ns = waiting_ns;
+ __entry->sleeping_ns = sleeping_ns;
+ ),
+ TP_printk("pid=%d threshold_ns=%llu running_ns=%llu waiting_ns=%llu sleeping_ns=%llu",
+ __entry->id, __entry->threshold_ns,
+ __entry->running_ns, __entry->waiting_ns, __entry->sleeping_ns)
+);
+#endif /* CONFIG_RV_MON_TLOB */
diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
index 9622c269789c..a4bc215c1f15 100644
--- a/kernel/trace/rv/rv_trace.h
+++ b/kernel/trace/rv/rv_trace.h
@@ -189,6 +189,7 @@ DECLARE_EVENT_CLASS(error_env_da_monitor_id,
#include <monitors/stall/stall_trace.h>
#include <monitors/nomiss/nomiss_trace.h>
+#include <monitors/tlob/tlob_trace.h>
// Add new monitors based on CONFIG_HA_MON_EVENTS_ID here
#endif
--
2.43.0
^ permalink raw reply related
* [PATCH v3 5/9] rv/ha: make da_monitor_reset_hook and EVENT_NONE_LBL overridable
From: wen.yang @ 2026-06-07 16:13 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, linux-trace-kernel, linux-kernel, Wen Yang
In-Reply-To: <cover.1780847473.git.wen.yang@linux.dev>
From: Wen Yang <wen.yang@linux.dev>
Wrap the two definitions with #ifndef guards so that HA-based monitors
can substitute their own implementations before including this header:
/* in monitor.c, before #include <rv/ha_monitor.h> */
#define da_monitor_reset_hook my_monitor_reset_env
#define EVENT_NONE_LBL "idle"
No behaviour change for monitors that do not override either macro.
Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
include/rv/ha_monitor.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index e5860900a337..610da54c111f 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -36,7 +36,10 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
da_id_type id);
#define da_monitor_event_hook ha_monitor_handle_constraint
#define da_monitor_init_hook ha_monitor_init_env
+/* Allow monitors to override da_monitor_reset_hook before including this header. */
+#ifndef da_monitor_reset_hook
#define da_monitor_reset_hook ha_monitor_reset_env
+#endif
#define da_monitor_sync_hook() synchronize_rcu()
#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
@@ -75,7 +78,9 @@ _Static_assert(offsetof(struct ha_monitor, da_mon) == 0,
#define ENV_INVALID_VALUE U64_MAX
/* Error with no event occurs only on timeouts */
#define EVENT_NONE EVENT_MAX
+#ifndef EVENT_NONE_LBL
#define EVENT_NONE_LBL "none"
+#endif
#define ENV_BUFFER_SIZE 64
#ifdef CONFIG_RV_REACTORS
--
2.43.0
^ permalink raw reply related
* [PATCH v3 4/9] rv/ha: fix ha_invariant_passed_ns silent bypass of invariant check
From: wen.yang @ 2026-06-07 16:13 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, linux-trace-kernel, linux-kernel, Wen Yang
In-Reply-To: <cover.1780847473.git.wen.yang@linux.dev>
From: Wen Yang <wen.yang@linux.dev>
The function is documented as "prepare the invariant and return the time
since reset", but on the first call (env_store == U64_MAX) it exits
early without calling ha_set_invariant_ns():
if (ha_monitor_env_invalid(ha_mon, env)) /* env_store == U64_MAX */
return 0; /* ha_set_invariant_ns skipped, env_store stays U64_MAX */
...
ha_set_invariant_ns(ha_mon, env, expire - passed, time_ns);
This leaves env_store == U64_MAX, so ha_check_invariant_ns() always
passes on the first activation regardless of elapsed time:
return READ_ONCE(ha_mon->env_store[env]) >= time_ns; /* U64_MAX >= any */
Fix: establish the guard before converting to the invariant:
if (ha_monitor_env_invalid(ha_mon, env))
ha_reset_clk_ns(ha_mon, env, time_ns); /* guard: env_store = time_ns */
passed = ha_get_env(ha_mon, env, time_ns);
ha_set_invariant_ns(ha_mon, env, expire - passed, time_ns);
/* invariant: env_store = time_ns + expire */
Apply the same fix to ha_invariant_passed_jiffy().
Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
include/rv/ha_monitor.h | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index 28d3c74cabfc..e5860900a337 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -365,16 +365,22 @@ static inline bool ha_check_invariant_ns(struct ha_monitor *ha_mon,
}
/*
* ha_invariant_passed_ns - prepare the invariant and return the time since reset
+ *
+ * If the env has not been initialised yet (first entry into a state with an
+ * invariant), anchor the guard clock at the current time so that the full
+ * budget is available from this point. This preserves the documented
+ * guard→invariant ordering: ha_set_invariant_ns() is always preceded by a
+ * valid guard representation in env_store.
*/
static inline u64 ha_invariant_passed_ns(struct ha_monitor *ha_mon, enum envs env,
u64 expire, u64 time_ns)
{
- u64 passed = 0;
+ u64 passed;
if (env < 0 || env >= ENV_MAX_STORED)
return 0;
if (ha_monitor_env_invalid(ha_mon, env))
- return 0;
+ ha_reset_clk_ns(ha_mon, env, time_ns);
passed = ha_get_env(ha_mon, env, time_ns);
ha_set_invariant_ns(ha_mon, env, expire - passed, time_ns);
return passed;
@@ -404,16 +410,19 @@ static inline bool ha_check_invariant_jiffy(struct ha_monitor *ha_mon,
}
/*
* ha_invariant_passed_jiffy - prepare the invariant and return the time since reset
+ *
+ * Same first-use semantics as ha_invariant_passed_ns(): anchor the guard clock
+ * now if the env has not been initialised.
*/
static inline u64 ha_invariant_passed_jiffy(struct ha_monitor *ha_mon, enum envs env,
u64 expire, u64 time_ns)
{
- u64 passed = 0;
+ u64 passed;
if (env < 0 || env >= ENV_MAX_STORED)
return 0;
if (ha_monitor_env_invalid(ha_mon, env))
- return 0;
+ ha_reset_clk_jiffy(ha_mon, env);
passed = ha_get_env(ha_mon, env, time_ns);
ha_set_invariant_jiffy(ha_mon, env, expire - passed);
return passed;
--
2.43.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox