Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-13  7:53 UTC (permalink / raw)
  To: jane.chu, Breno Leitao, Miaohe Lin, Naoya Horiguchi,
	Andrew Morton, Jonathan Corbet, Shuah Khan, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <816e3d8e-22d2-49a4-92ae-981568f38792@oracle.com>

On 5/12/26 19:58, jane.chu@oracle.com wrote:
> 
> 
> On 5/12/2026 1:17 AM, David Hildenbrand (Arm) wrote:
>> On 5/11/26 17:38, Breno Leitao wrote:
>>> When get_hwpoison_page() returns a negative value, distinguish
>>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>>> and should be classified accordingly for proper handling.
>>>
>>> Sample PG_reserved before the get_hwpoison_page() call. In the
>>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>>> reference before returning -EIO, after which the underlying page may
>>> have been freed and reallocated with page->flags reset; reading
>>> PageReserved(p) at that point would observe stale or unrelated state.
>>> The pre-call snapshot reflects what the page actually was at the
>>> time of the failure event.
>>>
>>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>>   mm/memory-failure.c | 19 ++++++++++++++++++-
>>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 866c4428ac7ef..f112fb27a8ff6 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>       unsigned long page_flags;
>>>       bool retry = true;
>>>       int hugetlb = 0;
>>> +    bool is_reserved;
>>>         if (!sysctl_memory_failure_recovery)
>>>           panic("Memory failure on page %lx", pfn);
>>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>>        * In fact it's dangerous to directly bump up page count from 0,
>>>        * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>>        */
>>> +    /*
>>> +     * Pages with PG_reserved set are not currently managed by the
>>> +     * page allocator (memblock-reserved memory, driver reservations,
>>> +     * etc.), so classify them as kernel-owned for reporting.
>>> +     *
>>> +     * Sample the flag before get_hwpoison_page(): in the
>>> +     * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>>> +     * reference before returning -EIO, after which page->flags may
>>> +     * have been reset by the allocator.
>>> +     */
>>> +    is_reserved = PageReserved(p);
>>> +
>>>       res = get_hwpoison_page(p, flags);
>>>       if (!res) {
>>>           if (is_free_buddy_page(p)) {
>>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>>           }
>>>           goto unlock_mutex;
>>>       } else if (res < 0) {
>>> -        res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>> +        if (is_reserved)
>>> +            res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>> +        else
>>> +            res = action_result(pfn, MF_MSG_GET_HWPOISON,
>>> +                        MF_IGNORED);
>>>           goto unlock_mutex;
>>>       }
>>>  
>>
>> It's a bit odd that we need this handling when we already have handling for
>> reserved pages in error_states[].
>>
>> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>> __get_hwpoison_page() ... would always fail? Making
>> get_hwpoison_page()->get_any_page() always fail?
>>
>> But then, we never call identify_page_state()? And never call me_kernel()?
>>
>> This all looks very odd.
>>
>> Why would you even want to call get_hwpoison_page() in the first place if you
>> find PageReserved?
>>
> 
> Ah, good point!
> It seems to me that all unhandable pages should head out to identify_page_state:
> 
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2411,6 +2411,10 @@ int memory_failure(unsigned long pfn, int flags)
>          * In fact it's dangerous to directly bump up page count from 0,
>          * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>          */
> +
> +       if (!HWPoisonHandlable(page, flags)
> +               goto identify_page_state;
> +
>         res = get_hwpoison_page(p, flags);
>         if (!res) {
>                 if (is_free_buddy_page(p)) {

That's one option, or we just let get_hwpoison_page() return clearer error
codes, let it take care of checking PageReserved, and process the error codes
return by get_hwpoison_page() in a better way.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-13  7:53 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
	kernel-team, Lance Yang
In-Reply-To: <agMj4ukhj1PkXXrN@gmail.com>

On 5/12/26 15:04, Breno Leitao wrote:
> On Tue, May 12, 2026 at 10:17:00AM +0200, David Hildenbrand (Arm) wrote:
>>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>  	unsigned long page_flags;
>>>  	bool retry = true;
>>>  	int hugetlb = 0;
>>> +	bool is_reserved;
>>>  
>>>  	if (!sysctl_memory_failure_recovery)
>>>  		panic("Memory failure on page %lx", pfn);
>>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>>  	 * In fact it's dangerous to directly bump up page count from 0,
>>>  	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>>  	 */
>>> +	/*
>>> +	 * Pages with PG_reserved set are not currently managed by the
>>> +	 * page allocator (memblock-reserved memory, driver reservations,
>>> +	 * etc.), so classify them as kernel-owned for reporting.
>>> +	 *
>>> +	 * Sample the flag before get_hwpoison_page(): in the
>>> +	 * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>>> +	 * reference before returning -EIO, after which page->flags may
>>> +	 * have been reset by the allocator.
>>> +	 */
>>> +	is_reserved = PageReserved(p);
>>> +
>>>  	res = get_hwpoison_page(p, flags);
>>>  	if (!res) {
>>>  		if (is_free_buddy_page(p)) {
>>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>>  		}
>>>  		goto unlock_mutex;
>>>  	} else if (res < 0) {
>>> -		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>> +		if (is_reserved)
>>> +			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>> +		else
>>> +			res = action_result(pfn, MF_MSG_GET_HWPOISON,
>>> +					    MF_IGNORED);
>>>  		goto unlock_mutex;
>>>  	}
>>>  
>>>
>>
>> It's a bit odd that we need this handling when we already have handling for
>> reserved pages in error_states[].
>>
>> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>> __get_hwpoison_page() ... would always fail? Making
>> get_hwpoison_page()->get_any_page() always fail?
>>
>> But then, we never call identify_page_state()? And never call me_kernel()?
> 
> From what I read, it seems that error_states[0] = { reserved, reserved, MF_MSG_KERNEL, me_kernel }
> has been effectively dead code on the hwpoison-from-MCE path for a
> while.
> 
> My v6 patch relabels the failure-path output to match what me_kernel() would
> have reported anyway.
> 
>> This all looks very odd.
>>
>> Why would you even want to call get_hwpoison_page() in the first place if you
>> find PageReserved?
> 
> Are you suggesting we should all the page action as soon as we detect the page
> is reserved and get out?
> 
> Something as:
> 
>     if (PageReserved(p)) {
>         res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>         goto unlock_mutex;
>     }
> 
>     res = get_hwpoison_page(p, flags);

Or you combine this patch with the other patch and let simply
get_hwpoison_page() check that, and return an appropriate error code for
unhandable that you can process here?

Like, maybe, returning -EIO directly?


res = get_hwpoison_page(p, flags);
switch (res) {
case 0: /* Success */
	...
	break
case -EIO: /* Unhandable kernel page. */
	...
	break;
case -EBUSY: /* Race, try again? */
	...
	break;
case ...
}

You can add more return codes as you see fit.

-- 
Cheers,

David

^ permalink raw reply

* Re: [RFC PATCH v2 10/10] selftests/verification: add tlob selftests
From: Gabriele Monaco @ 2026-05-13  7:46 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <8148267505ef90175b6b69e1ffb3aa560ff42d35.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add selftest coverage for the tlob RV monitor in
> tools/testing/selftests/verification/.
> 
> Two helper binaries are built by tlob/Makefile: tlob_helper for the
> ioctl interface (/dev/rv) and tlob_uprobe_target for the uprobe tests.
> The top-level Makefile delegates to tlob/ via a generic MONITOR_SUBDIRS
> pattern so monitor-specific build details stay within each monitor's
> own subdirectory.
> 
> Eight test files cover the tracefs control interface (tracefs.tc), the
> ioctl self-instrumentation interface (ioctl.tc, 8 scenarios), and the
> uprobe external monitoring interface (uprobe_bind.tc, uprobe_violation.tc,
> uprobe_no_event.tc, uprobe_multi.tc, uprobe_detail_sleeping.tc,
> uprobe_detail_waiting.tc).

Thanks for the deep test suite!

I run it on a VM (virtme-ng on my x86 16 core fedora box) and have it hanging at
step 9 (you see 8 is ok and after I get an RCU splat):

$ sudo vng -v -- make -C tools/testing/selftests/verification run_tests
...
# ok 5 Test tlob ioctl self-instrumentation (within/over-budget, error paths)
# ok 6 Test tlob monitor tracefs interface (enable/disable and files)
# ok 7 Test uprobe binding (visible in monitor file, removable, duplicate rejected)
# ok 8 Test uprobe detail sleeping (sleeping_ns dominates when task blocks between probes)
[   53.989561] tlob_target (1756) used greatest stack depth: 11792 bytes left
[   75.100818] rcu: INFO: rcu_preempt self-detected stall on CPU
[   75.100825] rcu: 	0-...!: (26082 ticks this GP) idle=a8e4/1/0x4000000000000000 softirq=0/0 fqs=13 rcuc=26078 jiffies(starved)
[   75.100833] rcu: 	(t=26000 jiffies g=17333 q=146 ncpus=16)
[   75.100836] rcu: rcu_preempt kthread timer wakeup didn't happen for 24040 jiffies! g17333 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   75.100839] rcu: 	Possible timer handling issue on cpu=7 timer-softirq=317
[   75.100840] rcu: rcu_preempt kthread starved for 24043 jiffies! g17333 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=7
[   75.100843] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[   75.100843] rcu: RCU grace-period kthread stack dump:
[   75.100845] task:rcu_preempt     state:I stack:14104 pid:17    tgid:17    ppid:2      task_flags:0x208040 flags:0x00080000
[   75.100856] Call Trace:
[   75.100859]  <TASK>
[   75.100870]  __schedule+0x4f1/0x1490
[   75.100890]  ? __pfx_rcu_gp_kthread+0x10/0x10
[   75.100898]  schedule+0x5b/0x210
[   75.100901]  ? schedule_timeout+0xae/0x130
[   75.100905]  schedule_timeout+0xae/0x130
[   75.100911]  ? __pfx_process_timeout+0x10/0x10
[   75.100925]  rcu_gp_fqs_loop+0x114/0x880
[   75.100933]  ? lock_release+0x2ea/0x4a0
[   75.100945]  ? __pfx_rcu_gp_kthread+0x10/0x10
[   75.100948]  rcu_gp_kthread+0x26b/0x320
[   75.100951]  ? preempt_count_sub+0x5f/0x80
[   75.100963]  ? __pfx_rcu_gp_kthread+0x10/0x10
[   75.100966]  kthread+0xf3/0x130
[   75.100970]  ? __pfx_kthread+0x10/0x10
[   75.100978]  ret_from_fork+0x3b4/0x420
[   75.100984]  ? __pfx_kthread+0x10/0x10
[   75.100989]  ret_from_fork_asm+0x1a/0x30
[   75.101018]  </TASK>
[   75.101019] rcu: Stack dump where RCU GP kthread last ran:
[   75.101021] Sending NMI from CPU 0 to CPUs 7:
[   75.101106] NMI backtrace for cpu 7
[   75.101118] CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 7.1.0-rc2+ #160 PREEMPT_{RT,(lazy)} 
[   75.101124] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   75.101128] RIP: 0010:pv_native_safe_halt+0xf/0x20
[   75.101139] Code: 75 70 00 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 25 6e 1c 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[   75.101142] RSP: 0018:ffffd22ec0103eb8 EFLAGS: 00000296
[   75.101147] RAX: 00000000000529f3 RBX: 0000000000000000 RCX: ffffffff8ca56131
[   75.101170] RDX: ffff8de4c185c280 RSI: 0000000000000000 RDI: ffffffff8ca56131
[   75.101172] RBP: ffff8de4c185c280 R08: 0000000000000000 R09: 0000000000000000
[   75.101174] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000007
[   75.101176] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   75.101373] FS:  0000000000000000(0000) GS:ffff8de56a091000(0000) knlGS:0000000000000000
[   75.101379] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.101381] CR2: 00007fb886a53f98 CR3: 000000003be5c002 CR4: 0000000000770ef0
[   75.101383] PKRU: 55555554
[   75.101384] Call Trace:
[   75.101388]  <TASK>
[   75.101389]  default_idle+0x9/0x10
[   75.101397]  default_idle_call+0x85/0x240
[   75.101404]  do_idle+0x291/0x300
[   75.101412]  ? schedule_idle+0x22/0x40
[   75.101415]  cpu_startup_entry+0x29/0x30
[   75.101418]  start_secondary+0xf8/0x100
[   75.101424]  common_startup_64+0x12c/0x138
[   75.101435]  </TASK>
[   75.102036] CPU: 0 UID: 0 PID: 1758 Comm: sh Not tainted 7.1.0-rc2+ #160 PREEMPT_{RT,(lazy)} 
[   75.102040] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   75.102042] RIP: 0033:0x556458604e3f
[   75.102049] Code: 3c 18 4e 8d 04 3f 42 c6 04 21 00 0f b6 01 4c 89 7d b0 4c 89 c3 e9 bf ed ff ff 90 41 0f b6 c1 48 8d 15 c5 3f 11 00 80 3c 02 00 <0f> 84 a9 f0 ff ff 48 8b 45 80 f6 40 08 50 0f 85 9b f0 ff ff e9 78
[   75.102051] RSP: 002b:00007ffc7ac46e30 EFLAGS: 00000246
[   75.102054] RAX: 0000000000000074 RBX: 0000000000000074 RCX: 000055646adb8a60
[   75.102056] RDX: 0000556458718e00 RSI: 0000000000000018 RDI: 0000000000000000
[   75.102057] RBP: 00007ffc7ac46f20 R08: 000055646adc3100 R09: 0000000000000074
[   75.102058] R10: 0000000000000021 R11: 0000000000000001 R12: 0000000000000000
[   75.102059] R13: 0000000000000070 R14: 000055646adb9cf0 R15: 0000000000000000
[   75.102061] FS:  00007f832822b740 GS:  0000000000000000


Did you see that? Am I doing something wrong?

Thanks,
Gabriele

> 
> Tested on x86_64 with vng (virtme-ng):
> 
>   TAP version 13
>   1..12
>   ok 1 Test monitor enable/disable
>   ok 2 Test monitor reactor setting
>   ok 3 Check available monitors
>   ok 4 Test wwnr monitor with printk reactor
>   ok 5 Test tlob ioctl self-instrumentation (within/over-budget, error paths)
>   ok 6 Test tlob monitor tracefs interface (enable/disable and files)
>   ok 7 uprobe binding: visible in monitor file, removable, duplicate offset
> rejected
>   ok 8 uprobe detail sleeping: sleeping_ns dominates when task blocks between
> probes
>   ok 9 uprobe detail waiting: waiting_ns dominates when task is preempted
> between probes
>   ok 10 Two bindings on same binary with different offsets and budgets fire
> independently
>   ok 11 Verify no spurious error_env_tlob events without an active uprobe
> binding
>   ok 12 uprobe violation: error_env_tlob and detail_env_tlob fire with correct
> fields
>   # Totals: pass:12 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  tools/testing/selftests/verification/Makefile |  21 +-
>  .../verification/test.d/tlob/ioctl.tc         |  36 +
>  .../verification/test.d/tlob/tracefs.tc       |  17 +
>  .../verification/test.d/tlob/uprobe_bind.tc   |  34 +
>  .../test.d/tlob/uprobe_detail_sleeping.tc     |  47 ++
>  .../test.d/tlob/uprobe_detail_waiting.tc      |  60 ++
>  .../verification/test.d/tlob/uprobe_multi.tc  |  60 ++
>  .../test.d/tlob/uprobe_no_event.tc            |  19 +
>  .../test.d/tlob/uprobe_violation.tc           |  60 ++
>  .../selftests/verification/tlob/Makefile      |  21 +
>  .../selftests/verification/tlob/tlob_ioctl.c  | 626 ++++++++++++++++++
>  .../selftests/verification/tlob/tlob_target.c | 138 ++++
>  12 files changed, 1138 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/verification/test.d/tlob/ioctl.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/tracefs.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
>  create mode 100644 tools/testing/selftests/verification/tlob/Makefile
>  create mode 100644 tools/testing/selftests/verification/tlob/tlob_ioctl.c
>  create mode 100644 tools/testing/selftests/verification/tlob/tlob_target.c
> 
> diff --git a/tools/testing/selftests/verification/Makefile
> b/tools/testing/selftests/verification/Makefile
> index aa8790c22a71..b5584fd3762d 100644
> --- a/tools/testing/selftests/verification/Makefile
> +++ b/tools/testing/selftests/verification/Makefile
> @@ -1,8 +1,27 @@
>  # SPDX-License-Identifier: GPL-2.0
> -all:
>  
>  TEST_PROGS := verificationtest-ktap
>  TEST_FILES := test.d settings
>  EXTRA_CLEAN := $(OUTPUT)/logs/*
>  
> +# Subdirectories that provide helper binaries for the test runner.
> +# Each entry must contain a Makefile that accepts OUTDIR= and deposits
> +# its binaries there; verificationtest-ktap adds OUTDIR to PATH so
> +# the ftracetest require-checks resolve the binaries by name.
> +MONITOR_SUBDIRS := tlob
> +
>  include ../lib.mk
> +
> +# Build and clean each monitor subdirectory.
> +all: $(patsubst %,_build_%,$(MONITOR_SUBDIRS))
> +
> +clean: $(patsubst %,_clean_%,$(MONITOR_SUBDIRS))
> +
> +.PHONY: $(patsubst %,_build_%,$(MONITOR_SUBDIRS)) \
> +        $(patsubst %,_clean_%,$(MONITOR_SUBDIRS))
> +
> +$(patsubst %,_build_%,$(MONITOR_SUBDIRS)): _build_%:
> +	$(MAKE) -C $* OUTDIR="$(OUTPUT)" TOOLS_INCLUDES="$(TOOLS_INCLUDES)"
> +
> +$(patsubst %,_clean_%,$(MONITOR_SUBDIRS)): _clean_%:
> +	$(MAKE) -C $* OUTDIR="$(OUTPUT)" clean
> diff --git a/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> b/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> new file mode 100644
> index 000000000000..54ae249af9a6
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> @@ -0,0 +1,36 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test tlob ioctl self-instrumentation (within/over-budget,
> error paths)
> +# requires: tlob:monitor tlob_ioctl:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +
> +[ -c /dev/rv ] || exit_unsupported
> +
> +echo 1 > monitors/tlob/enable
> +
> +# within budget: 50 ms threshold, 10 ms workload
> +"$TLOB_HELPER" within_budget
> +
> +# over budget in running state: 1 ms threshold, 100 ms busy-spin
> +"$TLOB_HELPER" over_budget_running
> +
> +# over budget in sleeping state: 3 ms threshold, 50 ms sleep
> +"$TLOB_HELPER" over_budget_sleeping
> +
> +# over budget in waiting state: 1 us threshold, sched_yield
> +"$TLOB_HELPER" over_budget_waiting
> +
> +# error paths
> +"$TLOB_HELPER" double_start
> +"$TLOB_HELPER" stop_no_start
> +
> +# per-thread isolation
> +"$TLOB_HELPER" multi_thread
> +
> +# bind against disabled monitor must return ENODEV, not crash
> +echo 0 > monitors/tlob/enable
> +"$TLOB_HELPER" not_enabled
> +echo 1 > monitors/tlob/enable
> +
> +echo 0 > monitors/tlob/enable
> diff --git a/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> b/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> new file mode 100644
> index 000000000000..5d1e7cc02498
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> @@ -0,0 +1,17 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test tlob monitor tracefs interface (enable/disable and files)
> +# requires: tlob:monitor
> +
> +check_requires monitors/tlob/enable monitors/tlob/desc monitors/tlob/monitor
> +
> +# enable / disable via the enable file
> +echo 1 > monitors/tlob/enable
> +grep -q 1 monitors/tlob/enable
> +echo "tlob" >> enabled_monitors
> +grep -q tlob enabled_monitors
> +
> +echo 0 > monitors/tlob/enable
> +grep -q 0 monitors/tlob/enable
> +echo "!tlob" >> enabled_monitors
> +! grep -q "^tlob$" enabled_monitors
> diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> new file mode 100644
> index 000000000000..41e20d593855
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> @@ -0,0 +1,34 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe binding (visible in monitor file, removable,
> duplicate rejected)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > monitors/tlob/enable
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=5000000" >
> "$TLOB_MONITOR"
> +
> +# Binding must appear in monitor file with canonical hex-offset format.
> +grep -qE "^p ${UPROBE_TARGET}:0x[0-9a-f]+ 0x[0-9a-f]+ threshold=[0-9]+$"
> "$TLOB_MONITOR"
> +grep -q "threshold=5000000" "$TLOB_MONITOR"
> +
> +# Duplicate offset_start must be rejected.
> +! echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=9999" >
> "$TLOB_MONITOR" 2>/dev/null
> +
> +# Remove the binding; it must no longer appear.
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR"
> +! grep -q "^p .*:0x${busy_offset#0x} " "$TLOB_MONITOR"
> +
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > monitors/tlob/enable
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> new file mode 100644
> index 000000000000..2b8656e0fef1
> --- /dev/null
> +++
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> @@ -0,0 +1,47 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe detail sleeping (sleeping_ns dominates when task
> blocks between probes)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +start_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work_done
> 2>/dev/null)
> +[ -n "$start_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 5000 sleep &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# 50 ms budget; task sleeps 200 ms per iteration -> sleeping_ns dominates.
> +echo "p ${UPROBE_TARGET}:${start_offset} ${stop_offset} threshold=50000" >
> "$TLOB_MONITOR"
> +
> +found=0; i=0
> +while [ "$i" -lt 30 ]; do
> +	sleep 0.1
> +	grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +waiting=$(echo "$line" | sed 's/.*waiting_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +[ "$sleeping" -gt "$((running + waiting))" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> new file mode 100644
> index 000000000000..0705854f24df
> --- /dev/null
> +++
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe detail waiting (waiting_ns dominates when task is
> preempted between probes)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +command -v chrt    > /dev/null || exit_unsupported
> +command -v taskset > /dev/null || exit_unsupported
> +
> +start_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_preempt_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET"
> tlob_preempt_work_done 2>/dev/null)
> +[ -n "$start_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ]  || exit_unsupported
> +
> +cpu=0
> +
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# Register probe before the target starts so the start uprobe fires on the
> +# first entry to tlob_preempt_work. Budget: 500 ms.
> +echo "p ${UPROBE_TARGET}:${start_offset} ${stop_offset} threshold=500000" >
> "$TLOB_MONITOR"
> +
> +# Target starts; start probe fires on tlob_preempt_work entry.
> +taskset -c "$cpu" "$UPROBE_TARGET" 5000 preempt &
> +busy_pid=$!
> +sleep 0.05
> +
> +# RT hog on the same CPU preempts the target; target stays in waiting state
> +# (runnable, off-CPU) until the budget expires -> waiting_ns dominates.
> +chrt -f 99 taskset -c "$cpu" sh -c 'while true; do :; done' 2>/dev/null &
> +hog_pid=$!
> +
> +found=0; i=0
> +while [ "$i" -lt 30 ]; do
> +	sleep 0.1
> +	grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$hog_pid" 2>/dev/null; wait "$hog_pid" 2>/dev/null || true
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +waiting=$(echo "$line" | sed 's/.*waiting_ns=\([0-9]*\).*/\1/')
> +[ "$waiting" -gt "$((running + sleeping))" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> new file mode 100644
> index 000000000000..c4b8f7108ae9
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test two uprobe bindings on same binary (different offsets
> fire independently)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +busy_stop=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +sleep_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work
> 2>/dev/null)
> +sleep_stop=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ]  || exit_unsupported
> +[ -n "$busy_stop" ]    || exit_unsupported
> +[ -n "$sleep_offset" ] || exit_unsupported
> +[ -n "$sleep_stop" ]   || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 &       # busy mode: tlob_busy_work fires every 200 ms
> +busy_pid=$!
> +"$UPROBE_TARGET" 30000 sleep & # sleep mode: tlob_sleep_work fires every 200
> ms
> +sleep_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# Binding A: 5 s budget on the busy probe - must not fire in 200 ms loops.
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${busy_stop} threshold=5000000" >
> "$TLOB_MONITOR"
> +# Binding B: 10 ns budget on the sleep probe - fires on first invocation.
> +echo "p ${UPROBE_TARGET}:${sleep_offset} ${sleep_stop} threshold=10" >
> "$TLOB_MONITOR"
> +
> +# Wait up to 2 s for error_env_tlob from binding B.
> +found=0; i=0
> +while [ "$i" -lt 20 ]; do
> +	sleep 0.1
> +	grep -q "error_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +echo "-${UPROBE_TARGET}:${sleep_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$sleep_pid" 2>/dev/null; wait "$sleep_pid" 2>/dev/null || true
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +
> +echo 0 > monitors/tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +
> +[ "$found" = "1" ]
> +# error_env_tlob payload: label and clock variable must be present.
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "budget_exceeded"
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "clk_elapsed="
> +# detail_env_tlob must appear alongside the error.
> +grep -q "detail_env_tlob" /sys/kernel/tracing/trace
> +
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> new file mode 100644
> index 000000000000..4a74853346e3
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> @@ -0,0 +1,19 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test no spurious error_env_tlob events without an active
> uprobe binding
> +# requires: tlob:monitor tlob_ioctl:program
> +
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +sleep 0.5
> +
> +! grep -q "error_env_tlob" /sys/kernel/tracing/trace
> +
> +echo 0 > monitors/tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> new file mode 100644
> index 000000000000..624fdb950f6b
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe violation (error_env_tlob and detail_env_tlob fire
> with correct fields)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# 10 ns budget - fires almost immediately; task is busy-spinning on-CPU.
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=10" >
> "$TLOB_MONITOR"
> +
> +# wait up to 2 s for detail_env_tlob
> +found=0; i=0
> +while [ "$i" -lt 20 ]; do
> +	sleep 0.1
> +	grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +# error_env_tlob event label must be budget_exceeded
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "budget_exceeded"
> +
> +# detail_env_tlob must have all five fields with the correct threshold
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +echo "$line" | grep -q "pid="
> +echo "$line" | grep -q "threshold_us=10"
> +echo "$line" | grep -q "running_ns="
> +echo "$line" | grep -q "waiting_ns="
> +echo "$line" | grep -q "sleeping_ns="
> +
> +# Busy-spin keeps the task on-CPU: running_ns must exceed sleeping_ns.
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +[ "$running" -gt "$sleeping" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git a/tools/testing/selftests/verification/tlob/Makefile
> b/tools/testing/selftests/verification/tlob/Makefile
> new file mode 100644
> index 000000000000..1bedf946cb34
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/Makefile
> @@ -0,0 +1,21 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# Builds tlob selftest helper binaries.
> +#
> +# Invoked by ../Makefile; pass OUTDIR to control the output directory
> +# and TOOLS_INCLUDES for the in-tree UAPI -isystem flag.
> +
> +OUTDIR ?= $(CURDIR)/..
> +CFLAGS += $(TOOLS_INCLUDES)
> +
> +.PHONY: all
> +all: $(OUTDIR)/tlob_ioctl $(OUTDIR)/tlob_target
> +
> +$(OUTDIR)/tlob_ioctl: tlob_ioctl.c
> +	$(CC) $(CFLAGS) -o $@ $< -lpthread
> +
> +$(OUTDIR)/tlob_target: tlob_target.c
> +	$(CC) $(CFLAGS) -o $@ $<
> +
> +.PHONY: clean
> +clean:
> +	$(RM) $(OUTDIR)/tlob_ioctl $(OUTDIR)/tlob_target
> diff --git a/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> b/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> new file mode 100644
> index 000000000000..abb4e2e80a2c
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> @@ -0,0 +1,626 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_ioctl.c - ioctl test driver and ELF utility for tlob selftests
> + *
> + * Usage: tlob_ioctl <subcommand> [args...]
> + *
> + *   not_enabled          - TRACE_START without monitor enabled -> ENODEV
> + *   within_budget        - sleep within budget -> 0
> + *   over_budget_running  - busy-spin past budget -> EOVERFLOW
> + *   over_budget_sleeping - sleep past budget -> EOVERFLOW
> + *   over_budget_waiting  - sched_yield into waiting state -> EOVERFLOW
> + *   double_start         - two starts without stop -> EALREADY
> + *   stop_no_start        - stop without start -> EINVAL
> + *   multi_thread         - two fds: thread A within budget, thread B over
> + *   bench                - TRACE_START/STOP latency (TAP output, always
> passes)
> + *   sym_offset <binary> <symbol> - print ELF file offset of symbol
> + *
> + * Exit: 0 = pass, 1 = fail, 2 = skip (device not available).
> + */
> +#define _GNU_SOURCE
> +#include <elf.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <sched.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <linux/rv.h>
> +
> +static int rv_fd = -1;
> +
> +static int open_rv(void)
> +{
> +	struct rv_bind_args bind = { .monitor_name = "tlob" };
> +
> +	rv_fd = open("/dev/rv", O_RDWR);
> +	if (rv_fd < 0) {
> +		fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
> +		return -1;
> +	}
> +	if (ioctl(rv_fd, RV_IOCTL_BIND_MONITOR, &bind) < 0) {
> +		fprintf(stderr, "bind tlob: %s\n", strerror(errno));
> +		close(rv_fd);
> +		rv_fd = -1;
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +static void busy_spin_us(unsigned long us)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < us * 1000UL);
> +}
> +
> +static int trace_start(uint64_t threshold_us)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = threshold_us,
> +	};
> +
> +	return ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +}
> +
> +static int trace_stop(void)
> +{
> +	return ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +}
> +
> +/* Synchronous TRACE_START / TRACE_STOP tests */
> +
> +/* Bind to a disabled monitor must return ENODEV without crashing */
> +static int test_not_enabled(void)
> +{
> +	struct rv_bind_args bind = { .monitor_name = "tlob" };
> +	int fd;
> +	int ret;
> +
> +	fd = open("/dev/rv", O_RDWR);
> +	if (fd < 0) {
> +		fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
> +		return 2; /* skip */
> +	}
> +
> +	ret = ioctl(fd, RV_IOCTL_BIND_MONITOR, &bind);
> +	close(fd);
> +
> +	if (ret == 0) {
> +		fprintf(stderr, "RV_IOCTL_BIND_MONITOR: expected ENODEV, got
> success\n");
> +		return 1;
> +	}
> +	if (errno != ENODEV) {
> +		fprintf(stderr, "RV_IOCTL_BIND_MONITOR: expected ENODEV, got
> %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_within_budget(void)
> +{
> +	int ret;
> +
> +	/* 50 ms budget */
> +	if (trace_start(50000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	usleep(10000); /* 10 ms */
> +	ret = trace_stop();
> +	if (ret != 0) {
> +		fprintf(stderr, "TRACE_STOP: expected 0, got %d errno=%s\n",
> +			ret, strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_running(void)
> +{
> +	int ret;
> +
> +	/* 1 ms budget */
> +	if (trace_start(1000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	busy_spin_us(100000); /* 100 ms */
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_sleeping(void)
> +{
> +	int ret;
> +
> +	/* 3 ms budget */
> +	if (trace_start(3000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	usleep(50000); /* 50 ms; sleeping time counts toward budget */
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_waiting(void)
> +{
> +	int ret;
> +
> +	/* 1 us budget */
> +	if (trace_start(1) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	sched_yield(); /* running -> waiting -> running */
> +	busy_spin_us(10); /* 10 us >> 1 us budget; hrtimer fires during spin
> */
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* Error-handling tests */
> +
> +static int test_double_start(void)
> +{
> +	int ret;
> +
> +	/* 10 s: large enough the hrtimer won't fire during the test */
> +	if (trace_start(10000000ULL) < 0) {
> +		fprintf(stderr, "first TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	ret = trace_start(10000000);
> +	if (ret == 0) {
> +		fprintf(stderr, "second TRACE_START: expected EALREADY, got
> 0\n");
> +		trace_stop();
> +		return 1;
> +	}
> +	if (errno != EALREADY) {
> +		fprintf(stderr, "second TRACE_START: expected EALREADY, got
> %s\n",
> +			strerror(errno));
> +		trace_stop();
> +		return 1;
> +	}
> +	trace_stop();
> +	return 0;
> +}
> +
> +static int test_stop_no_start(void)
> +{
> +	int ret;
> +
> +	/* Ensure clean state: ignore error from a stale entry */
> +	trace_stop();
> +
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EINVAL, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EINVAL) {
> +		fprintf(stderr, "TRACE_STOP: expected EINVAL, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* Two threads, each with its own fd: A within budget, B over budget. */
> +
> +struct mt_thread_args {
> +	uint64_t      threshold_us;
> +	unsigned long workload_us;
> +	int           busy;
> +	int           expect_eoverflow;
> +	int           result;
> +};
> +
> +static void *mt_thread_fn(void *arg)
> +{
> +	struct mt_thread_args *a = arg;
> +	struct tlob_start_args args = { .threshold_us = a->threshold_us };
> +	struct rv_bind_args bind = { .monitor_name = "tlob" };
> +	int fd;
> +	int ret;
> +
> +	fd = open("/dev/rv", O_RDWR);
> +	if (fd < 0) {
> +		fprintf(stderr, "thread open /dev/rv: %s\n",
> strerror(errno));
> +		a->result = 1;
> +		return NULL;
> +	}
> +	if (ioctl(fd, RV_IOCTL_BIND_MONITOR, &bind) < 0) {
> +		fprintf(stderr, "thread bind tlob: %s\n", strerror(errno));
> +		close(fd);
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	ret = ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +	if (ret < 0) {
> +		fprintf(stderr, "thread TRACE_START: %s\n", strerror(errno));
> +		close(fd);
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	if (a->busy)
> +		busy_spin_us(a->workload_us);
> +	else
> +		usleep(a->workload_us);
> +
> +	ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	if (a->expect_eoverflow) {
> +		if (ret == 0 || errno != EOVERFLOW) {
> +			fprintf(stderr, "thread: expected EOVERFLOW, got
> ret=%d errno=%s\n",
> +				ret, strerror(errno));
> +			close(fd);
> +			a->result = 1;
> +			return NULL;
> +		}
> +	} else {
> +		if (ret != 0) {
> +			fprintf(stderr, "thread: expected 0, got ret=%d
> errno=%s\n",
> +				ret, strerror(errno));
> +			close(fd);
> +			a->result = 1;
> +			return NULL;
> +		}
> +	}
> +	close(fd);
> +	a->result = 0;
> +	return NULL;
> +}
> +
> +static int test_multi_thread(void)
> +{
> +	pthread_t ta, tb;
> +	struct mt_thread_args a = {
> +		.threshold_us     = 20000,   /* 20 ms */
> +		.workload_us      = 5000,    /* 5 ms sleep -> within budget
> */
> +		.busy             = 0,
> +		.expect_eoverflow = 0,
> +	};
> +	struct mt_thread_args b = {
> +		.threshold_us     = 3000,    /* 3 ms */
> +		.workload_us      = 30000,   /* 30 ms spin -> over budget */
> +		.busy             = 1,
> +		.expect_eoverflow = 1,
> +	};
> +
> +	pthread_create(&ta, NULL, mt_thread_fn, &a);
> +	pthread_create(&tb, NULL, mt_thread_fn, &b);
> +	pthread_join(ta, NULL);
> +	pthread_join(tb, NULL);
> +
> +	return (a.result || b.result) ? 1 : 0;
> +}
> +
> +/*
> + * Benchmark TRACE_START, TRACE_STOP, and round-trip ioctls.
> + * Output uses TAP '#' prefix; always returns 0.
> + */
> +#define BENCH_WARMUP  32
> +#define BENCH_N      1000
> +
> +static long long timespec_diff_ns(const struct timespec *a,
> +				   const struct timespec *b)
> +{
> +	return (long long)(b->tv_sec - a->tv_sec) * 1000000000LL
> +		+ (b->tv_nsec - a->tv_nsec);
> +}
> +
> +static int test_bench(void)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = 10000000ULL, /* 10 s */
> +	};
> +	struct timespec t0, t1;
> +	long long total_start_ns = 0, total_stop_ns = 0, total_rt_ns = 0;
> +	int i;
> +
> +	/* warm up */
> +	for (i = 0; i < BENCH_WARMUP; i++) {
> +		if (ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args) == 0)
> +			ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	}
> +
> +	/* start only */
> +	for (i = 0; i < BENCH_N; i++) {
> +		clock_gettime(CLOCK_MONOTONIC, &t0);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +		clock_gettime(CLOCK_MONOTONIC, &t1);
> +		total_start_ns += timespec_diff_ns(&t0, &t1);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	}
> +
> +	/* stop only */
> +	for (i = 0; i < BENCH_N; i++) {
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +		clock_gettime(CLOCK_MONOTONIC, &t0);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +		clock_gettime(CLOCK_MONOTONIC, &t1);
> +		total_stop_ns += timespec_diff_ns(&t0, &t1);
> +	}
> +
> +	/* round-trip */
> +	clock_gettime(CLOCK_MONOTONIC, &t0);
> +	for (i = 0; i < BENCH_N; i++) {
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	}
> +	clock_gettime(CLOCK_MONOTONIC, &t1);
> +	total_rt_ns = timespec_diff_ns(&t0, &t1);
> +
> +	printf("# start ioctl only:      %lld ns/iter (N=%d, includes
> syscall)\n",
> +	       total_start_ns / BENCH_N, BENCH_N);
> +	printf("# stop ioctl only:       %lld ns/iter (N=%d, includes
> syscall)\n",
> +	       total_stop_ns / BENCH_N, BENCH_N);
> +	printf("# start+stop roundtrip:  %lld ns/iter (N=%d, includes 2
> syscalls)\n",
> +	       total_rt_ns / BENCH_N, BENCH_N);
> +	return 0;
> +}
> +
> +/*
> + * Print the ELF file offset of <symname> in <binary>.  Walks .symtab
> + * (falling back to .dynsym) and converts vaddr to file offset via PT_LOAD.
> + * Supports 32- and 64-bit ELF.
> + */
> +static int sym_offset(const char *binary, const char *symname)
> +{
> +	int fd;
> +	struct stat st;
> +	void *map;
> +	Elf64_Ehdr *ehdr;
> +	Elf32_Ehdr *ehdr32;
> +	int is64;
> +	uint64_t sym_vaddr = 0;
> +	int found = 0;
> +	uint64_t file_offset = 0;
> +
> +	fd = open(binary, O_RDONLY);
> +	if (fd < 0) {
> +		fprintf(stderr, "open %s: %s\n", binary, strerror(errno));
> +		return 1;
> +	}
> +	if (fstat(fd, &st) < 0) {
> +		close(fd);
> +		return 1;
> +	}
> +	map = mmap(NULL, (size_t)st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +	close(fd);
> +	if (map == MAP_FAILED) {
> +		fprintf(stderr, "mmap: %s\n", strerror(errno));
> +		return 1;
> +	}
> +
> +	ehdr = (Elf64_Ehdr *)map;
> +	ehdr32 = (Elf32_Ehdr *)map;
> +	if (st.st_size < 4 ||
> +	    ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
> +	    ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
> +	    ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
> +	    ehdr->e_ident[EI_MAG3] != ELFMAG3) {
> +		fprintf(stderr, "%s: not an ELF file\n", binary);
> +		munmap(map, (size_t)st.st_size);
> +		return 1;
> +	}
> +	is64 = (ehdr->e_ident[EI_CLASS] == ELFCLASS64);
> +
> +	if (is64) {
> +		Elf64_Shdr *shdrs = (Elf64_Shdr *)((char *)map + ehdr-
> >e_shoff);
> +		Elf64_Shdr *shstrtab_hdr = &shdrs[ehdr->e_shstrndx];
> +		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> +		int si;
> +
> +		/* prefer .symtab; fall back to .dynsym */
> +		for (int pass = 0; pass < 2 && !found; pass++) {
> +			const char *target = pass ? ".dynsym" : ".symtab";
> +
> +			for (si = 0; si < ehdr->e_shnum && !found; si++) {
> +				Elf64_Shdr *sh = &shdrs[si];
> +				const char *name = shstrtab + sh->sh_name;
> +
> +				if (strcmp(name, target) != 0)
> +					continue;
> +
> +				Elf64_Shdr *strtab_sh = &shdrs[sh->sh_link];
> +				const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> +				Elf64_Sym *syms = (Elf64_Sym *)((char *)map +
> sh->sh_offset);
> +				uint64_t nsyms = sh->sh_size /
> sizeof(Elf64_Sym);
> +				uint64_t j;
> +
> +				for (j = 0; j < nsyms; j++) {
> +					if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> +						sym_vaddr = syms[j].st_value;
> +						found = 1;
> +						break;
> +					}
> +				}
> +			}
> +		}
> +
> +		if (!found) {
> +			fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> +			munmap(map, (size_t)st.st_size);
> +			return 1;
> +		}
> +
> +		/* Convert vaddr to file offset via PT_LOAD segments */
> +		Elf64_Phdr *phdrs = (Elf64_Phdr *)((char *)map + ehdr-
> >e_phoff);
> +		int pi;
> +
> +		for (pi = 0; pi < ehdr->e_phnum; pi++) {
> +			Elf64_Phdr *ph = &phdrs[pi];
> +
> +			if (ph->p_type != PT_LOAD)
> +				continue;
> +			if (sym_vaddr >= ph->p_vaddr &&
> +			    sym_vaddr < ph->p_vaddr + ph->p_filesz) {
> +				file_offset = sym_vaddr - ph->p_vaddr + ph-
> >p_offset;
> +				break;
> +			}
> +		}
> +	} else {
> +		/* 32-bit ELF */
> +		Elf32_Shdr *shdrs = (Elf32_Shdr *)((char *)map + ehdr32-
> >e_shoff);
> +		Elf32_Shdr *shstrtab_hdr = &shdrs[ehdr32->e_shstrndx];
> +		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> +		int si;
> +		uint32_t sym_vaddr32 = 0;
> +
> +		for (int pass = 0; pass < 2 && !found; pass++) {
> +			const char *target = pass ? ".dynsym" : ".symtab";
> +
> +			for (si = 0; si < ehdr32->e_shnum && !found; si++) {
> +				Elf32_Shdr *sh = &shdrs[si];
> +				const char *name = shstrtab + sh->sh_name;
> +
> +				if (strcmp(name, target) != 0)
> +					continue;
> +
> +				Elf32_Shdr *strtab_sh = &shdrs[sh->sh_link];
> +				const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> +				Elf32_Sym *syms = (Elf32_Sym *)((char *)map +
> sh->sh_offset);
> +				uint32_t nsyms = sh->sh_size /
> sizeof(Elf32_Sym);
> +				uint32_t j;
> +
> +				for (j = 0; j < nsyms; j++) {
> +					if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> +						sym_vaddr32 =
> syms[j].st_value;
> +						found = 1;
> +						break;
> +					}
> +				}
> +			}
> +		}
> +
> +		if (!found) {
> +			fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> +			munmap(map, (size_t)st.st_size);
> +			return 1;
> +		}
> +
> +		Elf32_Phdr *phdrs = (Elf32_Phdr *)((char *)map + ehdr32-
> >e_phoff);
> +		int pi;
> +
> +		for (pi = 0; pi < ehdr32->e_phnum; pi++) {
> +			Elf32_Phdr *ph = &phdrs[pi];
> +
> +			if (ph->p_type != PT_LOAD)
> +				continue;
> +			if (sym_vaddr32 >= ph->p_vaddr &&
> +			    sym_vaddr32 < ph->p_vaddr + ph->p_filesz) {
> +				file_offset = sym_vaddr32 - ph->p_vaddr + ph-
> >p_offset;
> +				break;
> +			}
> +		}
> +		sym_vaddr = sym_vaddr32;
> +	}
> +
> +	munmap(map, (size_t)st.st_size);
> +
> +	if (!file_offset && sym_vaddr) {
> +		fprintf(stderr, "could not map vaddr 0x%lx to file offset\n",
> +			(unsigned long)sym_vaddr);
> +		return 1;
> +	}
> +
> +	printf("0x%lx\n", (unsigned long)file_offset);
> +	return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	int rc;
> +
> +	if (argc < 2) {
> +		fprintf(stderr, "Usage: %s <subcommand> [args...]\n",
> argv[0]);
> +		return 1;
> +	}
> +
> +	/* sym_offset does not need /dev/rv */
> +	if (strcmp(argv[1], "sym_offset") == 0) {
> +		if (argc < 4) {
> +			fprintf(stderr, "Usage: %s sym_offset <binary>
> <symbol>\n",
> +				argv[0]);
> +			return 1;
> +		}
> +		return sym_offset(argv[2], argv[3]);
> +	}
> +
> +	/* not_enabled: monitor is disabled; bind must return ENODEV without
> open_rv() */
> +	if (strcmp(argv[1], "not_enabled") == 0)
> +		return test_not_enabled();
> +
> +	if (open_rv() < 0)
> +		return 2; /* skip */
> +
> +	if (strcmp(argv[1], "bench") == 0)
> +		rc = test_bench();
> +	else if (strcmp(argv[1], "within_budget") == 0)
> +		rc = test_within_budget();
> +	else if (strcmp(argv[1], "over_budget_running") == 0)
> +		rc = test_over_budget_running();
> +	else if (strcmp(argv[1], "over_budget_sleeping") == 0)
> +		rc = test_over_budget_sleeping();
> +	else if (strcmp(argv[1], "over_budget_waiting") == 0)
> +		rc = test_over_budget_waiting();
> +	else if (strcmp(argv[1], "double_start") == 0)
> +		rc = test_double_start();
> +	else if (strcmp(argv[1], "stop_no_start") == 0)
> +		rc = test_stop_no_start();
> +	else if (strcmp(argv[1], "multi_thread") == 0)
> +		rc = test_multi_thread();
> +	else {
> +		fprintf(stderr, "Unknown test: %s\n", argv[1]);
> +		rc = 1;
> +	}
> +
> +	close(rv_fd);
> +	return rc;
> +}
> diff --git a/tools/testing/selftests/verification/tlob/tlob_target.c
> b/tools/testing/selftests/verification/tlob/tlob_target.c
> new file mode 100644
> index 000000000000..0fdbc575d71d
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/tlob_target.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_target.c - uprobe target binary for tlob selftests.
> + *
> + * Provides three start/stop probe pairs, each designed to exercise a
> + * different dominant component of the detail_env_tlob ns breakdown:
> + *
> + *   tlob_busy_work    / tlob_busy_work_done    - busy-spin: running_ns
> dominates
> + *   tlob_sleep_work   / tlob_sleep_work_done   - nanosleep: sleeping_ns
> dominates
> + *   tlob_preempt_work / tlob_preempt_work_done - busy-spin: waiting_ns
> dominates
> + *                                                (needs an RT competitor on
> the same CPU)
> + *
> + * Usage: tlob_target <duration_ms> [mode]
> + *
> + * mode is one of: busy (default), sleep, preempt.
> + * Loops in 200 ms iterations until <duration_ms> has elapsed
> + * (0 = run for ~24 hours).
> + */
> +#define _GNU_SOURCE
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <time.h>
> +
> +#ifndef noinline
> +#define noinline __attribute__((noinline))
> +#endif
> +
> +static inline int timespec_before(const struct timespec *a,
> +				   const struct timespec *b)
> +{
> +	return a->tv_sec < b->tv_sec ||
> +	       (a->tv_sec == b->tv_sec && a->tv_nsec < b->tv_nsec);
> +}
> +
> +static void timespec_add_ms(struct timespec *ts, unsigned long ms)
> +{
> +	ts->tv_sec  += ms / 1000;
> +	ts->tv_nsec += (long)(ms % 1000) * 1000000L;
> +	if (ts->tv_nsec >= 1000000000L) {
> +		ts->tv_sec++;
> +		ts->tv_nsec -= 1000000000L;
> +	}
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_busy_work_done(void)
> +{
> +	/* empty: uprobe fires on entry */
> +}
> +
> +/* start probe; busy-spin so running_ns dominates */
> +noinline void tlob_busy_work(unsigned long duration_ns)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < duration_ns);
> +
> +	tlob_busy_work_done();
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_sleep_work_done(void)
> +{
> +	/* empty: uprobe fires on entry */
> +}
> +
> +/* start probe; nanosleep so sleeping_ns dominates */
> +noinline void tlob_sleep_work(unsigned long duration_ms)
> +{
> +	struct timespec ts = {
> +		.tv_sec  = duration_ms / 1000,
> +		.tv_nsec = (long)(duration_ms % 1000) * 1000000L,
> +	};
> +	nanosleep(&ts, NULL);
> +	tlob_sleep_work_done();
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_preempt_work_done(void)
> +{
> +	/* empty: uprobe fires on entry */
> +}
> +
> +/*
> + * start probe; busy-spin so an RT competitor on the same CPU drives
> + * waiting_ns (prev_state==0 -> preempt event, task stays runnable off-CPU).
> + */
> +noinline void tlob_preempt_work(unsigned long duration_ms)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < duration_ms * 1000000UL);
> +
> +	tlob_preempt_work_done();
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	unsigned long duration_ms = 0;
> +	const char *mode = "busy";
> +	struct timespec deadline, now;
> +
> +	if (argc >= 2)
> +		duration_ms = strtoul(argv[1], NULL, 10);
> +	if (argc >= 3)
> +		mode = argv[2];
> +
> +	clock_gettime(CLOCK_MONOTONIC, &deadline);
> +	timespec_add_ms(&deadline, duration_ms ? duration_ms : 86400000UL);
> +
> +	do {
> +		if (strcmp(mode, "sleep") == 0)
> +			tlob_sleep_work(200);
> +		else if (strcmp(mode, "preempt") == 0)
> +			tlob_preempt_work(200);
> +		else
> +			tlob_busy_work(200 * 1000000UL);
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +	} while (timespec_before(&now, &deadline));
> +
> +	return 0;
> +}


^ permalink raw reply

* Re: [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync
From: Wen Yang @ 2026-05-13  5:32 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <8e80cbcf739304de95356f1fac677261628977fa.camel@redhat.com>



On 5/12/26 17:09, Gabriele Monaco wrote:
> On Tue, 2026-05-12 at 10:27 +0200, Gabriele Monaco wrote:
>> On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
>>> From: Wen Yang <wen.yang@linux.dev>
>>>
>>> The following two paths race:
>>>
>>>    CPU 0 (disable_stall/__rv_disable_monitor)  CPU 1 (wwnr probe handler)
>> 							^ did you mean stall?
> 
> Ok I got it now, so essentially you'd reproduce it like:
> 
> * start a DA per-task monitor (no timer)
> * stop it, a handler is still running after reset, it sets monitoring back to 1
> * start an HA per-task monitor
> 
> that would use the same slot that is now looking like:
> 
>   { monitoring = 1, timer.function = NULL }
> 
> because it was not initialised as HA but monitoring was reset in the race.
> 
> Thinking about this again, it isn't just an issue with per-task monitors, all
> monitors reusing slots would suffer from it.
> Besides, relying on monitoring can be fragile when using LTL monitors on the
> same task (those don't even have monitoring).
> 
> Perhaps the solution isn't that trivial, I'm going to give one more thought on
> it, but thanks again for bringing this up!
> 
> Gabriele
> 
>>>    ------------------------------------------  -----------------------------
>>>    disable_stall()
>>>      da_monitor_destroy()
>>>        da_monitor_reset_all()          <------ [task T: monitoring=0]
>>>                                                da_monitor_start(&T->rv[n])
>>>                                                /* no timer_setup */
>>>                                                 monitoring=1  <----
>>>    tracepoint_synchronize_unregister()
>>>    // CPU 1 probe has already returned; sync returns
>>>
>>> Later, enable_stall() acquires the same slot and calls da_monitor_init():
>>>
>>>    da_monitor_reset_all()
>>>      da_monitor_reset(&T->rv[slot])    // monitoring=1, timer.function==0
>>>        ha_monitor_reset_env()
>>>          ha_cancel_timer()
>>>            timer_delete(&ha_mon->timer)  // ODEBUG: timer never initialised
>>>
>>>    ODEBUG: assert_init not available (active state 0)
>>>    object type: timer_list
>>>    Call trace: timer_delete <- da_monitor_reset_all <- enable_stall
>>>
>>> Call tracepoint_synchronize_unregister() inside da_monitor_destroy()
>>> before da_monitor_reset_all().  The unregister_trace_xxx() calls in the
>>> monitor's disable() have already disconnected the tracepoints; the sync
>>> here drains any handler still in flight, so no new monitoring=1 can
>>> appear after da_monitor_reset_all() clears the slot.
>>>
>>> Also fix the slot release ordering: release the slot only after
>>> reset_all() to avoid accessing rv[] with an out-of-bounds index.
>>>
>>> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
>>> Signed-off-by: Wen Yang <wen.yang@linux.dev>
>>> ---
>>
>> Thanks for the fix, I have a similar one waiting for submission.
>>
>> These are technically 2 separate fixes though: the ordering with unset
>> task_mon_slot (independent on HA) and the synchronisation with pending
>> tracepoints. They probably deserve separate patches and visibility, the first
>> has always been around and we're technically overwriting who knows what.
>>
>>
>> The explanation above is a bit hard to follow though, are you talking about a
>> handler for the same (stall) monitor running after the reset, effectively
>> undoing it by setting the monitoring flag?
>>
>> Then this is indeed an issue with ha_monitor_reset_env() which expects a clean
>> environment.
>>
>> So that's basically what you'd see now much more often because in fact we
>> don't
>> reset the right slot (though, again, that's a different issue).
>>
>>
>> Calling tracepoint_synchronize_unregister() there too would surely fix, but it
>> used to be kinda slow. But it's probably gotten faster since now tracepoints
>> use
>> SRCU, so we can wait for a dedicated grace period.
>>
>> I liked the idea to wait cumulatively in the end, but that's just making
>> things
>> harder.. Let's do like this:
>>
>> Prepare 2 separate patches as fixes, put the task slot one first (would ease
>> backporting), mention this issue with the race condition only in the second.
>> You can send them independently and I'll add them to the tree as urgent.
>>
>>
>> I'm soon going to send my set of fixes that will also include the task slot
>> patch (not removing to ease my life with conflicts).
>>

Hi Gabriele,

Thanks for both messages.  Two patches are ready; let me address
your follow-up concerns before sending.

   1. "all monitors reusing slots would suffer from it"

      Only RV_MON_PER_TASK uses the rv_get/put_task_monitor_slot()
      pool.  RV_MON_GLOBAL and RV_MON_PER_CPU each have dedicated
      storage (a single static variable and a per-cpu variable) and
      never share slots across monitor types.  The race is exclusive
      to PER_TASK, so fixing that variant's da_monitor_destroy() is
      the correct scope.

   2. "LTL monitors don't even have monitoring"

      tracepoint_synchronize_unregister() does not rely on the
      monitoring flag at all.  It is a system-wide barrier — it
      calls synchronize_rcu_tasks_trace() followed by
      synchronize_srcu(&tracepoint_srcu) — draining every in-flight
      tracepoint handler on every CPU regardless of which monitor
      dispatched it.  LTL handlers are covered without any special
      treatment.

The slot-ordering issue (patch 1) affects all per-task DA monitors,
not only HA ones — "independent on HA" — because
RV_PER_TASK_MONITOR_INIT equals CONFIG_RV_PER_TASK_MONITORS (one
past the end of rv[]), so da_monitor_reset_all() overwrites whatever
follows rv[] in task_struct whenever any per-task monitor is
disabled.

Also corrected "wwnr probe handler" to "stall probe handler" in
patch 2 per your annotation.

Please let me know if the above reasoning addresses your concerns.


--
Best wishes,
Wen

>>
>>>   include/rv/da_monitor.h | 18 ++++++++++++++++--
>>>   1 file changed, 16 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
>>> index 00ded3d5ab3f..d04bb3229c75 100644
>>> --- a/include/rv/da_monitor.h
>>> +++ b/include/rv/da_monitor.h
>>> @@ -304,6 +304,20 @@ static int da_monitor_init(void)
>>>   
>>>   /*
>>>    * da_monitor_destroy - return the allocated slot
>>> + *
>>> + * Call tracepoint_synchronize_unregister() before reset_all() to close
>>> + * the race where an in-flight non-HA probe handler sets monitoring=1
>>> + * (without calling timer_setup()) after da_monitor_reset_all() has
>>> + * already cleared the slot but before the caller's own sync completes.
>>> + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
>>> + * the same slot would call timer_delete() on a never-initialised
>>> + * timer_list, triggering ODEBUG warnings.
>>> + *
>>> + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
>>> + * that waits for all CPUs to finish any in-flight tracepoint handlers.
>>> + * The caller's own __rv_disable_monitor() issues a second sync after
>>> + * returning from disable(); that redundant call is harmless on the
>>> + * infrequent admin (enable/disable) path.
>>>    */
>>>   static inline void da_monitor_destroy(void)
>>>   {
>>> @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
>>>   		WARN_ONCE(1, "Disabling a disabled monitor: "
>>> __stringify(MONITOR_NAME));
>>>   		return;
>>>   	}
>>> +	tracepoint_synchronize_unregister();
>>> +	da_monitor_reset_all();
>>>   	rv_put_task_monitor_slot(task_mon_slot);
>>>   	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
>>> -
>>> -	da_monitor_reset_all();
>>>   }
>>>   
>>>   #elif RV_MON_TYPE == RV_MON_PER_OBJ
> 

^ permalink raw reply

* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: kernel test robot @ 2026-05-13  3:04 UTC (permalink / raw)
  To: qiwu.chen, rostedt, mhiramat, akpm, hannes, david, mhocko, willy
  Cc: oe-kbuild-all, linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260506083652.100160-1-qiwu.chen@transsion.com>

Hi qiwu.chen,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v7.1-rc3]
[cannot apply to akpm-mm/mm-everything trace/for-next next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/qiwu-chen/mm-vmscan-rework-lru_shrink-and-write_folio-tracepoints/20260513-040720
base:   linus/master
patch link:    https://lore.kernel.org/r/20260506083652.100160-1-qiwu.chen%40transsion.com
patch subject: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
config: sh-defconfig (https://download.01.org/0day-ci/archive/20260513/202605131057.E7FZbuAc-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605131057.E7FZbuAc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605131057.E7FZbuAc-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/trace/define_trace.h:132,
                    from include/trace/events/vmscan.h:602,
                    from mm/vmscan.c:72:
   include/trace/events/vmscan.h: In function 'trace_raw_output_mm_vmscan_write_folio':
   include/trace/events/vmscan.h:358:19: warning: format '%p' expects argument of type 'void *', but argument 3 has type 'long unsigned int' [-Wformat=]
     358 |         TP_printk("folio=%p lru=%s",
         |                   ^~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:219:34: note: in definition of macro 'DECLARE_EVENT_CLASS'
     219 |         trace_event_printf(iter, print);                                \
         |                                  ^~~~~
   include/trace/trace_events.h:45:30: note: in expansion of macro 'PARAMS'
      45 |                              PARAMS(print));                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:358:9: note: in expansion of macro 'TP_printk'
     358 |         TP_printk("folio=%p lru=%s",
         |         ^~~~~~~~~
   In file included from include/trace/trace_events.h:256:
   include/trace/events/vmscan.h:358:27: note: format string is defined here
     358 |         TP_printk("folio=%p lru=%s",
         |                          ~^
         |                           |
         |                           void *
         |                          %ld
   include/trace/events/vmscan.h: In function 'do_trace_event_raw_event_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
     354 |                 __entry->folio = folio;
         |                                ^
   include/trace/trace_events.h:427:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
     427 |         { assign; }                                                     \
         |           ^~~~~~
   include/trace/trace_events.h:435:23: note: in expansion of macro 'PARAMS'
     435 |                       PARAMS(assign), PARAMS(print))                    \
         |                       ^~~~~~
   include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
      40 |         DECLARE_EVENT_CLASS(name,                              \
         |         ^~~~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
      44 |                              PARAMS(assign),                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
     353 |         TP_fast_assign(
         |         ^~~~~~~~~~~~~~
   In file included from include/trace/define_trace.h:133:
   include/trace/events/vmscan.h: In function 'do_perf_trace_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
     354 |                 __entry->folio = folio;
         |                                ^
   include/trace/perf.h:51:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
      51 |         { assign; }                                                     \
         |           ^~~~~~
   include/trace/perf.h:67:23: note: in expansion of macro 'PARAMS'
      67 |                       PARAMS(assign), PARAMS(print))                    \
         |                       ^~~~~~
   include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
      40 |         DECLARE_EVENT_CLASS(name,                              \
         |         ^~~~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
      44 |                              PARAMS(assign),                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
     353 |         TP_fast_assign(
         |         ^~~~~~~~~~~~~~


vim +354 include/trace/events/vmscan.h

   343	
   344		TP_PROTO(struct folio *folio),
   345	
   346		TP_ARGS(folio),
   347	
   348		TP_STRUCT__entry(
   349			__field(unsigned long, folio)
   350			__field(int, lru)
   351		),
   352	
   353		TP_fast_assign(
 > 354			__entry->folio = folio;
   355			__entry->lru = folio_lru_list(folio);
   356		),
   357	
   358		TP_printk("folio=%p lru=%s",
   359			__entry->folio,
   360			__print_symbolic(__entry->lru, LRU_NAMES))
   361	);
   362	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: kernel test robot @ 2026-05-13  1:19 UTC (permalink / raw)
  To: qiwu.chen, rostedt, mhiramat, akpm, hannes, david, mhocko, willy
  Cc: oe-kbuild-all, linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260506083652.100160-1-qiwu.chen@transsion.com>

Hi qiwu.chen,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v7.1-rc3]
[cannot apply to akpm-mm/mm-everything trace/for-next next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/qiwu-chen/mm-vmscan-rework-lru_shrink-and-write_folio-tracepoints/20260513-040720
base:   linus/master
patch link:    https://lore.kernel.org/r/20260506083652.100160-1-qiwu.chen%40transsion.com
patch subject: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
config: xtensa-randconfig-001-20260513 (https://download.01.org/0day-ci/archive/20260513/202605130942.9wJFWm9M-lkp@intel.com/config)
compiler: xtensa-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605130942.9wJFWm9M-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605130942.9wJFWm9M-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from include/trace/define_trace.h:132,
                    from include/trace/events/vmscan.h:602,
                    from mm/vmscan.c:72:
   include/trace/events/vmscan.h: In function 'trace_raw_output_mm_vmscan_write_folio':
   include/trace/events/vmscan.h:358:12: warning: format '%p' expects argument of type 'void *', but argument 3 has type 'long unsigned int' [-Wformat=]
     TP_printk("folio=%p lru=%s",
               ^~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:219:27: note: in definition of macro 'DECLARE_EVENT_CLASS'
     trace_event_printf(iter, print);    \
                              ^~~~~
   include/trace/trace_events.h:45:9: note: in expansion of macro 'PARAMS'
            PARAMS(print));         \
            ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
    TRACE_EVENT(mm_vmscan_write_folio,
    ^~~~~~~~~~~
   include/trace/events/vmscan.h:358:2: note: in expansion of macro 'TP_printk'
     TP_printk("folio=%p lru=%s",
     ^~~~~~~~~
   In file included from include/trace/trace_events.h:256,
                    from include/trace/define_trace.h:132,
                    from include/trace/events/vmscan.h:602,
                    from mm/vmscan.c:72:
   include/trace/events/vmscan.h:358:20: note: format string is defined here
     TP_printk("folio=%p lru=%s",
                      ~^
                      %ld
   In file included from include/trace/define_trace.h:132,
                    from include/trace/events/vmscan.h:602,
                    from mm/vmscan.c:72:
   include/trace/events/vmscan.h: In function 'do_trace_event_raw_event_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:354:18: warning: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
      __entry->folio = folio;
                     ^
   include/trace/trace_events.h:427:4: note: in definition of macro '__DECLARE_EVENT_CLASS'
     { assign; }       \
       ^~~~~~
   include/trace/trace_events.h:435:9: note: in expansion of macro 'PARAMS'
            PARAMS(assign), PARAMS(print))   \
            ^~~~~~
   include/trace/trace_events.h:40:2: note: in expansion of macro 'DECLARE_EVENT_CLASS'
     DECLARE_EVENT_CLASS(name,          \
     ^~~~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:44:9: note: in expansion of macro 'PARAMS'
            PARAMS(assign),         \
            ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
    TRACE_EVENT(mm_vmscan_write_folio,
    ^~~~~~~~~~~
   include/trace/events/vmscan.h:353:2: note: in expansion of macro 'TP_fast_assign'
     TP_fast_assign(
     ^~~~~~~~~~~~~~


vim +354 include/trace/events/vmscan.h

   343	
   344		TP_PROTO(struct folio *folio),
   345	
   346		TP_ARGS(folio),
   347	
   348		TP_STRUCT__entry(
   349			__field(unsigned long, folio)
   350			__field(int, lru)
   351		),
   352	
   353		TP_fast_assign(
 > 354			__entry->folio = folio;
   355			__entry->lru = folio_lru_list(folio);
   356		),
   357	
   358		TP_printk("folio=%p lru=%s",
   359			__entry->folio,
   360			__print_symbolic(__entry->lru, LRU_NAMES))
   361	);
   362	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: kernel test robot @ 2026-05-13  0:45 UTC (permalink / raw)
  To: qiwu.chen, rostedt, mhiramat, akpm, hannes, david, mhocko, willy
  Cc: oe-kbuild-all, linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260506083652.100160-1-qiwu.chen@transsion.com>

Hi qiwu.chen,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v7.1-rc3]
[cannot apply to akpm-mm/mm-everything trace/for-next next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/qiwu-chen/mm-vmscan-rework-lru_shrink-and-write_folio-tracepoints/20260513-040720
base:   linus/master
patch link:    https://lore.kernel.org/r/20260506083652.100160-1-qiwu.chen%40transsion.com
patch subject: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
config: sh-defconfig (https://download.01.org/0day-ci/archive/20260513/202605130842.zWTTtyaL-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605130842.zWTTtyaL-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605130842.zWTTtyaL-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from include/trace/define_trace.h:132,
                    from include/trace/events/vmscan.h:602,
                    from mm/vmscan.c:72:
   include/trace/events/vmscan.h: In function 'trace_raw_output_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:358:19: warning: format '%p' expects argument of type 'void *', but argument 3 has type 'long unsigned int' [-Wformat=]
     358 |         TP_printk("folio=%p lru=%s",
         |                   ^~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:219:34: note: in definition of macro 'DECLARE_EVENT_CLASS'
     219 |         trace_event_printf(iter, print);                                \
         |                                  ^~~~~
   include/trace/trace_events.h:45:30: note: in expansion of macro 'PARAMS'
      45 |                              PARAMS(print));                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:358:9: note: in expansion of macro 'TP_printk'
     358 |         TP_printk("folio=%p lru=%s",
         |         ^~~~~~~~~
   In file included from include/trace/trace_events.h:256:
   include/trace/events/vmscan.h:358:27: note: format string is defined here
     358 |         TP_printk("folio=%p lru=%s",
         |                          ~^
         |                           |
         |                           void *
         |                          %ld
   include/trace/events/vmscan.h: In function 'do_trace_event_raw_event_mm_vmscan_write_folio':
   include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
     354 |                 __entry->folio = folio;
         |                                ^
   include/trace/trace_events.h:427:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
     427 |         { assign; }                                                     \
         |           ^~~~~~
   include/trace/trace_events.h:435:23: note: in expansion of macro 'PARAMS'
     435 |                       PARAMS(assign), PARAMS(print))                    \
         |                       ^~~~~~
   include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
      40 |         DECLARE_EVENT_CLASS(name,                              \
         |         ^~~~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
      44 |                              PARAMS(assign),                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
     353 |         TP_fast_assign(
         |         ^~~~~~~~~~~~~~
   In file included from include/trace/define_trace.h:133:
   include/trace/events/vmscan.h: In function 'do_perf_trace_mm_vmscan_write_folio':
   include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
     354 |                 __entry->folio = folio;
         |                                ^
   include/trace/perf.h:51:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
      51 |         { assign; }                                                     \
         |           ^~~~~~
   include/trace/perf.h:67:23: note: in expansion of macro 'PARAMS'
      67 |                       PARAMS(assign), PARAMS(print))                    \
         |                       ^~~~~~
   include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
      40 |         DECLARE_EVENT_CLASS(name,                              \
         |         ^~~~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
      44 |                              PARAMS(assign),                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
     353 |         TP_fast_assign(
         |         ^~~~~~~~~~~~~~


vim +358 include/trace/events/vmscan.h

   343	
   344		TP_PROTO(struct folio *folio),
   345	
   346		TP_ARGS(folio),
   347	
   348		TP_STRUCT__entry(
   349			__field(unsigned long, folio)
   350			__field(int, lru)
   351		),
   352	
   353		TP_fast_assign(
   354			__entry->folio = folio;
   355			__entry->lru = folio_lru_list(folio);
   356		),
   357	
 > 358		TP_printk("folio=%p lru=%s",
   359			__entry->folio,
   360			__print_symbolic(__entry->lru, LRU_NAMES))
   361	);
   362	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH RFC v5 10/53] KVM: guest_memfd: Add basic support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-05-12 22:30 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <1DAB05E2-7F30-45D7-B155-B66C59D31AFF@infradead.org>

"Liam R. Howlett" <liam@infradead.org> writes:

>
> [...snip...]
>
>>
>>The invariant in this maple tree is that contiguous ranges with the same
>>attribute are stored as a single range.
>>
>>The goal of this first part is to get the entry at the index just after
>>the requested range, and see what the attribute there is. If that
>>attribute is what we're about to set, extend the requested range for
>>storing to the end of that range.
>>
>>If there is another range higher than end + 1, with the invariant
>>maintained, that attribute has to be different than the attribute stored
>>at end. Hence, we only want to extend this requested range up till end.
>>
>
> mas_find() will look for an entry at the given address for the first search, and if it is not found it will continue to search upwards.  Since you limit the search to end, it will work as you want and there isn't a bug as I was thinking in my sleep deprived state.
>
> Since you are searching for exactly one address (end), it might serve you better to walk there.  Maybe walking is a better API for what you are doing here?
>

Thanks again for this tip! I'll try the walk API in the next revision
after v6 [1]

[1] https://lore.kernel.org/all/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com/T/

>
>>> Do you have testing of these functions somewhere?
>>>
>>
>>GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4) tests setting
>>attributes in ranges. If test_page is 2,
>>
>>1. [0, 4) starts off shared (4 is the number of pages in the guest_memfd)
>>2. [2, 3) is converted to private
>>    => so the ranges should now be [0, 2), [2, 3), [3, 4)
>>3. [2, 3) is converted back to shared
>>    => so the ranges should now be [0, 4)
>>
>>I verified this by inserting some trace_printk()s and inspecting manually.
>>
>
> Thanks.  I find the exclusive ranges a bit odd to think about in the maple tree context, but this test case makes sense.  This is especially odd to look at a single index entry, at least for me.
>
> I generally have a set of test cases and append any bug reproduces to that list so they are unlikely to reoccur.  My testing is certainly different from what you'll be doing, but this method has done well with the quality of code improving over time, and limited (if any) regressions.
>

I've not worked directly with the maple tree tests but the xarray tests
(similarly set up, I believe) are a joy to work with.

> I actually insist that any fix has a test before I accept them.  There are two reasons for this: 1. Avoiding the regression. 2. People really understand the bug if they can create a reproducer.
>
> I hope this helps.
>
>

The maple tree tests are set up to directly test maple tree code, but
KVM selftests test from the userspace interface, and it's hard to test
this invariant from userspace.

>>>> +	if (entry && xa_to_value(entry) == attributes)
>>>> +		last = mas->last;
>>>> +
>>>> +	if (start > 0) {
>>>> +		mas_set_range(mas, start - 1, start - 1);
>>>> +		entry = mas_find(mas, start - 1);
>>>> +		if (entry && xa_to_value(entry) == attributes)
>>>> +			start = mas->index;
>>>> +	}
>>>> +
>>>> +	mas_set_range(mas, start, last);
>>>> +	return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
>>>> +}
>>>> +
>>>>
>>>> [...snip...]
>>>>

^ permalink raw reply

* Re: [RFC PATCH] trace: Introduce a new filter_pred "caller"
From: Steven Rostedt @ 2026-05-12 19:41 UTC (permalink / raw)
  To: Chen Jun; +Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260508122623.74290-1-chenjun102@huawei.com>

On Fri, 8 May 2026 20:26:23 +0800
Chen Jun <chenjun102@huawei.com> wrote:

> Low-level functions have many call paths, and sometimes
> we only care about the calls on a specific call path.
> Add a new filter to filter based on the call stack.
> 
> Usage:
> 1. echo 'caller=="$function_name"' > events/../filter
> 
> Only support OP_EQ and OP_NE

Cute.

> 
> Signed-off-by: Chen Jun <chenjun102@huawei.com>
> ---
>  include/linux/trace_events.h       |  1 +
>  kernel/trace/trace.h               |  3 ++-
>  kernel/trace/trace_events.c        |  1 +
>  kernel/trace/trace_events_filter.c | 40 ++++++++++++++++++++++++++++--
>  4 files changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 40a43a4c7caf..1f109669a391 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -851,6 +851,7 @@ enum {
>  	FILTER_COMM,
>  	FILTER_CPU,
>  	FILTER_STACKTRACE,
> +	FILTER_CALLER,
>  };
>  
>  extern int trace_event_raw_init(struct trace_event_call *call);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 80fe152af1dd..4e4b92ce264f 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -1825,7 +1825,8 @@ static inline bool is_string_field(struct ftrace_event_field *field)
>  	       field->filter_type == FILTER_RDYN_STRING ||
>  	       field->filter_type == FILTER_STATIC_STRING ||
>  	       field->filter_type == FILTER_PTR_STRING ||
> -	       field->filter_type == FILTER_COMM;
> +	       field->filter_type == FILTER_COMM ||
> +	       field->filter_type == FILTER_CALLER;
>  }
>  
>  static inline bool is_function_field(struct ftrace_event_field *field)
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index c46e623e7e0d..6d220d7eec73 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -199,6 +199,7 @@ static int trace_define_generic_fields(void)
>  	__generic_field(char *, comm, FILTER_COMM);
>  	__generic_field(char *, stacktrace, FILTER_STACKTRACE);
>  	__generic_field(char *, STACKTRACE, FILTER_STACKTRACE);
> +	__generic_field(char *, caller, FILTER_CALLER);
>  
>  	return ret;
>  }
> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
> index 609325f57942..1cf040065abe 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -72,6 +72,7 @@ enum filter_pred_fn {
>  	FILTER_PRED_FN_CPUMASK,
>  	FILTER_PRED_FN_CPUMASK_CPU,
>  	FILTER_PRED_FN_FUNCTION,
> +	FILTER_PRED_FN_CALLER,
>  	FILTER_PRED_FN_,
>  	FILTER_PRED_TEST_VISITED,
>  };
> @@ -1009,6 +1010,21 @@ static int filter_pred_function(struct filter_pred *pred, void *event)
>  	return pred->op == OP_EQ ? ret : !ret;
>  }
>  
> +/* Filter predicate for caller. */
> +static int filter_pred_caller(struct filter_pred *pred, void *event)
> +{
> +	unsigned long entries[32];

Let's make that only 16 in size. Having 256 bytes added to the stack in
random places may cause an overflow. 128 bytes isn't as bad. Either that,
or we need to preallocate per-cpu memory and use that. But that makes the
patch much more complex. I rather just use 16 entries instead for now. If
we need more, then we can add the extra complexity.

Also, you need to update Documentation/trace/events.rst.

Thanks,

-- Steve


> +	unsigned int nr_entries;
> +	int i;
> +
> +	nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
> +	for (i = 0; i < nr_entries ; i++)
> +		if (pred->val <= entries[i] && entries[i] < pred->val2)
> +			return !pred->not;
> +
> +	return pred->not;
> +}
> +
>  /*
>   * regex_match_foo - Basic regex callbacks
>   *
> @@ -1617,6 +1633,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
>  		return filter_pred_cpumask_cpu(pred, event);
>  	case FILTER_PRED_FN_FUNCTION:
>  		return filter_pred_function(pred, event);
> +	case FILTER_PRED_FN_CALLER:
> +		return filter_pred_caller(pred, event);
>  	case FILTER_PRED_TEST_VISITED:
>  		return test_pred_visited_fn(pred, event);
>  	default:
> @@ -2002,10 +2020,28 @@ static int parse_pred(const char *str, void *data,
>  
>  		} else if (field->filter_type == FILTER_DYN_STRING) {
>  			pred->fn_num = FILTER_PRED_FN_STRLOC;
> -		} else if (field->filter_type == FILTER_RDYN_STRING)
> +		} else if (field->filter_type == FILTER_RDYN_STRING) {
>  			pred->fn_num = FILTER_PRED_FN_STRRELLOC;
> -		else {
> +		} else if (field->filter_type == FILTER_CALLER) {
> +			unsigned long caller;
> +
> +			if (op == OP_GLOB)
> +				goto err_free;
>  
> +			pred->fn_num = FILTER_PRED_FN_CALLER;
> +			caller = kallsyms_lookup_name(pred->regex->pattern);
> +			if (!caller) {
> +				parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
> +				goto err_free;
> +			}
> +			/* Now find the function start and end address */
> +			if (!kallsyms_lookup_size_offset(caller, &size, &offset)) {
> +				parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
> +				goto err_free;
> +			}
> +			pred->val = caller - offset;
> +			pred->val2 = pred->val + size;
> +		} else {
>  			if (!ustring_per_cpu) {
>  				/* Once allocated, keep it around for good */
>  				ustring_per_cpu = alloc_percpu(struct ustring_buffer);


^ permalink raw reply

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Andrii Nakryiko @ 2026-05-12 19:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, Masami Hiramatsu, Andrii Nakryiko, bpf,
	linux-trace-kernel, Oleg Nesterov, Peter Zijlstra, Ingo Molnar
In-Reply-To: <CAADnVQLfgxEzpSTLhsxN2BYWpSxJ+RYku03UMfrSTi4Abu5SBw@mail.gmail.com>

On Tue, May 12, 2026 at 12:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, May 12, 2026 at 10:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > +       /*
> > +        * We have nop10 (with first byte overwritten to int3),
> > +        * change it to:
> > +        *   lea 0x80(%rsp), %rsp
> > +        *   call tramp
> > +        *
> > +        * The first lea instruction skips the stack redzone so the call
> > +        * instruction can safely push return address on stack.
> > +        */
>
> typo: lea -128(%rsp), %rsp
>
> you can also do:
>
> add $-128, %rsp + call tramp = 4 + 5 = 9 bytes instead of 10.

When I asked AI about this it explained that add instruction modifies
flags, so it's not a good fit here. lea doesn't touch flags.

>
> Initially I didn't like this approach, since we just introduced
> usdt nop5 and now need to recompile everything again,
> but looking at the fix it's definitely simpler than alternatives
> and doesn't have annoying limitations.


yeah, limitations are annoying, especially with those global "DO NOT
OPTIMIZE" flags... Jiri, let's polish your version and land it?

^ permalink raw reply

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Alexei Starovoitov @ 2026-05-12 19:27 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Masami Hiramatsu, Andrii Nakryiko, bpf, linux-trace-kernel,
	Oleg Nesterov, Peter Zijlstra, Ingo Molnar
In-Reply-To: <agNeEzjiThzmJHiP@krava>

On Tue, May 12, 2026 at 10:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> +       /*
> +        * We have nop10 (with first byte overwritten to int3),
> +        * change it to:
> +        *   lea 0x80(%rsp), %rsp
> +        *   call tramp
> +        *
> +        * The first lea instruction skips the stack redzone so the call
> +        * instruction can safely push return address on stack.
> +        */

typo: lea -128(%rsp), %rsp

you can also do:

add $-128, %rsp + call tramp = 4 + 5 = 9 bytes instead of 10.

Initially I didn't like this approach, since we just introduced
usdt nop5 and now need to recompile everything again,
but looking at the fix it's definitely simpler than alternatives
and doesn't have annoying limitations.

^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: jane.chu @ 2026-05-12 17:58 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Breno Leitao, Miaohe Lin,
	Naoya Horiguchi, Andrew Morton, Jonathan Corbet, Shuah Khan,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <9504c193-8c01-4d03-8f62-c50fd7fbdbc0@kernel.org>



On 5/12/2026 1:17 AM, David Hildenbrand (Arm) wrote:
> On 5/11/26 17:38, Breno Leitao wrote:
>> When get_hwpoison_page() returns a negative value, distinguish
>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>> and should be classified accordingly for proper handling.
>>
>> Sample PG_reserved before the get_hwpoison_page() call. In the
>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>> reference before returning -EIO, after which the underlying page may
>> have been freed and reallocated with page->flags reset; reading
>> PageReserved(p) at that point would observe stale or unrelated state.
>> The pre-call snapshot reflects what the page actually was at the
>> time of the failure event.
>>
>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Signed-off-by: Breno Leitao <leitao@debian.org>
>> ---
>>   mm/memory-failure.c | 19 ++++++++++++++++++-
>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 866c4428ac7ef..f112fb27a8ff6 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>   	unsigned long page_flags;
>>   	bool retry = true;
>>   	int hugetlb = 0;
>> +	bool is_reserved;
>>   
>>   	if (!sysctl_memory_failure_recovery)
>>   		panic("Memory failure on page %lx", pfn);
>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>   	 * In fact it's dangerous to directly bump up page count from 0,
>>   	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>   	 */
>> +	/*
>> +	 * Pages with PG_reserved set are not currently managed by the
>> +	 * page allocator (memblock-reserved memory, driver reservations,
>> +	 * etc.), so classify them as kernel-owned for reporting.
>> +	 *
>> +	 * Sample the flag before get_hwpoison_page(): in the
>> +	 * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>> +	 * reference before returning -EIO, after which page->flags may
>> +	 * have been reset by the allocator.
>> +	 */
>> +	is_reserved = PageReserved(p);
>> +
>>   	res = get_hwpoison_page(p, flags);
>>   	if (!res) {
>>   		if (is_free_buddy_page(p)) {
>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>   		}
>>   		goto unlock_mutex;
>>   	} else if (res < 0) {
>> -		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>> +		if (is_reserved)
>> +			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>> +		else
>> +			res = action_result(pfn, MF_MSG_GET_HWPOISON,
>> +					    MF_IGNORED);
>>   		goto unlock_mutex;
>>   	}
>>   
>>
> 
> It's a bit odd that we need this handling when we already have handling for
> reserved pages in error_states[].
> 
> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
> __get_hwpoison_page() ... would always fail? Making
> get_hwpoison_page()->get_any_page() always fail?
> 
> But then, we never call identify_page_state()? And never call me_kernel()?
> 
> This all looks very odd.
> 
> Why would you even want to call get_hwpoison_page() in the first place if you
> find PageReserved?
> 

Ah, good point!
It seems to me that all unhandable pages should head out to 
identify_page_state:

--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2411,6 +2411,10 @@ int memory_failure(unsigned long pfn, int flags)
          * In fact it's dangerous to directly bump up page count from 0,
          * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
          */
+
+       if (!HWPoisonHandlable(page, flags)
+               goto identify_page_state;
+
         res = get_hwpoison_page(p, flags);
         if (!res) {
                 if (is_free_buddy_page(p)) {

thanks,
-jane





^ permalink raw reply

* [PATCH v2] rtla: Stop the record trace on interrupt
From: Crystal Wood @ 2026-05-12 17:37 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: Steven Rostedt, linux-trace-kernel, John Kacur, Costa Shulyupin,
	Wander Lairson Costa, Crystal Wood

Before, when rtla got a signal, it stopped the main trace but not the
record trace.  With "--on-end trace", this can lead to
save_trace_to_file() failing to keep up, especially on a debug kernel. 
Plus, it adds post-stoppage noise to the trace file.

Signed-off-by: Crystal Wood <crwood@redhat.com>
---
v2: clarify that this matters for --on-end trace

 tools/tracing/rtla/src/common.c   | 19 +++++++++++--------
 tools/tracing/rtla/src/common.h   |  1 -
 tools/tracing/rtla/src/timerlat.c |  2 +-
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index 35e3d3aa922e..effad523e8cf 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -10,7 +10,7 @@
 
 #include "common.h"
 
-struct trace_instance *trace_inst;
+struct osnoise_tool *trace_tool;
 volatile int stop_tracing;
 int nr_cpus;
 
@@ -21,12 +21,16 @@ static void stop_trace(int sig)
 		 * Stop requested twice in a row; abort event processing and
 		 * exit immediately
 		 */
-		tracefs_iterate_stop(trace_inst->inst);
+		if (trace_tool)
+			tracefs_iterate_stop(trace_tool->trace.inst);
 		return;
 	}
 	stop_tracing = 1;
-	if (trace_inst)
-		trace_instance_stop(trace_inst);
+	if (trace_tool) {
+		trace_instance_stop(&trace_tool->trace);
+		if (trace_tool->record)
+			trace_instance_stop(&trace_tool->record->trace);
+	}
 }
 
 /*
@@ -273,11 +277,10 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
 	tool->params = params;
 
 	/*
-	 * Save trace instance into global variable so that SIGINT can stop
-	 * the timerlat tracer.
+	 * Expose the tool to signal handlers so they can stop the trace.
 	 * Otherwise, rtla could loop indefinitely when overloaded.
 	 */
-	trace_inst = &tool->trace;
+	trace_tool = tool;
 
 	retval = ops->apply_config(tool);
 	if (retval) {
@@ -285,7 +288,7 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
 		goto out_free;
 	}
 
-	retval = enable_tracer_by_name(trace_inst->inst, ops->tracer);
+	retval = enable_tracer_by_name(tool->trace.inst, ops->tracer);
 	if (retval) {
 		err_msg("Failed to enable %s tracer\n", ops->tracer);
 		goto out_free;
diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
index 51665db4ffce..eba40b6d9504 100644
--- a/tools/tracing/rtla/src/common.h
+++ b/tools/tracing/rtla/src/common.h
@@ -54,7 +54,6 @@ struct osnoise_context {
 	int			opt_workload;
 };
 
-extern struct trace_instance *trace_inst;
 extern volatile int stop_tracing;
 
 struct hist_params {
diff --git a/tools/tracing/rtla/src/timerlat.c b/tools/tracing/rtla/src/timerlat.c
index f8c057518d22..637f68d684f5 100644
--- a/tools/tracing/rtla/src/timerlat.c
+++ b/tools/tracing/rtla/src/timerlat.c
@@ -202,7 +202,7 @@ void timerlat_analyze(struct osnoise_tool *tool, bool stopped)
 		 * If the trace did not stop with --aa-only, at least print
 		 * the max known latency.
 		 */
-		max_lat = tracefs_instance_file_read(trace_inst->inst, "tracing_max_latency", NULL);
+		max_lat = tracefs_instance_file_read(tool->trace.inst, "tracing_max_latency", NULL);
 		if (max_lat) {
 			printf("  Max latency was %s\n", max_lat);
 			free(max_lat);
-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 2/2] serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver
From: Praveen Talari @ 2026-05-12 17:14 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-serial-v2-0-a5726421b3af@oss.qualcomm.com>

Add tracing to the Qualcomm GENI serial driver to improve runtime
observability.

Trace hooks are added at key points including termios and clock
configuration, manual control get/set, interrupt handling, and data
TX/RX paths.

Usage examples:

Enable all serial traces:
  echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_serial/enable
  cat /sys/kernel/debug/tracing/trace_pipe

Example trace output:

2517.938432: geni_serial_clk_cfg: a94000.serial: desired_rate=1843200
     clk_rate=7372800 clk_div=4 clk_idx=0
2517.938753: geni_serial_irq: a94000.serial: m_irq=0x88800000
     s_irq=0x08000111 dma_tx=0x00000000 dma_rx=0x00000000
2517.938803: geni_serial_set_termios: a94000.serial: baud=115200 bpc=8
     tx_trans=0x00000002 tx_par=0x00000000 rx_trans=0x00000000
rx_par=0x00000000 stop=0
2517.938807: geni_serial_set_mctrl: a94000.serial: mctrl=0x8006
     uart_manual_rfr=0x00000000
2517.938818: geni_serial_get_mctrl: a94000.serial: mctrl=0x0160
     geni_ios=0x00000001
2517.939165: geni_serial_irq: a94000.serial: m_irq=0x00400000
     s_irq=0x00000000 dma_tx=0x00000000 dma_rx=0x00000000
2517.939592: geni_serial_tx_data: a94000.serial: tx_len=8 data=61 62 63
     64 65 66 67 68
2517.940610: geni_serial_irq: a94000.serial: m_irq=0x00000001
     s_irq=0x00000000 dma_tx=0x00000003 dma_rx=0x00000000
2517.942174: geni_serial_irq: a94000.serial: m_irq=0x08000000
     s_irq=0x08000100 dma_tx=0x00000000 dma_rx=0x00000003
2517.942323: geni_serial_rx_data: a94000.serial: rx_len=8 data=61 62 63
     64 65 66 67 68
2517.942680: geni_serial_set_mctrl: a94000.serial: mctrl=0x8000
     uart_manual_rfr=0x80000002

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
 drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index e6b0a55f0cfb..9e2de074d799 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -7,6 +7,9 @@
 /* Disable MMIO tracing to prevent excessive logging of unwanted MMIO traces */
 #define __DISABLE_TRACE_MMIO__
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/qcom_geni_serial.h>
+
 #include <linux/clk.h>
 #include <linux/console.h>
 #include <linux/io.h>
@@ -225,7 +228,7 @@ static void qcom_geni_serial_config_port(struct uart_port *uport, int cfg_flags)
 static unsigned int qcom_geni_serial_get_mctrl(struct uart_port *uport)
 {
 	unsigned int mctrl = TIOCM_DSR | TIOCM_CAR;
-	u32 geni_ios;
+	u32 geni_ios = 0;
 
 	if (uart_console(uport)) {
 		mctrl |= TIOCM_CTS;
@@ -235,6 +238,8 @@ static unsigned int qcom_geni_serial_get_mctrl(struct uart_port *uport)
 			mctrl |= TIOCM_CTS;
 	}
 
+	trace_geni_serial_get_mctrl(uport->dev, mctrl, geni_ios);
+
 	return mctrl;
 }
 
@@ -253,6 +258,8 @@ static void qcom_geni_serial_set_mctrl(struct uart_port *uport,
 	if (!(mctrl & TIOCM_RTS) && !uport->suspended)
 		uart_manual_rfr = UART_MANUAL_RFR_EN | UART_RFR_NOT_READY;
 	writel(uart_manual_rfr, uport->membase + SE_UART_MANUAL_RFR);
+
+	trace_geni_serial_set_mctrl(uport->dev, mctrl, uart_manual_rfr);
 }
 
 static const char *qcom_geni_serial_get_type(struct uart_port *uport)
@@ -683,6 +690,8 @@ static void qcom_geni_serial_start_tx_dma(struct uart_port *uport)
 	xmit_size = kfifo_out_linear_ptr(&tport->xmit_fifo, &tail,
 			UART_XMIT_SIZE);
 
+	trace_geni_serial_tx_data(uport->dev, tail, xmit_size);
+
 	qcom_geni_set_rs485_mode(uport, SER_RS485_RTS_ON_SEND);
 
 	qcom_geni_serial_setup_tx(uport, xmit_size);
@@ -909,8 +918,10 @@ static void qcom_geni_serial_handle_rx_dma(struct uart_port *uport, bool drop)
 		return;
 	}
 
-	if (!drop)
+	if (!drop) {
+		trace_geni_serial_rx_data(uport->dev, port->rx_buf, rx_in);
 		handle_rx_uart(uport, rx_in);
+	}
 
 	ret = geni_se_rx_dma_prep(&port->se, port->rx_buf,
 				  DMA_RX_BUF_SIZE,
@@ -1069,6 +1080,10 @@ static irqreturn_t qcom_geni_serial_isr(int isr, void *dev)
 	geni_status = readl(uport->membase + SE_GENI_STATUS);
 	dma = readl(uport->membase + SE_GENI_DMA_MODE_EN);
 	m_irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
+
+	trace_geni_serial_irq(uport->dev, m_irq_status, s_irq_status,
+			      dma_tx_status, dma_rx_status);
+
 	writel(m_irq_status, uport->membase + SE_GENI_M_IRQ_CLEAR);
 	writel(s_irq_status, uport->membase + SE_GENI_S_IRQ_CLEAR);
 	writel(dma_tx_status, uport->membase + SE_DMA_TX_IRQ_CLR);
@@ -1281,8 +1296,8 @@ static int geni_serial_set_rate(struct uart_port *uport, unsigned int baud)
 		return -EINVAL;
 	}
 
-	dev_dbg(port->se.dev, "desired_rate = %u, clk_rate = %lu, clk_div = %u\n, clk_idx = %u\n",
-		baud * sampling_rate, clk_rate, clk_div, clk_idx);
+	trace_geni_serial_clk_cfg(uport->dev, baud * sampling_rate, clk_rate,
+				  clk_div, clk_idx);
 
 	uport->uartclk = clk_rate;
 	port->clk_rate = clk_rate;
@@ -1432,6 +1447,10 @@ static void qcom_geni_serial_set_termios(struct uart_port *uport,
 	writel(bits_per_char, uport->membase + SE_UART_TX_WORD_LEN);
 	writel(bits_per_char, uport->membase + SE_UART_RX_WORD_LEN);
 	writel(stop_bit_len, uport->membase + SE_UART_TX_STOP_BIT_LEN);
+
+	trace_geni_serial_set_termios(uport->dev, baud, bits_per_char,
+				      tx_trans_cfg, tx_parity_cfg, rx_trans_cfg,
+				      rx_parity_cfg, stop_bit_len);
 }
 
 #ifdef CONFIG_SERIAL_QCOM_GENI_CONSOLE

-- 
2.34.1


^ permalink raw reply related

* [PATCH v2 1/2] serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
From: Praveen Talari @ 2026-05-12 17:14 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-serial-v2-0-a5726421b3af@oss.qualcomm.com>

Add tracepoint support to the Qualcomm GENI serial driver to provide
runtime visibility into driver behavior without requiring invasive debug
patches.

The trace events cover UART termios configuration, clock setup, modem
control state, interrupt status, and TX/RX data, making it easier to
diagnose communication issues in the field.

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v1->v2:
- Removed multiple TX/RX trace events, instead used
  DECLARE_EVENT_CLASS and DEFINE_EVENT.
---
 include/trace/events/qcom_geni_serial.h | 172 ++++++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)

diff --git a/include/trace/events/qcom_geni_serial.h b/include/trace/events/qcom_geni_serial.h
new file mode 100644
index 000000000000..5e23827881d0
--- /dev/null
+++ b/include/trace/events/qcom_geni_serial.h
@@ -0,0 +1,172 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM qcom_geni_serial
+
+#if !defined(_TRACE_QCOM_GENI_SERIAL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_QCOM_GENI_SERIAL_H
+
+#include <linux/device.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(geni_serial_set_termios,
+	    TP_PROTO(struct device *dev, unsigned int baud,
+		     unsigned int bits_per_char, u32 tx_trans_cfg,
+		     u32 tx_parity_cfg, u32 rx_trans_cfg,
+		     u32 rx_parity_cfg, u32 stop_bit_len),
+	    TP_ARGS(dev, baud, bits_per_char, tx_trans_cfg, tx_parity_cfg,
+		    rx_trans_cfg, rx_parity_cfg, stop_bit_len),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, baud)
+			     __field(unsigned int, bits_per_char)
+			     __field(u32, tx_trans_cfg)
+			     __field(u32, tx_parity_cfg)
+			     __field(u32, rx_trans_cfg)
+			     __field(u32, rx_parity_cfg)
+			     __field(u32, stop_bit_len)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->baud = baud;
+			   __entry->bits_per_char = bits_per_char;
+			   __entry->tx_trans_cfg = tx_trans_cfg;
+			   __entry->tx_parity_cfg = tx_parity_cfg;
+			   __entry->rx_trans_cfg = rx_trans_cfg;
+			   __entry->rx_parity_cfg = rx_parity_cfg;
+			   __entry->stop_bit_len = stop_bit_len;
+	    ),
+
+	    TP_printk("%s: baud=%u bpc=%u tx_trans=0x%08x tx_par=0x%08x rx_trans=0x%08x rx_par=0x%08x stop=%u",
+		      __get_str(name), __entry->baud, __entry->bits_per_char,
+		      __entry->tx_trans_cfg, __entry->tx_parity_cfg,
+		      __entry->rx_trans_cfg, __entry->rx_parity_cfg,
+		      __entry->stop_bit_len)
+);
+
+TRACE_EVENT(geni_serial_clk_cfg,
+	    TP_PROTO(struct device *dev, unsigned int desired_rate,
+		     unsigned long clk_rate, unsigned int clk_div,
+		     unsigned int clk_idx),
+	    TP_ARGS(dev, desired_rate, clk_rate, clk_div, clk_idx),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, desired_rate)
+			     __field(unsigned long, clk_rate)
+			     __field(unsigned int, clk_div)
+			     __field(unsigned int, clk_idx)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->desired_rate = desired_rate;
+			   __entry->clk_rate = clk_rate;
+			   __entry->clk_div = clk_div;
+			   __entry->clk_idx = clk_idx;
+	    ),
+
+	    TP_printk("%s: desired_rate=%u clk_rate=%lu clk_div=%u clk_idx=%u",
+		      __get_str(name), __entry->desired_rate, __entry->clk_rate,
+		      __entry->clk_div, __entry->clk_idx)
+);
+
+TRACE_EVENT(geni_serial_irq,
+	    TP_PROTO(struct device *dev, u32 m_irq, u32 s_irq,
+		     u32 dma_tx, u32 dma_rx),
+	    TP_ARGS(dev, m_irq, s_irq, dma_tx, dma_rx),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(u32, m_irq)
+			     __field(u32, s_irq)
+			     __field(u32, dma_tx)
+			     __field(u32, dma_rx)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->m_irq = m_irq;
+			   __entry->s_irq = s_irq;
+			   __entry->dma_tx = dma_tx;
+			   __entry->dma_rx = dma_rx;
+	    ),
+
+	    TP_printk("%s: m_irq=0x%08x s_irq=0x%08x dma_tx=0x%08x dma_rx=0x%08x",
+		      __get_str(name), __entry->m_irq, __entry->s_irq,
+		      __entry->dma_tx, __entry->dma_rx)
+);
+
+DECLARE_EVENT_CLASS(geni_serial_data,
+
+	TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+
+	TP_ARGS(dev, buf, len),
+
+	TP_STRUCT__entry(__string(name, dev_name(dev))
+			 __field(unsigned int, len)
+			 __dynamic_array(u8, data, len)
+	),
+
+	TP_fast_assign(__assign_str(name);
+		       __entry->len = len;
+		       memcpy(__get_dynamic_array(data), buf, len);
+	),
+
+	TP_printk("%s: len=%u data=%s",
+		  __get_str(name), __entry->len,
+		  __print_hex(__get_dynamic_array(data), __entry->len))
+);
+
+DEFINE_EVENT(geni_serial_data, geni_serial_tx_data,
+
+	TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+
+	TP_ARGS(dev, buf, len)
+
+);
+
+DEFINE_EVENT(geni_serial_data, geni_serial_rx_data,
+
+	TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+
+	TP_ARGS(dev, buf, len)
+
+);
+
+TRACE_EVENT(geni_serial_set_mctrl,
+	    TP_PROTO(struct device *dev, unsigned int mctrl,
+		     u32 uart_manual_rfr),
+	    TP_ARGS(dev, mctrl, uart_manual_rfr),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, mctrl)
+			     __field(u32, uart_manual_rfr)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->mctrl = mctrl;
+			   __entry->uart_manual_rfr = uart_manual_rfr;
+	    ),
+
+	    TP_printk("%s: mctrl=0x%04x uart_manual_rfr=0x%08x",
+		      __get_str(name), __entry->mctrl, __entry->uart_manual_rfr)
+);
+
+TRACE_EVENT(geni_serial_get_mctrl,
+	    TP_PROTO(struct device *dev, unsigned int mctrl, u32 geni_ios),
+	    TP_ARGS(dev, mctrl, geni_ios),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, mctrl)
+			     __field(u32, geni_ios)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->mctrl = mctrl;
+			   __entry->geni_ios = geni_ios;
+	    ),
+
+	    TP_printk("%s: mctrl=0x%04x geni_ios=0x%08x",
+		      __get_str(name), __entry->mctrl, __entry->geni_ios)
+);
+
+#endif /* _TRACE_QCOM_GENI_SERIAL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

-- 
2.34.1


^ permalink raw reply related

* [PATCH v2 0/2] Add tracepoints support for Qualcomm GENI Serial drivers
From: Praveen Talari @ 2026-05-12 17:14 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari

Add tracepoints to the Qualcomm GENI (Generic Interface) serial driver.
These trace events enable runtime debugging and performance analysis of
UART operations.

The trace events cover UART termios configuration, clock setup, manual
control state, interrupt status, and actual transmitted/received data in
hexadecimal format.

Usage examples:

Enable all serial traces:
  echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_serial/enable
  cat /sys/kernel/debug/tracing/trace_pipe

Example trace output:

2517.938432: geni_serial_clk_cfg: a94000.serial: desired_rate=1843200
     clk_rate=7372800 clk_div=4 clk_idx=0
2517.938753: geni_serial_irq: a94000.serial: m_irq=0x88800000
     s_irq=0x08000111 dma_tx=0x00000000 dma_rx=0x00000000
2517.938803: geni_serial_set_termios: a94000.serial: baud=115200 bpc=8
     tx_trans=0x00000002 tx_par=0x00000000 rx_trans=0x00000000
rx_par=0x00000000 stop=0
2517.938807: geni_serial_set_mctrl: a94000.serial: mctrl=0x8006
     uart_manual_rfr=0x00000000
2517.938818: geni_serial_get_mctrl: a94000.serial: mctrl=0x0160
     geni_ios=0x00000001
2517.939165: geni_serial_irq: a94000.serial: m_irq=0x00400000
     s_irq=0x00000000 dma_tx=0x00000000 dma_rx=0x00000000
2517.939592: geni_serial_tx_data: a94000.serial: tx_len=8 data=61 62 63
     64 65 66 67 68
2517.940610: geni_serial_irq: a94000.serial: m_irq=0x00000001
     s_irq=0x00000000 dma_tx=0x00000003 dma_rx=0x00000000
2517.942174: geni_serial_irq: a94000.serial: m_irq=0x08000000
     s_irq=0x08000100 dma_tx=0x00000000 dma_rx=0x00000003
2517.942323: geni_serial_rx_data: a94000.serial: rx_len=8 data=61 62 63
     64 65 66 67 68
2517.942680: geni_serial_set_mctrl: a94000.serial: mctrl=0x8000
     uart_manual_rfr=0x80000002

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
Changes in v2:
- removed multiple trace events for TX/RX events, instead used
  DECLARE_EVENT_CLASS and DEFINE_EVENT.
- Link to v1: https://lore.kernel.org/r/20260506-add-tracepoints-for-qcom-geni-serial-v1-0-544b22612e08@oss.qualcomm.com

---
Praveen Talari (2):
      serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
      serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver

 drivers/tty/serial/qcom_geni_serial.c   |  27 ++++-
 include/trace/events/qcom_geni_serial.h | 172 ++++++++++++++++++++++++++++++++
 2 files changed, 195 insertions(+), 4 deletions(-)
---
base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
change-id: 20260427-add-tracepoints-for-qcom-geni-serial-948777218b7b

Best regards,
-- 
Praveen Talari <praveen.talari@oss.qualcomm.com>


^ permalink raw reply

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Jiri Olsa @ 2026-05-12 17:06 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jiri Olsa, Andrii Nakryiko, bpf, linux-trace-kernel, oleg, peterz,
	mingo
In-Reply-To: <20260512141431.a70375744fdae263bda5b722@kernel.org>

On Tue, May 12, 2026 at 02:14:31PM +0900, Masami Hiramatsu wrote:
> On Sun, 10 May 2026 23:25:26 +0200
> Jiri Olsa <olsajiri@gmail.com> wrote:
> 
> > On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > > user code may keep temporary data without adjusting rsp.
> > > 
> > > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > > also does not provide a return address. Replace the single trampoline
> > > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > > assigned slot, the slot moves rsp below the red zone, saves the registers
> > > clobbered by syscall, and invokes the uprobe syscall:
> > > 
> > >   Probe site:   jmp slot_N              (5B, replaces nop5)
> > > 
> > >   Slot N:       lea  -128(%rsp), %rsp   (5B)  skip red zone
> > >                 push %rcx               (1B)  save (syscall clobbers)
> > >                 push %r11               (2B)  save (syscall clobbers)
> > >                 push %rax               (1B)  save (syscall uses for nr)
> > >                 mov  $336, %eax         (5B)  uprobe syscall number
> > >                 syscall                 (2B)
> > > 
> > > All slots contain identical code at different offsets, so the trampoline
> > > page is generated once at boot and mapped read-execute into each process.
> > > The syscall handler identifies the slot from regs->ip, which points just
> > > after the syscall instruction, and uses a per-mm slot table to recover the
> > > original probe address.
> > > 
> > > The uprobe syscall does not return to the trampoline slot. The handler
> > > restores the probe-site register state, runs the uprobe consumers, sets
> > > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > > execution, and returns directly through the IRET path. This preserves
> > > general purpose registers, including rcx and r11, without requiring any
> > > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > > shadow stack concerns.
> > > 
> > > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > > without taking mmap_lock. The optimized-instruction detection path also
> > > walks the trampoline list under an RCU read-side lock. Since that path
> > > starts from the JMP target, it translates the slot start to the post-syscall
> > > IP expected by the shared resolver before checking the trampoline mapping.
> > > 
> > > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > > to their first probe address and are reused only when the same address is
> > > probed again. Reassigning detached slots is deliberately avoided because a
> > > thread can remain in a trampoline for an unbounded time due to ptrace,
> > > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > > path.
> > > 
> > > Require the entire trampoline page to be reachable by a rel32 JMP before
> > > reusing it for a probe. This keeps every slot in the page within the range
> > > that can be encoded at the probe site.
> > > 
> > > Change the error code returned when the uprobe syscall is invoked outside
> > > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > > similar libraries distinguish fixed kernels from kernels with the
> > > red-zone-clobbering implementation and enable nop5 optimization only on
> > > fixed kernels.
> > > 
> > > Performance (usdt single-thread, M/s):
> > > 
> > >                   usdt-nop  usdt-nop5-base  usdt-nop5-fix  nop5-change  iret%
> > >   Skylake          3.149        6.422          4.865         -24.3%     39.1%
> > >   Milan            2.910        3.443          3.820         +11.0%     24.3%
> > >   Sapphire Rapids  1.896        4.023          3.693          -8.2%     24.9%
> > >   Bergamo          3.393        3.895          3.849          -1.2%     24.5%
> > > 
> > > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > > measured systems. The regression relative to the old CALL-based trampoline
> > > comes from IRET being more expensive than SYSRET, most noticeably on older
> > > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > > and AMD Milan improves because removing mmap_lock from the hot path more
> > > than offsets the IRET cost.
> > > 
> > > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
> > 
> > hi,
> > thanks a lot for the fix
> > 
> > FWIW we discussed also an option to have 10-bytes nop and do:
> >   [rsp+0x80, call trampoline]
> > 
> > we would not need the slots re-use logic, but not sure what other
> > surprises there are with 10-bytes nop
> 
> Does this mean we have to update UDST implementation?

it's the optimized uprobe code, that's used for usdt that emits nop5 instead
of single nop

> 
> > 
> > I tried that change [1], it seems to work, but it has other
> > difficulties, like I think the unoptimized path needs to do:
> >   [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> > instead of patching back the 10-byte nop, because some thread
> > could be inside the nop area already.
> 
> Yeah, but at that moment, we know where the modified code is.
> Maybe memory dump shows different code, but that is also true
> if uprobe is active. So I think it is OK.

hum, I'm not what you mean.. I attached the kernel change from my changes,
if you want to comment on top of that

the whole change including user space changes is in here:
  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=redzone_fix

jirka


---
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ebb1baf1eb1d..a6db7b76cb49 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -636,9 +636,20 @@ struct uprobe_trampoline {
 	unsigned long		vaddr;
 };
 
+static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
+
+#define LEA_INSN_SIZE		5
+#define UPROBE_OPT_INSN_SIZE	(LEA_INSN_SIZE + CALL_INSN_SIZE)
+#define REDZONE_SIZE		0x80
+
+static bool is_lea_insn(const uprobe_opcode_t *insn)
+{
+	return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
+}
+
 static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
 {
-	long delta = (long)(vaddr + 5 - vtramp);
+	long delta = (long)(vaddr + UPROBE_OPT_INSN_SIZE - vtramp);
 
 	return delta >= INT_MIN && delta <= INT_MAX;
 }
@@ -651,7 +662,7 @@ static unsigned long find_nearest_trampoline(unsigned long vaddr)
 	};
 	unsigned long low_limit, high_limit;
 	unsigned long low_tramp, high_tramp;
-	unsigned long call_end = vaddr + 5;
+	unsigned long call_end = vaddr + UPROBE_OPT_INSN_SIZE;
 
 	if (check_add_overflow(call_end, INT_MIN, &low_limit))
 		low_limit = PAGE_SIZE;
@@ -826,8 +837,8 @@ SYSCALL_DEFINE0(uprobe)
 	regs->ax  = args.ax;
 	regs->r11 = args.r11;
 	regs->cx  = args.cx;
-	regs->ip  = args.retaddr - 5;
-	regs->sp += sizeof(args);
+	regs->ip  = args.retaddr - UPROBE_OPT_INSN_SIZE;
+	regs->sp += sizeof(args) + REDZONE_SIZE;
 	regs->orig_ax = -1;
 
 	sp = regs->sp;
@@ -844,12 +855,12 @@ SYSCALL_DEFINE0(uprobe)
 	 */
 	if (regs->sp != sp) {
 		/* skip the trampoline call */
-		if (args.retaddr - 5 == regs->ip)
-			regs->ip += 5;
+		if (args.retaddr - UPROBE_OPT_INSN_SIZE == regs->ip)
+			regs->ip += UPROBE_OPT_INSN_SIZE;
 		return regs->ax;
 	}
 
-	regs->sp -= sizeof(args);
+	regs->sp -= sizeof(args) + REDZONE_SIZE;
 
 	/* for the case uprobe_consumer has changed ax/r11/cx */
 	args.ax  = regs->ax;
@@ -857,7 +868,7 @@ SYSCALL_DEFINE0(uprobe)
 	args.cx  = regs->cx;
 
 	/* keep return address unless we are instructed otherwise */
-	if (args.retaddr - 5 != regs->ip)
+	if (args.retaddr - UPROBE_OPT_INSN_SIZE != regs->ip)
 		args.retaddr = regs->ip;
 
 	if (shstk_push(args.retaddr) == -EFAULT)
@@ -891,7 +902,7 @@ asm (
 	"pop %rax\n"
 	"pop %r11\n"
 	"pop %rcx\n"
-	"ret\n"
+	"ret $0x80\n"
 	"int3\n"
 	".balign " __stringify(PAGE_SIZE) "\n"
 	".popsection\n"
@@ -930,9 +941,9 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
 		       int nbytes, void *data)
 {
 	struct write_opcode_ctx *ctx = data;
-	uprobe_opcode_t old_opcode[5];
+	uprobe_opcode_t old_opcode[UPROBE_OPT_INSN_SIZE];
 
-	uprobe_copy_from_page(page, ctx->base, (uprobe_opcode_t *) &old_opcode, 5);
+	uprobe_copy_from_page(page, ctx->base, old_opcode, UPROBE_OPT_INSN_SIZE);
 
 	switch (ctx->expect) {
 	case EXPECT_SWBP:
@@ -940,7 +951,7 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
 			return 1;
 		break;
 	case EXPECT_CALL:
-		if (is_call_insn(&old_opcode[0]))
+		if (is_lea_insn(old_opcode))
 			return 1;
 		break;
 	}
@@ -963,7 +974,7 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
  *   - SMP sync all CPUs
  */
 static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
-		       unsigned long vaddr, char *insn, bool optimize)
+		       unsigned long vaddr, char *insn, int size, bool optimize)
 {
 	uprobe_opcode_t int3 = UPROBE_SWBP_INSN;
 	struct write_opcode_ctx ctx = {
@@ -990,7 +1001,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 
 	/* Write all but the first byte of the patched range. */
 	ctx.expect = EXPECT_SWBP;
-	err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, 4, verify_insn,
+	err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, size - 1, verify_insn,
 			   true /* is_register */, false /* do_update_ref_ctr */,
 			   &ctx);
 	if (err)
@@ -1017,17 +1028,35 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 			 unsigned long vaddr, unsigned long tramp)
 {
-	u8 call[5];
+	u8 insn[UPROBE_OPT_INSN_SIZE];
 
-	__text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
-			(const void *) tramp, CALL_INSN_SIZE);
-	return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
+	/*
+	 * We have nop10 (with first byte overwritten to int3),
+	 * change it to:
+	 *   lea 0x80(%rsp), %rsp
+	 *   call tramp
+	 *
+	 * The first lea instruction skips the stack redzone so the call
+	 * instruction can safely push return address on stack.
+	 */
+	memcpy(insn, lea_rsp, LEA_INSN_SIZE);
+	__text_gen_insn(insn + LEA_INSN_SIZE, CALL_INSN_OPCODE,
+			(const void *)(vaddr + LEA_INSN_SIZE),
+			(const void *)tramp, CALL_INSN_SIZE);
+	return int3_update(auprobe, vma, vaddr, insn, UPROBE_OPT_INSN_SIZE, true /* optimize */);
 }
 
 static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 			   unsigned long vaddr)
 {
-	return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* optimize */);
+	/*
+	 * Write JMP rel8 to end of the 10-byte slot instead of restoring the
+	 * original nop10, because we could have thread already inside lea
+	 * instruction.
+	 */
+	u8 jmp[UPROBE_OPT_INSN_SIZE] = { 0xeb, UPROBE_OPT_INSN_SIZE - 2 };
+
+	return int3_update(auprobe, vma, vaddr, jmp, 2, false /* optimize */);
 }
 
 static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst, int len)
@@ -1049,19 +1078,21 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
 	struct __packed __arch_relative_insn {
 		u8 op;
 		s32 raddr;
-	} *call = (struct __arch_relative_insn *) insn;
+	} *call = (struct __arch_relative_insn *)(insn + LEA_INSN_SIZE);
 
-	if (!is_call_insn(insn))
+	if (!is_lea_insn(insn))
 		return false;
-	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+	if (!is_call_insn((uprobe_opcode_t *) call))
+		return false;
+	return __in_uprobe_trampoline(vaddr + UPROBE_OPT_INSN_SIZE + call->raddr);
 }
 
 static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
 {
-	uprobe_opcode_t insn[5];
+	uprobe_opcode_t insn[UPROBE_OPT_INSN_SIZE];
 	int err;
 
-	err = copy_from_vaddr(mm, vaddr, &insn, 5);
+	err = copy_from_vaddr(mm, vaddr, &insn, UPROBE_OPT_INSN_SIZE);
 	if (err)
 		return err;
 	return __is_optimized((uprobe_opcode_t *)&insn, vaddr);
@@ -1131,7 +1162,7 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
 void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
 {
 	struct mm_struct *mm = current->mm;
-	uprobe_opcode_t insn[5];
+	uprobe_opcode_t insn[UPROBE_OPT_INSN_SIZE];
 
 	if (!should_optimize(auprobe))
 		return;
@@ -1142,7 +1173,7 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
 	 * Check if some other thread already optimized the uprobe for us,
 	 * if it's the case just go away silently.
 	 */
-	if (copy_from_vaddr(mm, vaddr, &insn, 5))
+	if (copy_from_vaddr(mm, vaddr, &insn, UPROBE_OPT_INSN_SIZE))
 		goto unlock;
 	if (!is_swbp_insn((uprobe_opcode_t*) &insn))
 		goto unlock;
@@ -1160,14 +1191,23 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
 
 static bool can_optimize(struct insn *insn, unsigned long vaddr)
 {
-	if (!insn->x86_64 || insn->length != 5)
+	if (!insn->x86_64)
 		return false;
 
-	if (!insn_is_nop(insn))
+	/* We can't do cross page atomic writes yet. */
+	if (PAGE_SIZE - (vaddr & ~PAGE_MASK) < UPROBE_OPT_INSN_SIZE)
 		return false;
 
-	/* We can't do cross page atomic writes yet. */
-	return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
+	if (insn->length == UPROBE_OPT_INSN_SIZE && insn_is_nop(insn))
+		return true;
+
+	/* JMP rel8 to end of slot — written by swbp_unoptimize. */
+	if (insn->length == 2 &&
+	    insn->opcode.bytes[0] == 0xEB &&
+	    insn->immediate.value == UPROBE_OPT_INSN_SIZE - 2)
+		return true;
+
+	return false;
 }
 #else /* 32-bit: */
 /*

^ permalink raw reply related

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Jiri Olsa @ 2026-05-12 16:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Andrii Nakryiko, bpf, linux-trace-kernel, oleg, peterz,
	mingo, mhiramat
In-Reply-To: <CAEf4Bza9PjbaVjFxYDmWPXXGV+Z-_Hn2Kz_KB2TOa5s-_UJ1xA@mail.gmail.com>

On Mon, May 11, 2026 at 06:41:06PM +0200, Andrii Nakryiko wrote:
> On Sun, May 10, 2026 at 2:25 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > > user code may keep temporary data without adjusting rsp.
> > >
> > > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > > also does not provide a return address. Replace the single trampoline
> > > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > > assigned slot, the slot moves rsp below the red zone, saves the registers
> > > clobbered by syscall, and invokes the uprobe syscall:
> > >
> > >   Probe site:   jmp slot_N              (5B, replaces nop5)
> > >
> > >   Slot N:       lea  -128(%rsp), %rsp   (5B)  skip red zone
> > >                 push %rcx               (1B)  save (syscall clobbers)
> > >                 push %r11               (2B)  save (syscall clobbers)
> > >                 push %rax               (1B)  save (syscall uses for nr)
> > >                 mov  $336, %eax         (5B)  uprobe syscall number
> > >                 syscall                 (2B)
> > >
> > > All slots contain identical code at different offsets, so the trampoline
> > > page is generated once at boot and mapped read-execute into each process.
> > > The syscall handler identifies the slot from regs->ip, which points just
> > > after the syscall instruction, and uses a per-mm slot table to recover the
> > > original probe address.
> > >
> > > The uprobe syscall does not return to the trampoline slot. The handler
> > > restores the probe-site register state, runs the uprobe consumers, sets
> > > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > > execution, and returns directly through the IRET path. This preserves
> > > general purpose registers, including rcx and r11, without requiring any
> > > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > > shadow stack concerns.
> > >
> > > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > > without taking mmap_lock. The optimized-instruction detection path also
> > > walks the trampoline list under an RCU read-side lock. Since that path
> > > starts from the JMP target, it translates the slot start to the post-syscall
> > > IP expected by the shared resolver before checking the trampoline mapping.
> > >
> > > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > > to their first probe address and are reused only when the same address is
> > > probed again. Reassigning detached slots is deliberately avoided because a
> > > thread can remain in a trampoline for an unbounded time due to ptrace,
> > > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > > path.
> > >
> > > Require the entire trampoline page to be reachable by a rel32 JMP before
> > > reusing it for a probe. This keeps every slot in the page within the range
> > > that can be encoded at the probe site.
> > >
> > > Change the error code returned when the uprobe syscall is invoked outside
> > > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > > similar libraries distinguish fixed kernels from kernels with the
> > > red-zone-clobbering implementation and enable nop5 optimization only on
> > > fixed kernels.
> > >
> > > Performance (usdt single-thread, M/s):
> > >
> > >                   usdt-nop  usdt-nop5-base  usdt-nop5-fix  nop5-change  iret%
> > >   Skylake          3.149        6.422          4.865         -24.3%     39.1%
> > >   Milan            2.910        3.443          3.820         +11.0%     24.3%
> > >   Sapphire Rapids  1.896        4.023          3.693          -8.2%     24.9%
> > >   Bergamo          3.393        3.895          3.849          -1.2%     24.5%
> > >
> > > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > > measured systems. The regression relative to the old CALL-based trampoline
> > > comes from IRET being more expensive than SYSRET, most noticeably on older
> > > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > > and AMD Milan improves because removing mmap_lock from the hot path more
> > > than offsets the IRET cost.
> > >
> > > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
> >
> > hi,
> > thanks a lot for the fix
> >
> > FWIW we discussed also an option to have 10-bytes nop and do:
> >   [rsp+0x80, call trampoline]
> >
> > we would not need the slots re-use logic, but not sure what other
> > surprises there are with 10-bytes nop
> >
> > I tried that change [1], it seems to work, but it has other
> > difficulties, like I think the unoptimized path needs to do:
> >   [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> > instead of patching back the 10-byte nop, because some thread
> > could be inside the nop area already.
> >
> 
> Yeah, nop10 and this jump-over-nop10 approach is an alternative. I
> don't have strong feelings apart from the ridiculousness of a 10-byte
> nop :)
> 
> did you get a chance to benchmark your nop10 approach, curious how do
> the number look like

yes, it's the same as with the nop5

  base:
          usermode-count :  152.509 ± 0.044M/s
          syscall-count  :   15.177 ± 0.021M/s
          uprobe-nop     :    3.215 ± 0.002M/s
          uprobe-push    :    3.054 ± 0.003M/s
          uprobe-ret     :    1.100 ± 0.002M/s
          uprobe-nop5    :    7.251 ± 0.034M/s
          uretprobe-nop  :    2.149 ± 0.012M/s
          uretprobe-push :    2.088 ± 0.001M/s
          uretprobe-ret  :    0.960 ± 0.001M/s
          uretprobe-nop5 :    3.402 ± 0.001M/s
          usdt-nop       :    3.185 ± 0.024M/s
          usdt-nop5      :    7.378 ± 0.016M/s

  nop10:
          usermode-count :  152.503 ± 0.024M/s
          syscall-count  :   15.977 ± 0.047M/s
          uprobe-nop     :    3.174 ± 0.011M/s
          uprobe-push    :    3.030 ± 0.006M/s
          uprobe-ret     :    1.124 ± 0.004M/s
          uprobe-nop5    :    7.201 ± 0.012M/s
          uretprobe-nop  :    2.141 ± 0.005M/s
          uretprobe-push :    2.078 ± 0.007M/s
          uretprobe-ret  :    0.947 ± 0.003M/s
          uretprobe-nop5 :    3.384 ± 0.014M/s
          usdt-nop       :    3.247 ± 0.002M/s
          usdt-nop5      :    7.374 ± 0.027M/s

jirka

^ permalink raw reply

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-05-12 16:47 UTC (permalink / raw)
  To: Jens Remus
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Linus Torvalds, Andrew Morton, Florian Weimer, Kees Cook,
	Carlos O'Donell, Sam James, Dylan Hatch, Borislav Petkov,
	Dave Hansen, David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Heiko Carstens, Vasily Gorbik
In-Reply-To: <43158d95-b4c2-44d2-a244-eb546fb2bfaa@linux.ibm.com>

On Fri, 8 May 2026 09:46:30 +0200
Jens Remus <jremus@linux.ibm.com> wrote:

> >   STACKTRACE_REGISTER_SFRAME - This registers the sframe
> >   STACKTRACE_UNREGISTER_SFRAME - This removes the sframe
> > 
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>  
> 
> LGTM.  Some comments/questions below.

Note, after talking with people at LSF/MM/BPF, I plan on completely
changing this system call into two distinct ones, and only for sframes.
I'll be sending that later this week.

> 
> > diff --git a/include/uapi/linux/stacktrace.h b/include/uapi/linux/stacktrace.h  
> 
> > @@ -0,0 +1,10 @@
> > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_STACKTRACE_H
> > +#define _UAPI_LINUX_STACKTRACE_H
> > +
> > +enum stacktrace_setup_types {
> > +	STACKTRACE_REGISTER_SFRAME	= 1,
> > +	STACKTRACE_UNREGISTER_SFRAME	= 2,
> > +};
> > +
> > +#endif /* _UAPI_LINUX_STACKTRACE_H */  
> 
> > diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c  
> 
> Having the syscall live in kernel/unwind/sframe.c means it is only
> available if config option HAVE_UNWIND_USER_SFRAME is selected (which
> triggers sframe.o to be built and linked into the kernel), which makes
> sense as long as it only implements sframe-specific functionality.
> I suppose it could be moved elsewhere if non-sframe use cases would
> arise in the future?

The new system calls will only be for sframes. Other unwinders will need to
implement their own system calls.

> 
> Would Dylan need to guard it when introducing HAVE_UNWIND_KERNEL_SFRAME?
> Provided the syscall fails with -ENOSYS if not implemented (e.g. when
> HAVE_UNWIND_USER_SFRAME is not enabled) the dummy implementations of
> sframe_add_section() and sframe_remove_section() in linux/sframe.h also
> return -ENOSYS, so the user observable behavior would be the same and
> it would not matter.  Do you agree?

I'll reply to that when Dylan's patches get closer to acceptance ;-)

> 
> > @@ -12,8 +12,10 @@
> >  #include <linux/mm.h>
> >  #include <linux/string_helpers.h>
> >  #include <linux/sframe.h>
> > +#include <linux/syscalls.h>
> >  #include <asm/unwind_user_sframe.h>
> >  #include <linux/unwind_user_types.h>
> > +#include <uapi/linux/stacktrace.h>
> >  
> >  #include "sframe.h"
> >  #include "sframe_debug.h"
> > @@ -838,3 +840,38 @@ void sframe_free_mm(struct mm_struct *mm)
> >  
> >  	mtree_destroy(&mm->sframe_mt);
> >  }
> > +
> > +/**
> > + * sys_stacktrace_setup - register an address for user space stacktrace walking.
> > + * @op: Type of operation to perform
> > + * @addr_start: The virtual address of the stacktrace information
> > + * @addr_length: The length of the stacktrace information
> > + * @text_start: The virtual address of the text that @addr_start represents
> > + * @text_length: The length of teh text
> > + *
> > + * This system call is used by dynamic library utilities to inform the kernel
> > + * of meta data that it loaded that can be used by the kernel to know how
> > + * to stack walk the given text locations.
> > + *
> > + * Currently only sframes are supported, but in the future, this may be used
> > + * to tell the kernel about JIT code which will most likely have a different
> > + * format.
> > + *
> > + * The type command may be extended and parameters may be used for other
> > + * purposes.
> > + *
> > + * Return: 0 if successful, otherwise a negative error.
> > + */
> > +SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, addr_start,
> > +		unsigned long, addr_length, unsigned long, text_start,
> > +		unsigned long, text_length)  
> 
> Would it make sense to keep the parameters generic from start, similar
> to how it is done in prctl()?  Or can this be changed later, if the need
> arises?

With discussions at LSF/MM/BPF I'll have the system call parameters be a
pointer to a structure, and a size of that structure. All the API will then
be part of the structure.

Thanks for reviewing,

-- Steve

> 
> SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, arg2,
> 		unsigned long, arg3, unsigned long, arg4, unsigned long, arg5)
> 
> > +{
> > +	switch (op) {
> > +	case STACKTRACE_REGISTER_SFRAME:
> > +		return sframe_add_section(addr_start, addr_start + addr_length,
> > +					  text_start, text_start+text_length);  
> 
> Nit:
> 					  text_start, text_start + text_length);
> 
> > +	case STACKTRACE_UNREGISTER_SFRAME:
> > +		return sframe_remove_section(addr_start);
> > +	}
> > +	return -EINVAL;
> > +}  
> Thanks and regards,
> Jens


^ permalink raw reply

* Re: [PATCHv2] uprobes: Use flexible array for xol_area bitmap
From: Oleg Nesterov @ 2026-05-12 16:17 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Rosen Penev, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, open list:PERFORMANCE EVENTS SUBSYSTEM,
	open list:UPROBES
In-Reply-To: <20260512234857.44e2b0fafa4961900fdb7246@kernel.org>

On 05/12, Masami Hiramatsu wrote:
>
> On Tue, 12 May 2026 13:29:52 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > >
> > > -	area = kzalloc_obj(*area);
> > > +	area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));
> >
> > The downside is that kmalloc will use kmem_cache with ->object_size = PAGE_SIZE * 2,
> > almost half of the allocated memory won't be used...
>
> Hmm, is the bitmap so big?
>
> #define UINSNS_PER_PAGE			(PAGE_SIZE/UPROBE_XOL_SLOT_BYTES)
>
> And even on arm64,
>
> #define UPROBE_XOL_SLOT_BYTES	AARCH64_INSN_SIZE
>
> So if PAGE_SIZE is 4k, UINSNS_PER_PAGE is 1k, its BITS_TO_LONGS will
> be 1024/64 = 16. So 128 bytes. So the object is allocated from
> object_size = 256 ?

Indeed you are right.

Sorry for the noise and thanks for correcting me! I can't even explain how can
I came to conclusion that object_size can be greater than PAGE_SIZE with this
change ;)

So I think the patch from Rosen is fine.

Thanks,

Oleg.


^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Wei Yang @ 2026-05-12 15:44 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260511185817.686831-12-npache@redhat.com>

On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
>Enable khugepaged to collapse to mTHP orders. This patch implements the
>main scanning logic using a bitmap to track occupied pages and a stack
>structure that allows us to find optimal collapse sizes.
>
>Previous to this patch, PMD collapse had 3 main phases, a light weight
>scanning phase (mmap_read_lock) that determines a potential PMD
>collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>phase (mmap_write_lock).
>
>To enabled mTHP collapse we make the following changes:
>
>During PMD scan phase, track occupied pages in a bitmap. When mTHP
>orders are enabled, we remove the restriction of max_ptes_none during the
>scan phase to avoid missing potential mTHP collapse candidates. Once we
>have scanned the full PMD range and updated the bitmap to track occupied
>pages, we use the bitmap to find the optimal mTHP size.
>
>Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
>and determine the best eligible order for the collapse. A stack structure
>is used instead of traditional recursion to manage the search. This also
>prevents a traditional recursive approach when the kernel stack struct is
>limited. The algorithm recursively splits the bitmap into smaller chunks to
>find the highest order mTHPs that satisfy the collapse criteria. We start
>by attempting the PMD order, then moved on the consecutively lower orders
>(mTHP collapse). The stack maintains a pair of variables (offset, order),
>indicating the number of PTEs from the start of the PMD, and the order of
>the potential collapse candidate.
>
>The algorithm for consuming the bitmap works as such:
>    1) push (0, HPAGE_PMD_ORDER) onto the stack
>    2) pop the stack
>    3) check if the number of set bits in that (offset,order) pair
>       statisfy the max_ptes_none threshold for that order
>    4) if yes, attempt collapse
>    5) if no (or collapse fails), push two new stack items representing
>       the left and right halves of the current bitmap range, at the
>       next lower order
>    6) repeat at step (2) until stack is empty.
>
>Below is a diagram representing the algorithm and stack items:
>
>                            offset   mid_offset
>                            |        |
>                            |        |
>                            v        v
>          ____________________________________
>         |          PTE Page Table            |
>         --------------------------------------
>			    <-------><------->
>                             order-1  order-1
>
>mTHP collapses reject regions containing swapped out or shared pages.
>This is because adding new entries can lead to new none pages, and these
>may lead to constant promotion into a higher order mTHP. A similar
>issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>introducing at least 2x the number of pages, and on a future scan will
>satisfy the promotion condition once again. This issue is prevented via
>the collapse_max_ptes_none() function which imposes the max_ptes_none
>restrictions above.
>
>We currently only support mTHP collapse for max_ptes_none values of 0
>and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
>    - max_ptes_none=0: Never introduce new empty pages during collapse
>    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>      available mTHP order
>
>Any other max_ptes_none value will emit a warning and skip mTHP collapse
>attempts. There should be no behavior change for PMD collapse.
>
>Once we determine what mTHP sizes fits best in that PMD range a collapse
>is attempted. A minimum collapse order of 2 is used as this is the lowest
>order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
>Currently madv_collapse is not supported and will only attempt PMD
>collapse.
>
>We can also remove the check for is_khugepaged inside the PMD scan as
>the collapse_max_ptes_none() function handles this logic now.
>
>Signed-off-by: Nico Pache <npache@redhat.com>

[...]

>+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>+		int referenced, int unmapped, struct collapse_control *cc,
>+		unsigned long enabled_orders)
>+{
>+	unsigned int nr_occupied_ptes, nr_ptes;
>+	int max_ptes_none, collapsed = 0, stack_size = 0;
>+	unsigned long collapse_address;
>+	struct mthp_range range;
>+	u16 offset;
>+	u8 order;
>+
>+	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>+
>+	while (stack_size) {
>+		range = collapse_mthp_stack_pop(cc, &stack_size);
>+		order = range.order;
>+		offset = range.offset;
>+		nr_ptes = 1UL << order;
>+
>+		if (!test_bit(order, &enabled_orders))
>+			goto next_order;
>+
>+		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);

I am thinking whether there is a behavioral change for userfaultfd_armed(vma).

collapse_single_pmd()
    collapse_scan_pmd
        max_ptes_none = collapse_max_ptes_none(cc, vma)
        max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT                --- (1)
        mthp_collapse
            max_ptes_none = collapse_max_ptes_none(cc, NULL)     --- (2)
            collapse_huge_page(mm)
                hugepage_vma_revalidate(&vma)
                __collapse_huge_page_isolate(vma)
                    max_ptes_none = collapse_max_ptes_none(cc, vma)

Before mthp_collapse() introduced, userfaultfd_armed(vma) is skipped if there
is any pte_none_or_zero() in collapse_scan_pmd().

But now, max_ptes_none could be set to KHUGEPAGED_MAX_PTES_LIMIT at (1), so
that we can scan all the pte to get the bitmap. This means
userfaultfd_armed(vma) could continue even with pte_none_or_zero().

Then in mthp_collapse(), collapse_max_ptes_none() at (2) ignores
userfaultfd_armed(vma), which means it will continue to collapse a
userfaultfd_armed(vma) when there is pte_none_or_zero(). 

The good news is we will stop at __collapse_huge_page_isolate(), where we
get collapse_max_ptes_none() with vma. But we already did a lot of work.

Not sure if I missed something.

>+
>+		if (max_ptes_none < 0)
>+			return collapsed;
>+
>+		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
>+							       nr_ptes);
>+
>+		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>+			int ret;
>+
>+			collapse_address = address + offset * PAGE_SIZE;
>+			ret = collapse_huge_page(mm, collapse_address, referenced,
>+						 unmapped, cc, order);
>+			if (ret == SCAN_SUCCEED) {
>+				collapsed += nr_ptes;
>+				continue;
>+			}
>+		}
>+
>+next_order:
>+		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>+			const u8 next_order = order - 1;
>+			const u16 mid_offset = offset + (nr_ptes / 2);
>+
>+			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
>+						 next_order);
>+			collapse_mthp_stack_push(cc, &stack_size, offset,
>+						 next_order);
>+		}
>+	}
>+	return collapsed;
>+}
>+
> static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		struct vm_area_struct *vma, unsigned long start_addr,
> 		bool *lock_dropped, struct collapse_control *cc)
> {
>-	const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>+	int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> 	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> 	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>+	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> 	pmd_t *pmd;
>-	pte_t *pte, *_pte;
>-	int none_or_zero = 0, shared = 0, referenced = 0;
>+	pte_t *pte, *_pte, pteval;
>+	int i;
>+	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> 	enum scan_result result = SCAN_FAIL;
> 	struct page *page = NULL;
> 	struct folio *folio = NULL;
> 	unsigned long addr;
>+	unsigned long enabled_orders;
> 	spinlock_t *ptl;
> 	int node = NUMA_NO_NODE, unmapped = 0;
> 
>@@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		goto out;
> 	}
> 
>+	bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> 	memset(cc->node_load, 0, sizeof(cc->node_load));
> 	nodes_clear(cc->alloc_nmask);
>+
>+	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);

Would it be 0 at this point?

>+
>+	/*
>+	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
>+	 * scan all pages to populate the bitmap for mTHP collapse.
>+	 */
>+	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
>+		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>+
> 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> 	if (!pte) {
> 		cc->progress++;
>@@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		goto out;
> 	}
> 
>-	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>-	     _pte++, addr += PAGE_SIZE) {
>+	for (i = 0; i < HPAGE_PMD_NR; i++) {
>+		_pte = pte + i;
>+		addr = start_addr + i * PAGE_SIZE;
>+		pteval = ptep_get(_pte);
>+
> 		cc->progress++;
> 
>-		pte_t pteval = ptep_get(_pte);
> 		if (pte_none_or_zero(pteval)) {
> 			if (++none_or_zero > max_ptes_none) {
> 				result = SCAN_EXCEED_NONE_PTE;
>@@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 			}
> 		}
> 
>+		/* Set bit for occupied pages */
>+		__set_bit(i, cc->mthp_bitmap);
> 		/*
> 		 * Record which node the original page is from and save this
> 		 * information to cc->node_load[].
>@@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 	if (result == SCAN_SUCCEED) {
> 		/* collapse_huge_page expects the lock to be dropped before calling */
> 		mmap_read_unlock(mm);
>-		result = collapse_huge_page(mm, start_addr, referenced,
>-					    unmapped, cc, HPAGE_PMD_ORDER);
>+		nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
>+					      cc, enabled_orders);
> 		/* collapse_huge_page will return with the mmap_lock released */

collapse_huge_page will return with mmap_lock released, but mthp_collapse()
may not?

> 		*lock_dropped = true;
>+		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> 	}
> out:
> 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
>-- 
>2.54.0

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCHv2] uprobes: Use flexible array for xol_area bitmap
From: Masami Hiramatsu @ 2026-05-12 14:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rosen Penev, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Masami Hiramatsu,
	open list:PERFORMANCE EVENTS SUBSYSTEM, open list:UPROBES
In-Reply-To: <agMPMAnsCH2ZRhf5@redhat.com>

On Tue, 12 May 2026 13:29:52 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 05/11, Rosen Penev wrote:
> >
> >  struct xol_area {
> >  	wait_queue_head_t		wq;		/* if all slots are busy */
> > -	unsigned long			*bitmap;	/* 0 = free slot */
> >
> >  	struct page			*page;
> >  	/*
> > @@ -117,6 +116,7 @@ struct xol_area {
> >  	 * the vma go away, and we must handle that reasonably gracefully.
> >  	 */
> >  	unsigned long			vaddr;		/* Page(s) of instruction slots */
> > +	unsigned long			bitmap[];	/* 0 = free slot */
> >  };
> >
> >  static void uprobe_warn(struct task_struct *t, const char *msg)
> > @@ -1755,18 +1755,13 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> >  	struct xol_area *area;
> >  	void *insns;
> >
> > -	area = kzalloc_obj(*area);
> > +	area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));
> 
> The downside is that kmalloc will use kmem_cache with ->object_size = PAGE_SIZE * 2,
> almost half of the allocated memory won't be used...

Hmm, is the bitmap so big? 

#define UINSNS_PER_PAGE			(PAGE_SIZE/UPROBE_XOL_SLOT_BYTES)

And even on arm64, 

#define UPROBE_XOL_SLOT_BYTES	AARCH64_INSN_SIZE

So if PAGE_SIZE is 4k, UINSNS_PER_PAGE is 1k, its BITS_TO_LONGS will
be 1024/64 = 16. So 128 bytes. So the object is allocated from
object_size = 256 ?

Thank you,

> 
> But technically the patch looks correct so I won't argue.
> 
> Oleg.
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] tracing: Switch trace_recursion_record.c code over to use guard()
From: Yash Suthar @ 2026-05-12 14:41 UTC (permalink / raw)
  To: rostedt
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	skhan, me
In-Reply-To: <20260502174741.39636-1-yashsuthar983@gmail.com>

Gentle ping.

Sincerely,
Yash Suthar

On Sat, May 2, 2026 at 11:17 PM Yash Suthar <yashsuthar983@gmail.com> wrote:
>
> Switch mutex_lock()/mutex_unlock() to guard().
> also drop the ret local variable and return directly.
>
> Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
> ---
>  kernel/trace/trace_recursion_record.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/trace/trace_recursion_record.c b/kernel/trace/trace_recursion_record.c
> index 784fe1fbb866..bac4bc844ccd 100644
> --- a/kernel/trace/trace_recursion_record.c
> +++ b/kernel/trace/trace_recursion_record.c
> @@ -180,9 +180,8 @@ static const struct seq_operations recursed_function_seq_ops = {
>
>  static int recursed_function_open(struct inode *inode, struct file *file)
>  {
> -       int ret = 0;
> +       guard(mutex)(&recursed_function_lock);
>
> -       mutex_lock(&recursed_function_lock);
>         /* If this file was opened for write, then erase contents */
>         if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
>                 /* disable updating records */
> @@ -194,10 +193,9 @@ static int recursed_function_open(struct inode *inode, struct file *file)
>                 atomic_set(&nr_records, 0);
>         }
>         if (file->f_mode & FMODE_READ)
> -               ret = seq_open(file, &recursed_function_seq_ops);
> -       mutex_unlock(&recursed_function_lock);
> +               return seq_open(file, &recursed_function_seq_ops);
>
> -       return ret;
> +       return 0;
>  }
>
>  static ssize_t recursed_function_write(struct file *file,
> --
> 2.43.0
>

^ permalink raw reply

* [RFC PATCH v2 18/28] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-05-12 14:36 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260512143645.113201-1-sj@kernel.org>

Introduce a new tracepoint for exposing the per-region per-probe
positive sample count via tracefs.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/trace/events/damon.h | 36 ++++++++++++++++++++++++++++++++++++
 mm/damon/core.c              |  7 +++++++
 2 files changed, 43 insertions(+)

diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index 7e25f4469b81b..d7b94c7640217 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -130,6 +130,42 @@ TRACE_EVENT(damon_monitor_intervals_tune,
 	TP_printk("sample_us=%lu", __entry->sample_us)
 );
 
+TRACE_EVENT(damon_aggregated_v2,
+
+	TP_PROTO(unsigned int target_id, struct damon_region *r,
+		unsigned int nr_regions, unsigned int nr_probes),
+
+	TP_ARGS(target_id, r, nr_regions, nr_probes),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, target_id)
+		__field(unsigned long, start)
+		__field(unsigned long, end)
+		__field(unsigned int, nr_regions)
+		__field(unsigned int, nr_accesses)
+		__field(unsigned int, age)
+		__dynamic_array(unsigned char, probe_hits, nr_probes)
+	),
+
+	TP_fast_assign(
+		__entry->target_id = target_id;
+		__entry->start = r->ar.start;
+		__entry->end = r->ar.end;
+		__entry->nr_regions = nr_regions;
+		__entry->nr_accesses = r->nr_accesses;
+		__entry->age = r->age;
+		memcpy(__get_dynamic_array(probe_hits), r->probe_hits,
+			sizeof(*r->probe_hits) * nr_probes);
+	),
+
+	TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u %u probe_hits=%s",
+			__entry->target_id, __entry->nr_regions,
+			__entry->start, __entry->end,
+			__entry->nr_accesses, __entry->age,
+			__print_hex(__get_dynamic_array(probe_hits),
+				__get_dynamic_array_len(probe_hits)))
+);
+
 TRACE_EVENT(damon_aggregated,
 
 	TP_PROTO(unsigned int target_id, struct damon_region *r,
diff --git a/mm/damon/core.c b/mm/damon/core.c
index fe6c789f2cecb..14b15c9876516 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1905,6 +1905,11 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 {
 	struct damon_target *t;
 	unsigned int ti = 0;	/* target's index */
+	unsigned int nr_probes = 0;
+	struct damon_probe *probe;
+
+	damon_for_each_probe(probe, c)
+		nr_probes++;
 
 	damon_for_each_target(t, c) {
 		struct damon_region *r;
@@ -1913,6 +1918,8 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 			int i;
 
 			trace_damon_aggregated(ti, r, damon_nr_regions(t));
+			trace_damon_aggregated_v2(ti, r, damon_nr_regions(t),
+					nr_probes);
 			damon_warn_fix_nr_accesses_corruption(r);
 			r->last_nr_accesses = r->nr_accesses;
 			r->nr_accesses = 0;
-- 
2.47.3

^ permalink raw reply related

* [RFC PATCH v2 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-12 14:36 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel

TL; DR
======

Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring.  In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and
PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.

Background: High Cost of Page Level Properties Monitoring
=========================================================

DAMON is initially introduced as a Data Access MONitor.  It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS).  But still the monitoring part is only for
data accesses.

Data access patterns is good information, but some users need more
holistic views.  Particularly, users want to show the access pattern
information together with the types of the memory.  For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages.  Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.

For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14.  Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters.  Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.

This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size.  It was useful for test or debugging
purposes on a small number of machines.  But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches.  For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet.  If the runtime and the  size of the
fleet is long and big enough, it should provide statistically meaningful
data.

But users are too busy to implement such controls on their own.

Data Attributes Monitoring
==========================

Extend DAMON to monitor not only data accesses, but also general data
attributes.  Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.

Allow users to specify what data attributes in addition to the data
access they want to monitor.  Users can install one 'data probe' per
data attribute of their interest for this purpose.  The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute.  E.g., if memory of physical
address 42 belongs to cgroup A.  Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.

When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.

Users can read 'probe_hits' just before the values are reset.  In this
way, users can know how many hot/cold memory regions have data
attributes of their interest.  E.g., 30 percent of this system's hot
memory is belonging to cgroup A, and 80 percent of the cgroup
A-belonging hot memory is backed by huge pages.

Patches Sequence
================

First eight patches implement the core feature, interface and the
working support.  Patch 1 introduces data probe data structure, namely
damon_probe.  Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter.  Patch 4 updates damon_ctx commit function
to handle the probes.  Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits.  Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation.  Patch 7 updates kdamond_fn() to invoke the probes
applying callback.  Patch 8 finally implements the probes support on
paddr ops.

Ten changes for user interface (patches 9-18) come next.  Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively.  Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Following three patches (patches 15-17) implement sysfs directories and
files for showing the probe_hits to users, namely probes directory,
probe directory and hits files, respectively.  Patch 18 introduces a new
tracepoint for showing the probe_hits via tracefs.

Patch 19 adds a selftest for the sysfs files.

Patches 20 and 21 documents the design and usage of the new feature,
respectively.

Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow.  Depending on the feedback, this part might be separated
to another series in future.  Patch 22 defines the DAMON filter type for
the new attribute, namely DAMON_FILTER_TYPE_MEMCG.  Patch 23 add the
support on paddr ops.  Patch 24 updates the sysfs interface for setup of
the target memcg.  Patch 25 move code for easy reuse of the filter
target memcg setup.  Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.

Discussions
===========

This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads.  Because the
sampling time for access check is reused for data attributes check,  the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.

Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information.  When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.

Plan for Dropping RFC tag
=========================

I'm considering renaming the tracepoint for exposing probe_hits
(damon_aggregated_v2).

Making changes for feedback from myself, humans and Sashiko should be
the major remaining work.

I'm currently hoping to drop the RFC tag by 7.2-rc1.

Future Works: Mid Term
========================

This version of implementation is limiting the maximum number of data
probes to four.  I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.

Future Works: Long Term
=======================

There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring.  For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive.  Actually there is
another ongoing work [3] for extending DAMON with PMU events.  The
motivation of the work is reducing the overhead, though.

In my work [2], I was introducing a new interface for access sampling
primitives control.  Now I think this data probe interface can be used
for that, too.  That is, data access becomes just one type of data
attribute.  Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.

The regions adjustment mechanism is currently working based on the
access information.  That's because DAMON is designed for data access
monitoring.  That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.

Once data access becomes just one of data attributes, there is no reason
to think data access that special.  There might be some users not
interested in access at all but want to know the location of memory of
specific type.  Data probes interface will allow doing that.  Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute.  Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.

DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes.  From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".

[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

Changes from RFC
- rfc: https://lore.kernel.org/all/20260426205222.93895-1-sj@kernel.org/
- Support memcg DAMON filter.
- Use per-probe probe_hits sysfs file.
- Use dynamic_array for probe_hits tracing.
- Fix filter matching field.
- Fix folio leaking in damon_pa_filter_pass().
- Move nr_regions of damon_aggregated_v2 tracepoint after end.
- Rename DAMON_TEST_TYPE_ANON to DAMON_FILTER_TYPE_ANON.

SeongJae Park (28):
  mm/damon/core: introduce struct damon_probe
  mm/damon/core: embed damon_probe objects in damon_ctx
  mm/damon/core: introduce damon_filter
  mm/damon/core: commit probes
  mm/damon/core: introduce damon_region->probe_hits
  mm/damon/core: introduce damon_ops->apply_probes
  mm/damon/core: do data attributes monitoring
  mm/damon/paddr: support data attributes monitoring
  mm/damon/sysfs: implement probes dir
  mm/damon/sysfs: implement probe dir
  mm/damon/sysfs: implement filters directory
  mm/damon/sysfs: implement filter dir
  mm/damon/sysfs: implement filter dir files
  mm/damon/sysfs: setup probes on DAMON core API parameters
  mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/
  mm/damon/sysfs-schemes: implement probe dir
  mm/damon/sysfs-schemes: implement probe/hits file
  mm/damon: trace probe_hits
  selftests/damon/sysfs.sh: test probes dir
  Docs/mm/damon/design: document data attributes monitoring
  Docs/admin-guide/mm/damon/usage: document data attributes monitoring
  mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
  mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCG
  mm/damon/sysfs: add filters/<F>/path file
  mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
  mm/damon/sysfs: setup damon_filter->memcg_id from path
  Docs/mm/damon/design: update for memcg damon filter
  Docs/admin-guide/mm/damon/usage: update for memcg damon filter

 Documentation/admin-guide/mm/damon/usage.rst |  48 +-
 Documentation/mm/damon/design.rst            |  39 ++
 include/linux/damon.h                        |  67 +++
 include/trace/events/damon.h                 |  36 ++
 mm/damon/core.c                              | 195 +++++++
 mm/damon/paddr.c                             |  76 +++
 mm/damon/sysfs-common.c                      |  41 ++
 mm/damon/sysfs-common.h                      |   2 +
 mm/damon/sysfs-schemes.c                     | 222 ++++++--
 mm/damon/sysfs.c                             | 557 +++++++++++++++++++
 tools/testing/selftests/damon/sysfs.sh       |  48 ++
 11 files changed, 1280 insertions(+), 51 deletions(-)


base-commit: 610724cfd93c1c413faf9e5bb63926fe54849887
-- 
2.47.3

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox