Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Peter Zijlstra @ 2026-06-11 13:44 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <aiphFXe_TPNPxZ_n@shell.ilvokhin.com>

On Thu, Jun 11, 2026 at 07:17:41AM +0000, Dmitry Ilvokhin wrote:
> On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote:
> > Also, I think someone should go do some performance runs with
> > ARCH_INLINE_SPIN_* set for x86 just like for s390.
> 
> As promised, I set ARCH_INLINE_SPIN_UNLOCK{,_BH,_IRQ,_IRQRESTORE} for
> x86 and measured the effect on a few real workloads.
> 
> Short version: inlining of _raw_spin_unlock() adds measurable kernel
> i-cache pressure on every workload I tried, and on a
> kernel-i-cache-bound one (nginx connection churn) it costs ~1.27%
> throughput. I did not find a workload where it helps.

Thanks for checking!

^ permalink raw reply

* Re: [PATCHv7 bpf-next 03/29] ftrace: Add add_ftrace_hash_entry function
From: Steven Rostedt @ 2026-06-11 13:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Jiri Olsa, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, bpf, linux-trace-kernel,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Menglong Dong
In-Reply-To: <CAADnVQ+RhRiUp8FeptdVKZimGd-Wv8C+NbW5U4hh+7Hi5abQxw@mail.gmail.com>

On Wed, 10 Jun 2026 08:42:51 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> If my memory doesn't fail me you said it's fine during v1,v2 iterations.

I'm fine with it, but I wanted to test it first. Did I give an Acked-by?

> The last v3 - v8 you were silent, so we assumed you're still fine.

When I see AI reports that say the series needs a fix, I don't bother
looking. I thought that was the entire point of AI; to let the maintainer
not have to review if the AI found something.


> 
> While at it, please review Mykyta's set:
> https://patchwork.kernel.org/user/todo/netdevbpf/?series=1096695
> 
> It's also been pending for almost a month now.

Have a better link? I just get a blank page as "TODO" is set to what I have.

-- Steve


^ permalink raw reply

* [PATCH] tracing: ring_buffer: Check page order under reader_lock
From: Yash Suthar @ 2026-06-11 15:17 UTC (permalink / raw)
  To: rostedt, mhiramat
  Cc: mathieu.desnoyers, tz.stoyanov, linux-kernel, linux-trace-kernel,
	skhan, me, syzbot+2dd9d02f60775ce5c1fb, Yash Suthar

when there is a concurrent swap from ring_buffer_subbuf_order_set(),
there is a case of wrong read of pagesize,as the order can change.

If order changes ,the memset at end of ring_buffer_read_page()
uses new subbuf_size which can be more than old and we then
we will hit out of bound write.

to resolve this, moved the order check in lock and calculate
the subbuf_size from correct order to prevent race.

syzbot did not provide reproducer for this crash, the race
condition is logically sound and found via code inspection of the
trace.

Fixes: bce761d75745 ("ring-buffer: Read and write to ring buffers with custom sub buffer size")
Reported-by: syzbot+2dd9d02f60775ce5c1fb@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2dd9d02f60775ce5c1fb

Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
---
 kernel/trace/ring_buffer.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7b07d2004cc6..e098eeb1d694 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -6898,6 +6898,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	struct buffer_data_page *bpage;
 	struct buffer_page *reader;
 	unsigned long missed_events;
+	unsigned int subbuf_size;
 	unsigned int commit;
 	unsigned int read;
 	u64 save_timestamp;
@@ -6918,15 +6919,22 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	if (!data_page || !data_page->data)
 		return -1;
 
-	if (data_page->order != buffer->subbuf_order)
-		return -1;
-
 	bpage = data_page->data;
 	if (!bpage)
 		return -1;
 
 	guard(raw_spinlock_irqsave)(&cpu_buffer->reader_lock);
 
+	/*
+	 * Check data_page order under lock to prevent a race with a
+	 * concurrent ring_buffer_subbuf_order_set() swap, which can
+	 * cause an outofbounds memset() if the subbuf_size changes.
+	 */
+	if (data_page->order != buffer->subbuf_order)
+		return -1;
+
+	subbuf_size = (PAGE_SIZE << data_page->order) - BUF_PAGE_HDR_SIZE;
+
 	reader = rb_get_reader_page(cpu_buffer);
 	if (!reader)
 		return -1;
@@ -7043,7 +7051,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 		/* If there is room at the end of the page to save the
 		 * missed events, then record it there.
 		 */
-		if (buffer->subbuf_size - commit >= sizeof(missed_events)) {
+		if (subbuf_size - commit >= sizeof(missed_events)) {
 			memcpy(&bpage->data[commit], &missed_events,
 			       sizeof(missed_events));
 			local_add(RB_MISSED_STORED, &bpage->commit);
@@ -7055,8 +7063,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	/*
 	 * This page may be off to user land. Zero it out here.
 	 */
-	if (commit < buffer->subbuf_size)
-		memset(&bpage->data[commit], 0, buffer->subbuf_size - commit);
+	if (commit < subbuf_size)
+		memset(&bpage->data[commit], 0, subbuf_size - commit);
 
 	return read;
 }
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Sean Christopherson @ 2026-06-11 15:46 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgF31BzyFyVUa7tDJ=qJ-8ws2kxfNjLxmV=OxKSqhaOiPw@mail.gmail.com>

On Wed, Jun 10, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Thu, Jun 04, 2026, Ackerley Tng wrote:
> >> Sean Christopherson <seanjc@google.com> writes:
> >> >> + KVM: selftests: Test conversion with elevated page refcount
> >> >>     + Askar pointed out that soon vmsplice may not pin pages. Should I
> >> >>       pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
> >> >>       take a dependency on CONFIG_GUP_TEST.
> >> >
> >> > I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
> >> > it probably is the least awful choice.  E.g. KVM also pins pages is certain flows,
> >> > but we're _also_ actively working to remove the need to pin.
> >> >
> >> > Hmm, maybe IORING_REGISTER_PBUF_RING?  AFAICT, it's almost literally a "pin user
> >> > memory" syscall.
> >> >
> >>
> >> Hmm that takes a dependency on io_uring, which isn't always compiled
> >> in. Between CONFIG_IO_URING and CONFIG_GUP_TEST, I'd rather
> >> CONFIG_GUP_TEST.
> >
> > Or try both?  If it's not a ridiculous amount of work.
> 
> CONFIG_GUP_TEST was tried in [1]
> 
> [1] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
> 
> It looks like this
> 
>   static void pin_pages(void *vaddr, uint64_t size)
>   {
>   	const struct pin_longterm_test args = {
>   		.addr = (uint64_t)vaddr,
>   		.size = size,
>   		.flags = PIN_LONGTERM_TEST_FLAG_USE_WRITE,
>   	};
> 
>   	gup_test_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
>   	TEST_REQUIRE(gup_test_fd > 0);

Use __open_path_or_exit().  I also think it makes sent to make these available
to all KVM selftests, there are probably other testcases that could utilize page
pinning.

>   	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_START, &args), 0);
>   }
> 
>   static void unpin_pages(void)
>   {
>   	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_STOP), 0);
>   }
> 
> So in the test I'll call pin_pages(), then try to convert, see that it
> fails with EAGAIN and reports the expected error_offset, then I call
> unpin_pages(), then I convert again and expect success.
> 
> Are you uncomfortable with the CONFIG_GUP_TEST interface?

No, my concern is/was the potential for leaking pages if the test fails/crashes,
but it looks gup_test_release() ensures all pins are dropped when the file is
released, so that should be a non-issue.

> What would you like me to try with CONFIG_IO_URING? I'm thinking that the
> main difference between the two is just down to which non-default CONFIG
> option we want to take for guest_memfd tests.

^ permalink raw reply

* Re: [PATCH v2 3/4] mm/fs: split the file's i_mmap tree
From: Lorenzo Stoakes @ 2026-06-11 15:48 UTC (permalink / raw)
  To: Huang Shijie
  Cc: Pedro Falcato, akpm, viro, brauner, jack, muchun.song, osalvador,
	david, surenb, mjguzik, liam, vbabka, shakeel.butt, rppt, mhocko,
	corbet, skhan, linux, dinguyen, schuster.simon, James.Bottomley,
	deller, djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, riel, harry,
	will, brian.ruley, rmk+kernel, dave.anglin, linux-mm, linux-doc,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-fsdevel,
	nvdimm, linux-perf-users, linux-trace-kernel, zhongyuan,
	fangbaoshun, yingzhiwei
In-Reply-To: <aiqFgGbIo1Psy3pI@pedro-suse.lan>

On Thu, Jun 11, 2026 at 12:11:27PM +0100, Pedro Falcato wrote:
> Hi,
>
> On Thu, Jun 11, 2026 at 02:18:59PM +0800, Huang Shijie wrote:
> > In the UnixBench tests, there is a test "execl" which tests
> > the execve system call.
> >   For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > When we test our server with "./Run -c 384 execl",
> > the test result is not good enough. The i_mmap locks contended heavily on
> > "libc.so" and "ld.so". The i_mmap tree for "libc.so" can be
> > over 6000 VMAs, all the VMAs can be in different NUMA mode. The insert/remove
> > operations do not run quickly enough.
>
> I _really_ would have appreciated some coordination here, because I said I was
> going to take a look at it. I have something that I think is much simpler

Agreed, this is the second (or in fact third?) time in recent weeks that
I'm aware of where publicly discussed work has been duplicated with a
series that came in later.

It's really important, when doing work that impact core stuff to have a
look around and see if others are looking at it, as there's nothing more
frustrating than to work on something, discuss it publicly, only to find
somebody sends a competing series.

It can be tricky, as sometimes it's not obvious, or it might not be so
easily found, but I would strongly suggest always making an effort on that
front.

But you didn't even try to send this as an RFC either :)

> in practice. These patches are also way too complex to be dropped just before
> the merge window.

This late in the cycle means -> next cycle. So you'd have needed to resend
it at rc1 in a couple weeks anyway.

>
> Some comments:
>
> >
> >  In order to reduce the competition of the i_mmap lock, this patch does
> > following:
> >    1.) Split the single i_mmap tree into several sibling trees:
> >        Each tree has a lock. The CONFIG_SPLIT_I_MMAP is used to
> >        turn on/off this feature.
>
> There is no need for a config option. This needs to Just Work.

Yeah, this is just a no-go. We don't add config options for changes to core
rmap code.

>
> >    2.) Introduce a new field "tree_idx" for vm_area_struct to save the
> >        sibling tree index for this VMA.
>
> This is possibly contentious, but there are holes in vm_area_struct.
> So I think this is fine.

Yeah no thanks for the extra field, I already have plans for those gaps in
vm_area_struct.

I am in fact writing code right now that uses them...

>
> >    3.) Introduce a new field "vma_count" for address_space.
> >        The new mapping_mapped() will use it.
> >    4.) Rewrite the vma_interval_tree_foreach()

I also intend to send a series that does a bunch of changes in the rmap
code that this would conflict with.

So let's all coordinate please.

> >    5.) Rewrite the lock functions.

Yeah looping on file rmap lock/unlock is gross.

> >
> >  After this patch, the VMA insert/remove operations will work faster,
> > and we can get over 400% performance improvement with the above test.
> >
> > Signed-off-by: Huang Shijie <huangsj@hygon.cn>

I had a look through and this code is really overwrought and you're putting
a bunch of confusing open-coded all over the codebase without comments.

This isn't upstreamable quality and you really should have sent this as an
RFC first so we could discuss the approach.

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v2 1/4] mm: use mapping_mapped to simplify the code
From: Lorenzo Stoakes @ 2026-06-11 15:52 UTC (permalink / raw)
  To: Huang Shijie
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, vbabka, shakeel.butt, rppt, mhocko, corbet, skhan,
	linux, dinguyen, schuster.simon, James.Bottomley, deller, djbw,
	willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <20260611061915.2354307-2-huangsj@hygon.cn>

On Thu, Jun 11, 2026 at 02:18:57PM +0800, Huang Shijie wrote:
> Use mapping_mapped() to simplify the code, make
> the code tidy and clean.
>
> Signed-off-by: Huang Shijie <huangsj@hygon.cn>

Yeah as Pedro said this one could just be sent separately, and I in fact
suggest you do that :) So:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

Cheers, Lorenzo

> ---
>  fs/hugetlbfs/inode.c | 4 ++--
>  mm/memory.c          | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 78d61bf2bd9b..216e1a0dd0b2 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -614,7 +614,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
>
>  	i_size_write(inode, offset);
>  	i_mmap_lock_write(mapping);
> -	if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
> +	if (mapping_mapped(mapping))
>  		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
>  				      ZAP_FLAG_DROP_MARKER);
>  	i_mmap_unlock_write(mapping);
> @@ -675,7 +675,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>
>  	/* Unmap users of full pages in the hole. */
>  	if (hole_end > hole_start) {
> -		if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
> +		if (mapping_mapped(mapping))
>  			hugetlb_vmdelete_list(&mapping->i_mmap,
>  					      hole_start >> PAGE_SHIFT,
>  					      hole_end >> PAGE_SHIFT, 0);
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..5335077765e2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4386,7 +4386,7 @@ void unmap_mapping_folio(struct folio *folio)
>  	details.zap_flags = ZAP_FLAG_DROP_MARKER;
>
>  	i_mmap_lock_read(mapping);
> -	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
> +	if (unlikely(mapping_mapped(mapping)))
>  		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
>  					 last_index, &details);
>  	i_mmap_unlock_read(mapping);
> @@ -4416,7 +4416,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
>  		last_index = ULONG_MAX;
>
>  	i_mmap_lock_read(mapping);
> -	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
> +	if (unlikely(mapping_mapped(mapping)))
>  		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
>  					 last_index, &details);
>  	i_mmap_unlock_read(mapping);
> --
> 2.53.0
>
>

^ permalink raw reply

* Re: [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
From: XIAO WU @ 2026-06-11 15:53 UTC (permalink / raw)
  To: sashiko-reviews, Masami Hiramatsu, Petr Mladek, Peter Zijlstra
  Cc: Tengda Wu, Mathieu Desnoyers, Alexei Starovoitov, Steven Rostedt,
	linux-kernel, linux-trace-kernel, live-patching
In-Reply-To: <20260610015032.4BFAA1F00893@smtp.kernel.org>

Hi Tengda,

Sashiko [1] reviewed this patch and found that removing the
task_is_running() check exposes stack unwinders to real crashes — not
just "invalid information."  A PoC confirms this: a KASAN panic triggers
within seconds when /proc/<pid>/stack reads the stack of a task that is
concurrently running a kretprobe.

[1] 
https://sashiko.dev/#/patchset/20260610013658.1837963-1-wutengda%40huaweicloud.com

 > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
 > index 5a8bdf88999a..1e7fdebe3cd5 100644
 > --- a/kernel/trace/rethook.c
 > +++ b/kernel/trace/rethook.c
 > @@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct 
task_struct *tsk, unsigned long frame
 >      if (WARN_ON_ONCE(!cur))
 >          return 0;
 >
 > -    if (tsk != current && task_is_running(tsk))
 > -        return 0;
 > -
 >      do {
 >          ret = __rethook_find_ret_addr(tsk, cur);
 >          if (!ret)

The commit message states:

 > The iteration is already safe from crashes because
 > unwind_next_frame() holds RCU and rethook_node structures are
 > RCU-freed; even if the iteration goes off the rails and returns
 > invalid information, it will not crash.

There are two problems with this claim, both reproducible.

**Problem 1: stack-out-of-bounds in unwind_next_frame itself**

The PoC below reliably triggers the following KASAN panic — not in the
rethook list traversal, but inside unwind_next_frame():

[ 1833.494623] BUG: KASAN: stack-out-of-bounds in 
unwind_next_frame+0x861/0x2080
[ 1833.494651] Read of size 2 at addr ffffc90003e6f5f0 by task poc/9854
[ 1833.494707] Call Trace:
[ 1833.494719]  dump_stack_lvl+0x116/0x1f0
[ 1833.494743]  print_report+0xf4/0x600
[ 1833.494788]  kasan_report+0xe0/0x110
[ 1833.494836]  unwind_next_frame+0x861/0x2080
[ 1833.494948]  arch_stack_walk+0x99/0x100
[ 1833.495000]  stack_trace_save_tsk+0x16a/0x200
[ 1833.495054]  proc_pid_stack+0x173/0x2b0
[ 1833.495103]  seq_read_iter+0x519/0x12d0
[ 1833.495166]  seq_read+0x3b7/0x590
[ 1833.495297]  vfs_read+0x1f5/0xd20
[ 1833.495497]  ksys_read+0x135/0x250
[ 1833.495549]  do_syscall_64+0x129/0x850
[ 1833.495566]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1833.498894] Kernel panic - not syncing: KASAN: panic_on_warn set ...

page last free pid 9737 tgid 9737 stack trace:
  do_sys_openat2+0xbf/0x260          <-- target task inside kretprobe
  __x64_sys_openat+0x179/0x210

This crash has nothing to do with rethook_node lifetimes or RCU.  It
happens because the ORC unwinder reads stack memory while the target
task concurrently executes a kretprobe trampoline that modifies return
addresses.  The unwinder follows corrupted frame data past valid stack
boundaries.  RCU protection of rethook_node structures is irrelevant —
this crash occurs at the stack frame interpretation level, before any
rethook list traversal.

The old task_is_running() check prevented the unwinder from attempting
to unwind a running task's stack in the first place.

**Problem 2: use-after-free via rethook_node recycling**

Even if the stack-out-of-bounds above were addressed, a second crash
path exists in the rethook list traversal itself.

rethook_recycle() immediately pushes nodes back to the objpool without
an RCU grace period:

   kernel/trace/rethook.c:
   void rethook_recycle(struct rethook_node *node)
   {
           ...
           objpool_push(node, &node->rethook->pool);
   }

Meanwhile, unwind_next_frame() in arch/x86/kernel/unwind_orc.c drops
RCU between frames while the cursor (*cur) persists across iterations:

   arch/x86/kernel/unwind_orc.c:
   bool unwind_next_frame(...)
   {
           ...
           guard(rcu)();    // RCU held for one frame
           ...
   }                        // RCU dropped here

When the unwinder calls __rethook_find_ret_addr() in the next frame
iteration, it does:

   struct llist_node *first = tsk->rethooks.first;
   ...
   *cur = first;
   ...
   node = node->next;       // node may have been recycled

If the target task returns from a probed function between frames, its
rethook_node is recycled and can be instantly reallocated to another
task.  The unwinder's stale cursor then dereferences a freed pointer,
leading to use-after-free.

## Reproducer

The PoC sets up a kretprobe on do_sys_openat2, creates hot-loop threads
calling open(), and concurrently reads /proc/<tid>/stack.  The race
triggers within seconds (Problem 1 above; Problem 2 may reproduce on
kernels without KASAN or with different timing).

Build:  gcc -static -pthread -o poc poc.c
Run:    ./poc [runtime_seconds]
Needs:  root, CONFIG_KASAN=y

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/syscall.h>
#include <sched.h>
#include <fcntl.h>
#include <errno.h>
#include <signal.h>
#include <pthread.h>
#include <dirent.h>

#define TRACE "/sys/kernel/tracing"

volatile int stop = 0;

static int tfs(const char *f, const char *b)
{
     char p[256]; int fd, r;
     snprintf(p, 256, "%s/%s", TRACE, f);
     fd = open(p, O_WRONLY | O_TRUNC);
     if (fd < 0) {
         system("mount -t tracefs tracefs /sys/kernel/tracing 2>/dev/null");
         usleep(50000);
         fd = open(p, O_WRONLY | O_TRUNC);
     }
     if (fd < 0) return -1;
     r = write(fd, b, strlen(b));
     close(fd);
     return r < 0 ? -1 : 0;
}

void *hot_thread(void *arg)
{
     while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
         int fd = open("/dev/null", O_RDONLY);
         if (fd >= 0) close(fd);
     }
     return NULL;
}

void *reader_thread(void *arg)
{
     pid_t target = *(pid_t *)arg;
     char path[64], buf[8192];
     snprintf(path, 64, "/proc/%d/stack", target);
     while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
         int fd = open(path, O_RDONLY);
         if (fd >= 0) { read(fd, buf, 8191); close(fd); }
     }
     return NULL;
}

void sigh(int s) { stop = 1; }

int main(int argc, char *argv[])
{
     int runtime = 120;
     if (argc > 1) runtime = atoi(argv[1]);

     printf("rethook race PoC\n");
     if (geteuid()) { printf("root needed\n"); return 1; }
     signal(SIGINT, sigh);

     pthread_t hot[4], rdr[4];
     pid_t hot_tids[4];
     int pairs = 4;

     for (int c = 0; c < runtime / 5 && !stop; c++) {
         tfs("events/kprobes/myretprobe/enable", "0");
         tfs("kprobe_events", "-:myretprobe");
         usleep(100);
         tfs("kprobe_events", "r:myretprobe do_sys_openat2 $retval");
         tfs("events/kprobes/myretprobe/enable", "1");

         pid_t main_tid = syscall(SYS_gettid);

         for (int i = 0; i < pairs; i++)
             pthread_create(&hot[i], NULL, hot_thread, NULL);

         usleep(300000);

         {
             DIR *d = opendir("/proc/self/task");
             int cnt = 0;
             if (d) {
                 struct dirent *de;
                 while ((de = readdir(d)) != NULL && cnt < pairs) {
                     pid_t t = atoi(de->d_name);
                     if (t > 0 && t != main_tid)
                         hot_tids[cnt++] = t;
                 }
                 closedir(d);
             }
             for (int i = 0; i < cnt; i++)
                 pthread_create(&rdr[i], NULL, reader_thread, &hot_tids[i]);
         }

         printf("round %d\n", c);
         sleep(5);

         stop = 1;
         usleep(100000);

         for (int i = 0; i < pairs; i++) pthread_join(hot[i], NULL);
         for (int i = 0; i < pairs; i++) pthread_join(rdr[i], NULL);

         stop = 0;
         usleep(1000);
     }

     tfs("events/kprobes/myretprobe/enable", "0");
     tfs("kprobe_events", "-:myretprobe");
     printf("Done\n");
     return 0;
}

## Summary

The v4 commit message claims the iteration "will not crash," but the PoC
demonstrates a reproducible KASAN panic:

1. stack-out-of-bounds in unwind_next_frame (ORC unwinder reads
    concurrently-modified stack frames of a running task)

2. Potential use-after-free in __rethook_find_ret_addr (rethook nodes
    recycled without RCU grace period, cursor persists across RCU drops)

The old task_is_running() check was racy but served as a practical
safety net.  Removing it without adding equivalent protection in the
callers (proc_pid_stack, BPF stack walkers) exposes users to kernel
panics via /proc/<pid>/stack on any task running a kretprobe.

Thanks,
Xiao


^ permalink raw reply

* Re: [PATCH v2 0/4] mm: split the file's i_mmap tree for NUMA
From: Lorenzo Stoakes @ 2026-06-11 16:00 UTC (permalink / raw)
  To: Huang Shijie
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, vbabka, shakeel.butt, rppt, mhocko, corbet, skhan,
	linux, dinguyen, schuster.simon, James.Bottomley, deller, djbw,
	willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <20260611061915.2354307-1-huangsj@hygon.cn>

Hi Huang,

You seem to be replacing the file rmap altogether here, so you really ought
to have sent this as an RFC so we could discuss it as a community first.

Especially so as Pedro had publicly mentioned his plans to implement
something similar here, so coordination would have been appreciated.

Anyway, as Pedro has pointed out, the code is overly complicated, it's far
too configurable (not always a good thing), and the locking implementation
is questionable.

You seem to be adding a whole bunch of open-coded complexity too, which is
not something we want. Abstraction is key for the rmap.

You're also not adding any kdoc comments or really many comments at all,
and you've not added any tests (though perhaps it's difficult given how
core this is).

So I would suggest that perhaps any respin should be sent as an RFC so we
can engage in that conversation and ensure we're all on the same page?

Especially since Pedro plans to send an alternative, simpler, solution I
believe.

It's also not helpful that you haven't examined the non-NUMA case :)
perhaps your particular server behaves a certain way that this approach
aids, but regresses other NUMA configurations?

We'd really need to be sure of this before accepting invasive changes like
this.

Thanks, Lorenzo

On Thu, Jun 11, 2026 at 02:18:56PM +0800, Huang Shijie wrote:
>   In NUMA, there are maybe many NUMA nodes and many CPUs.
> For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> In the UnixBench tests, there is a test "execl" which tests
> the execve system call.
>
>   When we test our server with "./Run -c 384 execl",
> the test result is not good enough. The i_mmap locks contended heavily on
> "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
> over 6000 VMAs, all the VMAs can be in different NUMA mode.
> The insert/remove operations do not run quickly enough.

You really need to send detailed, statistically valid numbers across
different NUMA configurations for changes like this to be considered.

>
> patch 1 & patch 2 are try to hide the direct access of i_mmap.
> patch 3 splits the i_mmap into sibling trees, each tree has separate lock,
> and we can get better performance with this patch set in our NUMA server:
>     we can get over 400% performance improvement.
>
> I did not test the non-NUMA case, since I do not have such server.

Yeah this isn't a great thing to hear :) you need to demonstrate this
doesn't regress non-NUMA machines or NUMA machines of a different
configuration.

>
> v1 --> v2:
> 	Not only split the immap tree, but also split the lock.
> 	v1 : https://lkml.org/lkml/2026/4/13/199
>
> Huang Shijie (4):
>   mm: use mapping_mapped to simplify the code
>   mm: use get_i_mmap_root to access the file's i_mmap
>   mm/fs: split the file's i_mmap tree
>   docs/mm: update document for split i_mmap tree
>
>  Documentation/mm/process_addrs.rst |  63 +++++++---
>  arch/arm/mm/fault-armv.c           |   3 +-
>  arch/arm/mm/flush.c                |   3 +-
>  arch/nios2/mm/cacheflush.c         |   3 +-
>  arch/parisc/kernel/cache.c         |   4 +-
>  fs/Kconfig                         |   8 ++
>  fs/dax.c                           |   3 +-
>  fs/hugetlbfs/inode.c               |  30 +++--
>  fs/inode.c                         |  75 +++++++++++-
>  include/linux/fs.h                 | 179 ++++++++++++++++++++++++++++-
>  include/linux/mm.h                 |  81 +++++++++++++
>  include/linux/mm_types.h           |   3 +
>  kernel/events/uprobes.c            |   3 +-
>  mm/hugetlb.c                       |   7 +-
>  mm/internal.h                      |   3 +-
>  mm/khugepaged.c                    |   6 +-
>  mm/memory-failure.c                |   8 +-
>  mm/memory.c                        |   8 +-
>  mm/mmap.c                          |  11 +-
>  mm/nommu.c                         |  28 +++--
>  mm/pagewalk.c                      |   4 +-
>  mm/rmap.c                          |   2 +-
>  mm/vma.c                           |  74 +++++++++---
>  mm/vma_init.c                      |   3 +
>  24 files changed, 534 insertions(+), 78 deletions(-)

This is a _lot_ of changes you're making here. It therefore feels like the
abstraction is broken somewhat?

>
> --
> 2.53.0
>
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCHv7 bpf-next 03/29] ftrace: Add add_ftrace_hash_entry function
From: Alexei Starovoitov @ 2026-06-11 17:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Kumar Kartikeya Dwivedi, Jiri Olsa, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, bpf, linux-trace-kernel,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Menglong Dong
In-Reply-To: <20260611094648.04622890@gandalf.local.home>

On Thu Jun 11, 2026 at 6:46 AM PDT, Steven Rostedt wrote:
> On Wed, 10 Jun 2026 08:42:51 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
>> If my memory doesn't fail me you said it's fine during v1,v2 iterations.
>
> I'm fine with it, but I wanted to test it first. Did I give an Acked-by?
>
>> The last v3 - v8 you were silent, so we assumed you're still fine.
>
> When I see AI reports that say the series needs a fix, I don't bother
> looking. I thought that was the entire point of AI; to let the maintainer
> not have to review if the AI found something.

AI finds things to consider, but when they're considered and postponned
to future it doesn't understand that and keep reporting the same thing
every revision. So it might look like that patches are landing with
outstanding AI complains, but this is not the case.

btw since patches touch ftrace from time to time should we add your
ftrace testsuite to bpf CI ?
How automated is it?

>
>> 
>> While at it, please review Mykyta's set:
>> https://patchwork.kernel.org/user/todo/netdevbpf/?series=1096695
>> 
>> It's also been pending for almost a month now.
>
> Have a better link? I just get a blank page as "TODO" is set to what I have.

Ohh. I meant this set:
https://lore.kernel.org/bpf/CAEf4BzZFjsEv3aLktwdCZF6EXoCL+eefX+6xa3XGrhBmfO1SqA@mail.gmail.com/
where you said that you'll think more about it after pto.
Would be great to land it now for this merge window, so we have
discoverability right now and if better approach comes in the future
we can adjust to it later.
 

^ permalink raw reply

* [syzbot ci] Re: mm: split the file's i_mmap tree for NUMA
From: syzbot ci @ 2026-06-11 20:24 UTC (permalink / raw)
  To: acme, adrian.hunter, akpm, alexander.shishkin, baohua,
	baolin.wang, brauner, brian.ruley, corbet, dave.anglin, david,
	deller, dev.jain, dinguyen, djbw, fangbaoshun, harry, huangsj,
	irogers, jack, james.bottomley, james.clark, jannh, jolsa,
	lance.yang, liam, linmiaohe, linux-arm-kernel, linux-doc,
	linux-fsdevel, linux-kernel, linux-mm, linux-parisc,
	linux-perf-users, linux-trace-kernel, linux, ljs, mark.rutland,
	mhiramat, mhocko, mingo, mjguzik, muchun.song, namhyung,
	nao.horiguchi, npache, nvdimm, oleg, osalvador, peterz
  Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260611061915.2354307-1-huangsj@hygon.cn>

syzbot ci has tested the following series

[v2] mm: split the file's i_mmap tree for NUMA
https://lore.kernel.org/all/20260611061915.2354307-1-huangsj@hygon.cn
* [PATCH v2 1/4] mm: use mapping_mapped to simplify the code
* [PATCH v2 2/4] mm: use get_i_mmap_root to access the file's i_mmap
* [PATCH v2 3/4] mm/fs: split the file's i_mmap tree
* [PATCH v2 4/4] docs/mm: update document for split i_mmap tree

and found the following issue:
INFO: trying to register non-static key in do_one_initcall

Full report is available here:
https://ci.syzbot.org/series/a9bada61-06e7-40d5-b423-5f2d69a60209

***

INFO: trying to register non-static key in do_one_initcall

tree:      linux-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next
base:      14546c7bef6c1036fc82e36c1a200b0caccd339a
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/2f92f704-660a-4108-9172-7e620e10ce46/config

acpi PNP0A08:00: _OSC: platform does not support [PCIeHotplug LTR]
acpi PNP0A08:00: _OSC: OS now controls [PME AER PCIeCapability]
PCI host bridge to bus 0000:00
pci_bus 0000:00: Unknown NUMA node; performance will be reduced
pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
pci_bus 0000:00: root bus resource [mem 0x80000000-0xafffffff window]
pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window]
pci_bus 0000:00: root bus resource [mem 0x240000000-0xa3fffffff window]
pci_bus 0000:00: root bus resource [bus 00-ff]
pci 0000:00:00.0: [8086:29c0] type 00 class 0x060000 conventional PCI endpoint
pci 0000:00:01.0: [1234:1111] type 00 class 0x030000 conventional PCI endpoint
pci 0000:00:01.0: BAR 0 [mem 0xfd000000-0xfdffffff pref]
pci 0000:00:01.0: BAR 2 [mem 0xfebf0000-0xfebf0fff]
pci 0000:00:01.0: ROM [mem 0xfebe0000-0xfebeffff pref]
pci 0000:00:01.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
pci 0000:00:02.0: [1af4:1005] type 00 class 0x00ff00 conventional PCI endpoint
pci 0000:00:02.0: BAR 0 [io  0xc080-0xc09f]
pci 0000:00:02.0: BAR 1 [mem 0xfebf1000-0xfebf1fff]
pci 0000:00:02.0: BAR 4 [mem 0xfe000000-0xfe003fff 64bit pref]
pci 0000:00:03.0: [8086:100e] type 00 class 0x020000 conventional PCI endpoint
pci 0000:00:03.0: BAR 0 [mem 0xfebc0000-0xfebdffff]
pci 0000:00:03.0: BAR 1 [io  0xc000-0xc03f]
pci 0000:00:03.0: ROM [mem 0xfeb80000-0xfebbffff pref]
pci 0000:00:1f.0: [8086:2918] type 00 class 0x060100 conventional PCI endpoint
pci 0000:00:1f.0: quirk: [io  0x0600-0x067f] claimed by ICH6 ACPI/GPIO/TCO
pci 0000:00:1f.2: [8086:2922] type 00 class 0x010601 conventional PCI endpoint
pci 0000:00:1f.2: BAR 4 [io  0xc0a0-0xc0bf]
pci 0000:00:1f.2: BAR 5 [mem 0xfebf2000-0xfebf2fff]
pci 0000:00:1f.3: [8086:2930] type 00 class 0x0c0500 conventional PCI endpoint
pci 0000:00:1f.3: BAR 4 [io  0x0700-0x073f]
ACPI: PCI: Interrupt link LNKA configured for IRQ 10
ACPI: PCI: Interrupt link LNKB configured for IRQ 10
ACPI: PCI: Interrupt link LNKC configured for IRQ 11
ACPI: PCI: Interrupt link LNKD configured for IRQ 11
ACPI: PCI: Interrupt link LNKE configured for IRQ 10
ACPI: PCI: Interrupt link LNKF configured for IRQ 10
ACPI: PCI: Interrupt link LNKG configured for IRQ 11
ACPI: PCI: Interrupt link LNKH configured for IRQ 11
ACPI: PCI: Interrupt link GSIA configured for IRQ 16
ACPI: PCI: Interrupt link GSIB configured for IRQ 17
ACPI: PCI: Interrupt link GSIC configured for IRQ 18
ACPI: PCI: Interrupt link GSID configured for IRQ 19
ACPI: PCI: Interrupt link GSIE configured for IRQ 20
ACPI: PCI: Interrupt link GSIF configured for IRQ 21
ACPI: PCI: Interrupt link GSIG configured for IRQ 22
ACPI: PCI: Interrupt link GSIH configured for IRQ 23
iommu: Default domain type: Translated
iommu: DMA domain TLB invalidation policy: lazy mode
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150
 assign_lock_key+0x133/0x150
 register_lock_class+0xcc/0x2e0
 __lock_acquire+0xad/0x2cf0
 lock_acquire+0x106/0x350
 down_write+0x96/0x200
 dma_resv_lockdep+0x39c/0x660
 do_one_initcall+0x250/0x870
 do_initcall_level+0x104/0x190
 do_initcalls+0x59/0xa0
 kernel_init_freeable+0x2a6/0x3e0
 kernel_init+0x1d/0x1d0
 ret_from_fork+0x514/0xb70
 ret_from_fork_asm+0x1a/0x30
 </TASK>
------------[ cut here ]------------
DEBUG_RWSEMS_WARN_ON(sem->magic != sem): count = 0x1, magic = 0x0, owner = 0xffff888102a95940, curr 0xffff888102a95940, list not empty
WARNING: kernel/locking/rwsem.c:1405 at up_write+0x1e2/0x410, CPU#0: swapper/0/1
Modules linked in:
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:up_write+0x2b1/0x410
Code: c0 c0 e6 cc 8b 49 c7 c2 a0 e6 cc 8b 4c 0f 44 d0 48 8b 7c 24 10 48 c7 c6 40 e8 cc 8b 48 8b 54 24 08 48 8b 0c 24 4d 89 f9 41 52 <67> 48 0f b9 3a 48 83 c4 08 e8 21 1f 0d 03 e9 b2 fd ff ff 90 0f 0b
RSP: 0000:ffffc90000067480 EFLAGS: 00010246
RAX: ffffffff8bcce6c0 RBX: ffffc900000677d0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff8bcce840 RDI: ffffffff90338290
RBP: ffffc90000067830 R08: ffff888102a95940 R09: ffff888102a95940
R10: ffffffff8bcce6c0 R11: fffff5200000cefc R12: ffffc90000067828
R13: dffffc0000000000 R14: 1ffff9200000cf06 R15: ffff888102a95940
FS:  0000000000000000(0000) GS:ffff88818dc9e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88823ffff000 CR3: 000000000e74a000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 dma_resv_lockdep+0x3a4/0x660
 do_one_initcall+0x250/0x870
 do_initcall_level+0x104/0x190
 do_initcalls+0x59/0xa0
 kernel_init_freeable+0x2a6/0x3e0
 kernel_init+0x1d/0x1d0
 ret_from_fork+0x514/0xb70
 ret_from_fork_asm+0x1a/0x30
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply

* Re: [RFC PATCH 1/2] tracing/osnoise: Sample IPI counts
From: Crystal Wood @ 2026-06-11 20:49 UTC (permalink / raw)
  To: Valentin Schneider, Tomas Glozar
  Cc: linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Ivan Pravdin
In-Reply-To: <xhsmh33yt2wtc.mognet@vschneid-thinkpadt14sgen2i.remote.csb>

On Thu, 2026-06-11 at 12:30 +0200, Valentin Schneider wrote:
> On 11/06/26 10:59, Tomas Glozar wrote:
> > [just replying to comments, will do a full review later]
> > 
> > st 10. 6. 2026 v 21:51 odesílatel Crystal Wood <crwood@redhat.com> napsal:
> > > 
> > > On Wed, 2026-06-10 at 15:04 +0200, Valentin Schneider wrote:
> > > > Osnoise already implictly accounts IPIs via its IRQ tracking,
> > > 
> > > Does it?  It seems that IPIs bypass the kernel/irq subsystem on some
> > > arches (including x86, but not ARM).
> > > 
> > > It would be nice to solve this properly by adding generic ipi
> > > entry/exit tracing (similar to what ARM already has).
> > > 
> > 
> > Isn't that precisely what the ipi tracepoints used by this
> > implementation (ipi:ipi_send_cpu) are for?
> > 
> 
> Well, these catch the emission of the IPI, which is great for investigation
> - slap a stacktrace trigger and you (most of the time) get the source of
> your interference.
> 
> However Crystal's point is that on x86 (and I assume other archs) receiving
> & handling these IPIs is "special" and doesn't go through the generic irq
> subsystem and thus has to be tracked separately, which is why osnoise has
> this fairly lengthy osnoise_arch_register() thing.

Oh, I missed the arch hook.  I feel better now :-)

(I'd feel better if it didn't rely on osnoise-specific arch code being
updated to match if some new interrupt path pops up, but oh well.)


> > 
> > > > Alternatively I can have this be purely supported in userspace osnoise by
> > > > hooking into the IPI events and counting IPIs separately from the osnoise
> > > > events.
> > > 
> > > One benefit I could see of doing this in kernel osnoise would be if you
> > > could atomically correlate the count with the particular noise
> > > interval, but this patch doesn't do that.
> > > 
> > 
> > The count is already reported by cycle on the kernel side in the
> > patchset, right? It's only missing in the current RTLA (userspace)
> > part, as there is no statistic using the information. But it can still
> > be collected through custom histogram triggers.

Not sure I follow... this patchset reports a count of IPIs, not cycle
info, but the count is based on when the IPIs were sent, not received. 
The IPI send events capture cycle info, but that's not what this
patchset adds.

I'm not sure that it really matters though.  I had been thinking of this
more like the interference count, which is atomic with respect to a
single noise (and thus the sender of the noise would be outside that
window).  But this count is reported over the entire osnoise sample
period, so a little slop is probably OK.

-Crystal

> 


^ permalink raw reply

* Re: [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
From: Tengda Wu @ 2026-06-12  1:54 UTC (permalink / raw)
  To: XIAO WU, sashiko-reviews, Masami Hiramatsu, Petr Mladek,
	Peter Zijlstra
  Cc: Mathieu Desnoyers, Alexei Starovoitov, Steven Rostedt,
	linux-kernel, linux-trace-kernel, live-patching
In-Reply-To: <tencent_3D17DC5BE32C8A51D938AF50F221321F6206@qq.com>

Hi Xiao,

Thank you very much for your detailed analysis and verification.

On 2026/6/11 23:53, XIAO WU wrote:
> Hi Tengda,
> 
> Sashiko [1] reviewed this patch and found that removing the
> task_is_running() check exposes stack unwinders to real crashes — not
> just "invalid information."  A PoC confirms this: a KASAN panic triggers
> within seconds when /proc/<pid>/stack reads the stack of a task that is
> concurrently running a kretprobe.
> 
> [1] https://sashiko.dev/#/patchset/20260610013658.1837963-1-wutengda%40huaweicloud.com
> 
>> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>> index 5a8bdf88999a..1e7fdebe3cd5 100644
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>>      if (WARN_ON_ONCE(!cur))
>>          return 0;
>>
>> -    if (tsk != current && task_is_running(tsk))
>> -        return 0;
>> -
>>      do {
>>          ret = __rethook_find_ret_addr(tsk, cur);
>>          if (!ret)
> 
> The commit message states:
> 
>> The iteration is already safe from crashes because
>> unwind_next_frame() holds RCU and rethook_node structures are
>> RCU-freed; even if the iteration goes off the rails and returns
>> invalid information, it will not crash.
> 
> There are two problems with this claim, both reproducible.
> 
> **Problem 1: stack-out-of-bounds in unwind_next_frame itself**
> 
> The PoC below reliably triggers the following KASAN panic — not in the
> rethook list traversal, but inside unwind_next_frame():
> 
> [ 1833.494623] BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x861/0x2080
> [ 1833.494651] Read of size 2 at addr ffffc90003e6f5f0 by task poc/9854
> [ 1833.494707] Call Trace:
> [ 1833.494719]  dump_stack_lvl+0x116/0x1f0
> [ 1833.494743]  print_report+0xf4/0x600
> [ 1833.494788]  kasan_report+0xe0/0x110
> [ 1833.494836]  unwind_next_frame+0x861/0x2080
> [ 1833.494948]  arch_stack_walk+0x99/0x100
> [ 1833.495000]  stack_trace_save_tsk+0x16a/0x200
> [ 1833.495054]  proc_pid_stack+0x173/0x2b0
> [ 1833.495103]  seq_read_iter+0x519/0x12d0
> [ 1833.495166]  seq_read+0x3b7/0x590
> [ 1833.495297]  vfs_read+0x1f5/0xd20
> [ 1833.495497]  ksys_read+0x135/0x250
> [ 1833.495549]  do_syscall_64+0x129/0x850
> [ 1833.495566]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 1833.498894] Kernel panic - not syncing: KASAN: panic_on_warn set ...
> 
> page last free pid 9737 tgid 9737 stack trace:
>  do_sys_openat2+0xbf/0x260          <-- target task inside kretprobe
>  __x64_sys_openat+0x179/0x210
> 
> This crash has nothing to do with rethook_node lifetimes or RCU.  It
> happens because the ORC unwinder reads stack memory while the target
> task concurrently executes a kretprobe trampoline that modifies return
> addresses.  The unwinder follows corrupted frame data past valid stack
> boundaries.  RCU protection of rethook_node structures is irrelevant —
> this crash occurs at the stack frame interpretation level, before any
> rethook list traversal.
> 
> The old task_is_running() check prevented the unwinder from attempting
> to unwind a running task's stack in the first place.
> 

Yes, I was able to reproduce the issue locally using your PoC. The problem
does exist as you described. I need to take a deeper look and figure out how
to properly fix it.

> **Problem 2: use-after-free via rethook_node recycling**
> 
> Even if the stack-out-of-bounds above were addressed, a second crash
> path exists in the rethook list traversal itself.
> 
> rethook_recycle() immediately pushes nodes back to the objpool without
> an RCU grace period:
> 
>   kernel/trace/rethook.c:
>   void rethook_recycle(struct rethook_node *node)
>   {
>           ...
>           objpool_push(node, &node->rethook->pool);
>   }
> 
> Meanwhile, unwind_next_frame() in arch/x86/kernel/unwind_orc.c drops
> RCU between frames while the cursor (*cur) persists across iterations:
> 
>   arch/x86/kernel/unwind_orc.c:
>   bool unwind_next_frame(...)
>   {
>           ...
>           guard(rcu)();    // RCU held for one frame
>           ...
>   }                        // RCU dropped here
> 
> When the unwinder calls __rethook_find_ret_addr() in the next frame
> iteration, it does:
> 
>   struct llist_node *first = tsk->rethooks.first;
>   ...
>   *cur = first;
>   ...
>   node = node->next;       // node may have been recycled
> 
> If the target task returns from a probed function between frames, its
> rethook_node is recycled and can be instantly reallocated to another
> task.  The unwinder's stale cursor then dereferences a freed pointer,
> leading to use-after-free.
> 

Yes, Sashiko also pointed this out. You have opened this issue for further
analysis to clarify its fault model. It appears to be a pre-existing issue
that may require a separate patch to resolve.

> ## Reproducer
> 
> The PoC sets up a kretprobe on do_sys_openat2, creates hot-loop threads
> calling open(), and concurrently reads /proc/<tid>/stack.  The race
> triggers within seconds (Problem 1 above; Problem 2 may reproduce on
> kernels without KASAN or with different timing).
> 
> Build:  gcc -static -pthread -o poc poc.c
> Run:    ./poc [runtime_seconds]
> Needs:  root, CONFIG_KASAN=y
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/wait.h>
> #include <sys/syscall.h>
> #include <sched.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <signal.h>
> #include <pthread.h>
> #include <dirent.h>
> 
> #define TRACE "/sys/kernel/tracing"
> 
> volatile int stop = 0;
> 
> static int tfs(const char *f, const char *b)
> {
>     char p[256]; int fd, r;
>     snprintf(p, 256, "%s/%s", TRACE, f);
>     fd = open(p, O_WRONLY | O_TRUNC);
>     if (fd < 0) {
>         system("mount -t tracefs tracefs /sys/kernel/tracing 2>/dev/null");
>         usleep(50000);
>         fd = open(p, O_WRONLY | O_TRUNC);
>     }
>     if (fd < 0) return -1;
>     r = write(fd, b, strlen(b));
>     close(fd);
>     return r < 0 ? -1 : 0;
> }
> 
> void *hot_thread(void *arg)
> {
>     while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>         int fd = open("/dev/null", O_RDONLY);
>         if (fd >= 0) close(fd);
>     }
>     return NULL;
> }
> 
> void *reader_thread(void *arg)
> {
>     pid_t target = *(pid_t *)arg;
>     char path[64], buf[8192];
>     snprintf(path, 64, "/proc/%d/stack", target);
>     while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>         int fd = open(path, O_RDONLY);
>         if (fd >= 0) { read(fd, buf, 8191); close(fd); }
>     }
>     return NULL;
> }
> 
> void sigh(int s) { stop = 1; }
> 
> int main(int argc, char *argv[])
> {
>     int runtime = 120;
>     if (argc > 1) runtime = atoi(argv[1]);
> 
>     printf("rethook race PoC\n");
>     if (geteuid()) { printf("root needed\n"); return 1; }
>     signal(SIGINT, sigh);
> 
>     pthread_t hot[4], rdr[4];
>     pid_t hot_tids[4];
>     int pairs = 4;
> 
>     for (int c = 0; c < runtime / 5 && !stop; c++) {
>         tfs("events/kprobes/myretprobe/enable", "0");
>         tfs("kprobe_events", "-:myretprobe");
>         usleep(100);
>         tfs("kprobe_events", "r:myretprobe do_sys_openat2 $retval");
>         tfs("events/kprobes/myretprobe/enable", "1");
> 
>         pid_t main_tid = syscall(SYS_gettid);
> 
>         for (int i = 0; i < pairs; i++)
>             pthread_create(&hot[i], NULL, hot_thread, NULL);
> 
>         usleep(300000);
> 
>         {
>             DIR *d = opendir("/proc/self/task");
>             int cnt = 0;
>             if (d) {
>                 struct dirent *de;
>                 while ((de = readdir(d)) != NULL && cnt < pairs) {
>                     pid_t t = atoi(de->d_name);
>                     if (t > 0 && t != main_tid)
>                         hot_tids[cnt++] = t;
>                 }
>                 closedir(d);
>             }
>             for (int i = 0; i < cnt; i++)
>                 pthread_create(&rdr[i], NULL, reader_thread, &hot_tids[i]);
>         }
> 
>         printf("round %d\n", c);
>         sleep(5);
> 
>         stop = 1;
>         usleep(100000);
> 
>         for (int i = 0; i < pairs; i++) pthread_join(hot[i], NULL);
>         for (int i = 0; i < pairs; i++) pthread_join(rdr[i], NULL);
> 
>         stop = 0;
>         usleep(1000);
>     }
> 
>     tfs("events/kprobes/myretprobe/enable", "0");
>     tfs("kprobe_events", "-:myretprobe");
>     printf("Done\n");
>     return 0;
> }
> 
> ## Summary
> 
> The v4 commit message claims the iteration "will not crash," but the PoC
> demonstrates a reproducible KASAN panic:
> 
> 1. stack-out-of-bounds in unwind_next_frame (ORC unwinder reads
>    concurrently-modified stack frames of a running task)
> 
> 2. Potential use-after-free in __rethook_find_ret_addr (rethook nodes
>    recycled without RCU grace period, cursor persists across RCU drops)
> 
> The old task_is_running() check was racy but served as a practical
> safety net.  Removing it without adding equivalent protection in the
> callers (proc_pid_stack, BPF stack walkers) exposes users to kernel
> panics via /proc/<pid>/stack on any task running a kretprobe.

Once again, I truly appreciate your thorough review and testing.
I'm not sure if I can fully resolve these issues, but if I succeed, I will
send out a v5.

Best regards,
Tengda


^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Zenghui Yu @ 2026-06-12  5:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, guanhao.wang
In-Reply-To: <ainFROZ3WrGioyuY@gourry-fedora-PF4VCD3F>

[ trim the Cc list ]

Hi Gregory,

On 2026/6/11 4:12, Gregory Price wrote:

> I will still probably send the next RFC version tomorrow or friday,
> as I want to get some eyes on the __GFP_PRIVATE-less pattern.

Could you please Cc me in the next version? I appreciate that and would be
happy to follow this work.

Thanks,
Zenghui

^ permalink raw reply

* Re: [PATCH v2 1/4] mm: use mapping_mapped to simplify the code
From: Huang Shijie @ 2026-06-12  6:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, vbabka, shakeel.butt, rppt, mhocko, corbet, skhan,
	linux, dinguyen, schuster.simon, James.Bottomley, deller, djbw,
	willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <airZn524Ip8VsWra@lucifer>

Hi Lorenzo & Pedro,
On Thu, Jun 11, 2026 at 04:52:54PM +0100, Lorenzo Stoakes wrote:
> On Thu, Jun 11, 2026 at 02:18:57PM +0800, Huang Shijie wrote:
> > Use mapping_mapped() to simplify the code, make
> > the code tidy and clean.
> >
> > Signed-off-by: Huang Shijie <huangsj@hygon.cn>
> 
> Yeah as Pedro said this one could just be sent separately, and I in fact
> suggest you do that :) So:
> 
Thank you Pedro and Lorenzo.
I can send a separate patch later.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [PATCH v2 3/4] mm/fs: split the file's i_mmap tree
From: Huang Shijie @ 2026-06-12  6:44 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko, corbet,
	skhan, linux, dinguyen, schuster.simon, James.Bottomley, deller,
	djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, riel, harry,
	will, brian.ruley, rmk+kernel, dave.anglin, linux-mm, linux-doc,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-fsdevel,
	nvdimm, linux-perf-users, linux-trace-kernel, zhongyuan,
	fangbaoshun, yingzhiwei
In-Reply-To: <aiqFgGbIo1Psy3pI@pedro-suse.lan>

On Thu, Jun 11, 2026 at 12:11:27PM +0100, Pedro Falcato wrote:
> Hi,
> 
> On Thu, Jun 11, 2026 at 02:18:59PM +0800, Huang Shijie wrote:
> > In the UnixBench tests, there is a test "execl" which tests
> > the execve system call.
> >   For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > When we test our server with "./Run -c 384 execl",
> > the test result is not good enough. The i_mmap locks contended heavily on
> > "libc.so" and "ld.so". The i_mmap tree for "libc.so" can be
> > over 6000 VMAs, all the VMAs can be in different NUMA mode. The insert/remove
> > operations do not run quickly enough.
> 
> I _really_ would have appreciated some coordination here, because I said I was
> going to take a look at it. I have something that I think is much simpler
Okay, no problem. 

I waited for more then a month, I thought you are busy at other
things. So I spent more then a week to finish the patch set v2.


> in practice. These patches are also way too complex to be dropped just before
> the merge window.
> 
> Some comments:
> 
> > 
> >  In order to reduce the competition of the i_mmap lock, this patch does
> > following:
> >    1.) Split the single i_mmap tree into several sibling trees:
> >        Each tree has a lock. The CONFIG_SPLIT_I_MMAP is used to
> >        turn on/off this feature.
> 
> There is no need for a config option. This needs to Just Work.
> 
> >    2.) Introduce a new field "tree_idx" for vm_area_struct to save the
> >        sibling tree index for this VMA.
> 
> This is possibly contentious, but there are holes in vm_area_struct.
> So I think this is fine.
> 
> >    3.) Introduce a new field "vma_count" for address_space.
> >        The new mapping_mapped() will use it.
> >    4.) Rewrite the vma_interval_tree_foreach()
> >    5.) Rewrite the lock functions.	
> > 
> >  After this patch, the VMA insert/remove operations will work faster,
> > and we can get over 400% performance improvement with the above test.
> > 
> > Signed-off-by: Huang Shijie <huangsj@hygon.cn>
> > ---
> >  fs/Kconfig               |   8 ++
> >  fs/hugetlbfs/inode.c     |  20 ++++-
> >  fs/inode.c               |  75 ++++++++++++++++-
> >  include/linux/fs.h       | 174 ++++++++++++++++++++++++++++++++++++++-
> >  include/linux/mm.h       |  80 ++++++++++++++++++
> >  include/linux/mm_types.h |   3 +
> >  mm/internal.h            |   3 +-
> >  mm/mmap.c                |  11 ++-
> >  mm/nommu.c               |  23 ++++--
> >  mm/pagewalk.c            |   2 +-
> >  mm/vma.c                 |  72 +++++++++++-----
> >  mm/vma_init.c            |   3 +
> >  12 files changed, 436 insertions(+), 38 deletions(-)
> > 
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 43cb06de297f..e24804f70432 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -9,6 +9,14 @@ menu "File systems"
> >  config DCACHE_WORD_ACCESS
> >         bool
> >  
> > +config SPLIT_I_MMAP
> > +	bool "Split the file's i_mmap to several trees"
> > +	default n
> > +	help
> > +	  Split the file's i_mmap to several trees, each tree has a separate
> > +	  lock. This will reduce the lock contention of file's i_mmap tree,
> > +	  but it will cost more memory for per inode.
> > +
> >  config VALIDATE_FS_PARSER
> >  	bool "Validate filesystem parameter description"
> >  	help
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index da5b41ea5bdd..68d8308418dd 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -891,6 +891,23 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
> >   */
> >  static struct lock_class_key hugetlbfs_i_mmap_rwsem_key;
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		lockdep_set_class(&mapping->i_mmap[i].rwsem,
> > +				&hugetlbfs_i_mmap_rwsem_key);
> > +	}
> > +}
> > +#else
> > +static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
> > +{
> > +	lockdep_set_class(&mapping->i_mmap_rwsem, &hugetlbfs_i_mmap_rwsem_key);
> > +}
> > +#endif
> > +
> >  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> >  					struct mnt_idmap *idmap,
> >  					struct inode *dir,
> > @@ -915,8 +932,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> >  
> >  		inode->i_ino = get_next_ino();
> >  		inode_init_owner(idmap, inode, dir, mode);
> > -		lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
> > -				&hugetlbfs_i_mmap_rwsem_key);
> > +		hugetlbfs_lockdep_set_class(inode->i_mapping);
> >  		inode->i_mapping->a_ops = &hugetlbfs_aops;
> >  		simple_inode_init_ts(inode);
> >  		info->resv_map = resv_map;
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 62c579a0cf7d..cb67ae83f5b3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -214,6 +214,70 @@ static int no_open(struct inode *inode, struct file *file)
> >  	return -ENXIO;
> >  }
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +int split_tree_num;
> > +static int split_tree_align __maybe_unused = 32;
> > +
> > +static void __init init_split_tree_num(void)
> > +{
> > +#ifdef CONFIG_NUMA
> > +	split_tree_num = nr_node_ids;
> > +#else
> > +	split_tree_num = ALIGN(nr_cpu_ids, split_tree_align);
> > +#endif
> > +}
> 
> Again, too configurable. I think you're too stuck up on the NUMA case -

If you do not care about the NUMA. The performance will _NOT_ get improved
in our NUMA server. I had ever tested code which do not care about the NUMA,
and I got a bad performance. Avoid the remote access is a very important
thing for the NUMA server.

> which does not matter for many people - and may actively harm NUMA users. If
> I have a 128 core 2 NUMA node system, what should I shard by?
It is easy to extend the tree number for NUMA. :)

For the 128 core 2 NUMA, we can extend to more trees, such as:
   Two trees for each NUMA node.

> 
> > +
> > +static void free_mapping_i_mmap(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	if (!mapping->i_mmap)
> > +		return;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		kfree(mapping->i_mmap[i]);
> > +
> > +	kfree(mapping->i_mmap);
> > +	mapping->i_mmap = NULL;
> > +}
> > +
> > +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
> > +{
> > +	struct i_mmap_tree *tree;
> > +	int i;
> > +
> > +	/* The extra one is used as terminator in vma_interval_tree_foreach() */
> > +	mapping->i_mmap = kzalloc(sizeof(tree) * (split_tree_num + 1), gfp);
> > +	if (!mapping->i_mmap)
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		tree = kzalloc_node(sizeof(*tree), gfp, i);
> > +		if (!tree)
> > +			goto nomem;
> > +
> > +		tree->root = RB_ROOT_CACHED;
> > +		init_rwsem(&tree->rwsem);
> 
> This (as-is) should blow up with lockdep + the locking loops down there.
okay, I will check it later.

thanks a lot.
> 
> > +
> > +		mapping->i_mmap[i] = tree;
> > +	}
> > +	return 0;
> > +nomem:
> > +	free_mapping_i_mmap(mapping);
> > +	return -ENOMEM;
> > +}
> 
> Honestly, it's likely that a simple static array in struct address_space
The array size is not fixed, so we cannot add a static array in address_space.

> suffices. I would not go through the trouble of getting everything very
> tight and NUMA correct.
> 
> > +#else
> > +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
> > +{
> > +	mapping->i_mmap = RB_ROOT_CACHED;
> > +	init_rwsem(&mapping->i_mmap_rwsem);
> > +	return 0;
> > +}
> > +
> > +static void free_mapping_i_mmap(struct address_space *mapping) { }
> > +static void __init init_split_tree_num(void) {}
> > +#endif
> > +
> >  /**
> >   * inode_init_always_gfp - perform inode structure initialisation
> >   * @sb: superblock inode belongs to
> > @@ -302,9 +366,14 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
> >  #endif
> >  	inode->i_flctx = NULL;
> >  
> > -	if (unlikely(security_inode_alloc(inode, gfp)))
> > +	if (init_mapping_i_mmap(mapping, gfp))
> >  		return -ENOMEM;
> >  
> > +	if (unlikely(security_inode_alloc(inode, gfp))) {
> > +		free_mapping_i_mmap(mapping);
> > +		return -ENOMEM;
> > +	}
> > +
> >  	this_cpu_inc(nr_inodes);
> >  
> >  	return 0;
> > @@ -380,6 +449,7 @@ void __destroy_inode(struct inode *inode)
> >  	if (inode->i_default_acl && !is_uncached_acl(inode->i_default_acl))
> >  		posix_acl_release(inode->i_default_acl);
> >  #endif
> > +	free_mapping_i_mmap(&inode->i_data);
> >  	this_cpu_dec(nr_inodes);
> >  }
> >  EXPORT_SYMBOL(__destroy_inode);
> > @@ -480,9 +550,7 @@ EXPORT_SYMBOL(inc_nlink);
> >  static void __address_space_init_once(struct address_space *mapping)
> >  {
> >  	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
> > -	init_rwsem(&mapping->i_mmap_rwsem);
> >  	spin_lock_init(&mapping->i_private_lock);
> > -	mapping->i_mmap = RB_ROOT_CACHED;
> >  }
> >  
> >  void address_space_init_once(struct address_space *mapping)
> > @@ -2619,6 +2687,7 @@ void __init inode_init(void)
> >  					&i_hash_mask,
> >  					0,
> >  					0);
> > +	init_split_tree_num();
> >  }
> >  
> >  void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index cd46615b8f53..f4b3645b61df 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -450,6 +450,25 @@ struct mapping_metadata_bhs {
> >  	struct list_head list;	/* The list of bhs (b_assoc_buffers) */
> >  };
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +/*
> > + * struct i_mmap_tree - A single sibling tree of the file's split i_mmap.
> > + * @root: The red/black interval tree root.
> > + * @rwsem: Protects insert/remove operations on this sibling tree.
> > + * @vma_count: Number of VMAs in this sibling tree.
> > + *
> > + * When CONFIG_SPLIT_I_MMAP is enabled, the file's single i_mmap tree is
> > + * split into split_tree_num sibling trees, each with its own lock. This
> > + * reduces lock contention by allowing concurrent VMA insert/remove
> > + * operations on different sibling trees.
> > + */
> > +struct i_mmap_tree {
> > +	struct rb_root_cached	root;
> > +	struct rw_semaphore	rwsem;
> > +	atomic_t		vma_count;
> 
> I don't see what you need this vma_count for? I get the one in address_space,
> but this one does not seem useful.
For non-NUMA case, we can use it to determine which tree we should put the new
VMA.
Round-robin is not good enough for a dynamic system.

> 
> > +};
> > +#endif
> > +
> >  /**
> >   * struct address_space - Contents of a cacheable, mappable object.
> >   * @host: Owner, either the inode or the block_device.
> > @@ -461,8 +480,13 @@ struct mapping_metadata_bhs {
> >   * @gfp_mask: Memory allocation flags to use for allocating pages.
> >   * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
> >   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
> > - * @i_mmap: Tree of private and shared mappings.
> > - * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
> > + * @i_mmap: Tree of private and shared mappings. When CONFIG_SPLIT_I_MMAP
> > + *   is enabled, this is an array of split_tree_num struct i_mmap_tree
> > + *   pointers (plus a NULL terminator).
> 
> NULL terminator wastes more memory, so I would really strongly avoid it as
> well.
any better idea?

> 
> > + * @vma_count: Total number of VMAs across all sibling trees (only when
> > + *   CONFIG_SPLIT_I_MMAP is enabled). Used by mapping_mapped().
> > + * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable (only when
> > + *   CONFIG_SPLIT_I_MMAP is disabled; otherwise per-tree rwsem is used).
> 
> So, there are very good reasons why you still need an i_mmap_rwsem protecting
> state, even with split mmap trees. Which I'll go into later.
> 
> >   * @nrpages: Number of page entries, protected by the i_pages lock.
> >   * @writeback_index: Writeback starts here.
> >   * @a_ops: Methods.
> > @@ -480,14 +504,19 @@ struct address_space {
> >  	/* number of thp, only for non-shmem files */
> >  	atomic_t		nr_thps;
> >  #endif
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +	struct i_mmap_tree	**i_mmap;
> > +	atomic_t		vma_count;
> > +#else
> >  	struct rb_root_cached	i_mmap;
> > +	struct rw_semaphore	i_mmap_rwsem;
> > +#endif
> >  	unsigned long		nrpages;
> >  	pgoff_t			writeback_index;
> >  	const struct address_space_operations *a_ops;
> >  	unsigned long		flags;
> >  	errseq_t		wb_err;
> >  	spinlock_t		i_private_lock;
> > -	struct rw_semaphore	i_mmap_rwsem;
> 
> See d3b1a9a778e1 ("fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.")
Got it.
> 
> >  } __attribute__((aligned(sizeof(long)))) __randomize_layout;
> >  	/*
> >  	 * On most architectures that alignment is already the case; but
> > @@ -508,6 +537,133 @@ static inline bool mapping_tagged(const struct address_space *mapping, xa_mark_t
> >  	return xa_marked(&mapping->i_pages, tag);
> >  }
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +static inline int mapping_mapped(const struct address_space *mapping)
> > +{
> > +	return	atomic_read(&mapping->vma_count);
> 
> Now that I think of it, I don't think we need atomic_t, only unsigned long +
> READ_ONCE() suffices. Increments can race just fine, we don't expect any 
> consistency there - if you want consistency you probably hold the i_mmap lock.
> 
okay. I will check it.

> > +}
> > +
> > +static inline void inc_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	atomic_inc(&tree->vma_count);
> > +	atomic_inc(&mapping->vma_count);
> > +}
> > +
> > +static inline void dec_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	atomic_dec(&tree->vma_count);
> > +	atomic_dec(&mapping->vma_count);
> > +}
> 
> This probably shouldn't be in linux/fs.h.
> 
> > +
> > +static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mapping)
> > +{
> > +	return (struct rb_root_cached *)mapping->i_mmap;
> > +}
> > +
> > +static inline void i_mmap_tree_lock_write(struct address_space *mapping,
> > +					struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	down_write(&tree->rwsem);
> > +}
> > +
> > +static inline void i_mmap_tree_unlock_write(struct address_space *mapping,
> > +					struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	up_write(&tree->rwsem);
> > +}
> > +
> > +#define i_mmap_lock_write_prepare(mapping)
> > +#define i_mmap_unlock_write_complete(mapping)
> 
> It's unclear to me why you added write_prepare() and write_complete().
> 
> > +
> > +extern int split_tree_num;
> > +static inline void i_mmap_lock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		down_write(&mapping->i_mmap[i]->rwsem);
> 
> Oof, this is an incredibly large hammer. This is basically why I think keeping
> i_mmap_rwsem (in a different form) is required. You do not want to take $nr_cpus
> locks (read _or_ write). For my design, I keep i_mmap_rwsem, but I invert its
> meaning - taking it in write = I'm reading from the tree; taking it in read =
> I'm writing to the tree. This provides some lighter-weight exclusion between
> rmap walks and rmap tree manipulation.
okay, it seem your method is better. I am waiting for your patch.

> 
> _Technically_, you shouldn't need to always take a lock when manipulating the
> tree. A pattern like mnt_hold_writers()/mnt_get_write_access() can probably
> work well. But it may be too complex ATM.
> 
> 
> Also, note that you pretty much do not want i_mmap_lock_write() users after
> the conversion is done.
> 
> > +}
> > +
> > +static inline int i_mmap_trylock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		if (!down_write_trylock(&mapping->i_mmap[i]->rwsem)) {
> > +			while (i--)
> > +				up_write(&mapping->i_mmap[i]->rwsem);
> > +			return 0;
> > +		}
> > +	}
> > +	return 1;
> > +}
> > +
> > +static inline void i_mmap_unlock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		up_write(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline int i_mmap_trylock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		if (!down_read_trylock(&mapping->i_mmap[i]->rwsem)) {
> > +			while (i--)
> > +				up_read(&mapping->i_mmap[i]->rwsem);
> > +			return 0;
> > +		}
> > +	}
> > +	return 1;
> > +}
> > +
> > +static inline void i_mmap_lock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		down_read(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_unlock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		up_read(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_assert_locked(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		lockdep_assert_held(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_assert_write_locked(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		lockdep_assert_held_write(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +#else
> > +
> >  static inline void i_mmap_lock_write(struct address_space *mapping)
> >  {
> >  	down_write(&mapping->i_mmap_rwsem);
> > @@ -561,6 +717,18 @@ static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mappi
> >  	return &mapping->i_mmap;
> >  }
> >  
> > +static inline void inc_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma) { }
> > +static inline void dec_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma) { }
> > +
> > +#define i_mmap_lock_write_prepare(mapping)	i_mmap_lock_write(mapping)
> > +#define i_mmap_unlock_write_complete(mapping)	i_mmap_unlock_write(mapping)
> > +#define i_mmap_tree_lock_write(mapping, vma)
> > +#define i_mmap_tree_unlock_write(mapping, vma)
> > +
> > +#endif
> > +
> >  /*
> >   * Might pages of this file have been modified in userspace?
> >   * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0a45c6a8b9f2..9aa8119fa9bf 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4041,11 +4041,91 @@ struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root_cached *root,
> >  struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node,
> >  				unsigned long start, unsigned long last);
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +extern int split_tree_num;
> > +
> > +static inline int smallest_tree_idx(struct file *file)
> > +{
> > +	struct address_space *mapping = file->f_mapping;
> > +	int tmp = INT_MAX, count;
> > +	int i, j = 0;
> > +
> > +	/*
> > +	 * Since a not 100% accurate value is still okay,
> > +	 * we do not need any lock here.
> > +	 */
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		count = atomic_read(&mapping->i_mmap[i]->vma_count);
> > +		if (count < tmp) {
> > +			j = i;
> > +			tmp = count;
> > +			if (!tmp)
> > +				break;
> > +		}
> > +	}
> 
> Ohh, I see why you want the per-subtree vma_count now. But is this a net-win?
It keep the trees as even as possible.

> I think doing something like vma-pointer-hashing or just smp_processor_id()
> would work a-ok.
> 
> > +	return j;
> > +}
> > +
> > +static inline void vma_set_tree_idx(struct vm_area_struct *vma)
> > +{
> > +#ifdef CONFIG_NUMA
> > +	vma->tree_idx = numa_node_id();
> > +#else
> > +	vma->tree_idx = smallest_tree_idx(vma->vm_file);
> > +#endif
> > +}
> > +
> > +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
> > +					struct address_space *mapping)
> > +{
> > +	return &mapping->i_mmap[vma->tree_idx]->root;
> > +}
> > +
> > +/* Find the first valid VMA in the sibling trees */
> > +static inline struct vm_area_struct *first_vma(struct i_mmap_tree ***__r,
> > +				unsigned long start, unsigned long last)
> > +{
> > +	struct vm_area_struct *vma = NULL;
> > +	struct i_mmap_tree **tree = *__r;
> > +	struct rb_root_cached *root;
> > +
> > +	while (*tree) {
> > +		root = &(*tree)->root;
> > +		tree++;
> > +		vma = vma_interval_tree_iter_first(root, start, last);
> > +		if (vma)
> > +			break;
> > +	}
> > +
> > +	/* Save for the next loop */
> > +	*__r = tree;
> > +	return vma;
> > +}
> > +
> > +/*
> > + * Please use get_i_mmap_root() to get the @root.
> > + * @_tmp is referenced to avoid unused variable warning.
> > + */
> > +#define vma_interval_tree_foreach(vma, root, start, last)		\
> > +	for (struct i_mmap_tree **_r = (struct i_mmap_tree **)(root),	\
> > +		**_tmp = (vma = first_vma(&_r, start, last)) ? _r : NULL;\
> > +	     ((_tmp && vma) || (vma = first_vma(&_r, start, last)));	\
> > +		vma = vma_interval_tree_iter_next(vma, start, last))
> > +#else
> >  /* Please use get_i_mmap_root() to get the @root */
> >  #define vma_interval_tree_foreach(vma, root, start, last)		\
> >  	for (vma = vma_interval_tree_iter_first(root, start, last);	\
> >  	     vma; vma = vma_interval_tree_iter_next(vma, start, last))
> >  
> > +static inline void vma_set_tree_idx(struct vm_area_struct *vma) { }
> > +
> > +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
> > +					struct address_space *mapping)
> > +{
> > +	return &mapping->i_mmap;
> > +}
> > +#endif
> > +
> >  void anon_vma_interval_tree_insert(struct anon_vma_chain *node,
> >  				   struct rb_root_cached *root);
> >  void anon_vma_interval_tree_remove(struct anon_vma_chain *node,
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index a308e2c23b82..8d6aab3346ce 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1072,6 +1072,9 @@ struct vm_area_struct {
> >  #ifdef __HAVE_PFNMAP_TRACKING
> >  	struct pfnmap_track_ctx *pfnmap_track_ctx;
> >  #endif
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +	int tree_idx;			/* The sibling tree index for the VMA */
> > +#endif
> 
> FTR the struct hole isn't here, but right after vm_lock_seq or vm_refcnt in
> most configs.
okay, thanks.
I did not notice the struct hole issue.
> 
> >  } __randomize_layout;
> >  
> >  /* Clears all bits in the VMA flags bitmap, non-atomically. */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 5a2ddcf68e0b..2d35cacffd19 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1888,7 +1888,8 @@ static inline void maybe_rmap_unlock_action(struct vm_area_struct *vma,
> >  
> >  	VM_WARN_ON_ONCE(vma_is_anonymous(vma));
> >  	file = vma->vm_file;
> > -	i_mmap_unlock_write(file->f_mapping);
> > +	i_mmap_tree_unlock_write(file->f_mapping, vma);
> > +	i_mmap_unlock_write_complete(file->f_mapping);
> >  	action->hide_from_rmap_until_complete = false;
> >  }
> >  
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index d714fdb357e5..70036ec9dcaa 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1825,15 +1825,20 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> >  			struct address_space *mapping = file->f_mapping;
> >  
> >  			get_file(file);
> > -			i_mmap_lock_write(mapping);
> > +			i_mmap_lock_write_prepare(mapping);
> > +			i_mmap_tree_lock_write(mapping, mpnt);
> > +
> >  			if (vma_is_shared_maywrite(tmp))
> >  				mapping_allow_writable(mapping);
> >  			flush_dcache_mmap_lock(mapping);
> >  			/* insert tmp into the share list, just after mpnt */
> >  			vma_interval_tree_insert_after(tmp, mpnt,
> > -					get_i_mmap_root(mapping));
> > +					get_rb_root(mpnt, mapping));
> > +			inc_mapping_vma(mapping, tmp);
> 
> Honestly, would prefer to hide all of these details from mmap.
yes, we can. 

But we need to change the functions in mm/interval_tree.c

> 
> >  			flush_dcache_mmap_unlock(mapping);
> > -			i_mmap_unlock_write(mapping);
> > +
> > +			i_mmap_tree_unlock_write(mapping, mpnt);
> > +			i_mmap_unlock_write_complete(mapping);
> >  		}
> >  
> >  		if (!(tmp->vm_flags & VM_WIPEONFORK))
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 0f18ffc658e9..1f2c60a220f6 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -567,11 +567,16 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
> >  	if (vma->vm_file) {
> >  		struct address_space *mapping = vma->vm_file->f_mapping;
> >  
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> > +
> >  		flush_dcache_mmap_lock(mapping);
> > -		vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
> > +		vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
> > +		inc_mapping_vma(mapping, vma);
> >  		flush_dcache_mmap_unlock(mapping);
> > -		i_mmap_unlock_write(mapping);
> > +
> > +		i_mmap_tree_unlock_write(mapping, vma);
> > +		i_mmap_unlock_write_complete(mapping);
> >  	}
> >  }
> >  
> > @@ -583,11 +588,16 @@ static void cleanup_vma_from_mm(struct vm_area_struct *vma)
> >  		struct address_space *mapping;
> >  		mapping = vma->vm_file->f_mapping;
> >  
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> > +
> >  		flush_dcache_mmap_lock(mapping);
> > -		vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
> > +		vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
> > +		dec_mapping_vma(mapping, vma);
> >  		flush_dcache_mmap_unlock(mapping);
> > -		i_mmap_unlock_write(mapping);
> > +
> > +		i_mmap_tree_unlock_write(mapping, vma);
> > +		i_mmap_unlock_write_complete(mapping);
> >  	}
> >  }
> >  
> > @@ -1063,6 +1073,7 @@ unsigned long do_mmap(struct file *file,
> >  	if (file) {
> >  		region->vm_file = get_file(file);
> >  		vma->vm_file = get_file(file);
> > +		vma_set_tree_idx(vma);
> 
> This is unrelated, shouldn't be done here.
> 
> >  	}
> >  
> >  	down_write(&nommu_region_sem);
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index 8df1b5077951..d5745519d95a 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -809,7 +809,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
> >  	if (!check_ops_safe(ops))
> >  		return -EINVAL;
> >  
> > -	lockdep_assert_held(&mapping->i_mmap_rwsem);
> > +	i_mmap_assert_locked(mapping);
> 
> This kind of conversion should be done in a separate step.
> 
> >  	vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), first_index,
> >  				  first_index + nr - 1) {
> >  		/* Clip to the vma */
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 6159650c1b42..2055758064a9 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -234,22 +234,23 @@ static void __vma_link_file(struct vm_area_struct *vma,
> >  		mapping_allow_writable(mapping);
> >  
> >  	flush_dcache_mmap_lock(mapping);
> > -	vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
> > +	vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
> > +	inc_mapping_vma(mapping, vma);
> 
> inc_mapping_vma() should probably be done implicitly by insertion?
Yes, we can. 
It is more grace to hide it in vma_interval_tree_insert.

> 
> >  	flush_dcache_mmap_unlock(mapping);
> >  }
> >  
> > -/*
> > - * Requires inode->i_mapping->i_mmap_rwsem
> > - */
> >  static void __remove_shared_vm_struct(struct vm_area_struct *vma,
> >  				      struct address_space *mapping)
> >  {
> > +	i_mmap_tree_lock_write(mapping, vma);
> >  	if (vma_is_shared_maywrite(vma))
> >  		mapping_unmap_writable(mapping);
> >  
> >  	flush_dcache_mmap_lock(mapping);
> > -	vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
> > +	vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
> > +	dec_mapping_vma(mapping, vma);
> >  	flush_dcache_mmap_unlock(mapping);
> > +	i_mmap_tree_unlock_write(mapping, vma);
> >  }
> >  
> >  /*
> > @@ -297,8 +298,9 @@ static void vma_prepare(struct vma_prepare *vp)
> >  			uprobe_munmap(vp->adj_next, vp->adj_next->vm_start,
> >  				      vp->adj_next->vm_end);
> >  
> > -		i_mmap_lock_write(vp->mapping);
> > +		i_mmap_lock_write_prepare(vp->mapping);
> >  		if (vp->insert && vp->insert->vm_file) {
> > +			i_mmap_tree_lock_write(vp->mapping, vp->insert);
> >  			/*
> >  			 * Put into interval tree now, so instantiated pages
> >  			 * are visible to arm/parisc __flush_dcache_page
> > @@ -307,6 +309,7 @@ static void vma_prepare(struct vma_prepare *vp)
> >  			 */
> >  			__vma_link_file(vp->insert,
> >  					vp->insert->vm_file->f_mapping);
> > +			i_mmap_tree_unlock_write(vp->mapping, vp->insert);
> >  		}
> >  	}
> >  
> > @@ -318,12 +321,17 @@ static void vma_prepare(struct vma_prepare *vp)
> >  	}
> >  
> >  	if (vp->file) {
> > +		i_mmap_tree_lock_write(vp->mapping, vp->vma);
> >  		flush_dcache_mmap_lock(vp->mapping);
> >  		vma_interval_tree_remove(vp->vma,
> > -					get_i_mmap_root(vp->mapping));
> > -		if (vp->adj_next)
> > +					get_rb_root(vp->vma, vp->mapping));
> > +		dec_mapping_vma(vp->mapping, vp->vma);
> > +		if (vp->adj_next) {
> > +			i_mmap_tree_lock_write(vp->mapping, vp->adj_next);
> >  			vma_interval_tree_remove(vp->adj_next,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->adj_next, vp->mapping));
> > +			dec_mapping_vma(vp->mapping, vp->adj_next);
> > +		}
> >  	}
> >  
> >  }
> > @@ -340,12 +348,17 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >  			 struct mm_struct *mm)
> >  {
> >  	if (vp->file) {
> > -		if (vp->adj_next)
> > +		if (vp->adj_next) {
> >  			vma_interval_tree_insert(vp->adj_next,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->adj_next, vp->mapping));
> > +			inc_mapping_vma(vp->mapping, vp->adj_next);
> > +			i_mmap_tree_unlock_write(vp->mapping, vp->adj_next);
> > +		}
> >  		vma_interval_tree_insert(vp->vma,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->vma, vp->mapping));
> > +		inc_mapping_vma(vp->mapping, vp->vma);
> >  		flush_dcache_mmap_unlock(vp->mapping);
> > +		i_mmap_tree_unlock_write(vp->mapping, vp->vma);
> >  	}
> >  
> >  	if (vp->remove && vp->file) {
> > @@ -370,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >  	}
> >  
> >  	if (vp->file) {
> > -		i_mmap_unlock_write(vp->mapping);
> > +		i_mmap_unlock_write_complete(vp->mapping);
> >  
> >  		if (!vp->skip_vma_uprobe) {
> >  			uprobe_mmap(vp->vma);
> > @@ -1799,12 +1812,12 @@ static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
> >  	int i;
> >  
> >  	mapping = vb->vmas[0]->vm_file->f_mapping;
> > -	i_mmap_lock_write(mapping);
> > +	i_mmap_lock_write_prepare(mapping);
> >  	for (i = 0; i < vb->count; i++) {
> >  		VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping);
> >  		__remove_shared_vm_struct(vb->vmas[i], mapping);
> >  	}
> > -	i_mmap_unlock_write(mapping);
> > +	i_mmap_unlock_write_complete(mapping);
> >  
> >  	unlink_file_vma_batch_init(vb);
> >  }
> > @@ -1836,10 +1849,13 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
> >  
> >  	if (file) {
> >  		mapping = file->f_mapping;
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> >  		__vma_link_file(vma, mapping);
> > -		if (!hold_rmap_lock)
> > -			i_mmap_unlock_write(mapping);
> > +		if (!hold_rmap_lock) {
> > +			i_mmap_tree_unlock_write(mapping, vma);
> > +			i_mmap_unlock_write_complete(mapping);
> > +		}
> >  	}
> >  }
> >  
> > @@ -2164,6 +2180,23 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
> >  	}
> >  }
> 
> I can but hope that all of the above is quite simplified before we get to the
> "making file rmap more complicated" bit.
:(
If we can do not care about the ARM device, we can make it simple.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [PATCH v2 0/4] mm: split the file's i_mmap tree for NUMA
From: Huang Shijie @ 2026-06-12  7:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, vbabka, shakeel.butt, rppt, mhocko, corbet, skhan,
	linux, dinguyen, schuster.simon, James.Bottomley, deller, djbw,
	willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <airY5q_SspdbQDbi@lucifer>

On Thu, Jun 11, 2026 at 05:00:49PM +0100, Lorenzo Stoakes wrote:
> Hi Huang,
> 
> You seem to be replacing the file rmap altogether here, so you really ought
> to have sent this as an RFC so we could discuss it as a community first.
No problem.

> 
> Especially so as Pedro had publicly mentioned his plans to implement
> something similar here, so coordination would have been appreciated.
Yes. I am very happy to work with Pedro.

> 
> Anyway, as Pedro has pointed out, the code is overly complicated, it's far
> too configurable (not always a good thing), and the locking implementation
> is questionable.
I can make the code more simple. :)

> 
> You seem to be adding a whole bunch of open-coded complexity too, which is
> not something we want. Abstraction is key for the rmap.
> 
> You're also not adding any kdoc comments or really many comments at all,
> and you've not added any tests (though perhaps it's difficult given how
> core this is).
Got it.

> 
> So I would suggest that perhaps any respin should be sent as an RFC so we
> can engage in that conversation and ensure we're all on the same page?
> 
> Especially since Pedro plans to send an alternative, simpler, solution I
> believe.
> 
> It's also not helpful that you haven't examined the non-NUMA case :)
> perhaps your particular server behaves a certain way that this approach
> aids, but regresses other NUMA configurations?

emm. I ever hoped someone can help me to test this patch set on the non-NUMA
server.

It seems I should find some non-NUMA server before I send out the patch set. :)

> 
> We'd really need to be sure of this before accepting invasive changes like
> this.
Okay.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [RFC PATCH 1/2] tracing/osnoise: Sample IPI counts
From: Valentin Schneider @ 2026-06-12  8:53 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: Crystal Wood, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Ivan Pravdin
In-Reply-To: <CAP4=nvTOV9jSermE_E_eUh29ks7B9Nv1Nkq1DfvA5nLRv=5hxg@mail.gmail.com>

On 11/06/26 13:55, Tomas Glozar wrote:
> čt 11. 6. 2026 v 12:31 odesílatel Valentin Schneider
> <vschneid@redhat.com> napsal:
>> >
>> > Isn't that precisely what the ipi tracepoints used by this
>> > implementation (ipi:ipi_send_cpu) are for?
>> >
>>
>> Well, these catch the emission of the IPI, which is great for investigation
>> - slap a stacktrace trigger and you (most of the time) get the source of
>> your interference.
>>
>> However Crystal's point is that on x86 (and I assume other archs) receiving
>> & handling these IPIs is "special" and doesn't go through the generic irq
>> subsystem and thus has to be tracked separately, which is why osnoise has
>> this fairly lengthy osnoise_arch_register() thing.
>>
>
> Ah, right. This is not IPI specific, though, IIUC - Intel also has
> other IRQs that have to be traced using Intel-specific trace points,
> like irq_vectors:local_timer, which is also handled in
> osnoise_arch_register(). On ARM from what I recall, most (all?) IRQs
> are traced with irq:* tracepoints.
>
> So there are two parts to this:
>
> - Detecting interference from IPIs firing as osnoise:irq_noise (to be
> analyzed by timerlat auto analysis, and also will appear by default in
> trace output if enabled, regardless of the tool, as all osnoise:*
> tracepoints are enabled there). This is done locally using the already
> existing path (no race hazard), but requires arch-specific detection.
>
> - Counting IPIs when they are being sent. This is the new feature, and
> the count is being recorded in osnoise_sample.
>
> I guess that means that if there were a generic IPI interface, it
> would be easier to use that for IPI counting, as the event would be
> CPU-local? As you say, for tracing of the IPI source, the sending
> tracepoints are better, and that you can already dump the stack trace
> of with --event/--trigger. timerlat auto-analysis could be extended to
> connect the specific IPI to the IRQ noise and display its stack trace
> automatically, instead of manually analyzing the trace output.
>

Right, at least for the smp_call stuff (which includes irq_work) we can
leverage:

  csd_queue_cpu (on the sending CPU)
  csd_func_start (on the receiving CPU)

by indexing on the @csd address; once upon a time [1] I had this:

  $ echo 'hist:keys=cpu,csd.hex:ts=common_timestamp.usecs:src=common_cpu' >\
       /sys/kernel/tracing/events/csd/csd_queue_cpu/trigger
  $ echo 'csd_latency unsigned int src_cpu; '\
       'unsigned int dst_cpu; '\
       'unsigned long csd; u64 time' >\
       /sys/kernel/tracing/synthetic_events

  $ echo 'hist:keys=common_cpu,csd.hex:
  time=common_timestamp.usecs-$ts:
  onmatch(csd.csd_queue_cpu).trace(csd_latency,$src,common_cpu,csd,$time)' >\
       /sys/kernel/tracing/events/csd/csd_function_entry/trigger

  $ trace-cmd record -e 'synthetic:csd_latency' hackbench
  $ trace-cmd report
  <idle>-0     [001]   115.236810: csd_latency:          src_cpu=7, dst_cpu=1, csd=18446612682588476192, time=134
  <idle>-0     [000]   115.240676: csd_latency:          src_cpu=7, dst_cpu=0, csd=18446612682588214048, time=103
  <idle>-0     [009]   115.241320: csd_latency:          src_cpu=7, dst_cpu=9, csd=18446612682143963384, time=83
  <idle>-0     [007]   115.242817: csd_latency:          src_cpu=8, dst_cpu=7, csd=18446612682150759032, time=93
  <idle>-0     [005]   115.247802: csd_latency:          src_cpu=7, dst_cpu=5, csd=18446612682144441144, time=114
  <idle>-0     [005]   115.271775: csd_latency:          src_cpu=7, dst_cpu=5, csd=18446612682144441144, time=151
  <idle>-0     [000]   115.279620: csd_latency:          src_cpu=7, dst_cpu=0, csd=18446612682588214048, time=87
  <idle>-0     [000]   115.281727: csd_latency:          src_cpu=7, dst_cpu=0, csd=18446612682588214048, time=101

[1]: https://lore.kernel.org/lkml/xhsmh4jn8y8vt.mognet@vschneid.remote.csb/

I believe you're right that leveraging this would be useful for
timerlat-aa; I'll add it to my todolist :-)

>> >> Isn't this racy to do from a different CPU?  Both in terms of the
>> >> counter, and the timing of the increment relative to when the IPI is
>> >> actually received.  Not necessarily a huge deal if you only care about
>> >> zero versus bignum, but still.  At least worth a comment, if we go with
>> >> this approach.
>> >>
>> >
>> > I also think it's a bit confusing, especially as the other accesses to
>> > osn_var are cpu-local, but here, "cpu" is the *target* CPU, not the
>> > current CPU. Not sure how expensive it would be to do atomic_add for
>> > that, at least it's something to consider.
>> >
>>
>> I suppose that could be an argument for doing that stat aggregation in
>> userspace osnoise - event handlers are run after the fact via
>> tracefs_iterate_raw_events(), it's all inherently slower since it's just
>> increments of one (one per handled event) but it's also all done in
>> userspace on a control thread and doesn't bog down the kernelspace.
>>
>
> You can also do per-cpu counters in-kernel and sum them in the end,
> but that would take cpus^2 space (indexed by [current_cpu,
> target_cpu]). The question is whether there could be enough samples to
> overload sample collection (like it happens for timerlat, which
> collects data in-kernel using BPF instead).
>
> In-kernel counting can be tested with " --event ipi:ipi_send_cpu
> --trigger hist:key=cpu" - IIRC, tracefs histograms use atomic
> operations (via tracing_map) to protect the entries from races in
> multi thread access. Of course, that is inferior to what the patchset
> implements, as it doesn't record which osnoise cycle the IPI was sent
> in, nor can record cpumask IPIs.
>

I suppose I'll need to go do some benchmarking, but I'm starting to lean
towards the side of atomic incs for IPI counts being okay considering the
sort of latencies we track.

>
> Tomas


^ permalink raw reply

* Re: [PATCH] sparc64: uprobes: add missing break
From: Andreas Larsson @ 2026-06-12  9:13 UTC (permalink / raw)
  To: Rosen Penev, linux-kernel
  Cc: Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra, David S. Miller,
	open list:UPROBES, open list:SPARC + UltraSPARC (sparc/sparc64)
In-Reply-To: <20260506031815.779909-1-rosenp@gmail.com>

On 2026-05-06 05:18, Rosen Penev wrote:
> Missing fallthrough causes failure with newer compilers:
> 
> arch/sparc/kernel/uprobes.c:284:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
>   284 |         default:
>       |         ^
> arch/sparc/kernel/uprobes.c:284:2: note: insert 'break;' to avoid fall-through
>   284 |         default:
>       |         ^
>       |         break;
> 
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  arch/sparc/kernel/uprobes.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/sparc/kernel/uprobes.c b/arch/sparc/kernel/uprobes.c
> index 305017bec164..c8cac64e9988 100644
> --- a/arch/sparc/kernel/uprobes.c
> +++ b/arch/sparc/kernel/uprobes.c
> @@ -280,6 +280,7 @@ int arch_uprobe_exception_notify(struct notifier_block *self,
>  	case DIE_SSTEP:
>  		if (uprobe_post_sstep_notifier(args->regs))
>  			ret = NOTIFY_STOP;
> +		break;
>  
>  	default:
>  		break;

Reviewed-by: Andreas Larsson <andreas@gaisler.com>

Picking this up to my for-next.

Thanks,
Andreas


^ permalink raw reply

* [PATCH] rtla: Simplify osnoise tracer option setting code
From: Tomas Glozar @ 2026-06-12 11:51 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel

Each osnoise tracer option (in /sys/kernel/tracing/osnoise) used by RTLA
requires four functions to be defined:

- static osnoise_get_<opt>() - to get the current value of the option
  and save it into struct osnoise_context's orig_<opt> field,
- osnoise_set_<opt>() - to set the value of the option requested by the
  user after reading and saving the original with osnoise_get_<opt>(),
  and save it into <opt> field of struct osnoise_context,
- osnoise_restore_<opt>() - restore the value recorded in orig_<opt>,
- static osnoise_put_<opt>() - restore the value recorded in orig_<opt>
  and update <opt> to reflect that.

The logic is duplicated for all the options, except for cpus (which is
the only string option) and period/runtime (which are handled together
and feature extra checks).

Deduplicate the logic using a set of macros featuring the X macro
pattern, defined in src/common.h:

- OSNOISE_LL_OPTIONS, which invokes OSNOISE_LL_OPTION macro for all
  "long long" options,
- OSNOISE_FLAG_OPTIONS, which invokes OSNOISE_FLAG_OPTION macro for all
  flag (boolean values in osnoise/options file) options.

The list macros are then invoked in four places:

- for struct osnoise_context fields in src/common.h,
- for function declarations, moved into src/common.h from
  src/osnoise.h,
- for function definitions in src/osnoise.c,
- for context initialization and restoration, in osnoise_context_alloc()
  and osnoise_put_context(), both in src/osnoise.c.

OSNOISE_LL_OPTIONS takes three options: name - struct osnoise_context
field name (written "<opt>" above), path - filename inside
/sys/kernel/tracing/osnoise passed to libtracefs, and init_val - initial
value of struct fields, corresponding to an otherwise invalid option
(some options use OSNOISE_OPTION_INIT_VAL = -1, some use
OSNOISE_TIME_INIT_VAL = 0).

OSNOISE_FLAG_OPTION is similar, but instead of path, it takes the option
string inside /sys/kernel/tracing/osnoise/options (opt_string), and no
init_val, as it is purely boolean (0 or 1).

Previously, for options timerlat_align and osnoise_workload, the return
value of osnoise_set_<opt>() distinguished between -2 (option cannot be
set) and -1 (option not present). This distinction is expanded for all
options for consistency; for most options, it is currently not used,
only osnoise_workload is implemented to avoid error on -1 on older RTLA
versions.

The change overall has two main benefits: it makes it much simpler to
add a new option, as well as to change existing logic consistently for
all of them. It also makes the code shorter by a bit over 500 lines.

There is no intentional user-visible change coming from the refactoring.
osnoise_restore_<opt>() for flag options now sets <opt> instead of
orig_<opt>. As the latter is also set by osnoise_put_<opt>(), plus long
long options set <opt> in both the old and new implementation, the old
behavior was likely a mistake, and should not matter for now, as the
options are only restored once at the end of tracing and neither <opt>
nor orig_<opt> field is read again.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 tools/tracing/rtla/src/common.h  |  79 +--
 tools/tracing/rtla/src/osnoise.c | 836 ++++++-------------------------
 tools/tracing/rtla/src/osnoise.h |  22 -
 3 files changed, 188 insertions(+), 749 deletions(-)

diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
index 04b287a03f6d..47233b0781c7 100644
--- a/tools/tracing/rtla/src/common.h
+++ b/tools/tracing/rtla/src/common.h
@@ -6,9 +6,35 @@
 #include "trace.h"
 #include "utils.h"
 
+/*
+ * OSNOISE_LL_OPTIONS - list of long long options backed by tracefs files.
+ *   OSNOISE_LL_OPTION(field_name, tracefs_path, init_value)
+ *
+ * OSNOISE_FLAG_OPTIONS - list of boolean options backed by osnoise/options.
+ *   OSNOISE_FLAG_OPTION(field_name, option_string)
+ */
+#define OSNOISE_LL_OPTIONS \
+	OSNOISE_LL_OPTION(stop_us,		"osnoise/stop_tracing_us",	 OSNOISE_OPTION_INIT_VAL) \
+	OSNOISE_LL_OPTION(stop_total_us,	"osnoise/stop_tracing_total_us", OSNOISE_OPTION_INIT_VAL) \
+	OSNOISE_LL_OPTION(print_stack,		"osnoise/print_stack",		 OSNOISE_OPTION_INIT_VAL) \
+	OSNOISE_LL_OPTION(tracing_thresh,	"tracing_thresh",		 OSNOISE_OPTION_INIT_VAL) \
+	OSNOISE_LL_OPTION(timerlat_period_us,	"osnoise/timerlat_period_us",	 OSNOISE_TIME_INIT_VAL)   \
+	OSNOISE_LL_OPTION(timerlat_align_us,	"osnoise/timerlat_align_us",	 OSNOISE_OPTION_INIT_VAL)
+
+#define OSNOISE_FLAG_OPTIONS \
+	OSNOISE_FLAG_OPTION(irq_disable,	"OSNOISE_IRQ_DISABLE") \
+	OSNOISE_FLAG_OPTION(workload,		"OSNOISE_WORKLOAD") \
+	OSNOISE_FLAG_OPTION(timerlat_align,	"TIMERLAT_ALIGN")
+
 /*
  * osnoise_context - read, store, write, restore osnoise configs.
  */
+#define OSNOISE_LL_OPTION(name, path, init_val)		\
+	long long		orig_##name;		\
+	long long		name;
+#define OSNOISE_FLAG_OPTION(name, option_str)		\
+	int			orig_opt_##name;	\
+	int			opt_##name;
 struct osnoise_context {
 	int			flags;
 	int			ref;
@@ -24,42 +50,11 @@ struct osnoise_context {
 	unsigned long long	orig_period_us;
 	unsigned long long	period_us;
 
-	/* 0 as init value */
-	long long		orig_timerlat_period_us;
-	long long		timerlat_period_us;
-
-	/* 0 as init value */
-	long long		orig_tracing_thresh;
-	long long		tracing_thresh;
-
-	/* -1 as init value because 0 is disabled */
-	long long		orig_stop_us;
-	long long		stop_us;
-
-	/* -1 as init value because 0 is disabled */
-	long long		orig_stop_total_us;
-	long long		stop_total_us;
-
-	/* -1 as init value because 0 is disabled */
-	long long		orig_print_stack;
-	long long		print_stack;
-
-	/* -1 as init value because 0 is off */
-	int			orig_opt_irq_disable;
-	int			opt_irq_disable;
-
-	/* -1 as init value because 0 is off */
-	int			orig_opt_workload;
-	int			opt_workload;
-
-	/* -1 as init value because 0 is off */
-	int			orig_opt_timerlat_align;
-	int			opt_timerlat_align;
-
-	/* 0 as init value */
-	unsigned long long	orig_timerlat_align_us;
-	unsigned long long	timerlat_align_us;
+	OSNOISE_LL_OPTIONS
+	OSNOISE_FLAG_OPTIONS
 };
+#undef OSNOISE_LL_OPTION
+#undef OSNOISE_FLAG_OPTION
 
 extern volatile int stop_tracing;
 
@@ -173,15 +168,21 @@ common_threshold_handler(const struct osnoise_tool *tool);
 int osnoise_set_cpus(struct osnoise_context *context, char *cpus);
 void osnoise_restore_cpus(struct osnoise_context *context);
 
-int osnoise_set_workload(struct osnoise_context *context, bool onoff);
+#define OSNOISE_LL_OPTION(name, path, init_val)					\
+	int osnoise_set_##name(struct osnoise_context *context, long long name);	\
+	void osnoise_restore_##name(struct osnoise_context *context);
+#define OSNOISE_FLAG_OPTION(name, option_str)					\
+	int osnoise_set_##name(struct osnoise_context *context, bool onoff);	\
+	void osnoise_restore_##name(struct osnoise_context *context);
+OSNOISE_LL_OPTIONS
+OSNOISE_FLAG_OPTIONS
+#undef OSNOISE_LL_OPTION
+#undef OSNOISE_FLAG_OPTION
 
 void osnoise_destroy_tool(struct osnoise_tool *top);
 struct osnoise_tool *osnoise_init_tool(char *tool_name);
 struct osnoise_tool *osnoise_init_trace_tool(const char *tracer);
 bool osnoise_trace_is_off(struct osnoise_tool *tool, struct osnoise_tool *record);
-int osnoise_set_stop_us(struct osnoise_context *context, long long stop_us);
-int osnoise_set_stop_total_us(struct osnoise_context *context,
-			      long long stop_total_us);
 
 int common_apply_config(struct osnoise_tool *tool, struct common_params *params);
 int top_main_loop(struct osnoise_tool *tool);
diff --git a/tools/tracing/rtla/src/osnoise.c b/tools/tracing/rtla/src/osnoise.c
index 4ff5dad013b1..7f15d00b431e 100644
--- a/tools/tracing/rtla/src/osnoise.c
+++ b/tools/tracing/rtla/src/osnoise.c
@@ -345,480 +345,73 @@ void osnoise_put_runtime_period(struct osnoise_context *context)
 }
 
 /*
- * osnoise_get_timerlat_period_us - read and save the original "timerlat_period_us"
- */
-static long long
-osnoise_get_timerlat_period_us(struct osnoise_context *context)
-{
-	long long timerlat_period_us;
-
-	if (context->timerlat_period_us != OSNOISE_TIME_INIT_VAL)
-		return context->timerlat_period_us;
-
-	if (context->orig_timerlat_period_us != OSNOISE_TIME_INIT_VAL)
-		return context->orig_timerlat_period_us;
-
-	timerlat_period_us = osnoise_read_ll_config("osnoise/timerlat_period_us");
-	if (timerlat_period_us < 0)
-		goto out_err;
-
-	context->orig_timerlat_period_us = timerlat_period_us;
-	return timerlat_period_us;
-
-out_err:
-	return OSNOISE_TIME_INIT_VAL;
-}
-
-/*
- * osnoise_set_timerlat_period_us - set "timerlat_period_us"
- */
-int osnoise_set_timerlat_period_us(struct osnoise_context *context, long long timerlat_period_us)
-{
-	long long curr_timerlat_period_us = osnoise_get_timerlat_period_us(context);
-	int retval;
-
-	if (curr_timerlat_period_us == OSNOISE_TIME_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("osnoise/timerlat_period_us", timerlat_period_us);
-	if (retval < 0)
-		return -1;
-
-	context->timerlat_period_us = timerlat_period_us;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_timerlat_period_us - restore "timerlat_period_us"
- */
-void osnoise_restore_timerlat_period_us(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_timerlat_period_us == OSNOISE_TIME_INIT_VAL)
-		return;
-
-	if (context->orig_timerlat_period_us == context->timerlat_period_us)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("osnoise/timerlat_period_us", context->orig_timerlat_period_us);
-	if (retval < 0)
-		err_msg("Could not restore original osnoise timerlat_period_us\n");
-
-out_done:
-	context->timerlat_period_us = OSNOISE_TIME_INIT_VAL;
-}
-
-/*
- * osnoise_put_timerlat_period_us - restore original values and cleanup data
- */
-void osnoise_put_timerlat_period_us(struct osnoise_context *context)
-{
-	osnoise_restore_timerlat_period_us(context);
-
-	if (context->orig_timerlat_period_us == OSNOISE_TIME_INIT_VAL)
-		return;
-
-	context->orig_timerlat_period_us = OSNOISE_TIME_INIT_VAL;
-}
-
-/*
- * osnoise_get_timerlat_align_us - read and save the original "timerlat_align_us"
- */
-static long long
-osnoise_get_timerlat_align_us(struct osnoise_context *context)
-{
-	long long timerlat_align_us;
-
-	if (context->timerlat_align_us != OSNOISE_OPTION_INIT_VAL)
-		return context->timerlat_align_us;
-
-	if (context->orig_timerlat_align_us != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_timerlat_align_us;
-
-	timerlat_align_us = osnoise_read_ll_config("osnoise/timerlat_align_us");
-	if (timerlat_align_us < 0)
-		goto out_err;
-
-	context->orig_timerlat_align_us = timerlat_align_us;
-	return timerlat_align_us;
-
-out_err:
-	return OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_set_timerlat_align_us - set "timerlat_align_us"
- */
-int osnoise_set_timerlat_align_us(struct osnoise_context *context, long long timerlat_align_us)
-{
-	long long curr_timerlat_align_us = osnoise_get_timerlat_align_us(context);
-	int retval;
-
-	if (curr_timerlat_align_us == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("osnoise/timerlat_align_us", timerlat_align_us);
-	if (retval < 0)
-		return -1;
-
-	context->timerlat_align_us = timerlat_align_us;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_timerlat_align_us - restore "timerlat_align_us"
- */
-void osnoise_restore_timerlat_align_us(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_timerlat_align_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_timerlat_align_us == context->timerlat_align_us)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("osnoise/timerlat_align_us",
-				   context->orig_timerlat_align_us);
-	if (retval < 0)
-		err_msg("Could not restore original osnoise timerlat_align_us\n");
-
-out_done:
-	context->timerlat_align_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_put_timerlat_align_us - restore original values and cleanup data
- */
-void osnoise_put_timerlat_align_us(struct osnoise_context *context)
-{
-	osnoise_restore_timerlat_align_us(context);
-
-	if (context->orig_timerlat_align_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_timerlat_align_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_get_stop_us - read and save the original "stop_tracing_us"
- */
-static long long
-osnoise_get_stop_us(struct osnoise_context *context)
-{
-	long long stop_us;
-
-	if (context->stop_us != OSNOISE_OPTION_INIT_VAL)
-		return context->stop_us;
-
-	if (context->orig_stop_us != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_stop_us;
-
-	stop_us = osnoise_read_ll_config("osnoise/stop_tracing_us");
-	if (stop_us < 0)
-		goto out_err;
-
-	context->orig_stop_us = stop_us;
-	return stop_us;
-
-out_err:
-	return OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_set_stop_us - set "stop_tracing_us"
- */
-int osnoise_set_stop_us(struct osnoise_context *context, long long stop_us)
-{
-	long long curr_stop_us = osnoise_get_stop_us(context);
-	int retval;
-
-	if (curr_stop_us == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("osnoise/stop_tracing_us", stop_us);
-	if (retval < 0)
-		return -1;
-
-	context->stop_us = stop_us;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_stop_us - restore the original "stop_tracing_us"
- */
-void osnoise_restore_stop_us(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_stop_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_stop_us == context->stop_us)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("osnoise/stop_tracing_us", context->orig_stop_us);
-	if (retval < 0)
-		err_msg("Could not restore original osnoise stop_us\n");
-
-out_done:
-	context->stop_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_put_stop_us - restore original values and cleanup data
- */
-void osnoise_put_stop_us(struct osnoise_context *context)
-{
-	osnoise_restore_stop_us(context);
-
-	if (context->orig_stop_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_stop_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_get_stop_total_us - read and save the original "stop_tracing_total_us"
- */
-static long long
-osnoise_get_stop_total_us(struct osnoise_context *context)
-{
-	long long stop_total_us;
-
-	if (context->stop_total_us != OSNOISE_OPTION_INIT_VAL)
-		return context->stop_total_us;
-
-	if (context->orig_stop_total_us != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_stop_total_us;
-
-	stop_total_us = osnoise_read_ll_config("osnoise/stop_tracing_total_us");
-	if (stop_total_us < 0)
-		goto out_err;
-
-	context->orig_stop_total_us = stop_total_us;
-	return stop_total_us;
-
-out_err:
-	return OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_set_stop_total_us - set "stop_tracing_total_us"
- */
-int osnoise_set_stop_total_us(struct osnoise_context *context, long long stop_total_us)
-{
-	long long curr_stop_total_us = osnoise_get_stop_total_us(context);
-	int retval;
-
-	if (curr_stop_total_us == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("osnoise/stop_tracing_total_us", stop_total_us);
-	if (retval < 0)
-		return -1;
-
-	context->stop_total_us = stop_total_us;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_stop_total_us - restore the original "stop_tracing_total_us"
- */
-void osnoise_restore_stop_total_us(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_stop_total_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_stop_total_us == context->stop_total_us)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("osnoise/stop_tracing_total_us",
-			context->orig_stop_total_us);
-	if (retval < 0)
-		err_msg("Could not restore original osnoise stop_total_us\n");
-
-out_done:
-	context->stop_total_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_put_stop_total_us - restore original values and cleanup data
- */
-void osnoise_put_stop_total_us(struct osnoise_context *context)
-{
-	osnoise_restore_stop_total_us(context);
-
-	if (context->orig_stop_total_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_stop_total_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_get_print_stack - read and save the original "print_stack"
- */
-static long long
-osnoise_get_print_stack(struct osnoise_context *context)
-{
-	long long print_stack;
-
-	if (context->print_stack != OSNOISE_OPTION_INIT_VAL)
-		return context->print_stack;
-
-	if (context->orig_print_stack != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_print_stack;
-
-	print_stack = osnoise_read_ll_config("osnoise/print_stack");
-	if (print_stack < 0)
-		goto out_err;
-
-	context->orig_print_stack = print_stack;
-	return print_stack;
-
-out_err:
-	return OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_set_print_stack - set "print_stack"
- */
-int osnoise_set_print_stack(struct osnoise_context *context, long long print_stack)
-{
-	long long curr_print_stack = osnoise_get_print_stack(context);
-	int retval;
-
-	if (curr_print_stack == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("osnoise/print_stack", print_stack);
-	if (retval < 0)
-		return -1;
-
-	context->print_stack = print_stack;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_print_stack - restore the original "print_stack"
- */
-void osnoise_restore_print_stack(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_print_stack == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_print_stack == context->print_stack)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("osnoise/print_stack", context->orig_print_stack);
-	if (retval < 0)
-		err_msg("Could not restore original osnoise print_stack\n");
-
-out_done:
-	context->print_stack = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_put_print_stack - restore original values and cleanup data
- */
-void osnoise_put_print_stack(struct osnoise_context *context)
-{
-	osnoise_restore_print_stack(context);
-
-	if (context->orig_print_stack == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_print_stack = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_get_tracing_thresh - read and save the original "tracing_thresh"
- */
-static long long
-osnoise_get_tracing_thresh(struct osnoise_context *context)
-{
-	long long tracing_thresh;
-
-	if (context->tracing_thresh != OSNOISE_OPTION_INIT_VAL)
-		return context->tracing_thresh;
-
-	if (context->orig_tracing_thresh != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_tracing_thresh;
-
-	tracing_thresh = osnoise_read_ll_config("tracing_thresh");
-	if (tracing_thresh < 0)
-		goto out_err;
-
-	context->orig_tracing_thresh = tracing_thresh;
-	return tracing_thresh;
-
-out_err:
-	return OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_set_tracing_thresh - set "tracing_thresh"
- */
-int osnoise_set_tracing_thresh(struct osnoise_context *context, long long tracing_thresh)
-{
-	long long curr_tracing_thresh = osnoise_get_tracing_thresh(context);
-	int retval;
-
-	if (curr_tracing_thresh == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("tracing_thresh", tracing_thresh);
-	if (retval < 0)
-		return -1;
-
-	context->tracing_thresh = tracing_thresh;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_tracing_thresh - restore the original "tracing_thresh"
- */
-void osnoise_restore_tracing_thresh(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_tracing_thresh == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_tracing_thresh == context->tracing_thresh)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("tracing_thresh", context->orig_tracing_thresh);
-	if (retval < 0)
-		err_msg("Could not restore original tracing_thresh\n");
-
-out_done:
-	context->tracing_thresh = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_put_tracing_thresh - restore original values and cleanup data
- */
-void osnoise_put_tracing_thresh(struct osnoise_context *context)
-{
-	osnoise_restore_tracing_thresh(context);
-
-	if (context->orig_tracing_thresh == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_tracing_thresh = OSNOISE_OPTION_INIT_VAL;
-}
+ * Long long option get/set/restore/put functions, generated from OSNOISE_LL_OPTIONS.
+ */
+#define OSNOISE_LL_OPTION(name, path, init_val)						\
+static long long									\
+osnoise_get_##name(struct osnoise_context *context)					\
+{											\
+	long long name;									\
+											\
+	if (context->name != (init_val))						\
+		return context->name;							\
+											\
+	if (context->orig_##name != (init_val))						\
+		return context->orig_##name;						\
+											\
+	name = osnoise_read_ll_config(path);						\
+	if (name < 0)									\
+		return (init_val);							\
+											\
+	context->orig_##name = name;							\
+	return name;									\
+}											\
+											\
+int osnoise_set_##name(struct osnoise_context *context, long long name)			\
+{											\
+	long long curr = osnoise_get_##name(context);					\
+	int retval;									\
+											\
+	if (curr == (init_val))								\
+		return -1;								\
+											\
+	retval = osnoise_write_ll_config(path, name);					\
+	if (retval < 0)									\
+		return -2;								\
+											\
+	context->name = name;								\
+	return 0;									\
+}											\
+											\
+void osnoise_restore_##name(struct osnoise_context *context)				\
+{											\
+	int retval;									\
+											\
+	if (context->orig_##name == (init_val))						\
+		return;									\
+											\
+	if (context->orig_##name == context->name)					\
+		goto out_done_##name;							\
+											\
+	retval = osnoise_write_ll_config(path, context->orig_##name);			\
+	if (retval < 0)									\
+		err_msg("Could not restore original " #name "\n");			\
+											\
+out_done_##name:									\
+	context->name = (init_val);							\
+}											\
+											\
+static void osnoise_put_##name(struct osnoise_context *context)				\
+{											\
+	osnoise_restore_##name(context);						\
+											\
+	if (context->orig_##name == (init_val))						\
+		return;									\
+											\
+	context->orig_##name = (init_val);						\
+}
+OSNOISE_LL_OPTIONS
+#undef OSNOISE_LL_OPTION
 
 static int osnoise_options_get_option(char *option)
 {
@@ -866,188 +459,70 @@ static int osnoise_options_set_option(char *option, bool onoff)
 	return tracefs_instance_file_write(NULL, "osnoise/options", no_option);
 }
 
-static int osnoise_get_irq_disable(struct osnoise_context *context)
-{
-	if (context->opt_irq_disable != OSNOISE_OPTION_INIT_VAL)
-		return context->opt_irq_disable;
-
-	if (context->orig_opt_irq_disable != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_opt_irq_disable;
-
-	context->orig_opt_irq_disable = osnoise_options_get_option("OSNOISE_IRQ_DISABLE");
-
-	return context->orig_opt_irq_disable;
-}
-
-int osnoise_set_irq_disable(struct osnoise_context *context, bool onoff)
-{
-	int opt_irq_disable = osnoise_get_irq_disable(context);
-	int retval;
-
-	if (opt_irq_disable == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	if (opt_irq_disable == onoff)
-		return 0;
-
-	retval = osnoise_options_set_option("OSNOISE_IRQ_DISABLE", onoff);
-	if (retval < 0)
-		return -1;
-
-	context->opt_irq_disable = onoff;
-
-	return 0;
-}
-
-static void osnoise_restore_irq_disable(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_opt_irq_disable == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_opt_irq_disable == context->opt_irq_disable)
-		goto out_done;
-
-	retval = osnoise_options_set_option("OSNOISE_IRQ_DISABLE", context->orig_opt_irq_disable);
-	if (retval < 0)
-		err_msg("Could not restore original OSNOISE_IRQ_DISABLE option\n");
-
-out_done:
-	context->orig_opt_irq_disable = OSNOISE_OPTION_INIT_VAL;
-}
-
-static void osnoise_put_irq_disable(struct osnoise_context *context)
-{
-	osnoise_restore_irq_disable(context);
-
-	if (context->orig_opt_irq_disable == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_opt_irq_disable = OSNOISE_OPTION_INIT_VAL;
-}
-
-static int osnoise_get_workload(struct osnoise_context *context)
-{
-	if (context->opt_workload != OSNOISE_OPTION_INIT_VAL)
-		return context->opt_workload;
-
-	if (context->orig_opt_workload != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_opt_workload;
-
-	context->orig_opt_workload = osnoise_options_get_option("OSNOISE_WORKLOAD");
-
-	return context->orig_opt_workload;
-}
-
-int osnoise_set_workload(struct osnoise_context *context, bool onoff)
-{
-	int opt_workload = osnoise_get_workload(context);
-	int retval;
-
-	if (opt_workload == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	if (opt_workload == onoff)
-		return 0;
-
-	retval = osnoise_options_set_option("OSNOISE_WORKLOAD", onoff);
-	if (retval < 0)
-		return -2;
-
-	context->opt_workload = onoff;
-
-	return 0;
-}
-
-static void osnoise_restore_workload(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_opt_workload == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_opt_workload == context->opt_workload)
-		goto out_done;
-
-	retval = osnoise_options_set_option("OSNOISE_WORKLOAD", context->orig_opt_workload);
-	if (retval < 0)
-		err_msg("Could not restore original OSNOISE_WORKLOAD option\n");
-
-out_done:
-	context->orig_opt_workload = OSNOISE_OPTION_INIT_VAL;
-}
-
-static void osnoise_put_workload(struct osnoise_context *context)
-{
-	osnoise_restore_workload(context);
-
-	if (context->orig_opt_workload == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_opt_workload = OSNOISE_OPTION_INIT_VAL;
-}
-
-static int osnoise_get_timerlat_align(struct osnoise_context *context)
-{
-	if (context->opt_timerlat_align != OSNOISE_OPTION_INIT_VAL)
-		return context->opt_timerlat_align;
-
-	if (context->orig_opt_timerlat_align != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_opt_timerlat_align;
-
-	context->orig_opt_timerlat_align = osnoise_options_get_option("TIMERLAT_ALIGN");
-
-	return context->orig_opt_timerlat_align;
-}
-
-int osnoise_set_timerlat_align(struct osnoise_context *context, bool onoff)
-{
-	int opt_timerlat_align = osnoise_get_timerlat_align(context);
-	int retval;
-
-	if (opt_timerlat_align == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	if (opt_timerlat_align == onoff)
-		return 0;
-
-	retval = osnoise_options_set_option("TIMERLAT_ALIGN", onoff);
-	if (retval < 0)
-		return -2;
-
-	context->opt_timerlat_align = onoff;
-
-	return 0;
-}
-
-static void osnoise_restore_timerlat_align(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_opt_timerlat_align == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_opt_timerlat_align == context->opt_timerlat_align)
-		goto out_done;
-
-	retval = osnoise_options_set_option("TIMERLAT_ALIGN", context->orig_opt_timerlat_align);
-	if (retval < 0)
-		err_msg("Could not restore original TIMERLAT_ALIGN option\n");
-
-out_done:
-	context->orig_opt_timerlat_align = OSNOISE_OPTION_INIT_VAL;
-}
-
-static void osnoise_put_timerlat_align(struct osnoise_context *context)
-{
-	osnoise_restore_timerlat_align(context);
-
-	if (context->orig_opt_timerlat_align == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_opt_timerlat_align = OSNOISE_OPTION_INIT_VAL;
-}
+/*
+ * Flag option get/set/restore/put functions, generated from OSNOISE_FLAG_OPTIONS.
+ */
+#define OSNOISE_FLAG_OPTION(name, option_str)						\
+static int osnoise_get_##name(struct osnoise_context *context)				\
+{											\
+	if (context->opt_##name != OSNOISE_OPTION_INIT_VAL)				\
+		return context->opt_##name;						\
+											\
+	if (context->orig_opt_##name != OSNOISE_OPTION_INIT_VAL)			\
+		return context->orig_opt_##name;					\
+											\
+	context->orig_opt_##name = osnoise_options_get_option(option_str);		\
+	return context->orig_opt_##name;						\
+}											\
+											\
+int osnoise_set_##name(struct osnoise_context *context, bool onoff)			\
+{											\
+	int val = osnoise_get_##name(context);						\
+	int retval;									\
+											\
+	if (val == OSNOISE_OPTION_INIT_VAL)						\
+		return -1;								\
+											\
+	if (val == onoff)								\
+		return 0;								\
+											\
+	retval = osnoise_options_set_option(option_str, onoff);				\
+	if (retval < 0)									\
+		return -2;								\
+											\
+	context->opt_##name = onoff;							\
+	return 0;									\
+}											\
+											\
+void osnoise_restore_##name(struct osnoise_context *context)				\
+{											\
+	int retval;									\
+											\
+	if (context->orig_opt_##name == OSNOISE_OPTION_INIT_VAL)			\
+		return;									\
+											\
+	if (context->orig_opt_##name == context->opt_##name)				\
+		goto out_done_##name;							\
+											\
+	retval = osnoise_options_set_option(option_str, context->orig_opt_##name);	\
+	if (retval < 0)									\
+		err_msg("Could not restore original " option_str " option\n");		\
+											\
+out_done_##name:									\
+	context->opt_##name = OSNOISE_OPTION_INIT_VAL;					\
+}											\
+											\
+static void osnoise_put_##name(struct osnoise_context *context)				\
+{											\
+	osnoise_restore_##name(context);						\
+											\
+	if (context->orig_opt_##name == OSNOISE_OPTION_INIT_VAL)			\
+		return;									\
+											\
+	context->orig_opt_##name = OSNOISE_OPTION_INIT_VAL;				\
+}
+OSNOISE_FLAG_OPTIONS
+#undef OSNOISE_FLAG_OPTION
 
 enum {
 	FLAG_CONTEXT_NEWLY_CREATED	= (1 << 0),
@@ -1083,29 +558,16 @@ struct osnoise_context *osnoise_context_alloc(void)
 
 	context = calloc_fatal(1, sizeof(*context));
 
-	context->orig_stop_us		= OSNOISE_OPTION_INIT_VAL;
-	context->stop_us		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_stop_total_us	= OSNOISE_OPTION_INIT_VAL;
-	context->stop_total_us		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_print_stack	= OSNOISE_OPTION_INIT_VAL;
-	context->print_stack		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_tracing_thresh	= OSNOISE_OPTION_INIT_VAL;
-	context->tracing_thresh		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_opt_irq_disable	= OSNOISE_OPTION_INIT_VAL;
-	context->opt_irq_disable	= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_opt_workload	= OSNOISE_OPTION_INIT_VAL;
-	context->opt_workload		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_opt_timerlat_align	= OSNOISE_OPTION_INIT_VAL;
-	context->opt_timerlat_align		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_timerlat_align_us	= OSNOISE_OPTION_INIT_VAL;
-	context->timerlat_align_us	= OSNOISE_OPTION_INIT_VAL;
+#define OSNOISE_LL_OPTION(name, path, init_val)			\
+	context->orig_##name	 = (init_val);			\
+	context->name		 = (init_val);
+#define OSNOISE_FLAG_OPTION(name, option_str)			\
+	context->orig_opt_##name = OSNOISE_OPTION_INIT_VAL; 	\
+	context->opt_##name	 = OSNOISE_OPTION_INIT_VAL;
+	OSNOISE_LL_OPTIONS
+	OSNOISE_FLAG_OPTIONS
+#undef OSNOISE_LL_OPTION
+#undef OSNOISE_FLAG_OPTION
 
 	osnoise_get_context(context);
 
@@ -1128,15 +590,13 @@ void osnoise_put_context(struct osnoise_context *context)
 
 	osnoise_put_cpus(context);
 	osnoise_put_runtime_period(context);
-	osnoise_put_stop_us(context);
-	osnoise_put_stop_total_us(context);
-	osnoise_put_timerlat_period_us(context);
-	osnoise_put_print_stack(context);
-	osnoise_put_tracing_thresh(context);
-	osnoise_put_irq_disable(context);
-	osnoise_put_workload(context);
-	osnoise_put_timerlat_align(context);
-	osnoise_put_timerlat_align_us(context);
+
+#define OSNOISE_LL_OPTION(name, path, init_val)	osnoise_put_##name(context);
+#define OSNOISE_FLAG_OPTION(name, option_str)	osnoise_put_##name(context);
+	OSNOISE_LL_OPTIONS
+	OSNOISE_FLAG_OPTIONS
+#undef OSNOISE_LL_OPTION
+#undef OSNOISE_FLAG_OPTION
 
 	free(context);
 }
diff --git a/tools/tracing/rtla/src/osnoise.h b/tools/tracing/rtla/src/osnoise.h
index 340ff5a64e6e..3d1852bffed8 100644
--- a/tools/tracing/rtla/src/osnoise.h
+++ b/tools/tracing/rtla/src/osnoise.h
@@ -34,28 +34,6 @@ int osnoise_set_runtime_period(struct osnoise_context *context,
 			       unsigned long long period);
 void osnoise_restore_runtime_period(struct osnoise_context *context);
 
-void osnoise_restore_stop_us(struct osnoise_context *context);
-void osnoise_restore_stop_total_us(struct osnoise_context *context);
-
-int osnoise_set_timerlat_period_us(struct osnoise_context *context,
-				   long long timerlat_period_us);
-void osnoise_restore_timerlat_period_us(struct osnoise_context *context);
-
-int osnoise_set_tracing_thresh(struct osnoise_context *context,
-			       long long tracing_thresh);
-void osnoise_restore_tracing_thresh(struct osnoise_context *context);
-
-void osnoise_restore_print_stack(struct osnoise_context *context);
-int osnoise_set_print_stack(struct osnoise_context *context,
-			    long long print_stack);
-
-int osnoise_set_timerlat_align_us(struct osnoise_context *context,
-				  long long timerlat_align_us);
-void osnoise_restore_timerlat_align_us(struct osnoise_context *context);
-
-int osnoise_set_timerlat_align(struct osnoise_context *context, bool onoff);
-
-int osnoise_set_irq_disable(struct osnoise_context *context, bool onoff);
 void osnoise_report_missed_events(struct osnoise_tool *tool);
 int osnoise_apply_config(struct osnoise_tool *tool, struct osnoise_params *params);
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
From: Masami Hiramatsu @ 2026-06-12 12:20 UTC (permalink / raw)
  To: XIAO WU
  Cc: sashiko-reviews, Petr Mladek, Peter Zijlstra, Tengda Wu,
	Mathieu Desnoyers, Alexei Starovoitov, Steven Rostedt,
	linux-kernel, linux-trace-kernel, live-patching
In-Reply-To: <tencent_3D17DC5BE32C8A51D938AF50F221321F6206@qq.com>

On Thu, 11 Jun 2026 23:53:36 +0800
XIAO WU <xiaowu.417@qq.com> wrote:

> Hi Tengda,
> 
> Sashiko [1] reviewed this patch and found that removing the
> task_is_running() check exposes stack unwinders to real crashes — not
> just "invalid information."  A PoC confirms this: a KASAN panic triggers
> within seconds when /proc/<pid>/stack reads the stack of a task that is
> concurrently running a kretprobe.

Hmm, why /proc/<pid>/stack unwind stack so unreliable way...
That should stop the target process, because it is exposed to
userspace. Thus it should work as safe as possible.

Anyway, thanks for reporting with the test program.

> 
> [1] 
> https://sashiko.dev/#/patchset/20260610013658.1837963-1-wutengda%40huaweicloud.com
> 
>  > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>  > index 5a8bdf88999a..1e7fdebe3cd5 100644
>  > --- a/kernel/trace/rethook.c
>  > +++ b/kernel/trace/rethook.c
>  > @@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct 
> task_struct *tsk, unsigned long frame
>  >      if (WARN_ON_ONCE(!cur))
>  >          return 0;
>  >
>  > -    if (tsk != current && task_is_running(tsk))
>  > -        return 0;
>  > -
>  >      do {
>  >          ret = __rethook_find_ret_addr(tsk, cur);
>  >          if (!ret)
> 
> The commit message states:
> 
>  > The iteration is already safe from crashes because
>  > unwind_next_frame() holds RCU and rethook_node structures are
>  > RCU-freed; even if the iteration goes off the rails and returns
>  > invalid information, it will not crash.
> 
> There are two problems with this claim, both reproducible.
> 
> **Problem 1: stack-out-of-bounds in unwind_next_frame itself**
> 
> The PoC below reliably triggers the following KASAN panic — not in the
> rethook list traversal, but inside unwind_next_frame():
> 
> [ 1833.494623] BUG: KASAN: stack-out-of-bounds in 
> unwind_next_frame+0x861/0x2080
> [ 1833.494651] Read of size 2 at addr ffffc90003e6f5f0 by task poc/9854
> [ 1833.494707] Call Trace:
> [ 1833.494719]  dump_stack_lvl+0x116/0x1f0
> [ 1833.494743]  print_report+0xf4/0x600
> [ 1833.494788]  kasan_report+0xe0/0x110
> [ 1833.494836]  unwind_next_frame+0x861/0x2080
> [ 1833.494948]  arch_stack_walk+0x99/0x100
> [ 1833.495000]  stack_trace_save_tsk+0x16a/0x200
> [ 1833.495054]  proc_pid_stack+0x173/0x2b0
> [ 1833.495103]  seq_read_iter+0x519/0x12d0
> [ 1833.495166]  seq_read+0x3b7/0x590
> [ 1833.495297]  vfs_read+0x1f5/0xd20
> [ 1833.495497]  ksys_read+0x135/0x250
> [ 1833.495549]  do_syscall_64+0x129/0x850
> [ 1833.495566]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 1833.498894] Kernel panic - not syncing: KASAN: panic_on_warn set ...
> 
> page last free pid 9737 tgid 9737 stack trace:
>   do_sys_openat2+0xbf/0x260          <-- target task inside kretprobe
>   __x64_sys_openat+0x179/0x210
> 
> This crash has nothing to do with rethook_node lifetimes or RCU.  It
> happens because the ORC unwinder reads stack memory while the target
> task concurrently executes a kretprobe trampoline that modifies return
> addresses.  The unwinder follows corrupted frame data past valid stack
> boundaries.  RCU protection of rethook_node structures is irrelevant —
> this crash occurs at the stack frame interpretation level, before any
> rethook list traversal.

OK, in that case, I think we should not allow list traversal.
I think without freezing the target task, accessing /proc/<pid>/stack
is potentially dangerous. Shouldn't we fix this at first?

> 
> The old task_is_running() check prevented the unwinder from attempting
> to unwind a running task's stack in the first place.
> 
> **Problem 2: use-after-free via rethook_node recycling**
> 
> Even if the stack-out-of-bounds above were addressed, a second crash
> path exists in the rethook list traversal itself.
> 
> rethook_recycle() immediately pushes nodes back to the objpool without
> an RCU grace period:
> 
>    kernel/trace/rethook.c:
>    void rethook_recycle(struct rethook_node *node)
>    {
>            ...
>            objpool_push(node, &node->rethook->pool);
>    }
> 
> Meanwhile, unwind_next_frame() in arch/x86/kernel/unwind_orc.c drops
> RCU between frames while the cursor (*cur) persists across iterations:
> 
>    arch/x86/kernel/unwind_orc.c:
>    bool unwind_next_frame(...)
>    {
>            ...
>            guard(rcu)();    // RCU held for one frame
>            ...
>    }                        // RCU dropped here
> 
> When the unwinder calls __rethook_find_ret_addr() in the next frame
> iteration, it does:
> 
>    struct llist_node *first = tsk->rethooks.first;
>    ...
>    *cur = first;
>    ...
>    node = node->next;       // node may have been recycled
> 
> If the target task returns from a probed function between frames, its
> rethook_node is recycled and can be instantly reallocated to another
> task.  The unwinder's stale cursor then dereferences a freed pointer,
> leading to use-after-free.


OK, this is still real problem. We should use call_rcu() to return
the object back to objpool.

Thanks!

> 
> ## Reproducer
> 
> The PoC sets up a kretprobe on do_sys_openat2, creates hot-loop threads
> calling open(), and concurrently reads /proc/<tid>/stack.  The race
> triggers within seconds (Problem 1 above; Problem 2 may reproduce on
> kernels without KASAN or with different timing).
> 
> Build:  gcc -static -pthread -o poc poc.c
> Run:    ./poc [runtime_seconds]
> Needs:  root, CONFIG_KASAN=y
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/wait.h>
> #include <sys/syscall.h>
> #include <sched.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <signal.h>
> #include <pthread.h>
> #include <dirent.h>
> 
> #define TRACE "/sys/kernel/tracing"
> 
> volatile int stop = 0;
> 
> static int tfs(const char *f, const char *b)
> {
>      char p[256]; int fd, r;
>      snprintf(p, 256, "%s/%s", TRACE, f);
>      fd = open(p, O_WRONLY | O_TRUNC);
>      if (fd < 0) {
>          system("mount -t tracefs tracefs /sys/kernel/tracing 2>/dev/null");
>          usleep(50000);
>          fd = open(p, O_WRONLY | O_TRUNC);
>      }
>      if (fd < 0) return -1;
>      r = write(fd, b, strlen(b));
>      close(fd);
>      return r < 0 ? -1 : 0;
> }
> 
> void *hot_thread(void *arg)
> {
>      while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>          int fd = open("/dev/null", O_RDONLY);
>          if (fd >= 0) close(fd);
>      }
>      return NULL;
> }
> 
> void *reader_thread(void *arg)
> {
>      pid_t target = *(pid_t *)arg;
>      char path[64], buf[8192];
>      snprintf(path, 64, "/proc/%d/stack", target);
>      while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>          int fd = open(path, O_RDONLY);
>          if (fd >= 0) { read(fd, buf, 8191); close(fd); }
>      }
>      return NULL;
> }
> 
> void sigh(int s) { stop = 1; }
> 
> int main(int argc, char *argv[])
> {
>      int runtime = 120;
>      if (argc > 1) runtime = atoi(argv[1]);
> 
>      printf("rethook race PoC\n");
>      if (geteuid()) { printf("root needed\n"); return 1; }
>      signal(SIGINT, sigh);
> 
>      pthread_t hot[4], rdr[4];
>      pid_t hot_tids[4];
>      int pairs = 4;
> 
>      for (int c = 0; c < runtime / 5 && !stop; c++) {
>          tfs("events/kprobes/myretprobe/enable", "0");
>          tfs("kprobe_events", "-:myretprobe");
>          usleep(100);
>          tfs("kprobe_events", "r:myretprobe do_sys_openat2 $retval");
>          tfs("events/kprobes/myretprobe/enable", "1");
> 
>          pid_t main_tid = syscall(SYS_gettid);
> 
>          for (int i = 0; i < pairs; i++)
>              pthread_create(&hot[i], NULL, hot_thread, NULL);
> 
>          usleep(300000);
> 
>          {
>              DIR *d = opendir("/proc/self/task");
>              int cnt = 0;
>              if (d) {
>                  struct dirent *de;
>                  while ((de = readdir(d)) != NULL && cnt < pairs) {
>                      pid_t t = atoi(de->d_name);
>                      if (t > 0 && t != main_tid)
>                          hot_tids[cnt++] = t;
>                  }
>                  closedir(d);
>              }
>              for (int i = 0; i < cnt; i++)
>                  pthread_create(&rdr[i], NULL, reader_thread, &hot_tids[i]);
>          }
> 
>          printf("round %d\n", c);
>          sleep(5);
> 
>          stop = 1;
>          usleep(100000);
> 
>          for (int i = 0; i < pairs; i++) pthread_join(hot[i], NULL);
>          for (int i = 0; i < pairs; i++) pthread_join(rdr[i], NULL);
> 
>          stop = 0;
>          usleep(1000);
>      }
> 
>      tfs("events/kprobes/myretprobe/enable", "0");
>      tfs("kprobe_events", "-:myretprobe");
>      printf("Done\n");
>      return 0;
> }
> 
> ## Summary
> 
> The v4 commit message claims the iteration "will not crash," but the PoC
> demonstrates a reproducible KASAN panic:
> 
> 1. stack-out-of-bounds in unwind_next_frame (ORC unwinder reads
>     concurrently-modified stack frames of a running task)
> 
> 2. Potential use-after-free in __rethook_find_ret_addr (rethook nodes
>     recycled without RCU grace period, cursor persists across RCU drops)
> 
> The old task_is_running() check was racy but served as a practical
> safety net.  Removing it without adding equivalent protection in the
> callers (proc_pid_stack, BPF stack walkers) exposes users to kernel
> panics via /proc/<pid>/stack on any task running a kretprobe.
> 
> Thanks,
> Xiao
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: Markus Schneider-Pargmann @ 2026-06-12 12:51 UTC (permalink / raw)
  To: Mathieu Desnoyers, Steven Rostedt, David Laight
  Cc: Masami Hiramatsu (Google),
	Markus Schneider-Pargmann (The Capable Hub), Heiko Carstens,
	linux-kernel, linux-trace-kernel
In-Reply-To: <0ea2ae74-7452-4ba5-9549-59197c766c25@efficios.com>

[-- Attachment #1: Type: text/plain, Size: 3036 bytes --]

Hi,

On Wed Jun 10, 2026 at 10:05 PM CEST, Mathieu Desnoyers wrote:
> On 2026-06-10 15:51, Steven Rostedt wrote:
>> On Wed, 10 Jun 2026 12:06:59 +0100
>> David Laight <david.laight.linux@gmail.com> wrote:
>> 
>>> So you only want __packed on structures that might be misaligned and those
>>> that contain misaligned members.
>>>
>>> If the structure is only guaranteed to be 32bit aligned then use __packed
>>> __aligned(4) so that two 32bit accesses get used instead of 8 8bit ones.
>>>
>>> -- David
>>>
>>>>
>>>> Thank you,
>>>>    
>>>>> Signed-off-by: Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>
>>>>> ---
>>>>>   kernel/trace/fprobe.c | 2 +-
>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
>>>>> index cc49ebd2a773..21751dcdb7b9 100644
>>>>> --- a/kernel/trace/fprobe.c
>>>>> +++ b/kernel/trace/fprobe.c
>>>>> @@ -181,7 +181,7 @@ static inline void read_fprobe_header(unsigned long *stack,
>>>>>   struct __fprobe_header {
>>>>>   	struct fprobe *fp;
>>>>>   	unsigned long size_words;
>>>>> -} __packed;
>>>>> +};
>>>>>   
>> 
>> Does "__packed" really do anything between a pointer and a long?
>
> If that structure is allocated at a non-void-ptr-aligned address, the
> packed attribute will ensure that the compiler don't emit instructions
> that require aligned loads/stores when accessing those fields.
>
> It does not change the layout of the structure per se in this specific
> case, but it informs the compiler about the lack of guarantees about
> alignment for the entire structure.
>
> x86 32/64 cannot care less about this, but it's relevant on other
> architectures.

Thanks for your feedback. I checked this before submitting the patch.
The struct is always aligned to sizeof(long):

struct __fprobe_header is only ever accessed through
read_fprobe_header() and write_fprobe_header(). Since the read will only
read what we have previously written, only the write part is relevant
here. write_fprobe_header() is only called from fprobe_fgraph_entry():

  if (write_fprobe_header(&fgraph_data[used], fp, size_words))
    used += FPROBE_HEADER_SIZE_IN_LONG + size_words;

used is always kept aligned to sizeof(long), in fact the above snippet
is the only part where it is actually changed. fgraph_data is assigned
here:

  fgraph_data = fgraph_reserve_data(gops->idx, reserved_words * sizeof(long));

fgraph_reserve_data() returns a pointer into an unsigned long array
ret_stack. ret_stack is allocated with

  ret_stack = kmem_cache_alloc(fgraph_stack_cachep, GFP_KERNEL);

and fgraph_stack_cachep is allocated with

  fgraph_stack_cachep = kmem_cache_create("fgraph_stack",
                                          SHADOW_STACK_SIZE,
                                          SHADOW_STACK_SIZE, 0, NULL);

So as far as I can see everything is sizeof(long) aligned here and it is
not allocated at a non-void-ptr-aligned address.

Best
Markus

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 289 bytes --]

^ permalink raw reply

* Re: [GIT PULL] RTLA changes for 7.2
From: Tomas Glozar @ 2026-06-12 14:19 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Costa Shulyupin, Crystal Wood, LKML, linux-trace-kernel
In-Reply-To: <20260529130643.3080315-1-tglozar@redhat.com>

pá 29. 5. 2026 v 15:16 odesílatel Tomas Glozar <tglozar@redhat.com> napsal:
>
> ----------------------------------------------------------------
> Costa Shulyupin (1):
>       tools/rtla: Fix --dump-tasks usage in timerlat
>
> Crystal Wood (1):
>       rtla: Stop the record trace on interrupt
>
> Tomas Glozar (24):
>       rtla/tests: Cover both top and hist tools where possible
>       rtla/tests: Add get_workload_pids() helper
>       rtla/tests: Check -c/--cpus thread affinity
>       rtla/tests: Use negative match when testing --aa-only
>       rtla/tests: Extend timerlat top --aa-only coverage
>       rtla/tests: Cover all hist options in runtime tests
>       rtla/tests: Add runtime test for -H/--house-keeping
>       rtla/tests: Add runtime test for -k and -u options
>       rtla/tests: Add runtime tests for -C/--cgroup
>       rtla/tests: Add unit tests for actions module
>       rtla/actions: Restore continue flag in actions_perform()
>       rtla/tests: Add unit test for restoring continue flag
>       rtla/tests: Run runtime tests in temporary directory
>       rtla/tests: Add runtime tests for restoring continue flag
>       rtla: Add libsubcmd dependency
>       tools subcmd: support optarg as separate argument
>       tools subcmd: allow parsing distinct --opt and --no-opt
>       rtla: Parse cmdline using libsubcmd

This will create a conflict with a fix in master/7.1 [1] as it removes
the code that is fixed by the patch, as reported in linux-next already
[2]. The resolution is trivial (just remove the new, fixed code, as it
is entirely replaced by a new implementation that doesn't have the
bug) but a small note in the final pull request might be useful, so
that it's clear we know about it.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e9e41d3035032ed6053d8bad7b7077e1cb3a6540
[2] https://lore.kernel.org/linux-next/aimDwlNq_RRLjg6X@sirena.co.uk/T/#u

>       rtla/tests: Add unit tests for _parse_args() functions
>       rtla/tests: Add unit tests for CLI option callbacks
>       rtla/timerlat: Add -A/--aligned CLI option
>       rtla/tests: Add unit tests for -A/--aligned option
>       Documentation/rtla: Add -A/--aligned option
>       rtla: Document tests in README
>

Tomas


^ permalink raw reply

* Re: [PATCHv7 bpf-next 03/29] ftrace: Add add_ftrace_hash_entry function
From: Steven Rostedt @ 2026-06-12 14:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Jiri Olsa, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, bpf, linux-trace-kernel,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Menglong Dong
In-Reply-To: <DJ6EGJ8S87HP.2WOTGYK374XKI@gmail.com>

On Thu, 11 Jun 2026 10:35:11 -0700
"Alexei Starovoitov" <alexei.starovoitov@gmail.com> wrote:


> AI finds things to consider, but when they're considered and postponned
> to future it doesn't understand that and keep reporting the same thing
> every revision. So it might look like that patches are landing with
> outstanding AI complains, but this is not the case.

Well, if someone just asked me to give an ack then I would have. But I
have other things to work on. Especially since everything I do at the
moment is 100% hobby related. The chores my wife gives me now have
priority ;-)

I'll start a new job come Monday.

> 
> btw since patches touch ftrace from time to time should we add your
> ftrace testsuite to bpf CI ?

That's actually a good idea.

> How automated is it?

Very. In fact it's public. Although it's been a few years since I
updated the git repos.

I have two qemu images (currently private, but I can update them and
share). Where one is a 32bit x86 image and the other is a 64bit image.

The 64bit image hostname is called tracetest and the 32bit image's
hostname is tracetest-32. Both with root password of test0000.

The tests loaded on the image is here:

  https://github.com/rostedt/ftrace-tests

And the ktests I run are here:

  https://github.com/rostedt/ftrace-ktests

I would run the ktest like;

  ktest.pl -DPATCH_CHECKOUT:=<SHA/BRANCH> -DPATCH_START:=<first-commit> tracetest-64.conf

And in another window

  ktest.pl -DPATCH_CHECKOUT:=<SHA/BRANCH> -DPATCH_START:=<first-commit> tracetest-32.conf

For example:

  ktest.pl -DPATCH_CHECKOUT:=trace/ftrace/core -DPATCH_START:=b5d6d3f73d0bac4a7e3a061372f6da166fc6ee5c tracetest-64.conf

And that will run 40 tests (I added some more since my last push, so
github doesn't have 40) and build, boot, install, test on the qemu
64bit image. It starts out testing commits from
b5d6d3f73d0bac4a7e3a061372f6da166fc6ee5c and going through to
trace/ftrace/core. Note, the PATCH_START needs to be in the history of
the PATCH_CHECKOUT otherwise the test will fail.

If you want to run these, let me know and I can help with the setup.
It's what I gave Masami to test as well.

> 
> >  
> >> 
> >> While at it, please review Mykyta's set:
> >> https://patchwork.kernel.org/user/todo/netdevbpf/?series=1096695
> >> 
> >> It's also been pending for almost a month now.  
> >
> > Have a better link? I just get a blank page as "TODO" is set to what I have.  
> 
> Ohh. I meant this set:
> https://lore.kernel.org/bpf/CAEf4BzZFjsEv3aLktwdCZF6EXoCL+eefX+6xa3XGrhBmfO1SqA@mail.gmail.com/
> where you said that you'll think more about it after pto.

I came back on Tuesday and have yet to catch up on all the email I
ignored while away :-p

> Would be great to land it now for this merge window, so we have
> discoverability right now and if better approach comes in the future
> we can adjust to it later.
>  

I'll take a look at it.

-- Steve

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-12 15:29 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <ainFROZ3WrGioyuY@gourry-fedora-PF4VCD3F>

On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
> > > 
> > > I understand this question in two ways:
> > > 
> > >   1) Can we disallow PAGE allocation and limit this to FOLIO allocation
> > 
> > Yes. Can we only allow folios to be allocated from private memory nodes. So let
> > me reply to that one below.
> > 
> ... snip ...
> > 
> > At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
> > context might be better. I think there was also talk about how the memalloc_*
> > interface might be a better way forward. Maybe we would start giving the
> > allocator more context ("we are allocating a folio").
> > 
> > The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
> >
> 
> I will still probably send the next RFC version tomorrow or friday,
> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
> 
> Also, I made a new `anondax` driver which enables userland testing
> of this functionality without any specialty hardware.
> 

(apologies for the length of this email: this will all be covered in
the coming cover letter, but I just wanted to share a bit of a preview)

===

Just another small update - I am planning to post the RFC today once i
get some mild cleanup done.  It will be based on the dax atomic hotplug

https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/

But a couple specific details regarding the memalloc pieces that i've
learned the past couple of days playing with it.

1) memalloc_folio is required to ensure non-folio allocations don't land
   on the private node, even if it happens within a memalloc_private
   context.  Since memalloc_folio may be useful in contexts outside of
   private nodes, I kept this as a separate flag.

   If we think there will *never* be additional users of memalloc_folio,
   then we could fold _folio into _private to save the flag for now and
   add it back when we actually need it.

2) memalloc_private is needed to unlock private nodes, but in the
   original NOFALLBACK-only design, you also needed __GFP_THISNODE.

   This is *highly* restrictive.  I found when playing with mbind that
   MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
   implies a bug). 

   That leads me to #3

3) If a private node is opted into something like Demotion (the node is
   a demotion target) or mbind(), such that normal kernel operation can
   place memory there - it's *pseudo-private*, and should actually land
   in it's own FALLBACK list (reachable without __GFP_THISNODE, but not
   reachable as a normal fallback allocation target).

I'm still playing with this, but I think we can even omit the
__GFP_THISNODE requirement (my initial feeling that __GFP_THISNODE
didn't buy us anything in particular seems to have panned out).

At the end of the day, this makes the whole memalloc_private_save()
pattern a heck of a lot cleaner than trying fiddle with GFP.

I think you will all enjoy how clean the code ends up, and how easily
testable it is.

As a testbed I've implement an anondax (we can discuss naming) that
adds some sample NODE_PRIVATE_OPT_* flags so you can do the following.

I'm including this in the next RFC - but we can hack the entire thing
off (including the OPT flags) if we prefer to just get the base set in
without a new driver as a start.

echo 1 > dax0.0/reclaim   # kswapd and reclaim run normally on this node
echo 1 > dax0.0/demotion  # it is a demotion target
echo 1 > dax0.0/mbind     # mbind() can target this node for anon-vma's
echo 1 > dax0.0/madvise   # allow madvise() to operate on its folios
echo 1 > dax0.0/numa_balance  # allow numa balancing for this node
echo 1 > dax0.0/ltpin     # allow GUP longterm pin to operate normally
echo * > dax0.0/adistance # set the adistance for hotplug time
echo * > dax0.0/hotplug   # same as kmem/hotplug

This also means *existing hardware* can leverage private nodes if
they're capable of generating a dax device.

I've even gotten it such that you can put a private node above dram in
the adistance heirarchy - which means demotion flows downward from
device to CPU, but allocations don't default or fallback there.

This seems *immediately* useful for a variety of use cases.

~Gregory

^ permalink raw reply

* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: Steven Rostedt @ 2026-06-12 16:36 UTC (permalink / raw)
  To: Markus Schneider-Pargmann
  Cc: Mathieu Desnoyers, David Laight, Masami Hiramatsu (Google),
	Heiko Carstens, linux-kernel, linux-trace-kernel
In-Reply-To: <DJ7328G40P9R.YB03MWLT8GQF@baylibre.com>

On Fri, 12 Jun 2026 14:51:58 +0200
"Markus Schneider-Pargmann" <msp@baylibre.com> wrote:

>   fgraph_data = fgraph_reserve_data(gops->idx, reserved_words * sizeof(long));
> 
> fgraph_reserve_data() returns a pointer into an unsigned long array
> ret_stack. ret_stack is allocated with

Correct. It is in fact a requirement that fgraph_reserve_data() returns
a long aligned pointer.

-- Steve

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox