Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-29 23:21 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <CAEvNRgGbMhkX310CkFY_M5x-zod=BDTiuznrZ0XvFPUK7weL1A@mail.gmail.com>

On Fri, Apr 24, 2026 at 12:08:45PM -0700, Ackerley Tng wrote:
> Michael Roth <michael.roth@amd.com> writes:
> 
> Thank you for your patches!
> 
> >
> > [...snip...]
> >
> >>
> >> I also did some minor updates (prefixed with a "[squash]" tag) to advertise
> >> the KVM_SET_MEMORY_ATTRIBUTES2_PRESERVED flag so it can be used by
> >
> > Though I'm not sure how we deal with it if SNP/TDX at some point become
> > capable of using the PRESERVED flag *after* populate... but maybe that's
> > too unlikely to worry about? If we wanted to address it though, we could
> > have both PRESERVED and PRESERVED_BEFORE_LAUNCH so they can be
> > enumerated separately from the start.
> >
> 
> Not sure how likely it is, but if SNP and TDX can honor PRESERVE
> semantics after populate, I think we could implement support under a new
> flag like CIPHER.

That works, but it still makes things *slightly* awkward due to special-casing
the PRESERVE semantics for 1 guest type vs. another.

For instance, for the QEMU patches what I'd like to do is be able to
issue conversions with the PRESERVE flag set, and it makes sense that
when the function that handles that gets called (in this case,
guest_memfd_set_memory_attributes_fd()), it asserts that QEMU
checked for the corresponding capability/flag (via
KVM_CAP_MEMORY_ATTRIBUTES2_FLAGS) earlier on during QEMU init time
and stored that into kvm_supported_memory_attributes2_flag to check
against later in cases where we try to use PRESERVE:

  https://github.com/AMDESE/qemu/blob/1308d4bcd8dd8acd151243e96ff0be4098389b93/accel/kvm/kvm-all.c#L1655

but that check is sort of incomplete: for use-cases like pKVM it makes
sense to allow conversion+PRESERVE at all times, but for SNP/TDX we'd
ideally have an additional check like:

  static int guest_memfd_set_memory_attributes_fd(..., attrs, flags)
  {
      struct kvm_memory_attributes2 attrs;
      ...

      assert(kvm_supported_memory_attributes & attrs);
      assert(kvm_supported_memory_attributes2_flags & flags);

      /* SNP and TDX only support PRESERVE prior to launch) */
      if (vm_type_is_snp_or_tdx() && (flags & PRESERVE))
          assert(vcpus_not_yet_started_and_LAUNCH_FINISH_etc_not_yet_called);
  
      ...
  }

maybe that's fine, but there's just some asymmetry in the fact that
the capabilities checks are essentially self-documenting/self-enforcing
in this regard, except for PRESERVE+SNP/TDX where userspace needs to
sprinkle some additional vm_type-specific logic.

So that's why I was sort of thinking PRESERVE might deserve a
PRESERVE_BEFORE_LAUNCH or something so the kernel can enumerate the
support rather than userspace needing to read into the documentation /
implementation to implement finer-grained checks/etc.

> 
> CIPHER can then be used to mean "do the encryption or decryption", and
> for platforms not supporting encryption, they'd stick with PRESERVE?

It's all theoretical, but I'd imagine that *if* SNP/TDX ever added
support for allowing userspace to write encrypted memory into a guest
range, it would be something like:

  - write plain-text data to corresponding GPA
  - issue shared-to-private conversion with CIPHER (or PRESERVE) bit
    set, which will then be augmented to make some sort of firmware
    call to encrypt the memory with guest key and make it visible to
    guest

So, at a high-level, if host writes 0xbeef to shared memory, then upon
completion of the converison ioctl, 0xbeef then be visible to the guest
via encrypted/private memory, which seems to match the semantics of
PRESERVE perfectly:

+``KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE``
+
+  On conversion, KVM guarantees memory contents will be preserved with
+  respect to the last written unencrypted value.  As a concrete
+  example, if the host writes ``0xbeef`` to shared memory and converts 
+  the memory to private, the guest will also read ``0xbeef``, even if
+  the in-memory data is encrypted as part of the conversion.  And vice
+  versa, if the guest writes ``0xbeef`` to private memory and then
+  converts the memory to shared, the host (and guest) will read
+  ``0xbeef`` (if the memory is accessible).

So it seems like a waste to have to add a CIPHER that would essentially
have the same semantics (even though the architectural implementation
details are different), and then complicate the documentation of
PRESERVE to have vm-specific behaviors that userspace needs to account
for, rather than leaving PRESERVE as-is and adding a
PRESERVE_BEFORE_LAUNCH that neatly encapsulates the specialness of the
SNP/TDX flow without complicating the more generally-defined flags like
PRESERVE that could in theory become usable by all variants in the future.

But then that sort of has me wondering if PRESERVE should be
PRESERVE_AFTER_LAUNCH (to cover cases where maybe SNP/TDX gain this
ability in the future, but can only use it post-launch, where other
architectures might allow it both before/after.)

...but that seems to be getting into a more general issue of how we
determine what is available before-launch/after-launch, so maybe instead
of introducing new flags, we just have 2 capabilities that returns what
flags are available at each particular phase, e.g.

  KVM_CAP_MEMORY_ATTRIBUTES2_PRE_LAUNCH_FLAGS
  KVM_CAP_MEMORY_ATTRIBUTES2_POST_LAUNCH_FLAGS

And then userspace can easily probe/lock down what it expects to be
available prior to launching the guest vs after without any vm-specific
logic, and we can continue to just stick with PRESERVE/ZERO. (and if
we do end up need to further distinguish pre vs. post-launch behavior,
this gives us some flexibility to handle that as well).

> 
> Should we redefine the semantics of PRESERVE to be "ensure that memory
> contents don't change while guest_memfd tracking is being updated" and
> avoid making a commitment on how the guest should read the memory?
> 
> The above update would be aligned with ZERO not being allowed for
> conversions to private (because KVM/guest_memfd does not make guarantees
> about the contract between the host and guest.

It's a little awkward here too, because PRESERVE is sort of making a
contract between host/guest: we are providing trusted data to the guest.
However, there's an attestation proces that allows the guest to enforce
that for pre-launch and ensure we are following through on that
contract. PRESERVE for post-launch...I'm not sure how that could be
enforced, but I think this is further reason to allow pre vs. post
launch flags to be enumerated separately.

Thanks,

Mike

> 
> This way, all of those (ZERO, PRESERVE) will focus on KVM's interface
> with the host.
> 
> This lines up for SW_PROTECTED_VMs too, since reading memory that didn't
> change in the guest is the contract between SW_PROTECTED_VMs and the
> host.
> 
> >> userspace for SNP/TDX in the kvm_gmem_populate() path as agreed upon
> >> during PUCK.
> >>
> >>
> >> [...snip...]
> >>

^ permalink raw reply

* Re: [PATCH] trace_printk: replace _______STR with __UNIQUE_ID(STR)
From: David Laight @ 2026-04-29 21:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Qian-Yu Lin, mhiramat, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260429134226.49e95e9d@robin>

On Wed, 29 Apr 2026 13:42:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 30 Apr 2026 00:57:07 +0800
> Qian-Yu Lin <tiffany019230@gmail.com> wrote:
> 
> > The macro trace_printk() uses a hardcoded identifier _______STR
> > within a statement expression, which can lead to variable name
> > shadowing if a caller happens to use the same name in its scope.  
> 
> Has this ever been a problem?
> 
> > 
> > Following the pattern in commit 24ba53017e18 ("rcu: Replace ________p1
> > and _________p1 with __UNIQUE_ID(rcu)") and commit 589a9785ee3a
> >  ("min/max: remove sparse warnings when they're nested"), replace the
> >  hardcoded identifier with __UNIQUE_ID(STR).

min and max do get nested - so it is important that the 'local'
variables have different names - otherwise you can get invalid expansions.

No one is going to have: trace_printk(fmt1, trace_printk(ftm2, ...), ...)
it just doesn't make sense.

There is a slight problem the ____________________________STR might
be used by an entirely different #define.
Just using _trace_printk_str[] would make this pretty unlikely.
It would also have the advantage of making the .i file a bit easier
to read, all the UNIQUE names in min/max output make it really hard
to see what the output actually means.

> > 
> > Since __UNIQUE_ID() must be expanded once to remain consistent across
> > declaration and sizeof() within the statement expression, introduce a
> > nested helper macro ___trace_printk.  
> 
> Hmm, so we are replacing one name with underscores with another name
> with underscores?
> 
> > 
> > Signed-off-by: Qian-Yu Lin <tiffany019230@gmail.com>
> > ---
> >  include/linux/trace_printk.h | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
> > index 2670ec7f4262..060eccb40838 100644
> > --- a/include/linux/trace_printk.h
> > +++ b/include/linux/trace_printk.h
> > @@ -2,6 +2,7 @@
> >  #ifndef _LINUX_TRACE_PRINTK_H
> >  #define _LINUX_TRACE_PRINTK_H
> >  
> > +#include <linux/compiler.h>  
> 
> People are already saying that trace_printk.h slows down the compile.
> Does this add any overhead to the compile?

A little - nothing is free.

	David

> 
> -- Steve
> 
> 
> >  #include <linux/compiler_attributes.h>
> >  #include <linux/instruction_pointer.h>
> >  #include <linux/stddef.h>
> > @@ -84,15 +85,18 @@ do {									\
> >   * let gcc optimize the rest.
> >   */
> >  
> > -#define trace_printk(fmt, ...)				\
> > +#define ___trace_printk(fmt, str, ...)				\
> >  do {							\
> > -	char _______STR[] = __stringify((__VA_ARGS__));	\
> > -	if (sizeof(_______STR) > 3)			\
> > +	char str[] = __stringify((__VA_ARGS__));	\
> > +	if (sizeof(str) > 3)			\
> >  		do_trace_printk(fmt, ##__VA_ARGS__);	\
> >  	else						\
> >  		trace_puts(fmt);			\
> >  } while (0)
> >  
> > +#define trace_printk(fmt, ...) \
> > +	___trace_printk(fmt, __UNIQUE_ID(str), ##__VA_ARGS__)
> > +
> >  #define do_trace_printk(fmt, args...)					\
> >  do {									\
> >  	static const char *trace_printk_fmt __used			\  
> 
> 


^ permalink raw reply

* Re: [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
From: Daniel Walker (danielwa) @ 2026-04-29 21:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Hildenbrand (Arm),
	Darko Tominac -X (dtominac - GLOBALLOGIC INC at Cisco),
	Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, xe-linux-external(mailer list),
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-mm@kvack.org
In-Reply-To: <afIiqR5CqYRId0N0@redhat.com>

On Wed, Apr 29, 2026 at 05:24:25PM +0200, Oleg Nesterov wrote:
> On 04/29, David Hildenbrand (Arm) wrote:
> >
> > On 4/29/26 15:15, Darko Tominac wrote:
> >
> > > uprobe infrastructure does not re-instrument pages on individual page
> > > faults (uprobe_mmap() is only called during VMA creation, not on
> > > page-in), the breakpoints are silently lost once the discarded pages are
> > > re-read from the backing file.  The probes stop firing with no error
> > > indication, and the only recovery is to unregister and re-register the
> > > affected uprobes.
> >
> > Right. Don't MADV_DONTNEED uprobes, just like you are not supposed to
> > MADV_DONTNEED debugger breakpoints/set data etc. :)
> 
> Agreed, thanks.


Shouldn't there be some sort of compensation or notification for this, or is each person that
hits this suppose to just scratch their head and send a patch that's rejected?

Daniel

^ permalink raw reply

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-04-29 20:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Jens Remus,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Linus Torvalds, Andrew Morton, Florian Weimer, Kees Cook,
	Carlos O'Donell, Sam James, Dylan Hatch, Borislav Petkov,
	Dave Hansen, David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Heiko Carstens, Vasily Gorbik
In-Reply-To: <20260429145815.69dcb1c0@gandalf.local.home>

On Wed, 29 Apr 2026 14:58:15 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > If the main executable doesn't have a sframe section, I don't see why this
> > shouldn't be allowed to add one.  
> 
> The address is added via mtree_insert_range() which states it will return
> -EEXISTS if the range is occupied.

I confirmed this. I changed the test program to load its own sframe section
again, and it errors out with -EEXISTS.

-- Steve

^ permalink raw reply

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-04-29 18:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Jens Remus,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Linus Torvalds, Andrew Morton, Florian Weimer, Kees Cook,
	Carlos O'Donell, Sam James, Dylan Hatch, Borislav Petkov,
	Dave Hansen, David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Heiko Carstens, Vasily Gorbik
In-Reply-To: <20260429145512.461e09fe@gandalf.local.home>

On Wed, 29 Apr 2026 14:55:12 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> > > + *
> > > + * This system call is used by dynamic library utilities to inform the kernel    
> > 
> > Out of curiosity: what would happen if the application chooses to invoke
> > this system call to register sframe for the main executable ? Should it be
> > allowed ?  
> 
> I haven't tried, and should look at the code. Ideally, no two sframes
> sections should cover the same code. I think it should error out if it does.
> 
> If the main executable doesn't have a sframe section, I don't see why this
> shouldn't be allowed to add one.

The address is added via mtree_insert_range() which states it will return
-EEXISTS if the range is occupied.

-- Steve

^ permalink raw reply

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-04-29 18:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Jens Remus,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Linus Torvalds, Andrew Morton, Florian Weimer, Kees Cook,
	Carlos O'Donell, Sam James, Dylan Hatch, Borislav Petkov,
	Dave Hansen, David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Heiko Carstens, Vasily Gorbik
In-Reply-To: <ade15347-6eca-4204-9f64-9b2eed9e9aec@efficios.com>

On Wed, 29 Apr 2026 14:42:52 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2026-04-29 11:43, Steven Rostedt wrote:
> [...]
> 
> AFAIU this is a functional iteration of my prior RFC here:
> 
> https://lore.kernel.org/lkml/2fa31347-3021-4604-bec3-e5a2d57b77b5@efficios.com/
> 
> perhaps it would be valuable to refer to that thread ?

Thanks, I actually decided to start this from scratch as I took a slightly
different approach than yours. But you're right, I should have mentioned it
anyway.

> 
> >   }
> > +
> > +/**
> > + * sys_stacktrace_setup - register an address for user space stacktrace walking.
> > + * @op: Type of operation to perform
> > + * @addr_start: The virtual address of the stacktrace information
> > + * @addr_length: The length of the stacktrace information
> > + * @text_start: The virtual address of the text that @addr_start represents
> > + * @text_length: The length of teh text  
> 
> the
> 
> also, perhaps add @flags just in case ? This allows changing the
> behavior without having to insert entirely new set of ops.

I thought about it and decided it was somewhat redundant to the ops. After
working on the futex code, I thought following that API was a better
approach.

> 
> > + *
> > + * This system call is used by dynamic library utilities to inform the kernel  
> 
> Out of curiosity: what would happen if the application chooses to invoke
> this system call to register sframe for the main executable ? Should it be
> allowed ?

I haven't tried, and should look at the code. Ideally, no two sframes
sections should cover the same code. I think it should error out if it does.

If the main executable doesn't have a sframe section, I don't see why this
shouldn't be allowed to add one.

> 
> > + * of meta data that it loaded that can be used by the kernel to know how
> > + * to stack walk the given text locations.
> > + *
> > + * Currently only sframes are supported, but in the future, this may be used
> > + * to tell the kernel about JIT code which will most likely have a different
> > + * format.
> > + *
> > + * The type command may be extended and parameters may be used for other
> > + * purposes.  
> 
> "type" or "op" ?

Bah, I originally had called it "type" and then renamed it to "op" (to
be consistent with the futex commands). I missed updating this. Thanks for
catching that.

-- Steve

^ permalink raw reply

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Mathieu Desnoyers @ 2026-04-29 18:42 UTC (permalink / raw)
  To: Steven Rostedt, LKML, Linux Trace Kernel
  Cc: Masami Hiramatsu, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
	Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
	Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
	Vasily Gorbik
In-Reply-To: <20260429114355.6c712e6a@gandalf.local.home>

On 2026-04-29 11:43, Steven Rostedt wrote:
[...]

AFAIU this is a functional iteration of my prior RFC here:

https://lore.kernel.org/lkml/2fa31347-3021-4604-bec3-e5a2d57b77b5@efficios.com/

perhaps it would be valuable to refer to that thread ?

>   }
> +
> +/**
> + * sys_stacktrace_setup - register an address for user space stacktrace walking.
> + * @op: Type of operation to perform
> + * @addr_start: The virtual address of the stacktrace information
> + * @addr_length: The length of the stacktrace information
> + * @text_start: The virtual address of the text that @addr_start represents
> + * @text_length: The length of teh text

the

also, perhaps add @flags just in case ? This allows changing the
behavior without having to insert entirely new set of ops.

> + *
> + * This system call is used by dynamic library utilities to inform the kernel

Out of curiosity: what would happen if the application chooses to invoke
this system call to register sframe for the main executable ? Should it be
allowed ?

> + * of meta data that it loaded that can be used by the kernel to know how
> + * to stack walk the given text locations.
> + *
> + * Currently only sframes are supported, but in the future, this may be used
> + * to tell the kernel about JIT code which will most likely have a different
> + * format.
> + *
> + * The type command may be extended and parameters may be used for other
> + * purposes.

"type" or "op" ?

Thanks,

Mathieu

> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, addr_start,
> +		unsigned long, addr_length, unsigned long, text_start,
> +		unsigned long, text_length)
> +{
> +	switch (op) {
> +	case STACKTRACE_REGISTER_SFRAME:
> +		return sframe_add_section(addr_start, addr_start + addr_length,
> +					  text_start, text_start+text_length);
> +	case STACKTRACE_UNREGISTER_SFRAME:
> +		return sframe_remove_section(addr_start);
> +	}
> +	return -EINVAL;
> +}

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Mathieu Desnoyers @ 2026-04-29 18:32 UTC (permalink / raw)
  To: Steven Rostedt, Madhavan Srinivasan, Michael Ellerman
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Jens Remus,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Linus Torvalds, Andrew Morton, Florian Weimer, Kees Cook,
	Carlos O'Donell, Sam James, Dylan Hatch, Borislav Petkov,
	Dave Hansen, David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Heiko Carstens, Vasily Gorbik
In-Reply-To: <20260429114747.4cf74d04@gandalf.local.home>

On 2026-04-29 11:47, Steven Rostedt wrote:
> On Wed, 29 Apr 2026 11:43:55 -0400
> Steven Rostedt <rostedt@kernel.org> wrote:
> 
>> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
>> index 4fcc7c58a105..005441233932 100644
>> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
>> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
>> @@ -562,3 +562,4 @@
>>   469	common	file_setattr			sys_file_setattr
>>   470	common	listns				sys_listns
>>   471	nospu	rseq_slice_yield		sys_rseq_slice_yield
>> +472	nospu	stacktrace_setup		sys_stacktrace_setup
> 
> BTW, I have no idea what "nospu" is. I did some searches for PowerPC and
> spu vs nospu and the only thing I found was a stackoverflow asking about
> it, and the response was "don't worry about it".
> 
> I only picked nospu because I noticed that rseq_slice_yield used it, but
> that doesn't give me a good feeling that it's correct.

AFAIR PowerPC SPUs are the Cell "Synergistic Processing Units", e.g.
for Playstation 3.

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply

* Re: [PATCH 7.2 v16 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function
From: Nico Pache @ 2026-04-29 18:22 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, ljs
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <5564e170-d052-4445-9914-70bfd852d3b1@kernel.org>

On 4/27/26 2:24 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> Add collapse_allowable_orders() to generalize THP order eligibility. The
>> function determines which THP orders are permitted based on collapse
>> context (khugepaged vs madv_collapse).
>>
>> This consolidates collapse configuration logic and provides a clean
>> interface for future mTHP collapse support where the orders may be
>> different.
>>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>
>
> [...]
>
>>      cc = kmalloc_obj(*cc);
>> diff --git a/mm/vma.c b/mm/vma.c
>> index 377321b48734..c0398fb597b3 100644
>> --- a/mm/vma.c
>> +++ b/mm/vma.c
>> @@ -989,7 +989,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
>>              goto abort;
>>
>>      vma_set_flags_mask(vmg->target, sticky_flags);
>> -    khugepaged_enter_vma(vmg->target, vmg->vm_flags);
>> +    khugepaged_enter_vma(vmg->target);
>>      vmg->state = VMA_MERGE_SUCCESS;
>>      return vmg->target;
>>
>> @@ -1110,7 +1110,7 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg)
>>       * following VMA if we have VMAs on both sides.
>>       */
>>      if (vmg->target && !vma_expand(vmg)) {
>> -            khugepaged_enter_vma(vmg->target, vmg->vm_flags);
>> +            khugepaged_enter_vma(vmg->target);
>>              vmg->state = VMA_MERGE_SUCCESS;
>>              return vmg->target;
>>      }
>> @@ -2589,7 +2589,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
>>       * call covers the non-merge case.
>>       */
>>      if (!vma_is_anonymous(vma))
>> -            khugepaged_enter_vma(vma, map->vm_flags);
>> +            khugepaged_enter_vma(vma);
>>      *vmap = vma;
>
> Are you sure that in all cases, vma->vm_flags already corresponds to
> vmg->vm_flags / map->vm_flags?

I reviewed most of them and nothing stuck out, but I can go over them
again. Lorenzo may also have more insight into this as he is more
familiar and has been working on this stuff.

@lorenzo?

>
>
> That's a change that makes this patch unnecessary hard to follow, in particular,
> because it's not documented in the patch description.

Thats really weird I could have sworn I did update this description...
There was a lot of changes this round, so it was hard to keep track of
everything. Sorry.

>
> If you think the change is fine, you should better move that into a separate
> cleanup patch where you only drop the flags parameter from  khugepaged_enter_vma().

Yeah thats a better idea. Ill separate it out, thank you for the reviews :)

>


^ permalink raw reply

* Re: [PATCH 7.2 v16 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics
From: Nico Pache @ 2026-04-29 18:21 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <f05f4506-2930-44ba-918a-e0e5bcb9d0f9@kernel.org>

On 4/27/26 2:21 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> Add three new mTHP statistics to track collapse failures for different
>> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
>>
>> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
>>      PTEs
>>
>> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
>>      exceeding the none PTE threshold for the given order
>>
>> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
>>      PTEs
>>
>> These statistics complement the existing THP_SCAN_EXCEED_* events by
>> providing per-order granularity for mTHP collapse attempts. The stats are
>> exposed via sysfs under
>> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
>> supported hugepage size.
>>
>> As we currently dont support collapsing mTHPs that contain a swap or
>
> s/dont/do not/
>
>> shared entry, those statistics keep track of how often we are
>> encountering failed mTHP collapses due to these restrictions.
>>
>> Now that we plan to support mTHP collapse for anon pages, lets also track
>
> "We will add support for mTHP collapse for anonymous pages next; let's also ..."
>
>> when this happens at the PMD level within the per-mTHP stats.
>
> What about file collapse? For example, we do adjust
> count_vm_event(THP_SCAN_EXCEED_SWAP_PTE) and
> count_vm_event(THP_SCAN_EXCEED_NONE_PTE) there.
>
> Wouldn't we want to update the HPAGE_PMD_ORDER side of things there already? or
> would we want to use a different counter for that?

Maybe? My thought process was that because we dont support mTHP in Shmem
that those stats shouldnt be updated? not sure the right answer tbh--
hence why this comment says "now that.. for anon pages, lets track..".

>
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++
>>   include/linux/huge_mm.h                    |  3 +++
>>   mm/huge_memory.c                           |  7 +++++++
>>   mm/khugepaged.c                            | 21 +++++++++++++++++--
>>   4 files changed, 53 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index c51932e6275d..eebb1f6bbc6c 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -714,6 +714,30 @@ nr_anon_partially_mapped
>>          an anonymous THP as "partially mapped" and count it here, even though it
>>          is not actually partially mapped anymore.
>>
>> +collapse_exceed_none_pte
>> +       The number of collapse attempts that failed due to exceeding the
>> +       max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none
>> +       values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will
>> +       emit a warning and no mTHP collapse will be attempted. khugepaged will
>> +       try to collapse to the largest enabled (m)THP size; if it fails, it will
>> +       try the next lower enabled mTHP size. This counter records the number of
>> +       times a collapse attempt was skipped for exceeding the max_ptes_none
>> +       threshold, and khugepaged will move on to the next available mTHP size.
>
> Why is everything after the first sentence worth documenting here? This doesn't
> read like it belongs to a failure counter?
>
>> +
>> +collapse_exceed_swap_pte
>> +       The number of anonymous mTHP PTE ranges which were unable to collapse due
>> +       to containing at least one swap PTE. Currently khugepaged does not
>> +       support collapsing mTHP regions that contain a swap PTE. This counter can
>> +       be used to monitor the number of khugepaged mTHP collapses that failed
>> +       due to the presence of a swap PTE.
>
> Can we similarly simplify that (and make it consistent with the one above) to
>
> "The number of collapse attempts that failed due to exceeding the max_ptes_swap
> threshold."
>
>> +
>> +collapse_exceed_shared_pte
>> +       The number of anonymous mTHP PTE ranges which were unable to collapse due
>> +       to containing at least one shared PTE. Currently khugepaged does not
>> +       support collapsing mTHP PTE ranges that contain a shared PTE. This
>> +       counter can be used to monitor the number of khugepaged mTHP collapses
>> +       that failed due to the presence of a shared PTE.
>
> Same here
>
> "The number of collapse attempts that failed due to exceeding the
> max_ptes_shared threshold."

These all used to be very similar to what you have here. But Lorenzo
asked me to expand all this (IIRC) several times. So here we are! I
believe I even mentioned in the V15 that these are probably the "best"
defined counters in the whole doc lol.

Cheers,

-- Nico

>
> ?
>
>> +
>
> [...]
>


^ permalink raw reply

* Re: [PATCH] trace_printk: replace _______STR with __UNIQUE_ID(STR)
From: Steven Rostedt @ 2026-04-29 17:42 UTC (permalink / raw)
  To: Qian-Yu Lin; +Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260429165707.7020-1-tiffany019230@gmail.com>

On Thu, 30 Apr 2026 00:57:07 +0800
Qian-Yu Lin <tiffany019230@gmail.com> wrote:

> The macro trace_printk() uses a hardcoded identifier _______STR
> within a statement expression, which can lead to variable name
> shadowing if a caller happens to use the same name in its scope.

Has this ever been a problem?

> 
> Following the pattern in commit 24ba53017e18 ("rcu: Replace ________p1
> and _________p1 with __UNIQUE_ID(rcu)") and commit 589a9785ee3a
>  ("min/max: remove sparse warnings when they're nested"), replace the
>  hardcoded identifier with __UNIQUE_ID(STR).
> 
> Since __UNIQUE_ID() must be expanded once to remain consistent across
> declaration and sizeof() within the statement expression, introduce a
> nested helper macro ___trace_printk.

Hmm, so we are replacing one name with underscores with another name
with underscores?

> 
> Signed-off-by: Qian-Yu Lin <tiffany019230@gmail.com>
> ---
>  include/linux/trace_printk.h | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
> index 2670ec7f4262..060eccb40838 100644
> --- a/include/linux/trace_printk.h
> +++ b/include/linux/trace_printk.h
> @@ -2,6 +2,7 @@
>  #ifndef _LINUX_TRACE_PRINTK_H
>  #define _LINUX_TRACE_PRINTK_H
>  
> +#include <linux/compiler.h>

People are already saying that trace_printk.h slows down the compile.
Does this add any overhead to the compile?

-- Steve


>  #include <linux/compiler_attributes.h>
>  #include <linux/instruction_pointer.h>
>  #include <linux/stddef.h>
> @@ -84,15 +85,18 @@ do {									\
>   * let gcc optimize the rest.
>   */
>  
> -#define trace_printk(fmt, ...)				\
> +#define ___trace_printk(fmt, str, ...)				\
>  do {							\
> -	char _______STR[] = __stringify((__VA_ARGS__));	\
> -	if (sizeof(_______STR) > 3)			\
> +	char str[] = __stringify((__VA_ARGS__));	\
> +	if (sizeof(str) > 3)			\
>  		do_trace_printk(fmt, ##__VA_ARGS__);	\
>  	else						\
>  		trace_puts(fmt);			\
>  } while (0)
>  
> +#define trace_printk(fmt, ...) \
> +	___trace_printk(fmt, __UNIQUE_ID(str), ##__VA_ARGS__)
> +
>  #define do_trace_printk(fmt, args...)					\
>  do {									\
>  	static const char *trace_printk_fmt __used			\


^ permalink raw reply

* [PATCH] trace_printk: replace _______STR with __UNIQUE_ID(STR)
From: Qian-Yu Lin @ 2026-04-29 16:57 UTC (permalink / raw)
  To: rostedt, mhiramat
  Cc: mathieu.desnoyers, linux-kernel, linux-trace-kernel, Qian-Yu Lin

The macro trace_printk() uses a hardcoded identifier _______STR
within a statement expression, which can lead to variable name
shadowing if a caller happens to use the same name in its scope.

Following the pattern in commit 24ba53017e18 ("rcu: Replace ________p1
and _________p1 with __UNIQUE_ID(rcu)") and commit 589a9785ee3a
 ("min/max: remove sparse warnings when they're nested"), replace the
 hardcoded identifier with __UNIQUE_ID(STR).

Since __UNIQUE_ID() must be expanded once to remain consistent across
declaration and sizeof() within the statement expression, introduce a
nested helper macro ___trace_printk.

Signed-off-by: Qian-Yu Lin <tiffany019230@gmail.com>
---
 include/linux/trace_printk.h | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index 2670ec7f4262..060eccb40838 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -2,6 +2,7 @@
 #ifndef _LINUX_TRACE_PRINTK_H
 #define _LINUX_TRACE_PRINTK_H
 
+#include <linux/compiler.h>
 #include <linux/compiler_attributes.h>
 #include <linux/instruction_pointer.h>
 #include <linux/stddef.h>
@@ -84,15 +85,18 @@ do {									\
  * let gcc optimize the rest.
  */
 
-#define trace_printk(fmt, ...)				\
+#define ___trace_printk(fmt, str, ...)				\
 do {							\
-	char _______STR[] = __stringify((__VA_ARGS__));	\
-	if (sizeof(_______STR) > 3)			\
+	char str[] = __stringify((__VA_ARGS__));	\
+	if (sizeof(str) > 3)			\
 		do_trace_printk(fmt, ##__VA_ARGS__);	\
 	else						\
 		trace_puts(fmt);			\
 } while (0)
 
+#define trace_printk(fmt, ...) \
+	___trace_printk(fmt, __UNIQUE_ID(str), ##__VA_ARGS__)
+
 #define do_trace_printk(fmt, args...)					\
 do {									\
 	static const char *trace_printk_fmt __used			\
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v18 4/8] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Steven Rostedt @ 2026-04-29 16:39 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260430002023.b19608c27a746284e8f73b80@kernel.org>

On Thu, 30 Apr 2026 00:20:23 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Tue, 28 Apr 2026 16:21:46 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Fri, 24 Apr 2026 15:52:35 +0900
> > "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> >   
> > > @@ -1892,9 +1895,19 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
> > >  	 * subbuf_size is considered invalid.
> > >  	 */
> > >  	tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
> > > -	if (tail > meta->subbuf_size - BUF_PAGE_HDR_SIZE)
> > > -		return -1;
> > > -	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
> > > +	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
> > > +		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
> > > +  
> > 
> > This code seriously needs comments.  
> 
> OK, I'll add it, or let code explain clearer?
> 
> 	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
> 		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
> 	else
> 		ret = -1;

That's better...

> 
> Thanks,
> 

The below should have some explanation too. I can figure it out, but it
wasted more brain cycles than I would have liked ;-)

-- Steve


> >   
> > > +	if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
> > > +		local_set(&bpage->entries, 0);
> > > +		local_set(&bpage->page->commit, 0);
> > > +		bpage->page->time_stamp = prev_ts ? prev_ts : next_ts;
> > > +		ret = -1;
> > > +	} else {
> > > +		local_set(&bpage->entries, ret);
> > > +	}
> > > +
> > > +	return ret;
> > >  }
> > >    
> 
> 


^ permalink raw reply

* Re: [PATCH v18 3/8] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Steven Rostedt @ 2026-04-29 16:32 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260430001559.a2d0489d956c06059e13e58d@kernel.org>

On Thu, 30 Apr 2026 00:15:59 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> It should not be a never ending loop (there are other exit conditions),
> but I agreed. What about limiting with nr_subbufs?
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 5326924615a4..aa89ec96e964 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -5630,7 +5630,8 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
>  	 * a case where we will loop three times. There should be no
>  	 * reason to loop four times (that I know of).
>  	 */
> -	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
> +	if (RB_WARN_ON(cpu_buffer,
> +		++nr_loops > (cpu_buffer->ring_meta ? cpu_buffer->nr_subbufs : 3))) {
>  		reader = NULL;
>  		goto out;
>  	}

Yeah, this looks more reasonable.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH] tracing/probes: Limit size of event probe to 3K
From: Steven Rostedt @ 2026-04-29 15:57 UTC (permalink / raw)
  To: Masami Hiramatsu (Google); +Cc: LKML, Linux Trace Kernel, Mathieu Desnoyers
In-Reply-To: <20260429175144.9d3d6e13e935f1dfd480f6a0@kernel.org>

On Wed, 29 Apr 2026 17:51:44 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Hi Steve,
> 
> BTW, to prevent regressions during future expansions, how about adding the following line?

Nothing against this, but I think it should be a separate patch.

Thanks,

-- Steve


> 
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 2cabf8a23ec5..c5ee7920dec6 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -979,6 +979,7 @@ static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu,
>  	ucb = uprobe_buffer_get();
>  	ucb->dsize = tu->tp.size + dsize;
>  
> +	BUILD_BUG_ON(MAX_UCB_BUFFER_SIZE < MAX_PROBE_EVENT_SIZE);
>  	if (WARN_ON_ONCE(ucb->dsize > MAX_UCB_BUFFER_SIZE)) {
>  		ucb->dsize = MAX_UCB_BUFFER_SIZE;
>  		dsize = MAX_UCB_BUFFER_SIZE - tu->tp.size;
> 

^ permalink raw reply

* Re: [PATCH] kprobes: Remove dead child probes from aggrprobe list on module unload
From: Steven Rostedt @ 2026-04-29 15:56 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Shijia Hu, naveen, davem, ananth, akpm, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260429174053.b66f3a0032408f544128afd4@kernel.org>

On Wed, 29 Apr 2026 17:40:53 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Shijia Hu <hushijia1@uniontech.com> wrote:
> 
> > When a kernel module that registered kprobes is unloaded without calling
> > unregister_kprobe(), kprobes_module_callback() calls kill_kprobe() to
> > mark the probe(s) GONE.  If the probe is an aggrprobe, kill_kprobe()
> > also marks all child probes GONE, but it does not remove them from
> > the aggrprobe's list.  
> 
> That sounds like a bug in the module.

Agreed.

> 
> > 
> > The problem is that child probes whose struct kprobe resides in the
> > unloading module's memory are freed along with the module, yet they
> > remain on the aggrprobe's list.  Later, when another caller registers
> > a kprobe at the same address, __get_valid_kprobe() walks that list
> > and dereferences the freed child probe, causing a use-after-free.
> > 
> > Reproduction steps:
> > 
> >     1) Load module A which registers two kprobes on the same kernel
> >        function address (e.g., do_nanosleep), causing them to be
> >        aggregated under one aggrprobe.
> > 
> >     2) Unload module A without calling unregister_kprobe().
> >        Module A's memory is freed, but its two child probes remain
> >        on the aggrprobe's list as dangling pointers.  
> 
> Would you mean "load a buggy kernel module and unload it, the kernel cause
> use-after-free."? for example:
> 
> ----
> struct kprobe my_probe = {...};
> 
> init_module() {
> 	register_kprobe(&my_probe);
> }
> exit_module() {
> 	/* do nothing */
> }
> ----
> 
> Yes, this cause UAF because that module has a bug. Please call
> unregister_kprobe().

Yes, this is one of those...

  Patient: Doctor it hurts me when I do this
  Doctor: Then don't do that

... reports.

No, the kernel isn't responsible for fixing buggy modules.

-- Steve

^ permalink raw reply

* Re: [PATCH v18 7/8] ring-buffer: Cleanup persistent ring buffer validation
From: Steven Rostedt @ 2026-04-29 15:50 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260430002139.c3b0e4c92ae52aeeaf86e1bf@kernel.org>

On Thu, 30 Apr 2026 00:21:39 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > Also, could you rebase on top of v7.1-rc1?  
> 
> Is it v7.1-rc1, not ring-buffer/for-next?

The ring-buffer/for-next has been merged with conflicts. I rather not add
more conflicts ;-)  v7.1-rc1 includes ring-buffer/for-next.

-- Steve

^ permalink raw reply

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-04-29 15:47 UTC (permalink / raw)
  To: Madhavan Srinivasan, Michael Ellerman
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Jens Remus, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
	Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
	Vasily Gorbik
In-Reply-To: <20260429114355.6c712e6a@gandalf.local.home>

On Wed, 29 Apr 2026 11:43:55 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 4fcc7c58a105..005441233932 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -562,3 +562,4 @@
>  469	common	file_setattr			sys_file_setattr
>  470	common	listns				sys_listns
>  471	nospu	rseq_slice_yield		sys_rseq_slice_yield
> +472	nospu	stacktrace_setup		sys_stacktrace_setup

BTW, I have no idea what "nospu" is. I did some searches for PowerPC and
spu vs nospu and the only thing I found was a stackoverflow asking about
it, and the response was "don't worry about it".

I only picked nospu because I noticed that rseq_slice_yield used it, but
that doesn't give me a good feeling that it's correct.

-- Steve

^ permalink raw reply

* [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-04-29 15:43 UTC (permalink / raw)
  To: LKML, Linux Trace Kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Jens Remus, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Linus Torvalds, Andrew Morton,
	Florian Weimer, Kees Cook, Carlos O'Donell, Sam James,
	Dylan Hatch, Borislav Petkov, Dave Hansen, David Hildenbrand,
	H. Peter Anvin, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
	Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
	Heiko Carstens, Vasily Gorbik

[-- Attachment #1: Type: text/plain, Size: 15652 bytes --]

From: Steven Rostedt <rostedt@goodmis.org>

[
   This is an RFC that adds a system call for dynamic linkers to use to
   tell the kernel where the sframe sections are when it loads dynamic
   libraries.

   It is built on top of Jens's sframe implementation for v3:

      https://lore.kernel.org/linux-trace-kernel/20260127150554.2760964-1-jremus@linux.ibm.com/

   I have a repo with that code that this applies on top of here:

      git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git sframe/core
       

   The name of the system call is "stacktrace_setup", but I'm not attached
   to this name. If anyone can think of a better name I'm happy to take
   suggestions.

   This patch is just to get the conversation going and the final result
   may be much different. I tested this with the attached program which is a
   major hack. I built glibc with sframe v3 support and I used readelf to
   find the sframe size and location of glibc.

   readelf -e /work/usr/lib/libc.so.6 | grep sframe
     [19] .sframe           GNU_SFRAME       00000000001d3fc0  001d3fc0

   Then I wrote a program that takes the above location and size of the
   .sframe section in libc as parameters, scans /proc/self/maps to find
   where it loaded libc and then calls this new system call with a pointer
   to the location of the sframe along with its size, as well as where the
   libc text is located.

   It then spins for 2 seconds, calls the system call again to remove the
   sframe section it loaded, and spins for another 2 seconds.

   I ran perf record --call-graph fp,defer on the program and looked for
   the do_spin() function.

   With sframe loaded:

sframe-test    1350  1396.333593:     202366 cpu/cycles/P: 
            7fdf0ec38a44 [unknown] ([vdso])
            5621a6b97243 get_time+0x19 (/work/c/sframe-test)
            5621a6b9727f do_spin+0x1f (/work/c/sframe-test)
            5621a6b975cd main+0xd4 (/work/c/sframe-test)
            7fdf0ea26bda __libc_start_call_main+0x6a (/work/usr/lib/libc.so.6)
            7fdf0ea26d05 __libc_start_main@@GLIBC_2.34+0x85 (/work/usr/lib/libc.so.6)
            5621a6b97131 _start+0x21 (/work/c/sframe-test)

   After it unloads the sframe:

sframe-test    1350  1400.332902:     657582 cpu/cycles/P: 
            7fdf0ec38a5e [unknown] ([vdso])
            5621a6b97243 get_time+0x19 (/work/c/sframe-test)
            5621a6b9727f do_spin+0x1f (/work/c/sframe-test)
            5621a6b97602 main+0x109 (/work/c/sframe-test)
            7fdf0ea26bda __libc_start_call_main+0x6a (/work/usr/lib/libc.so.6)

   As you can see, with the sframe loaded, it was able to walk further up
   the libc library.

   Again, this is just an RFC, but I would like to get agreement on the
   system call so that we can then update the dynamic linker to do this
   instead of using my hack ;-)
]

Add a system call that can be used by dynamic linkers to tell the kernel
where the sframe section is in memory for libraries it loads.

The system call stacktrace_setup takes 5 parameters:

  op - the type of operation to perform
  addr_start - The virtual address of the sframe section
  addr_length - The length of the sframe section
  text_start - the text section the sframe represents
  test_length - the length of the section

The current op values are:

  STACKTRACE_REGISTER_SFRAME - This registers the sframe
  STACKTRACE_UNREGISTER_SFRAME - This removes the sframe

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/tools/syscall_32.tbl             |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 include/linux/syscalls.h                    |  1 +
 include/uapi/asm-generic/unistd.h           |  5 ++-
 include/uapi/linux/stacktrace.h             | 10 ++++++
 kernel/sys_ni.c                             |  2 ++
 kernel/unwind/sframe.c                      | 37 +++++++++++++++++++++
 scripts/syscall.tbl                         |  1 +
 22 files changed, 71 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/stacktrace.h

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index f31b7afffc34..8c320029a156 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -511,3 +511,4 @@
 579	common	file_setattr			sys_file_setattr
 580	common	listns				sys_listns
 581	common	rseq_slice_yield		sys_rseq_slice_yield
+582	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 94351e22bfcf..60f9a33b2dc5 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -486,3 +486,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 62d93d88e0fe..a0bd04a23006 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -483,3 +483,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 248934257101..266ec877300a 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -471,3 +471,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 223d26303627..916294849393 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -477,3 +477,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 7430714e2b8f..20fec148901e 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -410,3 +410,4 @@
 469	n32	file_setattr			sys_file_setattr
 470	n32	listns				sys_listns
 471	n32	rseq_slice_yield		sys_rseq_slice_yield
+472	n32	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 630aab9e5425..2743bbcab143 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -386,3 +386,4 @@
 469	n64	file_setattr			sys_file_setattr
 470	n64	listns				sys_listns
 471	n64	rseq_slice_yield		sys_rseq_slice_yield
+472	n64	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 128653112284..187eadc4a42e 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -459,3 +459,4 @@
 469	o32	file_setattr			sys_file_setattr
 470	o32	listns				sys_listns
 471	o32	rseq_slice_yield		sys_rseq_slice_yield
+472	o32	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c6331dad9461..9442a92ef0aa 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -470,3 +470,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4fcc7c58a105..005441233932 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -562,3 +562,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	nospu	rseq_slice_yield		sys_rseq_slice_yield
+472	nospu	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 09a7ef04d979..bc9894b25584 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -398,3 +398,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 70b315cbe710..5766251b4d2d 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 7e71bf7fcd14..20e7f3b856e4 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f832ebd2d79b..652ede93b724 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -477,3 +477,4 @@
 469	i386	file_setattr		sys_file_setattr
 470	i386	listns			sys_listns
 471	i386	rseq_slice_yield	sys_rseq_slice_yield
+472	i386	stacktrace_setup	sys_stacktrace_setup
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..5da918e912a6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,7 @@
 469	common	file_setattr		sys_file_setattr
 470	common	listns			sys_listns
 471	common	rseq_slice_yield	sys_rseq_slice_yield
+472	common	stacktrace_setup	sys_stacktrace_setup
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index a9bca4e484de..34f0de06baee 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..fdbea39c1b38 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -999,6 +999,7 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
 asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
 				      u32 size, u32 flags);
 asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_stacktrace_setup(void);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..d3f57d8454d7 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
 #define __NR_rseq_slice_yield 471
 __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
 
+#define __NR_stacktrace_setup 472
+__SYSCALL(__NR_stacktrace_setup, sys_stacktrace_setup)
+
 #undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/stacktrace.h b/include/uapi/linux/stacktrace.h
new file mode 100644
index 000000000000..60b581f55995
--- /dev/null
+++ b/include/uapi/linux/stacktrace.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_STACKTRACE_H
+#define _UAPI_LINUX_STACKTRACE_H
+
+enum stacktrace_setup_types {
+	STACKTRACE_REGISTER_SFRAME	= 1,
+	STACKTRACE_UNREGISTER_SFRAME	= 2,
+};
+
+#endif /* _UAPI_LINUX_STACKTRACE_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index add3032da16f..76998b0f811a 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -394,3 +394,5 @@ COND_SYSCALL(rseq_slice_yield);
 
 COND_SYSCALL(uretprobe);
 COND_SYSCALL(uprobe);
+
+COND_SYSCALL(stacktrace_setup);
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index f24997e84e05..a842038fb03b 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -12,8 +12,10 @@
 #include <linux/mm.h>
 #include <linux/string_helpers.h>
 #include <linux/sframe.h>
+#include <linux/syscalls.h>
 #include <asm/unwind_user_sframe.h>
 #include <linux/unwind_user_types.h>
+#include <uapi/linux/stacktrace.h>
 
 #include "sframe.h"
 #include "sframe_debug.h"
@@ -838,3 +840,38 @@ void sframe_free_mm(struct mm_struct *mm)
 
 	mtree_destroy(&mm->sframe_mt);
 }
+
+/**
+ * sys_stacktrace_setup - register an address for user space stacktrace walking.
+ * @op: Type of operation to perform
+ * @addr_start: The virtual address of the stacktrace information
+ * @addr_length: The length of the stacktrace information
+ * @text_start: The virtual address of the text that @addr_start represents
+ * @text_length: The length of teh text
+ *
+ * This system call is used by dynamic library utilities to inform the kernel
+ * of meta data that it loaded that can be used by the kernel to know how
+ * to stack walk the given text locations.
+ *
+ * Currently only sframes are supported, but in the future, this may be used
+ * to tell the kernel about JIT code which will most likely have a different
+ * format.
+ *
+ * The type command may be extended and parameters may be used for other
+ * purposes.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, addr_start,
+		unsigned long, addr_length, unsigned long, text_start,
+		unsigned long, text_length)
+{
+	switch (op) {
+	case STACKTRACE_REGISTER_SFRAME:
+		return sframe_add_section(addr_start, addr_start + addr_length,
+					  text_start, text_start+text_length);
+	case STACKTRACE_UNREGISTER_SFRAME:
+		return sframe_remove_section(addr_start);
+	}
+	return -EINVAL;
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..54a99cffeec4 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
-- 
2.53.0


[-- Attachment #2: sframe-test.c --]
[-- Type: text/x-c++src, Size: 2716 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdarg.h>
#include <errno.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/syscall.h>

#undef SYS_stacktrace_setup
#define SYS_stacktrace_setup		472

enum {
	STACKTRACE_REGISTER_SFRAME = 1,
	STACKTRACE_UNREGISTER_SFRAME,
};

#define stacktrace_setup(op, start_addr, length, text_addr, text_length)\
	syscall(SYS_stacktrace_setup, op, start_addr, length, text_addr, text_length)

static void usage(const char *name)
{
	printf("usage: %s sframe-offset sframe-length\n"
	       "\n",name);
	exit(-1);
}

static unsigned long long get_time(void)
{
	struct timeval tv;
	unsigned long long time;

	gettimeofday(&tv, NULL);

	time = tv.tv_sec * 1000000ULL;
	time += tv.tv_usec;

	return time;
}

void do_spin(void)
{
	unsigned long long start = get_time();

	/* 2 seconds */
	start += 2 * 1000 * 1000;

	while (get_time() < start)
		;
}

static int dump_self(unsigned long idx, unsigned long *sframe,
		     unsigned long *text, unsigned long *text_len)
{
	FILE *fp;
	char *line = NULL;
	size_t size;
	unsigned long ptr = 0;

	*text = 0;

	fp = fopen("/proc/self/maps", "r");
	if (!fp) {
		perror("self");
		exit(-1);
	}

	while ((!ptr || !*text) && getline(&line, &size, fp) > 0) {
		unsigned long start, end;
		char *file;
		char *perm;
		int r;
		int l;

		r = sscanf(line, "%lx-%lx %ms %*x %*s %*d %ms\n",
			   &start, &end, &perm, &file);
		if (r != 4)
			continue;
		printf("%s", line);
//		printf("file=%s\n", file);
		l = strlen(file);
		for (l--; l > 0; l--) {
			if (file[l] == '/') {
				l++;
				break;
			}
		}
		// r-xp	
		if (!strncmp(file + l, "libc.so", 7)) {
			if (!ptr) {
				ptr = start + idx;
				printf("found sframe libc.so %lx-%lx (%lx)\n", start, end, ptr);
				printf("VAL=%lx\n", *(unsigned long*)ptr);
			}
			if (!*text && !strcmp(perm, "r-xp")) {
				*text = start;
				*text_len = end - start;
				printf("found text libc.so %lx-%lx (%lx)\n", start, end, ptr);
			}
		}
		free(file);
		free(perm);
	}
	free(line);

	*sframe = ptr;
	return ptr && *text;
}

int main (int argc, char **argv)
{
	unsigned long idx; // 0x001d3fc0;
	unsigned long len; // 0x303f9;
	unsigned long ptr;
	unsigned long text;
	unsigned long text_len;
	int ret = 0;

	if (argc < 3)
		usage(argv[0]);


	idx = strtoul(argv[1], NULL, 0);
	len = strtoul(argv[2], NULL, 0);

	if (!dump_self(idx, &ptr, &text, &text_len)) {
		fprintf(stderr, "Could not find data");
		exit(-1);
	}
	ret = stacktrace_setup(STACKTRACE_REGISTER_SFRAME, ptr, len, text, text_len);
	if (ret < 0) {
		perror("Registering stacktrace");
		exit(-1);
	}

	do_spin();

	stacktrace_setup(STACKTRACE_UNREGISTER_SFRAME, ptr, len, text, text_len);

	do_spin();
	
	return 0;
}

^ permalink raw reply related

* Re: [PATCH 7.2 v16 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-04-29 15:34 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <6025c087-378b-4754-b1f2-3c5448b49d6c@kernel.org>

On 4/27/26 2:13 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> Pass an order and offset to collapse_huge_page to support collapsing anon
>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>> are attempting to collapse to, and offset indicates were in the PMD to
>> start the collapse attempt.
>>
>> For non-PMD collapse we must leave the anon VMA write locked until after
>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>> the mTHP case this is not true, and we must keep the lock to prevent
>> access/changes to the page tables. This can happen if the rmap walkers hit
>> a pmd_none while the PMD entry is currently unavailable due to being
>> temporarily removed during the collapse phase.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 103 +++++++++++++++++++++++++++---------------------
>>   1 file changed, 57 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 283bb63854a5..ff6f9f1883ed 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1198,42 +1198,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>>      return SCAN_SUCCEED;
>>   }
>>
>> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -            int referenced, int unmapped, struct collapse_control *cc)
>> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>> +            int referenced, int unmapped, struct collapse_control *cc,
>> +            unsigned int order)
>>   {
>>      LIST_HEAD(compound_pagelist);
>>      pmd_t *pmd, _pmd;
>> -    pte_t *pte;
>> +    pte_t *pte = NULL;
>>      pgtable_t pgtable;
>>      struct folio *folio;
>>      spinlock_t *pmd_ptl, *pte_ptl;
>>      enum scan_result result = SCAN_FAIL;
>>      struct vm_area_struct *vma;
>>      struct mmu_notifier_range range;
>> +    bool anon_vma_locked = false;
>> +    const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>> +    const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>
> In general, const read better when they are at the very top of this list.

Ack! still getting in the habit of this, sorry.

>
>>
>> -    VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -
>> -    /*
>> -     * Before allocating the hugepage, release the mmap_lock read lock.
>> -     * The allocation can take potentially a long time if it involves
>> -     * sync compaction, and we do not need to hold the mmap_lock during
>> -     * that. We will recheck the vma after taking it again in write mode.
>> -     */
>> -    mmap_read_unlock(mm);
>> -
>
> You should spell out that locking change (moving it to the caller), and why it
> is required, in the patch description.
>
> I'd even have put this into a separate patch, as it's independent to the
> order-passing changes.

Yeah good point ill separate this out into its own patch, I also
realized I never updated the commit message for this patch to properly
describe this change. Whoops.


> [...]
>
>>       */
>>      __folio_mark_uptodate(folio);
>> -    pgtable = pmd_pgtable(_pmd);
>> -
>>      spin_lock(pmd_ptl);
>> -    BUG_ON(!pmd_none(*pmd));
>> -    pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> -    map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>> +    WARN_ON_ONCE(!pmd_none(*pmd));
>> +    if (is_pmd_order(order)) { /* PMD collapse */
>> +            pgtable = pmd_pgtable(_pmd);
>> +            pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +            map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>> +    } else { /* mTHP collapse */
>
> Do both these comments (PMD collapse ...) really add any value? I'd say it's
> pretty self-documenting code already.

Yeah with the new is_pmd_order helper its def a little overkill. I'll
remove them. Thanks!


>
>> +            map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
>> +            smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
>> +            pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> +    }
>>      spin_unlock(pmd_ptl);
>>
>>      folio = NULL;
>>
>>      result = SCAN_SUCCEED;
>>   out_up_write:
>> +    if (anon_vma_locked)
>> +            anon_vma_unlock_write(vma->anon_vma);
>> +    if (pte)
>> +            pte_unmap(pte);
>


^ permalink raw reply

* Re: [PATCH 7.2 v16 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-04-29 15:34 UTC (permalink / raw)
  To: Usama Arif, Lorenzo Stoakes (Oracle), David Hildenbrand (Red Hat)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260420142057.392263-1-usama.arif@linux.dev>

On 4/20/26 8:20 AM, Usama Arif wrote:
> On Sun, 19 Apr 2026 12:57:42 -0600 Nico Pache <npache@redhat.com> wrote:
>
>> Pass an order and offset to collapse_huge_page to support collapsing anon
>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>> are attempting to collapse to, and offset indicates were in the PMD to
>> start the collapse attempt.
>>
>> For non-PMD collapse we must leave the anon VMA write locked until after
>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>> the mTHP case this is not true, and we must keep the lock to prevent
>> access/changes to the page tables. This can happen if the rmap walkers hit
>> a pmd_none while the PMD entry is currently unavailable due to being
>> temporarily removed during the collapse phase.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 103 +++++++++++++++++++++++++++---------------------
>>   1 file changed, 57 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 283bb63854a5..ff6f9f1883ed 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1198,42 +1198,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>>      return SCAN_SUCCEED;
>>   }
>>
>> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -            int referenced, int unmapped, struct collapse_control *cc)
>> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>> +            int referenced, int unmapped, struct collapse_control *cc,
>> +            unsigned int order)
>>   {
>>      LIST_HEAD(compound_pagelist);
>>      pmd_t *pmd, _pmd;
>> -    pte_t *pte;
>> +    pte_t *pte = NULL;
>>      pgtable_t pgtable;
>>      struct folio *folio;
>>      spinlock_t *pmd_ptl, *pte_ptl;
>>      enum scan_result result = SCAN_FAIL;
>>      struct vm_area_struct *vma;
>>      struct mmu_notifier_range range;
>> +    bool anon_vma_locked = false;
>> +    const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>> +    const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>>
>> -    VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -
>> -    /*
>> -     * Before allocating the hugepage, release the mmap_lock read lock.
>> -     * The allocation can take potentially a long time if it involves
>> -     * sync compaction, and we do not need to hold the mmap_lock during
>> -     * that. We will recheck the vma after taking it again in write mode.
>> -     */
>> -    mmap_read_unlock(mm);
>> -
>
> My understanding now is that the caller will need to drop the mmap_read lock?
>
> This needs explicit documentation at the start of the function IMO. I think
> there are callers in later patches and its very easy to miss this.

Yeah as David suggested I will probably seperate this out into its own
patch. And add some proper comments to describe the requirement. I didnt
even realize I had forgot to update the patch to describe this change
(apart from the coverletter). Thank you :)

>
>
>> -    result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>> +    result = alloc_charge_folio(&folio, mm, cc, order);
>>      if (result != SCAN_SUCCEED)
>>              goto out_nolock;
>>
>>      mmap_read_lock(mm);
>> -    result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -                                     HPAGE_PMD_ORDER);
>> +    result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +                                     &vma, cc, order);
>>      if (result != SCAN_SUCCEED) {
>>              mmap_read_unlock(mm);
>>              goto out_nolock;
>>      }
>>
>> -    result = find_pmd_or_thp_or_none(mm, address, &pmd);
>> +    result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>>      if (result != SCAN_SUCCEED) {
>>              mmap_read_unlock(mm);
>>              goto out_nolock;
>> @@ -1245,8 +1239,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>               * released when it fails. So we jump out_nolock directly in
>>               * that case.  Continuing to collapse causes inconsistency.
>>               */
>> -            result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>> -                                                 referenced, HPAGE_PMD_ORDER);
>> +            result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>> +                                                 referenced, order);
>>              if (result != SCAN_SUCCEED)
>>                      goto out_nolock;
>>      }
>> @@ -1261,20 +1255,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>       * mmap_lock.
>>       */
>>      mmap_write_lock(mm);
>> -    result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -                                     HPAGE_PMD_ORDER);
>> +    result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +                                     &vma, cc, order);
>>      if (result != SCAN_SUCCEED)
>>              goto out_up_write;
>>      /* check if the pmd is still valid */
>>      vma_start_write(vma);
>> -    result = check_pmd_still_valid(mm, address, pmd);
>> +    result = check_pmd_still_valid(mm, pmd_addr, pmd);
>>      if (result != SCAN_SUCCEED)
>>              goto out_up_write;
>>
>>      anon_vma_lock_write(vma->anon_vma);
>> +    anon_vma_locked = true;
>>
>> -    mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>> -                            address + HPAGE_PMD_SIZE);
>> +    mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>> +                            end_addr);
>>      mmu_notifier_invalidate_range_start(&range);
>>
>>      pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>> @@ -1286,26 +1281,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>       * Parallel GUP-fast is fine since GUP-fast will back off when
>>       * it detects PMD is changed.
>>       */
>> -    _pmd = pmdp_collapse_flush(vma, address, pmd);
>> +    _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>
>
> For an mTHP collapse covering, say, 64KiB of a 2MiB PMD, the patch still
> flushes the entire PMD via pmdp_collapse_flush and tlb_remove_table_sync_one.
> That triggers cross-CPU TLB shootdowns for ~1.94MiB of unrelated mappings on
> every successful sub-PMD collapse. Probably acceptable as a first cut?

Hmm interesting consideration, I believe this is neccisary from a GUP
perspective as the comment suggests. Me, Dev, and David had discussed
that once that this lands we wanted to play around with a lot of the
locking and previously defined "requirements" to see what is actually
needed. So yeah I think for now we leave it as is. Hopefully we can make
more improvements in the future.

>
>
>>      spin_unlock(pmd_ptl);
>>      mmu_notifier_invalidate_range_end(&range);
>>      tlb_remove_table_sync_one();
>>
>> -    pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>> +    pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>>      if (pte) {
>> -            result = __collapse_huge_page_isolate(vma, address, pte, cc,
>> -                                                  HPAGE_PMD_ORDER,
>> -                                                  &compound_pagelist);
>> +            result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>> +                                                  order, &compound_pagelist);
>>              spin_unlock(pte_ptl);
>>      } else {
>>              result = SCAN_NO_PTE_TABLE;
>>      }
>>
>>      if (unlikely(result != SCAN_SUCCEED)) {
>> -            if (pte)
>> -                    pte_unmap(pte);
>>              spin_lock(pmd_ptl);
>> -            BUG_ON(!pmd_none(*pmd));
>> +            WARN_ON_ONCE(!pmd_none(*pmd));
>
> Why was this turned into WARN_ON_ONCE? Would be good to add to commit message
> what the reason is if it has been discussed earlier.
>
> The next line writes a PMD entry over an existing one — that leaks the
> previous page table or PMD-mapped folio and can corrupt VA mappings. BUG_ON
> failed loudly and safely; WARN_ON_ONCE continues into the corruption and falls
> silent after the first hit.

Discussed here:
https://lore.kernel.org/lkml/c781ec19-724b-4c00-9ad6-56b9733cefa0@lucifer.local/

But yes im more so on your side that id rather crash the system then
warn if we are potentially corrupting things in memory.

@lorenzo and @david can we come to a definitive consensus on this?

>
>
>>              /*
>>               * We can only use set_pmd_at when establishing
>>               * hugepmds and never for establishing regular pmds that
>> @@ -1313,21 +1305,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>               */
>>              pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>              spin_unlock(pmd_ptl);
>> -            anon_vma_unlock_write(vma->anon_vma);
>>              goto out_up_write;
>>      }
>>
>>      /*
>> -     * All pages are isolated and locked so anon_vma rmap
>> -     * can't run anymore.
>> +     * For PMD collapse all pages are isolated and locked so anon_vma
>> +     * rmap can't run anymore. For mTHP collapse the PMD entry has been
>> +     * removed and not all pages are isolated and locked, so we must hold
>> +     * the lock to prevent neighboring folios from attempting to access
>> +     * this PMD until its reinstalled.
>>       */
>> -    anon_vma_unlock_write(vma->anon_vma);
>> +    if (is_pmd_order(order)) {
>> +            anon_vma_unlock_write(vma->anon_vma);
>> +            anon_vma_locked = false;
>> +    }
>>
>>      result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>> -                                       vma, address, pte_ptl,
>> -                                       HPAGE_PMD_ORDER,
>> -                                       &compound_pagelist);
>> -    pte_unmap(pte);
>> +                                       vma, start_addr, pte_ptl,
>> +                                       order, &compound_pagelist);
>>      if (unlikely(result != SCAN_SUCCEED))
>>              goto out_up_write;
>>
>> @@ -1337,18 +1332,27 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>       * write.
>>       */
>>      __folio_mark_uptodate(folio);
>> -    pgtable = pmd_pgtable(_pmd);
>> -
>>      spin_lock(pmd_ptl);
>> -    BUG_ON(!pmd_none(*pmd));
>> -    pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> -    map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>> +    WARN_ON_ONCE(!pmd_none(*pmd));
>> +    if (is_pmd_order(order)) { /* PMD collapse */
>> +            pgtable = pmd_pgtable(_pmd);
>> +            pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +            map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>> +    } else { /* mTHP collapse */
>> +            map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
>
> map_anon_folio_pte_nopf calls set_ptes and modifies pagetable, while holding
> pmd_ptl only. It should be safe as we expect pmd_none. But I think you should
> put a comment about this?

Yeah doesnt hurt. Ill try to clarify that.


>
>
>> +            smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
>> +            pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> +    }
>>      spin_unlock(pmd_ptl);
>>
>>      folio = NULL;
>>
>>      result = SCAN_SUCCEED;
>>   out_up_write:
>> +    if (anon_vma_locked)
>> +            anon_vma_unlock_write(vma->anon_vma);
>> +    if (pte)
>> +            pte_unmap(pte);
>>      mmap_write_unlock(mm);
>>   out_nolock:
>>      if (folio)
>> @@ -1525,8 +1529,15 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>>   out_unmap:
>>      pte_unmap_unlock(pte, ptl);
>>      if (result == SCAN_SUCCEED) {
>> +            /*
>> +             * Before allocating the hugepage, release the mmap_lock read lock.
>> +             * The allocation can take potentially a long time if it involves
>> +             * sync compaction, and we do not need to hold the mmap_lock during
>> +             * that. We will recheck the vma after taking it again in write mode.
>> +             */
>> +            mmap_read_unlock(mm);
>>              result = collapse_huge_page(mm, start_addr, referenced,
>> -                                        unmapped, cc);
>> +                                        unmapped, cc, HPAGE_PMD_ORDER);
>>              /* collapse_huge_page will return with the mmap_lock released */
>>              *lock_dropped = true;
>>      }
>> --
>> 2.53.0
>>
>>
>


^ permalink raw reply

* Re: [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
From: Oleg Nesterov @ 2026-04-29 15:24 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Darko Tominac, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, xe-linux-external, danielwa,
	linux-kernel, linux-trace-kernel, linux-perf-users, linux-mm
In-Reply-To: <889e4186-9f75-45d3-bd9a-1c90b218a635@kernel.org>

On 04/29, David Hildenbrand (Arm) wrote:
>
> On 4/29/26 15:15, Darko Tominac wrote:
>
> > uprobe infrastructure does not re-instrument pages on individual page
> > faults (uprobe_mmap() is only called during VMA creation, not on
> > page-in), the breakpoints are silently lost once the discarded pages are
> > re-read from the backing file.  The probes stop firing with no error
> > indication, and the only recovery is to unregister and re-register the
> > affected uprobes.
>
> Right. Don't MADV_DONTNEED uprobes, just like you are not supposed to
> MADV_DONTNEED debugger breakpoints/set data etc. :)

Agreed, thanks.

Oleg.


^ permalink raw reply

* Re: [PATCH v18 7/8] ring-buffer: Cleanup persistent ring buffer validation
From: Masami Hiramatsu @ 2026-04-29 15:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260428162457.1ca8c4b6@gandalf.local.home>

On Tue, 28 Apr 2026 16:24:57 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 24 Apr 2026 15:52:59 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> 
> > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > 
> > Cleanup rb_meta_validate_events() function to make it easier to read.
> > This includes the following cleanups:
> >  - Introduce rb_validatation_state to hold working variables in
> >    validation.
> >  - Move repleated validation state updates into rb_validate_buffer().
> >  - Move reader_page injection code outside of rb_meta_validate_events().
> > 
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> >  kernel/trace/ring_buffer.c |  186 ++++++++++++++++++++++----------------------
> >  1 file changed, 95 insertions(+), 91 deletions(-)
> > 
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index de653a8e3cec..9850a0d8d24b 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -1883,8 +1883,16 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
> >  	return events;
> >  }
> >  
> > -static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
> > -			      struct ring_buffer_cpu_meta *meta, u64 prev_ts, u64 next_ts)
> > +struct rb_validation_state {
> > +	unsigned long entries;
> > +	unsigned long entry_bytes;
> > +	int discarded;
> > +	u64 ts;
> > +};
> > +
> > +static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
> > +				struct ring_buffer_cpu_meta *meta,
> > +				u64 prev_ts, u64 next_ts)
> >  {
> 
> This can still use those comments (from patch 4).
> 
> Also, could you rebase on top of v7.1-rc1?

Is it v7.1-rc1, not ring-buffer/for-next?

Thanks,

> 
> Thanks Masami!
> 
> -- Steve
> 
> >  	struct buffer_data_page *dpage = bpage->page;
> >  	unsigned long long ts;
> > @@ -1914,16 +1922,82 @@ static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
> >  	return ret;
> >  }


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v18 4/8] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu @ 2026-04-29 15:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260428162146.78e52988@gandalf.local.home>

On Tue, 28 Apr 2026 16:21:46 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 24 Apr 2026 15:52:35 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> 
> > @@ -1892,9 +1895,19 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
> >  	 * subbuf_size is considered invalid.
> >  	 */
> >  	tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
> > -	if (tail > meta->subbuf_size - BUF_PAGE_HDR_SIZE)
> > -		return -1;
> > -	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
> > +	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
> > +		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
> > +
> 
> This code seriously needs comments.

OK, I'll add it, or let code explain clearer?

	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
	else
		ret = -1;

Thanks,

> 
> -- Steve
> 
> > +	if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
> > +		local_set(&bpage->entries, 0);
> > +		local_set(&bpage->page->commit, 0);
> > +		bpage->page->time_stamp = prev_ts ? prev_ts : next_ts;
> > +		ret = -1;
> > +	} else {
> > +		local_set(&bpage->entries, ret);
> > +	}
> > +
> > +	return ret;
> >  }
> >  


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v18 3/8] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu @ 2026-04-29 15:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260428155508.4f47279e@gandalf.local.home>

On Tue, 28 Apr 2026 15:55:08 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 24 Apr 2026 15:52:27 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> 
> > @@ -5648,11 +5668,12 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
> >   again:
> >  	/*
> >  	 * This should normally only loop twice. But because the
> > -	 * start of the reader inserts an empty page, it causes
> > -	 * a case where we will loop three times. There should be no
> > -	 * reason to loop four times (that I know of).
> > +	 * start of the reader inserts an empty page, it causes a
> > +	 * case where we will loop three times. There should be no
> > +	 * reason to loop four times unless the ring buffer is a
> > +	 * recovered persistent ring buffer.
> 
> Can you explain more to why this is allowed for persistent ring buffer?

Ah, that was introduced in v15.

  Changes in v15:
  - Skip reader_page loop check on persistent ring buffer because
    there can be contiguous empty(invalidated) pages. 


So, finding next reader_page, we need to skip empty pages, which is normally
not contiguous. However, if we see more than 3 invalid pages on recovering
persistent ring buffer, it will be reset and become empty.


> 
> Note, I do not like any loops that can go into an infinite loop and lock up
> the machine. If something goes wrong with a persistent ring buffer, then
> this could possibly go into an infinite loop.

Yeah, so I think we should not use goto here. OK, let me update it to
an actual loop.

> 
> I want to understand why this is allowed, and possibly add a check that
> prevents this from never ending.

It should not be a never ending loop (there are other exit conditions),
but I agreed. What about limiting with nr_subbufs?

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5326924615a4..aa89ec96e964 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5630,7 +5630,8 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
 	 * a case where we will loop three times. There should be no
 	 * reason to loop four times (that I know of).
 	 */
-	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
+	if (RB_WARN_ON(cpu_buffer,
+		++nr_loops > (cpu_buffer->ring_meta ? cpu_buffer->nr_subbufs : 3))) {
 		reader = NULL;
 		goto out;
 	}

> 
> -- Steve
> 
> 
> >  	 */
> > -	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
> > +	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3 && !cpu_buffer->ring_meta)) {
> >  		reader = NULL;
> >  		goto out;
> >  	}
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox