Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH RFC v4 07/44] KVM: guest_memfd: Only prepare folios for private pages
From: Michael Roth @ 2026-04-01 15:16 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <CAEvNRgF+FjJ1EWSR_rzD1=N040ZitiRrM2O3N0Kj5yN5rT3h+Q@mail.gmail.com>

On Wed, Apr 01, 2026 at 07:05:16AM -0700, Ackerley Tng wrote:
> Ackerley Tng <ackerleytng@google.com> writes:
> 
> > All-shared guest_memfd used to be only supported for non-CoCo VMs where
> > preparation doesn't apply. INIT_SHARED is about to be supported for
> > non-CoCo VMs in a later patch in this series.
> >
> > In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> > guest_memfd in a later patch in this series.
> >
> > This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> > shared folio for a CoCo VM where preparation applies.
> >
> > Add a check to make sure that preparation is only performed for private
> > folios.
> >
> > Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> > conversion to shared.
> >
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > ---
> >  virt/kvm/guest_memfd.c | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index b6ffa8734175d..d414ebfcb4c19 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -900,6 +900,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  		     int *max_order)
> >  {
> >  	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > +	struct inode *inode;
> >  	struct folio *folio;
> >  	int r = 0;
> >
> > @@ -907,7 +908,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  	if (!file)
> >  		return -EFAULT;
> >
> > -	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > +	inode = file_inode(file);
> > +	filemap_invalidate_lock_shared(inode->i_mapping);
> >
> >  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> >  	if (IS_ERR(folio)) {
> > @@ -920,7 +922,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  		folio_mark_uptodate(folio);
> >  	}
> >
> > -	r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > +	if (kvm_gmem_is_private_mem(inode, index))
> > +		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> 
> Michael, I might have misunderstood you at the last guest_memfd call:
> sev_gmem_prepare() doesn't prepare a page for being a shared page,
> right? Does this work? That prepare is only called to "make private"?

Hmm, I guess your guest_memfd-inplace-conversion-v4 branch is out of sync with
these patches?

I have the below local patch based on top of that for SNP-specific enablement,
which is basically identically, so suffice to say: yes, this should work
for SNP :) If any architecture pops up that needs to do some prep in
advance of mapping shared pages, then we could potentially plumb the
shared/private flag through to the arch-specific prep hook, as was also
suggested on the call, but it doesn't seem like that's needed by any
users for now.

-Mike

  Author: Michael Roth <michael.roth@amd.com>
  Date:   Mon Oct 27 07:58:32 2025 -0500
  
      KVM: guest_memfd: Don't prepare shared folios
  
      In the current guest_memfd logic, "preparation" is only used currently
      to describe the additional work of putting a guest_memfd page into an
      architecturally-defined "private" state, such as updating RMP table
      entries for SEV-SNP guests. As such, there's no input to the
      corresponding kvm_arch_gmem_prepare() hooks as to whether a page is
      being prepared/accessed as shared or as private, so "preparation" will
      end up being erroneously done on pages that were supposed to remain in a
      shared state. Rather than plumb through the additional information
      needed to distinguish between shared vs. private preparation, just
      continue to only do preparation on private pages, as was the case prior
      to support for GUEST_MEMFD_FLAG_MMAP being introduced.
  
      Signed-off-by: Michael Roth <michael.roth@amd.com>
  
  diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
  index 3acc6d983449..4869e59e4fc5 100644
  --- a/virt/kvm/guest_memfd.c
  +++ b/virt/kvm/guest_memfd.c
  @@ -1249,7 +1249,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
                  folio_mark_uptodate(folio);
          }
  
  -       r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
  +       if (!kvm_gmem_is_shared_mem(file_inode(file), index))
  +               r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
  
          folio_unlock(folio);

> 
> >
> >  	folio_unlock(folio);
> >
> > @@ -930,7 +933,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  		folio_put(folio);
> >
> >  out:
> > -	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> > +	filemap_invalidate_unlock_shared(inode->i_mapping);
> >  	return r;
> >  }
> >  EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
> >
> > --
> > 2.53.0.1018.g2bb0e51243-goog

^ permalink raw reply

* Re: [PATCH 0/2] Fix trace remotes read with an offline CPU
From: Steven Rostedt @ 2026-04-01 15:16 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Vincent Donnefort, mhiramat, mathieu.desnoyers,
	linux-trace-kernel, kernel-team, linux-kernel
In-Reply-To: <861pgy4t1i.wl-maz@kernel.org>

On Wed, 01 Apr 2026 16:08:41 +0100
Marc Zyngier <maz@kernel.org> wrote:

> Do you want to look at v2 [1], which was sent two hours later?

Yeah I saw that right after I sent my ack.

Yep, my ack still stands, please add it to them.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH 0/2] Fix trace remotes read with an offline CPU
From: Marc Zyngier @ 2026-04-01 15:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Vincent Donnefort, mhiramat, mathieu.desnoyers,
	linux-trace-kernel, kernel-team, linux-kernel
In-Reply-To: <20260401103159.747ae168@gandalf.local.home>

On Wed, 01 Apr 2026 15:31:59 +0100,
Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Wed,  1 Apr 2026 03:50:01 +0100
> Vincent Donnefort <vdonnefort@google.com> wrote:
> 
> > This small series is fixing non-consuming read of a trace remote when the
> > trace_buffer is created after a CPU is offline.
> > 
> > It also extends hotplug testing coverage to include this test case.
> > 
> > I have based this series on top of kvmarm/next which contains the hypervisor
> > tracing patches.
> > 
> > Vincent Donnefort (2):
> >   tracing: Non-consuming read for trace remotes with an offline CPU
> >   tracing: selftests: Extend hotplug testing for trace remotes
> 
> This looks fine to me. Since they only affect the new remote code and will
> not conflict with anything I'm doing, they can go on top of the KVM patches
> in the arm64 tree.
> 
> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> 
> for the series.

Do you want to look at v2 [1], which was sent two hours later?

	M.

[1] https://lore.kernel.org/r/20260401045100.3394299-1-vdonnefort@google.com

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Breno Leitao @ 2026-04-01 15:01 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <20260401224853.d8ed517a344c4be51d371a9c@kernel.org>

On Wed, Apr 01, 2026 at 10:48:53PM +0900, Masami Hiramatsu wrote:

> > The challenge extends beyond that. There are numerous early_parameter()
> > definitions scattered throughout the kernel that may or may not be
> > utilized by setup_arch().
> >
> > For example, consider `early_param("mitigations", ..)` in
> > ./kernel/cpu.c. This modifies the cpu_mitigations global variable, which
> > is referenced in various locations across different architectures.
> >
> > It's worth noting that we have over 300 early_parameter() instances in
> > the kernel.
> >
> > Given this, analyzing all these early parameters and examining each one
> > individually represents a substantial amount of work.
>
> Yes, that may require a substantial amount of work. But to improve
> the kernel framework around the parameter handling, eventually we
> need to examine each early parameter.

I'm still uncertain about this approach. The goal is to identify and
categorize the early parameters that are parsed prior to bootconfig
initialization.

Moreover, this work could become obsolete if bootconfig's initialization
point shifts earlier or later in the boot sequence, necessitating
another comprehensive analysis.

Conversely, if we successfully move bootconfig initialization earlier
by breaking the dependency of memblock (assuming this is feasible), the
vast majority of early parameters would execute after bootconfig is
configured, eliminating the need for this extensive categorization work.

Please, feel free to tell what approach might be better for the project.

> > Are there alternative approaches? At this point, I'm leaning toward
> > breaking bootconfig's dependency on memblock, allowing us to invoke it
> > before setup_arch(). Is this the only practical solution available?!
>
> Basically, the memblock dependency comes from allocating copy of data.
> Only for the embedded bootconfig, we can just pass copy memory block
> to the xbc_init(). Something like;
>
> xbc_init() {
>  xbc_data = memblock_alloc();
>  memcpy(xbc_data, data);
>  __xbc_init(xbc_data);
> }
>
> embedded_xbc_init() {
>   __xbc_init(embedded_bootconfig_data);
> }
>
> Afterwards, we can pass mixture of embedded bootcofnigt and initrd
> bootconfig data to parser again.
>
> (But in this case, we must be careful not to override the early
>  parameters that we have already applied.)

Do you have any additional recommendations if I proceed with this
approach?

Thank you for your detailed responses and insights.
--breno

^ permalink raw reply

* Re: [PATCH bpf] bpf: Reject sleepable kprobe_multi programs at attach time
From: Kumar Kartikeya Dwivedi @ 2026-04-01 14:42 UTC (permalink / raw)
  To: Varun R Mallya, jolsa
  Cc: andrii, ast, bpf, daniel, eddyz87, kpsingh, linux-kernel,
	linux-trace-kernel, martin.lau, mathieu.desnoyers, mattbobrowski,
	mhiramat, rostedt, song, yonghong.song
In-Reply-To: <CAP01T75D7p2smSyoFvyZykgC5bJdobL_cqtk6-w+LmP4NP9SpQ@mail.gmail.com>

On Wed, 1 Apr 2026 at 16:08, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Wed, 1 Apr 2026 at 15:49, Varun R Mallya <varunrmallya@gmail.com> wrote:
> >
> > kprobe.multi programs run in atomic/RCU context and cannot sleep.
> > However, bpf_kprobe_multi_link_attach() did not validate whether the
> > program being attached had the sleepable flag set, allowing sleepable
> > helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
> > context.
> >
> > This causes a "sleeping function called from invalid context" splat:
> >
> >   BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
> >   in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
> >   preempt_count: 1, expected: 0
> >   RCU nest depth: 2, expected: 0
> >
> > Fix this by rejecting sleepable programs early in
> > bpf_kprobe_multi_link_attach(), before any further processing.
> >
> > Fixes: 0dcac272540613d41c05e89679e4ddb978b612f1 ("bpf: Add multi kprobe link")
>
> Please use the right format.
> Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
>
> You can add the follow to your .gitconfig to generate the fixes tag.
> [alias]
>     fixes = log -1 --format='Fixes: %h (\"%s\")'
>
> > Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
> > ---
> >  kernel/trace/bpf_trace.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index 0b040a417442..af7079aa0f36 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
> >         if (!is_kprobe_multi(prog))
> >                 return -EINVAL;
> >
> > +       /* kprobe_multi is not allowed to be sleepable. */
> > +       if (prog->sleepable)
> > +               return -EINVAL;
> > +
>
> Looks ok to me, Jiri, could you also take a look?
>
> >         /* Writing to context is not allowed for kprobes. */
> >         if (prog->aux->kprobe_write_ctx)
> >                 return -EINVAL;
> > --
> > 2.53.0
> >

Also, please add a selftest.

pw-bot: cr

^ permalink raw reply

* Re: [PATCH v9 3/3] tracing/Documentation: Add a section about backup instance
From: Steven Rostedt @ 2026-04-01 14:41 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260401121541.e2c8c549ae172ccf2d5e3035@kernel.org>

On Wed, 1 Apr 2026 12:15:41 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > > +It is also possible to record trace data at system boot time by specifying
> > > +events with the persistent ring buffer, but in this case the data before the
> > > +reboot will be lost before it can be read. This problem can be solved by a
> > > +backup instance. From the kernel command line::
> > > +
> > > +  reserve_mem=12M:4096:trace trace_instance=boot_map@trace,sched,irq trace_instance=backup=boot_map  
> > 
> > The above didn't actually work well without my patch to enable events on
> > the persistent ring buffer with the backup. But keep it, as it will work
> > with my patch ;-)  
> 
> Ah, thanks!
> Let me rebase on it.

Well, my patch hasn't been added yet ;-)  But don't change anything. When
this goes upstream both patches should have been applied.

-- Steve

^ permalink raw reply

* Re: [PATCH v9 2/3] tracing: Remove the backup instance automatically after read
From: Steven Rostedt @ 2026-04-01 14:40 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260401121957.2665d454390aff97593bb996@kernel.org>

On Wed, 1 Apr 2026 12:19:57 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > 
> >   CPU 0							CPU 1
> >   -----							-----
> >   open(trace_pipe);
> >   read(..);
> >   close(trace_pipe);
> >      kick the work queue to delete it....
> > 						rmdir();
> > 							[instance deleted]  
> 
> I thought this requires trace_types_lock, and after kicked the queue,
> can rmdir() gets the tr? (__trace_array_get() return error if
> tr->free_on_close is set)

rmdir() doesn't use __trace_array_get(), it uses trace_array_find() which
we shouldn't need to modify.

static int instance_rmdir(const char *name)
{
	struct trace_array *tr;

	guard(mutex)(&event_mutex);
	guard(mutex)(&trace_types_lock);

	tr = trace_array_find(name);
	if (!tr)
		return -ENODEV;

	return __remove_instance(tr);
}

> 
> > 
> >   __remove_instance();
> > 
> >    [ now the tr is freed, and the remove will crash!]
> > 
> > 
> > What would prevent this is this is to use trace_array_destroy() that checks
> > this and also adds the proper locking:
> > 
> > static void trace_array_autoremove(struct work_struct *work)
> > {
> > 	struct trace_array *tr = container_of(work, struct trace_array, autoremove_work);
> > 
> > 	trace_array_destroy(tr);
> > }  
> 
> OK, let's use it.

Yes, by using trace_array_destroy(), it will fix this.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH 0/2] Fix trace remotes read with an offline CPU
From: Steven Rostedt @ 2026-04-01 14:31 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: mhiramat, mathieu.desnoyers, linux-trace-kernel, maz, kernel-team,
	linux-kernel
In-Reply-To: <20260401025003.3258729-1-vdonnefort@google.com>

On Wed,  1 Apr 2026 03:50:01 +0100
Vincent Donnefort <vdonnefort@google.com> wrote:

> This small series is fixing non-consuming read of a trace remote when the
> trace_buffer is created after a CPU is offline.
> 
> It also extends hotplug testing coverage to include this test case.
> 
> I have based this series on top of kvmarm/next which contains the hypervisor
> tracing patches.
> 
> Vincent Donnefort (2):
>   tracing: Non-consuming read for trace remotes with an offline CPU
>   tracing: selftests: Extend hotplug testing for trace remotes

This looks fine to me. Since they only affect the new remote code and will
not conflict with anything I'm doing, they can go on top of the KVM patches
in the arm64 tree.

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

for the series.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH bpf] bpf: Reject sleepable kprobe_multi programs at attach time
From: Kumar Kartikeya Dwivedi @ 2026-04-01 14:08 UTC (permalink / raw)
  To: Varun R Mallya, jolsa
  Cc: andrii, ast, bpf, daniel, eddyz87, kpsingh, linux-kernel,
	linux-trace-kernel, martin.lau, mathieu.desnoyers, mattbobrowski,
	mhiramat, rostedt, song, yonghong.song
In-Reply-To: <20260401134921.362148-1-varunrmallya@gmail.com>

On Wed, 1 Apr 2026 at 15:49, Varun R Mallya <varunrmallya@gmail.com> wrote:
>
> kprobe.multi programs run in atomic/RCU context and cannot sleep.
> However, bpf_kprobe_multi_link_attach() did not validate whether the
> program being attached had the sleepable flag set, allowing sleepable
> helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
> context.
>
> This causes a "sleeping function called from invalid context" splat:
>
>   BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
>   in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
>   preempt_count: 1, expected: 0
>   RCU nest depth: 2, expected: 0
>
> Fix this by rejecting sleepable programs early in
> bpf_kprobe_multi_link_attach(), before any further processing.
>
> Fixes: 0dcac272540613d41c05e89679e4ddb978b612f1 ("bpf: Add multi kprobe link")

Please use the right format.
Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")

You can add the follow to your .gitconfig to generate the fixes tag.
[alias]
    fixes = log -1 --format='Fixes: %h (\"%s\")'

> Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
> ---
>  kernel/trace/bpf_trace.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 0b040a417442..af7079aa0f36 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
>         if (!is_kprobe_multi(prog))
>                 return -EINVAL;
>
> +       /* kprobe_multi is not allowed to be sleepable. */
> +       if (prog->sleepable)
> +               return -EINVAL;
> +

Looks ok to me, Jiri, could you also take a look?

>         /* Writing to context is not allowed for kprobes. */
>         if (prog->aux->kprobe_write_ctx)
>                 return -EINVAL;
> --
> 2.53.0
>

^ permalink raw reply

* Re: [PATCH RFC v4 07/44] KVM: guest_memfd: Only prepare folios for private pages
From: Ackerley Tng @ 2026-04-01 14:05 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Jason Gunthorpe,
	Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <20260326-gmem-inplace-conversion-v4-7-e202fe950ffd@google.com>

Ackerley Tng <ackerleytng@google.com> writes:

> All-shared guest_memfd used to be only supported for non-CoCo VMs where
> preparation doesn't apply. INIT_SHARED is about to be supported for
> non-CoCo VMs in a later patch in this series.
>
> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> guest_memfd in a later patch in this series.
>
> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> shared folio for a CoCo VM where preparation applies.
>
> Add a check to make sure that preparation is only performed for private
> folios.
>
> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> conversion to shared.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index b6ffa8734175d..d414ebfcb4c19 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -900,6 +900,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		     int *max_order)
>  {
>  	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> +	struct inode *inode;
>  	struct folio *folio;
>  	int r = 0;
>
> @@ -907,7 +908,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	if (!file)
>  		return -EFAULT;
>
> -	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> +	inode = file_inode(file);
> +	filemap_invalidate_lock_shared(inode->i_mapping);
>
>  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
>  	if (IS_ERR(folio)) {
> @@ -920,7 +922,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		folio_mark_uptodate(folio);
>  	}
>
> -	r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> +	if (kvm_gmem_is_private_mem(inode, index))
> +		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);

Michael, I might have misunderstood you at the last guest_memfd call:
sev_gmem_prepare() doesn't prepare a page for being a shared page,
right? Does this work? That prepare is only called to "make private"?

>
>  	folio_unlock(folio);
>
> @@ -930,7 +933,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		folio_put(folio);
>
>  out:
> -	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> +	filemap_invalidate_unlock_shared(inode->i_mapping);
>  	return r;
>  }
>  EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>
> --
> 2.53.0.1018.g2bb0e51243-goog

^ permalink raw reply

* [PATCH] bootconfig: Skip printing early params to cmdline from bootconfig
From: Masami Hiramatsu (Google) @ 2026-04-01 14:02 UTC (permalink / raw)
  To: Masami Hiramatsu, Steven Rostedt
  Cc: Breno Leitao, linux-kernel, linux-trace-kernel

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

If user configures `kernel.key` in bootconfig, the 'key' is shown
in kernel cmdline (/proc/cmdline) and kernel boot parameter
handler associated with 'key' is invoked. However, since the
bootconfig does not support the parameter defined with early_param,
those keys are shown in '/proc/cmdline' but not handled by kernel.

This could easily mislead users who expected to be able to specify
early parameters via the boot configuration, leading them to wonder
why it doesn't work.

Let's skip printing out early params to cmdline buffer, and warn
if there is such parameters in bootconfig.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 init/main.c |   30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/init/main.c b/init/main.c
index 1cb395dd94e4..095c497ea2df 100644
--- a/init/main.c
+++ b/init/main.c
@@ -324,10 +324,21 @@ static void * __init get_boot_config_from_initrd(size_t *_size)
 
 static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
 
+static bool __init is_early_param(const char *param)
+{
+	const struct obs_kernel_param *p;
+
+	for (p = __setup_start; p < __setup_end; p++) {
+		if (p->early && parameq(param, p->str))
+			return true;
+	}
+	return false;
+}
+
 #define rest(dst, end) ((end) > (dst) ? (end) - (dst) : 0)
 
 static int __init xbc_snprint_cmdline(char *buf, size_t size,
-				      struct xbc_node *root)
+				      struct xbc_node *root, bool is_kernel)
 {
 	struct xbc_node *knode, *vnode;
 	char *end = buf + size;
@@ -340,6 +351,13 @@ static int __init xbc_snprint_cmdline(char *buf, size_t size,
 		if (ret < 0)
 			return ret;
 
+		/* We will skip early params because it is not applied. */
+		if (is_kernel && is_early_param(xbc_namebuf)) {
+			pr_warn_once("early_param(e.g. %s.%s) is not passed to cmdline from bootconfig\n",
+				     xbc_node_get_data(root), xbc_namebuf);
+			continue;
+		}
+
 		vnode = xbc_node_get_child(knode);
 		if (!vnode) {
 			ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
@@ -368,7 +386,7 @@ static int __init xbc_snprint_cmdline(char *buf, size_t size,
 #undef rest
 
 /* Make an extra command line under given key word */
-static char * __init xbc_make_cmdline(const char *key)
+static char * __init xbc_make_cmdline(const char *key, bool is_kernel)
 {
 	struct xbc_node *root;
 	char *new_cmdline;
@@ -379,7 +397,7 @@ static char * __init xbc_make_cmdline(const char *key)
 		return NULL;
 
 	/* Count required buffer size */
-	len = xbc_snprint_cmdline(NULL, 0, root);
+	len = xbc_snprint_cmdline(NULL, 0, root, is_kernel);
 	if (len <= 0)
 		return NULL;
 
@@ -389,7 +407,7 @@ static char * __init xbc_make_cmdline(const char *key)
 		return NULL;
 	}
 
-	ret = xbc_snprint_cmdline(new_cmdline, len + 1, root);
+	ret = xbc_snprint_cmdline(new_cmdline, len + 1, root, is_kernel);
 	if (ret < 0 || ret > len) {
 		pr_err("Failed to print extra kernel cmdline.\n");
 		memblock_free(new_cmdline, len + 1);
@@ -465,9 +483,9 @@ static void __init setup_boot_config(void)
 		xbc_get_info(&ret, NULL);
 		pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
 		/* keys starting with "kernel." are passed via cmdline */
-		extra_command_line = xbc_make_cmdline("kernel");
+		extra_command_line = xbc_make_cmdline("kernel", true);
 		/* Also, "init." keys are init arguments */
-		extra_init_args = xbc_make_cmdline("init");
+		extra_init_args = xbc_make_cmdline("init", false);
 	}
 	return;
 }


^ permalink raw reply related

* [PATCH bpf] bpf: Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-01 13:49 UTC (permalink / raw)
  To: varunrmallya
  Cc: andrii, ast, bpf, daniel, eddyz87, jolsa, kpsingh, linux-kernel,
	linux-trace-kernel, martin.lau, mathieu.desnoyers, mattbobrowski,
	memxor, mhiramat, rostedt, song, yonghong.song
In-Reply-To: <ac0hLyMuPHiZfnVc@computer>

kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.

This causes a "sleeping function called from invalid context" splat:

  BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
  preempt_count: 1, expected: 0
  RCU nest depth: 2, expected: 0

Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.

Fixes: 0dcac272540613d41c05e89679e4ddb978b612f1 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
 kernel/trace/bpf_trace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0b040a417442..af7079aa0f36 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 	if (!is_kprobe_multi(prog))
 		return -EINVAL;
 
+	/* kprobe_multi is not allowed to be sleepable. */
+	if (prog->sleepable)
+		return -EINVAL;
+
 	/* Writing to context is not allowed for kprobes. */
 	if (prog->aux->kprobe_write_ctx)
 		return -EINVAL;
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Masami Hiramatsu @ 2026-04-01 13:48 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <acvjcCqIAeHyIiQN@gmail.com>

Hi Breno,

On Tue, 31 Mar 2026 08:27:59 -0700
Breno Leitao <leitao@debian.org> wrote:

> hello Masami,
> 
> On Tue, Mar 31, 2026 at 12:58:27PM +0900, Masami Hiramatsu wrote:
> 
> > > 3) Ensure that early bootconfig parameters don't overwrite the boot command
> > >    line. For example, if the boot command line has foo=bar and bootconfig
> > >    later has foo=baz, the command line value should take precedence.
> > >    This prevents early boot code (in setup_arch()) from seeing a parameter
> > >    value that will be changed later.
> >
> > OK, this also needs to be considered. Currently we just pass the bootconfig
> > parameters right before bootloader given parameters as "extra_command_line"
> > if "bootconfig" in cmdline or CONFIG_BOOT_CONFIG_FORCE=y.
> >
> > [boot_config(.kernel)]<command_line>[ -- [boot_config(.init)][init_command_line]]
> >
> > This is currently expected behavior. The bootconfig parameters are
> > expected to be overridden by command_line or command_line are appended.
> 
> That's correct, and I have no intention of changing this behavior. Here's
> the current approach:
> 
> 1) Early parameters from the bootloader are parsed first in setup_arch()
> 
> 2) Subsequently, bootconfig_apply_early_params() is invoked. Any early
>    parameter that was already parsed from the bootloader (in setup_arch())
>    will be skipped at this stage.

Ah, I meant if we skip these parameters, we should not show it in the
command line via extra_command_line. This is still a minor issue at this
point. It should find early parameters in kernel.* parameters and do not
show it in extra_command_line, because those parameters are ignored.
So it is better to make a separated patch to fix that.

For example, if we pass

kernel.mem=1G

via bootconfig, it will be shown in the /proc/cmdline, but it is
not applied. This can confuse user.


> 
> > If we change this for early params, we also should change the expected
> > output of /proc/cmdline too. I think we have 2 options;
> >
> >  - As before, we expect the parameters provided by the boot configuration
> >    to be processed first and then overridden later by the command line.
> >
> > Or,
> >
> >  - ignore all parameters which is given from the command line, this also
> >    updates existing setup_boot_config() (means xbc_snprint_cmdline() ).
> >
> > Anyway, this behavior change will also be a bit critical... We have
> > to announce it.
> 
> As mentioned above, I don't anticipate any changes to existing behavior.
> Bootconfig parsing remains unchanged. The only modification is that
> bootconfig_apply_early_params() will skip any early config parameter
> that's already present in the bootloader command line.

Yes, but it is just different from existing one.
Suppose that if we have "early" and "normal" keys in the kernel, those
are handled by early_param() and __setup() respectively.

If we use bootconfig, like

 kernel {
   early = bconf_val
   normal = bconf_val
 }
 
And passes "early=foo normal=bar" via cmdline.

In this case, the /proc/cmdline eventually has

"early=bconf_val normal=bconf_val early=foo normal=bar"

And the "normal" callback called with "bconf_val" and "bar" twice.
However, "early" callback will be called with "foo" only once.

That can confuse users too.
I believe it's important for the system to behave in a way that is
as close as possible to the user's mind model.
Because the behavior is inconsistent when multiple parameters with
the same name are specified on the kernel command line, it is
necessary to ensure that users can later look at it and infer what
happened.

I mean, if a parameter is skipped, it should not be printed at
/proc/cmdline, because it can mislead user (and maybe bug reporter)
when a problem happens.

> 
> > > +Note that embedded bootconfig is parsed after ``setup_arch()``, so
> > > +early options that are consumed during architecture initialization
> > > +(e.g., ``mem=``, ``memmap=``, ``earlycon``, ``noapic``, ``nolapic``,
> > > +``acpi=``, ``numa=``, ``iommu=``) may not take effect from bootconfig.
> > > +
> >
> > This is easy to explain, but it's quite troublesome for users to
> > determine which parameters are unavailable.
> 
> Agreed. This turned out to be significantly more complex than I
> initially anticipated.

Yeah, that's complicated.

> 
> I'm uncertain whether we can accomplish this without examining every
> early_parameter() implementation in depth.

Agreed. My proposal is something like a divide and conquer approach.
Since these are implemented architecture by architecture, you need to
check the implementation for each architecture, and I think it's best
to implement them one by one using Kconfig.

> 
> > Currently we can identify
> > it by `git grep early_param -- arch/${ARCH}`. But it is setup in
> > setup_arch() we need to track the source code. (Or ask AI :))
> 
> The challenge extends beyond that. There are numerous early_parameter()
> definitions scattered throughout the kernel that may or may not be
> utilized by setup_arch().
> 
> For example, consider `early_param("mitigations", ..)` in
> ./kernel/cpu.c. This modifies the cpu_mitigations global variable, which
> is referenced in various locations across different architectures.
> 
> It's worth noting that we have over 300 early_parameter() instances in
> the kernel.
> 
> Given this, analyzing all these early parameters and examining each one
> individually represents a substantial amount of work.

Yes, that may require a substantial amount of work. But to improve
the kernel framework around the parameter handling, eventually we
need to examine each early parameter.

> 
> Are there alternative approaches? At this point, I'm leaning toward
> breaking bootconfig's dependency on memblock, allowing us to invoke it
> before setup_arch(). Is this the only practical solution available?!

Basically, the memblock dependency comes from allocating copy of data.
Only for the embedded bootconfig, we can just pass copy memory block
to the xbc_init(). Something like;

xbc_init() {
 xbc_data = memblock_alloc();
 memcpy(xbc_data, data);
 __xbc_init(xbc_data);
}

embedded_xbc_init() {
  __xbc_init(embedded_bootconfig_data);
}

Afterwards, we can pass mixture of embedded bootcofnigt and initrd
bootconfig data to parser again.

(But in this case, we must be careful not to override the early
 parameters that we have already applied.)

Thank you,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [BUG] bpf: kprobe_multi allows BPF_F_SLEEPABLE programs, causing sleeping-in-atomic splat
From: Varun R Mallya @ 2026-04-01 13:44 UTC (permalink / raw)
  To: bpf
  Cc: linux-kernel, kpsingh, mattbobrowski, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, rostedt,
	mhiramat, mathieu.desnoyers, linux-trace-kernel

Hi,

I found a bug in the kprobe_multi BPF link creation path where the kernel
fails to reject programs with BPF_F_SLEEPABLE set.

kprobe.multi programs run in an atomic/RCU context and cannot sleep.
However, bpf_link_create() with BPF_TRACE_KPROBE_MULTI does not validate
whether the program being attached has BPF_F_SLEEPABLE set. This allows
sleepable helpers such as bpf_copy_from_user() to be invoked from a
non-sleepable context, triggering a "sleeping function called from invalid
context" splat.

Reproducer:

  /* crash.bpf.c */
  SEC("kprobe.multi")
  int handle_kprobe_multi_crash(struct pt_regs *ctx)
  {
      char buf[16];
      int ret;
      ret = bpf_copy_from_user(buf, sizeof(buf), (void *)ctx->sp);
      bpf_printk("bpf_copy_from_user ret: %d\n", ret);
      return 0;
  }

  /* loader: manually set BPF_F_SLEEPABLE before load, then attach via
     bpf_link_create() with BPF_TRACE_KPROBE_MULTI */
  bpf_program__set_flags(prog, BPF_F_SLEEPABLE);

Kernel output (bpf-next 7.0.0-rc5+):

[  483.577248] BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:
169
[  483.577364] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
[  483.577420] preempt_count: 1, expected: 0
[  483.577453] RCU nest depth: 2, expected: 0
[  483.577486] INFO: lockdep is turned off.
Attachment successful! The bug is confirmed.
Check 'sudo dmesg -w' for BPF prints and warnings.
Triggering v[  483.577569] CfsPU: 7 UID: 0 PID: 1787 Comm: sudo Tainted: G        W           _
rea7.0.0-rc5+ #4 PREEMPT(full) 
d[  483.577571] T.ainted: [W]=WARN..

Su[  483.577572] Hrvardware name: Bochs Bochs, BIOS iBochs 01/01/2011v
[  483.577573] Call Trace:
[  483.577575]  e<TASK>
[  483.577578]  d!dump_stack_lvl+0x54/0x70
 [  483.577582]  If __might_resched+b0x200/0x220
p[  483.577585]  f__might_fault+0x_c2c/0x80
o[  483.577587]  p_copy_from_user+y0x23/0x80
_[  483.577591]  bpf_copy_from_user+0x27/0x50
[  483.577594]  fbpf_prog_1906cf6roa66546b5e_handlem__kprobe_multi_crash+0x33/0x4d
u[  483.577596]  skprobe_multi_linek_handler+0x15d/r re0x260
[  483.577599]  t? kprobe_multi_lurink_handler+0x99n/0x260
[  483.577602]  e? __pfx_vfs_readd +0x10/0x10
0[  483.577604]   in? ksys_read+0x7c/0x100
 [  483.577605]  dm? vfs_read+0x4/0esg, x360
t[  483.577607]  fprobe_ftrace_entry+0x3b8/0x480
[  483.577609]  h? ksys_read+0x7ce /0x100
e[  483.577610]  x? fprobe_ftrace_pentry+0x93/0x480l
o[  483.577612]  it? lock_release+0x42/0x330
 [  483.577615]  ? vfs_read+0x9/0wx360
o[  483.577616]  r? __x64_sys_rt_sigaction+0xde/0xke120
[  483.577619]  ? vfs_read+0x9/0x360
[  483.577620]  d? ksys_read+0x7c./0x100
[  483.577622]  
? do_syscall_64+0x143/0xf80
[  483.577624]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  483.577625]  ? trace_irq_disable+0x1d/0xc0
[  483.577627]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  483.577631]  </TASK>

I am sending in a patch that fixes this.

Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>

^ permalink raw reply

* [PATCH v2] rv: Allow epoll in rtapp-sleep monitor
From: Nam Cao @ 2026-04-01 13:08 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel, Nam Cao

Since commit 0c43094f8cc9 ("eventpoll: Replace rwlock with spinlock"),
epoll_wait is real-time-safe syscall for sleeping.

Add epoll_wait to the list of rt-safe sleeping APIs.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
---
v2: Use ltl_atom_update() instead of ltl_atom_set() in handle_sys_enter()
---
 kernel/trace/rv/monitors/sleep/sleep.c    |  8 ++
 kernel/trace/rv/monitors/sleep/sleep.h    | 98 ++++++++++++-----------
 tools/verification/models/rtapp/sleep.ltl |  1 +
 3 files changed, 61 insertions(+), 46 deletions(-)

diff --git a/kernel/trace/rv/monitors/sleep/sleep.c b/kernel/trace/rv/monitors/sleep/sleep.c
index c1347da69e9d..8dfe5ec13e19 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.c
+++ b/kernel/trace/rv/monitors/sleep/sleep.c
@@ -49,6 +49,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
 		ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
 		ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
 		ltl_atom_set(mon, LTL_FUTEX_WAIT, false);
+		ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
 		ltl_atom_set(mon, LTL_FUTEX_LOCK_PI, false);
 		ltl_atom_set(mon, LTL_BLOCK_ON_RT_MUTEX, false);
 	}
@@ -63,6 +64,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
 		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
 		ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
 		ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
+		ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
 
 		if (strstarts(task->comm, "migration/"))
 			ltl_atom_set(mon, LTL_TASK_IS_MIGRATION, true);
@@ -162,6 +164,11 @@ static void handle_sys_enter(void *data, struct pt_regs *regs, long id)
 			break;
 		}
 		break;
+#ifdef __NR_epoll_wait
+	case __NR_epoll_wait:
+		ltl_atom_update(current, LTL_EPOLL_WAIT, true);
+		break;
+#endif
 	}
 }
 
@@ -174,6 +181,7 @@ static void handle_sys_exit(void *data, struct pt_regs *regs, long ret)
 	ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, false);
 	ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
 	ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
+	ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
 	ltl_atom_update(current, LTL_CLOCK_NANOSLEEP, false);
 }
 
diff --git a/kernel/trace/rv/monitors/sleep/sleep.h b/kernel/trace/rv/monitors/sleep/sleep.h
index 2ab46fd218d2..95dc2727c059 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.h
+++ b/kernel/trace/rv/monitors/sleep/sleep.h
@@ -15,6 +15,7 @@ enum ltl_atom {
 	LTL_ABORT_SLEEP,
 	LTL_BLOCK_ON_RT_MUTEX,
 	LTL_CLOCK_NANOSLEEP,
+	LTL_EPOLL_WAIT,
 	LTL_FUTEX_LOCK_PI,
 	LTL_FUTEX_WAIT,
 	LTL_KERNEL_THREAD,
@@ -40,6 +41,7 @@ static const char *ltl_atom_str(enum ltl_atom atom)
 		"ab_sl",
 		"bl_on_rt_mu",
 		"cl_na",
+		"ep_wa",
 		"fu_lo_pi",
 		"fu_wa",
 		"ker_th",
@@ -75,39 +77,41 @@ static_assert(RV_NUM_BA_STATES <= RV_MAX_BA_STATES);
 
 static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
 {
-	bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
-	bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
-	bool val40 = task_is_rcu || task_is_migration;
-	bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
-	bool val41 = futex_lock_pi || val40;
-	bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
-	bool val5 = block_on_rt_mutex || val41;
-	bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
-	bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
-	bool val32 = abort_sleep || kthread_should_stop;
 	bool woken_by_nmi = test_bit(LTL_WOKEN_BY_NMI, mon->atoms);
-	bool val33 = woken_by_nmi || val32;
 	bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
-	bool val34 = woken_by_hardirq || val33;
 	bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
 	     mon->atoms);
-	bool val14 = woken_by_equal_or_higher_prio || val34;
 	bool wake = test_bit(LTL_WAKE, mon->atoms);
-	bool val13 = !wake;
-	bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
+	bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
+	bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
+	bool sleep = test_bit(LTL_SLEEP, mon->atoms);
+	bool rt = test_bit(LTL_RT, mon->atoms);
+	bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
 	bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
 	bool nanosleep_clock_monotonic = test_bit(LTL_NANOSLEEP_CLOCK_MONOTONIC, mon->atoms);
-	bool val24 = nanosleep_clock_monotonic || nanosleep_clock_tai;
-	bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
-	bool val25 = nanosleep_timer_abstime && val24;
-	bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
-	bool val18 = clock_nanosleep && val25;
+	bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
+	bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
 	bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
-	bool val9 = futex_wait || val18;
+	bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
+	bool epoll_wait = test_bit(LTL_EPOLL_WAIT, mon->atoms);
+	bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
+	bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
+	bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
+	bool val42 = task_is_rcu || task_is_migration;
+	bool val43 = futex_lock_pi || val42;
+	bool val5 = block_on_rt_mutex || val43;
+	bool val34 = abort_sleep || kthread_should_stop;
+	bool val35 = woken_by_nmi || val34;
+	bool val36 = woken_by_hardirq || val35;
+	bool val14 = woken_by_equal_or_higher_prio || val36;
+	bool val13 = !wake;
+	bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
+	bool val27 = nanosleep_timer_abstime && val26;
+	bool val18 = clock_nanosleep && val27;
+	bool val20 = val18 || epoll_wait;
+	bool val9 = futex_wait || val20;
 	bool val11 = val9 || kernel_thread;
-	bool sleep = test_bit(LTL_SLEEP, mon->atoms);
 	bool val2 = !sleep;
-	bool rt = test_bit(LTL_RT, mon->atoms);
 	bool val1 = !rt;
 	bool val3 = val1 || val2;
 
@@ -124,39 +128,41 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
 static void
 ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned long *next)
 {
-	bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
-	bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
-	bool val40 = task_is_rcu || task_is_migration;
-	bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
-	bool val41 = futex_lock_pi || val40;
-	bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
-	bool val5 = block_on_rt_mutex || val41;
-	bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
-	bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
-	bool val32 = abort_sleep || kthread_should_stop;
 	bool woken_by_nmi = test_bit(LTL_WOKEN_BY_NMI, mon->atoms);
-	bool val33 = woken_by_nmi || val32;
 	bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
-	bool val34 = woken_by_hardirq || val33;
 	bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
 	     mon->atoms);
-	bool val14 = woken_by_equal_or_higher_prio || val34;
 	bool wake = test_bit(LTL_WAKE, mon->atoms);
-	bool val13 = !wake;
-	bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
+	bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
+	bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
+	bool sleep = test_bit(LTL_SLEEP, mon->atoms);
+	bool rt = test_bit(LTL_RT, mon->atoms);
+	bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
 	bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
 	bool nanosleep_clock_monotonic = test_bit(LTL_NANOSLEEP_CLOCK_MONOTONIC, mon->atoms);
-	bool val24 = nanosleep_clock_monotonic || nanosleep_clock_tai;
-	bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
-	bool val25 = nanosleep_timer_abstime && val24;
-	bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
-	bool val18 = clock_nanosleep && val25;
+	bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
+	bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
 	bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
-	bool val9 = futex_wait || val18;
+	bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
+	bool epoll_wait = test_bit(LTL_EPOLL_WAIT, mon->atoms);
+	bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
+	bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
+	bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
+	bool val42 = task_is_rcu || task_is_migration;
+	bool val43 = futex_lock_pi || val42;
+	bool val5 = block_on_rt_mutex || val43;
+	bool val34 = abort_sleep || kthread_should_stop;
+	bool val35 = woken_by_nmi || val34;
+	bool val36 = woken_by_hardirq || val35;
+	bool val14 = woken_by_equal_or_higher_prio || val36;
+	bool val13 = !wake;
+	bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
+	bool val27 = nanosleep_timer_abstime && val26;
+	bool val18 = clock_nanosleep && val27;
+	bool val20 = val18 || epoll_wait;
+	bool val9 = futex_wait || val20;
 	bool val11 = val9 || kernel_thread;
-	bool sleep = test_bit(LTL_SLEEP, mon->atoms);
 	bool val2 = !sleep;
-	bool rt = test_bit(LTL_RT, mon->atoms);
 	bool val1 = !rt;
 	bool val3 = val1 || val2;
 
diff --git a/tools/verification/models/rtapp/sleep.ltl b/tools/verification/models/rtapp/sleep.ltl
index 6379bbeb6212..6f26c4810f78 100644
--- a/tools/verification/models/rtapp/sleep.ltl
+++ b/tools/verification/models/rtapp/sleep.ltl
@@ -5,6 +5,7 @@ RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
 
 RT_VALID_SLEEP_REASON = FUTEX_WAIT
                      or RT_FRIENDLY_NANOSLEEP
+                     or EPOLL_WAIT
 
 RT_FRIENDLY_NANOSLEEP = CLOCK_NANOSLEEP
                     and NANOSLEEP_TIMER_ABSTIME
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH] rv: Allow epoll in rtapp-sleep monitor
From: Gabriele Monaco @ 2026-04-01 11:47 UTC (permalink / raw)
  To: Nam Cao; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <87v7ec6n1n.fsf@yellow.woof>

On Tue, 2026-03-31 at 17:23 +0200, Nam Cao wrote:
> Gabriele Monaco <gmonaco@redhat.com> writes:
> > On Tue, 2026-03-31 at 12:49 +0200, Nam Cao wrote:
> > > diff --git a/kernel/trace/rv/monitors/sleep/sleep.c
> > > b/kernel/trace/rv/monitors/sleep/sleep.c
> > > index c1347da69e9d..59091863c17c 100644
> > > --- a/kernel/trace/rv/monitors/sleep/sleep.c
> > > +++ b/kernel/trace/rv/monitors/sleep/sleep.c
> > > @@ -162,6 +164,11 @@ static void handle_sys_enter(void *data, struct
> > > pt_regs
> > > *regs, long id)
> > >  			break;
> > >  		}
> > >  		break;
> > > +#ifdef __NR_epoll_wait
> > > +	case __NR_epoll_wait:
> > > +		ltl_atom_set(mon, LTL_EPOLL_WAIT, true);
> > > +		break;
> > > +#endif
> > 
> > Sashiko (the AI bot) wonders why this isn't ltl_atom_update() like other
> > things
> > around here. Is that intentional?
> 
> No that's not intentional. It does not affect verification result, but
> still should be fixed. I will send v2.
> 
> Funnily a colleague just told me earlier today about how good AIs are at
> reviewing..

Well, they're probably already better than (most of) us in those kind of
consistency assessments..

Anyway, I tried your changes and it seems to work as I'd expect. Feel free to
send a V2 and I can include it in the pull request.

Thanks,
Gabriele


^ permalink raw reply

* [PATCH v2 2/2] tracing/hist: reject synthetic-field strings that exceed MAX_FILTER_STR_VAL
From: Pengpeng Hou @ 2026-04-01 11:22 UTC (permalink / raw)
  To: rostedt
  Cc: mhiramat, mathieu.desnoyers, tom.zanussi, linux-kernel,
	linux-trace-kernel, pengpeng

The synthetic field helpers build a prefixed synthetic field name and a
hist trigger command in fixed MAX_FILTER_STR_VAL staging buffers. Even
when each individual field or filter string stays within that limit, the
combined "keys=...:synthetic_...=... if ..." command can exceed 256
bytes and overrun the scratch buffer.

Keep MAX_FILTER_STR_VAL as the tracing-side limit and reject synthetic
field names or generated commands that do not fit in that bound before
formatting them into the fixed buffers.

Fixes: 02205a6752f2 ("tracing: Add support for 'field variables'")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v1: https://lore.kernel.org/all/20260329030950.32503-1-pengpeng@iscas.ac.cn/

- keep MAX_FILTER_STR_VAL as the fixed tracing-side limit
- reject overlong synthetic names and generated commands with -E2BIG
- drop the previous dynamic-allocation approach

 kernel/trace/trace_events_hist.c | 57 +++++++++++++++++++++++---------
 1 file changed, 41 insertions(+), 16 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index f9c8a4f078ea..4172c91605af 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -2966,12 +2966,16 @@ find_synthetic_field_var(struct hist_trigger_data *target_hist_data,
 	struct hist_field *event_var;
 	char *synthetic_name;
 
+	if ((sizeof("synthetic_") - 1) + strlen(field_name) >=
+	    MAX_FILTER_STR_VAL)
+		return ERR_PTR(-E2BIG);
+
 	synthetic_name = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
 	if (!synthetic_name)
 		return ERR_PTR(-ENOMEM);
 
-	strcpy(synthetic_name, "synthetic_");
-	strcat(synthetic_name, field_name);
+	scnprintf(synthetic_name, MAX_FILTER_STR_VAL, "synthetic_%s",
+		  field_name);
 
 	event_var = find_event_var(target_hist_data, system, event_name, synthetic_name);
 
@@ -3018,6 +3022,8 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
 	struct hist_field *event_var;
 	char *saved_filter;
 	char *cmd;
+	size_t cmdlen;
+	size_t off;
 	int ret;
 
 	if (target_hist_data->n_field_var_hists >= SYNTH_FIELDS_MAX) {
@@ -3048,13 +3054,36 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
 	/* See if a synthetic field variable has already been created */
 	event_var = find_synthetic_field_var(target_hist_data, subsys_name,
 					     event_name, field_name);
-	if (!IS_ERR_OR_NULL(event_var))
+	if (IS_ERR(event_var))
+		return event_var;
+	if (event_var)
 		return event_var;
 
 	var_hist = kzalloc_obj(*var_hist);
 	if (!var_hist)
 		return ERR_PTR(-ENOMEM);
 
+	saved_filter = find_trigger_filter(hist_data, file);
+
+	cmdlen = strlen("keys=") + strlen(":synthetic_") +
+		 strlen(field_name) + strlen("=") + strlen(field_name);
+	first = true;
+	for_each_hist_key_field(i, hist_data) {
+		key_field = hist_data->fields[i];
+		if (!first)
+			cmdlen++;
+		cmdlen += strlen(key_field->field->name);
+		first = false;
+	}
+
+	if (saved_filter)
+		cmdlen += strlen(" if ") + strlen(saved_filter);
+
+	if (cmdlen >= MAX_FILTER_STR_VAL) {
+		kfree(var_hist);
+		return ERR_PTR(-E2BIG);
+	}
+
 	cmd = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
 	if (!cmd) {
 		kfree(var_hist);
@@ -3062,28 +3091,24 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
 	}
 
 	/* Use the same keys as the compatible histogram */
-	strcat(cmd, "keys=");
+	off = scnprintf(cmd, MAX_FILTER_STR_VAL, "keys=");
+	first = true;
 
 	for_each_hist_key_field(i, hist_data) {
 		key_field = hist_data->fields[i];
-		if (!first)
-			strcat(cmd, ",");
-		strcat(cmd, key_field->field->name);
+		off += scnprintf(cmd + off, MAX_FILTER_STR_VAL - off, "%s%s",
+				 first ? "" : ",", key_field->field->name);
 		first = false;
 	}
 
 	/* Create the synthetic field variable specification */
-	strcat(cmd, ":synthetic_");
-	strcat(cmd, field_name);
-	strcat(cmd, "=");
-	strcat(cmd, field_name);
+	off += scnprintf(cmd + off, MAX_FILTER_STR_VAL - off,
+			 ":synthetic_%s=%s", field_name, field_name);
 
 	/* Use the same filter as the compatible histogram */
-	saved_filter = find_trigger_filter(hist_data, file);
-	if (saved_filter) {
-		strcat(cmd, " if ");
-		strcat(cmd, saved_filter);
-	}
+	if (saved_filter)
+		scnprintf(cmd + off, MAX_FILTER_STR_VAL - off, " if %s",
+			  saved_filter);
 
 	var_hist->cmd = kstrdup(cmd, GFP_KERNEL);
 	if (!var_hist->cmd) {
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Pengpeng Hou @ 2026-04-01 11:22 UTC (permalink / raw)
  To: rostedt
  Cc: mhiramat, mathieu.desnoyers, tom.zanussi, linux-kernel,
	linux-trace-kernel, pengpeng

hist_field_name() uses a static MAX_FILTER_STR_VAL buffer for fully
qualified variable-reference names, but it currently appends into that
buffer with strcat() without rebuilding it first. As a result, repeated
calls append a new "system.event.field" name onto the previous one,
which can eventually run past the end of full_name.

Build the name with snprintf() on each call and return NULL if the fully
qualified name does not fit in MAX_FILTER_STR_VAL.

Fixes: 067fe038e70f ("tracing: Add variable reference handling to hist triggers")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v1: https://lore.kernel.org/all/20260329030950.32503-1-pengpeng@iscas.ac.cn/

- rebuild full_name on each call instead of falling back to field->name
- return NULL on overflow as suggested
- split out the snprintf() length check instead of using an inline if

 kernel/trace/trace_events_hist.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 73ea180cad55..f9c8a4f078ea 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1361,12 +1361,14 @@ static const char *hist_field_name(struct hist_field *field,
 		 field->flags & HIST_FIELD_FL_VAR_REF) {
 		if (field->system) {
 			static char full_name[MAX_FILTER_STR_VAL];
+			int len;
+
+			len = snprintf(full_name, sizeof(full_name), "%s.%s.%s",
+				       field->system, field->event_name,
+				       field->name);
+			if (len >= sizeof(full_name))
+				return NULL;
 
-			strcat(full_name, field->system);
-			strcat(full_name, ".");
-			strcat(full_name, field->event_name);
-			strcat(full_name, ".");
-			strcat(full_name, field->name);
 			field_name = full_name;
 		} else
 			field_name = field->name;
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Kiryl Shutsemau @ 2026-04-01  9:40 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Breno Leitao, Jonathan Corbet, Shuah Khan, linux-kernel,
	linux-trace-kernel, linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <20260401100237.ff9a37505d041a00e4d8658a@kernel.org>

On Wed, Apr 01, 2026 at 10:02:37AM +0900, Masami Hiramatsu wrote:
> On Tue, 31 Mar 2026 11:18:33 +0100
> Kiryl Shutsemau <kirill@shutemov.name> wrote:
> 
> > On Wed, Mar 25, 2026 at 11:22:04PM +0900, Masami Hiramatsu wrote:
> > > > diff --git a/init/main.c b/init/main.c
> > > > index 453ac9dff2da0..14a04c283fa48 100644
> > > > --- a/init/main.c
> > > > +++ b/init/main.c
> > > > @@ -416,9 +416,64 @@ static int __init warn_bootconfig(char *str)
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +/*
> > > > + * do_early_param() is defined later in this file but called from
> > > > + * bootconfig_apply_early_params() below, so we need a forward declaration.
> > > > + */
> > > > +static int __init do_early_param(char *param, char *val,
> > > > +				 const char *unused, void *arg);
> > > > +
> > > > +/*
> > > > + * bootconfig_apply_early_params - dispatch kernel.* keys from the embedded
> > > > + * bootconfig as early_param() calls.
> > > > + *
> > > > + * early_param() handlers must run before most of the kernel initialises
> > > > + * (e.g. before the GIC driver reads irqchip.gicv3_pseudo_nmi).  A bootconfig
> > > > + * attached to the initrd arrives too late for this because the initrd is not
> > > > + * mapped yet when early params are processed.  The embedded bootconfig lives
> > > > + * in the kernel image itself (.init.data), so it is always reachable.
> > > > + *
> > > > + * This function is called from setup_boot_config() which runs in
> > > > + * start_kernel() before parse_early_param(), making the timing correct.
> > > > + */
> > > > +static void __init bootconfig_apply_early_params(void)
> > > 
> > > [sashiko comment]
> > > | Does this run early enough for architectural parameters?
> > > | While setup_boot_config() runs before parse_early_param() in start_kernel(),
> > > | it runs after setup_arch(). setup_boot_config() relies on xbc_init() which
> > > | uses the memblock allocator, requiring setup_arch() to have already
> > > | initialized it.
> > > | However, the kernel expects many early parameters (like mem=, earlycon,
> > > | noapic, and iommu) to be parsed during setup_arch() via the architecture's
> > > | call to parse_early_param(). Since setup_arch() completes before
> > > | setup_boot_config() runs, will these architectural early parameters be
> > > | silently ignored because the decisions they influence were already
> > > | finalized?
> > > 
> > > This is the major reason that I did not support early parameter
> > > in bootconfig. Some archs initialize kernel_cmdline in setup_arch()
> > > and setup early parameters in it.
> > > To fix this, we need to change setup_arch() for each architecture so
> > > that it calls this bootconfig_apply_early_params().
> > 
> > Hi Masami,
> > 
> > I have a question about bootconfig design. Is it necessary to parse the
> > bootconfig at boot time?
> > 
> > We could avoid a lot of complexity if we flattened the bootconfig into a
> > simple key-value list before embedding it in the kernel image or
> > attaching it to the initrd. That would eliminate the need for memory
> > allocations and allow the config to be used earlier during boot.
> 
> Hi Kiryl,
> 
> Thanks for a good question.
> 
> If it is embedded in the kernel, yes, we can compile it. But if it is
> attached to initrd, the kernel needs to verify it. So anyway we have to
> parse the key-value. Of course we can make it a binary data structure,
> but it still needs verification. Moreover, memory allocation is not
> required by design (the first design uses a statically allocated data).
> 
> Even if it is normalized as key-value, we can not trust the contents
> without outer verification system, like verified boot. Therefore, our
> basic strategy is to parse the text.

Hm. I am not sure what issues verification aims to solve. Could you
elaborate.

With normalized key-value structure, we should be able to extract the
needed value in the same way as with normal kernel command line, no?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* [PATCH] mm/vmscan: add tracepoint for changes in min_seq and max_seq
From: wangzhen @ 2026-04-01  7:17 UTC (permalink / raw)
  To: akpm@linux-foundation.org
  Cc: rostedt@goodmis.org, mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com, hannes@cmpxchg.org,
	david@kernel.org, mhocko@kernel.org, zhengqi.arch@bytedance.com,
	shakeel.butt@linux.dev, ljs@kernel.org, axelrasmussen@google.com,
	yuanchu@google.com, weixugc@google.com, jiayuan.chen@shopee.com,
	wangzhen, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org,
	kasong@tencent.com, baolin.wang@linux.alibaba.com,
	baohua@kernel.org

From 20f685d13826ebaa86705a689128893be6316c7b Mon Sep 17 00:00:00 2001
From: w00021541 <wangzhen5@hihonor.com>
Date: Tue, 31 Mar 2026 16:00:48 +0800
Subject: [PATCH] mm/vmscan: add tracepoint for changes in min_seq and max_seq

Currently, when tracking the changes in min_seq and max_seq of mglru,
it can only be done through debugfs. The granularity of debugfs is too
coarse when we debug the generations of mglru. there's no way to trace
the increase in min_seq and max_seq, so we add 2 tracepoints,
mm_vmscan_lru_inc_min_seq and mm_vmscan_lru_inc_max_seq.

Test results:

kswapd0-94      [003] d..1.    72.664921: mm_vmscan_lru_inc_min_seq: memcg_id=87 type=0 swappiness=140 min_seq=2
kswapd0-94      [003] d..1.    72.664922: mm_vmscan_lru_inc_min_seq: memcg_id=87 type=1 swappiness=140 min_seq=1
kswapd0-94      [003] d..1.    72.667303: mm_vmscan_lru_inc_max_seq: memcg_id=87 swappiness=140 max_seq=4
kswapd0-94      [000] d..1.    72.807691: mm_vmscan_lru_inc_min_seq: memcg_id=25 type=0 swappiness=140 min_seq=2
kswapd0-94      [000] d..1.    72.807692: mm_vmscan_lru_inc_min_seq: memcg_id=25 type=1 swappiness=140 min_seq=1
kswapd0-94      [000] d..1.    72.810955: mm_vmscan_lru_inc_max_seq: memcg_id=25 swappiness=140 max_seq=4
kswapd0-94      [005] d..1.    73.482586: mm_vmscan_lru_inc_min_seq: memcg_id=91 type=0 swappiness=140 min_seq=2
kswapd0-94      [005] d..1.    73.482588: mm_vmscan_lru_inc_min_seq: memcg_id=91 type=1 swappiness=140 min_seq=1
kswapd0-94      [005] d..1.    73.485509: mm_vmscan_lru_inc_max_seq: memcg_id=91 swappiness=140 max_seq=4
kswapd0-94      [000] d..1.    77.708630: mm_vmscan_lru_inc_min_seq: memcg_id=88 type=0 swappiness=140 min_seq=1
kswapd0-94      [000] d..1.    77.709491: mm_vmscan_lru_inc_min_seq: memcg_id=88 type=0 swappiness=140 min_seq=2
kswapd0-94      [000] d..1.    77.709491: mm_vmscan_lru_inc_min_seq: memcg_id=88 type=1 swappiness=140 min_seq=1
kswapd0-94      [000] d..1.    77.715550: mm_vmscan_lru_inc_max_seq: memcg_id=88 swappiness=140 max_seq=4
kswapd0-94      [003] d..1.    77.749288: mm_vmscan_lru_inc_min_seq: memcg_id=4 type=0 swappiness=140 min_seq=2
kswapd0-94      [003] d..1.    77.749290: mm_vmscan_lru_inc_min_seq: memcg_id=4 type=1 swappiness=140 min_seq=1
kswapd0-94      [003] d..1.    77.763055: mm_vmscan_lru_inc_max_seq: memcg_id=4 swappiness=140 max_seq=4

Signed-off-by: w00021541 <wangzhen5@hihonor.com>
---
 include/trace/events/vmscan.h | 49 +++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |  6 +++++
 2 files changed, 55 insertions(+)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index ea58e4656abf..6f2ae5c597fd 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -488,6 +488,55 @@ TRACE_EVENT(mm_vmscan_lru_shrink_active,
 		show_reclaim_flags(__entry->reclaim_flags))
 );
 
+#ifdef CONFIG_LRU_GEN
+TRACE_EVENT(mm_vmscan_lru_inc_min_seq,
+
+	TP_PROTO(struct mem_cgroup *memcg, int type, int swappiness, unsigned long min_seq),
+
+	TP_ARGS(memcg, type, swappiness, min_seq),
+
+	TP_STRUCT__entry(
+		__field(int, id)
+		__field(int, type)
+		__field(int, swappiness)
+		__field(unsigned long, min_seq)
+	),
+
+	TP_fast_assign(
+		__entry->id = mem_cgroup_id(memcg);
+		__entry->type = type;
+		__entry->swappiness = swappiness;
+		__entry->min_seq = min_seq;
+	),
+
+	TP_printk("memcg_id=%d type=%d swappiness=%d min_seq=%ld",
+		__entry->id, __entry->type,
+		__entry->swappiness, __entry->min_seq)
+);
+
+TRACE_EVENT(mm_vmscan_lru_inc_max_seq,
+
+	TP_PROTO(struct mem_cgroup *memcg, int swappiness, unsigned long max_seq),
+
+	TP_ARGS(memcg, swappiness, max_seq),
+
+	TP_STRUCT__entry(
+		__field(int, id)
+		__field(int, swappiness)
+		__field(unsigned long, max_seq)
+	),
+
+	TP_fast_assign(
+		__entry->id = mem_cgroup_id(memcg);
+		__entry->swappiness = swappiness;
+		__entry->max_seq = max_seq;
+	),
+
+	TP_printk("memcg_id=%d swappiness=%d max_seq=%ld",
+		__entry->id, __entry->swappiness, __entry->max_seq)
+);
+#endif /* CONFIG_LRU_GEN */
+
 TRACE_EVENT(mm_vmscan_node_reclaim_begin,
 
 	TP_PROTO(int nid, int order, gfp_t gfp_flags),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e8251..6b5d21ee45ba 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3848,6 +3848,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 	int zone;
 	int remaining = MAX_LRU_BATCH;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
 
@@ -3892,6 +3893,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 done:
 	reset_ctrl_pos(lruvec, type, true);
 	WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+	trace_mm_vmscan_lru_inc_min_seq(memcg, type, swappiness, lrugen->min_seq[type]);
 
 	return true;
 }
@@ -3902,6 +3904,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 	bool success = false;
 	bool seq_inc_flag = false;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
 	VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
@@ -3947,6 +3950,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 
 		reset_ctrl_pos(lruvec, type, true);
 		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
+		trace_mm_vmscan_lru_inc_min_seq(memcg, type, swappiness, lrugen->min_seq[type]);
 		success = true;
 	}
 
@@ -3959,6 +3963,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness
 	int prev, next;
 	int type, zone;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 restart:
 	if (seq < READ_ONCE(lrugen->max_seq))
 		return false;
@@ -4012,6 +4017,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness
 	WRITE_ONCE(lrugen->timestamps[next], jiffies);
 	/* make sure preceding modifications appear */
 	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
+	trace_mm_vmscan_lru_inc_max_seq(memcg, swappiness, lrugen->max_seq);
 unlock:
 	spin_unlock_irq(&lruvec->lru_lock);
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH v10 3/3] tracing/Documentation: Add a section about backup instance
From: Masami Hiramatsu (Google) @ 2026-04-01  6:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177502545902.1311542.15822886746171808738.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add a section about backup instance to the debugging.rst.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v10:
  - Add a newline.
 Changes in v6:
  - Fix typos.
---
 Documentation/trace/debugging.rst |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst
index 4d88c346fc38..bca1710d92bf 100644
--- a/Documentation/trace/debugging.rst
+++ b/Documentation/trace/debugging.rst
@@ -159,3 +159,22 @@ If setting it from the kernel command line, it is recommended to also
 disable tracing with the "traceoff" flag, and enable tracing after boot up.
 Otherwise the trace from the most recent boot will be mixed with the trace
 from the previous boot, and may make it confusing to read.
+
+Using a backup instance for keeping previous boot data
+------------------------------------------------------
+
+It is also possible to record trace data at system boot time by specifying
+events with the persistent ring buffer, but in this case the data before the
+reboot will be lost before it can be read. This problem can be solved by a
+backup instance. From the kernel command line::
+
+  reserve_mem=12M:4096:trace trace_instance=boot_map@trace,sched,irq trace_instance=backup=boot_map
+
+On boot up, the previous data in the "boot_map" is copied to the "backup"
+instance, and the "sched:*" and "irq:*" events for the current boot are traced
+in the "boot_map". Thus the user can read the previous boot data from the "backup"
+instance without stopping the trace.
+
+Note that this "backup" instance is readonly, and will be removed automatically
+if you clear the trace data or read out all trace data from the "trace_pipe"
+or the "trace_pipe_raw" files.


^ permalink raw reply related

* [PATCH v10 2/3] tracing: Remove the backup instance automatically after read
From: Masami Hiramatsu (Google) @ 2026-04-01  6:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177502545902.1311542.15822886746171808738.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Since the backup instance is readonly, after reading all data
via pipe, no data is left on the instance. Thus it can be
removed safely after closing all files.
This also removes it if user resets the ring buffer manually
via 'trace' file.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v10:
  - Fix to use trace_array_destroy() to make it safer.
  - Use cancel_work_sync() to wait for cancel when removing.
  - Remove unneeded work_pending() check in trace_array_kick_autoremove()
    (for trace_array_cancel_autoremove() to avoid deadlock of workqueue
    it still use work_pending() check)
 Changes in v9:
   - Fix to initialize autoremove workqueue only for backup.
   - Fix to return -ENODEV if trace_array_get() refers freeing instance.
 Changes in v6:
   - Fix typo in comment.
   - Only when there is a readonly trace array, initialize autoremove_wq.
   - Fix to exit loop in trace_array_get() if tr is found in the list.
 Changes in v4:
   - Update description.
---
 kernel/trace/trace.c |   79 ++++++++++++++++++++++++++++++++++++++++++++++----
 kernel/trace/trace.h |    6 ++++
 2 files changed, 79 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index fdffa62a2b4e..3b3ec6645feb 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -539,8 +539,59 @@ void trace_set_ring_buffer_expanded(struct trace_array *tr)
 	tr->ring_buffer_expanded = true;
 }
 
+static void trace_array_autoremove(struct work_struct *work)
+{
+	struct trace_array *tr = container_of(work, struct trace_array, autoremove_work);
+
+	trace_array_destroy(tr);
+}
+
+static struct workqueue_struct *autoremove_wq;
+
+static void trace_array_kick_autoremove(struct trace_array *tr)
+{
+	if (autoremove_wq)
+		queue_work(autoremove_wq, &tr->autoremove_work);
+}
+
+static void trace_array_cancel_autoremove(struct trace_array *tr)
+{
+	/*
+	 * Since this can be called inside trace_array_autoremove(),
+	 * it has to avoid deadlock of the workqueue.
+	 */
+	if (work_pending(&tr->autoremove_work))
+		cancel_work_sync(&tr->autoremove_work);
+}
+
+static void trace_array_init_autoremove(struct trace_array *tr)
+{
+	INIT_WORK(&tr->autoremove_work, trace_array_autoremove);
+}
+
+static void trace_array_start_autoremove(void)
+{
+	if (autoremove_wq)
+		return;
+
+	autoremove_wq = alloc_workqueue("tr_autoremove_wq",
+					WQ_UNBOUND | WQ_HIGHPRI, 0);
+	if (!autoremove_wq)
+		pr_warn("Unable to allocate tr_autoremove_wq. autoremove disabled.\n");
+}
+
 LIST_HEAD(ftrace_trace_arrays);
 
+static int __trace_array_get(struct trace_array *this_tr)
+{
+	/* When free_on_close is set, this is not available anymore. */
+	if (autoremove_wq && this_tr->free_on_close)
+		return -ENODEV;
+
+	this_tr->ref++;
+	return 0;
+}
+
 int trace_array_get(struct trace_array *this_tr)
 {
 	struct trace_array *tr;
@@ -548,8 +599,7 @@ int trace_array_get(struct trace_array *this_tr)
 	guard(mutex)(&trace_types_lock);
 	list_for_each_entry(tr, &ftrace_trace_arrays, list) {
 		if (tr == this_tr) {
-			tr->ref++;
-			return 0;
+			return __trace_array_get(tr);
 		}
 	}
 
@@ -560,6 +610,12 @@ static void __trace_array_put(struct trace_array *this_tr)
 {
 	WARN_ON(!this_tr->ref);
 	this_tr->ref--;
+	/*
+	 * When free_on_close is set, prepare removing the array
+	 * when the last reference is released.
+	 */
+	if (this_tr->ref == 1 && this_tr->free_on_close)
+		trace_array_kick_autoremove(this_tr);
 }
 
 /**
@@ -4829,6 +4885,10 @@ static void update_last_data(struct trace_array *tr)
 	/* Only if the buffer has previous boot data clear and update it. */
 	tr->flags &= ~TRACE_ARRAY_FL_LAST_BOOT;
 
+	/* If this is a backup instance, mark it for autoremove. */
+	if (tr->flags & TRACE_ARRAY_FL_VMALLOC)
+		tr->free_on_close = true;
+
 	/* Reset the module list and reload them */
 	if (tr->scratch) {
 		struct trace_scratch *tscratch = tr->scratch;
@@ -8442,8 +8502,8 @@ struct trace_array *trace_array_find_get(const char *instance)
 
 	guard(mutex)(&trace_types_lock);
 	tr = trace_array_find(instance);
-	if (tr)
-		tr->ref++;
+	if (tr && __trace_array_get(tr) < 0)
+		tr = NULL;
 
 	return tr;
 }
@@ -8540,6 +8600,8 @@ trace_array_create_systems(const char *name, const char *systems,
 	if (ftrace_allocate_ftrace_ops(tr) < 0)
 		goto out_free_tr;
 
+	trace_array_init_autoremove(tr);
+
 	ftrace_init_trace_array(tr);
 
 	init_trace_flags_index(tr);
@@ -8650,7 +8712,9 @@ struct trace_array *trace_array_get_by_name(const char *name, const char *system
 
 	list_for_each_entry(tr, &ftrace_trace_arrays, list) {
 		if (tr->name && strcmp(tr->name, name) == 0) {
-			tr->ref++;
+			/* if this fails, @tr is going to be removed. */
+			if (__trace_array_get(tr) < 0)
+				tr = NULL;
 			return tr;
 		}
 	}
@@ -8689,6 +8753,7 @@ static int __remove_instance(struct trace_array *tr)
 			set_tracer_flag(tr, 1ULL << i, 0);
 	}
 
+	trace_array_cancel_autoremove(tr);
 	tracing_set_nop(tr);
 	clear_ftrace_function_probes(tr);
 	event_trace_del_tracer(tr);
@@ -9653,8 +9718,10 @@ __init static void enable_instances(void)
 		/*
 		 * Backup buffers can be freed but need vfree().
 		 */
-		if (backup)
+		if (backup) {
 			tr->flags |= TRACE_ARRAY_FL_VMALLOC | TRACE_ARRAY_FL_RDONLY;
+			trace_array_start_autoremove();
+		}
 
 		if (start || backup) {
 			tr->flags |= TRACE_ARRAY_FL_BOOT | TRACE_ARRAY_FL_LAST_BOOT;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2aa11178af59..a34f2d9479db 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -455,6 +455,12 @@ struct trace_array {
 	 * we do not waste memory on systems that are not using tracing.
 	 */
 	bool ring_buffer_expanded;
+	/*
+	 * If the ring buffer is a read only backup instance, it will be
+	 * removed after dumping all data via pipe, because no readable data.
+	 */
+	bool free_on_close;
+	struct work_struct	autoremove_work;
 };
 
 enum {


^ permalink raw reply related

* [PATCH v10 1/3] tracing: Make the backup instance non-reusable
From: Masami Hiramatsu (Google) @ 2026-04-01  6:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177502545902.1311542.15822886746171808738.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Since there is no reason to reuse the backup instance, make it
readonly (but erasable).
Note that only backup instances are readonly, because
other trace instances will be empty unless it is writable.
Only backup instances have copy entries from the original.

With this change, most of the trace control files are removed
from the backup instance, including eventfs enable/filter etc.

 # find /sys/kernel/tracing/instances/backup/events/ | wc -l
 4093
 # find /sys/kernel/tracing/instances/boot_map/events/ | wc -l
 9573

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v9:
   - Add forcibly readonly check in open() operations.
 Changes in v8:
   - Remove read-only checks in read() operation.
 Changes in v7:
   - Return -EACCES instead of -EPERM.
 Changes in v6:
   - Remove tracing_on file from readonly instances.
   - Remove unused writable_mode from tracing_init_tracefs_percpu().
   - Cleanup init_tracer_tracefs() and create_event_toplevel_files().
   - Remove TRACE_MODE_WRITE_MASK.
   - Add TRACE_ARRAY_FL_RDONLY.
 Changes in v5:
   - Rebased on the latest for-next (and hide show_event_filters/triggers
     if the instance is readonly.
 Changes in v4:
  - Make trace data erasable. (not reusable)
 Changes in v3:
  - Resuse the beginning part of event_entries for readonly files.
  - Remove readonly file_operations and checking readonly flag in
    each write operation.
 Changes in v2:
  - Use readonly file_operations to prohibit writing instead of
    checking flags in write() callbacks.
  - Remove writable files from eventfs.
---
 kernel/trace/trace.c        |   81 +++++++++++++++++++++++++++----------------
 kernel/trace/trace.h        |    7 ++++
 kernel/trace/trace_boot.c   |    5 ++-
 kernel/trace/trace_events.c |   76 +++++++++++++++++++++++-----------------
 4 files changed, 104 insertions(+), 65 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 7f2fbf9c36df..fdffa62a2b4e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3414,6 +3414,11 @@ int tracing_open_generic_tr(struct inode *inode, struct file *filp)
 	if (ret)
 		return ret;
 
+	if ((filp->f_mode & FMODE_WRITE) && trace_array_is_readonly(tr)) {
+		trace_array_put(tr);
+		return -EACCES;
+	}
+
 	filp->private_data = inode->i_private;
 
 	return 0;
@@ -6435,6 +6440,11 @@ static int tracing_clock_open(struct inode *inode, struct file *file)
 	if (ret)
 		return ret;
 
+	if ((file->f_mode & FMODE_WRITE) && trace_array_is_readonly(tr)) {
+		trace_array_put(tr);
+		return -EACCES;
+	}
+
 	ret = single_open(file, tracing_clock_show, inode->i_private);
 	if (ret < 0)
 		trace_array_put(tr);
@@ -8771,17 +8781,22 @@ static __init void create_trace_instances(struct dentry *d_tracer)
 static void
 init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
 {
+	umode_t writable_mode = TRACE_MODE_WRITE;
 	int cpu;
 
+	if (trace_array_is_readonly(tr))
+		writable_mode = TRACE_MODE_READ;
+
 	trace_create_file("available_tracers", TRACE_MODE_READ, d_tracer,
-			tr, &show_traces_fops);
+			  tr, &show_traces_fops);
 
-	trace_create_file("current_tracer", TRACE_MODE_WRITE, d_tracer,
-			tr, &set_tracer_fops);
+	trace_create_file("current_tracer", writable_mode, d_tracer,
+			  tr, &set_tracer_fops);
 
-	trace_create_file("tracing_cpumask", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file("tracing_cpumask", writable_mode, d_tracer,
 			  tr, &tracing_cpumask_fops);
 
+	/* Options are used for changing print-format even for readonly instance. */
 	trace_create_file("trace_options", TRACE_MODE_WRITE, d_tracer,
 			  tr, &tracing_iter_fops);
 
@@ -8791,12 +8806,36 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
 	trace_create_file("trace_pipe", TRACE_MODE_READ, d_tracer,
 			  tr, &tracing_pipe_fops);
 
-	trace_create_file("buffer_size_kb", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file("buffer_size_kb", writable_mode, d_tracer,
 			  tr, &tracing_entries_fops);
 
 	trace_create_file("buffer_total_size_kb", TRACE_MODE_READ, d_tracer,
 			  tr, &tracing_total_entries_fops);
 
+	trace_create_file("trace_clock", writable_mode, d_tracer, tr,
+			  &trace_clock_fops);
+
+	trace_create_file("timestamp_mode", TRACE_MODE_READ, d_tracer, tr,
+			  &trace_time_stamp_mode_fops);
+
+	tr->buffer_percent = 50;
+
+	trace_create_file("buffer_subbuf_size_kb", writable_mode, d_tracer,
+			  tr, &buffer_subbuf_size_fops);
+
+	create_trace_options_dir(tr);
+
+	if (tr->range_addr_start)
+		trace_create_file("last_boot_info", TRACE_MODE_READ, d_tracer,
+				  tr, &last_boot_fops);
+
+	for_each_tracing_cpu(cpu)
+		tracing_init_tracefs_percpu(tr, cpu);
+
+	/* Read-only instance has above files only. */
+	if (trace_array_is_readonly(tr))
+		return;
+
 	trace_create_file("free_buffer", 0200, d_tracer,
 			  tr, &tracing_free_buffer_fops);
 
@@ -8808,49 +8847,29 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
 	trace_create_file("trace_marker_raw", 0220, d_tracer,
 			  tr, &tracing_mark_raw_fops);
 
-	trace_create_file("trace_clock", TRACE_MODE_WRITE, d_tracer, tr,
-			  &trace_clock_fops);
-
-	trace_create_file("tracing_on", TRACE_MODE_WRITE, d_tracer,
-			  tr, &rb_simple_fops);
-
-	trace_create_file("timestamp_mode", TRACE_MODE_READ, d_tracer, tr,
-			  &trace_time_stamp_mode_fops);
-
-	tr->buffer_percent = 50;
-
 	trace_create_file("buffer_percent", TRACE_MODE_WRITE, d_tracer,
-			tr, &buffer_percent_fops);
-
-	trace_create_file("buffer_subbuf_size_kb", TRACE_MODE_WRITE, d_tracer,
-			  tr, &buffer_subbuf_size_fops);
+			  tr, &buffer_percent_fops);
 
 	trace_create_file("syscall_user_buf_size", TRACE_MODE_WRITE, d_tracer,
-			 tr, &tracing_syscall_buf_fops);
+			  tr, &tracing_syscall_buf_fops);
 
-	create_trace_options_dir(tr);
+	trace_create_file("tracing_on", TRACE_MODE_WRITE, d_tracer,
+			  tr, &rb_simple_fops);
 
 	trace_create_maxlat_file(tr, d_tracer);
 
 	if (ftrace_create_function_files(tr, d_tracer))
 		MEM_FAIL(1, "Could not allocate function filter files");
 
-	if (tr->range_addr_start) {
-		trace_create_file("last_boot_info", TRACE_MODE_READ, d_tracer,
-				  tr, &last_boot_fops);
 #ifdef CONFIG_TRACER_SNAPSHOT
-	} else {
+	if (!tr->range_addr_start)
 		trace_create_file("snapshot", TRACE_MODE_WRITE, d_tracer,
 				  tr, &snapshot_fops);
 #endif
-	}
 
 	trace_create_file("error_log", TRACE_MODE_WRITE, d_tracer,
 			  tr, &tracing_err_log_fops);
 
-	for_each_tracing_cpu(cpu)
-		tracing_init_tracefs_percpu(tr, cpu);
-
 	ftrace_init_tracefs(tr, d_tracer);
 }
 
@@ -9635,7 +9654,7 @@ __init static void enable_instances(void)
 		 * Backup buffers can be freed but need vfree().
 		 */
 		if (backup)
-			tr->flags |= TRACE_ARRAY_FL_VMALLOC;
+			tr->flags |= TRACE_ARRAY_FL_VMALLOC | TRACE_ARRAY_FL_RDONLY;
 
 		if (start || backup) {
 			tr->flags |= TRACE_ARRAY_FL_BOOT | TRACE_ARRAY_FL_LAST_BOOT;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 6abd9e16ef21..2aa11178af59 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -464,6 +464,7 @@ enum {
 	TRACE_ARRAY_FL_MOD_INIT		= BIT(3),
 	TRACE_ARRAY_FL_MEMMAP		= BIT(4),
 	TRACE_ARRAY_FL_VMALLOC		= BIT(5),
+	TRACE_ARRAY_FL_RDONLY		= BIT(6),
 };
 
 #ifdef CONFIG_MODULES
@@ -493,6 +494,12 @@ extern unsigned long trace_adjust_address(struct trace_array *tr, unsigned long
 
 extern struct trace_array *printk_trace;
 
+static inline bool trace_array_is_readonly(struct trace_array *tr)
+{
+	/* backup instance is read only. */
+	return tr->flags & TRACE_ARRAY_FL_RDONLY;
+}
+
 /*
  * The global tracer (top) should be the first trace array added,
  * but we check the flag anyway.
diff --git a/kernel/trace/trace_boot.c b/kernel/trace/trace_boot.c
index dbe29b4c6a7a..2ca2541c8a58 100644
--- a/kernel/trace/trace_boot.c
+++ b/kernel/trace/trace_boot.c
@@ -61,7 +61,8 @@ trace_boot_set_instance_options(struct trace_array *tr, struct xbc_node *node)
 		v = memparse(p, NULL);
 		if (v < PAGE_SIZE)
 			pr_err("Buffer size is too small: %s\n", p);
-		if (tracing_resize_ring_buffer(tr, v, RING_BUFFER_ALL_CPUS) < 0)
+		if (trace_array_is_readonly(tr) ||
+		    tracing_resize_ring_buffer(tr, v, RING_BUFFER_ALL_CPUS) < 0)
 			pr_err("Failed to resize trace buffer to %s\n", p);
 	}
 
@@ -597,7 +598,7 @@ trace_boot_enable_tracer(struct trace_array *tr, struct xbc_node *node)
 
 	p = xbc_node_find_value(node, "tracer", NULL);
 	if (p && *p != '\0') {
-		if (tracing_set_tracer(tr, p) < 0)
+		if (trace_array_is_readonly(tr) || tracing_set_tracer(tr, p) < 0)
 			pr_err("Failed to set given tracer: %s\n", p);
 	}
 
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index de807a9e2371..7ddcee312471 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1401,6 +1401,9 @@ static int __ftrace_set_clr_event(struct trace_array *tr, const char *match,
 {
 	int ret;
 
+	if (trace_array_is_readonly(tr))
+		return -EACCES;
+
 	mutex_lock(&event_mutex);
 	ret = __ftrace_set_clr_event_nolock(tr, match, sub, event, set, mod);
 	mutex_unlock(&event_mutex);
@@ -2969,8 +2972,8 @@ event_subsystem_dir(struct trace_array *tr, const char *name,
 	} else
 		__get_system(system);
 
-	/* ftrace only has directories no files */
-	if (strcmp(name, "ftrace") == 0)
+	/* ftrace only has directories no files, readonly instance too. */
+	if (strcmp(name, "ftrace") == 0 || trace_array_is_readonly(tr))
 		nr_entries = 0;
 	else
 		nr_entries = ARRAY_SIZE(system_entries);
@@ -3135,28 +3138,30 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)
 	int ret;
 	static struct eventfs_entry event_entries[] = {
 		{
-			.name		= "enable",
+			.name		= "format",
 			.callback	= event_callback,
-			.release	= event_release,
 		},
+#ifdef CONFIG_PERF_EVENTS
 		{
-			.name		= "filter",
+			.name		= "id",
 			.callback	= event_callback,
 		},
+#endif
+#define NR_RO_EVENT_ENTRIES	(1 + IS_ENABLED(CONFIG_PERF_EVENTS))
+/* Readonly files must be above this line and counted by NR_RO_EVENT_ENTRIES. */
 		{
-			.name		= "trigger",
+			.name		= "enable",
 			.callback	= event_callback,
+			.release	= event_release,
 		},
 		{
-			.name		= "format",
+			.name		= "filter",
 			.callback	= event_callback,
 		},
-#ifdef CONFIG_PERF_EVENTS
 		{
-			.name		= "id",
+			.name		= "trigger",
 			.callback	= event_callback,
 		},
-#endif
 #ifdef CONFIG_HIST_TRIGGERS
 		{
 			.name		= "hist",
@@ -3189,7 +3194,10 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)
 	if (!e_events)
 		return -ENOMEM;
 
-	nr_entries = ARRAY_SIZE(event_entries);
+	if (trace_array_is_readonly(tr))
+		nr_entries = NR_RO_EVENT_ENTRIES;
+	else
+		nr_entries = ARRAY_SIZE(event_entries);
 
 	name = trace_event_name(call);
 	ei = eventfs_create_dir(name, e_events, event_entries, nr_entries, file);
@@ -4532,31 +4540,44 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr)
 	int nr_entries;
 	static struct eventfs_entry events_entries[] = {
 		{
-			.name		= "enable",
+			.name		= "header_page",
 			.callback	= events_callback,
 		},
 		{
-			.name		= "header_page",
+			.name		= "header_event",
 			.callback	= events_callback,
 		},
+#define NR_RO_TOP_ENTRIES	2
+/* Readonly files must be above this line and counted by NR_RO_TOP_ENTRIES. */
 		{
-			.name		= "header_event",
+			.name		= "enable",
 			.callback	= events_callback,
 		},
 	};
 
-	entry = trace_create_file("set_event", TRACE_MODE_WRITE, parent,
-				  tr, &ftrace_set_event_fops);
-	if (!entry)
-		return -ENOMEM;
+	if (!trace_array_is_readonly(tr)) {
+		entry = trace_create_file("set_event", TRACE_MODE_WRITE, parent,
+					tr, &ftrace_set_event_fops);
+		if (!entry)
+			return -ENOMEM;
 
-	trace_create_file("show_event_filters", TRACE_MODE_READ, parent, tr,
-			  &ftrace_show_event_filters_fops);
+		/* There are not as crucial, just warn if they are not created */
+		trace_create_file("show_event_filters", TRACE_MODE_READ, parent, tr,
+				&ftrace_show_event_filters_fops);
 
-	trace_create_file("show_event_triggers", TRACE_MODE_READ, parent, tr,
-			  &ftrace_show_event_triggers_fops);
+		trace_create_file("show_event_triggers", TRACE_MODE_READ, parent, tr,
+				&ftrace_show_event_triggers_fops);
 
-	nr_entries = ARRAY_SIZE(events_entries);
+		trace_create_file("set_event_pid", TRACE_MODE_WRITE, parent,
+				tr, &ftrace_set_event_pid_fops);
+
+		trace_create_file("set_event_notrace_pid",
+				TRACE_MODE_WRITE, parent, tr,
+				&ftrace_set_event_notrace_pid_fops);
+		nr_entries = ARRAY_SIZE(events_entries);
+	} else {
+		nr_entries = NR_RO_TOP_ENTRIES;
+	}
 
 	e_events = eventfs_create_events_dir("events", parent, events_entries,
 					     nr_entries, tr);
@@ -4565,15 +4586,6 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr)
 		return -ENOMEM;
 	}
 
-	/* There are not as crucial, just warn if they are not created */
-
-	trace_create_file("set_event_pid", TRACE_MODE_WRITE, parent,
-			  tr, &ftrace_set_event_pid_fops);
-
-	trace_create_file("set_event_notrace_pid",
-			  TRACE_MODE_WRITE, parent, tr,
-			  &ftrace_set_event_notrace_pid_fops);
-
 	tr->event_dir = e_events;
 
 	return 0;


^ permalink raw reply related

* [PATCH v10 0/3] tracing: Remove backup instance after read all
From: Masami Hiramatsu (Google) @ 2026-04-01  6:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel

Hi,

Here is the v10 of the series to improve backup instances of
the persistent ring buffer. The previous version is here:

https://lore.kernel.org/all/177497473558.569199.6527680985537865638.stgit@mhiramat.tok.corp.google.com/

In this version, I fixed to use trace_array_destroy() and
cancel_work_sync() to avoid possible race. And remove unneeded
work_pending() check in trace_array_kick_autoremove()
(for trace_array_cancel_autoremove() to avoid deadlock of
workqueue, it still use work_pending() check) [2/3] and
add a newline at EOF [3/3]

Series Description
------------------
Since backup instances are a kind of snapshot of the persistent
ring buffer, it should be readonly. And if it is readonly
there is no reason to keep it after reading all data via trace_pipe
because the data has been consumed. But user should be able to remove
the readonly instance by rmdir or truncating `trace` file.

Thus, [1/3] makes backup instances readonly (not able to write any
events, cleanup trace, change buffer size). Also, [2/3] removes the
backup instance after consuming all data via trace_pipe.
With this improvements, even if we makes a backup instance (using
the same amount of memory of the persistent ring buffer), it will
be removed after reading the data automatically.

Thanks,

---

Masami Hiramatsu (Google) (3):
      tracing: Make the backup instance non-reusable
      tracing: Remove the backup instance automatically after read
      tracing/Documentation: Add a section about backup instance

 Documentation/trace/debugging.rst |   19 ++++
 kernel/trace/trace.c              |  160 ++++++++++++++++++++++++++++---------
 kernel/trace/trace.h              |   13 +++
 kernel/trace/trace_boot.c         |    5 +
 kernel/trace/trace_events.c       |   76 ++++++++++--------
 5 files changed, 202 insertions(+), 71 deletions(-)

--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH] ring-buffer: Enforce read ordering of trace_buffer cpumask and buffers
From: Vincent Donnefort @ 2026-04-01  5:36 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort

On CPU hotplug, if it is the first time a trace_buffer sees a CPU, a
ring_buffer_per_cpu will be allocated and its corresponding bit toggled
in the cpumask. Many readers check this cpumask to know if they can
safely read the ring_buffer_per_cpu but they are doing so without memory
ordering and may observe the cpumask bit set while having NULL buffer
pointer.

Enforce the memory read ordering by sending an IPI to all online CPUs.
The hotplug path is a slow-path anyway and it saves us from adding read
barriers in numerous call sites.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 170170bd83bd..10d2d0404434 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7468,6 +7468,12 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
 	return 0;
 }
 
+static void rb_cpu_sync(void *data)
+{
+	/* Not really needed, but documents what is happening */
+	smp_rmb();
+}
+
 /*
  * We only allocate new buffers, never free them if the CPU goes down.
  * If we were to free the buffer, then the user would lose any trace that was in
@@ -7506,7 +7512,18 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
 		     cpu);
 		return -ENOMEM;
 	}
-	smp_wmb();
+
+	/*
+	 * Ensure trace_buffer readers observe the newly allocated
+	 * ring_buffer_per_cpu before they check the cpumask. Instead of using a
+	 * read barrier for all readers, send an IPI.
+	 */
+	if (unlikely(system_state == SYSTEM_RUNNING)) {
+		on_each_cpu(rb_cpu_sync, NULL, 1);
+		/* Not really needed, but documents what is happening */
+		smp_wmb();
+	}
+
 	cpumask_set_cpu(cpu, buffer->cpumask);
 	return 0;
 }

base-commit: 7aaa8047eafd0bd628065b15757d9b48c5f9c07d
-- 
2.53.0.1118.gaef5881109-goog


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox