Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] tracing: work around -Wmissing-format-attribute warning
From: Andy Shevchenko @ 2026-06-02 21:05 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Arnd Bergmann, Steven Rostedt, Masami Hiramatsu, Andrew Morton,
	Petr Mladek, Nathan Chancellor, Dennis Dalessandro,
	Jason Gunthorpe, Leon Romanovsky, Arend van Spriel,
	Miri Korenblit, Mathieu Desnoyers, Rasmus Villemoes,
	Sergey Senozhatsky, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Vlastimil Babka (SUSE), linux-rdma, linux-kernel, linux-wireless,
	brcm80211, brcm80211-dev-list.pdl, linux-trace-kernel, llvm
In-Reply-To: <35c1ba62-e74d-4abc-aa73-ccd35968ff89@app.fastmail.com>

On Tue, Jun 02, 2026 at 10:32:04PM +0200, Arnd Bergmann wrote:
> On Tue, Jun 2, 2026, at 20:59, Andy Shevchenko wrote:
> > On Tue, Jun 02, 2026 at 05:07:05PM +0200, Arnd Bergmann wrote:

...

> > Why the __printf() annotation is in the C file and not here?
> > Is this all about headers as the second paragraph in the commit message 
> > explains?
> > I would add a comment to explain it here, otherwise we might see false 
> > patches to "make things consistent" in a wrong way.
> 
> I've tried to come up with a kerneldoc comment now, similar to
> the one for the vsnprintf() function, and added a separate prototype
> in the header. Does this address your concern?

Yes, see one nit, though.

> -int __printf(3, 0) __vsnprintf(char *buf, size_t size, const char *fmt_str, va_list args)
> +/**
> + * __vsnprintf - vsnprintf() wrapper without __printf() attribute
> + * @buf: The buffer to place the result into
> + * @size: The size of the buffer, including the trailing null space
> + * @fmt_str: The format string to use
> + * @args: Arguments for the format string
> + *
> + * This has the exact same behavior as vsnprintf() but can be used in call
> + * sites that are missing a __printf() annotation, e.g. because they
> + * get a 'va_format' argument instead of format and varargs.
> + *
> + * For this to work, the attribute is added to the declaration here but
> + * not in the header.

+ *
+ * Return: ...

> + */
> +int __printf(3, 0) __vsnprintf(char *buf, size_t size, const char *fmt_str, va_list args);

> +
> +int __vsnprintf(char *buf, size_t size, const char *fmt_str, va_list args)

Something slipped here...

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Ackerley Tng @ 2026-06-02 20:46 UTC (permalink / raw)
  To: Suzuki K Poulose, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <d01cf1ec-b85d-4af6-9810-8107c0e2a4ec@arm.com>

Suzuki K Poulose <suzuki.poulose@arm.com> writes:

> On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> All-shared guest_memfd used to be only supported for non-CoCo VMs where
>> preparation doesn't apply. INIT_SHARED is about to be supported for
>> non-CoCo VMs in a later patch in this series.
>
> nit: s/non-CoCo/CoCo ?
>

Yes, thanks!

>>
>> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
>> guest_memfd in a later patch in this series.
>>
>> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
>> shared folio for a CoCo VM where preparation applies.
>>
>> Add a check to make sure that preparation is only performed for private
>> folios.
>>
>> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
>> conversion to shared.
>>
>> Signed-off-by: Michael Roth <michael.roth@amd.com>
>
> nit: Missing Co-Developed-by: ?
>

IIRC this should have been

Suggested-by: Michael Roth <michael.roth@amd.com>

IIRC Michael suggested this on one of the guest_memfd calls, Michael
please let me know if you remember otherwise!

>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH 1/2] tracing: work around -Wmissing-format-attribute warning
From: Arnd Bergmann @ 2026-06-02 20:32 UTC (permalink / raw)
  To: Andy Shevchenko, Arnd Bergmann
  Cc: Steven Rostedt, Masami Hiramatsu, Andrew Morton, Petr Mladek,
	Nathan Chancellor, Dennis Dalessandro, Jason Gunthorpe,
	Leon Romanovsky, Arend van Spriel, Miri Korenblit,
	Mathieu Desnoyers, Rasmus Villemoes, Sergey Senozhatsky,
	Nick Desaulniers, Bill Wendling, Justin Stitt,
	Vlastimil Babka (SUSE), linux-rdma, linux-kernel, linux-wireless,
	brcm80211, brcm80211-dev-list.pdl, linux-trace-kernel, llvm
In-Reply-To: <ah8n-Nk305S5hRwN@ashevche-desk.local>

On Tue, Jun 2, 2026, at 20:59, Andy Shevchenko wrote:
> On Tue, Jun 02, 2026 at 05:07:05PM +0200, Arnd Bergmann wrote:
>> 
>> A number of tracing headers turn off -Wsuggest-attribute=format for
>> gcc, but they don't turn it off for clang, so the same warning still
>> happens on new versions of clang that support the format attribute.
>> 
>> To avoid duplicating the same thing in each tracing header, as well
>> as changing all of them to also turn it off for clang, add a new
>> __vsnprintf() helper that is not annotated this way in linux/sprintf.h
>> but is defined to work the same way as the regular vsprintf.
>
> vsprintf()

Fixed now

> Why the __printf() annotation is in the C file and not here?
> Is this all about headers as the second paragraph in the commit message 
> explains?
> I would add a comment to explain it here, otherwise we might see false 
> patches to "make things consistent" in a wrong way.

I've tried to come up with a kerneldoc comment now, similar to
the one for the vsnprintf() function, and added a separate prototype
in the header. Does this address your concern?

      Arnd

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 3caf0796f54d..7c696aea2ed3 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -2975,7 +2975,23 @@ int vsnprintf(char *buf, size_t size, const char *fmt_str, va_list args)
 }
 EXPORT_SYMBOL(vsnprintf);
 
-int __printf(3, 0) __vsnprintf(char *buf, size_t size, const char *fmt_str, va_list args)
+/**
+ * __vsnprintf - vsnprintf() wrapper without __printf() attribute
+ * @buf: The buffer to place the result into
+ * @size: The size of the buffer, including the trailing null space
+ * @fmt_str: The format string to use
+ * @args: Arguments for the format string
+ *
+ * This has the exact same behavior as vsnprintf() but can be used in call
+ * sites that are missing a __printf() annotation, e.g. because they
+ * get a 'va_format' argument instead of format and varargs.
+ *
+ * For this to work, the attribute is added to the declaration here but
+ * not in the header.
+ */
+int __printf(3, 0) __vsnprintf(char *buf, size_t size, const char *fmt_str, va_list args);
+
+int __vsnprintf(char *buf, size_t size, const char *fmt_str, va_list args)
 {
 	return vsnprintf(buf, size, fmt_str, args);
 }

^ permalink raw reply related

* Re: [PATCH] tracing: Reject tracefs buffer size values that overflow bytes
From: Steven Rostedt @ 2026-06-02 20:03 UTC (permalink / raw)
  To: Sam Moelius
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260602184335.1554470-1-sam.moelius@trailofbits.com>

On Tue,  2 Jun 2026 18:43:34 +0000
Sam Moelius <sam.moelius@trailofbits.com> wrote:

> From: Samuel Moelius <sam.moelius@trailofbits.com>
> 
> `tracing_entries_write()` accepts a `buffer_size_kb` value as
> `unsigned long`, checks only for zero, then shifts left by 10. On
> 64-bit, writing `18014398509481984` KB wraps the byte count to zero
> and the ring buffer resize path accepts it as a tiny buffer instead
> of rejecting an impossible huge size.
> 
> The fix also adds the same pre-scale overflow check to
> `buffer_subbuf_size_write()`.

Honestly, enter stupid values, get stupid results.

I don't think this is necessary. Nothing breaks but the person may get
confused by being confused by the confusing entries they make.

-- Steve

^ permalink raw reply

* Re: [PATCH 1/2] tracing: work around -Wmissing-format-attribute warning
From: Andy Shevchenko @ 2026-06-02 18:59 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Steven Rostedt, Masami Hiramatsu, Andrew Morton, Petr Mladek,
	Nathan Chancellor, Arnd Bergmann, Dennis Dalessandro,
	Jason Gunthorpe, Leon Romanovsky, Arend van Spriel,
	Miri Korenblit, Mathieu Desnoyers, Rasmus Villemoes,
	Sergey Senozhatsky, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Vlastimil Babka, linux-rdma, linux-kernel, linux-wireless,
	brcm80211, brcm80211-dev-list.pdl, linux-trace-kernel, llvm
In-Reply-To: <20260602150904.2258624-1-arnd@kernel.org>

On Tue, Jun 02, 2026 at 05:07:05PM +0200, Arnd Bergmann wrote:
> 
> A number of tracing headers turn off -Wsuggest-attribute=format for
> gcc, but they don't turn it off for clang, so the same warning still
> happens on new versions of clang that support the format attribute.
> 
> To avoid duplicating the same thing in each tracing header, as well
> as changing all of them to also turn it off for clang, add a new
> __vsnprintf() helper that is not annotated this way in linux/sprintf.h
> but is defined to work the same way as the regular vsprintf.

vsprintf()

> Aside from tracing, the same thing can be used in va_format(),
> which is part of lib/vsprintf.c itself.

...

> --- a/include/linux/sprintf.h
> +++ b/include/linux/sprintf.h
> @@ -12,6 +12,7 @@ __printf(2, 3) int sprintf(char *buf, const char * fmt, ...);
>  __printf(2, 0) int vsprintf(char *buf, const char *, va_list);
>  __printf(3, 4) int snprintf(char *buf, size_t size, const char *fmt, ...);
>  __printf(3, 0) int vsnprintf(char *buf, size_t size, const char *fmt, va_list args);
> +int __vsnprintf(char *buf, size_t size, const char *fmt, va_list args);

Why the __printf() annotation is in the C file and not here?
Is this all about headers as the second paragraph in the commit message explains?
I would add a comment to explain it here, otherwise we might see false patches to
"make things consistent" in a wrong way.

>  __printf(3, 4) int scnprintf(char *buf, size_t size, const char *fmt, ...);
>  __printf(3, 0) int vscnprintf(char *buf, size_t size, const char *fmt, va_list args);
>  __printf(2, 3) __malloc char *kasprintf(gfp_t gfp, const char *fmt, ...);

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* [PATCH] tracing: Reject tracefs buffer size values that overflow bytes
From: Sam Moelius @ 2026-06-02 18:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Samuel Moelius, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel

From: Samuel Moelius <sam.moelius@trailofbits.com>

`tracing_entries_write()` accepts a `buffer_size_kb` value as
`unsigned long`, checks only for zero, then shifts left by 10. On
64-bit, writing `18014398509481984` KB wraps the byte count to zero
and the ring buffer resize path accepts it as a tiny buffer instead
of rejecting an impossible huge size.

The fix also adds the same pre-scale overflow check to
`buffer_subbuf_size_write()`.

Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
---
 kernel/trace/trace.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..79da29c3d525 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5735,7 +5735,7 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
 		return ret;
 
 	/* must have at least 1 entry */
-	if (!val)
+	if (!val || val > ULONG_MAX >> 10)
 		return -EINVAL;
 
 	/* value is in KB */
@@ -8206,6 +8206,9 @@ buffer_subbuf_size_write(struct file *filp, const char __user *ubuf,
 	if (ret)
 		return ret;
 
+	if (!val || val > ULONG_MAX / 1024)
+		return -EINVAL;
+
 	val *= 1024; /* value passed in is in KB */
 
 	pages = DIV_ROUND_UP(val, PAGE_SIZE);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-02 17:26 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif, usamaarif642
In-Reply-To: <19639b08-5bf1-4974-9635-c458d512fa38@redhat.com>

On Tue, Jun 2, 2026 at 11:22 AM Nico Pache <npache@redhat.com> wrote:
>
>
>
> On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
> >>>
> >>> Reading this, it is unclear why exactly do we need the stack.
> >>
> >> So I looked into your items below. It seems logical, and I think it
> >> works the same way; however, your method seems slightly harder to
> >> understand due to all the edge cases and more error-prone to future
> >> changes (the stack holds implicit knowledge of the offset/order that
> >> must now be tracked in the edge cases).
> >>
> >> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> >> worth saving that small amount of memory. Although we would also be
> >> getting rid of (3?) functions, so both approaches have pros and cons.
> >
> > I consider a simple forward loop over the offset ... less complexity compared to
> > a stack structure :)
> >
> >>
> >> I will implement a patch comparing your solution against mine and send
> >> it here, then we can decide which approach is better.
> >
> > Right, throw it over the fence and I'll see how to improve it further.
>
> Ok heres what the diff looks like on top of my V19.
>
> you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.
>
> So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow.

not sure why this didnt send with the proper encoding I guess my email
is still a little screwed up

>
>
> From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
> From: Nico Pache <npache@redhat.com>
> Date: Tue, 2 Jun 2026 10:26:18 -0600
> Subject: [PATCH] convert from stack to forward loop
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 96 ++++++++-----------------------------------------
>  1 file changed, 15 insertions(+), 81 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 498eba009751..6de935e76ceb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>
>  #define KHUGEPAGED_MIN_MTHP_ORDER      2
> -/*
> - * mthp_collapse() does an iterative DFS over a binary tree, from
> - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> - * size needed for a DFS on a binary tree is height + 1, where
> - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> - *
> - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> - */
> -#define MTHP_STACK_SIZE        (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> -
> -/*
> - * Defines a range of PTE entries in a PTE page table which are being
> - * considered for mTHP collapse.
> - *
> - * @offset: the offset of the first PTE entry in a PMD range.
> - * @order: the order of the PTE entries being considered for collapse.
> - */
> -struct mthp_range {
> -       u16 offset;
> -       u8 order;
> -};
>
>  struct collapse_control {
>         bool is_khugepaged;
> @@ -137,7 +115,6 @@ struct collapse_control {
>
>         /* Each bit represents a single occupied (!none/zero) page. */
>         DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> -       struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>  };
>
>  /**
> @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>         return result;
>  }
>
> -static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> -                                    u16 offset, u8 order)
> -{
> -       const int size = *stack_size;
> -       struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> -
> -       VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> -       stack->order = order;
> -       stack->offset = offset;
> -       (*stack_size)++;
> -}
> -
> -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> -                                                int *stack_size)
> -{
> -       const int size = *stack_size;
> -
> -       VM_WARN_ON_ONCE(size <= 0);
> -       (*stack_size)--;
> -       return cc->mthp_bitmap_stack[size - 1];
> -}
> -
>  /*
>   * mthp_collapse() consumes the bitmap that is generated during
>   * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
>   *
>   * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
> - * page. A stack structure cc->mthp_bitmap_stack is used to check different
> - * regions of the bitmap for collapse eligibility. The stack maintains a pair
> - * of variables (offset, order), indicating the number of PTEs from the start
> - * of the PMD, and the order of the potential collapse candidate respectively.
> - * We start at the PMD order and check if it is eligible for collapse; if not,
> - * we add two entries to the stack at a lower order to represent the left and
> - * right halves of the PTE page table we are examining.
> - *
> - *                         offset       mid_offset
> - *                         |         |
> - *                         |         |
> - *                         v         v
> - *      --------------------------------------
> - *      |       cc->mthp_present_ptes         |
> - *      --------------------------------------
> - *                         <-------><------->
> - *                          order-1  order-1
> + * page. We start at the PMD order and check if it is eligible for collapse;
> + * if not, we check the left and right halves of the PTE page table we are
> + * examining at a lower order.
>   *
>   * For each of these, we determine how many PTE entries are occupied in the
>   * range of PTE entries we propose to collapse, then we compare this to a
> @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>  {
>         unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
>         enum scan_result last_result = SCAN_FAIL;
> -       int collapsed = 0, stack_size = 0;
> +       int collapsed = 0;
>         bool alloc_failed = false;
>         unsigned long collapse_address;
> -       struct mthp_range range;
> -       u16 offset;
> -       u8 order;
> +       unsigned int offset = 0;
> +       unsigned int order = HPAGE_PMD_ORDER;
>
> -       collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>
> -       while (stack_size) {
> -               range = collapse_mthp_stack_pop(cc, &stack_size);
> -               order = range.order;
> -               offset = range.offset;
> +       while (offset < HPAGE_PMD_NR) {
>                 nr_ptes = 1UL << order;
>
>                 if (!test_bit(order, &enabled_orders))
>                         goto next_order;
>
>                 max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> -
>                 nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>                                                       offset + nr_ptes);
>
> @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>                                 collapsed += nr_ptes;
>                                 fallthrough;
>                         case SCAN_PTE_MAPPED_HUGEPAGE:
> -                               continue;
> +                               goto next_offset;
>                         /* Cases where lower orders might still succeed */
>                         case SCAN_ALLOC_HUGE_PAGE_FAIL:
>                                 alloc_failed = true;
> @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>                 }
>
>  next_order:
> -               if ((BIT(order) - 1) & enabled_orders) {
> -                       const u8 next_order = order - 1;
> -                       const u16 mid_offset = offset + (nr_ptes / 2);
> -
> -                       collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> -                                                next_order);
> -                       collapse_mthp_stack_push(cc, &stack_size, offset,
> -                                                next_order);
> +               if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> +                       (BIT(order) - 1) & enabled_orders) {
> +                       order = order - 1;
> +                       continue;
>                 }
> +next_offset:
> +               offset += nr_ptes;
> +               order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
>         }
>  done:
>         if (collapsed)
> --
> 2.54.0
>
>
>
> >
> > [...]
> >
> >>>> +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >>>>       memset(cc->node_load, 0, sizeof(cc->node_load));
> >>>>       nodes_clear(cc->alloc_nmask);
> >>>> +
> >>>> +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> >>>> +
> >>>> +     /*
> >>>> +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> >>>> +      * scan all pages to populate the bitmap for mTHP collapse.
> >>>> +      */
> >>>
> >>> You should note here, that we re-verify in mthp_collapse().
> >>>
> >>> But the question is, whether we should relocate the check completely into
> >>> mthp_collapse(), instead of conditionally duplicating it.
> >>>
> >>> What speaks against always populating the bitmap and making the decision in
> >>> mthp_collapse()?
> >>>
> >>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
> >>> I am not sure if scanning some more page table entries is really that critical here.
> >>
> >> Someone asked me to preserve the legacy behavior (PMD only). Although
> >> rather trivial, if you set max_ptes_none=0 for example, we'd still
> >> have to do 511 iterations for no reason if PMD collapse is the only
> >> enabled order rather than bailing immediately.
> >>
> >> I'm ok with dropping it, but I think its the correct approach (despite
> >> the extra complexity). @Usama Arif brought up this point here
> >> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/
> >
> > We talk about regressions, but I am not sure if we care about scanning speed
> > within a page table that much?
> >
> > After all, we locked it and already read some entries.
> >
> > Having the same check at two places to optimize for PMD order might right now
> > feel like a good optimization, but likely an irrelevant one in a near future?
> >
> > Anyhow, won't push back, as long as we document why we are special casing things
> > here.
> >
>


^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-02 17:23 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	Usama Arif, usamaarif642
In-Reply-To: <d3c2b00c-6810-434a-b837-0707b0a11611@kernel.org>



On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
>>>
>>> Reading this, it is unclear why exactly do we need the stack.
>>
>> So I looked into your items below. It seems logical, and I think it
>> works the same way; however, your method seems slightly harder to
>> understand due to all the edge cases and more error-prone to future
>> changes (the stack holds implicit knowledge of the offset/order that
>> must now be tracked in the edge cases).
>>
>> Given the stack is 24 bytes, I'm not sure if the extra complexity is
>> worth saving that small amount of memory. Although we would also be
>> getting rid of (3?) functions, so both approaches have pros and cons.
> 
> I consider a simple forward loop over the offset ... less complexity compared to
> a stack structure :)
> 
>>
>> I will implement a patch comparing your solution against mine and send
>> it here, then we can decide which approach is better.
> 
> Right, throw it over the fence and I'll see how to improve it further.

Ok heres what the diff looks like on top of my V19. 

you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.

So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow. 


From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
From: Nico Pache <npache@redhat.com>
Date: Tue, 2 Jun 2026 10:26:18 -0600
Subject: [PATCH] convert from stack to forward loop

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 96 ++++++++-----------------------------------------
 1 file changed, 15 insertions(+), 81 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 498eba009751..6de935e76ceb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 static struct kmem_cache *mm_slot_cache __ro_after_init;
 
 #define KHUGEPAGED_MIN_MTHP_ORDER	2
-/*
- * mthp_collapse() does an iterative DFS over a binary tree, from
- * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
- * size needed for a DFS on a binary tree is height + 1, where
- * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
- *
- * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
- * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
- */
-#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
-
-/*
- * Defines a range of PTE entries in a PTE page table which are being
- * considered for mTHP collapse.
- *
- * @offset: the offset of the first PTE entry in a PMD range.
- * @order: the order of the PTE entries being considered for collapse.
- */
-struct mthp_range {
-	u16 offset;
-	u8 order;
-};
 
 struct collapse_control {
 	bool is_khugepaged;
@@ -137,7 +115,6 @@ struct collapse_control {
 
 	/* Each bit represents a single occupied (!none/zero) page. */
 	DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
-	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
 };
 
 /**
@@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
 	return result;
 }
 
-static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
-				     u16 offset, u8 order)
-{
-	const int size = *stack_size;
-	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
-
-	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
-	stack->order = order;
-	stack->offset = offset;
-	(*stack_size)++;
-}
-
-static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
-						 int *stack_size)
-{
-	const int size = *stack_size;
-
-	VM_WARN_ON_ONCE(size <= 0);
-	(*stack_size)--;
-	return cc->mthp_bitmap_stack[size - 1];
-}
-
 /*
  * mthp_collapse() consumes the bitmap that is generated during
  * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
  *
  * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
- * page. A stack structure cc->mthp_bitmap_stack is used to check different
- * regions of the bitmap for collapse eligibility. The stack maintains a pair
- * of variables (offset, order), indicating the number of PTEs from the start
- * of the PMD, and the order of the potential collapse candidate respectively.
- * We start at the PMD order and check if it is eligible for collapse; if not,
- * we add two entries to the stack at a lower order to represent the left and
- * right halves of the PTE page table we are examining.
- *
- *                         offset       mid_offset
- *                         |         |
- *                         |         |
- *                         v         v
- *      --------------------------------------
- *      |       cc->mthp_present_ptes         |
- *      --------------------------------------
- *                         <-------><------->
- *                          order-1  order-1
+ * page. We start at the PMD order and check if it is eligible for collapse;
+ * if not, we check the left and right halves of the PTE page table we are
+ * examining at a lower order.
  *
  * For each of these, we determine how many PTE entries are occupied in the
  * range of PTE entries we propose to collapse, then we compare this to a
@@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
 {
 	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
 	enum scan_result last_result = SCAN_FAIL;
-	int collapsed = 0, stack_size = 0;
+	int collapsed = 0;
 	bool alloc_failed = false;
 	unsigned long collapse_address;
-	struct mthp_range range;
-	u16 offset;
-	u8 order;
+	unsigned int offset = 0;
+	unsigned int order = HPAGE_PMD_ORDER;
 
-	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
 
-	while (stack_size) {
-		range = collapse_mthp_stack_pop(cc, &stack_size);
-		order = range.order;
-		offset = range.offset;
+	while (offset < HPAGE_PMD_NR) {
 		nr_ptes = 1UL << order;
 
 		if (!test_bit(order, &enabled_orders))
 			goto next_order;
 
 		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
-
 		nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
 						      offset + nr_ptes);
 
@@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
 				collapsed += nr_ptes;
 				fallthrough;
 			case SCAN_PTE_MAPPED_HUGEPAGE:
-				continue;
+				goto next_offset;
 			/* Cases where lower orders might still succeed */
 			case SCAN_ALLOC_HUGE_PAGE_FAIL:
 				alloc_failed = true;
@@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
 		}
 
 next_order:
-		if ((BIT(order) - 1) & enabled_orders) {
-			const u8 next_order = order - 1;
-			const u16 mid_offset = offset + (nr_ptes / 2);
-
-			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
-						 next_order);
-			collapse_mthp_stack_push(cc, &stack_size, offset,
-						 next_order);
+		if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
+			(BIT(order) - 1) & enabled_orders) {
+			order = order - 1;
+			continue;
 		}
+next_offset:
+		offset += nr_ptes;
+		order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
 	}
 done:
 	if (collapsed)
-- 
2.54.0



> 
> [...]
> 
>>>> +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>>>>       memset(cc->node_load, 0, sizeof(cc->node_load));
>>>>       nodes_clear(cc->alloc_nmask);
>>>> +
>>>> +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
>>>> +
>>>> +     /*
>>>> +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
>>>> +      * scan all pages to populate the bitmap for mTHP collapse.
>>>> +      */
>>>
>>> You should note here, that we re-verify in mthp_collapse().
>>>
>>> But the question is, whether we should relocate the check completely into
>>> mthp_collapse(), instead of conditionally duplicating it.
>>>
>>> What speaks against always populating the bitmap and making the decision in
>>> mthp_collapse()?
>>>
>>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
>>> I am not sure if scanning some more page table entries is really that critical here.
>>
>> Someone asked me to preserve the legacy behavior (PMD only). Although
>> rather trivial, if you set max_ptes_none=0 for example, we'd still
>> have to do 511 iterations for no reason if PMD collapse is the only
>> enabled order rather than bailing immediately.
>>
>> I'm ok with dropping it, but I think its the correct approach (despite
>> the extra complexity). @Usama Arif brought up this point here
>> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/
> 
> We talk about regressions, but I am not sure if we care about scanning speed
> within a page table that much?
> 
> After all, we locked it and already read some entries.
> 
> Having the same check at two places to optimize for PMD order might right now
> feel like a good optimization, but likely an irrelevant one in a near future?
> 
> Anyhow, won't push back, as long as we document why we are special casing things
> here.
> 


^ permalink raw reply related

* [PATCH 13/13] Documentation/kernel-parameters: Add trace_remote
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

The trace_remote parameter allows to configure a trace remote on
registration. The syntax is similar to trace_instance.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 97007f4f69d4..dec8fd1f5ff9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7792,6 +7792,22 @@ Kernel parameters
 			See also Documentation/trace/ftrace.rst "trace options"
 			section.
 
+	trace_remote=[remote-info]
+			[FTRACE] Configure a trace remote instance at boot.
+			Format: <name>[^option1[^option2...]][,event1[,event2...]]
+
+			Supported options:
+			  dump_on_oops - Enable panic load to dump the trace
+			                 buffer on oops/panic.
+			  printk       - Redirect tracing output to printk.
+			  buf_size=<size> - Set trace buffer size (e.g. 2M).
+
+			Events are a comma-separated list of events to enable.
+			If events are specified, tracing is automatically enabled.
+
+			Multiple remotes can be configured by specifying this
+			parameter multiple times.
+
 	trace_trigger=[trigger-list]
 			[FTRACE] Add an event trigger on specific events.
 			Set a trigger on top of a specific event, with an optional
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 12/13] tracing/remotes: Add trace_remote cmdline options
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

Following the same format as trace_instance, add a cmdline to configure
a trace remote on registration:

  trace_remote=<remote>^<opt1>^<opt2>,<evt1>,<evt2>

Enabling events automatically turns on tracing.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c
index 3164f327d4d8..0b44636b6eea 100644
--- a/kernel/trace/trace_remote.c
+++ b/kernel/trace/trace_remote.c
@@ -13,6 +13,8 @@
 #include <linux/trace_seq.h>
 #include <linux/types.h>
 
+#include <asm/setup.h>
+
 #include "trace.h"
 
 #define TRACEFS_DIR		"remotes"
@@ -1057,6 +1059,124 @@ static int dump_on_oops_show(struct seq_file *s, void *unused)
 }
 DEFINE_SHOW_STORE_ATTRIBUTE(dump_on_oops);
 
+static char trace_remote_cmdline[COMMAND_LINE_SIZE];
+
+static int __init set_trace_remote_cmdline(char *str)
+{
+	static int idx;
+
+	if (!str)
+		return 0;
+
+	strscpy(trace_remote_cmdline + idx, str, COMMAND_LINE_SIZE - idx);
+	idx += strlen(str);
+	trace_remote_cmdline[idx++] = '\t';
+	return 1;
+}
+__setup("trace_remote=", set_trace_remote_cmdline);
+
+static void trace_remote_apply_cmdline_opts(struct trace_remote *remote, char *cmdline)
+{
+	bool printk_on = false;
+	char *opt;
+	int ret;
+
+	while ((opt = strsep(&cmdline, "^"))) {
+		if (!*opt)
+			continue;
+
+		if (!strcmp(opt, "dump_on_oops")) {
+			remote->panic_on = true;
+		} else if (!strcmp(opt, "printk")) {
+			printk_on = true;
+		} else if (!strncmp(opt, "buf_size=", 9)) {
+			/* buf_size can only be applied if the buffer is unloaded */
+			if (!WARN_ON(trace_remote_loaded(remote)))
+				remote->trace_buffer_size = memparse(opt + 9, NULL);
+		} else {
+			pr_warn("Unknown trace remote option '%s'\n", opt);
+		}
+	}
+
+	if (printk_on) {
+		ret = trace_remote_enable_printk(remote, true);
+		if (ret)
+			pr_warn("Failed to enable trace remote printk (%d)\n", ret);
+	}
+}
+
+static struct remote_event *
+trace_remote_find_event_by_name(struct trace_remote *remote, const char *name);
+
+static int
+trace_remote_enable_event(struct trace_remote *remote, struct remote_event *evt, bool enable);
+
+static void trace_remote_apply_cmdline_events(struct trace_remote *remote, char *cmdline)
+{
+	bool tracing_on = false;
+	char *token;
+	int ret;
+
+	while ((token = strsep(&cmdline, ","))) {
+		struct remote_event *evt;
+
+		if (!*token)
+			continue;
+
+		evt = trace_remote_find_event_by_name(remote, token);
+		if (!evt) {
+			pr_warn("trace remote event '%s' not found\n", token);
+			continue;
+		}
+
+		ret = trace_remote_enable_event(remote, evt, true);
+		if (ret)
+			pr_warn("Failed to enable trace remote event '%s' (%d)\n", token, ret);
+		else
+			tracing_on = true;
+	}
+
+	if (tracing_on) {
+		ret = trace_remote_enable_tracing(remote);
+		if (ret)
+			pr_warn("Failed to enable trace remote tracing (%d)\n", ret);
+	}
+}
+
+static void trace_remote_apply_cmdline(const char *name, struct trace_remote *remote)
+{
+	char *cmdline __free(kfree) = NULL;
+	char *events_cmdline = NULL;
+	char *opts_cmdline = NULL;
+	char *curr, *next;
+
+	if (!trace_remote_cmdline[0])
+		return;
+
+	cmdline = kstrdup(trace_remote_cmdline, GFP_KERNEL);
+	if (!cmdline)
+		return;
+
+	next = cmdline;
+	while ((curr = strsep(&next, "\t"))) {
+		char *token = strsep(&curr, ",");
+		char *rname = strsep(&token, "^");
+
+		if (strcmp(rname, name) == 0) {
+			opts_cmdline = token;
+			events_cmdline = curr;
+			break;
+		}
+	}
+
+	guard(mutex)(&remote->lock);
+
+	if (opts_cmdline)
+		trace_remote_apply_cmdline_opts(remote, opts_cmdline);
+	if (events_cmdline)
+		trace_remote_apply_cmdline_events(remote, events_cmdline);
+}
+
 static struct dentry *tracefs_root;
 static DEFINE_MUTEX(tracefs_lock);
 static u64 tracefs_root_count;
@@ -1228,6 +1348,7 @@ int trace_remote_register(const char *name, struct trace_remote_callbacks *cbs,
 		goto err_unregister_events;
 	}
 
+	trace_remote_apply_cmdline(name, remote);
 	no_free_ptr(remote);
 	return 0;
 
@@ -1708,3 +1829,15 @@ static struct remote_event *trace_remote_find_event(struct trace_remote *remote,
 	return bsearch((const void *)(unsigned long)id, remote->events, remote->nr_events,
 		       sizeof(*remote->events), __cmp_events);
 }
+
+static struct remote_event *
+trace_remote_find_event_by_name(struct trace_remote *remote, const char *name)
+{
+	int i;
+
+	for (i = 0; i < remote->nr_events; i++) {
+		if (!strcmp(remote->events[i].name, name))
+			return &remote->events[i];
+	}
+	return NULL;
+}
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 11/13] Documentation: tracing/remotes: Add detailed tracefs layout
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

Add a description for each tracefs file available in a trace remote
instance.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/Documentation/trace/remotes.rst b/Documentation/trace/remotes.rst
index 1f9d764f69aa..bf63b929271e 100644
--- a/Documentation/trace/remotes.rst
+++ b/Documentation/trace/remotes.rst
@@ -19,8 +19,8 @@ for which the host kernel can see and expose to user space.
 
 Register a remote
 =================
-A remote must provide a set of callbacks `struct trace_remote_callbacks` whom
-description can be found below. Those callbacks allows Tracefs to enable and
+A remote must provide a set of callbacks `struct trace_remote_callbacks` whose
+description can be found below. Those callbacks allow Tracefs to enable and
 disable tracing and events, to load and unload a tracing buffer (a set of
 ring-buffers) and to swap a reader page with the head page, which enables
 consuming reading.
@@ -28,8 +28,63 @@ consuming reading.
 .. kernel-doc:: include/linux/trace_remote.h
 
 Once registered, an instance will appear for this remote in the Tracefs
-directory **remotes/**. Buffers can then be read using the usual Tracefs files
-**trace_pipe** and **trace**.
+directory **remotes/**. The files within this directory allow configuring
+and reading the remote buffer (see `The File System` below).
+
+The File System
+===============
+A remote tracing instance is represented by a directory in Tracefs under
+**remotes/**. The layout and files within it are very similar to standard ftrace
+instances. Inside the remote directory, the following files and directories are
+available:
+
+  tracing_on
+	This file allows enabling or disabling the remote tracing.
+
+  buffer_size_kb
+	This file displays and allows changing the size of the per-CPU ring
+	buffers used by the remote. It also shows if the buffer is **loaded** or
+	**unloaded**. To change the size, the remote buffers must be unloaded
+	first. Remote buffers are automatically unloaded when **tracing_on** is
+	off, no one is reading the buffer (either by accessing **trace_pipe** or
+	when **printk** is on) and no events remain in the buffer.
+
+  trace
+	Display the human-readable content of the remote buffers. Reading this
+	file is non-consuming. Writing to this file clears the ring buffers.
+
+  trace_pipe
+	Similar to **trace** but reading it consumes the events from the ring
+	buffers (consuming read). It blocks if there are no new events.
+
+  printk
+	When enabled, all events from the remote are redirected to the kernel
+	dmesg. This is similar to the **tp_printk** option for in-kernel events.
+	It counts as a reader of the remote buffers and prevents unloading.
+
+  dump_on_oops
+	When enabled, the remote tracing buffer is dumped to the console when a
+	kernel panic occurs.
+
+  per_cpu/
+	This directory contains subdirectories for each possible CPU (e.g.,
+	**cpu0/**, **cpu1/** ...)
+
+  per_cpu/cpuX/trace
+	This is similar to the **trace** file, but it will only display the data
+	specific for the CPU. If written to, it only clears the specific CPU
+	buffer.
+
+  per_cpu/cpuX/trace_pipe
+	This is similar to the **trace_pipe** file, and is a consuming read, but
+	it will only display (and consume) the data specific to the CPU.
+
+  events/
+	This directory contains remote events that can be enabled or disabled.
+
+  events/enable
+	Allows enabling or disabling all the remote events.
+
 
 Declare a remote event
 ======================
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 10/13] tracing/remotes: selftests: Add a test for the dump_on_oops tracefs file
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

Exercise the newly introduced dump_on_oops tracefs file that turns on
or off the trace remote buffer dump on system panic.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/dump_on_oops.tc b/tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/dump_on_oops.tc
new file mode 100644
index 000000000000..6c3f93ab922d
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/dump_on_oops.tc
@@ -0,0 +1,11 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test hypervisor trace dump_on_oops
+# requires: remotes/hypervisor/write_event
+
+SOURCE_REMOTE_TEST=1
+. $TEST_DIR/remotes/dump_on_oops.tc
+
+set -e
+setup_remote "hypervisor"
+test_dump_on_oops
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/dump_on_oops.tc b/tools/testing/selftests/ftrace/test.d/remotes/dump_on_oops.tc
new file mode 100644
index 000000000000..4f5f26cb3291
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/dump_on_oops.tc
@@ -0,0 +1,51 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test trace remote dump_on_oops
+# requires: remotes/test
+
+. $TEST_DIR/remotes/functions
+
+test_dump_on_oops()
+{
+    #
+    # Toggle when the buffer is unloaded
+    #
+    echo 1 > dump_on_oops
+    echo 0 > dump_on_oops
+
+    #
+    # Toggle when the buffer is loaded
+    #
+    echo 1 > tracing_on
+    assert_loaded
+
+    echo 1 > dump_on_oops
+    echo 0 > dump_on_oops
+
+    #
+    # Load and unload buffer while dump_on_oops is enabled
+    #
+    echo 0 > tracing_on
+    assert_unloaded
+
+    echo 1 > dump_on_oops
+    echo 1 > tracing_on
+    echo 0 > tracing_on
+
+    # REMOVE ME FOR A PROPER OOPS TEST
+    return
+
+    echo 1 > tracing_on
+
+    for i in $(seq 1 32); do
+        echo $i > write_event
+    done
+
+    echo c > /proc/sysrq-trigger
+}
+
+if [ -z "$SOURCE_REMOTE_TEST" ]; then
+    set -e
+    setup_remote_test
+    test_dump_on_oops
+fi
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/functions b/tools/testing/selftests/ftrace/test.d/remotes/functions
index 8dd9c961977b..887b9296d0a7 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/functions
+++ b/tools/testing/selftests/ftrace/test.d/remotes/functions
@@ -9,6 +9,7 @@ setup_remote()
 	cd remotes/$name/
 	echo 0 > tracing_on
 	echo 0 > printk
+	echo 0 > dump_on_oops
 	clear_trace
 	echo 7 > buffer_size_kb
 	echo 0 > events/enable
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 09/13] tracing/remotes: Add dump_on_oops tracefs file
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

When enabled, dump_on_oops will dump the content of the trace remote
buffer if the system panics.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c
index e708dcd7d258..3164f327d4d8 100644
--- a/kernel/trace/trace_remote.c
+++ b/kernel/trace/trace_remote.c
@@ -7,6 +7,7 @@
 #include <linux/kstrtox.h>
 #include <linux/lockdep.h>
 #include <linux/mutex.h>
+#include <linux/panic_notifier.h>
 #include <linux/tracefs.h>
 #include <linux/trace_remote.h>
 #include <linux/trace_seq.h>
@@ -22,6 +23,7 @@ enum tri_type {
 	TRI_CONSUMING,
 	TRI_NONCONSUMING,
 	TRI_PRINTK,
+	TRI_PANIC,
 };
 
 struct trace_remote_iterator {
@@ -58,6 +60,9 @@ struct trace_remote {
 	unsigned int			poll_ms;
 	struct delayed_work		poll_work;
 	unsigned int			poll_cnt;
+	struct notifier_block		panic_notifier;
+	struct trace_remote_iterator	*panic_iter;
+	bool				panic_on;
 	bool				tracing_on;
 };
 
@@ -66,10 +71,15 @@ static bool trace_remote_loaded(struct trace_remote *remote)
 	return !!remote->trace_buffer;
 }
 
+static void trace_remote_unload(struct trace_remote *remote);
+static int trace_remote_panic_load(struct trace_remote *remote);
+static void trace_remote_panic_unload(struct trace_remote *remote);
+
 static int trace_remote_load(struct trace_remote *remote)
 {
 	struct ring_buffer_remote *rb_remote = &remote->rb_remote;
 	struct trace_buffer_desc *desc;
+	int ret = 0;
 
 	lockdep_assert_held(&remote->lock);
 
@@ -84,15 +94,28 @@ static int trace_remote_load(struct trace_remote *remote)
 	rb_remote->swap_reader_page = remote->cbs->swap_reader_page;
 	rb_remote->priv = remote->priv;
 	rb_remote->reset = remote->cbs->reset;
+	remote->trace_buffer_desc = desc;
 	remote->trace_buffer = ring_buffer_alloc_remote(rb_remote);
 	if (!remote->trace_buffer) {
 		remote->cbs->unload_trace_buffer(desc, remote->priv);
 		return -ENOMEM;
 	}
 
-	remote->trace_buffer_desc = desc;
+	if (remote->panic_on) {
+		ret = trace_remote_panic_load(remote);
+		if (ret)
+			trace_remote_unload(remote);
+	}
 
-	return 0;
+	return ret;
+}
+
+static void trace_remote_unload(struct trace_remote *remote)
+{
+	trace_remote_panic_unload(remote);
+	ring_buffer_free(remote->trace_buffer);
+	remote->trace_buffer = NULL;
+	remote->cbs->unload_trace_buffer(remote->trace_buffer_desc, remote->priv);
 }
 
 static void trace_remote_try_unload(struct trace_remote *remote)
@@ -110,9 +133,7 @@ static void trace_remote_try_unload(struct trace_remote *remote)
 	if (!ring_buffer_empty(remote->trace_buffer))
 		return;
 
-	ring_buffer_free(remote->trace_buffer);
-	remote->trace_buffer = NULL;
-	remote->cbs->unload_trace_buffer(remote->trace_buffer_desc, remote->priv);
+	trace_remote_unload(remote);
 }
 
 static int trace_remote_enable_tracing(struct trace_remote *remote)
@@ -375,58 +396,68 @@ static void trace_remote_dec_poll(struct trace_remote *remote)
 static struct trace_remote_iterator
 *trace_remote_iter(struct trace_remote *remote, int cpu, enum tri_type type)
 {
-	struct trace_remote_iterator *iter = NULL;
+	struct trace_remote_iterator *iter __free(kfree) = kzalloc_obj(*iter);
 	int ret;
 
 	lockdep_assert_held(&remote->lock);
 
-	if (type == TRI_NONCONSUMING && !trace_remote_loaded(remote))
-		return NULL;
+	if (!iter)
+		return ERR_PTR(-ENOMEM);
 
-	ret = trace_remote_get(remote, cpu);
-	if (ret)
-		return ERR_PTR(ret);
+	switch (type) {
+	case TRI_NONCONSUMING:
+		if (!trace_remote_loaded(remote))
+			return NULL;
+		fallthrough;
+	case TRI_CONSUMING:
+	case TRI_PRINTK:
+		ret = trace_remote_get(remote, cpu);
+		if (ret)
+			return ERR_PTR(ret);
+		break;
+	case TRI_PANIC:
+		break;
+	}
 
 	if (!trace_remote_has_cpu(remote, cpu)) {
 		ret = -ENODEV;
 		goto err;
 	}
 
-	iter = kzalloc_obj(*iter);
-	if (iter) {
-		iter->remote = remote;
-		iter->cpu = cpu;
-		iter->type = type;
-		trace_seq_init(&iter->seq);
+	iter->remote = remote;
+	iter->cpu = cpu;
+	iter->type = type;
+	trace_seq_init(&iter->seq);
 
-		switch (type) {
-		case TRI_PRINTK:
-			/* only one printk iter allowed */
-			if (WARN_ON_ONCE(remote->printk)) {
-				ret = -EBUSY;
-				break;
-			}
-			smp_store_release(&remote->printk, iter);
-			fallthrough;
-		case TRI_CONSUMING:
-			trace_remote_inc_poll(remote);
-			break;
-		case TRI_NONCONSUMING:
-			ret = __alloc_ring_buffer_iter(iter, cpu);
-			break;
+	switch (type) {
+	case TRI_PRINTK:
+		/* only one printk iter allowed */
+		if (WARN_ON_ONCE(remote->printk)) {
+			ret = -EBUSY;
+			goto err;
 		}
-
+		smp_store_release(&remote->printk, iter);
+		fallthrough;
+	case TRI_CONSUMING:
+		trace_remote_inc_poll(remote);
+		break;
+	case TRI_PANIC:
+	case TRI_NONCONSUMING:
+		ret = __alloc_ring_buffer_iter(iter, cpu);
 		if (ret)
 			goto err;
-
-		return iter;
+		break;
 	}
-	ret = -ENOMEM;
 
-err:
-	kfree(iter);
-	trace_remote_put(remote);
+	return no_free_ptr(iter);
 
+err:
+	switch (type) {
+	case TRI_PANIC:
+		break;
+	default:
+		trace_remote_put(remote);
+	}
 	return ERR_PTR(ret);
 }
 
@@ -449,14 +480,18 @@ static void trace_remote_iter_free(struct trace_remote_iterator *iter)
 		fallthrough;
 	case TRI_CONSUMING:
 		trace_remote_dec_poll(remote);
+		trace_remote_put(remote);
 		break;
 	case TRI_NONCONSUMING:
+		trace_remote_put(remote);
+		__free_ring_buffer_iter(iter, iter->cpu);
+		break;
+	case TRI_PANIC:
 		__free_ring_buffer_iter(iter, iter->cpu);
 		break;
 	}
 
 	kfree(iter);
-	trace_remote_put(remote);
 }
 
 static void trace_remote_iter_read_start(struct trace_remote_iterator *iter)
@@ -527,6 +562,7 @@ __peek_event(struct trace_remote_iterator *iter, int cpu, u64 *ts, unsigned long
 	case TRI_PRINTK:
 	case TRI_CONSUMING:
 		return ring_buffer_peek(iter->remote->trace_buffer, cpu, ts, lost_events);
+	case TRI_PANIC:
 	case TRI_NONCONSUMING:
 		rb_iter = __get_rb_iter(iter, cpu);
 		if (!rb_iter)
@@ -596,6 +632,7 @@ static void trace_remote_iter_move(struct trace_remote_iterator *iter)
 	case TRI_CONSUMING:
 		ring_buffer_consume(trace_buffer, iter->evt_cpu, NULL, NULL);
 		break;
+	case TRI_PANIC:
 	case TRI_NONCONSUMING:
 		ring_buffer_iter_advance(__get_rb_iter(iter, iter->evt_cpu));
 		break;
@@ -910,6 +947,116 @@ static int printk_show(struct seq_file *s, void *unused)
 }
 DEFINE_SHOW_STORE_ATTRIBUTE(printk);
 
+static int trace_remote_panic_handler(struct notifier_block *self, unsigned long ev, void *v)
+{
+	struct trace_remote *remote = container_of(self, struct trace_remote, panic_notifier);
+	struct trace_remote_iterator *iter = smp_load_acquire(&remote->panic_iter);
+	int cpu;
+
+	if (!iter) {
+		pr_warn("Unexpected error: no panic iterator for the trace remote\n");
+		return NOTIFY_DONE;
+	}
+
+	for_each_possible_cpu(cpu) {
+		if (iter->rb_iters[cpu]) {
+			/* No RING_BUFFER_ALL_CPUS to avoid taking cpu_read_lock() */
+			ring_buffer_read_remote_meta_page(remote->trace_buffer, cpu);
+			ring_buffer_iter_reset(iter->rb_iters[cpu]);
+		}
+	}
+
+	while (trace_remote_iter_read_event(iter)) {
+		trace_seq_init(&iter->seq);
+
+		trace_remote_iter_print_event(iter);
+		pr_emerg("%s", iter->seq.buffer);
+
+		trace_remote_iter_move(iter);
+	}
+
+	return NOTIFY_DONE;
+}
+
+static int trace_remote_panic_load(struct trace_remote *remote)
+{
+	struct notifier_block *notifier = &remote->panic_notifier;
+	struct trace_remote_iterator *iter;
+
+	lockdep_assert_held(&remote->lock);
+
+	if (remote->panic_iter)
+		return 0;
+
+	iter = trace_remote_iter(remote, RING_BUFFER_ALL_CPUS, TRI_PANIC);
+	if (IS_ERR(iter))
+		return PTR_ERR(iter);
+
+	smp_store_release(&remote->panic_iter, iter);
+
+	notifier->notifier_call = trace_remote_panic_handler;
+	notifier->priority = INT_MAX - 1;
+	atomic_notifier_chain_register(&panic_notifier_list, notifier);
+
+	return 0;
+}
+
+static void trace_remote_panic_unload(struct trace_remote *remote)
+{
+	struct trace_remote_iterator *iter = remote->panic_iter;
+
+	lockdep_assert_held(&remote->lock);
+
+	if (!iter)
+		return;
+
+	atomic_notifier_chain_unregister(&panic_notifier_list, &remote->panic_notifier);
+	smp_store_release(&remote->panic_iter, NULL);
+	trace_remote_iter_free(iter);
+}
+
+static ssize_t dump_on_oops_write(struct file *filp, const char __user *ubuf,
+				  size_t cnt, loff_t *ppos)
+{
+	struct seq_file *seq = filp->private_data;
+	struct trace_remote *remote = seq->private;
+	bool enable;
+	int ret;
+
+	ret = kstrtobool_from_user(ubuf, cnt, &enable);
+	if (ret)
+		return ret;
+
+	guard(mutex)(&remote->lock);
+
+	if (enable == remote->panic_on)
+		return cnt;
+
+	if (trace_remote_loaded(remote)) {
+		if (enable) {
+			ret = trace_remote_panic_load(remote);
+			if (ret)
+				return ret;
+		} else {
+			trace_remote_panic_unload(remote);
+		}
+	}
+
+	remote->panic_on = enable;
+
+	return cnt;
+}
+
+static int dump_on_oops_show(struct seq_file *s, void *unused)
+{
+	struct trace_remote *remote = s->private;
+
+	seq_printf(s, "%d\n", remote->panic_on);
+
+	return 0;
+}
+DEFINE_SHOW_STORE_ATTRIBUTE(dump_on_oops);
+
 static struct dentry *tracefs_root;
 static DEFINE_MUTEX(tracefs_lock);
 static u64 tracefs_root_count;
@@ -941,6 +1088,11 @@ static int trace_remote_init_tracefs(const char *name, struct trace_remote *remo
 	if (!d)
 		goto err;
 
+	d = trace_create_file("dump_on_oops", TRACEFS_MODE_WRITE, remote_d, remote,
+			      &dump_on_oops_fops);
+	if (!d)
+		goto err;
+
 	d = trace_create_file("buffer_size_kb", TRACEFS_MODE_WRITE, remote_d, remote,
 			      &buffer_size_kb_fops);
 	if (!d)
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 08/13] ring-buffer: Add ring_buffer_read_remote_meta_page()
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

In preparation for the introduction of a panic handler for trace
remotes, add a ring_buffer_read_remote_meta_page(). This is basically
similar to ring_buffer_poll_remote, but it doesn't try to wake-up
readers and, in the !RING_BUFFER_ALL_CPUS case, uses panic-friendly
locks.

While at it, update trace_remote_has_cpu() to use this new function
instead of ring_buffer_poll_remote(), avoiding unnecessary wakeups when
verifying if a CPU buffer is active.

Finally, the distracted engineer who wrote that
ring_buffer_poll_remote() forgot to document it. Add a kerneldoc for
that function too.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 994f52b34344..6e008a548063 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -298,6 +298,7 @@ struct ring_buffer_remote {
 	void				*priv;
 };
 
+int ring_buffer_read_remote_meta_page(struct trace_buffer *buffer, int cpu);
 int ring_buffer_poll_remote(struct trace_buffer *buffer, int cpu);
 
 struct trace_buffer *
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7b07d2004cc6..88ac346c65ec 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -6623,6 +6623,59 @@ bool ring_buffer_empty_cpu(struct trace_buffer *buffer, int cpu)
 }
 EXPORT_SYMBOL_GPL(ring_buffer_empty_cpu);
 
+/**
+ * ring_buffer_read_remote_meta_page - read the meta page of a remote ring buffer
+ * @buffer: The ring buffer
+ * @cpu: The CPU buffer to read (or RING_BUFFER_ALL_CPUS)
+ *
+ * Returns:
+ *  0 on success, or -EINVAL if the CPU is not in the buffer's cpumask.
+ */
+int ring_buffer_read_remote_meta_page(struct trace_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (cpu != RING_BUFFER_ALL_CPUS) {
+		unsigned long flags;
+		bool dolock;
+
+		if (!cpumask_test_cpu(cpu, buffer->cpumask))
+			return -EINVAL;
+
+		cpu_buffer = buffer->buffers[cpu];
+
+		local_irq_save(flags);
+		dolock = rb_reader_lock(cpu_buffer);
+		rb_read_remote_meta_page(cpu_buffer);
+		rb_reader_unlock(cpu_buffer, dolock);
+		local_irq_restore(flags);
+		return 0;
+	}
+
+	guard(cpus_read_lock)();
+
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+
+		guard(raw_spinlock)(&cpu_buffer->reader_lock);
+		rb_read_remote_meta_page(cpu_buffer);
+	}
+
+	return 0;
+}
+
+/**
+ * ring_buffer_poll_remote - poll a remote ring buffer for new data
+ * @buffer: The ring buffer
+ * @cpu: The CPU buffer to poll (or RING_BUFFER_ALL_CPUS)
+ *
+ * This function polls the specified remote CPU buffer (or all of them)
+ * by reading its meta page to update the local reader's view. If new
+ * entries are detected, it triggers wakeups for any waiting readers.
+ *
+ * Returns:
+ *  0 on success, or -EINVAL if the CPU is not in the buffer's cpumask.
+ */
 int ring_buffer_poll_remote(struct trace_buffer *buffer, int cpu)
 {
 	struct ring_buffer_per_cpu *cpu_buffer;
diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c
index 1bf0ba159c92..e708dcd7d258 100644
--- a/kernel/trace/trace_remote.c
+++ b/kernel/trace/trace_remote.c
@@ -291,7 +291,7 @@ static bool trace_remote_has_cpu(struct trace_remote *remote, int cpu)
 	if (cpu == RING_BUFFER_ALL_CPUS)
 		return true;
 
-	return ring_buffer_poll_remote(remote->trace_buffer, cpu) == 0;
+	return ring_buffer_read_remote_meta_page(remote->trace_buffer, cpu) == 0;
 }
 
 static void __free_ring_buffer_iter(struct trace_remote_iterator *iter, int cpu)
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 07/13] tracing/remotes: selftests: Prefix hypervisor folder
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

Avoid interleaving run tests by prefixing the hypervisor folder.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/buffer_size.tc b/tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/buffer_size.tc
similarity index 100%
rename from tools/testing/selftests/ftrace/test.d/remotes/hypervisor/buffer_size.tc
rename to tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/buffer_size.tc
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc b/tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/hotplug.tc
similarity index 100%
rename from tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
rename to tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/hotplug.tc
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/printk.tc b/tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/printk.tc
similarity index 100%
rename from tools/testing/selftests/ftrace/test.d/remotes/hypervisor/printk.tc
rename to tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/printk.tc
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/reset.tc b/tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/reset.tc
similarity index 100%
rename from tools/testing/selftests/ftrace/test.d/remotes/hypervisor/reset.tc
rename to tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/reset.tc
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace.tc b/tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/trace.tc
similarity index 100%
rename from tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace.tc
rename to tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/trace.tc
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace_pipe.tc b/tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/trace_pipe.tc
similarity index 100%
rename from tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace_pipe.tc
rename to tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/trace_pipe.tc
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/unloading.tc b/tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/unloading.tc
similarity index 100%
rename from tools/testing/selftests/ftrace/test.d/remotes/hypervisor/unloading.tc
rename to tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/unloading.tc
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply

* [PATCH 06/13] tracing/remotes: selftests: Add a test for the printk tracefs file
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

Exercise the newly introduced printk tracefs file that turns on and off
the dmesg redirection.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/tools/testing/selftests/ftrace/test.d/remotes/functions b/tools/testing/selftests/ftrace/test.d/remotes/functions
index 05224fac3653..8dd9c961977b 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/functions
+++ b/tools/testing/selftests/ftrace/test.d/remotes/functions
@@ -8,6 +8,7 @@ setup_remote()
 
 	cd remotes/$name/
 	echo 0 > tracing_on
+	echo 0 > printk
 	clear_trace
 	echo 7 > buffer_size_kb
 	echo 0 > events/enable
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/printk.tc b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/printk.tc
new file mode 100644
index 000000000000..aca7a2bfe293
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/printk.tc
@@ -0,0 +1,11 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test trace remote printk, the dmesg redirection
+# requires: remotes/hypervisor/write_event
+
+SOURCE_REMOTE_TEST=1
+. $TEST_DIR/remotes/printk.tc
+
+set -e
+setup_remote "hypervisor"
+test_printk
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/printk.tc b/tools/testing/selftests/ftrace/test.d/remotes/printk.tc
new file mode 100644
index 000000000000..80eaf13e240d
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/printk.tc
@@ -0,0 +1,72 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test trace remote printk, the dmesg redirection
+# requires: remotes/test
+
+. $TEST_DIR/remotes/functions
+
+test_printk()
+{
+    echo 0 > tracing_on
+    assert_unloaded
+
+    #
+    # Test printk on/off when tracing is disabled
+    #
+    echo 1 > printk
+    test $(cat printk) -eq 1
+    assert_loaded
+
+    echo 0 > printk
+    test $(cat printk) -eq 0
+    assert_unloaded
+
+    #
+    # Test events are logged to dmesg
+    #
+    dmesg -c > /dev/null
+
+    echo 1 > tracing_on
+    assert_loaded
+    echo 1 > printk
+    test $(cat printk) -eq 1
+
+    nr_events=128
+    for i in $(seq 1 $nr_events); do
+        echo $i > write_event
+    done
+
+    sleep 1
+    output=$(mktemp $TMPDIR/remote_test.XXXXXX)
+    dmesg | grep "selftest id=" | sed 's/^[^]]*] //'> $output
+
+    check_trace 1 $nr_events $output
+
+    rm $output
+
+    #
+    # Disable printk and Test events were not consumed by dmesg
+    #
+    echo 0 > printk
+    test $(cat printk) -eq 0
+
+    start_id=$(($nr_events + 1))
+    end_id=$(($start_id + $nr_events))
+
+    for i in $(seq $start_id $end_id); do
+        echo $i > write_event
+    done
+
+    sleep 1
+
+    output=$(dump_trace_pipe)
+    check_trace $start_id $end_id $output
+    rm $output
+}
+
+if [ -z "$SOURCE_REMOTE_TEST" ]; then
+    set -e
+
+    setup_remote_test
+    test_printk
+fi
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 05/13] tracing/remotes: Add printk tracefs file
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

When enabled, the printk tracefs file enables the redirection of all
events to dmesg. This is similar to tp_printk.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c
index 21583fae1bd9..1bf0ba159c92 100644
--- a/kernel/trace/trace_remote.c
+++ b/kernel/trace/trace_remote.c
@@ -21,6 +21,7 @@
 enum tri_type {
 	TRI_CONSUMING,
 	TRI_NONCONSUMING,
+	TRI_PRINTK,
 };
 
 struct trace_remote_iterator {
@@ -42,6 +43,7 @@ struct trace_remote {
 	void				*priv;
 	struct trace_buffer		*trace_buffer;
 	struct trace_buffer_desc	*trace_buffer_desc;
+	struct trace_remote_iterator	*printk;
 	struct dentry			*dentry;
 	struct eventfs_inode		*eventfs_root;
 	struct eventfs_inode		*eventfs_subdir;
@@ -335,6 +337,8 @@ static int __alloc_ring_buffer_iter(struct trace_remote_iterator *iter, int cpu)
 	return 0;
 }
 
+static void trace_remote_do_printk(struct trace_remote *remote);
+
 static void __poll_remote(struct work_struct *work)
 {
 	struct delayed_work *dwork = to_delayed_work(work);
@@ -342,6 +346,7 @@ static void __poll_remote(struct work_struct *work)
 
 	remote = container_of(dwork, struct trace_remote, poll_work);
 	ring_buffer_poll_remote(remote->trace_buffer, RING_BUFFER_ALL_CPUS);
+	trace_remote_do_printk(remote);
 
 	schedule_delayed_work(dwork, msecs_to_jiffies(remote->poll_ms));
 }
@@ -351,6 +356,8 @@ static void trace_remote_inc_poll(struct trace_remote *remote)
 	/* poll_cnt <= nr_readers, inherits its overflow protection */
 	if (!remote->poll_cnt++) {
 		ring_buffer_poll_remote(remote->trace_buffer, RING_BUFFER_ALL_CPUS);
+		trace_remote_do_printk(remote);
+
 		schedule_delayed_work(&remote->poll_work, msecs_to_jiffies(remote->poll_ms));
 	}
 }
@@ -393,6 +400,14 @@ static struct trace_remote_iterator
 		trace_seq_init(&iter->seq);
 
 		switch (type) {
+		case TRI_PRINTK:
+			/* only one printk iter allowed */
+			if (WARN_ON_ONCE(remote->printk)) {
+				ret = -EBUSY;
+				break;
+			}
+			smp_store_release(&remote->printk, iter);
+			fallthrough;
 		case TRI_CONSUMING:
 			trace_remote_inc_poll(remote);
 			break;
@@ -427,6 +442,11 @@ static void trace_remote_iter_free(struct trace_remote_iterator *iter)
 	lockdep_assert_held(&remote->lock);
 
 	switch (iter->type) {
+	case TRI_PRINTK:
+		WARN_ON_ONCE(remote->printk != iter);
+		smp_store_release(&remote->printk, NULL);
+		flush_delayed_work(&remote->poll_work);
+		fallthrough;
 	case TRI_CONSUMING:
 		trace_remote_dec_poll(remote);
 		break;
@@ -504,6 +524,7 @@ __peek_event(struct trace_remote_iterator *iter, int cpu, u64 *ts, unsigned long
 	struct ring_buffer_iter *rb_iter;
 
 	switch (iter->type) {
+	case TRI_PRINTK:
 	case TRI_CONSUMING:
 		return ring_buffer_peek(iter->remote->trace_buffer, cpu, ts, lost_events);
 	case TRI_NONCONSUMING:
@@ -571,6 +592,7 @@ static void trace_remote_iter_move(struct trace_remote_iterator *iter)
 	struct trace_buffer *trace_buffer = iter->remote->trace_buffer;
 
 	switch (iter->type) {
+	case TRI_PRINTK:
 	case TRI_CONSUMING:
 		ring_buffer_consume(trace_buffer, iter->evt_cpu, NULL, NULL);
 		break;
@@ -814,6 +836,80 @@ static const struct file_operations trace_fops = {
 	.release	= trace_release,
 };
 
+static void trace_remote_do_printk(struct trace_remote *remote)
+{
+	struct trace_remote_iterator *iter = smp_load_acquire(&remote->printk);
+
+	if (!iter)
+		return;
+
+	trace_remote_iter_read_start(iter);
+
+	while (trace_remote_iter_read_event(iter)) {
+		trace_seq_init(&iter->seq);
+
+		trace_remote_iter_print_event(iter);
+		if (!pr_emerg("%s", iter->seq.buffer))
+			break;
+
+		trace_remote_iter_move(iter);
+	}
+
+	trace_remote_iter_read_finished(iter);
+}
+
+static int trace_remote_enable_printk(struct trace_remote *remote, bool enable)
+{
+	struct trace_remote_iterator *iter = remote->printk;
+
+	lockdep_assert_held(&remote->lock);
+
+	if (enable == !!iter)
+		return 0;
+
+	if (enable) {
+		iter = trace_remote_iter(remote, RING_BUFFER_ALL_CPUS, TRI_PRINTK);
+		if (IS_ERR(iter))
+			return PTR_ERR(iter);
+	} else {
+		trace_remote_iter_free(remote->printk);
+		/* trace_remote_iter_free has reset remote->printk */
+	}
+
+	return 0;
+}
+
+static ssize_t
+printk_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos)
+{
+	struct seq_file *seq = filp->private_data;
+	struct trace_remote *remote = seq->private;
+	bool val;
+	int ret;
+
+	ret = kstrtobool_from_user(ubuf, cnt, &val);
+	if (ret)
+		return ret;
+
+	guard(mutex)(&remote->lock);
+
+	ret = trace_remote_enable_printk(remote, val);
+	if (ret)
+		return ret;
+
+	return cnt;
+}
+
+static int printk_show(struct seq_file *s, void *unused)
+{
+	struct trace_remote *remote = s->private;
+
+	seq_printf(s, "%d\n", !!remote->printk);
+
+	return 0;
+}
+DEFINE_SHOW_STORE_ATTRIBUTE(printk);
+
 static struct dentry *tracefs_root;
 static DEFINE_MUTEX(tracefs_lock);
 static u64 tracefs_root_count;
@@ -858,6 +954,10 @@ static int trace_remote_init_tracefs(const char *name, struct trace_remote *remo
 	if (!d)
 		goto err;
 
+	d = trace_create_file("printk", TRACEFS_MODE_WRITE, remote_d, remote, &printk_fops);
+	if (!d)
+		goto err;
+
 	percpu_d = tracefs_create_dir("per_cpu", remote_d);
 	if (!percpu_d) {
 		pr_err("Failed to create tracefs dir "TRACEFS_DIR"%s/per_cpu/\n", name);
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 04/13] tracing/simple_ring_buffer: Add support for compressed length
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

The array length is the total size in bytes of the data for the current
event. It is possible to compress this value into the event header type,
which has 28 unused types, saving 32 bits for sufficiently small events.

The compressed length is expressed as a multiple of the ring-buffer
alignment, 4-bytes by default. Enforces this alignment.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/kernel/trace/simple_ring_buffer.c b/kernel/trace/simple_ring_buffer.c
index f4642f5adda3..6be4aa19adca 100644
--- a/kernel/trace/simple_ring_buffer.c
+++ b/kernel/trace/simple_ring_buffer.c
@@ -207,7 +207,12 @@ static unsigned long rb_event_size(unsigned long length)
 {
 	struct ring_buffer_event *event;
 
-	return length + RB_EVNT_HDR_SIZE + sizeof(event->array[0]);
+	length = ALIGN(length, RB_ALIGNMENT);
+
+	if (length > RB_MAX_SMALL_DATA)
+		length += sizeof(event->array[0]);
+
+	return length + RB_EVNT_HDR_SIZE;
 }
 
 static struct ring_buffer_event *
@@ -223,12 +228,15 @@ rb_event_add_ts_extend(struct ring_buffer_event *event, u64 delta)
 static struct ring_buffer_event *
 simple_rb_reserve_next(struct simple_rb_per_cpu *cpu_buffer, unsigned long length, u64 timestamp)
 {
-	unsigned long ts_ext_size = 0, event_size = rb_event_size(length);
 	struct simple_buffer_page *tail = cpu_buffer->tail_page;
+	unsigned long event_size, array_size, ts_ext_size = 0;
 	struct ring_buffer_event *event;
 	u32 write, prev_write;
 	u64 time_delta;
 
+	event_size = rb_event_size(length);
+	array_size = event_size - RB_EVNT_HDR_SIZE;
+
 	time_delta = timestamp - cpu_buffer->write_stamp;
 
 	if (test_time_stamp(time_delta))
@@ -259,9 +267,13 @@ simple_rb_reserve_next(struct simple_rb_per_cpu *cpu_buffer, unsigned long lengt
 		time_delta = 0;
 	}
 
-	event->type_len = 0;
+	if (length > RB_MAX_SMALL_DATA) {
+		event->type_len = 0;
+		event->array[0] = array_size;
+	} else {
+		event->type_len = DIV_ROUND_UP(array_size, RB_ALIGNMENT);
+	}
 	event->time_delta = time_delta;
-	event->array[0] = event_size - RB_EVNT_HDR_SIZE;
 
 	return event;
 }
@@ -284,7 +296,7 @@ void *simple_ring_buffer_reserve(struct simple_rb_per_cpu *cpu_buffer, unsigned
 
 	rb_event = simple_rb_reserve_next(cpu_buffer, length, timestamp);
 
-	return &rb_event->array[1];
+	return rb_event->type_len ? &rb_event->array[0] : &rb_event->array[1];
 }
 EXPORT_SYMBOL_GPL(simple_ring_buffer_reserve);
 
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 03/13] tracing/remotes: Use a single per-remote polling work
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

Having a per-iterator polling work is wasteful when logging several
trace_remote per_cpu/trace_pipe files in parallel. This result in one
work running per-CPU, where only one would suffice.

Transition to a single per-remote polling work, scheduled on the first
consumer creation and stopped when the last consuming iterator is freed.

This blanket polls all CPUs, regardless of which ones are actually being
read. This is acceptable because the poll consists of reading the
meta-page, which is a fast operation. Also, it is more common to log all
CPUs in the system than only one, so this use-case should be favoured.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c
index 7fed18f28fa7..21583fae1bd9 100644
--- a/kernel/trace/trace_remote.c
+++ b/kernel/trace/trace_remote.c
@@ -26,7 +26,6 @@ enum tri_type {
 struct trace_remote_iterator {
 	struct trace_remote		*remote;
 	struct trace_seq		seq;
-	struct delayed_work		poll_work;
 	unsigned long			lost_events;
 	u64				ts;
 	struct ring_buffer_iter		*rb_iter;
@@ -55,6 +54,8 @@ struct trace_remote {
 	struct rw_semaphore		*pcpu_reader_locks;
 	unsigned int			nr_readers;
 	unsigned int			poll_ms;
+	struct delayed_work		poll_work;
+	unsigned int			poll_cnt;
 	bool				tracing_on;
 };
 
@@ -291,17 +292,6 @@ static bool trace_remote_has_cpu(struct trace_remote *remote, int cpu)
 	return ring_buffer_poll_remote(remote->trace_buffer, cpu) == 0;
 }
 
-static void __poll_remote(struct work_struct *work)
-{
-	struct delayed_work *dwork = to_delayed_work(work);
-	struct trace_remote_iterator *iter;
-
-	iter = container_of(dwork, struct trace_remote_iterator, poll_work);
-	ring_buffer_poll_remote(iter->remote->trace_buffer, iter->cpu);
-	schedule_delayed_work((struct delayed_work *)work,
-			      msecs_to_jiffies(iter->remote->poll_ms));
-}
-
 static void __free_ring_buffer_iter(struct trace_remote_iterator *iter, int cpu)
 {
 	if (cpu != RING_BUFFER_ALL_CPUS) {
@@ -345,6 +335,36 @@ static int __alloc_ring_buffer_iter(struct trace_remote_iterator *iter, int cpu)
 	return 0;
 }
 
+static void __poll_remote(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct trace_remote *remote;
+
+	remote = container_of(dwork, struct trace_remote, poll_work);
+	ring_buffer_poll_remote(remote->trace_buffer, RING_BUFFER_ALL_CPUS);
+
+	schedule_delayed_work(dwork, msecs_to_jiffies(remote->poll_ms));
+}
+
+static void trace_remote_inc_poll(struct trace_remote *remote)
+{
+	/* poll_cnt <= nr_readers, inherits its overflow protection */
+	if (!remote->poll_cnt++) {
+		ring_buffer_poll_remote(remote->trace_buffer, RING_BUFFER_ALL_CPUS);
+		schedule_delayed_work(&remote->poll_work, msecs_to_jiffies(remote->poll_ms));
+	}
+}
+
+static void trace_remote_dec_poll(struct trace_remote *remote)
+{
+	if (WARN_ON_ONCE(!remote->poll_cnt))
+		return;
+
+	remote->poll_cnt--;
+	if (!remote->poll_cnt)
+		cancel_delayed_work_sync(&remote->poll_work);
+}
+
 static struct trace_remote_iterator
 *trace_remote_iter(struct trace_remote *remote, int cpu, enum tri_type type)
 {
@@ -374,9 +394,7 @@ static struct trace_remote_iterator
 
 		switch (type) {
 		case TRI_CONSUMING:
-			ring_buffer_poll_remote(remote->trace_buffer, cpu);
-			INIT_DELAYED_WORK(&iter->poll_work, __poll_remote);
-			schedule_delayed_work(&iter->poll_work, msecs_to_jiffies(remote->poll_ms));
+			trace_remote_inc_poll(remote);
 			break;
 		case TRI_NONCONSUMING:
 			ret = __alloc_ring_buffer_iter(iter, cpu);
@@ -410,7 +428,7 @@ static void trace_remote_iter_free(struct trace_remote_iterator *iter)
 
 	switch (iter->type) {
 	case TRI_CONSUMING:
-		cancel_delayed_work_sync(&iter->poll_work);
+		trace_remote_dec_poll(remote);
 		break;
 	case TRI_NONCONSUMING:
 		__free_ring_buffer_iter(iter, iter->cpu);
@@ -939,6 +957,7 @@ int trace_remote_register(const char *name, struct trace_remote_callbacks *cbs,
 	remote->poll_ms = 100;
 	mutex_init(&remote->lock);
 	init_rwsem(&remote->reader_lock);
+	INIT_DELAYED_WORK(&remote->poll_work, __poll_remote);
 
 	ret = trace_remote_init_tracefs(name, remote);
 	if (ret)
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 02/13] tracing/remotes: Use kstrtobool for boolean tracefs files
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

Use kstrtobool in trace_remote.c where possible. This is more user-friendly
as it allows a better variety of input strings.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c
index cfe84e9b8fe6..7fed18f28fa7 100644
--- a/kernel/trace/trace_remote.c
+++ b/kernel/trace/trace_remote.c
@@ -176,10 +176,10 @@ tracing_on_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t
 {
 	struct seq_file *seq = filp->private_data;
 	struct trace_remote *remote = seq->private;
-	unsigned long val;
+	bool val;
 	int ret;
 
-	ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
+	ret = kstrtobool_from_user(ubuf, cnt, &val);
 	if (ret)
 		return ret;
 
@@ -1090,10 +1090,10 @@ static ssize_t remote_event_enable_write(struct file *filp, const char __user *u
 	struct seq_file *seq = filp->private_data;
 	struct remote_event *evt = seq->private;
 	struct trace_remote *remote = evt->remote;
-	u8 enable;
+	bool enable;
 	int ret;
 
-	ret = kstrtou8_from_user(ubuf, count, 10, &enable);
+	ret = kstrtobool_from_user(ubuf, count, &enable);
 	if (ret)
 		return ret;
 
@@ -1174,10 +1174,10 @@ static ssize_t remote_events_dir_enable_write(struct file *filp, const char __us
 					      size_t count, loff_t *ppos)
 {
 	struct trace_remote *remote = file_inode(filp)->i_private;
+	bool enable;
 	int i, ret;
-	u8 enable;
 
-	ret = kstrtou8_from_user(ubuf, count, 10, &enable);
+	ret = kstrtobool_from_user(ubuf, count, &enable);
 	if (ret)
 		return ret;
 
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 01/13] tracing/remotes: Release tracefs,eventfs on registration failure
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260602171146.2238998-1-vdonnefort@google.com>

In trace_remote_register(), if registration of events or the init
callback fails, the created tracefs and eventfs directories are leaked.

Release the entire eventfs and tracefs hierarchy on trace_remote
registration.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c
index d6c3f94d67cd..cfe84e9b8fe6 100644
--- a/kernel/trace/trace_remote.c
+++ b/kernel/trace/trace_remote.c
@@ -44,7 +44,8 @@ struct trace_remote {
 	struct trace_buffer		*trace_buffer;
 	struct trace_buffer_desc	*trace_buffer_desc;
 	struct dentry			*dentry;
-	struct eventfs_inode		*eventfs;
+	struct eventfs_inode		*eventfs_root;
+	struct eventfs_inode		*eventfs_subdir;
 	struct remote_event		*events;
 	unsigned long			nr_events;
 	unsigned long			trace_buffer_size;
@@ -795,26 +796,28 @@ static const struct file_operations trace_fops = {
 	.release	= trace_release,
 };
 
+static struct dentry *tracefs_root;
+static DEFINE_MUTEX(tracefs_lock);
+static u64 tracefs_root_count;
+
 static int trace_remote_init_tracefs(const char *name, struct trace_remote *remote)
 {
 	struct dentry *remote_d, *percpu_d, *d;
-	static struct dentry *root;
-	static DEFINE_MUTEX(lock);
 	bool root_inited = false;
 	int cpu;
 
-	guard(mutex)(&lock);
+	guard(mutex)(&tracefs_lock);
 
-	if (!root) {
-		root = tracefs_create_dir(TRACEFS_DIR, NULL);
-		if (!root) {
+	if (!tracefs_root) {
+		tracefs_root = tracefs_create_dir(TRACEFS_DIR, NULL);
+		if (!tracefs_root) {
 			pr_err("Failed to create tracefs dir "TRACEFS_DIR"\n");
 			return -ENOMEM;
 		}
 		root_inited = true;
 	}
 
-	remote_d = tracefs_create_dir(name, root);
+	remote_d = tracefs_create_dir(name, tracefs_root);
 	if (!remote_d) {
 		pr_err("Failed to create tracefs dir "TRACEFS_DIR"%s/\n", name);
 		goto err;
@@ -866,14 +869,15 @@ static int trace_remote_init_tracefs(const char *name, struct trace_remote *remo
 			goto err;
 	}
 
+	tracefs_root_count++;
 	remote->dentry = remote_d;
 
 	return 0;
 
 err:
 	if (root_inited) {
-		tracefs_remove(root);
-		root = NULL;
+		tracefs_remove(tracefs_root);
+		tracefs_root = NULL;
 	} else {
 		tracefs_remove(remote_d);
 	}
@@ -881,8 +885,26 @@ static int trace_remote_init_tracefs(const char *name, struct trace_remote *remo
 	return -ENOMEM;
 }
 
+static void trace_remote_remove_tracefs(struct trace_remote *remote)
+{
+	guard(mutex)(&tracefs_lock);
+
+	if (!remote->dentry)
+		return;
+
+	tracefs_remove(remote->dentry);
+	remote->dentry = NULL;
+
+	tracefs_root_count--;
+	if (!tracefs_root_count) {
+		tracefs_remove(tracefs_root);
+		tracefs_root = NULL;
+	}
+}
+
 static int trace_remote_register_events(const char *remote_name, struct trace_remote *remote,
 					struct remote_event *events, size_t nr_events);
+static void trace_remote_unregister_events(struct trace_remote *remote);
 
 /**
  * trace_remote_register() - Register a Tracefs remote
@@ -905,10 +927,9 @@ static int trace_remote_register_events(const char *remote_name, struct trace_re
 int trace_remote_register(const char *name, struct trace_remote_callbacks *cbs, void *priv,
 			  struct remote_event *events, size_t nr_events)
 {
-	struct trace_remote *remote;
+	struct trace_remote *remote __free(kfree) = kzalloc_obj(*remote);
 	int ret;
 
-	remote = kzalloc_obj(*remote);
 	if (!remote)
 		return -ENOMEM;
 
@@ -919,22 +940,30 @@ int trace_remote_register(const char *name, struct trace_remote_callbacks *cbs,
 	mutex_init(&remote->lock);
 	init_rwsem(&remote->reader_lock);
 
-	if (trace_remote_init_tracefs(name, remote)) {
-		kfree(remote);
-		return -ENOMEM;
-	}
+	ret = trace_remote_init_tracefs(name, remote);
+	if (ret)
+		return ret;
 
 	ret = trace_remote_register_events(name, remote, events, nr_events);
 	if (ret) {
 		pr_err("Failed to register events for trace remote '%s' (%d)\n",
 		       name, ret);
-		return ret;
+		goto err_remove_tracefs;
 	}
 
 	ret = cbs->init ? cbs->init(remote->dentry, priv) : 0;
-	if (ret)
+	if (ret) {
 		pr_err("Init failed for trace remote '%s' (%d)\n", name, ret);
+		goto err_unregister_events;
+	}
+
+	no_free_ptr(remote);
+	return 0;
 
+err_unregister_events:
+	trace_remote_unregister_events(remote);
+err_remove_tracefs:
+	trace_remote_remove_tracefs(remote);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(trace_remote_register);
@@ -1267,7 +1296,6 @@ static int remote_events_dir_callback(const char *name, umode_t *mode, void **da
 static int trace_remote_init_eventfs(const char *remote_name, struct trace_remote *remote,
 				     struct remote_event *evt)
 {
-	struct eventfs_inode *eventfs = remote->eventfs;
 	static struct eventfs_entry dir_entries[] = {
 		{
 			.name		= "enable",
@@ -1292,35 +1320,37 @@ static int trace_remote_init_eventfs(const char *remote_name, struct trace_remot
 			.callback	= remote_event_callback,
 		}
 	};
-	bool eventfs_create = false;
+	struct eventfs_inode *eventfs_root, *eventfs_subdir, *e;
 
-	if (!eventfs) {
-		eventfs = eventfs_create_events_dir("events", remote->dentry, dir_entries,
-						    ARRAY_SIZE(dir_entries), remote);
-		if (IS_ERR(eventfs))
-			return PTR_ERR(eventfs);
+	eventfs_root = remote->eventfs_root;
+	eventfs_subdir = remote->eventfs_subdir;
+	if (!eventfs_root) {
+		eventfs_root = eventfs_create_events_dir("events", remote->dentry, dir_entries,
+							 ARRAY_SIZE(dir_entries), remote);
+		if (IS_ERR(eventfs_root))
+			return PTR_ERR(eventfs_root);
 
 		/*
 		 * Create similar hierarchy as local events even if a single system is supported at
 		 * the moment
 		 */
-		eventfs = eventfs_create_dir(remote_name, eventfs, NULL, 0, NULL);
-		if (IS_ERR(eventfs))
-			return PTR_ERR(eventfs);
-
-		remote->eventfs = eventfs;
-		eventfs_create = true;
+		eventfs_subdir = eventfs_create_dir(remote_name, eventfs_root, NULL, 0, NULL);
+		if (IS_ERR(eventfs_subdir)) {
+			eventfs_remove_events_dir(eventfs_root);
+			return PTR_ERR(eventfs_subdir);
+		}
 	}
 
-	eventfs = eventfs_create_dir(evt->name, eventfs, entries, ARRAY_SIZE(entries), evt);
-	if (IS_ERR(eventfs)) {
-		if (eventfs_create) {
-			eventfs_remove_events_dir(remote->eventfs);
-			remote->eventfs = NULL;
-		}
-		return PTR_ERR(eventfs);
+	e = eventfs_create_dir(evt->name, eventfs_subdir, entries, ARRAY_SIZE(entries), evt);
+	if (IS_ERR(e)) {
+		if (!remote->eventfs_root)
+			eventfs_remove_events_dir(eventfs_root);
+		return PTR_ERR(e);
 	}
 
+	remote->eventfs_root = eventfs_root;
+	remote->eventfs_subdir = eventfs_subdir;
+
 	return 0;
 }
 
@@ -1335,11 +1365,11 @@ static int trace_remote_attach_events(struct trace_remote *remote, struct remote
 		if (evt->remote)
 			return -EEXIST;
 
-		evt->remote = remote;
-
 		/* We need events to be sorted for efficient lookup */
 		if (i && evt->id <= events[i - 1].id)
 			return -EINVAL;
+
+		evt->remote = remote;
 	}
 
 	remote->events = events;
@@ -1348,14 +1378,33 @@ static int trace_remote_attach_events(struct trace_remote *remote, struct remote
 	return 0;
 }
 
+static void trace_remote_detach_events(struct trace_remote *remote, struct remote_event *events,
+					size_t nr_events)
+{
+	int i;
+
+	for (i = 0; i < nr_events; i++) {
+		struct remote_event *evt = &events[i];
+
+		if (evt->remote == remote)
+			evt->remote = NULL;
+	}
+
+	remote->events = NULL;
+	remote->nr_events = 0;
+}
+
 static int trace_remote_register_events(const char *remote_name, struct trace_remote *remote,
 					struct remote_event *events, size_t nr_events)
 {
 	int i, ret;
 
 	ret = trace_remote_attach_events(remote, events, nr_events);
-	if (ret)
+	if (ret) {
+		/* It is safe to call detach on a half-registered array */
+		trace_remote_detach_events(remote, events, nr_events);
 		return ret;
+	}
 
 	for (i = 0; i < nr_events; i++) {
 		struct remote_event *evt = &events[i];
@@ -1369,6 +1418,12 @@ static int trace_remote_register_events(const char *remote_name, struct trace_re
 	return 0;
 }
 
+static void trace_remote_unregister_events(struct trace_remote *remote)
+{
+	trace_remote_detach_events(remote, remote->events, remote->nr_events);
+	eventfs_remove_events_dir(remote->eventfs_root);
+}
+
 static int __cmp_events(const void *key, const void *data)
 {
 	const struct remote_event *evt = data;
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply related

* [PATCH 00/13] tracing/remotes: Add printk, dump_on_oops and boot parameters
From: Vincent Donnefort @ 2026-06-02 17:11 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort

This series extends the recently introduced trace remotes
infrastructure, bringing useful features for developers:

  * dump_on_oops: Dump the trace remote buffer on system panic.
  * printk: Redirect remote events to dmesg.
  * trace_remote=: Configure a trace_remote from the commandline.

It also brings a couple of optimisations:

  * Header compressed length support for small events.
  * Single work thread for remote polling.

And some misc improvements:

  * Use kstrtobool where possible
  * Fix trace remote unregistering

I didn't put a "Fixes:" tag on the commit which is fixing the remote
registration because I do not believe this is something any user can
trigger.

Vincent Donnefort (13):
  tracing/remotes: Release tracefs,eventfs on registration failure
  tracing/remotes: Use kstrtobool for boolean tracefs files
  tracing/remotes: Use a single per-remote polling work
  tracing/simple_ring_buffer: Add support for compressed length
  tracing/remotes: Add printk tracefs file
  tracing/remotes: selftests: Add a test for the printk tracefs file
  tracing/remotes: selftests: Prefix hypervisor folder
  ring-buffer: Add ring_buffer_read_remote_meta_page()
  tracing/remotes: Add dump_on_oops tracefs file
  tracing/remotes: selftests: Add a test for the dump_on_oops tracefs
    file
  Documentation: tracing/remotes: Add detailed tracefs layout
  tracing/remotes: Add trace_remote cmdline options
  Documentation/kernel-parameters: Add trace_remote

 .../admin-guide/kernel-parameters.txt         |  16 +
 Documentation/trace/remotes.rst               |  63 +-
 include/linux/ring_buffer.h                   |   1 +
 kernel/trace/ring_buffer.c                    |  53 ++
 kernel/trace/simple_ring_buffer.c             |  22 +-
 kernel/trace/trace_remote.c                   | 649 +++++++++++++++---
 .../buffer_size.tc                            |   0
 .../remotes/00hypervisor/dump_on_oops.tc      |  11 +
 .../{hypervisor => 00hypervisor}/hotplug.tc   |   0
 .../test.d/remotes/00hypervisor/printk.tc     |  11 +
 .../{hypervisor => 00hypervisor}/reset.tc     |   0
 .../{hypervisor => 00hypervisor}/trace.tc     |   0
 .../trace_pipe.tc                             |   0
 .../{hypervisor => 00hypervisor}/unloading.tc |   0
 .../ftrace/test.d/remotes/dump_on_oops.tc     |  51 ++
 .../selftests/ftrace/test.d/remotes/functions |   2 +
 .../selftests/ftrace/test.d/remotes/printk.tc |  72 ++
 17 files changed, 847 insertions(+), 104 deletions(-)
 rename tools/testing/selftests/ftrace/test.d/remotes/{hypervisor => 00hypervisor}/buffer_size.tc (100%)
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/dump_on_oops.tc
 rename tools/testing/selftests/ftrace/test.d/remotes/{hypervisor => 00hypervisor}/hotplug.tc (100%)
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/00hypervisor/printk.tc
 rename tools/testing/selftests/ftrace/test.d/remotes/{hypervisor => 00hypervisor}/reset.tc (100%)
 rename tools/testing/selftests/ftrace/test.d/remotes/{hypervisor => 00hypervisor}/trace.tc (100%)
 rename tools/testing/selftests/ftrace/test.d/remotes/{hypervisor => 00hypervisor}/trace_pipe.tc (100%)
 rename tools/testing/selftests/ftrace/test.d/remotes/{hypervisor => 00hypervisor}/unloading.tc (100%)
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/dump_on_oops.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/printk.tc


base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.54.0.1032.g2f8565e1d1-goog


^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-06-02 16:34 UTC (permalink / raw)
  To: npache
  Cc: lance.yang, david, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, usama.arif
In-Reply-To: <CAA1CXcD7peS3WHueVgAWhhRrjBO_1b19+Xc0CfZBSO8OwJJKQw@mail.gmail.com>


On Tue, Jun 02, 2026 at 09:30:06AM -0600, Nico Pache wrote:
>On Mon, Jun 1, 2026 at 4:48 AM Lance Yang <lance.yang@linux.dev> wrote:
>>
>>
>>
>> On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
>> > On 6/1/26 11:08, Lance Yang wrote:
>> >>
>> >>
>> >> On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
>> >>> On 6/1/26 05:28, Lance Yang wrote:
>> >>>>
>> >>>>
>> >>>> Ah, fair point.
>> >>>>
>> >>>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
>> >>>> than only using the pte pointer passed in. For example, mips does:
>> >>>
>> >>> Right, a re-walk would be the real problem.
>> >>>
>> >>>>
>> >>>>     update_mmu_cache_range()
>> >>>>       -> __update_tlb()
>> >>>>         -> pgd_offset(vma->vm_mm, address)
>> >>>>         -> pte_offset_map(...)
>> >>>>
>> >>>> and __update_tlb() has this assumption:
>> >>>>
>> >>>>          /*
>> >>>>           * update_mmu_cache() is called between pte_offset_map_lock()
>> >>>>           * and pte_unmap_unlock(), so we can assume that ptep is not
>> >>>>           * NULL here: and what should be done below if it were NULL?
>> >>>>           */
>> >>>>
>> >>>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
>> >>>> here, could __update_tlb() hit the none PMD, get NULL from
>> >>>> pte_offset_map(), and then dereference it?
>> >>>
>> >>> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
>> >>> even spells that out. :(
>> >>>
>> >>> Do you know about other code like that, or is MIPS the only one doing a
>> >>> re-walk and crossing fingers?
>> >>>
>> >>>>
>> >>>> Just wanted to raise it since some arch code may still have assumptions
>> >>>> like this, and the always-enable-mTHP work is getting closer ...
>> >>>
>> >>> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in
>> >>> arch code,
>> >>> because we simply provide the ptep. update_mmu_cache_range() only consumes the
>> >>> pte.
>> >>>
>> >>>>
>> >>>> Probably very very very hard to hit, though :)
>> >>>
>> >>> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
>> >>> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
>> >>>
>> >>> Maybe we could reinstall the page table with the cleared (none) entries while
>> >>> still holding the PTL?
>> >>>
>> >>> Thinking out loud:
>> >>>
>> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> >>> index 5ba298d420b7..e39b750b1e6f 100644
>> >>> --- a/mm/khugepaged.c
>> >>> +++ b/mm/khugepaged.c
>> >>> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct
>> >>> mm_struct *mm, unsigned long s
>> >>>                   map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>> >>>           } else {
>> >>>                   /*
>> >>> -                * set_ptes is called in map_anon_folio_pte_nopf with the
>> >>> -                * pmd_ptl lock still held; this is safe as the PMD is expected
>> >>> -                * to be none. The pmd entry is then repopulated below.
>> >>> +                * Re-insert the page table with the cleared entries, but
>> >>> +                * hold the PTL, such that no one can mess with the re-installed
>> >>> +                * page table until we updated the temporarily-cleared entries
>> >>> +                * through map_anon_folio_pte_nopf().
>> >>>                    */
>> >>> -               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /
>> >>> *uffd_wp=*/ false);
>> >>> -               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
>> >>
>> >> One small thing, I think we should probably keep the smp_wmb(), and just
>> >> move it before the earlier pmd_populate().
>> >>
>> >> IIUC, the ordering we want is still:
>> >>
>> >>    clear old PTEs
>> >>    smp_wmb()
>> >>    pmd_populate()
>> >>
>> >> so another CPU cannot walk through the re-installed PMD and still observe
>> >> the old PTEs, right?
>> >
>> > There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
>>
>> Ah, cool! __folio_mark_uptodate() already does the job :P
>>
>> So yeah, no extra smp_wmb() needed here!
>
>are we sure? that folio_mark_uptodate is done before the PTEs are
>reinstalled. Then we reinstall the PMD right after. Currently
>separated by the smp_wmb().

Reinstalling the PMD first makes the PTE table reachable again, right?

So before pmd_populate(), we only need to order the old PTE clears before
the PTE table is reachable again; __folio_mark_uptodate() already has the
smp_wmb() for that :)

The new PTEs are filled later under the PTL.

Hopefully I didn't miss soemthing :)

>I was copying this from other THP code that performs similar PTE/PMD juggling.
>
>I can remove it, but I'd rather air on the side of caution with this.

Cheers, Lance

^ permalink raw reply

* Re: [syzbot] [trace?] KASAN: use-after-free Write in ring_buffer_read_page
From: Steven Rostedt @ 2026-06-02 16:28 UTC (permalink / raw)
  To: syzbot
  Cc: linux-kernel, linux-trace-kernel, mathieu.desnoyers, mhiramat,
	syzkaller-bugs
In-Reply-To: <6a1ede7b.b4221f80.1326c5.0003.GAE@google.com>

On Tue, 02 Jun 2026 06:45:31 -0700
syzbot <syzbot+2dd9d02f60775ce5c1fb@syzkaller.appspotmail.com> wrote:

> syzbot found the following issue on:
> 
> HEAD commit:    e7ae89a0c97c Linux 7.1-rc5
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=16f06e2e580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=58acee1ac5406016
> dashboard link: https://syzkaller.appspot.com/bug?extid=2dd9d02f60775ce5c1fb
> compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44
> 
> Unfortunately, I don't have any reproducer for this issue yet.

Looks like the test was doing something really weird to trigger this.
Without a reproducer, it's pretty much impossible to find out what
happened. Maybe AI could do it?

-- Steve

^ permalink raw reply

* Re: [PATCH] rtla: Fix parsing of multi-character short options
From: John Kacur @ 2026-06-02 16:21 UTC (permalink / raw)
  To: Tomas Glozar; +Cc: linux-trace-kernel, Steven Rostedt, linux-kernel
In-Reply-To: <20260602125506.3325345-1-tglozar@redhat.com>

On Tue, 2 Jun 2026 14:55:06 +0200, Tomas Glozar wrote:
> rtla: Fix parsing of multi-character short options

Thanks for the fix! I've tested this patch with a comprehensive test suite
covering all rtla commands (timerlat hist/top, osnoise hist/top, hwnoise)
with the four option formats:
  -p 100        (short with space)
  -p100         (short attached - previously broken)
  --period=100  (long with equals)
  --period 100  (long with space)

All 20 tests pass. The fix correctly resolves the issue where -p100 was
being parsed as multiple separate options.

Tested-by: John Kacur <jkacur@redhat.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox