Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH v6 0/2] blk-mq: introduce tag starvation observability
From: Aaron Tomlin @ 2026-05-21  2:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: rostedt, mhiramat, mathieu.desnoyers, bvanassche,
	johannes.thumshirn, kch, dlemoal, ritesh.list, loberman, neelx,
	sean, mproche, chjohnst, linux-block, linux-kernel,
	linux-trace-kernel
In-Reply-To: <189ddfd7-f579-4c86-bcfc-334cf574bdfc@kernel.dk>

On Mon, May 18, 2026 at 07:31:45AM -0600, Jens Axboe wrote:
> Why not just issue the trace points? Then there's close to zero
> overhead, rather than needing to need added counters for this, and the
> kernel to keep track. If you just issue the get/put tag kind of traces,
> then userspace can keep track. That's what blktrace has done for decades
> for things like inflight/queue depth accounting.
> 
> IOW, seems to me, this could be done with basically zero kernel
> additions outside of perhaps a trace point or two.

Hi Jens,

Thanks for taking a look.

You make a completely fair point.
I agree that pushing the accounting to userspace is the right approach,
especially given the proposed hard-coded tracepoint. For example, with
bpftrace(8):

# bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
  Attaching 1 probe...
  ^C
  @tag_waits[4]: 12
  @tag_waits[12]: 87


I will drop Patch 2 from this series, in the next iteration.


Kind regards,
-- 
Aaron Tomlin

^ permalink raw reply

* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-05-21  1:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: sashiko-bot, sashiko-reviews, bpf, LKML, Linux trace kernel
In-Reply-To: <20260520124832.737a946a@gandalf.local.home>

On Wed, 20 May 2026 12:48:32 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 20 May 2026 15:20:21 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > > > > @@ -515,6 +542,10 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
> > > > >  		ctx->params = NULL;
> > > > >  		ctx->nr_params = 0;
> > > > >  	}
> > > > > +	if (ctx->struct_btf) {
> > > > > +		btf_put(ctx->struct_btf);
> > > > > +		ctx->last_struct = NULL;    
> > > > 
> > > > [Severity: Low]
> > > > Should ctx->struct_btf be explicitly set to NULL after btf_put() drops
> > > > the reference?  
> > > 
> > > I'm thinking of dropping it in the '(' switch case.  
> > 
> > Can you consider making the '(' switch case part as a helper
> > function because it depends on CONFIG_DEBUG_INFO_BTF?
> 
> Should we just encapsulate that entire case statement with:
> 
> #ifdef CONFIG_DEBUG_INFO_BTF
> [..]
> #endif

Yeah that is possible, and I rather like to make it a separate
function for simplifying switch-case block for readability.

Thank you,

> 
>  ?
> 
> -- Steve
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Wei Yang @ 2026-05-21  1:55 UTC (permalink / raw)
  To: Nico Pache
  Cc: Wei Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <CAA1CXcD2KPKFrwCZd2PatQhf_e1nrvCguPD77GcNOVPFZLvsew@mail.gmail.com>

On Wed, May 20, 2026 at 06:05:31AM -0600, Nico Pache wrote:
>On Tue, May 12, 2026 at 9:44 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
>> >Enable khugepaged to collapse to mTHP orders. This patch implements the
>> >main scanning logic using a bitmap to track occupied pages and a stack
>> >structure that allows us to find optimal collapse sizes.
>> >
>> >Previous to this patch, PMD collapse had 3 main phases, a light weight
>> >scanning phase (mmap_read_lock) that determines a potential PMD
>> >collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>> >phase (mmap_write_lock).
>> >
>> >To enabled mTHP collapse we make the following changes:
>> >
>> >During PMD scan phase, track occupied pages in a bitmap. When mTHP
>> >orders are enabled, we remove the restriction of max_ptes_none during the
>> >scan phase to avoid missing potential mTHP collapse candidates. Once we
>> >have scanned the full PMD range and updated the bitmap to track occupied
>> >pages, we use the bitmap to find the optimal mTHP size.
>> >
>> >Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
>> >and determine the best eligible order for the collapse. A stack structure
>> >is used instead of traditional recursion to manage the search. This also
>> >prevents a traditional recursive approach when the kernel stack struct is
>> >limited. The algorithm recursively splits the bitmap into smaller chunks to
>> >find the highest order mTHPs that satisfy the collapse criteria. We start
>> >by attempting the PMD order, then moved on the consecutively lower orders
>> >(mTHP collapse). The stack maintains a pair of variables (offset, order),
>> >indicating the number of PTEs from the start of the PMD, and the order of
>> >the potential collapse candidate.
>> >
>> >The algorithm for consuming the bitmap works as such:
>> >    1) push (0, HPAGE_PMD_ORDER) onto the stack
>> >    2) pop the stack
>> >    3) check if the number of set bits in that (offset,order) pair
>> >       statisfy the max_ptes_none threshold for that order
>> >    4) if yes, attempt collapse
>> >    5) if no (or collapse fails), push two new stack items representing
>> >       the left and right halves of the current bitmap range, at the
>> >       next lower order
>> >    6) repeat at step (2) until stack is empty.
>> >
>> >Below is a diagram representing the algorithm and stack items:
>> >
>> >                            offset   mid_offset
>> >                            |        |
>> >                            |        |
>> >                            v        v
>> >          ____________________________________
>> >         |          PTE Page Table            |
>> >         --------------------------------------
>> >                           <-------><------->
>> >                             order-1  order-1
>> >
>> >mTHP collapses reject regions containing swapped out or shared pages.
>> >This is because adding new entries can lead to new none pages, and these
>> >may lead to constant promotion into a higher order mTHP. A similar
>> >issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>> >introducing at least 2x the number of pages, and on a future scan will
>> >satisfy the promotion condition once again. This issue is prevented via
>> >the collapse_max_ptes_none() function which imposes the max_ptes_none
>> >restrictions above.
>> >
>> >We currently only support mTHP collapse for max_ptes_none values of 0
>> >and HPAGE_PMD_NR - 1. resulting in the following behavior:
>> >
>> >    - max_ptes_none=0: Never introduce new empty pages during collapse
>> >    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>> >      available mTHP order
>> >
>> >Any other max_ptes_none value will emit a warning and skip mTHP collapse
>> >attempts. There should be no behavior change for PMD collapse.
>> >
>> >Once we determine what mTHP sizes fits best in that PMD range a collapse
>> >is attempted. A minimum collapse order of 2 is used as this is the lowest
>> >order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>> >
>> >Currently madv_collapse is not supported and will only attempt PMD
>> >collapse.
>> >
>> >We can also remove the check for is_khugepaged inside the PMD scan as
>> >the collapse_max_ptes_none() function handles this logic now.
>> >
>> >Signed-off-by: Nico Pache <npache@redhat.com>
>>
>> [...]
>>
>> >+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>> >+              int referenced, int unmapped, struct collapse_control *cc,
>> >+              unsigned long enabled_orders)
>> >+{
>> >+      unsigned int nr_occupied_ptes, nr_ptes;
>> >+      int max_ptes_none, collapsed = 0, stack_size = 0;
>> >+      unsigned long collapse_address;
>> >+      struct mthp_range range;
>> >+      u16 offset;
>> >+      u8 order;
>> >+
>> >+      collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>> >+
>> >+      while (stack_size) {
>> >+              range = collapse_mthp_stack_pop(cc, &stack_size);
>> >+              order = range.order;
>> >+              offset = range.offset;
>> >+              nr_ptes = 1UL << order;
>> >+
>> >+              if (!test_bit(order, &enabled_orders))
>> >+                      goto next_order;
>> >+
>> >+              max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
>>
>> I am thinking whether there is a behavioral change for userfaultfd_armed(vma).
>>
>> collapse_single_pmd()
>>     collapse_scan_pmd
>>         max_ptes_none = collapse_max_ptes_none(cc, vma)
>>         max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT                --- (1)
>>         mthp_collapse
>>             max_ptes_none = collapse_max_ptes_none(cc, NULL)     --- (2)
>>             collapse_huge_page(mm)
>>                 hugepage_vma_revalidate(&vma)
>>                 __collapse_huge_page_isolate(vma)
>>                     max_ptes_none = collapse_max_ptes_none(cc, vma)
>>
>> Before mthp_collapse() introduced, userfaultfd_armed(vma) is skipped if there
>> is any pte_none_or_zero() in collapse_scan_pmd().
>>
>> But now, max_ptes_none could be set to KHUGEPAGED_MAX_PTES_LIMIT at (1), so
>> that we can scan all the pte to get the bitmap. This means
>> userfaultfd_armed(vma) could continue even with pte_none_or_zero().
>>
>> Then in mthp_collapse(), collapse_max_ptes_none() at (2) ignores
>> userfaultfd_armed(vma), which means it will continue to collapse a
>> userfaultfd_armed(vma) when there is pte_none_or_zero().
>>
>> The good news is we will stop at __collapse_huge_page_isolate(), where we
>> get collapse_max_ptes_none() with vma. But we already did a lot of work.
>
>Good catch!
>
>As you stated we eventually ensure we respect the uffd checks. So
>there are no correctness issues, just the potential for wasted cycles.
>
>At (1) we only do this if mTHPs are enabled. If that is the case, the
>only waste that can arise is at the PMD order, as that order respects
>the max_ptes_none value.
>
>I think one approach is to gate (1) with the uffd check as well. That
>way, if mTHPs are enabled and its uffd-armed, max_ptes_none will stay
>at 0, and we bail early on the scan early if any none_ptes are hit.
>
>But then we lose the ability to collapse to mTHPs that are uffd-armed,
>where the PMD has none/zero-ptes and the mTHP fully has 0
>non-none/zero-ptes.
>
>ie) assume a PMD is 16 x's [xxxxxxxx00000000]
>where x is a populated pte and 0 is not
>If we guard this scan (1), then we will never check if its possible to
>collapse to the smaller orders.
>
>Let me know if you see a flaw in my logic, I think it's best to keep it as is?
>

Yes, gate it at (1) is not a proper place.

I am thinking whether we could pass vma to (2)? So that we could respect
uffd-armed?

>>
>> Not sure if I missed something.
>>
>> >+
>> >+              if (max_ptes_none < 0)
>> >+                      return collapsed;
>> >+
>> >+              nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
>> >+                                                             nr_ptes);
>> >+
>> >+              if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>> >+                      int ret;
>> >+
>> >+                      collapse_address = address + offset * PAGE_SIZE;
>> >+                      ret = collapse_huge_page(mm, collapse_address, referenced,
>> >+                                               unmapped, cc, order);
>> >+                      if (ret == SCAN_SUCCEED) {
>> >+                              collapsed += nr_ptes;
>> >+                              continue;
>> >+                      }
>> >+              }
>> >+
>> >+next_order:
>> >+              if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>> >+                      const u8 next_order = order - 1;
>> >+                      const u16 mid_offset = offset + (nr_ptes / 2);
>> >+
>> >+                      collapse_mthp_stack_push(cc, &stack_size, mid_offset,
>> >+                                               next_order);
>> >+                      collapse_mthp_stack_push(cc, &stack_size, offset,
>> >+                                               next_order);
>> >+              }
>> >+      }
>> >+      return collapsed;
>> >+}
>> >+
>> > static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> >               struct vm_area_struct *vma, unsigned long start_addr,
>> >               bool *lock_dropped, struct collapse_control *cc)
>> > {
>> >-      const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>> >+      int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>> >       const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>> >       const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>> >+      enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>> >       pmd_t *pmd;
>> >-      pte_t *pte, *_pte;
>> >-      int none_or_zero = 0, shared = 0, referenced = 0;
>> >+      pte_t *pte, *_pte, pteval;
>> >+      int i;
>> >+      int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>> >       enum scan_result result = SCAN_FAIL;
>> >       struct page *page = NULL;
>> >       struct folio *folio = NULL;
>> >       unsigned long addr;
>> >+      unsigned long enabled_orders;
>> >       spinlock_t *ptl;
>> >       int node = NUMA_NO_NODE, unmapped = 0;
>> >
>> >@@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> >               goto out;
>> >       }
>> >
>> >+      bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>> >       memset(cc->node_load, 0, sizeof(cc->node_load));
>> >       nodes_clear(cc->alloc_nmask);
>> >+
>> >+      enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
>>
>> Would it be 0 at this point?
>
>If your question relates to the issue you brought up above, then yes,
>max_ptes_none would be 0 if it's uffd-armed. We must recheck the
>uffd-armed status before modifying it to 511.
>
>>
>> >+
>> >+      /*
>> >+       * If PMD is the only enabled order, enforce max_ptes_none, otherwise
>> >+       * scan all pages to populate the bitmap for mTHP collapse.
>> >+       */
>> >+      if (enabled_orders != BIT(HPAGE_PMD_ORDER))
>> >+              max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>> >+
>> >       pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>> >       if (!pte) {
>> >               cc->progress++;
>> >@@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> >               goto out;
>> >       }
>> >
>> >-      for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> >-           _pte++, addr += PAGE_SIZE) {
>> >+      for (i = 0; i < HPAGE_PMD_NR; i++) {
>> >+              _pte = pte + i;
>> >+              addr = start_addr + i * PAGE_SIZE;
>> >+              pteval = ptep_get(_pte);
>> >+
>> >               cc->progress++;
>> >
>> >-              pte_t pteval = ptep_get(_pte);
>> >               if (pte_none_or_zero(pteval)) {
>> >                       if (++none_or_zero > max_ptes_none) {
>> >                               result = SCAN_EXCEED_NONE_PTE;
>> >@@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> >                       }
>> >               }
>> >
>> >+              /* Set bit for occupied pages */
>> >+              __set_bit(i, cc->mthp_bitmap);
>> >               /*
>> >                * Record which node the original page is from and save this
>> >                * information to cc->node_load[].
>> >@@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> >       if (result == SCAN_SUCCEED) {
>> >               /* collapse_huge_page expects the lock to be dropped before calling */
>> >               mmap_read_unlock(mm);
>> >-              result = collapse_huge_page(mm, start_addr, referenced,
>> >-                                          unmapped, cc, HPAGE_PMD_ORDER);
>> >+              nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
>> >+                                            cc, enabled_orders);
>> >               /* collapse_huge_page will return with the mmap_lock released */
>>
>> collapse_huge_page will return with mmap_lock released, but mthp_collapse()
>> may not?
>
>We are now releasing the lock before calling mthp_collapse, which
>subsequently calls collapse_huge_page. Even if `collapse_huge_page` is
>never called-- say, because enabled_orders is 0 (which should not
>happen) and all collapse orders are skipped (never calling
>collapse_huge_page)-- we still return here with the lock dropped.
>
>I think this is sound. Let me know if you think differently.
>

You are right. I missed the lock is released in previous patch.

>Cheers :)
>-- Nico
>
>>
>> >               *lock_dropped = true;
>> >+              result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>> >       }
>> > out:
>> >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
>> >--
>> >2.54.0
>>
>> --
>> Wei Yang
>> Help you, Help me
>>

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* [PATCH] tracing: Fix README path for synthetic_events
From: ao.sun @ 2026-05-21  1:52 UTC (permalink / raw)
  To: rostedt@goodmis.org, mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org
  Cc: jiazi.li, hongyan.xia, ao.sun

From: Ao Sun <ao.sun@transsion.com>

The events/ prefix should be removed, since synthetic_events
is now directly under the tracing root directory.

Signed-off-by: Ao Sun <ao.sun@transsion.com>
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..5dd4a8ca7586 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4474,7 +4474,7 @@ static const char readme_msg[] =
 	"\t        snapshot()                           - snapshot the trace buffer\n\n"
 #endif
 #ifdef CONFIG_SYNTH_EVENTS
-	"  events/synthetic_events\t- Create/append/remove/show synthetic events\n"
+	"  synthetic_events\t- Create/append/remove/show synthetic events\n"
 	"\t  Write into this file to define/undefine new synthetic events.\n"
 	"\t     example: echo 'myevent u64 lat; char name[]; long[] stack' >> synthetic_events\n"
 #endif
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH] ftrace: Use flexible array for hash buckets
From: Rosen Penev @ 2026-05-21  1:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, open list:FUNCTION HOOKS (FTRACE), sashiko-bot,
	sashiko-reviews
In-Reply-To: <20260520212829.7734bad4@fedora>

On Wed, May 20, 2026 at 6:28 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 20 May 2026 15:00:30 -0700
> Rosen Penev <rosenp@gmail.com> wrote:
>
> > Store ftrace hash buckets in the ftrace_hash allocation instead of
> > allocating the bucket array separately.
> >
> > This keeps the bucket storage tied to the hash lifetime and simplifies
> > the allocation and cleanup paths.
> >
> > Assisted-by: Codex:GPT-5.5
>
> I'll let the AI's duke it out!
>
> > Signed-off-by: Rosen Penev <rosenp@gmail.com>
> > ---
> >  kernel/trace/ftrace.c | 17 ++---------------
> >  kernel/trace/trace.h  |  2 +-
> >  2 files changed, 3 insertions(+), 16 deletions(-)
> >
> > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > index b2611de3f594..25a9dca290dd 100644
> > --- a/kernel/trace/ftrace.c
> > +++ b/kernel/trace/ftrace.c
> > @@ -1082,10 +1082,7 @@ struct ftrace_func_probe {
> >   * it all the time. These are in a read only section such that if
> >   * anyone does try to modify it, it will cause an exception.
> >   */
> > -static const struct hlist_head empty_buckets[1];
> > -static const struct ftrace_hash empty_hash = {
> > -     .buckets = (struct hlist_head *)empty_buckets,
> > -};
> > +static const struct ftrace_hash empty_hash = {};
> >  #define EMPTY_HASH   ((struct ftrace_hash *)&empty_hash)
>
>
> According to Sashiko: https://sashiko.dev/#/patchset/20260520220030.16887-1-rosenp%40gmail.com
>
>    Could this conversion to a flexible array member cause an
>    out-of-bounds read when iterating over the empty hash? Because
>    empty_hash is now initialized as an empty struct, its flexible array
>    member buckets has a size of 0. However, empty_hash.size_bits is 0,
>    which means loop limits computing '1 << hash->size_bits' will
>    evaluate to 1. If functions like
>    prepare_direct_functions_for_ipmodify() iterate over a default
>    EMPTY_HASH without checking ftrace_hash_empty(), they will attempt
>    to read EMPTY_HASH->buckets[0]. This reads past the end of the
>    struct into adjacent memory in the .rodata section. If that adjacent
>    memory happens to be non-zero, the linked list loop could
>    dereference it and cause a kernel panic. Prior to this patch,
>    empty_buckets provided a safely zeroed array of size 1 to handle
>    this single iteration.
Yeah this looks right. Might as well abandon.
>
> -- Steve

^ permalink raw reply

* Re: [PATCH] ring-buffer: Fix reporting of missed events in iterator
From: Steven Rostedt @ 2026-05-21  1:38 UTC (permalink / raw)
  To: Masami Hiramatsu (Google); +Cc: LKML, Linux Trace Kernel, Mathieu Desnoyers
In-Reply-To: <20260521102921.a1449dedfe0392a3bb4a91b9@kernel.org>

On Thu, 21 May 2026 10:29:21 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:


> > Do not reset the missed_events flag when checking if there were missed
> > events, but instead clear it when moving the iterator head to the next
> > event.
> >   
> 
> As Sashiko pointed, this flag should be reset in rb_iter_reset() too.
> (But that seems like another issue?)
> 

Hmm, I think that was an issue even before this patch, but Sashiko is
right. It should be fixed. I'll send a v2.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH] ring-buffer: Fix reporting of missed events in iterator
From: Masami Hiramatsu @ 2026-05-21  1:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260520142817.4050abab@gandalf.local.home>

On Wed, 20 May 2026 14:28:17 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When tracing is active while reading the trace file, if the iterator
> reading the buffer detects that the writer has passed the iterator head,
> it will reset and set a "missed events" flag. This flag is passed to the
> output processing to show the user that events were missed:
> 
>   CPU:4 [LOST EVENTS]
> 
> The problem is that the flag is reset after it is checked in
> ring_buffer_iter_dropped(). But the "trace" file iterates over all the CPU
> ring buffers and it will check if they are dropped when figuring out which
> buffer to print next. This prematurely clears the missed_events flag if
> the CPU buffer with the missed events is not the one that is printed next.
> 
> On the iteration where the CPU buffer with the missed events is printed,
> the check if it had missed events would return false and the output does
> not show that events were missed.
> 
> Do not reset the missed_events flag when checking if there were missed
> events, but instead clear it when moving the iterator head to the next
> event.
> 

As Sashiko pointed, this flag should be reset in rb_iter_reset() too.
(But that seems like another issue?)

Thank you,

> Cc: stable@vger.kernel.org
> Fixes: c9b7a4a72ff64 ("ring-buffer/tracing: Have iterator acknowledge dropped events")
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/trace/ring_buffer.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 5326924615a4..47b0a7b43f0f 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -6086,10 +6086,7 @@ ring_buffer_peek(struct trace_buffer *buffer, int cpu, u64 *ts,
>   */
>  bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter)
>  {
> -	bool ret = iter->missed_events != 0;
> -
> -	iter->missed_events = 0;
> -	return ret;
> +	return iter->missed_events != 0;
>  }
>  EXPORT_SYMBOL_GPL(ring_buffer_iter_dropped);
>  
> @@ -6251,7 +6248,7 @@ void ring_buffer_iter_advance(struct ring_buffer_iter *iter)
>  	unsigned long flags;
>  
>  	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> -
> +	iter->missed_events = 0;
>  	rb_advance_iter(iter);
>  
>  	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> -- 
> 2.53.0
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] ftrace: Use flexible array for hash buckets
From: Steven Rostedt @ 2026-05-21  1:28 UTC (permalink / raw)
  To: Rosen Penev
  Cc: linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, open list:FUNCTION HOOKS (FTRACE), sashiko-bot,
	sashiko-reviews
In-Reply-To: <20260520220030.16887-1-rosenp@gmail.com>

On Wed, 20 May 2026 15:00:30 -0700
Rosen Penev <rosenp@gmail.com> wrote:

> Store ftrace hash buckets in the ftrace_hash allocation instead of
> allocating the bucket array separately.
> 
> This keeps the bucket storage tied to the hash lifetime and simplifies
> the allocation and cleanup paths.
> 
> Assisted-by: Codex:GPT-5.5

I'll let the AI's duke it out!

> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  kernel/trace/ftrace.c | 17 ++---------------
>  kernel/trace/trace.h  |  2 +-
>  2 files changed, 3 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index b2611de3f594..25a9dca290dd 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1082,10 +1082,7 @@ struct ftrace_func_probe {
>   * it all the time. These are in a read only section such that if
>   * anyone does try to modify it, it will cause an exception.
>   */
> -static const struct hlist_head empty_buckets[1];
> -static const struct ftrace_hash empty_hash = {
> -	.buckets = (struct hlist_head *)empty_buckets,
> -};
> +static const struct ftrace_hash empty_hash = {};
>  #define EMPTY_HASH	((struct ftrace_hash *)&empty_hash)

According to Sashiko: https://sashiko.dev/#/patchset/20260520220030.16887-1-rosenp%40gmail.com

   Could this conversion to a flexible array member cause an
   out-of-bounds read when iterating over the empty hash? Because
   empty_hash is now initialized as an empty struct, its flexible array
   member buckets has a size of 0. However, empty_hash.size_bits is 0,
   which means loop limits computing '1 << hash->size_bits' will
   evaluate to 1. If functions like
   prepare_direct_functions_for_ipmodify() iterate over a default
   EMPTY_HASH without checking ftrace_hash_empty(), they will attempt
   to read EMPTY_HASH->buckets[0]. This reads past the end of the
   struct into adjacent memory in the .rodata section. If that adjacent
   memory happens to be non-zero, the linked list loop could
   dereference it and cause a kernel panic. Prior to this patch,
   empty_buckets provided a safely zeroed array of size 1 to handle
   this single iteration.

-- Steve

^ permalink raw reply

* Re: [PATCH] tracing: Create output file from cmd_check_undefined
From: Nathan Chancellor @ 2026-05-20 22:42 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Vincent Donnefort, Marc Zyngier, Arnd Bergmann, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260520-tracing-ringbuffer-check-v1-1-d979cfab1338@weissschuh.net>

On Wed, May 20, 2026 at 08:01:55PM +0200, Thomas Weißschuh wrote:
> As the output file is currently never created, the check will run every
> time, even if the inputs have not changed.
> 
> Create an empty output file which allows make to skip the execution when
> it is not necessary.
> 
> Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
> Fixes: 58b4bd18390e ("tracing: Adjust cmd_check_undefined to show unexpected undefined symbols")
> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>

Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>

> ---
>  kernel/trace/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index 1decdce8cbef..b5797457f9f4 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -154,7 +154,8 @@ quiet_cmd_check_undefined = NM      $<
>                echo "Unexpected symbols in $<:" >&2; \
>                echo "$$undefsyms" >&2; \
>                false; \
> -          fi
> +          fi; \
> +          touch $@
>  
>  $(obj)/%.o.checked: $(obj)/%.o $(obj)/undefsyms_base.o FORCE
>  	$(call if_changed,check_undefined)
> 
> ---
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
> change-id: 20260520-tracing-ringbuffer-check-3a6e748d37b7
> 
> Best regards,
> --  
> Thomas Weißschuh <linux@weissschuh.net>
> 

-- 
Cheers,
Nathan

^ permalink raw reply

* [PATCH] trace: allocate fields with elt struct
From: Rosen Penev @ 2026-05-20 22:31 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	open list:TRACING

Use a flexible array member to embed the fields array in the
tracing_map_elt allocation, reducing the number of allocations
per element.

Since the fields are now embedded in the struct, taking the address
of a field through a const-qualified elt pointer yields a
const-qualified pointer. Rather than adding casts, switch the
comparison functions to take const void * parameters. These are
all read-only operations.

Assisted-by: OpenCode:BigPickle
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 kernel/trace/tracing_map.c | 41 ++++++++++++++++----------------------
 kernel/trace/tracing_map.h |  8 ++++----
 2 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
index 1404bf752d99..97f7e3cde262 100644
--- a/kernel/trace/tracing_map.c
+++ b/kernel/trace/tracing_map.c
@@ -125,32 +125,32 @@ u64 tracing_map_read_var_once(struct tracing_map_elt *elt, unsigned int i)
 	return (u64)atomic64_read(&elt->vars[i]);
 }
 
-int tracing_map_cmp_string(void *val_a, void *val_b)
+int tracing_map_cmp_string(const void *val_a, const void *val_b)
 {
-	char *a = val_a;
-	char *b = val_b;
+	const char *a = val_a;
+	const char *b = val_b;
 
 	return strcmp(a, b);
 }
 
-int tracing_map_cmp_none(void *val_a, void *val_b)
+int tracing_map_cmp_none(const void *val_a, const void *val_b)
 {
 	return 0;
 }
 
-static int tracing_map_cmp_atomic64(void *val_a, void *val_b)
+static int tracing_map_cmp_atomic64(const void *val_a, const void *val_b)
 {
-	u64 a = atomic64_read((atomic64_t *)val_a);
-	u64 b = atomic64_read((atomic64_t *)val_b);
+	u64 a = atomic64_read((const atomic64_t *)val_a);
+	u64 b = atomic64_read((const atomic64_t *)val_b);
 
 	return (a > b) ? 1 : ((a < b) ? -1 : 0);
 }
 
 #define DEFINE_TRACING_MAP_CMP_FN(type)					\
-static int tracing_map_cmp_##type(void *val_a, void *val_b)		\
+static int tracing_map_cmp_##type(const void *val_a, const void *val_b)	\
 {									\
-	type a = (type)(*(u64 *)val_a);					\
-	type b = (type)(*(u64 *)val_b);					\
+	type a = (type)(*(const u64 *)val_a);				\
+	type b = (type)(*(const u64 *)val_b);				\
 									\
 	return (a > b) ? 1 : ((a < b) ? -1 : 0);			\
 }
@@ -385,7 +385,6 @@ static void tracing_map_elt_free(struct tracing_map_elt *elt)
 
 	if (elt->map->ops && elt->map->ops->elt_free)
 		elt->map->ops->elt_free(elt);
-	kfree(elt->fields);
 	kfree(elt->vars);
 	kfree(elt->var_set);
 	kfree(elt->key);
@@ -397,7 +396,7 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
 	struct tracing_map_elt *elt;
 	int err = 0;
 
-	elt = kzalloc_obj(*elt);
+	elt = kzalloc_flex(*elt, fields, map->n_fields);
 	if (!elt)
 		return ERR_PTR(-ENOMEM);
 
@@ -409,12 +408,6 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
 		goto free;
 	}
 
-	elt->fields = kzalloc_objs(*elt->fields, map->n_fields);
-	if (!elt->fields) {
-		err = -ENOMEM;
-		goto free;
-	}
-
 	elt->vars = kzalloc_objs(*elt->vars, map->n_vars);
 	if (!elt->vars) {
 		err = -ENOMEM;
@@ -848,10 +841,10 @@ static int cmp_entries_sum(const void *A, const void *B)
 {
 	const struct tracing_map_elt *elt_a, *elt_b;
 	const struct tracing_map_sort_entry *a, *b;
-	struct tracing_map_sort_key *sort_key;
-	struct tracing_map_field *field;
+	const struct tracing_map_sort_key *sort_key;
+	const struct tracing_map_field *field;
 	tracing_map_cmp_fn_t cmp_fn;
-	void *val_a, *val_b;
+	const void *val_a, *val_b;
 	int ret = 0;
 
 	a = *(const struct tracing_map_sort_entry **)A;
@@ -879,10 +872,10 @@ static int cmp_entries_key(const void *A, const void *B)
 {
 	const struct tracing_map_elt *elt_a, *elt_b;
 	const struct tracing_map_sort_entry *a, *b;
-	struct tracing_map_sort_key *sort_key;
-	struct tracing_map_field *field;
+	const struct tracing_map_sort_key *sort_key;
+	const struct tracing_map_field *field;
 	tracing_map_cmp_fn_t cmp_fn;
-	void *val_a, *val_b;
+	const void *val_a, *val_b;
 	int ret = 0;
 
 	a = *(const struct tracing_map_sort_entry **)A;
diff --git a/kernel/trace/tracing_map.h b/kernel/trace/tracing_map.h
index 18a02959d77b..90a7fde5dd02 100644
--- a/kernel/trace/tracing_map.h
+++ b/kernel/trace/tracing_map.h
@@ -13,7 +13,7 @@
 #define TRACING_MAP_VARS_MAX		16
 #define TRACING_MAP_SORT_KEYS_MAX	2
 
-typedef int (*tracing_map_cmp_fn_t) (void *val_a, void *val_b);
+typedef int (*tracing_map_cmp_fn_t) (const void *val_a, const void *val_b);
 
 /*
  * This is an overview of the tracing_map data structures and how they
@@ -137,11 +137,11 @@ struct tracing_map_field {
 
 struct tracing_map_elt {
 	struct tracing_map		*map;
-	struct tracing_map_field	*fields;
 	atomic64_t			*vars;
 	bool				*var_set;
 	void				*key;
 	void				*private_data;
+	struct tracing_map_field	fields[];
 };
 
 struct tracing_map_entry {
@@ -260,8 +260,8 @@ tracing_map_lookup(struct tracing_map *map, void *key);
 
 extern tracing_map_cmp_fn_t tracing_map_cmp_num(int field_size,
 						int field_is_signed);
-extern int tracing_map_cmp_string(void *val_a, void *val_b);
-extern int tracing_map_cmp_none(void *val_a, void *val_b);
+extern int tracing_map_cmp_string(const void *val_a, const void *val_b);
+extern int tracing_map_cmp_none(const void *val_a, const void *val_b);
 
 extern void tracing_map_update_sum(struct tracing_map_elt *elt,
 				   unsigned int i, u64 n);
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v6 20/43] KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
From: Ackerley Tng @ 2026-05-20 22:04 UTC (permalink / raw)
  To: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-20-91ab5a8b19a4@google.com>

Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
writes:

> From: Sean Christopherson <seanjc@google.com>
>
> Now that guest_memfd supports tracking private vs. shared within gmem
> itself, allow userspace to specify INIT_SHARED on a guest_memfd instance
> for x86 Confidential Computing (CoCo) VMs, so long as per-VM attributes
> are disabled, i.e. when it's actually possible for a guest_memfd instance
> to contain shared memory.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  arch/x86/kvm/x86.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1560de1e95be0..6609957ecfea3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -14172,14 +14172,13 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
>  }
>
>  #ifdef CONFIG_KVM_GUEST_MEMFD
> -/*
> - * KVM doesn't yet support initializing guest_memfd memory as shared for VMs
> - * with private memory (the private vs. shared tracking needs to be moved into
> - * guest_memfd).
> - */
>  bool kvm_arch_supports_gmem_init_shared(struct kvm *kvm)
>  {
> -	return !kvm_arch_has_private_mem(kvm);
> +	/*
> +	 * INIT_SHARED isn't supported if the memory attributes are per-VM,
> +	 * in which case guest_memfd can _only_ be used for private memory.
> +	 */
> +	return !vm_memory_attributes || !kvm_arch_has_private_mem(kvm);

Adding a note here from PUCK on 2026-05-20:

Michael pointed out that it's odd that when vm_memory_attributes is
available, guest_memfd still can only be used for private memory.

It is a little odd, but we don't want to investigate the complexities of
supporting it, and Sean says this is working as intended, in line with
deprecating vm_memory_attributes=true.

>  }
>
>  #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
>
> --
> 2.54.0.563.g4f69b47b94-goog

^ permalink raw reply

* [PATCH] ftrace: Use flexible array for hash buckets
From: Rosen Penev @ 2026-05-20 22:00 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	open list:FUNCTION HOOKS (FTRACE)

Store ftrace hash buckets in the ftrace_hash allocation instead of
allocating the bucket array separately.

This keeps the bucket storage tied to the hash lifetime and simplifies
the allocation and cleanup paths.

Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 kernel/trace/ftrace.c | 17 ++---------------
 kernel/trace/trace.h  |  2 +-
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b2611de3f594..25a9dca290dd 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1082,10 +1082,7 @@ struct ftrace_func_probe {
  * it all the time. These are in a read only section such that if
  * anyone does try to modify it, it will cause an exception.
  */
-static const struct hlist_head empty_buckets[1];
-static const struct ftrace_hash empty_hash = {
-	.buckets = (struct hlist_head *)empty_buckets,
-};
+static const struct ftrace_hash empty_hash = {};
 #define EMPTY_HASH	((struct ftrace_hash *)&empty_hash)
 
 struct ftrace_ops global_ops = {
@@ -1295,7 +1292,6 @@ void free_ftrace_hash(struct ftrace_hash *hash)
 	if (!hash || hash == EMPTY_HASH)
 		return;
 	ftrace_hash_clear(hash);
-	kfree(hash->buckets);
 	kfree(hash);
 }
 
@@ -1333,20 +1329,11 @@ EXPORT_SYMBOL_GPL(ftrace_free_filter);
 struct ftrace_hash *alloc_ftrace_hash(int size_bits)
 {
 	struct ftrace_hash *hash;
-	int size;
 
-	hash = kzalloc_obj(*hash);
+	hash = kzalloc_flex(*hash, buckets, BIT(size_bits));
 	if (!hash)
 		return NULL;
 
-	size = 1 << size_bits;
-	hash->buckets = kzalloc_objs(*hash->buckets, size);
-
-	if (!hash->buckets) {
-		kfree(hash);
-		return NULL;
-	}
-
 	hash->size_bits = size_bits;
 
 	return hash;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..5a3f81f17317 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1000,10 +1000,10 @@ enum {
 
 struct ftrace_hash {
 	unsigned long		size_bits;
-	struct hlist_head	*buckets;
 	unsigned long		count;
 	unsigned long		flags;
 	struct rcu_head		rcu;
+	struct hlist_head	buckets[];
 };
 
 struct ftrace_func_entry *
-- 
2.54.0


^ permalink raw reply related

* [PATCH] tracing: Use flexible array for entry fetch code
From: Rosen Penev @ 2026-05-20 21:58 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Kees Cook,
	Gustavo A. R. Silva, open list:TRACING,
	open list:KERNEL HARDENING (not covered by other areas):Keyword:b__counted_by(_le|_be|_ptr)?b

Store probe entry fetch instructions in the probe_entry_arg
allocation instead of allocating a separate instruction array.

This keeps the entry fetch code tied to the entry argument lifetime while
leaving regular probe_arg instruction arrays separately allocated and
freed.

Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 kernel/trace/trace_probe.c | 8 +-------
 kernel/trace/trace_probe.h | 2 +-
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0d3a0da26af..39f040c863e8 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -838,15 +838,10 @@ static int __store_entry_arg(struct trace_probe *tp, int argnum)
 	int i, offset, last_offset = 0;
 
 	if (!earg) {
-		earg = kzalloc_obj(*tp->entry_arg);
+		earg = kzalloc_flex(*earg, code, 2 * tp->nr_args + 1);
 		if (!earg)
 			return -ENOMEM;
 		earg->size = 2 * tp->nr_args + 1;
-		earg->code = kzalloc_objs(struct fetch_insn, earg->size);
-		if (!earg->code) {
-			kfree(earg);
-			return -ENOMEM;
-		}
 		/* Fill the code buffer with 'end' to simplify it */
 		for (i = 0; i < earg->size; i++)
 			earg->code[i].op = FETCH_OP_END;
@@ -2051,7 +2046,6 @@ void trace_probe_cleanup(struct trace_probe *tp)
 		traceprobe_free_probe_arg(&tp->args[i]);
 
 	if (tp->entry_arg) {
-		kfree(tp->entry_arg->code);
 		kfree(tp->entry_arg);
 		tp->entry_arg = NULL;
 	}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 262d8707a3df..1076f1df347b 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -238,8 +238,8 @@ struct probe_arg {
 };
 
 struct probe_entry_arg {
-	struct fetch_insn	*code;
 	unsigned int		size;	/* The entry data size */
+	struct fetch_insn	code[] __counted_by(size);
 };
 
 struct trace_uprobe_filter {
-- 
2.54.0


^ permalink raw reply related

* [PATCHv2] tracing: simplify pages allocation
From: Rosen Penev @ 2026-05-20 21:50 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Kees Cook,
	Gustavo A. R. Silva, open list:TRACING,
	open list:KERNEL HARDENING (not covered by other areas):Keyword:b__counted_by(_le|_be|_ptr)?b

Change to a flexible array member to allocate together with the array
struct.

Simplifies code slightly by removing no longer correct null checks for
pages and removing kfrees.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 v2: add back kfree(a). Accidentally removed.
 kernel/trace/tracing_map.c | 30 +++++++++++-------------------
 kernel/trace/tracing_map.h |  2 +-
 2 files changed, 12 insertions(+), 20 deletions(-)

diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
index c59326ad7a84..97f7e3cde262 100644
--- a/kernel/trace/tracing_map.c
+++ b/kernel/trace/tracing_map.c
@@ -288,9 +288,6 @@ static void tracing_map_array_clear(struct tracing_map_array *a)
 {
 	unsigned int i;

-	if (!a->pages)
-		return;
-
 	for (i = 0; i < a->n_pages; i++)
 		memset(a->pages[i], 0, PAGE_SIZE);
 }
@@ -302,9 +299,6 @@ static void tracing_map_array_free(struct tracing_map_array *a)
 	if (!a)
 		return;

-	if (!a->pages)
-		goto free;
-
 	for (i = 0; i < a->n_pages; i++) {
 		if (!a->pages[i])
 			break;
@@ -312,9 +306,6 @@ static void tracing_map_array_free(struct tracing_map_array *a)
 		free_page((unsigned long)a->pages[i]);
 	}

-	kfree(a->pages);
-
- free:
 	kfree(a);
 }

@@ -322,24 +313,25 @@ static struct tracing_map_array *tracing_map_array_alloc(unsigned int n_elts,
 						  unsigned int entry_size)
 {
 	struct tracing_map_array *a;
+	unsigned int entry_size_shift;
+	unsigned int entries_per_page;
+	unsigned int n_pages;
 	unsigned int i;

-	a = kzalloc_obj(*a);
+	entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
+	entries_per_page = PAGE_SIZE / (1 << entry_size_shift);
+	n_pages = max(1, n_elts / entries_per_page);
+
+	a = kzalloc_flex(*a, pages, n_pages);
 	if (!a)
 		return NULL;

-	a->entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
-	a->entries_per_page = PAGE_SIZE / (1 << a->entry_size_shift);
-	a->n_pages = n_elts / a->entries_per_page;
-	if (!a->n_pages)
-		a->n_pages = 1;
+	a->entry_size_shift = entry_size_shift;
+	a->entries_per_page = entries_per_page;
+	a->n_pages = n_pages;
 	a->entry_shift = fls(a->entries_per_page) - 1;
 	a->entry_mask = (1 << a->entry_shift) - 1;

-	a->pages = kcalloc(a->n_pages, sizeof(void *), GFP_KERNEL);
-	if (!a->pages)
-		goto free;
-
 	for (i = 0; i < a->n_pages; i++) {
 		a->pages[i] = (void *)get_zeroed_page(GFP_KERNEL);
 		if (!a->pages[i])
diff --git a/kernel/trace/tracing_map.h b/kernel/trace/tracing_map.h
index ed64136782d8..90a7fde5dd02 100644
--- a/kernel/trace/tracing_map.h
+++ b/kernel/trace/tracing_map.h
@@ -167,7 +167,7 @@ struct tracing_map_array {
 	unsigned int entry_shift;
 	unsigned int entry_mask;
 	unsigned int n_pages;
-	void **pages;
+	void *pages[] __counted_by(n_pages);
 };

 #define TRACING_MAP_ARRAY_ELT(array, idx)				\
--
2.54.0


^ permalink raw reply related

* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Ackerley Tng @ 2026-05-20 21:44 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CA+EHjTw-cUM=FrJevtSDtR7K6MwUfGfOx21LMFDn7DAy5bFzYw@mail.gmail.com>

Fuad Tabba <tabba@google.com> writes:

>
> [...snip...]
>
>> +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>> +{
>> +       struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>> +       struct inode *inode;
>> +
>> +       /*
>> +        * If this gfn has no associated memslot, there's no chance of the gfn
>> +        * being backed by private memory, since guest_memfd must be used for
>> +        * private memory, and guest_memfd must be associated with some memslot.
>> +        */
>> +       if (!slot)
>> +               return 0;
>> +
>> +       CLASS(gmem_get_file, file)(slot);
>> +       if (!file)
>> +               return 0;
>> +
>> +       inode = file_inode(file);
>> +
>> +       /*
>> +        * Rely on the maple tree's internal RCU lock to ensure a
>> +        * stable result. This result can become stale as soon as the
>> +        * lock is dropped, so the caller _must_ still protect
>> +        * consumption of private vs. shared by checking
>> +        * mmu_invalidate_retry_gfn() under mmu_lock to serialize
>> +        * against ongoing attribute updates.
>> +        */
>> +       return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
>> +}
>
> Doesn't this imply that all consumers of kvm_mem_is_private() should
> validate the result using mmu_lock and the invalidation sequence?

Let me know how I can improve the comment.

I think the "consumption" of private vs shared here actually means
something like "don't commit a page being faulted into page tables based
on the result of kvm_gmem_get_memory_attributes() without checking
kvm->mmu_invalidate_in_progress.", since a racing conversion may
complete before you commit.

kvm_mem_is_private() is used from these places:

1. Fault handling in KVM, like page_fault_can_be_fast(),
   kvm_mmu_faultin_pfn(), kvm_mmu_page_fault(): this already handles the
   entire mmu_lock and invalidation dance. No fault will be committed if
   a racing conversion happened after kvm_mem_is_private() but before
   the commit.

2. kvm_mmu_max_mapping_level() from recovering huge pages after
   disabling dirty logging: Other than that it can't be used with
   guest_memfd now since dirty logging can't be used with guest_memfd
   and guest_memfd memslots are not updatable, this holds mmu_lock
   throughout until the huge page recovery is done. invalidate_begin
   also involves zapping the pages in the range, so if the order of
   events is

   | Thread A                     | Thread B          |
   |------------------------------|-------------------|
   | invalidate_begin + zap       |                   |
   | update attributes maple_tree | recover huge page |
   | invalidate_end               |                   |

   Then recovering will never see the zapped pages, nothing to
   recover, no kvm_mem_is_private() lookup.

3. kvm_arch_vcpu_pre_fault_memory()

   This eventually calls kvm_tdp_mmu_page_fault(), which checks
   is_page_fault_stale(), so it does check before committing.

Were there any other calls I missed?

> sev_handle_rmp_fault() calls kvm_mem_is_private() without holding
> mmu_lock and without any retry mechanism. Is that a problem?
>

Sean already replied on your actual question separately :)

> Cheers,
> /fuad
>
>
>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH v2] perf/ftrace: Fix WARNING in __unregister_ftrace_function
From: Steven Rostedt @ 2026-05-20 20:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, kernel-team, sashiko-bot, sashiko-reviews
In-Reply-To: <20260513161916.04151502@fangorn>


Rik,

Can you reply to Sashiko here?

The below comes from Sashiko: https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260513161916.04151502%40fangorn

Thanks,

-- Steve


> commit f90ac36f39b79a19c015522bb1caf7ef903caf02
> Author: Rik van Riel <riel@surriel.com>
> perf/ftrace: Fix WARNING in __unregister_ftrace_function
> This commit attempts to fix a WARN_ON triggered during perf_event_alloc() error
> cleanup by checking FTRACE_OPS_FL_ENABLED before unregistering the ftrace
> function. The warning occurs because the ftrace_ops registration failed or was
> torn down by perf_try_init_event() earlier, but the event->destroy callback is
> executed again during the error path.
> > diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> > index 61e3a2620fa3c..8d64f17135e8f 100644
> > --- a/kernel/trace/trace_event_perf.c
> > +++ b/kernel/trace/trace_event_perf.c
> > @@ -495,7 +495,11 @@ static int perf_ftrace_function_register(struct perf_event *event)
> >  static int perf_ftrace_function_unregister(struct perf_event *event)
> >  {
> >  	struct ftrace_ops *ops = &event->ftrace_ops;
> > -	int ret = unregister_ftrace_function(ops);
> > +	int ret = 0;
> > +
> > +	if (ops->flags & FTRACE_OPS_FL_ENABLED)
> > +		ret = unregister_ftrace_function(ops);
> > +
> >  	ftrace_free_filter(ops);

> Does calling ftrace_free_filter() unconditionally here mask a double-teardown
> regression while leaving the underlying double-free active?
> In perf_try_init_event(), if a PMU event_init() succeeds but a subsequent
> capability check fails, it explicitly calls event->destroy(event) to roll back:
> kernel/events/core.c:perf_try_init_event() {
>     ...
> 		if (ret && event->destroy)
> 			event->destroy(event);
>     ...
> }
> However, it does not set event->destroy to NULL.
> When the event is later freed in error paths, the stale event->destroy callback
> executes a second time.
> If ftrace_free_filter() is called twice, won't it result in a direct
> double-free of ops->func_hash->filter_hash, since ftrace_free_filter() frees
> the hash but does not nullify the pointers?
> Additionally, for regular tracepoint events, wouldn't the second teardown
> execute perf_trace_event_unreg() again?
> This could double-decrement total_ref_count, potentially freeing the shared
> perf_trace_buf while other events are active, and trigger an unbalanced
> module_put() leading to module refcount underflows.
> Should the root cause be addressed by setting event->destroy = NULL in
> perf_try_init_event() immediately after invoking it?

> >  	return ret;
> >  }


^ permalink raw reply

* Re: [PATCH v6 06/43] KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
From: Sean Christopherson @ 2026-05-20 20:39 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Fuad Tabba, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgGHHkvfJ-mn9rfDvS+_1ht08YatFWo-Swt+5wFSPnC9Nw@mail.gmail.com>

On Wed, May 20, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> >> >         } else {
> >> > +               /*
> >> > +                * Memory attributes cannot be obtained from guest_memfd while
> >> > +                * the MMU lock is held.
> >> > +                */
> >> > +               if (KVM_BUG_ON(static_call_query(__kvm_get_memory_attributes) ==
> >> > +                              kvm_gmem_get_memory_attributes, kvm)) {
> >> > +                       return 0;
> >> > +               }
> >> > +
> >>
> >> This directly takes the address of kvm_gmem_get_memory_attributes,
> >> which is only compiled if CONFIG_KVM_GUEST_MEMFD=y. This breaks
> >> ARCH=i386.
> >
> > And this bleeds guest_memfd implementation details into places they don't belong.
> > The right way to deal with this is to use lockdep_assert_not_held() in whatever
> > code mustn't run with mmu_lock held.  E.g.
> >
> > diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> > index c9f155c2dc5c..3bea9c1137ef 100644
> > --- virt/kvm/guest_memfd.c
> > +++ virt/kvm/guest_memfd.c
> > @@ -547,6 +547,9 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> >         struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> >         struct inode *inode;
> >
> > +       /* Comment goes here. */
> > +       lockdep_assert_not_held(&kvm->mmu_lock);
> > +
> >         /*
> >          * If this gfn has no associated memslot, there's no chance of the gfn
> >          * being backed by private memory, since guest_memfd must be used for
> >
> > But I'm confused, because kvm_gmem_get_memory_attributes() doesn't actually take
> > filemap_invalidate_lock(), so what exactly is the problem?
> >
> 
> Ahh I can drop this patch now. kvm_gmem_get_memory_attributes() used to
> take the filemap_invalidate_lock(), but after Liam pointed out that
> the attributes maple tree should be using MT_FLAGS_USE_RCU, I stopped
> taking filemap_invalidate_lock() and forgot to undo this.
> 
> I'll wait a bit for more reviews and then put out another revision
> without this patch.

If this is the only issue with v6, don't send a new version, I'll just drop it
when applying.

^ permalink raw reply

* Re: [PATCH v6 12/43] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Ackerley Tng @ 2026-05-20 20:35 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CA+EHjTwOfJ=nCRoX3m5uDt=CcF_zrqGsZVBBmo4MscmPqrBxOA@mail.gmail.com>

Fuad Tabba <tabba@google.com> writes:

> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
> <devnull+ackerleytng.google.com@kernel.org> wrote:
>>
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> When memory in guest_memfd is converted from private to shared, the
>> platform-specific state associated with the guest-private pages must be
>> invalidated or cleaned up.
>>
>> Iterate over the folios in the affected range and call the
>> kvm_arch_gmem_invalidate() hook for each PFN range. This allows
>> architectures to perform necessary teardown, such as updating hardware
>> metadata or encryption states, before the pages are transitioned to the
>> shared state.
>>
>> Invoke this helper after indicating to KVM's mmu code that an invalidation
>> is in progress to stop in-flight page faults from succeeding.
>>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>
> Minor nit below, but lgtm.
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
>
> Cheers,
> /fuad
>
>> ---
>>  virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 41 insertions(+)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 9d82642a025e9..baf4b88dead1f 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -603,6 +603,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>>         return safe;
>>  }
>>
>> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
>> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>> +{
>> +       struct folio_batch fbatch;
>> +       pgoff_t next = start;
>> +       int i;
>> +
>> +       folio_batch_init(&fbatch);
>> +       while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
>> +               for (i = 0; i < folio_batch_count(&fbatch); ++i) {
>> +                       struct folio *folio = fbatch.folios[i];
>> +                       pgoff_t start_index, end_index;
>> +                       kvm_pfn_t start_pfn, end_pfn;
>> +
>> +                       start_index = max(start, folio->index);
>> +                       end_index = min(end, folio_next_index(folio));
>> +                       /*
>> +                        * end_index is either in folio or points to
>> +                        * the first page of the next folio. Hence,
>> +                        * all pages in range [start_index, end_index)
>> +                        * are contiguous.
>> +                        */
>> +                       start_pfn = folio_file_pfn(folio, start_index);
>> +                       end_pfn = start_pfn + end_index - start_index;
>> +
>> +                       kvm_arch_gmem_invalidate(start_pfn, end_pfn);
>> +               }
>> +
>> +               folio_batch_release(&fbatch);
>> +               cond_resched();
>> +       }
>> +}
>> +#else
>> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
>> +#endif
>> +
>>  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>>                                      size_t nr_pages, uint64_t attrs,
>>                                      pgoff_t *err_index)
>> @@ -643,7 +679,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>>          */
>>
>>         kvm_gmem_invalidate_begin(inode, start, end);
>> +
>> +       if (!to_private)
>> +               kvm_gmem_invalidate(inode, start, end);
>> +
>>         mas_store_prealloc(&mas, xa_mk_value(attrs));
>> +
>
> Why the unrelated extra space?
>

Hmm this space provides vertical space between invalidate_{begin,end}
and the STUFF it wraps, like

  invalidate_begin

  STUFF

  invalidate_end

More STUFF is going to go here in future patch series, such as splitting
private pages in TDX.

>>         kvm_gmem_invalidate_end(inode, start, end);
>>  out:
>>         filemap_invalidate_unlock(mapping);
>>
>> --
>> 2.54.0.563.g4f69b47b94-goog
>>
>>

^ permalink raw reply

* Re: [PATCH v2] tracing/osnoise: Array printk init and cleanup
From: Steven Rostedt @ 2026-05-20 20:28 UTC (permalink / raw)
  To: Crystal Wood
  Cc: linux-trace-kernel, John Kacur, Tomas Glozar, Costa Shulyupin,
	Wander Lairson Costa, sashiko-bot, sashiko-reviews
In-Reply-To: <20260511223035.1475676-1-crwood@redhat.com>


[ Replying to Sashiko: https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260511223035.1475676-1-crwood%40redhat.com ]

> commit 859dc1eded9e36304f27e72643f28f30f327f078
> Author: Crystal Wood <crwood@redhat.com>
> tracing/osnoise: Array printk init and cleanup
> This commit initializes the trace array buffer for osnoise instances on
> creation. It also introduces the osnoise_print() helper function and switches
> the tracer to use trace_array_printk() instead of trace_array_printk_buf().
> > Use trace_array_printk() instead of trace_array_printk_buf(), as we're
> > only writing to the main buffer (of a non-main instance) anyway -- and

> Is this description accurate? Can the osnoise tracer also be enabled on the
> global trace array, for example by writing to current_tracer?
> If it is used on the global instance, trace_array_init_printk() explicitly
> returns -EINVAL and refuses to allocate a buffer, and trace_array_printk()
> will return 0 without printing.

Hmm, I guess it would drop any of these prints if it's using the top level
trace array. But I believe this is for use with RTLA which always creates a
trace instance.

> > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > index 75678053b21c5..2be188768ab42 100644
> > --- a/kernel/trace/trace_osnoise.c
> > +++ b/kernel/trace/trace_osnoise.c
> > @@ -83,6 +83,22 @@ struct osnoise_instance {
> >  
> >  static struct list_head osnoise_instances;
> >  
> > +static void osnoise_print(const char *fmt, ...)
> > +{
> > +	struct osnoise_instance *inst;
> > +	struct trace_array *tr;
> > +	va_list ap;
> > +
> > +	rcu_read_lock();
> > +	list_for_each_entry_rcu(inst, &osnoise_instances, list) {
> > +		tr = inst->tr;
> > +		va_start(ap, fmt);
> > +		trace_array_vprintk(tr, _RET_IP_, fmt, ap);

> Does this code create a use-after-free on the trace array if an instance is
> removed concurrently?
> When a user deletes a trace instance via rmdir, the unregister function
> removes the instance from the list using list_del_rcu(). However, the removal
> routine does not appear to wait for an RCU grace period before freeing the
> trace array itself.
> Could a concurrent execution of this loop inside the rcu_read_lock() section
> still access the unlinked instance, read the freed inst->tr, and pass it to
> trace_array_vprintk()? This appears to be an existing issue, but it still
> affects the loop here.

Hmm, this is interesting. osnoise keeps track of its own instances via a
osnoise_instances list. But it only use kfree_rcu() to free the list
descriptor but doesn't take care of the tr being freed before hand!

Something like this could work [not even compiled]

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 75678053b21c..bda1e0e0d2e1 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -476,8 +476,11 @@ static void print_osnoise_headers(struct seq_file *s)
 										\
 	rcu_read_lock();							\
 	list_for_each_entry_rcu(inst, &osnoise_instances, list) {		\
+		if (trace_array_get(inst->tr) < 0)				\
+			continue;						\
 		buffer = inst->tr->array_buffer.buffer;				\
 		trace_array_printk_buf(buffer, _THIS_IP_, msg);			\
+		trace_array_put(inst->tr);					\
 	}									\
 	rcu_read_unlock();							\
 	osnoise_data.tainted = true;						\


> > +		va_end(ap);
> > +	}
> > +	rcu_read_unlock();
> > +}
> [ ... ]
> > @@ -123,6 +139,7 @@ static int osnoise_register_instance(struct trace_array *tr)
> >  	 * trace_types_lock.
> >  	 */
> >  	lockdep_assert_held(&trace_types_lock);
> > +	trace_array_init_printk(tr);

> Could this leave the per-CPU buffer unallocated when running on the global
> instance?
> If the tracer is used on the global trace array, trace_array_init_printk()
> returns -EINVAL. Because this return value is ignored here, the buffer
> remains unallocated, causing trace_array_vprintk() to fail internally.

Well, it will fail anyway because it is the top level trace buffer.

> [ ... ]

> > @@ -1189,10 +1198,10 @@ static __always_inline void osnoise_stop_exception(char *msg, int cpu)
> >  	rcu_read_lock();
> >  	list_for_each_entry_rcu(inst, &osnoise_instances, list) {
> >  		tr = inst->tr;
> > -		trace_array_printk_buf(tr->array_buffer.buffer, _THIS_IP_,
> > -				       "stop tracing hit on cpu %d due to exception: %s\n",
> > -				       smp_processor_id(),
> > -				       msg);
> > +		trace_array_printk(tr, _THIS_IP_,
> > +				   "stop tracing hit on cpu %d due to exception: %s\n",
> > +				   smp_processor_id(),
> > +				   msg);

> Does this switch to trace_array_printk() silently drop messages when tracing
> on the global instance?
> Looking at trace_array_printk(), it explicitly returns 0 if the
> TRACE_ARRAY_FL_GLOBAL flag is set on the trace array:
> kernel/trace/trace_printk.c:trace_array_printk() {
>     ...
> 	if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
> 		return 0;
>     ...
> }
> This means stop and exception logs would be discarded if the tracer is
> running on the global instance.

Yep. But maybe that's not an issue?

-- Steve


^ permalink raw reply related

* Re: [PATCH v6 06/43] KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
From: Ackerley Tng @ 2026-05-20 20:25 UTC (permalink / raw)
  To: Sean Christopherson, Fuad Tabba
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ag3DawWCcrpCkD0e@google.com>

Sean Christopherson <seanjc@google.com> writes:

>
> [...snip...]
>
>> > @@ -3357,6 +3357,15 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
>> >                 max_level = fault->max_level;
>> >                 is_private = fault->is_private;
>> >         } else {
>> > +               /*
>> > +                * Memory attributes cannot be obtained from guest_memfd while
>> > +                * the MMU lock is held.
>> > +                */
>> > +               if (KVM_BUG_ON(static_call_query(__kvm_get_memory_attributes) ==
>> > +                              kvm_gmem_get_memory_attributes, kvm)) {
>> > +                       return 0;
>> > +               }
>> > +
>>
>> This directly takes the address of kvm_gmem_get_memory_attributes,
>> which is only compiled if CONFIG_KVM_GUEST_MEMFD=y. This breaks
>> ARCH=i386.
>
> And this bleeds guest_memfd implementation details into places they don't belong.
> The right way to deal with this is to use lockdep_assert_not_held() in whatever
> code mustn't run with mmu_lock held.  E.g.
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index c9f155c2dc5c..3bea9c1137ef 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -547,6 +547,9 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>         struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>         struct inode *inode;
>
> +       /* Comment goes here. */
> +       lockdep_assert_not_held(&kvm->mmu_lock);
> +
>         /*
>          * If this gfn has no associated memslot, there's no chance of the gfn
>          * being backed by private memory, since guest_memfd must be used for
>
> But I'm confused, because kvm_gmem_get_memory_attributes() doesn't actually take
> filemap_invalidate_lock(), so what exactly is the problem?
>

Ahh I can drop this patch now. kvm_gmem_get_memory_attributes() used to
take the filemap_invalidate_lock(), but after Liam pointed out that
the attributes maple tree should be using MT_FLAGS_USE_RCU, I stopped
taking filemap_invalidate_lock() and forgot to undo this.

I'll wait a bit for more reviews and then put out another revision
without this patch.

>> >                 max_level = PG_LEVEL_NUM;
>> >                 is_private = kvm_mem_is_private(kvm, gfn);
>> >         }
>> >
>> > --
>> > 2.54.0.563.g4f69b47b94-goog
>> >
>> >

^ permalink raw reply

* Re: [PATCH v2] tracepoint: add lockdep rcu_is_watching() check to trace_##name##_enabled()
From: Steven Rostedt @ 2026-05-20 19:55 UTC (permalink / raw)
  To: David Carlier
  Cc: linux-trace-kernel, linux-kernel, mhiramat, mathieu.desnoyers,
	peterz, vineeth, sashiko-bot, sashiko-reviews
In-Reply-To: <20260430144159.10985-1-devnexen@gmail.com>


[ Replying to Sashiko ]

> commit 7596bdcd9abf9a26b3964f10b5d5de3249e731e9
> Author: David Carlier <devnexen@gmail.com>
> tracepoint: add lockdep rcu_is_watching() check to trace_##name##_enabled()
> This commit attempts to restore RCU lockdep coverage for trace_call__##name()
> callers by adding a warning directly to the trace_##name##_enabled() static
> branch check.
> > When the tracepoint is enabled, the rcu_dereference inside
> > __DO_TRACE_CALL() already trips under PROVE_RCU, so the warning is
> > only needed on the !enabled path.
> Is this description accurate regarding __DO_TRACE_CALL()?
> Looking at the implementation, it uses rcu_dereference_raw() which
> intentionally bypasses lockdep checking. This means trace_call__##name()
> lacks lockdep coverage on both the enabled and disabled paths.
> > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> > --- a/include/linux/tracepoint.h
> > +++ b/include/linux/tracepoint.h
> > @@ -293,6 +293,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> >  	static inline bool						\
> >  	trace_##name##_enabled(void)					\
> >  	{								\
> > +		if (IS_ENABLED(CONFIG_LOCKDEP)) {			\
> > +			WARN_ONCE(!rcu_is_watching(),			\
> > +				  "RCU not watching for tracepoint");	\
> > +		}							\
> >  		return static_branch_unlikely(&__tracepoint_##name.key);\
> >  	}

> Does this unconditional check cause spurious lockdep splats?
> Unlike the original trace_##name() macro, trace_##name##_enabled() takes no
> arguments and cannot evaluate the tracepoint's condition, such as
> cpu_online(raw_smp_processor_id()). If a subsystem queries
> trace_##name##_enabled() on an offline CPU where RCU is not watching, won't
> this trigger a false positive warning, even if the tracepoint itself would
> have been safely skipped by its condition?

If the CPU is offline, you shouldn't be using trace_##name##_enabled() to
begin with.


> Furthermore, is it already safe to query a static branch outside of an RCU
> watch window? Subsystems sometimes use this boolean query in RCU-idle
> contexts simply to skip data gathering without ever intending to execute
> the tracepoint.

The entire point of this patch is to make sure lockdep *always* checks that
"RCU is watching" at every trace_##name##_enabled() location regardless if
the tracepoint is active or not. It's not about the static branch, but the
tracepoint that is hidden behind it.


> Would it be more reliable to add the WARN_ONCE check, guarded by its cond
> parameter, directly to trace_call__##name() to appropriately mirror how
> trace_##name() behaves?

Hmm, if this does become an issue, that may be worth doing. But currently,
I do not believe the enabled() code is used by anything with conditional
tracepoints.

-- Steve

^ permalink raw reply

* Re: [PATCH] tracing: simplify pages allocation
From: Steven Rostedt @ 2026-05-20 19:38 UTC (permalink / raw)
  To: Rosen Penev
  Cc: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Kees Cook, Gustavo A. R. Silva, open list:TRACING,
	open list:KERNEL HARDENING (not covered by other areas):Keyword:b__counted_by(_le|_be)?b
In-Reply-To: <20260425014403.440786-1-rosenp@gmail.com>

On Fri, 24 Apr 2026 18:44:03 -0700
Rosen Penev <rosenp@gmail.com> wrote:

> Change to a flexible array member to allocate together with the array
> struct.
> 
> Simplifies code slightly by removing no longer correct null checks for
> pages and removing kfrees.
> 
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  kernel/trace/tracing_map.c | 32 +++++++++++---------------------
>  kernel/trace/tracing_map.h |  2 +-
>  2 files changed, 12 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
> index bf1a507695b6..627cc3fdf69e 100644
> --- a/kernel/trace/tracing_map.c
> +++ b/kernel/trace/tracing_map.c
> @@ -288,9 +288,6 @@ static void tracing_map_array_clear(struct tracing_map_array *a)
>  {
>  	unsigned int i;
>  
> -	if (!a->pages)
> -		return;
> -
>  	for (i = 0; i < a->n_pages; i++)
>  		memset(a->pages[i], 0, PAGE_SIZE);
>  }
> @@ -302,44 +299,37 @@ static void tracing_map_array_free(struct tracing_map_array *a)
>  	if (!a)
>  		return;
>  
> -	if (!a->pages)
> -		goto free;
> -
>  	for (i = 0; i < a->n_pages; i++) {
>  		if (!a->pages[i])
>  			break;
>  		kmemleak_free(a->pages[i]);
>  		free_page((unsigned long)a->pages[i]);
>  	}
> -
> -	kfree(a->pages);
> -
> - free:
> -	kfree(a);
>  }

Sashiko reported:

   https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260425014403.440786-1-rosenp%40gmail.com
  
  Does this code leak the tracing_map_array struct?
  While removing kfree(a->pages) is correct since the array is now inline,
  it looks like we still need kfree(a) to free the container struct itself
  which is allocated by kzalloc_flex() in tracing_map_array_alloc().

It looks to be correct. Please fix.

-- Steve


>  
>  static struct tracing_map_array *tracing_map_array_alloc(unsigned int n_elts,
>  						  unsigned int entry_size)
>  {
>  	struct tracing_map_array *a;
> +	unsigned int entry_size_shift;
> +	unsigned int entries_per_page;
> +	unsigned int n_pages;
>  	unsigned int i;
>  
> -	a = kzalloc_obj(*a);
> +	entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
> +	entries_per_page = PAGE_SIZE / (1 << entry_size_shift);
> +	n_pages = max(1, n_elts / entries_per_page);
> +
> +	a = kzalloc_flex(*a, pages, n_pages);
>  	if (!a)
>  		return NULL;
>  
> -	a->entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
> -	a->entries_per_page = PAGE_SIZE / (1 << a->entry_size_shift);
> -	a->n_pages = n_elts / a->entries_per_page;
> -	if (!a->n_pages)
> -		a->n_pages = 1;
> +	a->entry_size_shift = entry_size_shift;
> +	a->entries_per_page = entries_per_page;
> +	a->n_pages = n_pages;
>  	a->entry_shift = fls(a->entries_per_page) - 1;
>  	a->entry_mask = (1 << a->entry_shift) - 1;
>  
> -	a->pages = kcalloc(a->n_pages, sizeof(void *), GFP_KERNEL);
> -	if (!a->pages)
> -		goto free;
> -
>  	for (i = 0; i < a->n_pages; i++) {
>  		a->pages[i] = (void *)get_zeroed_page(GFP_KERNEL);
>  		if (!a->pages[i])
> diff --git a/kernel/trace/tracing_map.h b/kernel/trace/tracing_map.h
> index 99c37eeebc16..18a02959d77b 100644
> --- a/kernel/trace/tracing_map.h
> +++ b/kernel/trace/tracing_map.h
> @@ -167,7 +167,7 @@ struct tracing_map_array {
>  	unsigned int entry_shift;
>  	unsigned int entry_mask;
>  	unsigned int n_pages;
> -	void **pages;
> +	void *pages[] __counted_by(n_pages);
>  };
>  
>  #define TRACING_MAP_ARRAY_ELT(array, idx)				\


^ permalink raw reply

* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Sean Christopherson @ 2026-05-20 18:59 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CA+EHjTw-cUM=FrJevtSDtR7K6MwUfGfOx21LMFDn7DAy5bFzYw@mail.gmail.com>

On Wed, May 20, 2026, Fuad Tabba wrote:
> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
> <devnull+ackerleytng.google.com@kernel.org> wrote:
> >
> > From: Sean Christopherson <seanjc@google.com>
> >
> > Implement kvm_gmem_get_memory_attributes() for guest_memfd to allow the KVM
> > core and architecture code to query per-GFN memory attributes.
> >
> > kvm_gmem_get_memory_attributes() finds the memory slot for a given GFN and
> > queries the guest_memfd file's to determine if the page is marked as
> > private.
> >
> > If vm_memory_attributes is not enabled, there is no shared/private tracking
> > at the VM level. Install the guest_memfd implementation as long as
> > guest_memfd is enabled to give guest_memfd a chance to respond on
> > attributes.
> >
> > guest_memfd should look up attributes regardless of whether this memslot is
> > gmem-only since attributes are now tracked by gmem regardless of whether
> > mmap() is enabled.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > ---
> >  include/linux/kvm_host.h |  2 ++
> >  virt/kvm/guest_memfd.c   | 31 +++++++++++++++++++++++++++++++
> >  virt/kvm/kvm_main.c      |  3 +++
> >  3 files changed, 36 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index c5ba2cb34e45c..28a54298d27db 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2557,6 +2557,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> >                                          struct kvm_gfn_range *range);
> >  #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> >
> > +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn);
> > +
> >  #ifdef CONFIG_KVM_GUEST_MEMFD
> >  int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >                      gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 5011d38820d0d..f055e058a3f28 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -509,6 +509,37 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> >         return 0;
> >  }
> >
> > +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> > +{
> > +       struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> > +       struct inode *inode;
> > +
> > +       /*
> > +        * If this gfn has no associated memslot, there's no chance of the gfn
> > +        * being backed by private memory, since guest_memfd must be used for
> > +        * private memory, and guest_memfd must be associated with some memslot.
> > +        */
> > +       if (!slot)
> > +               return 0;
> > +
> > +       CLASS(gmem_get_file, file)(slot);
> > +       if (!file)
> > +               return 0;
> > +
> > +       inode = file_inode(file);
> > +
> > +       /*
> > +        * Rely on the maple tree's internal RCU lock to ensure a
> > +        * stable result. This result can become stale as soon as the
> > +        * lock is dropped, so the caller _must_ still protect
> > +        * consumption of private vs. shared by checking
> > +        * mmu_invalidate_retry_gfn() under mmu_lock to serialize
> > +        * against ongoing attribute updates.
> > +        */
> > +       return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
> > +}
> 
> Doesn't this imply that all consumers of kvm_mem_is_private() should
> validate the result using mmu_lock and the invalidation sequence?
> sev_handle_rmp_fault() calls kvm_mem_is_private() without holding
> mmu_lock and without any retry mechanism. Is that a problem?

Yes, but my understanding is that sev_handle_rmp_fault() can tolerate false
positives and false negatives.  It's not optimal, but it's "fine", and already
KVM's existing behavior, e.g. KVM gets the PFN and then smashes the RMP, without
ensuring the PFN is fresh.

Mike, is that all correct?

^ permalink raw reply

* [PATCH v20 10/10] ring-buffer: Show persistent buffer dropped events in trace_pipe file
From: Steven Rostedt @ 2026-05-20 18:49 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Ian Rogers
In-Reply-To: <20260520184938.749337513@kernel.org>

From: Steven Rostedt <rostedt@goodmis.org>

When the persistent ring buffer is validated on boot up, if a subbuffer is
deemed invalid, it resets the buffer and continues. Have the code preserve
the RB_MISSED_EVENTS flag in the commit portion of the subbuffer header
and pass that back so that the trace_pipe file can show the missed events
like the trace file does.

For example:

   <...>-1242    [005] d....  4429.120116: page_fault_user: address=0x7ffaebb6e728 ip=0x7ffaeb9d4960 error_code=0x7
   <...>-1242    [005] .....  4429.120124: mm_page_alloc: page=00000000055254f3 pfn=0x1373bd order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP
   <...>-1242    [005] d..2.  4429.120132: tlb_flush: pages:1 reason:local MM shootdown (3)
CPU:5 [LOST EVENTS]
   <...>-1242    [005] d....  4429.120661: page_fault_user: address=0x55ba7c2d0944 ip=0x55ba7c20cd02 error_code=0x7
   <...>-1242    [005] .....  4429.120669: mm_page_alloc: page=0000000005a02500 pfn=0x12b6e4 order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP
   <...>-1242    [005] d..2.  4429.120680: tlb_flush: pages:1 reason:local MM shootdown (3)

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 57 +++++++++++++++++++++++---------------
 1 file changed, 35 insertions(+), 22 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9cdbee171cdc..f42d2176b92c 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5794,6 +5794,7 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
 	struct buffer_page *reader = NULL;
 	unsigned long overwrite;
 	unsigned long flags;
+	int missed_events = 0;
 	int nr_loops = 0;
 	bool ret;
 
@@ -5894,6 +5895,9 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
 	if (!ret)
 		goto spin;
 
+	if (rb_page_commit(reader) & RB_MISSED_EVENTS)
+		missed_events = -1;
+
 	if (cpu_buffer->ring_meta)
 		rb_update_meta_reader(cpu_buffer, reader);
 
@@ -5958,6 +5962,8 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
 	 */
 	smp_rmb();
 
+	if (!cpu_buffer->lost_events)
+		cpu_buffer->lost_events = missed_events;
 
 	return reader;
 }
@@ -7059,6 +7065,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	struct buffer_page *reader;
 	long missed_events;
 	unsigned int commit;
+	unsigned int size;
 	unsigned int read;
 	u64 save_timestamp;
 	bool force_memcpy;
@@ -7094,7 +7101,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	event = rb_reader_event(cpu_buffer);
 
 	read = reader->read;
-	commit = rb_page_size(reader);
+	commit = rb_page_commit(reader);
+	size = rb_page_size(reader);
 
 	/* Check if any events were dropped */
 	missed_events = cpu_buffer->lost_events;
@@ -7108,13 +7116,14 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	 * we must copy the data from the page to the buffer.
 	 * Otherwise, we can simply swap the page with the one passed in.
 	 */
-	if (read || (len < (commit - read)) ||
+	if (read || (len < (size - read)) ||
 	    cpu_buffer->reader_page == cpu_buffer->commit_page ||
 	    force_memcpy) {
 		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
 		unsigned int rpos = read;
 		unsigned int pos = 0;
-		unsigned int size;
+		unsigned int event_size;
+		unsigned int flags = 0;
 
 		/*
 		 * If a full page is expected, this can still be returned
@@ -7123,19 +7132,23 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 		 * the reader page.
 		 */
 		if (full &&
-		    (!read || (len < (commit - read)) ||
+		    (!read || (len < (size - read)) ||
 		     cpu_buffer->reader_page == cpu_buffer->commit_page))
 			return -1;
 
-		if (len > (commit - read))
-			len = (commit - read);
+		if (len > (size - read))
+			len = (size - read);
 
 		/* Always keep the time extend and data together */
-		size = rb_event_ts_length(event);
+		event_size = rb_event_ts_length(event);
 
-		if (len < size)
+		if (len < event_size)
 			return -1;
 
+		if (commit & RB_MISSED_EVENTS) {
+			printk("MISSED\n");
+			flags = RB_MISSED_EVENTS; }
+
 		/* save the current timestamp, since the user will need it */
 		save_timestamp = cpu_buffer->read_stamp;
 
@@ -7147,25 +7160,25 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 			 * one or two events.
 			 * We have already ensured there's enough space if this
 			 * is a time extend. */
-			size = rb_event_length(event);
-			memcpy(dpage->data + pos, rpage->data + rpos, size);
+			event_size = rb_event_length(event);
+			memcpy(dpage->data + pos, rpage->data + rpos, event_size);
 
-			len -= size;
+			len -= event_size;
 
 			rb_advance_reader(cpu_buffer);
 			rpos = reader->read;
-			pos += size;
+			pos += event_size;
 
-			if (rpos >= commit)
+			if (rpos >= event_size)
 				break;
 
 			event = rb_reader_event(cpu_buffer);
 			/* Always keep the time extend and data together */
-			size = rb_event_ts_length(event);
-		} while (len >= size);
+			event_size = rb_event_ts_length(event);
+		} while (len >= event_size);
 
 		/* update dpage */
-		local_set(&dpage->commit, pos);
+		local_set(&dpage->commit, pos | flags);
 		dpage->time_stamp = save_timestamp;
 
 		/* we copied everything to the beginning */
@@ -7197,7 +7210,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 
 	cpu_buffer->lost_events = 0;
 
-	commit = rb_data_page_commit(dpage);
+	size = rb_data_page_size(dpage);
 	/*
 	 * Set a flag in the commit field if we lost events
 	 */
@@ -7207,11 +7220,11 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 		 * missed events, then record it there.
 		 */
 		if (missed_events > 0 &&
-		    buffer->subbuf_size - commit >= sizeof(missed_events)) {
-			memcpy(&dpage->data[commit], &missed_events,
+		    buffer->subbuf_size - size >= sizeof(missed_events)) {
+			memcpy(&dpage->data[size], &missed_events,
 			       sizeof(missed_events));
 			local_add(RB_MISSED_STORED, &dpage->commit);
-			commit += sizeof(missed_events);
+			size += sizeof(missed_events);
 		}
 		local_add(RB_MISSED_EVENTS, &dpage->commit);
 	}
@@ -7219,8 +7232,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	/*
 	 * This page may be off to user land. Zero it out here.
 	 */
-	if (commit < buffer->subbuf_size)
-		memset(&dpage->data[commit], 0, buffer->subbuf_size - commit);
+	if (size < buffer->subbuf_size)
+		memset(&dpage->data[size], 0, buffer->subbuf_size - size);
 
 	return read;
 }
-- 
2.53.0



^ permalink raw reply related

* [PATCH v20 09/10] ring-buffer: Show persistent buffer dropped events in trace file
From: Steven Rostedt @ 2026-05-20 18:49 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Ian Rogers
In-Reply-To: <20260520184938.749337513@kernel.org>

From: Steven Rostedt <rostedt@goodmis.org>

When the persistent ring buffer is validated on boot up, if a subbuffer is
deemed invalid, it resets the buffer and continues. Currently, these lost
events are not shown in the trace file output.

Have the trace iterator look for subbuffers that have the RB_MISSED_EVENTS
set and set the iter->missed_events flag when it is detected. This will
then have the trace file shows "LOST EVENTS" when it reads across a
subbuffer that was corrupted and invalidated.

For example:

 <...>-1016    [005] ...1.  6230.660403: preempt_disable: caller=__mod_memcg_state+0x1c8/0x200 parent=__mod_memcg_state+0x1c8/0x200
CPU:5 [LOST EVENTS]
 <...>-1016    [005] .....  6230.660673: kmem_cache_alloc: call_site=__anon_vma_prepare+0x1ad/0x1e0 ptr=000000006e40294c name=anon_vma bytes_req=200 bytes_alloc=208 gfp_flags=GFP_KERNEL node=-1 accounted=true

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ae5c645b59c9..9cdbee171cdc 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3518,6 +3518,9 @@ static void rb_inc_iter(struct ring_buffer_iter *iter)
 	else
 		rb_inc_page(&iter->head_page);
 
+	if (rb_page_commit(iter->head_page) & RB_MISSED_EVENTS)
+		iter->missed_events = -1;
+
 	iter->page_stamp = iter->read_stamp = iter->head_page->page->time_stamp;
 	iter->head = 0;
 	iter->next_event = 0;
@@ -5579,6 +5582,7 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
 	iter->head_page = cpu_buffer->reader_page;
 	iter->head = cpu_buffer->reader_page->read;
 	iter->next_event = iter->head;
+	iter->missed_events = 0;
 
 	iter->cache_reader_page = iter->head_page;
 	iter->cache_read = cpu_buffer->read;
@@ -7053,7 +7057,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	struct ring_buffer_event *event;
 	struct buffer_data_page *dpage;
 	struct buffer_page *reader;
-	unsigned long missed_events;
+	long missed_events;
 	unsigned int commit;
 	unsigned int read;
 	u64 save_timestamp;
@@ -7179,6 +7183,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 		local_set(&reader->entries, 0);
 		reader->read = 0;
 		data_page->data = dpage;
+		if (!missed_events && rb_data_page_commit(dpage) & RB_MISSED_EVENTS)
+			missed_events = -1;
 
 		/*
 		 * Use the real_end for the data size,
@@ -7196,10 +7202,12 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	 * Set a flag in the commit field if we lost events
 	 */
 	if (missed_events) {
-		/* If there is room at the end of the page to save the
+		/*
+		 * If there is room at the end of the page to save the
 		 * missed events, then record it there.
 		 */
-		if (buffer->subbuf_size - commit >= sizeof(missed_events)) {
+		if (missed_events > 0 &&
+		    buffer->subbuf_size - commit >= sizeof(missed_events)) {
 			memcpy(&dpage->data[commit], &missed_events,
 			       sizeof(missed_events));
 			local_add(RB_MISSED_STORED, &dpage->commit);
-- 
2.53.0



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox