Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH] rtla: Fix parsing of multi-character short options
From: John Kacur @ 2026-06-02 16:21 UTC (permalink / raw)
  To: Tomas Glozar; +Cc: linux-trace-kernel, Steven Rostedt, linux-kernel
In-Reply-To: <20260602125506.3325345-1-tglozar@redhat.com>

On Tue, 2 Jun 2026 14:55:06 +0200, Tomas Glozar wrote:
> rtla: Fix parsing of multi-character short options

Thanks for the fix! I've tested this patch with a comprehensive test suite
covering all rtla commands (timerlat hist/top, osnoise hist/top, hwnoise)
with the four option formats:
  -p 100        (short with space)
  -p100         (short attached - previously broken)
  --period=100  (long with equals)
  --period 100  (long with space)

All 20 tests pass. The fix correctly resolves the issue where -p100 was
being parsed as multiple separate options.

Tested-by: John Kacur <jkacur@redhat.com>

^ permalink raw reply

* Re: [PATCH v4 2/3] perf: enable unprivileged syscall tracing with perf trace
From: Anubhav Shelat @ 2026-06-02 16:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mpetlan, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Falcon, linux-kernel, linux-trace-kernel,
	linux-perf-users
In-Reply-To: <20260518214116.GZ3102624@noisy.programming.kicks-ass.net>

On Mon, May 18, 2026 at 5:41 PM Peter Zijlstra <peterz@infradead.org> wrote:
> Typically patches are supposed to a single thing, you're listing 4
> things. What gives?
All four changes need to be made together to work properly. The second
point could be pulled out as a separate patch, but will be replaced
with the eventfs that Steve suggested. The other three points
represent a single logical change: selectively loosening the
perf_event_open() restrictions without exposing kernel data or
breaking uprobe functionality.

> PERF_SAMPLE_IP should be here too, no?
If PERF_SAMPLE_IP is added to the kaddr_leak mask it blocks uprobes,
so the PERF_SAMPLE_IP check is in the trace_event_perf.c changes where
I can exempt uprobes:
+       if ((p_event->attr.sample_type & PERF_SAMPLE_IP) &&
+           !p_event->attr.exclude_kernel &&
+           !(tp_event->flags & TRACE_EVENT_FL_UPROBE) &&
+           sysctl_perf_event_paranoid > 1 && !perfmon_capable())
+               return -EACCES;

> And I'm not sure if tracepoints can trigger it, but PHYS_ADDR also seems
> something we shouldn't allow.
There's a check for unprivileged access to PHYS_ADDR at core.c:13917
so I didn't add it to kaddr_leak.

> And we're sure RAW doesn't include pointers
PERF_SAMPLE_RAW for TRACE_EVENT_FL_CAP_ANY tracepoints doesn't include
kernel pointers.

> Again, you're doing the same thing in multiple places. If only there was
> something to re-use a previous expression.
>
> None of this gives me warm and fuzzy feelings.
You're right. I'll factor the checks out for the next version.

Anubhav


^ permalink raw reply

* [PATCH 2/2] rtla: Add tests for option parsing with attached arguments
From: John Kacur @ 2026-06-02 15:52 UTC (permalink / raw)
  To: linux-trace-kernel; +Cc: Steven Rostedt, Tomas Glozar, linux-kernel
In-Reply-To: <20260602155210.60439-1-jkacur@redhat.com>

Add tests to verify that short options with attached numeric arguments
work correctly for all rtla commands after the parsing fixes.

Tests verify four formats for each command:
- Short option with space: -p 100
- Short option attached: -p100
- Long option with equals: --period=100
- Long option with space: --period 100

For osnoise and hwnoise commands, the tests also include -r 100 (runtime)
to satisfy the osnoise constraint that runtime <= period.

These tests complement the existing timerlat hist tests added in commit
d489b602c669 ("rtla/timerlat: Add tests for option parsing with attached
arguments").

Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: John Kacur <jkacur@redhat.com>
---
 tools/tracing/rtla/tests/hwnoise.t  | 10 ++++++++++
 tools/tracing/rtla/tests/osnoise.t  | 18 ++++++++++++++++++
 tools/tracing/rtla/tests/timerlat.t |  8 ++++++++
 3 files changed, 36 insertions(+)

diff --git a/tools/tracing/rtla/tests/hwnoise.t b/tools/tracing/rtla/tests/hwnoise.t
index 23ce250a6852..c61f02bc42fe 100644
--- a/tools/tracing/rtla/tests/hwnoise.t
+++ b/tools/tracing/rtla/tests/hwnoise.t
@@ -19,4 +19,14 @@ check "enable a trace event trigger" \
 	"hwnoise -t -e osnoise:irq_noise --trigger=\"hist:key=desc,duration:sort=desc,duration:vals=hitcount\" -d 10s" \
 	0 "Saving event osnoise:irq_noise hist to osnoise_irq_noise_hist.txt"
 
+# Option parsing tests - verify attached numeric arguments work correctly
+check "verify -p with space" \
+	"hwnoise -p 100 -r 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify -p without space (attached argument)" \
+	"hwnoise -p100 -r 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with equals" \
+	"hwnoise --period=100 -r 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with space" \
+	"hwnoise --period 100 -r 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+
 test_end
diff --git a/tools/tracing/rtla/tests/osnoise.t b/tools/tracing/rtla/tests/osnoise.t
index 396334608920..1807236431df 100644
--- a/tools/tracing/rtla/tests/osnoise.t
+++ b/tools/tracing/rtla/tests/osnoise.t
@@ -17,6 +17,24 @@ check "verify the  --trace param" \
 check "verify the --entries/-E param" \
 	"osnoise hist -P F:1 -c 0 -r 900000 -d 10s -b 10 -E 25"
 
+# Option parsing tests - verify attached numeric arguments work correctly
+check "verify -p with space" \
+	"osnoise hist -p 100 -r 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify -p without space (attached argument)" \
+	"osnoise hist -p100 -r 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with equals" \
+	"osnoise hist --period=100 -r 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with space" \
+	"osnoise hist --period 100 -r 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify osnoise top -p with space" \
+	"osnoise top -p 100 -r 100 -c 0 -d 1s -q" 0 "" "no-irq and no-thread"
+check "verify osnoise top -p without space (attached argument)" \
+	"osnoise top -p100 -r 100 -c 0 -d 1s -q" 0 "" "no-irq and no-thread"
+check "verify osnoise top --period with equals" \
+	"osnoise top --period=100 -r 100 -c 0 -d 1s -q" 0 "" "no-irq and no-thread"
+check "verify osnoise top --period with space" \
+	"osnoise top --period 100 -r 100 -c 0 -d 1s -q" 0 "" "no-irq and no-thread"
+
 # Test setting default period by putting an absurdly high period
 # and stopping on threshold.
 # If default period is not set, this will time out.
diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
index 1a63301f5d70..506227412027 100644
--- a/tools/tracing/rtla/tests/timerlat.t
+++ b/tools/tracing/rtla/tests/timerlat.t
@@ -51,6 +51,14 @@ check "verify --period with equals" \
 	"timerlat hist --period=100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
 check "verify --period with space" \
 	"timerlat hist --period 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify timerlat top -p with space" \
+	"timerlat top -p 100 -c 0 -d 1s -q" 0 "" "no-irq and no-thread"
+check "verify timerlat top -p without space (attached argument)" \
+	"timerlat top -p100 -c 0 -d 1s -q" 0 "" "no-irq and no-thread"
+check "verify timerlat top --period with equals" \
+	"timerlat top --period=100 -c 0 -d 1s -q" 0 "" "no-irq and no-thread"
+check "verify timerlat top --period with space" \
+	"timerlat top --period 100 -c 0 -d 1s -q" 0 "" "no-irq and no-thread"
 
 # Actions tests
 check "trace output through -t" \
-- 
2.54.0


^ permalink raw reply related

* [PATCH 1/2] rtla/timerlat: Add tests for option parsing with attached arguments
From: John Kacur @ 2026-06-02 15:52 UTC (permalink / raw)
  To: linux-trace-kernel; +Cc: Steven Rostedt, Tomas Glozar, linux-kernel
In-Reply-To: <20260602155210.60439-1-jkacur@redhat.com>

Add tests to verify that numeric arguments work correctly with both
attached and detached formats:
  -p 100        (short with space)
  -p100         (short without space)
  --period=100  (long with =)
  --period 100  (long with space)

These tests prevent regression of the bug fixed in commit eefa8af46ff7
("rtla/timerlat: Fix parsing of short options with attached arguments")
where -p100 was incorrectly parsed as multiple separate options.

The tests verify that:
1. All four argument formats succeed (exit code 0)
2. None trigger the "no-irq and no-thread" error that occurred when
   the bug was present

Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: John Kacur <jkacur@redhat.com>
---
 tools/tracing/rtla/tests/timerlat.t | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
index fd4935fd7b49..1a63301f5d70 100644
--- a/tools/tracing/rtla/tests/timerlat.t
+++ b/tools/tracing/rtla/tests/timerlat.t
@@ -42,6 +42,16 @@ check "verify -c/--cpus" \
 check "hist test in nanoseconds" \
 	"timerlat hist -i 2 -c 0 -n -d 10s" 2 "ns"
 
+# Option parsing tests - verify attached numeric arguments work correctly
+check "verify -p with space" \
+	"timerlat hist -p 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify -p without space (attached argument)" \
+	"timerlat hist -p100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with equals" \
+	"timerlat hist --period=100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with space" \
+	"timerlat hist --period 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+
 # Actions tests
 check "trace output through -t" \
 	"timerlat hist -T 2 -t" 2 "^  Saving trace to timerlat_trace.txt$"
-- 
2.54.0


^ permalink raw reply related

* [PATCH 0/2] rtla: Add tests for option parsing with attached arguments
From: John Kacur @ 2026-06-02 15:52 UTC (permalink / raw)
  To: linux-trace-kernel; +Cc: Steven Rostedt, Tomas Glozar, linux-kernel

This patch series adds comprehensive tests to verify that short options
with attached numeric arguments (e.g., -p100) work correctly across all
rtla commands.

These tests complement Tomas Glozar's fix "rtla: Fix parsing of
multi-character short options" which resolves the issue where options
like -p100 were incorrectly parsed as multiple separate options due to
getopt_long() being called twice.

The tests verify four option formats for each command:
  -p 100        (short with space)
  -p100         (short attached - previously broken)
  --period=100  (long with equals)
  --period 100  (long with space)

Commands tested:
- timerlat hist and top
- osnoise hist and top
- hwnoise

All 20 tests pass with Tomas's fix applied, confirming the issue is
resolved and preventing future regressions. These tests will continue to
work when rtla transitions to libsubcmd in the future, ensuring this
functionality remains correct across parsing implementations.

Note: Patch 1/2 is a resend of the timerlat hist tests sent previously.
Patch 2/2 adds tests for the remaining rtla commands.

Signed-off-by: John Kacur <jkacur@redhat.com>

John Kacur (2):
  rtla/timerlat: Add tests for option parsing with attached arguments
  rtla: Add tests for option parsing with attached arguments

 tools/tracing/rtla/tests/hwnoise.t  | 10 ++++++++++
 tools/tracing/rtla/tests/osnoise.t  | 18 ++++++++++++++++++
 tools/tracing/rtla/tests/timerlat.t | 18 ++++++++++++++++++
 3 files changed, 46 insertions(+)

-- 
2.54.0

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lance Yang @ 2026-06-02 15:44 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <CAA1CXcA+oZmp=cxiC2_EBDxqGX94gAd335d9eFPNv=j_0=og7Q@mail.gmail.com>



On 2026/6/2 18:58, Nico Pache wrote:
> On Sun, May 31, 2026 at 1:19 AM Lance Yang <lance.yang@linux.dev> wrote:
>>
>>
>> On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
>> [...]
>>> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>>>        if (result == SCAN_SUCCEED) {
>>>                /* collapse_huge_page expects the lock to be dropped before calling */
>>>                mmap_read_unlock(mm);
>>> -              result = collapse_huge_page(mm, start_addr, referenced,
>>> -                                          unmapped, cc, HPAGE_PMD_ORDER);
>>> -              /* collapse_huge_page will return with the mmap_lock released */
>>> +              nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
>>> +                                           unmapped, cc, enabled_orders);
>>> +              /* mmap_lock was released above, set lock_dropped */
>>>                *lock_dropped = true;
>>> +              result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>>
>> Hmm ... don't we lose the allocation-failure result here?
>>
>> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
>> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
>> in khugepaged_do_scan().
>>
>> Now if allocation fails and nr_collapsed stays 0, we just return
>> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
> 
> Ok I did the error propagation! I think I handled both of these cases
> you brought up pretty easily.

Thanks.

> However I don't know what to do in the following case: We successfully
> collapsed some portion of the PMD, but during that process, we also
> hit an allocation failure. Is it best to back off entirely? or can we
> treat some forward progress as a sign we can continue trying collapses
> without sleeping.
> 
> Basically, do we prioritize SCAN_ALLOC_HUGE_PAGE_FAIL or the
> successful collapses as the returned value?

Thinking out loud, forward progress should win here, the allocation
failure only matter if we made no progress at all?

> This is what I currently have:
> done:
>      if (collapsed)
>          return SCAN_SUCCEED;
>      if (alloc_failed)
>          return SCAN_ALLOC_HUGE_PAGE_FAIL;

I'd go with this ordering :)

Cheers, Lance

^ permalink raw reply

* Re: [PATCH 1/2] tracing: work around -Wmissing-format-attribute warning
From: Steven Rostedt @ 2026-06-02 15:40 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Masami Hiramatsu, Andrew Morton, Petr Mladek, Nathan Chancellor,
	Arnd Bergmann, Dennis Dalessandro, Jason Gunthorpe,
	Leon Romanovsky, Arend van Spriel, Miri Korenblit,
	Mathieu Desnoyers, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Vlastimil Babka, linux-rdma, linux-kernel, linux-wireless,
	brcm80211, brcm80211-dev-list.pdl, linux-trace-kernel, llvm
In-Reply-To: <20260602150904.2258624-1-arnd@kernel.org>

On Tue,  2 Jun 2026 17:07:05 +0200
Arnd Bergmann <arnd@kernel.org> wrote:

> @@ -2979,6 +2975,12 @@ int vsnprintf(char *buf, size_t size, const char *fmt_str, va_list args)
>  }
>  EXPORT_SYMBOL(vsnprintf);
>  

Should add a comment here for why this is needed.

-- Steve

> +int __printf(3, 0) __vsnprintf(char *buf, size_t size, const char *fmt_str, va_list args)
> +{
> +	return vsnprintf(buf, size, fmt_str, args);
> +}
> +EXPORT_SYMBOL(__vsnprintf);
> +

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-06-02 15:30 UTC (permalink / raw)
  To: Lance Yang
  Cc: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, usama.arif
In-Reply-To: <baa0a462-46e0-44ab-b583-c722ad253afe@linux.dev>

On Mon, Jun 1, 2026 at 4:48 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
> > On 6/1/26 11:08, Lance Yang wrote:
> >>
> >>
> >> On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
> >>> On 6/1/26 05:28, Lance Yang wrote:
> >>>>
> >>>>
> >>>> Ah, fair point.
> >>>>
> >>>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
> >>>> than only using the pte pointer passed in. For example, mips does:
> >>>
> >>> Right, a re-walk would be the real problem.
> >>>
> >>>>
> >>>>     update_mmu_cache_range()
> >>>>       -> __update_tlb()
> >>>>         -> pgd_offset(vma->vm_mm, address)
> >>>>         -> pte_offset_map(...)
> >>>>
> >>>> and __update_tlb() has this assumption:
> >>>>
> >>>>          /*
> >>>>           * update_mmu_cache() is called between pte_offset_map_lock()
> >>>>           * and pte_unmap_unlock(), so we can assume that ptep is not
> >>>>           * NULL here: and what should be done below if it were NULL?
> >>>>           */
> >>>>
> >>>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
> >>>> here, could __update_tlb() hit the none PMD, get NULL from
> >>>> pte_offset_map(), and then dereference it?
> >>>
> >>> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
> >>> even spells that out. :(
> >>>
> >>> Do you know about other code like that, or is MIPS the only one doing a
> >>> re-walk and crossing fingers?
> >>>
> >>>>
> >>>> Just wanted to raise it since some arch code may still have assumptions
> >>>> like this, and the always-enable-mTHP work is getting closer ...
> >>>
> >>> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in
> >>> arch code,
> >>> because we simply provide the ptep. update_mmu_cache_range() only consumes the
> >>> pte.
> >>>
> >>>>
> >>>> Probably very very very hard to hit, though :)
> >>>
> >>> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
> >>> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
> >>>
> >>> Maybe we could reinstall the page table with the cleared (none) entries while
> >>> still holding the PTL?
> >>>
> >>> Thinking out loud:
> >>>
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 5ba298d420b7..e39b750b1e6f 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct
> >>> mm_struct *mm, unsigned long s
> >>>                   map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> >>>           } else {
> >>>                   /*
> >>> -                * set_ptes is called in map_anon_folio_pte_nopf with the
> >>> -                * pmd_ptl lock still held; this is safe as the PMD is expected
> >>> -                * to be none. The pmd entry is then repopulated below.
> >>> +                * Re-insert the page table with the cleared entries, but
> >>> +                * hold the PTL, such that no one can mess with the re-installed
> >>> +                * page table until we updated the temporarily-cleared entries
> >>> +                * through map_anon_folio_pte_nopf().
> >>>                    */
> >>> -               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /
> >>> *uffd_wp=*/ false);
> >>> -               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> >>
> >> One small thing, I think we should probably keep the smp_wmb(), and just
> >> move it before the earlier pmd_populate().
> >>
> >> IIUC, the ordering we want is still:
> >>
> >>    clear old PTEs
> >>    smp_wmb()
> >>    pmd_populate()
> >>
> >> so another CPU cannot walk through the re-installed PMD and still observe
> >> the old PTEs, right?
> >
> > There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
>
> Ah, cool! __folio_mark_uptodate() already does the job :P
>
> So yeah, no extra smp_wmb() needed here!

are we sure? that folio_mark_uptodate is done before the PTEs are
reinstalled. Then we reinstall the PMD right after. Currently
separated by the smp_wmb().

I was copying this from other THP code that performs similar PTE/PMD juggling.

I can remove it, but I'd rather air on the side of caution with this.

>
> Cheers, Lance
>


^ permalink raw reply

* [PATCH 2/2] tracing/osnoise: add printf attribute to osnoise_print
From: Arnd Bergmann @ 2026-06-02 15:07 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Crystal Wood
  Cc: Arnd Bergmann, Mathieu Desnoyers, Tomas Glozar, Wang Liang,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260602150904.2258624-1-arnd@kernel.org>

From: Arnd Bergmann <arnd@arndb.de>

gcc points out that tne newly added function uses printf style arguments
and should get an attribute to allow verifying the format strings for
its callers:

kernel/trace/trace_osnoise.c: In function 'osnoise_print':
kernel/trace/trace_osnoise.c:96:17: error: function 'osnoise_print' might be a candidate for 'gnu_printf' format attribute [-Werror=suggest-attribute=format]
   96 |                 trace_array_vprintk(tr, _RET_IP_, fmt, ap);
      |                 ^~~~~~~~~~~~~~~~~~~

Add the attribute as suggested

Fixes: 9cb99c598643 ("tracing/osnoise: Array printk init and cleanup")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 kernel/trace/trace_osnoise.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 1fbd8525ab54..6fa015e57899 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -83,7 +83,7 @@ struct osnoise_instance {
 
 static struct list_head osnoise_instances;
 
-static void osnoise_print(const char *fmt, ...)
+static __printf(1, 2) void osnoise_print(const char *fmt, ...)
 {
 	struct osnoise_instance *inst;
 	struct trace_array *tr;
-- 
2.39.5


^ permalink raw reply related

* [PATCH 1/2] tracing: work around -Wmissing-format-attribute warning
From: Arnd Bergmann @ 2026-06-02 15:07 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Andrew Morton, Petr Mladek,
	Nathan Chancellor
  Cc: Arnd Bergmann, Dennis Dalessandro, Jason Gunthorpe,
	Leon Romanovsky, Arend van Spriel, Miri Korenblit,
	Mathieu Desnoyers, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Vlastimil Babka, linux-rdma, linux-kernel, linux-wireless,
	brcm80211, brcm80211-dev-list.pdl, linux-trace-kernel, llvm

From: Arnd Bergmann <arnd@arndb.de>

A number of tracing headers turn off -Wsuggest-attribute=format for
gcc, but they don't turn it off for clang, so the same warning still
happens on new versions of clang that support the format attribute.

To avoid duplicating the same thing in each tracing header, as well
as changing all of them to also turn it off for clang, add a new
__vsnprintf() helper that is not annotated this way in linux/sprintf.h
but is defined to work the same way as the regular vsprintf.

Aside from tracing, the same thing can be used in va_format(),
which is part of lib/vsprintf.c itself.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
This version is a fairly simplistic way to work around the warning
reliably. I have resent two more patches to address actually
missing annotations in device drivers, but with all of these
out of the way, we can move the warning from the 'make W=1'
into the default set.

I have also prototyped a variant of this patch that passes down
a 'struct va_format' throughout the tracing code. That patch is
a little more invasive and I have no idea if that actually works,
but the result looks simpler.
---
 drivers/infiniband/hw/hfi1/trace_dbg.h               |  7 -------
 .../broadcom/brcm80211/brcmfmac/tracepoint.h         |  7 -------
 .../brcm80211/brcmsmac/brcms_trace_brcmsmac_msg.h    |  7 -------
 drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c    |  3 ---
 include/linux/sprintf.h                              |  1 +
 include/linux/trace_events.h                         |  2 +-
 include/trace/events/qla.h                           |  7 -------
 include/trace/stages/stage6_event_callback.h         |  2 +-
 lib/vsprintf.c                                       | 12 +++++++-----
 samples/trace_events/trace-events-sample.c           |  2 --
 10 files changed, 10 insertions(+), 40 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/trace_dbg.h b/drivers/infiniband/hw/hfi1/trace_dbg.h
index 58304b91380f..05c4f1354269 100644
--- a/drivers/infiniband/hw/hfi1/trace_dbg.h
+++ b/drivers/infiniband/hw/hfi1/trace_dbg.h
@@ -22,11 +22,6 @@
 
 #define MAX_MSG_LEN 512
 
-#pragma GCC diagnostic push
-#ifndef __clang__
-#pragma GCC diagnostic ignored "-Wsuggest-attribute=format"
-#endif
-
 DECLARE_EVENT_CLASS(hfi1_trace_template,
 		    TP_PROTO(const char *function, struct va_format *vaf),
 		    TP_ARGS(function, vaf),
@@ -41,8 +36,6 @@ DECLARE_EVENT_CLASS(hfi1_trace_template,
 			      __get_str(msg))
 );
 
-#pragma GCC diagnostic pop
-
 /*
  * It may be nice to macroize the __hfi1_trace but the va_* stuff requires an
  * actual function to work and can not be in a macro.
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/tracepoint.h b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/tracepoint.h
index 96032322b165..6c4e00e9ccd1 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/tracepoint.h
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/tracepoint.h
@@ -28,11 +28,6 @@ static inline void trace_ ## name(proto) {}
 
 #define MAX_MSG_LEN		100
 
-#pragma GCC diagnostic push
-#ifndef __clang__
-#pragma GCC diagnostic ignored "-Wsuggest-attribute=format"
-#endif
-
 TRACE_EVENT(brcmf_err,
 	TP_PROTO(const char *func, struct va_format *vaf),
 	TP_ARGS(func, vaf),
@@ -128,8 +123,6 @@ TRACE_EVENT(brcmf_sdpcm_hdr,
 		  __entry->len, ((u8 *)__get_dynamic_array(hdr))[4])
 );
 
-#pragma GCC diagnostic pop
-
 #ifdef CONFIG_BRCM_TRACING
 
 #undef TRACE_INCLUDE_PATH
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmsmac/brcms_trace_brcmsmac_msg.h b/drivers/net/wireless/broadcom/brcm80211/brcmsmac/brcms_trace_brcmsmac_msg.h
index 908ce3c864fe..dc296d8bf775 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmsmac/brcms_trace_brcmsmac_msg.h
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmsmac/brcms_trace_brcmsmac_msg.h
@@ -24,11 +24,6 @@
 
 #define MAX_MSG_LEN	100
 
-#pragma GCC diagnostic push
-#ifndef __clang__
-#pragma GCC diagnostic ignored "-Wsuggest-attribute=format"
-#endif
-
 DECLARE_EVENT_CLASS(brcms_msg_event,
 	TP_PROTO(struct va_format *vaf),
 	TP_ARGS(vaf),
@@ -77,8 +72,6 @@ TRACE_EVENT(brcms_dbg,
 	TP_printk("%s: %s", __get_str(func), __get_str(msg))
 );
 
-#pragma GCC diagnostic pop
-
 #endif /* __TRACE_BRCMSMAC_MSG_H */
 
 #ifdef CONFIG_BRCM_TRACING
diff --git a/drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c b/drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c
index 7e686297963d..49a8196430a7 100644
--- a/drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c
+++ b/drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c
@@ -12,9 +12,6 @@
 #include "iwl-trans.h"
 
 #define CREATE_TRACE_POINTS
-#ifdef CONFIG_CC_IS_GCC
-#pragma GCC diagnostic ignored "-Wsuggest-attribute=format"
-#endif
 #include "iwl-devtrace.h"
 
 EXPORT_TRACEPOINT_SYMBOL(iwlwifi_dev_ucode_event);
diff --git a/include/linux/sprintf.h b/include/linux/sprintf.h
index f06f7b785091..036a247b7c1e 100644
--- a/include/linux/sprintf.h
+++ b/include/linux/sprintf.h
@@ -12,6 +12,7 @@ __printf(2, 3) int sprintf(char *buf, const char * fmt, ...);
 __printf(2, 0) int vsprintf(char *buf, const char *, va_list);
 __printf(3, 4) int snprintf(char *buf, size_t size, const char *fmt, ...);
 __printf(3, 0) int vsnprintf(char *buf, size_t size, const char *fmt, va_list args);
+int __vsnprintf(char *buf, size_t size, const char *fmt, va_list args);
 __printf(3, 4) int scnprintf(char *buf, size_t size, const char *fmt, ...);
 __printf(3, 0) int vscnprintf(char *buf, size_t size, const char *fmt, va_list args);
 __printf(2, 3) __malloc char *kasprintf(gfp_t gfp, const char *fmt, ...);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d49338c44014..4715330c7b6b 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -962,7 +962,7 @@ perf_trace_buf_submit(void *raw_data, int size, int rctx, u16 type,
 	int __ret;					\
 							\
 	va_copy(__ap, *(va));				\
-	__ret = vsnprintf(NULL, 0, fmt, __ap) + 1;	\
+	__ret = __vsnprintf(NULL, 0, fmt, __ap) + 1;	\
 	va_end(__ap);					\
 							\
 	min(__ret, TRACE_EVENT_STR_MAX);		\
diff --git a/include/trace/events/qla.h b/include/trace/events/qla.h
index 8800c35525a1..74a7534b99b6 100644
--- a/include/trace/events/qla.h
+++ b/include/trace/events/qla.h
@@ -9,11 +9,6 @@
 
 #define QLA_MSG_MAX 256
 
-#pragma GCC diagnostic push
-#ifndef __clang__
-#pragma GCC diagnostic ignored "-Wsuggest-attribute=format"
-#endif
-
 DECLARE_EVENT_CLASS(qla_log_event,
 	TP_PROTO(const char *buf,
 		struct va_format *vaf),
@@ -32,8 +27,6 @@ DECLARE_EVENT_CLASS(qla_log_event,
 	TP_printk("%s %s", __get_str(buf), __get_str(msg))
 );
 
-#pragma GCC diagnostic pop
-
 DEFINE_EVENT(qla_log_event, ql_dbg_log,
 	TP_PROTO(const char *buf, struct va_format *vaf),
 	TP_ARGS(buf, vaf)
diff --git a/include/trace/stages/stage6_event_callback.h b/include/trace/stages/stage6_event_callback.h
index 1691676fd858..7d6a6ca6e779 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -45,7 +45,7 @@
 	do {								\
 		va_list __cp_va;					\
 		va_copy(__cp_va, *(va));				\
-		vsnprintf(__get_str(dst), TRACE_EVENT_STR_MAX, fmt, __cp_va); \
+		__vsnprintf(__get_str(dst), TRACE_EVENT_STR_MAX, fmt, __cp_va); \
 		va_end(__cp_va);					\
 	} while (0)
 
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index a3017bc58986..3caf0796f54d 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -1702,9 +1702,6 @@ char *escaped_string(char *buf, char *end, u8 *addr, struct printf_spec spec,
 	return buf;
 }
 
-__diag_push();
-__diag_ignore(GCC, all, "-Wsuggest-attribute=format",
-	      "Not a valid __printf() conversion candidate.");
 static char *va_format(char *buf, char *end, struct va_format *va_fmt,
 		       struct printf_spec spec)
 {
@@ -1714,12 +1711,11 @@ static char *va_format(char *buf, char *end, struct va_format *va_fmt,
 		return buf;
 
 	va_copy(va, *va_fmt->va);
-	buf += vsnprintf(buf, end > buf ? end - buf : 0, va_fmt->fmt, va);
+	buf += __vsnprintf(buf, end > buf ? end - buf : 0, va_fmt->fmt, va);
 	va_end(va);
 
 	return buf;
 }
-__diag_pop();
 
 static noinline_for_stack
 char *uuid_string(char *buf, char *end, const u8 *addr,
@@ -2979,6 +2975,12 @@ int vsnprintf(char *buf, size_t size, const char *fmt_str, va_list args)
 }
 EXPORT_SYMBOL(vsnprintf);
 
+int __printf(3, 0) __vsnprintf(char *buf, size_t size, const char *fmt_str, va_list args)
+{
+	return vsnprintf(buf, size, fmt_str, args);
+}
+EXPORT_SYMBOL(__vsnprintf);
+
 /**
  * vscnprintf - Format a string and place it in a buffer
  * @buf: The buffer to place the result into
diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index 9993fb5d5f98..ecc7db237f2e 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -9,8 +9,6 @@
  * creates the handles for the trace points.
  */
 #define CREATE_TRACE_POINTS
-__diag_ignore(GCC, all, "-Wsuggest-attribute=format",
-             "trace_event_get_offsets_foo_bar can't easily be annotated as __printf");
 #include "trace-events-sample.h"
 
 static const char *random_strings[] = {
-- 
2.39.5


^ permalink raw reply related

* Re: [PATCH v2] tracing: fix CFI violation in probestub test
From: Masami Hiramatsu @ 2026-06-02 14:33 UTC (permalink / raw)
  To: Eva Kurchatova
  Cc: rostedt, linux-trace-kernel, linux-kernel, mathieu.desnoyers,
	peterz, jpoimboe, samitolvanen
In-Reply-To: <20260602135425.542073-1-eva.kurchatova@virtuozzo.com>

On Tue,  2 Jun 2026 16:54:08 +0300
Eva Kurchatova <eva.kurchatova@virtuozzo.com> wrote:

> When multiple callbacks are registered on the same tracepoint,
> callbacks will be indirectly called via traceiter helper.
> 
> Pointers to __probestub_* callbacks reside in __tracepoints section,
> which is excluded from ENDBR checks in objtool, causing objtool to
> assume those functions are never indirectly called.
> 
> Registering multiple callbacks using sched_wakeup test will result
> in #CP exception due to missing ENDBR in __probestub_sched_wakeup
> on a CFI-enabled machine.
> 
> Fix this by adding CFI_NOSEAL annotation to probestub declaration.
> 

Thanks, this looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

> Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
> Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
> ---
>  include/linux/tracepoint.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 763eea4d80d8..38e9f49a71b7 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -20,6 +20,7 @@
>  #include <linux/rcupdate_trace.h>
>  #include <linux/tracepoint-defs.h>
>  #include <linux/static_call.h>
> +#include <asm/cfi.h>
>  
>  struct module;
>  struct tracepoint;
> @@ -389,6 +390,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  	void __probestub_##_name(void *__data, proto)			\
>  	{								\
>  	}								\
> +	/*								\
> +	 * Annotate the probestub 'CFI_NOSEAL' to stop objtool from	\
> +	 * requesting the kernel remove the ENDBR, because the only	\
> +	 * references to the function are in the __tracepoint section,	\
> +	 * that objtool doesn't scan.					\
> +	 */								\
> +	CFI_NOSEAL(__probestub_##_name);				\
>  	DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);	\
>  	DEFINE_RUST_DO_TRACE(_name, TP_PROTO(proto), TP_ARGS(args))
>  
> -- 
> 2.54.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH v2] tracing: fix CFI violation in probestub test
From: Eva Kurchatova @ 2026-06-02 13:54 UTC (permalink / raw)
  To: mhiramat, rostedt
  Cc: linux-trace-kernel, linux-kernel, mathieu.desnoyers, peterz,
	jpoimboe, samitolvanen, eva.kurchatova

When multiple callbacks are registered on the same tracepoint,
callbacks will be indirectly called via traceiter helper.

Pointers to __probestub_* callbacks reside in __tracepoints section,
which is excluded from ENDBR checks in objtool, causing objtool to
assume those functions are never indirectly called.

Registering multiple callbacks using sched_wakeup test will result
in #CP exception due to missing ENDBR in __probestub_sched_wakeup
on a CFI-enabled machine.

Fix this by adding CFI_NOSEAL annotation to probestub declaration.

Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
---
 include/linux/tracepoint.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 763eea4d80d8..38e9f49a71b7 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -20,6 +20,7 @@
 #include <linux/rcupdate_trace.h>
 #include <linux/tracepoint-defs.h>
 #include <linux/static_call.h>
+#include <asm/cfi.h>
 
 struct module;
 struct tracepoint;
@@ -389,6 +390,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 	void __probestub_##_name(void *__data, proto)			\
 	{								\
 	}								\
+	/*								\
+	 * Annotate the probestub 'CFI_NOSEAL' to stop objtool from	\
+	 * requesting the kernel remove the ENDBR, because the only	\
+	 * references to the function are in the __tracepoint section,	\
+	 * that objtool doesn't scan.					\
+	 */								\
+	CFI_NOSEAL(__probestub_##_name);				\
 	DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);	\
 	DEFINE_RUST_DO_TRACE(_name, TP_PROTO(proto), TP_ARGS(args))
 
-- 
2.54.0


^ permalink raw reply related

* [syzbot] [trace?] KASAN: use-after-free Write in ring_buffer_read_page
From: syzbot @ 2026-06-02 13:45 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, mathieu.desnoyers, mhiramat,
	rostedt, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    e7ae89a0c97c Linux 7.1-rc5
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16f06e2e580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=58acee1ac5406016
dashboard link: https://syzkaller.appspot.com/bug?extid=2dd9d02f60775ce5c1fb
compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/9b0c5b4e3645/disk-e7ae89a0.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/ed163d3ad68b/vmlinux-e7ae89a0.xz
kernel image: https://storage.googleapis.com/syzbot-assets/f2408b333334/bzImage-e7ae89a0.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+2dd9d02f60775ce5c1fb@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in ring_buffer_read_page+0xd51/0x15a0 kernel/trace/ring_buffer.c:7059
Write of size 16308 at addr ffff88805ceb404c by task syz.3.1872/14532

CPU: 0 UID: 0 PID: 14532 Comm: syz.3.1872 Tainted: G             L      syzkaller #0 PREEMPT(full) 
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0x13d/0x4b0 mm/kasan/report.c:482
 kasan_report+0xdf/0x1d0 mm/kasan/report.c:595
 check_region_inline mm/kasan/generic.c:186 [inline]
 kasan_check_range+0x10f/0x1e0 mm/kasan/generic.c:200
 __asan_memset+0x23/0x50 mm/kasan/shadow.c:84
 ring_buffer_read_page+0xd51/0x15a0 kernel/trace/ring_buffer.c:7059
 tracing_buffers_read+0x2bf/0xaf0 kernel/trace/trace.c:7129
 vfs_read+0x1e4/0xb30 fs/read_write.c:572
 ksys_read+0x12a/0x250 fs/read_write.c:717
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x10b/0x830 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb1aad9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fb1abca5028 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 00007fb1ab015fa0 RCX: 00007fb1aad9ce59
RDX: 0000000000001000 RSI: 00002000000002c0 RDI: 0000000000000008
RBP: 00007fb1aae32d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fb1ab016038 R14: 00007fb1ab015fa0 R15: 00007ffec139a1b8
 </TASK>

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5ceb4
flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x44dc0(GFP_KERNEL|__GFP_ZERO|__GFP_RETRY_MAYFAIL|__GFP_COMP), pid 5959, tgid 5958 (syz.1.37), ts 95924498456, free_ts 95918916027
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0xfd/0x120 mm/page_alloc.c:1858
 prep_new_page mm/page_alloc.c:1866 [inline]
 get_page_from_freelist+0x11a6/0x33b0 mm/page_alloc.c:3946
 __alloc_frozen_pages_noprof+0x27c/0x2bc0 mm/page_alloc.c:5226
 __alloc_pages_noprof+0xb/0x110 mm/page_alloc.c:5260
 __alloc_pages_node_noprof include/linux/gfp.h:289 [inline]
 alloc_pages_node_noprof include/linux/gfp.h:316 [inline]
 alloc_cpu_data+0x60/0x130 kernel/trace/ring_buffer.c:406
 ring_buffer_alloc_read_page+0x430/0x560 kernel/trace/ring_buffer.c:6801
 tracing_buffers_read+0x603/0xaf0 kernel/trace/trace.c:7110
 vfs_read+0x1e4/0xb30 fs/read_write.c:572
 ksys_read+0x12a/0x250 fs/read_write.c:717
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x10b/0x830 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
page last free pid 5946 tgid 5945 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1402 [inline]
 __free_frozen_pages+0x747/0x1040 mm/page_alloc.c:2943
 tlb_batch_list_free mm/mmu_gather.c:161 [inline]
 tlb_finish_mmu+0x27d/0x810 mm/mmu_gather.c:552
 exit_mmap+0x454/0xa10 mm/mmap.c:1313
 __mmput+0x12a/0x410 kernel/fork.c:1178
 mmput+0x67/0x80 kernel/fork.c:1201
 exit_mm kernel/exit.c:582 [inline]
 do_exit+0x8b2/0x2af0 kernel/exit.c:964
 do_group_exit+0xd5/0x2a0 kernel/exit.c:1119
 get_signal+0x20ff/0x2210 kernel/signal.c:3037
 arch_do_signal_or_restart+0x91/0x7a0 arch/x86/kernel/signal.c:337
 __exit_to_user_mode_loop kernel/entry/common.c:64 [inline]
 exit_to_user_mode_loop+0x8b/0x4f0 kernel/entry/common.c:98
 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:207 [inline]
 syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:230 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:318 [inline]
 do_syscall_64+0x6f2/0x830 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Memory state around the buggy address:
 ffff88805ceb4f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff88805ceb4f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88805ceb5000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                   ^
 ffff88805ceb5080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88805ceb5100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH] tracing/events: Expand ring buffer for in-kernel event enables
From: Steven Rostedt @ 2026-06-02 13:00 UTC (permalink / raw)
  To: Manjunath Patil
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260601233716.2517987-1-manjunath.b.patil@oracle.com>

On Mon,  1 Jun 2026 16:24:43 -0700
Manjunath Patil <manjunath.b.patil@oracle.com> wrote:

> Ftrace keeps trace arrays at a boot-minimum ring-buffer size until
> tracing is used. Tracefs event-enable paths already call
> tracing_update_buffers() before enabling events, but the exported
> in-kernel helpers trace_set_clr_event() and trace_array_set_clr_event()
> directly enable events through __ftrace_set_clr_event().
> 
> This can leave events enabled by in-kernel users recording into the tiny
> boot-minimum buffer instead of the configured default-sized buffer. Any
> caller that enables events through these exported helpers observes
> different buffer-expansion behavior than a userspace tracefs event enable.
> 
> Expand the relevant trace array before enabling events through the
> exported in-kernel helpers, matching the tracefs event-enable behavior.
> Disabling events remains unchanged.

The above explains everything correctly, but you left out what needs this?

Internal code should not be using the main ring buffer except for
debugging, in which case you can use trace_printk(), which will cause the
tracing buffers to be expanded by default.

Other areas of the kernel should create their own trace array which will be
created expanded by default too.

-- Steve

^ permalink raw reply

* Re: [PATCH] rtla: Fix parsing of multi-character short options
From: Tomas Glozar @ 2026-06-02 12:56 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel
In-Reply-To: <20260602125506.3325345-1-tglozar@redhat.com>

út 2. 6. 2026 v 14:55 odesílatel Tomas Glozar <tglozar@redhat.com> napsal:
>
> A bug was reported where the parsing of multi-character short options,
> be it a short option with an argument specified without space (e.g.
> "-p100") or multiple short options in one argument (e.g. -un), ignores
> options specific to individual tools.
>
> Furthermore, if the rest of the option is supposed to be an argument, it
> gets reinterpreted as a string of options. For example, -p100 gets
> interpreted as -100, which is due to hackish implementation read as
> --no-thread --no-irq --no-irq with timerlat hist, causing rtla to error
> out:
>
> $ rtla timerlat hist -p100
> no-irq and no-thread set, there is nothing to do here
>
> This behavior is caused by getopt_long() being called twice on each
> argument, once in common_parse_options(), once in [tool]_parse_args():
>
> - common_parse_options() calls getopt_long() with an array of options
>   common for all rtla tools, while suppressing errors (opterr = 0).
> - If the option fails to parse, common_parse_options() returns 0.
> - If 0 is returned from common_parse_options(), [tool]_parse_args()
>   calls getopt_long() again, with its own set of options.
>
> * [tool] means one of {osnoise,timerlat}_{top,hist}
>
> At least in glibc, getopt_long() increments its internal nextchar
> variable even if the option is not recognized. That means that in the
> case of "-p100", common_parse_options() sets nextchar pointing to '1',
> and timerlat_hist_parse_args() sees '1', not 'p'; the same then repeats
> for the first and second '0'.
>
> As there is no way to restore the correct internal state of
> getopt_long() reliably, fix the issue by merging the common options back
> to the longopt array and option string of the [tool]_parse_args()
> functions using a macro; only the switch part is left in the original
> function, which is renamed to set_common_option().
>
> Fixes: 850cd24cb6d6 ("tools/rtla: Add common_parse_options()")
> Reported-by: John Kacur <jkacur@redhat.com>
> Signed-off-by: Tomas Glozar <tglozar@redhat.com>
> ---

Forgot to add note to the original email: This fix is only for 7.1,
7.0 needs tweaking of the commit, 7.2 will remove the command line
parsing logic entirely and replace it with libsubcmd, where this
works.

Tomas


^ permalink raw reply

* [PATCH] rtla: Fix parsing of multi-character short options
From: Tomas Glozar @ 2026-06-02 12:55 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel

A bug was reported where the parsing of multi-character short options,
be it a short option with an argument specified without space (e.g.
"-p100") or multiple short options in one argument (e.g. -un), ignores
options specific to individual tools.

Furthermore, if the rest of the option is supposed to be an argument, it
gets reinterpreted as a string of options. For example, -p100 gets
interpreted as -100, which is due to hackish implementation read as
--no-thread --no-irq --no-irq with timerlat hist, causing rtla to error
out:

$ rtla timerlat hist -p100
no-irq and no-thread set, there is nothing to do here

This behavior is caused by getopt_long() being called twice on each
argument, once in common_parse_options(), once in [tool]_parse_args():

- common_parse_options() calls getopt_long() with an array of options
  common for all rtla tools, while suppressing errors (opterr = 0).
- If the option fails to parse, common_parse_options() returns 0.
- If 0 is returned from common_parse_options(), [tool]_parse_args()
  calls getopt_long() again, with its own set of options.

* [tool] means one of {osnoise,timerlat}_{top,hist}

At least in glibc, getopt_long() increments its internal nextchar
variable even if the option is not recognized. That means that in the
case of "-p100", common_parse_options() sets nextchar pointing to '1',
and timerlat_hist_parse_args() sees '1', not 'p'; the same then repeats
for the first and second '0'.

As there is no way to restore the correct internal state of
getopt_long() reliably, fix the issue by merging the common options back
to the longopt array and option string of the [tool]_parse_args()
functions using a macro; only the switch part is left in the original
function, which is renamed to set_common_option().

Fixes: 850cd24cb6d6 ("tools/rtla: Add common_parse_options()")
Reported-by: John Kacur <jkacur@redhat.com>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 tools/tracing/rtla/src/common.c        | 28 +++++---------------------
 tools/tracing/rtla/src/common.h        | 12 ++++++++++-
 tools/tracing/rtla/src/osnoise_hist.c  |  7 ++++---
 tools/tracing/rtla/src/osnoise_top.c   |  7 ++++---
 tools/tracing/rtla/src/timerlat_hist.c |  7 ++++---
 tools/tracing/rtla/src/timerlat_top.c  |  7 ++++---
 6 files changed, 32 insertions(+), 36 deletions(-)

diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index 35e3d3aa922e..bc9d01ddd102 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -84,37 +84,20 @@ int getopt_auto(int argc, char **argv, const struct option *long_opts)
 }
 
 /*
- * common_parse_options - parse common command line options
+ * set_common_option - set common options
  *
+ * @c: option character
  * @argc: argument count
  * @argv: argument vector
  * @common: common parameters structure
  *
  * Parse command line options that are common to all rtla tools.
  *
- * Returns: non zero if a common option was parsed, or 0
- * if the option should be handled by tool-specific parsing.
+ * Returns: 1 if the option was set, 0 otherwise.
  */
-int common_parse_options(int argc, char **argv, struct common_params *common)
+int set_common_option(int c, int argc, char **argv, struct common_params *common)
 {
 	struct trace_events *tevent;
-	int saved_state = optind;
-	int c;
-
-	static struct option long_options[] = {
-		{"cpus",                required_argument,      0, 'c'},
-		{"cgroup",              optional_argument,      0, 'C'},
-		{"debug",               no_argument,            0, 'D'},
-		{"duration",            required_argument,      0, 'd'},
-		{"event",               required_argument,      0, 'e'},
-		{"house-keeping",       required_argument,      0, 'H'},
-		{"priority",            required_argument,      0, 'P'},
-		{0, 0, 0, 0}
-	};
-
-	opterr = 0;
-	c = getopt_auto(argc, argv, long_options);
-	opterr = 1;
 
 	switch (c) {
 	case 'c':
@@ -154,11 +137,10 @@ int common_parse_options(int argc, char **argv, struct common_params *common)
 		common->set_sched = 1;
 		break;
 	default:
-		optind = saved_state;
 		return 0;
 	}
 
-	return c;
+	return 1;
 }
 
 /*
diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
index 51665db4ffce..8921807bda98 100644
--- a/tools/tracing/rtla/src/common.h
+++ b/tools/tracing/rtla/src/common.h
@@ -178,7 +178,17 @@ int osnoise_set_stop_total_us(struct osnoise_context *context,
 			      long long stop_total_us);
 
 int getopt_auto(int argc, char **argv, const struct option *long_opts);
-int common_parse_options(int argc, char **argv, struct common_params *common);
+
+#define COMMON_OPTIONS \
+	{"cpus",                required_argument,      0, 'c'},\
+	{"cgroup",              optional_argument,      0, 'C'},\
+	{"debug",               no_argument,            0, 'D'},\
+	{"duration",            required_argument,      0, 'd'},\
+	{"event",               required_argument,      0, 'e'},\
+	{"house-keeping",       required_argument,      0, 'H'},\
+	{"priority",            required_argument,      0, 'P'}
+int set_common_option(int c, int argc, char **argv, struct common_params *common);
+
 int common_apply_config(struct osnoise_tool *tool, struct common_params *params);
 int top_main_loop(struct osnoise_tool *tool);
 int hist_main_loop(struct osnoise_tool *tool);
diff --git a/tools/tracing/rtla/src/osnoise_hist.c b/tools/tracing/rtla/src/osnoise_hist.c
index 8ad816b80265..cb4ce58c5987 100644
--- a/tools/tracing/rtla/src/osnoise_hist.c
+++ b/tools/tracing/rtla/src/osnoise_hist.c
@@ -475,6 +475,7 @@ static struct common_params
 
 	while (1) {
 		static struct option long_options[] = {
+			COMMON_OPTIONS,
 			{"auto",		required_argument,	0, 'a'},
 			{"bucket-size",		required_argument,	0, 'b'},
 			{"entries",		required_argument,	0, 'E'},
@@ -498,15 +499,15 @@ static struct common_params
 			{0, 0, 0, 0}
 		};
 
-		if (common_parse_options(argc, argv, &params->common))
-			continue;
-
 		c = getopt_auto(argc, argv, long_options);
 
 		/* detect the end of the options. */
 		if (c == -1)
 			break;
 
+		if (set_common_option(c, argc, argv, &params->common))
+			continue;
+
 		switch (c) {
 		case 'a':
 			/* set sample stop to auto_thresh */
diff --git a/tools/tracing/rtla/src/osnoise_top.c b/tools/tracing/rtla/src/osnoise_top.c
index 244bdce022ad..e65312ec26c4 100644
--- a/tools/tracing/rtla/src/osnoise_top.c
+++ b/tools/tracing/rtla/src/osnoise_top.c
@@ -328,6 +328,7 @@ struct common_params *osnoise_top_parse_args(int argc, char **argv)
 
 	while (1) {
 		static struct option long_options[] = {
+			COMMON_OPTIONS,
 			{"auto",		required_argument,	0, 'a'},
 			{"help",		no_argument,		0, 'h'},
 			{"period",		required_argument,	0, 'p'},
@@ -346,15 +347,15 @@ struct common_params *osnoise_top_parse_args(int argc, char **argv)
 			{0, 0, 0, 0}
 		};
 
-		if (common_parse_options(argc, argv, &params->common))
-			continue;
-
 		c = getopt_auto(argc, argv, long_options);
 
 		/* Detect the end of the options. */
 		if (c == -1)
 			break;
 
+		if (set_common_option(c, argc, argv, &params->common))
+			continue;
+
 		switch (c) {
 		case 'a':
 			/* set sample stop to auto_thresh */
diff --git a/tools/tracing/rtla/src/timerlat_hist.c b/tools/tracing/rtla/src/timerlat_hist.c
index 79142af4f566..4b6708e333b8 100644
--- a/tools/tracing/rtla/src/timerlat_hist.c
+++ b/tools/tracing/rtla/src/timerlat_hist.c
@@ -785,6 +785,7 @@ static struct common_params
 
 	while (1) {
 		static struct option long_options[] = {
+			COMMON_OPTIONS,
 			{"auto",		required_argument,	0, 'a'},
 			{"bucket-size",		required_argument,	0, 'b'},
 			{"entries",		required_argument,	0, 'E'},
@@ -819,11 +820,11 @@ static struct common_params
 			{0, 0, 0, 0}
 		};
 
-		if (common_parse_options(argc, argv, &params->common))
-			continue;
-
 		c = getopt_auto(argc, argv, long_options);
 
+		if (set_common_option(c, argc, argv, &params->common))
+			continue;
+
 		/* detect the end of the options. */
 		if (c == -1)
 			break;
diff --git a/tools/tracing/rtla/src/timerlat_top.c b/tools/tracing/rtla/src/timerlat_top.c
index 64cbdcc878b0..91f88bbebad9 100644
--- a/tools/tracing/rtla/src/timerlat_top.c
+++ b/tools/tracing/rtla/src/timerlat_top.c
@@ -549,6 +549,7 @@ static struct common_params
 
 	while (1) {
 		static struct option long_options[] = {
+			COMMON_OPTIONS,
 			{"auto",		required_argument,	0, 'a'},
 			{"help",		no_argument,		0, 'h'},
 			{"irq",			required_argument,	0, 'i'},
@@ -577,11 +578,11 @@ static struct common_params
 			{0, 0, 0, 0}
 		};
 
-		if (common_parse_options(argc, argv, &params->common))
-			continue;
-
 		c = getopt_auto(argc, argv, long_options);
 
+		if (set_common_option(c, argc, argv, &params->common))
+			continue;
+
 		/* detect the end of the options. */
 		if (c == -1)
 			break;
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v2 4/8] riscv: ftrace: always preserve s0 in dynamic ftrace register frame
From: Shuai Xue @ 2026-06-02 11:37 UTC (permalink / raw)
  To: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, zhuo.song, jkchen, linux-riscv,
	linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
	linux-perf-users
In-Reply-To: <20260528082310.1994388-5-wanghan@linux.alibaba.com>

On 5/28/26 4:23 PM, Wang Han wrote:
> The dynamic ftrace entry/exit only saved s0 (the architectural frame
> pointer) when HAVE_FUNCTION_GRAPH_FP_TEST was selected. The upcoming
> reliable frame-pointer unwinder needs s0 to be present in
> ftrace_regs unconditionally so it can use the frame pointer as the
> function-graph return-address cookie regardless of FP_TEST.

Nit: A prefered commit log:

struct __arch_ftrace_regs declares s0 unconditionally, and both
ftrace_regs_get_frame_pointer() and ftrace_partial_regs() read it
unconditionally. But the SAVE_ABI_REGS / RESTORE_ABI_REGS macros in
mcount-dyn.S only stored s0 under HAVE_FUNCTION_GRAPH_FP_TEST
(CONFIG_FUNCTION_GRAPH_TRACER && CONFIG_FRAME_POINTER). With
CONFIG_FRAME_POINTER=n the slot held whatever was on the stack before,
so any callback going through ftrace_partial_regs() saw a garbage
regs->s0. RISC-V kernels default to FRAME_POINTER=y, which is why
this has not bitten in practice.

Save and restore s0 unconditionally in the dynamic ftrace ABI register
frame. This fixes the latent garbage-s0 case, brings the dynamic ftrace
path in line with the static _mcount path (mcount.S SAVE_ABI_STATE
already saves s0 unconditionally), and matches the frame layout already
documented in the comment above SAVE_ABI_REGS. It is also a
prerequisite for the upcoming reliable unwinder, which reads
ftrace_regs_get_frame_pointer(fregs) directly.

Save and restore s0 unconditionally in the dynamic ftrace ABI register
frame. This fixes the latent garbage-s0 case, brings the dynamic ftrace
path in line with the static _mcount path (mcount.S SAVE_ABI_STATE
already saves s0 unconditionally), and matches the frame layout already
documented in the comment above SAVE_ABI_REGS. It is also a
prerequisite for the upcoming reliable unwinder, which reads
ftrace_regs_get_frame_pointer(fregs) directly.

The cost is one extra REG_S/REG_L pair per traced call, negligible
compared to the overall ftrace cost; the existing FREGS_SIZE_ON_STACK
already reserved the slot, so no extra stack space is used.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>

Thanks.
Shuai

^ permalink raw reply

* Re: [PATCH v2 3/8] riscv: stacktrace: disable KASAN instrumentation for stacktrace.o
From: Shuai Xue @ 2026-06-02 11:22 UTC (permalink / raw)
  To: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, zhuo.song, jkchen, linux-riscv,
	linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
	linux-perf-users
In-Reply-To: <20260528082310.1994388-4-wanghan@linux.alibaba.com>



On 5/28/26 4:23 PM, Wang Han wrote:
> KASAN records stack traces for every alloc/free, which means it walks
> the unwinder very frequently. Instrumenting the stack trace collection
> code itself adds substantial overhead and makes the traces themselves
> noisier.
> 
> Mark stacktrace.o as not KASAN-instrumented, matching the arm, arm64
> and x86 treatment of their stack unwinding code. This is a prerequisite
> preference for the upcoming reliable unwinder, but the change is valid
> on its own.
> 
> Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
> ---
>   arch/riscv/kernel/Makefile | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index cabb99cadfb6..1cb6c9ab2981 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -44,6 +44,11 @@ CFLAGS_REMOVE_return_address.o	= $(CC_FLAGS_FTRACE)
>   CFLAGS_REMOVE_sbi_ecall.o = $(CC_FLAGS_FTRACE)
>   endif
>   
> +# When KASAN is enabled, a stack trace is recorded for every alloc/free, which
> +# can significantly impact performance. Avoid instrumenting the stack trace
> +# collection code to minimize this impact.
> +KASAN_SANITIZE_stacktrace.o := n
> +

I checked the three referenced arches:
   - arm    (arch/arm/kernel/Makefile):    KASAN only
   - arm64  (arch/arm64/kernel/Makefile):  KASAN only
   - x86    (arch/x86/kernel/Makefile):    KASAN *and* KCOV
            (KCOV_INSTRUMENT_stacktrace.o := n, plus dumpstack and
             the unwind_*.o TUs)

So as written, this patch matches arm/arm64 but NOT x86. KCOV
instruments every basic-block edge; the unwinder is a hot path
(doubly so under KASAN, where it runs on every alloc/free), so the
same rationale that justifies disabling KASAN applies to KCOV. I'd
suggest making the claim true by adding:

  KCOV_INSTRUMENT_stacktrace.o := n

(RISC-V keeps its entire unwinder in stacktrace.o, so unlike x86
there's no dumpstack/unwind_*.o to also annotate — the single TU
covers the equivalent scope.)

Alternatively, if you'd rather keep it minimal, just drop "and x86"
from the changelog so the claim matches the code.


Thanks.
Shuai

^ permalink raw reply

* Re: [PATCH 1/2] rtla/timerlat: Fix parsing of short options with attached arguments
From: Tomas Glozar @ 2026-06-02 11:18 UTC (permalink / raw)
  To: John Kacur
  Cc: Steven Rostedt, linux-trace-kernel, Costa Shulyupin,
	Wander Lairson Costa, Crystal Wood, Luis Claudio R . Goncalves,
	linux-kernel
In-Reply-To: <20260601211538.381649-1-jkacur@redhat.com>

po 1. 6. 2026 v 23:15 odesílatel John Kacur <jkacur@redhat.com> napsal:
>
> The timerlat hist command fails to parse short options with attached
> numeric arguments (e.g., -p100) due to conflicts between digit characters
> used as option values and numeric arguments to other options.
>
> This issue was discovered when testing rtla 7.1.0-rc6 with rteval,
> which passes arguments in the compact -p100 format. The rteval tests
> failed with the confusing error "no-irq and no-thread set, there is
> nothing to do here" even though neither option was specified.
>
> The root cause is two-fold:
>
> 1. Digit characters ('0'-'9') were used as short option values for
>    long-only options like --no-irq, --no-thread, etc. This caused
>    getopt_auto() to generate an option string like 'a:b:...:u0123456:7:8:9'
>    which made getopt treat digits as valid option characters.
>
> 2. The two-phase option parsing approach (alternating calls between
>    common_parse_options() and local option parsing) confused getopt's
>    internal state when encountering arguments like -p100.
>

What actually happens is that the call to getopt_long() in
common_parse_options() does not recognize -p, but it still increments
the internal static variable nextchar by 1 before falling back to
timerlat_hist_parse_args()'s getopt_long(). That means -p is ignored
entirely and timerlat_hist_parse_args() only sees -100.

Note that options are not required to trigger the bug:

$ rtla timerlat hist -nu -c 0 -d 1s
# RTLA timerlat histogram
# Time unit is microseconds (us)
# Duration:   0 00:00:02
(rtla 7.0)

vs:

$ rtla timerlat hist -nu -c 0 -d 1s
rtla timerlat hist -nu -c 0 -d 1s
# RTLA timerlat histogram
# Time unit is nanoseconds (ns)
# Duration:   0 00:00:02
(rtla 6.19)

Again, the nanosecond option gets dropped by the
common_parse_options() mechanism pushing nextchar to 1.

> When a user passed -p100, getopt would incorrectly parse it as three
> separate options: -p, -1, -0, and -0, silently setting no_irq and
> no_thread flags instead of recognizing "100" as the period argument.
>
> The two-phase parsing was introduced in commit 850cd24cb6d6 ("tools/rtla:
> Add common_parse_options()") which first appeared in v7.0-rc1. Prior to
> that commit, -p100 worked correctly. The digit characters as option
> values existed since the original timerlat implementation, but only
> became problematic when combined with the two-phase parsing approach.
>

Note that RTLA documentation only ever mentions the syntax "-p 100".
Nevertheless, this is a real regression, and it's not unreasonable for
users to assume the syntax without the space also works, as is common
for most commands on Un*x, for example, gcc's -I/include/path syntax.

> Fix this by:
>
> 1. Eliminating digit characters from the option string by filtering them
>    out in getopt_auto(). This prevents conflicts with numeric arguments.
>
> 2. Refactoring timerlat_hist_parse_args() to use single-pass option
>    parsing. Instead of alternating between common_parse_options() and
>    local parsing, merge all options (common and local) into a single
>    option table and parse them in one pass. This matches the approach
>    used by cyclictest and other tools.

This is a partial revert of the common_parse_options() patchset [1],
it does fix the bug but only for one tool (timerlat hist).
getopt_long()'s design does not allow the user to reset its internal
nextchar variable; it can be reset (by calling it with optind = 0, not
1 as the documentation says) but that would require a lot of work, as
we'd have to calculate and restore the original nextchar. It might be
the easiest to revert the entire consolidation patchset [1], if that's
worth it.

[1] https://lore.kernel.org/linux-trace-kernel/20251209100047.2692515-1-costa.shul@redhat.com/T/#u

<truncated>



Tomas


^ permalink raw reply

* Re: [PATCH v2 2/8] riscv: stacktrace: Add frame record metadata
From: Shuai Xue @ 2026-06-02 11:18 UTC (permalink / raw)
  To: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, zhuo.song, jkchen, linux-riscv,
	linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
	linux-perf-users
In-Reply-To: <20260528082310.1994388-3-wanghan@linux.alibaba.com>



On 5/28/26 4:23 PM, Wang Han wrote:
> Reliable frame-pointer unwinding needs an explicit way to identify
> exception boundaries and the final entry frame. The existing unwinder
> infers those boundaries from return addresses, which is too loose for a
> future reliable unwinder.
> 
> Add a small metadata frame record to pt_regs and initialize it on
> exception entry, kernel thread fork, user fork, and early idle task
> setup. The record uses a zero {fp, ra} sentinel plus a type field so a
> later unwinder can distinguish a final user-to-kernel boundary from a
> nested kernel pt_regs boundary.
> 
> This follows the arm64 metadata frame-record model, adapted to the
> RISC-V {fp, ra} frame record convention.
> 
> The metadata is established at the RISC-V entry boundaries that need an
> explicit unwind marker:
> 
>    * exception entry clears the metadata {fp, ra} pair and uses SPP
>      (or MPP in M-mode) to record whether the pt_regs frame is the final
>      user-to-kernel boundary or a nested kernel boundary;
>    * _start_kernel builds the init task's final metadata record, while
>      the secondary CPU path sets up s0 before smp_callin() so idle-task
>      unwinding does not inherit an undefined caller frame;
>    * copy_thread creates matching final metadata records for new kernel
>      and user tasks, and keeps s0 available for the frame-pointer chain;
>    * call_on_irq_stack still reserves an aligned stack slot, but links the
>      saved {fp, ra} with the raw frame-record size so s0 points at the
>      RISC-V frame record rather than past the alignment padding.
> 
> These changes keep s0 reserved for the frame-pointer chain at task and
> stack-switch boundaries.
> 
> Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
> ---
>   arch/riscv/include/asm/ptrace.h           |  9 ++++
>   arch/riscv/include/asm/stacktrace/frame.h | 53 +++++++++++++++++++++++
>   arch/riscv/kernel/asm-offsets.c           |  4 ++
>   arch/riscv/kernel/entry.S                 | 30 +++++++++++--
>   arch/riscv/kernel/head.S                  | 23 ++++++++++
>   arch/riscv/kernel/process.c               | 31 ++++++++++++-
>   6 files changed, 144 insertions(+), 6 deletions(-)
>   create mode 100644 arch/riscv/include/asm/stacktrace/frame.h
> 
> diff --git a/arch/riscv/include/asm/ptrace.h b/arch/riscv/include/asm/ptrace.h
> index addc8188152f..4b9b0f279214 100644
> --- a/arch/riscv/include/asm/ptrace.h
> +++ b/arch/riscv/include/asm/ptrace.h
> @@ -8,6 +8,7 @@
>   
>   #include <uapi/asm/ptrace.h>
>   #include <asm/csr.h>
> +#include <asm/stacktrace/frame.h>
>   #include <linux/compiler.h>
>   
>   #ifndef __ASSEMBLER__
> @@ -53,6 +54,14 @@ struct pt_regs {
>   	unsigned long cause;
>   	/* a0 value before the syscall */
>   	unsigned long orig_a0;
> +
> +	/*
> +	 * This frame record is entirely zeroed on exception entry, allowing the
> +	 * unwinder to identify exception boundaries. The type field encodes
> +	 * whether the exception was taken from user (FINAL) or kernel (PT_REGS)
> +	 * mode.
> +	 */
> +	struct frame_record_meta stackframe;
>   };
>   
>   #define PTRACE_SYSEMU			0x1f
> diff --git a/arch/riscv/include/asm/stacktrace/frame.h b/arch/riscv/include/asm/stacktrace/frame.h
> new file mode 100644
> index 000000000000..5720a6c65fe8
> --- /dev/null
> +++ b/arch/riscv/include/asm/stacktrace/frame.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __ASM_RISCV_STACKTRACE_FRAME_H
> +#define __ASM_RISCV_STACKTRACE_FRAME_H
> +
> +/*
> + * See: arch/arm64/include/asm/stacktrace/frame.h for the reference
> + * implementation.
> + */
> +
> +/*
> + * - FRAME_META_TYPE_NONE
> + *
> + *   This value is reserved.
> + *
> + * - FRAME_META_TYPE_FINAL
> + *
> + *   The record is the last entry on the stack.
> + *   Unwinding should terminate successfully.
> + *
> + * - FRAME_META_TYPE_PT_REGS
> + *
> + *   The record is embedded within a struct pt_regs, recording the registers at
> + *   an arbitrary point in time.
> + *   Unwinding should consume pt_regs::epc, followed by pt_regs::ra.
> + *
> + * Note: all other values are reserved and should result in unwinding
> + * terminating with an error.
> + */
> +#define FRAME_META_TYPE_NONE		0
> +#define FRAME_META_TYPE_FINAL		1
> +#define FRAME_META_TYPE_PT_REGS		2
> +
> +#ifndef __ASSEMBLER__
> +/*
> + * A standard RISC-V frame record.
> + */
> +struct frame_record {
> +	unsigned long fp;
> +	unsigned long ra;
> +};
> +
> +/*
> + * A metadata frame record indicating a special unwind.
> + * The record::{fp,ra} fields must be zero to indicate the presence of
> + * metadata.
> + */
> +struct frame_record_meta {
> +	struct frame_record record;
> +	unsigned long type;
> +};
> +#endif /* __ASSEMBLER__ */
> +
> +#endif /* __ASM_RISCV_STACKTRACE_FRAME_H */
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index af827448a609..8dfcb5a44bb8 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -131,6 +131,9 @@ void asm_offsets(void)
>   	OFFSET(PT_BADADDR, pt_regs, badaddr);
>   	OFFSET(PT_CAUSE, pt_regs, cause);
>   
> +	DEFINE(S_STACKFRAME,		offsetof(struct pt_regs, stackframe));
> +	DEFINE(S_STACKFRAME_TYPE,	offsetof(struct pt_regs, stackframe.type));
> +
>   	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
>   
>   	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> @@ -501,6 +504,7 @@ void asm_offsets(void)
>   	OFFSET(SBI_HART_BOOT_STACK_PTR_OFFSET, sbi_hart_boot_data, stack_ptr);
>   
>   	DEFINE(STACKFRAME_SIZE_ON_STACK, ALIGN(sizeof(struct stackframe), STACK_ALIGN));
> +	DEFINE(STACKFRAME_RECORD_SIZE, sizeof(struct stackframe));
>   	OFFSET(STACKFRAME_FP, stackframe, fp);
>   	OFFSET(STACKFRAME_RA, stackframe, ra);
>   #ifdef CONFIG_FUNCTION_TRACER
> diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> index d011fb51c59a..9cae0e1eba1c 100644
> --- a/arch/riscv/kernel/entry.S
> +++ b/arch/riscv/kernel/entry.S
> @@ -11,6 +11,7 @@
>   #include <asm/asm.h>
>   #include <asm/csr.h>
>   #include <asm/scs.h>
> +#include <asm/stacktrace/frame.h>
>   #include <asm/unistd.h>
>   #include <asm/page.h>
>   #include <asm/thread_info.h>
> @@ -193,6 +194,27 @@ SYM_CODE_START(handle_exception)
>   	REG_S s4, PT_CAUSE(sp)
>   	REG_S s5, PT_TP(sp)
>   
> +	/*
> +	 * Create a metadata frame record. The unwinder will use this to
> +	 * identify and unwind exception boundaries.
> +	 */
> +	REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp) /* stackframe.record.fp = 0 */
> +	REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp) /* stackframe.record.ra = 0 */
> +#ifdef CONFIG_RISCV_M_MODE
> +	li t0, SR_MPP
> +	and t0, s1, t0
> +#else
> +	andi t0, s1, SR_SPP
> +#endif
> +	bnez t0, 1f
> +	li t0, FRAME_META_TYPE_FINAL
> +	j 2f
> +1:
> +	li t0, FRAME_META_TYPE_PT_REGS
> +2:
> +	REG_S t0, S_STACKFRAME_TYPE(sp)
> +	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
> +

One spot for symmetry (non-blocking, robustness only):

handle_kernel_stack_overflow in entry.S allocates a full
PT_SIZE_ON_STACK frame (including the new stackframe metadata fields)
but, unlike .Lsave_context, never initialises stackframe.{record,type}
nor repoints s0 at the metadata. Those three words are therefore left
as whatever was on the overflow_stack.

In practice this is currently harmless: handle_bad_stack() only calls
__show_regs() (register dump, no unwind) followed by panic(), so the
reliable unwinder never actually consumes that metadata today. So this
is not an active bug — purely a robustness / symmetry gap.

It would still be worth initialising it, because the moment someone
adds a dump_stack() here, or another CPU NMI-backtraces this task, or a
kdump image is walked offline via the frame-record chain, the garbage
type byte would mislead the unwinder. Since the overflow path is by
definition entered from kernel context, FRAME_META_TYPE_PT_REGS is the
right type, and it has the nice property that the unwinder will resume
from frame_pointer(regs)==regs->s0 (the pre-overflow s0 is already
saved into PT_S0 by save_from_x6_to_x31), giving the pre-overflow call
chain instead of a hard stop.


>   	/*
>   	 * Set the scratch register to 0, so that if a recursive exception
>   	 * occurs, the exception vector knows it came from the kernel
> @@ -357,8 +379,8 @@ ASM_NOKPROBE(handle_kernel_stack_overflow)
>   
>   SYM_CODE_START(ret_from_fork_kernel_asm)
>   	call schedule_tail
> -	move a0, s1 /* fn_arg */
> -	move a1, s0 /* fn */
> +	move a0, s3 /* fn_arg */
> +	move a1, s2 /* fn */
>   	move a2, sp /* pt_regs */
>   	call ret_from_fork_kernel
>   	j ret_from_exception
> @@ -383,7 +405,7 @@ SYM_FUNC_START(call_on_irq_stack)
>   	addi	sp, sp, -STACKFRAME_SIZE_ON_STACK
>   	REG_S	ra, STACKFRAME_RA(sp)
>   	REG_S	s0, STACKFRAME_FP(sp)
> -	addi	s0, sp, STACKFRAME_SIZE_ON_STACK
> +	addi	s0, sp, STACKFRAME_RECORD_SIZE
>   
>   	/* Switch to the per-CPU shadow call stack */
>   	scs_save_current
> @@ -399,7 +421,7 @@ SYM_FUNC_START(call_on_irq_stack)
>   	scs_load_current
>   
>   	/* Switch back to the thread stack and restore ra and s0 */
> -	addi	sp, s0, -STACKFRAME_SIZE_ON_STACK
> +	addi	sp, s0, -STACKFRAME_RECORD_SIZE

Worth calling out explicitly that this is more than a cosmetic refactor:
on RV32 the previous code is actually wrong, and this hunk fixes it.

   STACKFRAME_SIZE_ON_STACK = ALIGN(sizeof(struct stackframe), STACK_ALIGN)
   STACKFRAME_RECORD_SIZE   = sizeof(struct stackframe)

   RV64: sizeof(stackframe) == STACK_ALIGN == 16, so the two are equal
         and the old code happened to work.
   RV32: sizeof(stackframe) == 8 but STACK_ALIGN == 16, so the old
         "addi s0, sp, STACKFRAME_SIZE_ON_STACK" left s0 pointing 8 bytes
         past the saved {fp, ra} pair, into the alignment padding. An FP
         unwinder that derives the frame record from s0 (e.g. via
         "(struct stackframe *)s0 - 1" or fixed -8/-16(s0) loads) would
         then read garbage instead of the saved fp/ra at the IRQ-stack

After the change s0 lands exactly at the end of the {fp, ra} record on
both RV32 and RV64, while the aligned slot is still reserved by the
unchanged "addi sp, sp, -STACKFRAME_SIZE_ON_STACK" / matching restore.

Could you mention this in the v3 commit message? It's load-bearing
context for anyone bisecting an RV32 unwind regression later, and it
also justifies why the change is correct to apply ahead of the reliable
unwinder rather than folded into it.

>   	REG_L	ra, STACKFRAME_RA(sp)
>   	REG_L	s0, STACKFRAME_FP(sp)
>   	addi	sp, sp, STACKFRAME_SIZE_ON_STACK
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index f6a8ca49e627..00e16a24f149 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -14,6 +14,7 @@
>   #include <asm/hwcap.h>
>   #include <asm/image.h>
>   #include <asm/scs.h>
> +#include <asm/stacktrace/frame.h>
>   #include <asm/usercfi.h>
>   #include "efi-header.S"
>   
> @@ -177,6 +178,14 @@ secondary_start_sbi:
>   	REG_S a0, (a1)
>   1:
>   #endif
> +
> +	/*
> +	 * Set up the frame pointer for the secondary idle task so reliable
> +	 * stack unwinding terminates at the metadata frame in task_pt_regs().
> +	 * Without this, the first frame records can inherit an undefined caller
> +	 * fp and unwind past smp_callin() into .Lsecondary_park.
> +	 */
> +	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
>   	scs_load_current
>   	call smp_callin
>   #endif /* CONFIG_SMP */
> @@ -305,6 +314,20 @@ SYM_CODE_START(_start_kernel)
>   	la tp, init_task
>   	la sp, init_thread_union + THREAD_SIZE
>   	addi sp, sp, -PT_SIZE_ON_STACK
> +
> +	/*
> +	 * Set up a metadata frame record for the init task so that
> +	 * the unwinder can identify the outermost frame by its
> +	 * {fp, ra} = {0, 0} sentinel at the bottom of pt_regs.
> +	 * fp/s0 points above the metadata record (RISC-V
> +	 * convention).
> +	 */
> +	REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp)
> +	REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp)
> +	li t0, FRAME_META_TYPE_FINAL
> +	REG_S t0, S_STACKFRAME_TYPE(sp)
> +	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
> +
>   #if defined(CONFIG_RISCV_SBI) && defined(CONFIG_RISCV_USER_CFI)
>   	li a7, SBI_EXT_FWFT
>   	li a6, SBI_EXT_FWFT_SET
> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> index b2df7f72241a..5212926b926b 100644
> --- a/arch/riscv/kernel/process.c
> +++ b/arch/riscv/kernel/process.c
> @@ -258,8 +258,23 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>   		/* Supervisor/Machine, irqs on: */
>   		childregs->status = SR_PP | SR_PIE;
>   
> -		p->thread.s[0] = (unsigned long)args->fn;
> -		p->thread.s[1] = (unsigned long)args->fn_arg;
> +		/*
> +		 * Set up a metadata frame record at the bottom of the
> +		 * stack for the unwinder. Use FRAME_META_TYPE_FINAL
> +		 * since this is the outermost kernel entry for the new
> +		 * task. The frame_record::{fp,ra} are already zero from
> +		 * memset().
> +		 *
> +		 * fp/s0 points above the metadata record (RISC-V
> +		 * convention). fn and fn_arg are passed via s2/s3,
> +		 * keeping s0 available for the frame pointer chain.
> +		 */
> +		childregs->stackframe.type = FRAME_META_TYPE_FINAL;
> +
> +		p->thread.s[0] = (unsigned long)(&childregs->stackframe)
> +				+ sizeof(struct frame_record);
> +		p->thread.s[2] = (unsigned long)args->fn;
> +		p->thread.s[3] = (unsigned long)args->fn_arg;
>   		p->thread.ra = (unsigned long)ret_from_fork_kernel_asm;
>   	} else {
>   		/* allocate new shadow stack if needed. In case of CLONE_VM we have to */
> @@ -278,6 +293,18 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>   		if (clone_flags & CLONE_SETTLS)
>   			childregs->tp = tls;
>   		childregs->a0 = 0; /* Return value of fork() */
> +
> +		/*
> +		 * Set up the unwind boundary: ensure the metadata
> +		 * frame record has its {fp,ra} sentinel zeroed and
> +		 * point fp/s0 above the metadata record. The type
> +		 * field is inherited from the parent's pt_regs.
> +		 */
> +		childregs->stackframe.record.fp = 0;
> +		childregs->stackframe.record.ra = 0;

This relies on the parent always entering kernel via handle_exception
on a user->kernel boundary (which writes FRAME_META_TYPE_FINAL).
That is true for fork()/clone() today, but:

   - The kernel-thread path right above explicitly assigns type =
     FINAL, so the user-thread path looks asymmetric and like a
     possible omission to anyone reading it cold.
   - A future caller invoking kernel_clone() from a nested-kernel
     context (parent pt_regs.type == PT_REGS) would silently produce
     a broken unwind boundary on the new task.

Recommend explicitly setting it here too:

       childregs->stackframe.type = FRAME_META_TYPE_FINAL;

Even if currently redundant, it is one assignment, costs nothing, is
self-documenting, and fails closed instead of open.


Thanks.
Shuai

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-02 10:58 UTC (permalink / raw)
  To: Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260531071845.10875-1-lance.yang@linux.dev>

On Sun, May 31, 2026 at 1:19 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
> On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> [...]
> >@@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >       if (result == SCAN_SUCCEED) {
> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >               mmap_read_unlock(mm);
> >-              result = collapse_huge_page(mm, start_addr, referenced,
> >-                                          unmapped, cc, HPAGE_PMD_ORDER);
> >-              /* collapse_huge_page will return with the mmap_lock released */
> >+              nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> >+                                           unmapped, cc, enabled_orders);
> >+              /* mmap_lock was released above, set lock_dropped */
> >               *lock_dropped = true;
> >+              result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>
> Hmm ... don't we lose the allocation-failure result here?
>
> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> in khugepaged_do_scan().
>
> Now if allocation fails and nr_collapsed stays 0, we just return
> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?

Ok I did the error propagation! I think I handled both of these cases
you brought up pretty easily.

However I don't know what to do in the following case: We successfully
collapsed some portion of the PMD, but during that process, we also
hit an allocation failure. Is it best to back off entirely? or can we
treat some forward progress as a sign we can continue trying collapses
without sleeping.

Basically, do we prioritize SCAN_ALLOC_HUGE_PAGE_FAIL or the
successful collapses as the returned value?

This is what I currently have:
done:
    if (collapsed)
        return SCAN_SUCCEED;
    if (alloc_failed)
        return SCAN_ALLOC_HUGE_PAGE_FAIL;

Thanks,
-- Nico

>
> Cheers, Lance
>


^ permalink raw reply

* Re: [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped
From: Nico Pache @ 2026-06-02 10:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <ah2Ro54tMDMsPevk@lucifer>

On Mon, Jun 1, 2026 at 8:13 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 09:00:00AM -0600, Nico Pache wrote:
> > Currently the collapse_huge_page function requires the mmap_read_lock to
> > enter with it held, and exit with it dropped. This function moves the
> > unlock into its parent caller, and changes this semantic to requiring it
> > to enter/exit with it always unlocked.
> >
> > In future patches, we need this expectation, as for in mTHP collapse, we
> > may have already have dropped the lock, and do not want to conditionally
> > check for this by passing through the lock_dropped variable.
> >
> > No functional change is expected as one of the first things the
> > collapse_huge_page function does is drop this lock before allocating the
> > hugepage.
> >
> > Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> One small nit below, otherwise LGTM, so:
>
> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

Thank you for reviewing!

>
> > ---
> >  mm/khugepaged.c | 16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index e98ba5b15163..fab35d318641 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1208,6 +1208,12 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> >       return SCAN_SUCCEED;
> >  }
> >
> > +/*
> > + * collapse_huge_page expects the mmap_lock to be unlocked before entering and
> > + * will always return with the lock unlocked, to avoid holding the mmap_lock
> > + * while allocating a THP, as that could trigger direct reclaim/compaction.
> > + * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > + */
> >  static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               int referenced, int unmapped, struct collapse_control *cc)
> >  {
> > @@ -1223,14 +1229,6 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > -     /*
> > -      * Before allocating the hugepage, release the mmap_lock read lock.
> > -      * The allocation can take potentially a long time if it involves
> > -      * sync compaction, and we do not need to hold the mmap_lock during
> > -      * that. We will recheck the vma after taking it again in write mode.
> > -      */
> > -     mmap_read_unlock(mm);
> > -
>
> NIT: Maybe worth an mmap_assert_locked()?

But it will already be unlocked here. The contract is that we enter
unlocked and exit unlocked.

Cheers,
-- Nico

>
> >       result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> > @@ -1535,6 +1533,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >  out_unmap:
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> > +             /* collapse_huge_page expects the lock to be dropped before calling */
> > +             mmap_read_unlock(mm);
> >               result = collapse_huge_page(mm, start_addr, referenced,
> >                                           unmapped, cc);
> >               /* collapse_huge_page will return with the mmap_lock released */
> > --
> > 2.54.0
> >
>
> Cheers, Lorenzo
>


^ permalink raw reply

* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: David Hildenbrand (Arm) @ 2026-06-02  9:41 UTC (permalink / raw)
  To: Miaohe Lin, Breno Leitao
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang, Andrew Morton,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <33ef8821-c809-b7d1-ea77-6e8a07a6e784@huawei.com>

On 6/2/26 05:08, Miaohe Lin wrote:
> On 2026/6/1 21:22, David Hildenbrand (Arm) wrote:
>> On 6/1/26 14:28, Miaohe Lin wrote:
>>>
>>> Thanks for your patch.
>>>
>>>
>>> Once shake_page finds a lightweight range-based way to shrink slab, slab pages could be freed
>>> into buddy and above PageSlab test should be removed then. Maybe add a TODO or XXX here?
>>>
>>>
>>> I'm not sure but is it safe or a common way to test PageReserved, PageSlab,
>>> PageTable and PageLargeKmalloc without extra page refcnt?
>>
>> Checking typed pages in a racy fashion is fine (PageSlab, PageTable,
>> PageLargeKmalloc).
> 
> Got it. Thanks.
> 
>> Checking PageReserved in a racy fashion is fine as well. TESTPAGEFLAG() will
>> allow checking it on compound pages.
> 
> It seems PageReserved is not intended to be set on compound pages. I see there are PF_NO_COMPOUND
> in its definition: PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND).
> 
>>
>> For PageLargeKmalloc, we would want to check the head page, though. The page
>> type is only stored for the head page.
> 
> Maybe we should check the head page for PageSlab and PageTable too? alloc_slab_page only
> set PageSlab on the head page and __pagetable_ctor uses __folio_set_pgtable to set PageTable
> on folio.
> 
>>
>> So maybe we want to lookup the compound head (if any) and perform the type
>> checks against that?
> 
> Maybe we should or we might miss some pages that could have been handled. And
> if compound head is required, should we hold an extra page refcnt to guard against
> possible folio split race?

Races are fine. We might miss some pages, but that can happen on races either way.


I'd just do something like

if (PageReserved(page))
	return true;

head = compound_head(page);
return PageSlab(head) || ...;
	

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v4 08/13] rv: Ensure synchronous cleanup for HA monitors
From: Nam Cao @ 2026-06-02  9:17 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260601153840.124372-9-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:

> HA monitors may start timers, all cleanup functions currently stop the
> timers asynchronously to avoid sleeping in the wrong context.
> Nothing makes sure running callbacks terminate on cleanup.
>
> Run the entire HA timer callback in an RCU read-side critical section,
> this way we can simply synchronize_rcu() with any pending timer and are
> sure any cleanup using kfree_rcu() runs after callbacks terminated.
> Additionally make sure any unlikely callback running late won't run any
> code if the monitor is marked as disabled or if destruction started.
> Use memory barriers to serialise with racing resets.
>
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-02  9:10 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <d01cf1ec-b85d-4af6-9810-8107c0e2a4ec@arm.com>



On 02/06/2026 09:55, Suzuki K Poulose wrote:
> On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> All-shared guest_memfd used to be only supported for non-CoCo VMs where
>> preparation doesn't apply. INIT_SHARED is about to be supported for
>> non-CoCo VMs in a later patch in this series.
> 
> nit: s/non-CoCo/CoCo ?
> 
>>
>> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
>> guest_memfd in a later patch in this series.
>>
>> This means that the kvm fault handler may now call kvm_gmem_get_pfn() 
>> on a
>> shared folio for a CoCo VM where preparation applies.
>>
>> Add a check to make sure that preparation is only performed for private
>> folios.
>>
>> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
>> conversion to shared.
>>
>> Signed-off-by: Michael Roth <michael.roth@amd.com>
> 
> nit: Missing Co-Developed-by: ?
> 
>> Reviewed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>>   virt/kvm/guest_memfd.c | 9 ++++++---
>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 78e5435967341..adf57a3a1f5dd 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -894,6 +894,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>                int *max_order)
>>   {
>>       pgoff_t index = kvm_gmem_get_index(slot, gfn);
>> +    struct inode *inode;
>>       struct folio *folio;
>>       int r = 0;
>> @@ -901,7 +902,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>       if (!file)
>>           return -EFAULT;
>> -    filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>> +    inode = file_inode(file);
>> +    filemap_invalidate_lock_shared(inode->i_mapping);
>>       folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
>>       if (IS_ERR(folio)) {
>> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>           folio_mark_uptodate(folio);
>>       }
>> -    r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>> +    if (kvm_gmem_is_private_mem(inode, index))
> 
> Don't we need to make sure the entire folio is private ? Not just the 
> page at the index ?
>      if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?

Or rather, we should go through the individual pages and apply the
prepare for ones that are private ?

Suzuki

> 
> Suzuki
> 
>> +        r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>>       folio_unlock(folio);
>> @@ -924,7 +927,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>           folio_put(folio);
>>   out:
>> -    filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>> +    filemap_invalidate_unlock_shared(inode->i_mapping);
>>       return r;
>>   }
>>   EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>>
> 


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox