Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH v4 2/2] tracing: Drain deferred trigger frees if kthread creation fails
From: Wesley Atwell @ 2026-03-28  4:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-trace-kernel, linux-kernel, mhiramat, mark.rutland,
	mathieu.desnoyers, tom.zanussi
In-Reply-To: <20260327223022.167defcc@robin>

Hi Steve,

I'm glad the test case was helpful. I'll include similar testing
details in future commit messages and avoid grouping unrelated
patches.

Thanks,
Wesley Atwell

^ permalink raw reply

* Re: [PATCH v4 2/2] tracing: Drain deferred trigger frees if kthread creation fails
From: Steven Rostedt @ 2026-03-28  2:30 UTC (permalink / raw)
  To: Wesley Atwell
  Cc: linux-trace-kernel, linux-kernel, mhiramat, mark.rutland,
	mathieu.desnoyers, tom.zanussi
In-Reply-To: <CAN=sVvzFMC1m2aGT23aRpPpoddBVP59mQBmEsQyEPKBYm3J_Vw@mail.gmail.com>

On Fri, 27 Mar 2026 16:41:52 -0600
Wesley Atwell <atwellwea@gmail.com> wrote:

> Yes,
> 
> This kernel command line reliably reaches trigger_data_free() during boot:
> 
> trace_event=sched:sched_switch
> trace_trigger=sched_switch.traceon,sched_switch.traceon
> 
> On an unpatched tree, that crashes during early boot before userspace.
> The call trace goes through:
> 
> trigger_data_free()
> __kthread_create_on_node()
> try_to_wake_up()
> 
> The stack also shows the boot-time trigger registration path:
> 
> event_trigger_parse()
> trigger_process_regex()
> __trace_early_add_events()

Thanks for this. I can reproduce the crash. I'm also going to add this
to the change log as it is useful (I'll even add it to one of my
regression tests). I'll take this patch separately (this didn't need to
be a patch series, as the two patches do not depend on each other).

-- Steve

^ permalink raw reply

* Re: [PATCH v4 2/2] tracing: Drain deferred trigger frees if kthread creation fails
From: Wesley Atwell @ 2026-03-27 22:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-trace-kernel, linux-kernel, mhiramat, mark.rutland,
	mathieu.desnoyers, tom.zanussi
In-Reply-To: <20260327150634.5df3cf4f@gandalf.local.home>

Yes,

This kernel command line reliably reaches trigger_data_free() during boot:

trace_event=sched:sched_switch
trace_trigger=sched_switch.traceon,sched_switch.traceon

On an unpatched tree, that crashes during early boot before userspace.
The call trace goes through:

trigger_data_free()
__kthread_create_on_node()
try_to_wake_up()

The stack also shows the boot-time trigger registration path:

event_trigger_parse()
trigger_process_regex()
__trace_early_add_events()

With v4 applied, the same command line boots successfully. The guest log shows:

Failed to register trigger 'traceon' on event sched_switch

And /sys/kernel/tracing/events/sched/sched_switch/trigger contains:

traceon:unlimited

I also verified patch 1 with repeated trace_trigger= parameters:
before the patch, only the last parameter was preserved; after the
patch, both triggers were installed.

Thanks,
Wesley Atwell

^ permalink raw reply

* Re: [PATCH v13 4/4] ring-buffer: Add persistent ring buffer selftest
From: Steven Rostedt @ 2026-03-27 20:47 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Ian Rogers
In-Reply-To: <20260327162508.6cac690c@gandalf.local.home>

On Fri, 27 Mar 2026 16:25:08 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Also, I noticed that there's nothing that reads the RB_MISSING as I thought
> it might. I'll have to look into how to pass that info to the trace output.

And when I cat /sys/kernel/tracing/instances/ptracingtest/per_cpu/cpuX/trace_pipe

   (where X is the failed buffer)

It triggered an infinite loop of:

[  206.549217] ------------[ cut here ]------------
[  206.550907] WARNING: kernel/trace/ring_buffer.c:5751 at __rb_get_reader_page+0xa6b/0x1040, CPU#2: cat/1197
[  206.554111] Modules linked in:
[  206.555331] CPU: 2 UID: 0 PID: 1197 Comm: cat Tainted: G        W           7.0.0-rc4-test-00028-g7b37f48b2c57-dirty #276 PREEMPT(full) 
[  206.559048] Tainted: [W]=WARN
[  206.560244] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
[  206.563212] RIP: 0010:__rb_get_reader_page+0xa6b/0x1040
[  206.564964] Code: ff df 48 c1 ea 03 80 3c 02 00 0f 85 4a 05 00 00 48 8b 43 10 be 04 00 00 00 4c 8d 60 08 4c 89 e7 e8 9a 2d 63 00 f0 41 ff 04 24 <0f> 0b e9 36 fb ff ff e8 29 39 05 00 fb 0f 1f 44 00 00 4d 85 f6 0f
[  206.572295] RSP: 0018:ffff888112a77938 EFLAGS: 00010006
[  206.574095] RAX: 0000000000000001 RBX: ffff888100d6e000 RCX: 0000000000000001
[  206.576458] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff88810027b808
[  206.578749] RBP: 1ffff1102254ef34 R08: ffffffff909a1556 R09: ffffed102004f701
[  206.581020] R10: ffffed102004f702 R11: ffff88823443a000 R12: ffff88810027b808
[  206.583312] R13: ffff888100f65f00 R14: ffff888100f65f00 R15: dffffc0000000000
[  206.585647] FS:  00007f98e4d80780(0000) GS:ffff88829e3c2000(0000) knlGS:0000000000000000
[  206.588246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  206.590179] CR2: 00007f98e4d3e000 CR3: 000000012272e006 CR4: 0000000000172ef0
[  206.592444] Call Trace:
[  206.593518]  <TASK>
[  206.594436]  ? __pfx___rb_get_reader_page+0x10/0x10
[  206.596148]  ? lock_acquire+0x1b2/0x340
[  206.597599]  rb_buffer_peek+0x37e/0x520
[  206.598954]  ring_buffer_peek+0xe9/0x310
[  206.601956]  peek_next_entry+0x15a/0x280
[  206.603420]  __find_next_entry+0x39f/0x530
[  206.604918]  ? __pfx___mutex_lock+0x10/0x10
[  206.606474]  ? rcu_is_watching+0x15/0xb0
[  206.616049]  ? __pfx___find_next_entry+0x10/0x10
[  206.617741]  ? preempt_count_sub+0x10c/0x1c0
[  206.619242]  ? __pfx_down_read+0x10/0x10
[  206.620687]  trace_find_next_entry_inc+0x2f/0x240
[  206.622351]  tracing_read_pipe+0x4e7/0xc60
[  206.623852]  ? rw_verify_area+0x353/0x5f0
[  206.625325]  vfs_read+0x171/0xb20
[  206.626592]  ? __lock_acquire+0x487/0x2220
[  206.628135]  ? __pfx___handle_mm_fault+0x10/0x10
[  206.629784]  ? __pfx_vfs_read+0x10/0x10
[  206.632696]  ? __pfx_css_rstat_updated+0x10/0x10
[  206.634351]  ? rcu_is_watching+0x15/0xb0
[  206.635835]  ? trace_preempt_on+0x126/0x160
[  206.637362]  ? preempt_count_sub+0x10c/0x1c0
[  206.638880]  ? count_memcg_events+0x10a/0x4b0
[  206.640455]  ? find_held_lock+0x2b/0x80
[  206.641908]  ? rcu_read_unlock+0x17/0x60
[  206.643340]  ? lock_release+0x1ab/0x320
[  206.644812]  ksys_read+0xff/0x200
[  206.646127]  ? __pfx_ksys_read+0x10/0x10
[  206.647651]  do_syscall_64+0x117/0x16c0
[  206.649035]  ? irqentry_exit+0xd9/0x690
[  206.650548]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  206.652331] RIP: 0033:0x7f98e4e14eb2
[  206.653743] Code: 18 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 1a 83 e2 39 83 fa 08 75 12 e8 2b ff ff ff 0f 1f 00 49 89 ca 48 8b 44 24 20 0f 05 <48> 83 c4 18 c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 10 ff 74 24 18
[  206.659364] RSP: 002b:00007ffdc0a8d930 EFLAGS: 00000202 ORIG_RAX: 0000000000000000
[  206.663251] RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f98e4e14eb2
[  206.665614] RDX: 0000000000040000 RSI: 00007f98e4d3f000 RDI: 0000000000000003
[  206.668022] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000000
[  206.670306] R10: 0000000000000000 R11: 0000000000000202 R12: 00007f98e4d3f000
[  206.672624] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000040000
[  206.674941]  </TASK>
[  206.675927] irq event stamp: 7898
[  206.677154] hardirqs last  enabled at (7897): [<ffffffff90991f6f>] ring_buffer_empty_cpu+0x19f/0x2f0
[  206.680088] hardirqs last disabled at (7898): [<ffffffff909a277d>] ring_buffer_peek+0x17d/0x310
[  206.682881] softirqs last  enabled at (7888): [<ffffffff9056cffc>] handle_softirqs+0x5bc/0x7c0
[  206.685710] softirqs last disabled at (7879): [<ffffffff9056d322>] __irq_exit_rcu+0x112/0x230
[  206.688483] ---[ end trace 0000000000000000 ]---

OK, that RB_MISSED_EVENTS is causing an issue. Something else we need to
look into. The warning is that __rb_get_reader_page() is trying more than 3
times. Thus I think it's constantly swapping the head page and the reader
page. Something to investigate.

So, I'm holding off pulling in these patches. I may take the first one
though.

-- Steve

^ permalink raw reply

* Re: [PATCH v13 4/4] ring-buffer: Add persistent ring buffer selftest
From: Steven Rostedt @ 2026-03-27 20:25 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Ian Rogers
In-Reply-To: <177440552560.1529621.1405976992959650354.stgit@mhiramat.tok.corp.google.com>

On Wed, 25 Mar 2026 11:25:25 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Add a self-destractive test for the persistent ring buffer. This
> will invalidate some sub-buffer pages in the persistent ring buffer
> when kernel gets panic, and check whether the number of detected
> invalid pages and the total entry_bytes are the same as record
> after reboot.
> 
> This can ensure the kernel correctly recover partially corrupted
> persistent ring buffer when boot.
> 
> The test only runs on the persistent ring buffer whose name is
> "ptracingtest". And user has to fill it up with events before
> kernel panics.
> 
> To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_SELFTEST

I think a more appropriate config name would be:

  CONFIG_PERSISTENT_RING_BUFFER_ERROR_INJECT

as that's what it is doing as it is only testing error injection and not
the persistent ring buffer.

> and you have to setup the kernel cmdline;
> 
>  reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
>  panic=1
> 
> And run following commands after the 1st boot;
> 
>  cd /sys/kernel/tracing/instances/ptracingtest
>  echo 1 > tracing_on
>  echo 1 > events/enable
>  sleep 3
>  echo c > /proc/sysrq-trigger

These instructions should probably be in the CONFIG help message.

> 
> After panic message, the kernel will reboot and run the verification
> on the persistent ring buffer, e.g.
> 
>  Ring buffer meta [2] invalid buffer page detected
>  Ring buffer meta [2] is from previous boot! (318 pages discarded)
>  Ring buffer testing [2] invalid pages: PASSED (318/318)
>  Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)

BTW, when I tested this, I got the above on the first boot, but if I
rebooted normally without re-enabling the persistent ring buffer, I would
get on the next boot:


[    0.966510] Ring buffer meta [2] is from previous boot! (0 pages discarded)
[    0.971338]  #2
[    1.003431] Ring buffer meta [3] is from previous boot! (0 pages discarded)
[    1.007737]  #3
[    1.039091] Ring buffer meta [4] is from previous boot! (0 pages discarded)
[    1.043181] Ring buffer testing [4] invalid pages: FAILED (0/1597)
[    1.044660] Ring buffer testing [4] entry_bytes: PASSED (6512464/6512464)
[    1.047829]  #4
[    1.079811] Ring buffer meta [5] is from previous boot! (0 pages discarded)
[    1.083728]  #5
[    1.116764] Ring buffer meta [6] is from previous boot! (0 pages discarded)
[    1.120846]  #6
[    1.156502] Ring buffer meta [7] is from previous boot! (0 pages discarded)
[    1.160857]  #7

I'll start testing the previous 3 patches and may add them to next.

Also, I noticed that there's nothing that reads the RB_MISSING as I thought
it might. I'll have to look into how to pass that info to the trace output.

-- Steve

^ permalink raw reply

* Re: Warning from free_reserved_area() in next-20260325+
From: Bert Karwatzki @ 2026-03-27 19:54 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, spasswolf, Liam.Howlett, akpm, andreas, ardb, bp,
	brauner, catalin.marinas, chleroy, dave.hansen, davem, david,
	devicetree, dvyukov, elver, glider, hannes, hpa, ilias.apalodimas,
	iommu, jack, jackmanb, kasan-dev, linux-arm-kernel, linux-efi,
	linux-fsdevel, linux-mm, linux-trace-kernel, linuxppc-dev,
	lorenzo.stoakes, m.szyprowski, maddy, mhiramat, mhocko, mingo,
	mpe, npiggin, robh, robin.murphy, saravanak, sparclinux, surenb,
	tglx, vbabka, viro, will, x86, ziy
In-Reply-To: <aca6blFFWskxAcAr@kernel.org>

Am Freitag, dem 27.03.2026 um 20:12 +0300 schrieb Mike Rapoport:
> Hi Bert,
> 
> On Fri, Mar 27, 2026 at 03:01:08PM +0100, Bert Karwatzki wrote:
> > Starting with linux next-20260325 I see the following warning early in the
> > boot process of a machine running debian stable (trixie) (except for the kernel):
> 
> Thanks for the report!
> 
> > [    0.027118] [      T0] ------------[ cut here ]------------
> > [    0.027118] [      T0] Cannot free reserved memory because of deferred initialization of the memory map
> > [    0.027119] [      T0] WARNING: mm/memblock.c:904 at __free_reserved_area+0xa9/0xc0, CPU#0: swapper/0/0
> > [    0.027122] [      T0] Modules linked in:
> > [    0.027123] [      T0] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.0.0-rc5-next-20260326-master #385 PREEMPT_RT 
> > [    0.027125] [      T0] Hardware name: ASUS System Product Name/ROG STRIX B850-F GAMING WIFI, BIOS 1627 02/05/2026
> > [    0.027125] [      T0] RIP: 0010:__free_reserved_area+0xa9/0xc0
> > [    0.027126] [      T0] Code: 48 89 df 48 89 ee e8 06 fe ff ff 48 89 c3 48 39 e8 72 a0 5b 4c 89 e8 5d 41 5c 41 5d 41 5e c3 cc cc cc cc 48 8d 3d 97 c2 c6 00 <67> 48 0f b9 3a 45 31 ed eb df 66 66 2e 0f 1f 84 00 00 00 00 00 66
> > [    0.027127] [      T0] RSP: 0000:ffffffff9b203e98 EFLAGS: 00010202
> > [    0.027128] [      T0] RAX: 0000000e91c00001 RBX: ffffffff9b100c0f RCX: 0000000080000001
> > [    0.027128] [      T0] RDX: 00000000000000cc RSI: 0000000e2d42d000 RDI: ffffffff9b32ef60
> > [    0.027128] [      T0] RBP: ffff9eeafdd6fbc0 R08: 0000000000000000 R09: 0000000000000001
> > [    0.027129] [      T0] R10: 0000000000001000 R11: 8000000000000163 R12: 000000000000006f
> > [    0.027129] [      T0] R13: 0000000000000000 R14: 0000000000000045 R15: 000000005c8a1000
> > [    0.027129] [      T0] FS:  0000000000000000(0000) GS:ffff9eeb21c05000(0000) knlGS:0000000000000000
> > [    0.027130] [      T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    0.027130] [      T0] CR2: ffff9ee8ad801000 CR3: 0000000e2ce1e000 CR4: 0000000000f50ef0
> > [    0.027131] [      T0] PKRU: 55555554
> > [    0.027131] [      T0] Call Trace:
> > [    0.027132] [      T0]  <TASK>
> > [    0.027132] [      T0]  free_reserved_area+0x89/0xd0
> > [    0.027133] [      T0]  alternative_instructions+0xee/0x110
> > [    0.027136] [      T0]  arch_cpu_finalize_init+0x10f/0x160
> > [    0.027138] [      T0]  start_kernel+0x686/0x710
> > [    0.027140] [      T0]  x86_64_start_reservations+0x24/0x30
> > [    0.027141] [      T0]  x86_64_start_kernel+0xd4/0xe0
> > [    0.027142] [      T0]  common_startup_64+0x13e/0x141
> > [    0.027143] [      T0]  </TASK>
> > [    0.027144] [      T0] ---[ end trace 0000000000000000 ]---
> 
> Does this patch fix it for you?
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index e87da25d1236..62936a3bde19 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -2448,19 +2448,31 @@ void __init alternative_instructions(void)
>  					    __smp_locks, __smp_locks_end,
>  					    _text, _etext);
>  	}
> +#endif
>  
> +	restart_nmi();
> +	alternatives_patched = 1;
> +
> +	alt_reloc_selftest();
> +}
> +
> +#ifdef CONFIG_SMP
> +/*
> + * With CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled we can free_init_pages() only
> + * after the deferred initialization of the memory map is complete.
> + */
> +static int __init free_smp_locks(void)
> +{
>  	if (!uniproc_patched || num_possible_cpus() == 1) {
>  		free_init_pages("SMP alternatives",
>  				(unsigned long)__smp_locks,
>  				(unsigned long)__smp_locks_end);
>  	}
> -#endif
>  
> -	restart_nmi();
> -	alternatives_patched = 1;
> -
> -	alt_reloc_selftest();
> +	return 0;
>  }
> +arch_initcall(free_smp_locks);
> +#endif
>  
>  /**
>   * text_poke_early - Update instructions on a live kernel at boot time
>  
> > Bert Karwatzki

Yes, your patch fixes the issue in next-20260326.

Tested-By: Bert Karwatzki <spasswolf@web.de>

Bert Karwatzki

^ permalink raw reply

* Re: [PATCH v4 2/2] tracing: Drain deferred trigger frees if kthread creation fails
From: Steven Rostedt @ 2026-03-27 19:06 UTC (permalink / raw)
  To: Wesley Atwell
  Cc: linux-trace-kernel, linux-kernel, mhiramat, mark.rutland,
	mathieu.desnoyers, tom.zanussi
In-Reply-To: <20260324221326.1395799-3-atwellwea@gmail.com>

On Tue, 24 Mar 2026 16:13:26 -0600
Wesley Atwell <atwellwea@gmail.com> wrote:

> Boot-time trigger registration can fail before the trigger-data cleanup
> kthread exists. Deferring those frees until late init is fine, but the
> post-boot fallback must still drain the deferred list if kthread
> creation never succeeds.
> 
> Otherwise, boot-deferred nodes can accumulate on
> trigger_data_free_list, later frees fall back to synchronously freeing
> only the current object, and the older queued entries are leaked
> forever.
> 
> Keep the deferred boot-time behavior, but when kthread creation fails,
> drain the whole queued list synchronously. Do the same in the late-init
> drain path so queued entries are not stranded there either.
> 
> Fixes: 61d445af0a7c ("tracing: Add bulk garbage collection of freeing event_trigger_data")
> Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
> ---

Do you have a test case (kernel command line) that will make
trigger_data_free() get called at boot up?

-- Steve

^ permalink raw reply

* Re: Warning from free_reserved_area() in next-20260325+
From: Mike Rapoport @ 2026-03-27 17:12 UTC (permalink / raw)
  To: Bert Karwatzki
  Cc: linux-kernel, Liam.Howlett, akpm, andreas, ardb, bp, brauner,
	catalin.marinas, chleroy, dave.hansen, davem, david, devicetree,
	dvyukov, elver, glider, hannes, hpa, ilias.apalodimas, iommu,
	jack, jackmanb, kasan-dev, linux-arm-kernel, linux-efi,
	linux-fsdevel, linux-mm, linux-trace-kernel, linuxppc-dev,
	lorenzo.stoakes, m.szyprowski, maddy, mhiramat, mhocko, mingo,
	mpe, npiggin, robh, robin.murphy, saravanak, sparclinux, surenb,
	tglx, vbabka, viro, will, x86, ziy
In-Reply-To: <20260327140109.7561-1-spasswolf@web.de>

Hi Bert,

On Fri, Mar 27, 2026 at 03:01:08PM +0100, Bert Karwatzki wrote:
> Starting with linux next-20260325 I see the following warning early in the
> boot process of a machine running debian stable (trixie) (except for the kernel):

Thanks for the report!

> [    0.027118] [      T0] ------------[ cut here ]------------
> [    0.027118] [      T0] Cannot free reserved memory because of deferred initialization of the memory map
> [    0.027119] [      T0] WARNING: mm/memblock.c:904 at __free_reserved_area+0xa9/0xc0, CPU#0: swapper/0/0
> [    0.027122] [      T0] Modules linked in:
> [    0.027123] [      T0] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.0.0-rc5-next-20260326-master #385 PREEMPT_RT 
> [    0.027125] [      T0] Hardware name: ASUS System Product Name/ROG STRIX B850-F GAMING WIFI, BIOS 1627 02/05/2026
> [    0.027125] [      T0] RIP: 0010:__free_reserved_area+0xa9/0xc0
> [    0.027126] [      T0] Code: 48 89 df 48 89 ee e8 06 fe ff ff 48 89 c3 48 39 e8 72 a0 5b 4c 89 e8 5d 41 5c 41 5d 41 5e c3 cc cc cc cc 48 8d 3d 97 c2 c6 00 <67> 48 0f b9 3a 45 31 ed eb df 66 66 2e 0f 1f 84 00 00 00 00 00 66
> [    0.027127] [      T0] RSP: 0000:ffffffff9b203e98 EFLAGS: 00010202
> [    0.027128] [      T0] RAX: 0000000e91c00001 RBX: ffffffff9b100c0f RCX: 0000000080000001
> [    0.027128] [      T0] RDX: 00000000000000cc RSI: 0000000e2d42d000 RDI: ffffffff9b32ef60
> [    0.027128] [      T0] RBP: ffff9eeafdd6fbc0 R08: 0000000000000000 R09: 0000000000000001
> [    0.027129] [      T0] R10: 0000000000001000 R11: 8000000000000163 R12: 000000000000006f
> [    0.027129] [      T0] R13: 0000000000000000 R14: 0000000000000045 R15: 000000005c8a1000
> [    0.027129] [      T0] FS:  0000000000000000(0000) GS:ffff9eeb21c05000(0000) knlGS:0000000000000000
> [    0.027130] [      T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.027130] [      T0] CR2: ffff9ee8ad801000 CR3: 0000000e2ce1e000 CR4: 0000000000f50ef0
> [    0.027131] [      T0] PKRU: 55555554
> [    0.027131] [      T0] Call Trace:
> [    0.027132] [      T0]  <TASK>
> [    0.027132] [      T0]  free_reserved_area+0x89/0xd0
> [    0.027133] [      T0]  alternative_instructions+0xee/0x110
> [    0.027136] [      T0]  arch_cpu_finalize_init+0x10f/0x160
> [    0.027138] [      T0]  start_kernel+0x686/0x710
> [    0.027140] [      T0]  x86_64_start_reservations+0x24/0x30
> [    0.027141] [      T0]  x86_64_start_kernel+0xd4/0xe0
> [    0.027142] [      T0]  common_startup_64+0x13e/0x141
> [    0.027143] [      T0]  </TASK>
> [    0.027144] [      T0] ---[ end trace 0000000000000000 ]---

Does this patch fix it for you?

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index e87da25d1236..62936a3bde19 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -2448,19 +2448,31 @@ void __init alternative_instructions(void)
 					    __smp_locks, __smp_locks_end,
 					    _text, _etext);
 	}
+#endif
 
+	restart_nmi();
+	alternatives_patched = 1;
+
+	alt_reloc_selftest();
+}
+
+#ifdef CONFIG_SMP
+/*
+ * With CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled we can free_init_pages() only
+ * after the deferred initialization of the memory map is complete.
+ */
+static int __init free_smp_locks(void)
+{
 	if (!uniproc_patched || num_possible_cpus() == 1) {
 		free_init_pages("SMP alternatives",
 				(unsigned long)__smp_locks,
 				(unsigned long)__smp_locks_end);
 	}
-#endif
 
-	restart_nmi();
-	alternatives_patched = 1;
-
-	alt_reloc_selftest();
+	return 0;
 }
+arch_initcall(free_smp_locks);
+#endif
 
 /**
  * text_poke_early - Update instructions on a live kernel at boot time
 
> Bert Karwatzki

-- 
Sincerely yours,
Mike.

^ permalink raw reply related

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Breno Leitao @ 2026-03-27 16:11 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <20260327231630.2d6f4273b7d615bda4b51053@kernel.org>

On Fri, Mar 27, 2026 at 11:16:30PM +0900, Masami Hiramatsu wrote:
> > Given all the feedback on this series, I see three types of issues to address:
> >
> > 1) Minor patch improvements
> > 2) Architecture-specific super early parameters being parsed before bootconfig
> >    is available
> > 3) Unifying kernel cmdline and bootconfig interfaces
>
> I think we can start with 1) for embedded bootconfig for this series
> with using bootconfig in parse_early_param().

Thanks for the clear direction.

I'll work on integrating bootconfig into parse_early_param() to see
what can be achieved and identify any potential blockers.

I should be back soon with more fun.

Thanks so far,
--breno

^ permalink raw reply

* [GIT PULL] RTLA changes for v7.1
From: Tomas Glozar @ 2026-03-27 15:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Costa Shulyupin, Wander Lairson Costa, LKML, linux-trace-kernel,
	Tomas Glozar

Steven,

please pull the following changes for RTLA (more info in tag description).

Thanks,
Tomas

The following changes since commit 11439c4635edd669ae435eec308f4ab8a0804808:

  Linux 7.0-rc2 (2026-03-01 15:39:31 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglozar/linux.git tags/rtla-v7.1

for you to fetch changes up to 82374995b63d2de21414163828a32d52610dcaf2:

  Documentation/rtla: Document SIGINT behavior (2026-03-27 10:58:30 +0100)

----------------------------------------------------------------
RTLA patches for v7.1

- Simplify option parsing

Auto-generate getopt_long() optstring for short options from long options
array, avoiding the need to specify it manually and reducing the surface
for mistakes.

- Add unit tests

Implement unit tests (make unit-tests) using libcheck, next to existing
runtime tests (make check). Currently, three functions from utils.c are
tested.

- Add --stack-format option

In addition to stopping stack pointer decoding (with -s/--stack option)
on first unresolvable pointer, allow also skipping unresolvable pointers
and displaying everything, configurable with a new option.

- Unify number of CPUs into one global variable

Use one global variable, nr_cpus, to store the number of CPUs instead of
retrieving it and passing it at multiple places.

- Fix behavior in various corner cases

Make RTLA behave correctly in several corner cases: memory allocation
failure, invalid value read from kernel side, thread creation failure,
malformed time value input, and read/write failure or interruption by
signal.

- Improve string handling

Simplify several places in the code that handle strings, including
parsing of action arguments. A few new helper functions and variables
are added for that purpose.

- Get rid of magic numbers

Few places handling paths use a magic number of 1024. Replace it with
MAX_PATH and ARRAY_SIZE() macro.

- Unify threshold handling

Code that handles response to latency threshold is duplicated between
tools, which has led to bugs in the past. Unify it into a new helper
as much as possible.

- Fix segfault on SIGINT during cleanup

The SIGINT handler touches dynamically allocated memory. Detach it
before freeing it during cleanup to prevent segmentation fault and
discarding of output buffers. Also, properly document SIGINT handling
while at it.

The tag was tested (make && make check && make unit-tests) as well as
pre-tested on top of next-20260326. There are no known conflicts.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>

----------------------------------------------------------------
Costa Shulyupin (7):
      tools/rtla: Generate optstring from long options
      tools/build: Add feature test for libcheck
      tools/rtla: Add unit tests for utils.c
      tools/rtla: Consolidate nr_cpus usage across all tools
      tools/rtla: Remove unneeded nr_cpus arguments
      tools/rtla: Remove unneeded nr_cpus members
      tools/rtla: Remove unneeded nr_cpus from for_each_monitored_cpu

Tomas Glozar (4):
      rtla/timerlat: Add --stack-format option
      Documentation/rtla: Document --stack-format option
      rtla: Fix segfault on multiple SIGINTs
      Documentation/rtla: Document SIGINT behavior

Wander Lairson Costa (17):
      rtla: Exit on memory allocation failures during initialization
      rtla: Use strdup() to simplify code
      rtla/actions: Simplify argument parsing
      rtla: Introduce common_threshold_handler() helper
      rtla: Replace magic number with MAX_PATH
      rtla: Simplify code by caching string lengths
      rtla/timerlat: Add bounds check for softirq vector
      rtla: Handle pthread_create() failure properly
      rtla: Add str_has_prefix() helper function
      rtla: Use str_has_prefix() for prefix checks
      rtla: Enforce exact match for time unit suffixes
      rtla: Use str_has_prefix() for option prefix check
      rtla/timerlat: Simplify RTLA_NO_BPF environment variable check
      rtla/trace: Fix write loop in trace_event_save_hist()
      rtla/trace: Fix I/O handling in save_trace_to_file()
      rtla/utils: Fix resource leak in set_comm_sched_attr()
      rtla/utils: Fix loop condition in PID validation

 Documentation/tools/rtla/common_appendix.txt       |  21 ++++
 .../tools/rtla/common_timerlat_options.txt         |  12 +++
 tools/build/Makefile.feature                       |   3 +
 tools/build/feature/Makefile                       |   4 +
 tools/build/feature/test-libcheck.c                |   8 ++
 tools/tracing/rtla/Build                           |   1 +
 tools/tracing/rtla/Makefile                        |   5 +
 tools/tracing/rtla/Makefile.config                 |   8 ++
 tools/tracing/rtla/README.txt                      |   1 +
 tools/tracing/rtla/src/actions.c                   | 103 +++++++++++-------
 tools/tracing/rtla/src/actions.h                   |   8 +-
 tools/tracing/rtla/src/common.c                    | 120 +++++++++++++++++----
 tools/tracing/rtla/src/common.h                    |  24 ++++-
 tools/tracing/rtla/src/osnoise.c                   |  26 ++---
 tools/tracing/rtla/src/osnoise_hist.c              |  51 ++++-----
 tools/tracing/rtla/src/osnoise_top.c               |  41 ++-----
 tools/tracing/rtla/src/timerlat.c                  |  16 ++-
 tools/tracing/rtla/src/timerlat.h                  |   1 +
 tools/tracing/rtla/src/timerlat_aa.c               |  51 ++++++---
 tools/tracing/rtla/src/timerlat_aa.h               |   2 +-
 tools/tracing/rtla/src/timerlat_bpf.c              |  19 ++--
 tools/tracing/rtla/src/timerlat_bpf.h              |  12 +--
 tools/tracing/rtla/src/timerlat_hist.c             | 116 +++++++++-----------
 tools/tracing/rtla/src/timerlat_top.c              | 114 +++++++++-----------
 tools/tracing/rtla/src/timerlat_u.c                |  13 ++-
 tools/tracing/rtla/src/timerlat_u.h                |   1 +
 tools/tracing/rtla/src/trace.c                     | 102 ++++++++++--------
 tools/tracing/rtla/src/trace.h                     |   4 +-
 tools/tracing/rtla/src/utils.c                     | 113 ++++++++++++++-----
 tools/tracing/rtla/src/utils.h                     |  33 ++++++
 tools/tracing/rtla/tests/unit/Build                |   2 +
 tools/tracing/rtla/tests/unit/Makefile.unit        |  17 +++
 tools/tracing/rtla/tests/unit/unit_tests.c         | 119 ++++++++++++++++++++
 33 files changed, 769 insertions(+), 402 deletions(-)
 create mode 100644 tools/build/feature/test-libcheck.c
 create mode 100644 tools/tracing/rtla/tests/unit/Build
 create mode 100644 tools/tracing/rtla/tests/unit/Makefile.unit
 create mode 100644 tools/tracing/rtla/tests/unit/unit_tests.c


^ permalink raw reply

* Re: [PATCH next] tracing: Remove spurious default precision from show_event_trigger/filter formats
From: Steven Rostedt @ 2026-03-27 14:26 UTC (permalink / raw)
  To: Petr Mladek
  Cc: david.laight.linux, Masami Hiramatsu, Mathieu Desnoyers,
	linux-kernel, linux-trace-kernel, Aaron Tomlin, Rasmus Villemoes,
	Andy Shevchenko, Sergey Senozhatsky, Andrew Morton
In-Reply-To: <acZKpXQDTEPCU813@pathway.suse.cz>

On Fri, 27 Mar 2026 10:15:17 +0100
Petr Mladek <pmladek@suse.com> wrote:

> On Thu 2026-03-26 20:18:24, david.laight.linux@gmail.com wrote:
> > From: David Laight <david.laight.linux@gmail.com>
> > 
> > Change 2d8b7f9bf8e6e ("tracing: Have show_event_trigger/filter format a bit more in columns")
> > added space padding to align the output.
> > However it used ("%*.s", len, "") which requests the default precision.
> > It doesn't matter here whether the userspace default (0) or kernel
> > default (no precision) is used, but the format should be "%*s".
> > 
> > Signed-off-by: David Laight <david.laight.linux@gmail.com>  
> 
> Makes sense. It does not change the output because it printed
> an empty string "" so the precision did not matter.

Right. I use this in user space all the time, and add "%*.s" a lot.

I tested it and it doesn't change the output so I'm happy to take it
through my tree.

-- Steve
 

> 
> Reviewed-by: Petr Mladek <pmladek@suse.com>
> 
> Best Regards,
> Petr


^ permalink raw reply

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Masami Hiramatsu @ 2026-03-27 14:16 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <acZX_IXQiGwMMi5e@gmail.com>

On Fri, 27 Mar 2026 03:18:31 -0700
Breno Leitao <leitao@debian.org> wrote:

> On Thu, Mar 26, 2026 at 11:30:42PM +0900, Masami Hiramatsu wrote:
> > On Wed, 25 Mar 2026 23:22:04 +0900
> > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> >
> > > > +	/*
> > > > +	 * Keys that do not match any early_param() handler are silently
> > > > +	 * ignored — do_early_param() always returns 0.
> > > > +	 */
> > > > +	xbc_node_for_each_key_value(root, knode, val) {
> > >
> > > [sashiko comment]
> > > | Does this loop handle array values correctly?
> > > | xbc_node_for_each_key_value() only assigns the first value of an array to
> > > | the val pointer before advancing to the next key. It does not iterate over
> > > | the child nodes of the array.
> > > | If the bootconfig contains a multi-value key like
> > > | kernel.console = "ttyS0", "tty0", will the subsequent values in the array
> > > | be silently dropped instead of passed to the early_param handlers?
> > >
> > > Also, good catch :) we need to use xbc_node_for_each_array_value()
> > > for inner loop.
> >
> > FYI, xbc_snprint_cmdline() translates the arraied parameter as
> > multiple parameters. For example,
> >
> > foo = bar, buz;
> >
> > will be converted to
> >
> > foo=bar foo=buz
> >
> > Thus, I think we should do the same thing below;
> >
> > >
> > > > +		if (xbc_node_compose_key_after(root, knode, xbc_namebuf, XBC_KEYLEN_MAX) < 0)
> > > > +			continue;
> > > > +
> > > > +		/*
> > > > +		 * We need to copy const char *val to a char pointer,
> > > > +		 * which is what do_early_param() need, given it might
> > > > +		 * call strsep(), strtok() later.
> > > > +		 */
> > > > +		ret = strscpy(val_buf, val, sizeof(val_buf));
> > > > +		if (ret < 0) {
> > > > +			pr_warn("ignoring bootconfig value '%s', too long\n",
> > > > +				xbc_namebuf);
> > > > +			continue;
> > > > +		}
> > > > +		do_early_param(xbc_namebuf, val_buf, NULL, NULL);
> >
> > So instead of this;
> >
> > xbc_array_for_each_value(vnode, val) {
> > 	do_early_param(xbc_namebuf, val, NULL, NULL);
> > }
> >
> > Maybe it is a good timing to recondier unifying kernel cmdline and bootconfig
> > from API viewpoint.
> 
> I'm not familiar with the history on this topic. Has unifying the APIs been
> previously considered and set aside?

Previously I considered but I found some early parameters must be composed by
bootloaders, and they does not support bootconfig. Thus, I introduced
setup_boot_config() to compose kernel.* parameters into cmdline buffer.

> 
> Given all the feedback on this series, I see three types of issues to address:
> 
> 1) Minor patch improvements
> 2) Architecture-specific super early parameters being parsed before bootconfig
>    is available
> 3) Unifying kernel cmdline and bootconfig interfaces

I think we can start with 1) for embedded bootconfig for this series
with using bootconfig in parse_early_param().

For 2), I think it needs to check which parameters are expected to
be passed by bootloaders, which does not care bootconfig currently.

For 3), eventually it may be need to change how kernel handle the
parameters. I think I need to introduce CONFIG_BOOT_CONFIG_EXPOSED
option which keeps the xbc_*() API and parsed data accessible after
boot (Remove __init) and exposed to modules, so that all modules
can use xbc_* to get parameters from bootconfig directly.

Thanks,

> 
> Which of these areas would you recommend I prioritize?
> 
> Thanks for the guidance,
> --breno


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Warning from free_reserved_area() in next-20260325+
From: Bert Karwatzki @ 2026-03-27 14:01 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Bert Karwatzki, linux-kernel, Liam.Howlett, akpm, andreas, ardb,
	bp, brauner, catalin.marinas, chleroy, dave.hansen, davem, david,
	devicetree, dvyukov, elver, glider, hannes, hpa, ilias.apalodimas,
	iommu, jack, jackmanb, kasan-dev, linux-arm-kernel, linux-efi,
	linux-fsdevel, linux-mm, linux-trace-kernel, linuxppc-dev,
	lorenzo.stoakes, m.szyprowski, maddy, mhiramat, mhocko, mingo,
	mpe, npiggin, robh, robin.murphy, saravanak, sparclinux, surenb,
	tglx, vbabka, viro, will, x86, ziy
In-Reply-To: <20260323074836.3653702-10-rppt@kernel.org>

Starting with linux next-20260325 I see the following warning early in the
boot process of a machine running debian stable (trixie) (except for the kernel):

[    0.027118] [      T0] ------------[ cut here ]------------
[    0.027118] [      T0] Cannot free reserved memory because of deferred initialization of the memory map
[    0.027119] [      T0] WARNING: mm/memblock.c:904 at __free_reserved_area+0xa9/0xc0, CPU#0: swapper/0/0
[    0.027122] [      T0] Modules linked in:
[    0.027123] [      T0] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.0.0-rc5-next-20260326-master #385 PREEMPT_RT 
[    0.027125] [      T0] Hardware name: ASUS System Product Name/ROG STRIX B850-F GAMING WIFI, BIOS 1627 02/05/2026
[    0.027125] [      T0] RIP: 0010:__free_reserved_area+0xa9/0xc0
[    0.027126] [      T0] Code: 48 89 df 48 89 ee e8 06 fe ff ff 48 89 c3 48 39 e8 72 a0 5b 4c 89 e8 5d 41 5c 41 5d 41 5e c3 cc cc cc cc 48 8d 3d 97 c2 c6 00 <67> 48 0f b9 3a 45 31 ed eb df 66 66 2e 0f 1f 84 00 00 00 00 00 66
[    0.027127] [      T0] RSP: 0000:ffffffff9b203e98 EFLAGS: 00010202
[    0.027128] [      T0] RAX: 0000000e91c00001 RBX: ffffffff9b100c0f RCX: 0000000080000001
[    0.027128] [      T0] RDX: 00000000000000cc RSI: 0000000e2d42d000 RDI: ffffffff9b32ef60
[    0.027128] [      T0] RBP: ffff9eeafdd6fbc0 R08: 0000000000000000 R09: 0000000000000001
[    0.027129] [      T0] R10: 0000000000001000 R11: 8000000000000163 R12: 000000000000006f
[    0.027129] [      T0] R13: 0000000000000000 R14: 0000000000000045 R15: 000000005c8a1000
[    0.027129] [      T0] FS:  0000000000000000(0000) GS:ffff9eeb21c05000(0000) knlGS:0000000000000000
[    0.027130] [      T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.027130] [      T0] CR2: ffff9ee8ad801000 CR3: 0000000e2ce1e000 CR4: 0000000000f50ef0
[    0.027131] [      T0] PKRU: 55555554
[    0.027131] [      T0] Call Trace:
[    0.027132] [      T0]  <TASK>
[    0.027132] [      T0]  free_reserved_area+0x89/0xd0
[    0.027133] [      T0]  alternative_instructions+0xee/0x110
[    0.027136] [      T0]  arch_cpu_finalize_init+0x10f/0x160
[    0.027138] [      T0]  start_kernel+0x686/0x710
[    0.027140] [      T0]  x86_64_start_reservations+0x24/0x30
[    0.027141] [      T0]  x86_64_start_kernel+0xd4/0xe0
[    0.027142] [      T0]  common_startup_64+0x13e/0x141
[    0.027143] [      T0]  </TASK>
[    0.027144] [      T0] ---[ end trace 0000000000000000 ]---

The Hardware used is this:

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 26
model		: 68
model name	: AMD Ryzen 9 9950X 16-Core Processor
stepping	: 0
microcode	: 0xb404035
cpu MHz		: 3607.683
cache size	: 1024 KB
physical id	: 0
siblings	: 32
core id		: 0
cpu cores	: 16
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_freeze
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso spectre_v2_user vmscape
bogomips	: 8599.98
TLB size	: 192 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

$ lspci
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Internal GPP Bridge to Bus [C:A]
00:08.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Internal GPP Bridge to Bus [C:A]
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 71)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 7
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev 25)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch (rev 25)
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 44 [RX 9060 XT] (rev c0)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 48 HDMI/DP Audio Controller
04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD 9100 PRO [PM9E1]
05:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Upstream Port (rev 01)
06:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:06.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:07.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:0c.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:0d.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
08:00.0 Ethernet controller: Intel Corporation Ethernet Controller I226-V (rev 06)
09:00.0 Network controller: MEDIATEK Corp. Device 7925
0b:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] 800 Series Chipset USB 3.x XHCI Controller (rev 01)
0c:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset SATA Controller (rev 01)
0d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge PCIe Dummy Function (rev c1)
0d:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 19h PSP/CCP
0d:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI
0d:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI
0e:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 2.0 xHCI

Memory used is 64G:
$ LANG=C free
               total        used        free      shared  buff/cache   available
Mem:        65500068     3584080    56709424       70916     5912256    61915988
Swap:       78125052           0    78125052


Bert Karwatzki

^ permalink raw reply

* Re: [PATCH next] tracing: Remove spurious default precision from show_event_trigger/filter formats
From: Aaron Tomlin @ 2026-03-27 13:53 UTC (permalink / raw)
  To: david.laight.linux
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Petr Mladek, Rasmus Villemoes,
	Andy Shevchenko, Sergey Senozhatsky, Andrew Morton
In-Reply-To: <20260326201824.3919-1-david.laight.linux@gmail.com>

On Thu, Mar 26, 2026 at 08:18:24PM +0000, david.laight.linux@gmail.com wrote:
> From: David Laight <david.laight.linux@gmail.com>
> 
> Change 2d8b7f9bf8e6e ("tracing: Have show_event_trigger/filter format a bit more in columns")
> added space padding to align the output.
> However it used ("%*.s", len, "") which requests the default precision.
> It doesn't matter here whether the userspace default (0) or kernel
> default (no precision) is used, but the format should be "%*s".
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>
> ---
>  kernel/trace/trace_events.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 249d1cba72c0..6b54c10f9ba4 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -1718,7 +1718,7 @@ static int t_show_filters(struct seq_file *m, void *v)
>  
>  	len = get_call_len(call);
>  
> -	seq_printf(m, "%s:%s%*.s%s\n", call->class->system,
> +	seq_printf(m, "%s:%s%*s%s\n", call->class->system,
>  		   trace_event_name(call), len, "", filter->filter_string);
>  
>  	return 0;
> @@ -1750,7 +1750,7 @@ static int t_show_triggers(struct seq_file *m, void *v)
>  	len = get_call_len(call);
>  
>  	list_for_each_entry_rcu(data, &file->triggers, list) {
> -		seq_printf(m, "%s:%s%*.s", call->class->system,
> +		seq_printf(m, "%s:%s%*s", call->class->system,
>  			   trace_event_name(call), len, "");
>  
>  		data->cmd_ops->print(m, data);
> -- 
> 2.39.5
> 

LGTM. 

Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>

-- 
Aaron Tomlin

^ permalink raw reply

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Masami Hiramatsu @ 2026-03-27 13:37 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <acZPZ4XKy4QynznK@gmail.com>

On Fri, 27 Mar 2026 03:06:41 -0700
Breno Leitao <leitao@debian.org> wrote:

> Hi Masami,
> 
> On Wed, Mar 25, 2026 at 11:22:04PM +0900, Masami Hiramatsu wrote:
> > On Wed, 25 Mar 2026 03:05:38 -0700
> > Breno Leitao <leitao@debian.org> wrote:
> 
> > > +/*
> > > + * bootconfig_apply_early_params - dispatch kernel.* keys from the embedded
> > > + * bootconfig as early_param() calls.
> > > + *
> > > + * early_param() handlers must run before most of the kernel initialises
> > > + * (e.g. before the GIC driver reads irqchip.gicv3_pseudo_nmi).  A bootconfig
> > > + * attached to the initrd arrives too late for this because the initrd is not
> > > + * mapped yet when early params are processed.  The embedded bootconfig lives
> > > + * in the kernel image itself (.init.data), so it is always reachable.
> > > + *
> > > + * This function is called from setup_boot_config() which runs in
> > > + * start_kernel() before parse_early_param(), making the timing correct.
> > > + */
> > > +static void __init bootconfig_apply_early_params(void)
> >
> > [sashiko comment]
> > | Does this run early enough for architectural parameters?
> > | While setup_boot_config() runs before parse_early_param() in start_kernel(),
> > | it runs after setup_arch(). setup_boot_config() relies on xbc_init() which
> > | uses the memblock allocator, requiring setup_arch() to have already
> > | initialized it.
> > | However, the kernel expects many early parameters (like mem=, earlycon,
> > | noapic, and iommu) to be parsed during setup_arch() via the architecture's
> > | call to parse_early_param(). Since setup_arch() completes before
> > | setup_boot_config() runs, will these architectural early parameters be
> > | silently ignored because the decisions they influence were already
> > | finalized?
> >
> > This is the major reason that I did not support early parameter
> > in bootconfig. Some archs initialize kernel_cmdline in setup_arch()
> > and setup early parameters in it.
> 
> Would it be feasible to document which parameters are architecture-specific
> and must be processed during setup_arch()?

Yeah, at least we can mark what is not available in bootconfig.
Or, maybe we can export this function to setup_arch() for each
architecture.

Anyway, some cmdline options are not possible to be passed via
bootconfig. IIRC, for example, the initrd image address is
passed via cmdline (via devicetree) on arm64 from bootloader.

> 
> We could potentially introduce a third parameter category alongside the
> existing early_param() and __setup():
> 
> 	* early_param()
> 	* __setup()
> 	* early_arch_param() (New)
> 
> This would allow bootconfig to support __setup() and early_param() while
> explicitly excluding early_arch_param() from bootconfig processing.

Yeah, that maybe possible.

> 
> This would move break down the early parameters in those that can be
> easily handled.
> 
> > To fix this, we need to change setup_arch() for each architecture so
> > that it calls this bootconfig_apply_early_params().
> 
> Could we instead integrate this into parse_early_param() itself? That
> approach would avoid the need to modify each architecture individually.

Ah, indeed. 

Thanks!

> 
> Thanks for looking at it,
> --breno


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH v2 2/2] module/kallsyms: sort function symbols and use binary search
From: Stanislaw Gruszka @ 2026-03-27 11:00 UTC (permalink / raw)
  To: linux-modules, Sami Tolvanen, Luis Chamberlain, Petr Pavlu
  Cc: linux-kernel, linux-trace-kernel, live-patching, Daniel Gomez,
	Aaron Tomlin, Steven Rostedt, Masami Hiramatsu, Jordan Rome,
	Viktor Malik
In-Reply-To: <20260327110005.16499-1-stf_xl@wp.pl>

Module symbol lookup via find_kallsyms_symbol() performs a linear scan
over the entire symtab when resolving an address. The number of symbols
in module symtabs has grown over the years, largely due to additional
metadata in non-standard sections, making this lookup very slow.

Improve this by separating function symbols during module load, placing
them at the beginning of the symtab, sorting them by address, and using
binary search when resolving addresses in module text.

This also should improve times for linear symbol name lookups, as valid
function symbols are now located at the beginning of the symtab.

The cost of sorting is small relative to module load time. In repeated
module load tests [1], depending on .config options, this change
increases load time between 2% and 4%. With cold caches, the difference
is not measurable, as memory access latency dominates.

The sorting theoretically could be done in compile time, but much more
complicated as we would have to simulate kernel addresses resolution
for symbols, and then correct relocation entries. That would be risky
if get out of sync.

The improvement can be observed when listing ftrace filter functions.

Before:

root@nano:~# time cat /sys/kernel/tracing/available_filter_functions | wc -l
74908

real	0m1.315s
user	0m0.000s
sys	0m1.312s

After:

root@nano:~# time cat /sys/kernel/tracing/available_filter_functions | wc -l
74911

real	0m0.167s
user	0m0.004s
sys	0m0.175s

(there are three more symbols introduced by the patch)

For livepatch modules, the symtab layout is preserved and the existing
linear search is used. For this case, it should be possible to keep
the original ELF symtab instead of copying it 1:1, but that is outside
the scope of this patch.

Link: https://gist.github.com/sgruszka/09f3fb1dad53a97b1aad96e1927ab117 [1]
Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
---
v1 -> v2: 
 - fix searching data symbols for CONFIG_KALLSYMS_ALL
 - use kallsyms_symbol_value() in elf_sym_cmp()

 include/linux/module.h   |   1 +
 kernel/module/internal.h |   1 +
 kernel/module/kallsyms.c | 171 +++++++++++++++++++++++++++++----------
 3 files changed, 130 insertions(+), 43 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index ac254525014c..67c053afa882 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -379,6 +379,7 @@ struct module_memory {
 struct mod_kallsyms {
 	Elf_Sym *symtab;
 	unsigned int num_symtab;
+	unsigned int num_func_syms;
 	char *strtab;
 	char *typetab;
 };
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 618202578b42..6a4d498619b1 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -73,6 +73,7 @@ struct load_info {
 	bool sig_ok;
 #ifdef CONFIG_KALLSYMS
 	unsigned long mod_kallsyms_init_off;
+	unsigned long num_func_syms;
 #endif
 #ifdef CONFIG_MODULE_DECOMPRESS
 #ifdef CONFIG_MODULE_STATS
diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c
index f23126d804b2..d69e99e67707 100644
--- a/kernel/module/kallsyms.c
+++ b/kernel/module/kallsyms.c
@@ -10,6 +10,7 @@
 #include <linux/kallsyms.h>
 #include <linux/buildid.h>
 #include <linux/bsearch.h>
+#include <linux/sort.h>
 #include "internal.h"
 
 /* Lookup exported symbol in given range of kernel_symbols */
@@ -103,6 +104,95 @@ static bool is_core_symbol(const Elf_Sym *src, const Elf_Shdr *sechdrs,
 	return true;
 }
 
+static inline bool is_func_symbol(const Elf_Sym *sym)
+{
+	return sym->st_shndx != SHN_UNDEF && sym->st_size != 0 &&
+	       ELF_ST_TYPE(sym->st_info) == STT_FUNC;
+}
+
+static unsigned int bsearch_func_symbol(struct mod_kallsyms *kallsyms,
+					unsigned long addr,
+					unsigned long *bestval,
+					unsigned long *nextval)
+
+{
+	unsigned int mid, low = 1, high = kallsyms->num_func_syms + 1;
+	unsigned int best = 0;
+	unsigned long thisval;
+
+	while (low < high) {
+		mid = low + (high - low) / 2;
+		thisval = kallsyms_symbol_value(&kallsyms->symtab[mid]);
+
+		if (thisval <= addr) {
+			*bestval = thisval;
+			best = mid;
+			low = mid + 1;
+		} else {
+			*nextval = thisval;
+			high = mid;
+		}
+	}
+
+	return best;
+}
+
+static const char *kallsyms_symbol_name(struct mod_kallsyms *kallsyms,
+					unsigned int symnum)
+{
+	return kallsyms->strtab + kallsyms->symtab[symnum].st_name;
+}
+
+static unsigned int search_kallsyms_symbol(struct mod_kallsyms *kallsyms,
+					   unsigned long addr,
+					   unsigned long *bestval,
+					   unsigned long *nextval)
+{
+	unsigned int i, best = 0;
+
+	/*
+	 * Scan for closest preceding symbol and next symbol. (ELF starts
+	 * real symbols at 1). Skip the initial function symbols range
+	 * if num_func_syms is non-zero, those are handled separately for
+	 * the core TEXT segment lookup.
+	 */
+	for (i = 1 + kallsyms->num_func_syms; i < kallsyms->num_symtab; i++) {
+		const Elf_Sym *sym = &kallsyms->symtab[i];
+		unsigned long thisval = kallsyms_symbol_value(sym);
+
+		if (sym->st_shndx == SHN_UNDEF)
+			continue;
+
+		/*
+		 * We ignore unnamed symbols: they're uninformative
+		 * and inserted at a whim.
+		 */
+		if (*kallsyms_symbol_name(kallsyms, i) == '\0' ||
+		    is_mapping_symbol(kallsyms_symbol_name(kallsyms, i)))
+			continue;
+
+		if (thisval <= addr && thisval > *bestval) {
+			best = i;
+			*bestval = thisval;
+		}
+		if (thisval > addr && thisval < *nextval)
+			*nextval = thisval;
+	}
+
+	return best;
+}
+
+static int elf_sym_cmp(const void *a, const void *b)
+{
+	unsigned long val_a = kallsyms_symbol_value((const Elf_Sym *)a);
+	unsigned long val_b = kallsyms_symbol_value((const Elf_Sym *)b);
+
+	if (val_a < val_b)
+		return -1;
+
+	return val_a > val_b;
+}
+
 /*
  * We only allocate and copy the strings needed by the parts of symtab
  * we keep.  This is simple, but has the effect of making multiple
@@ -115,9 +205,10 @@ void layout_symtab(struct module *mod, struct load_info *info)
 	Elf_Shdr *symsect = info->sechdrs + info->index.sym;
 	Elf_Shdr *strsect = info->sechdrs + info->index.str;
 	const Elf_Sym *src;
-	unsigned int i, nsrc, ndst, strtab_size = 0;
+	unsigned int i, nsrc, ndst, nfunc, strtab_size = 0;
 	struct module_memory *mod_mem_data = &mod->mem[MOD_DATA];
 	struct module_memory *mod_mem_init_data = &mod->mem[MOD_INIT_DATA];
+	bool is_lp_mod = is_livepatch_module(mod);
 
 	/* Put symbol section at end of init part of module. */
 	symsect->sh_flags |= SHF_ALLOC;
@@ -129,12 +220,14 @@ void layout_symtab(struct module *mod, struct load_info *info)
 	nsrc = symsect->sh_size / sizeof(*src);
 
 	/* Compute total space required for the core symbols' strtab. */
-	for (ndst = i = 0; i < nsrc; i++) {
-		if (i == 0 || is_livepatch_module(mod) ||
+	for (ndst = nfunc = i = 0; i < nsrc; i++) {
+		if (i == 0 || is_lp_mod ||
 		    is_core_symbol(src + i, info->sechdrs, info->hdr->e_shnum,
 				   info->index.pcpu)) {
 			strtab_size += strlen(&info->strtab[src[i].st_name]) + 1;
 			ndst++;
+			if (!is_lp_mod && is_func_symbol(src + i))
+				nfunc++;
 		}
 	}
 
@@ -156,6 +249,7 @@ void layout_symtab(struct module *mod, struct load_info *info)
 	mod_mem_init_data->size = ALIGN(mod_mem_init_data->size,
 					__alignof__(struct mod_kallsyms));
 	info->mod_kallsyms_init_off = mod_mem_init_data->size;
+	info->num_func_syms = nfunc;
 
 	mod_mem_init_data->size += sizeof(struct mod_kallsyms);
 	info->init_typeoffs = mod_mem_init_data->size;
@@ -169,7 +263,7 @@ void layout_symtab(struct module *mod, struct load_info *info)
  */
 void add_kallsyms(struct module *mod, const struct load_info *info)
 {
-	unsigned int i, ndst;
+	unsigned int i, di, nfunc, ndst;
 	const Elf_Sym *src;
 	Elf_Sym *dst;
 	char *s;
@@ -178,6 +272,7 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
 	void *data_base = mod->mem[MOD_DATA].base;
 	void *init_data_base = mod->mem[MOD_INIT_DATA].base;
 	struct mod_kallsyms *kallsyms;
+	bool is_lp_mod = is_livepatch_module(mod);
 
 	kallsyms = init_data_base + info->mod_kallsyms_init_off;
 
@@ -194,19 +289,28 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
 	mod->core_kallsyms.symtab = dst = data_base + info->symoffs;
 	mod->core_kallsyms.strtab = s = data_base + info->stroffs;
 	mod->core_kallsyms.typetab = data_base + info->core_typeoffs;
+
 	strtab_size = info->core_typeoffs - info->stroffs;
 	src = kallsyms->symtab;
-	for (ndst = i = 0; i < kallsyms->num_symtab; i++) {
+	ndst = info->num_func_syms + 1;
+
+	for (nfunc = i = 0; i < kallsyms->num_symtab; i++) {
 		kallsyms->typetab[i] = elf_type(src + i, info);
-		if (i == 0 || is_livepatch_module(mod) ||
+		if (i == 0 || is_lp_mod ||
 		    is_core_symbol(src + i, info->sechdrs, info->hdr->e_shnum,
 				   info->index.pcpu)) {
 			ssize_t ret;
 
-			mod->core_kallsyms.typetab[ndst] =
-				kallsyms->typetab[i];
-			dst[ndst] = src[i];
-			dst[ndst++].st_name = s - mod->core_kallsyms.strtab;
+			if (i == 0)
+				di = 0;
+			else if (!is_lp_mod && is_func_symbol(src + i))
+				di = 1 + nfunc++;
+			else
+				di = ndst++;
+
+			mod->core_kallsyms.typetab[di] = kallsyms->typetab[i];
+			dst[di] = src[i];
+			dst[di].st_name = s - mod->core_kallsyms.strtab;
 			ret = strscpy(s, &kallsyms->strtab[src[i].st_name],
 				      strtab_size);
 			if (ret < 0)
@@ -216,9 +320,13 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
 		}
 	}
 
+	WARN_ON_ONCE(nfunc != info->num_func_syms);
+	sort(dst + 1, nfunc, sizeof(Elf_Sym), elf_sym_cmp, NULL);
+
 	/* Set up to point into init section. */
 	rcu_assign_pointer(mod->kallsyms, kallsyms);
 	mod->core_kallsyms.num_symtab = ndst;
+	mod->core_kallsyms.num_func_syms = nfunc;
 }
 
 #if IS_ENABLED(CONFIG_STACKTRACE_BUILD_ID)
@@ -241,11 +349,6 @@ void init_build_id(struct module *mod, const struct load_info *info)
 }
 #endif
 
-static const char *kallsyms_symbol_name(struct mod_kallsyms *kallsyms, unsigned int symnum)
-{
-	return kallsyms->strtab + kallsyms->symtab[symnum].st_name;
-}
-
 /*
  * Given a module and address, find the corresponding symbol and return its name
  * while providing its size and offset if needed.
@@ -255,7 +358,10 @@ static const char *find_kallsyms_symbol(struct module *mod,
 					unsigned long *size,
 					unsigned long *offset)
 {
-	unsigned int i, best = 0;
+	unsigned int (*search)(struct mod_kallsyms *kallsyms,
+			       unsigned long addr, unsigned long *bestval,
+			       unsigned long *nextval);
+	unsigned int best;
 	unsigned long nextval, bestval;
 	struct mod_kallsyms *kallsyms = rcu_dereference(mod->kallsyms);
 	struct module_memory *mod_mem = NULL;
@@ -266,6 +372,11 @@ static const char *find_kallsyms_symbol(struct module *mod,
 			continue;
 #endif
 		if (within_module_mem_type(addr, mod, type)) {
+			if (type == MOD_TEXT && kallsyms->num_func_syms > 0)
+				search = bsearch_func_symbol;
+			else
+				search = search_kallsyms_symbol;
+
 			mod_mem = &mod->mem[type];
 			break;
 		}
@@ -278,33 +389,7 @@ static const char *find_kallsyms_symbol(struct module *mod,
 	nextval = (unsigned long)mod_mem->base + mod_mem->size;
 	bestval = (unsigned long)mod_mem->base - 1;
 
-	/*
-	 * Scan for closest preceding symbol, and next symbol. (ELF
-	 * starts real symbols at 1).
-	 */
-	for (i = 1; i < kallsyms->num_symtab; i++) {
-		const Elf_Sym *sym = &kallsyms->symtab[i];
-		unsigned long thisval = kallsyms_symbol_value(sym);
-
-		if (sym->st_shndx == SHN_UNDEF)
-			continue;
-
-		/*
-		 * We ignore unnamed symbols: they're uninformative
-		 * and inserted at a whim.
-		 */
-		if (*kallsyms_symbol_name(kallsyms, i) == '\0' ||
-		    is_mapping_symbol(kallsyms_symbol_name(kallsyms, i)))
-			continue;
-
-		if (thisval <= addr && thisval > bestval) {
-			best = i;
-			bestval = thisval;
-		}
-		if (thisval > addr && thisval < nextval)
-			nextval = thisval;
-	}
-
+	best = search(kallsyms, addr, &bestval, &nextval);
 	if (!best)
 		return NULL;
 
-- 
2.50.1


^ permalink raw reply related

* [PATCH v2 1/2] module/kallsyms: fix nextval for data symbol lookup
From: Stanislaw Gruszka @ 2026-03-27 11:00 UTC (permalink / raw)
  To: linux-modules, Sami Tolvanen, Luis Chamberlain, Petr Pavlu
  Cc: linux-kernel, linux-trace-kernel, live-patching, Daniel Gomez,
	Aaron Tomlin, Steven Rostedt, Masami Hiramatsu, Jordan Rome,
	Viktor Malik

The symbol lookup code assumes the queried address resides in either
MOD_TEXT or MOD_INIT_TEXT. This breaks for addresses in other module
memory regions (e.g. rodata or data), resulting in incorrect upper
bounds and wrong symbol size.

Select the module memory region the address belongs to instead of
hardcoding text sections. Also initialize the lower bound to the start
of that region, as searching from address 0 is unnecessary.

Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
---
v1 -> v2: new patch.

 kernel/module/kallsyms.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c
index 0fc11e45df9b..f23126d804b2 100644
--- a/kernel/module/kallsyms.c
+++ b/kernel/module/kallsyms.c
@@ -258,17 +258,25 @@ static const char *find_kallsyms_symbol(struct module *mod,
 	unsigned int i, best = 0;
 	unsigned long nextval, bestval;
 	struct mod_kallsyms *kallsyms = rcu_dereference(mod->kallsyms);
-	struct module_memory *mod_mem;
+	struct module_memory *mod_mem = NULL;
 
-	/* At worse, next value is at end of module */
-	if (within_module_init(addr, mod))
-		mod_mem = &mod->mem[MOD_INIT_TEXT];
-	else
-		mod_mem = &mod->mem[MOD_TEXT];
+	for_each_mod_mem_type(type) {
+#ifndef CONFIG_KALLSYMS_ALL
+		if (!mod_mem_type_is_text(type))
+			continue;
+#endif
+		if (within_module_mem_type(addr, mod, type)) {
+			mod_mem = &mod->mem[type];
+			break;
+		}
+	}
 
-	nextval = (unsigned long)mod_mem->base + mod_mem->size;
+	if (!mod_mem)
+		return NULL;
 
-	bestval = kallsyms_symbol_value(&kallsyms->symtab[best]);
+	/* Initialize bounds within memory region the address belongs to. */
+	nextval = (unsigned long)mod_mem->base + mod_mem->size;
+	bestval = (unsigned long)mod_mem->base - 1;
 
 	/*
 	 * Scan for closest preceding symbol, and next symbol. (ELF
-- 
2.50.1


^ permalink raw reply related

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Breno Leitao @ 2026-03-27 10:18 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <20260326233042.f52cfc127ec934d52713bce1@kernel.org>

On Thu, Mar 26, 2026 at 11:30:42PM +0900, Masami Hiramatsu wrote:
> On Wed, 25 Mar 2026 23:22:04 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > > +	/*
> > > +	 * Keys that do not match any early_param() handler are silently
> > > +	 * ignored — do_early_param() always returns 0.
> > > +	 */
> > > +	xbc_node_for_each_key_value(root, knode, val) {
> >
> > [sashiko comment]
> > | Does this loop handle array values correctly?
> > | xbc_node_for_each_key_value() only assigns the first value of an array to
> > | the val pointer before advancing to the next key. It does not iterate over
> > | the child nodes of the array.
> > | If the bootconfig contains a multi-value key like
> > | kernel.console = "ttyS0", "tty0", will the subsequent values in the array
> > | be silently dropped instead of passed to the early_param handlers?
> >
> > Also, good catch :) we need to use xbc_node_for_each_array_value()
> > for inner loop.
>
> FYI, xbc_snprint_cmdline() translates the arraied parameter as
> multiple parameters. For example,
>
> foo = bar, buz;
>
> will be converted to
>
> foo=bar foo=buz
>
> Thus, I think we should do the same thing below;
>
> >
> > > +		if (xbc_node_compose_key_after(root, knode, xbc_namebuf, XBC_KEYLEN_MAX) < 0)
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * We need to copy const char *val to a char pointer,
> > > +		 * which is what do_early_param() need, given it might
> > > +		 * call strsep(), strtok() later.
> > > +		 */
> > > +		ret = strscpy(val_buf, val, sizeof(val_buf));
> > > +		if (ret < 0) {
> > > +			pr_warn("ignoring bootconfig value '%s', too long\n",
> > > +				xbc_namebuf);
> > > +			continue;
> > > +		}
> > > +		do_early_param(xbc_namebuf, val_buf, NULL, NULL);
>
> So instead of this;
>
> xbc_array_for_each_value(vnode, val) {
> 	do_early_param(xbc_namebuf, val, NULL, NULL);
> }
>
> Maybe it is a good timing to recondier unifying kernel cmdline and bootconfig
> from API viewpoint.

I'm not familiar with the history on this topic. Has unifying the APIs been
previously considered and set aside?

Given all the feedback on this series, I see three types of issues to address:

1) Minor patch improvements
2) Architecture-specific super early parameters being parsed before bootconfig
   is available
3) Unifying kernel cmdline and bootconfig interfaces

Which of these areas would you recommend I prioritize?

Thanks for the guidance,
--breno

^ permalink raw reply

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Breno Leitao @ 2026-03-27 10:06 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <20260325232204.05edbb21c7602b6408ca007b@kernel.org>

Hi Masami,

On Wed, Mar 25, 2026 at 11:22:04PM +0900, Masami Hiramatsu wrote:
> On Wed, 25 Mar 2026 03:05:38 -0700
> Breno Leitao <leitao@debian.org> wrote:

> > +/*
> > + * bootconfig_apply_early_params - dispatch kernel.* keys from the embedded
> > + * bootconfig as early_param() calls.
> > + *
> > + * early_param() handlers must run before most of the kernel initialises
> > + * (e.g. before the GIC driver reads irqchip.gicv3_pseudo_nmi).  A bootconfig
> > + * attached to the initrd arrives too late for this because the initrd is not
> > + * mapped yet when early params are processed.  The embedded bootconfig lives
> > + * in the kernel image itself (.init.data), so it is always reachable.
> > + *
> > + * This function is called from setup_boot_config() which runs in
> > + * start_kernel() before parse_early_param(), making the timing correct.
> > + */
> > +static void __init bootconfig_apply_early_params(void)
>
> [sashiko comment]
> | Does this run early enough for architectural parameters?
> | While setup_boot_config() runs before parse_early_param() in start_kernel(),
> | it runs after setup_arch(). setup_boot_config() relies on xbc_init() which
> | uses the memblock allocator, requiring setup_arch() to have already
> | initialized it.
> | However, the kernel expects many early parameters (like mem=, earlycon,
> | noapic, and iommu) to be parsed during setup_arch() via the architecture's
> | call to parse_early_param(). Since setup_arch() completes before
> | setup_boot_config() runs, will these architectural early parameters be
> | silently ignored because the decisions they influence were already
> | finalized?
>
> This is the major reason that I did not support early parameter
> in bootconfig. Some archs initialize kernel_cmdline in setup_arch()
> and setup early parameters in it.

Would it be feasible to document which parameters are architecture-specific
and must be processed during setup_arch()?

We could potentially introduce a third parameter category alongside the
existing early_param() and __setup():

	* early_param()
	* __setup()
	* early_arch_param() (New)

This would allow bootconfig to support __setup() and early_param() while
explicitly excluding early_arch_param() from bootconfig processing.

This would move break down the early parameters in those that can be
easily handled.

> To fix this, we need to change setup_arch() for each architecture so
> that it calls this bootconfig_apply_early_params().

Could we instead integrate this into parse_early_param() itself? That
approach would avoid the need to modify each architecture individually.

Thanks for looking at it,
--breno

^ permalink raw reply

* Re: [PATCH next] tracing: Remove spurious default precision from show_event_trigger/filter formats
From: Petr Mladek @ 2026-03-27  9:15 UTC (permalink / raw)
  To: david.laight.linux
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Aaron Tomlin, Rasmus Villemoes,
	Andy Shevchenko, Sergey Senozhatsky, Andrew Morton
In-Reply-To: <20260326201824.3919-1-david.laight.linux@gmail.com>

On Thu 2026-03-26 20:18:24, david.laight.linux@gmail.com wrote:
> From: David Laight <david.laight.linux@gmail.com>
> 
> Change 2d8b7f9bf8e6e ("tracing: Have show_event_trigger/filter format a bit more in columns")
> added space padding to align the output.
> However it used ("%*.s", len, "") which requests the default precision.
> It doesn't matter here whether the userspace default (0) or kernel
> default (no precision) is used, but the format should be "%*s".
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Makes sense. It does not change the output because it printed
an empty string "" so the precision did not matter.

Reviewed-by: Petr Mladek <pmladek@suse.com>

Best Regards,
Petr

^ permalink raw reply

* Re: [PATCH v2 06/19] cpufreq: Use trace_call__##name() at guarded tracepoint call sites
From: Gautham R. Shenoy @ 2026-03-27  9:10 UTC (permalink / raw)
  To: Vineeth Pillai (Google)
  Cc: Steven Rostedt, Peter Zijlstra, Huang Rui, Mario Limonciello,
	Perry Yuan, Rafael J. Wysocki, Viresh Kumar, Srinivas Pandruvada,
	Len Brown, linux-pm, linux-kernel, linux-trace-kernel
In-Reply-To: <20260323160052.17528-7-vineeth@bitbyteword.org>

Hello Vineeth,

On Mon, Mar 23, 2026 at 12:00:25PM -0400, Vineeth Pillai (Google) wrote:
> Replace trace_foo() with the new trace_call__foo() at sites already
> guarded by trace_foo_enabled(), avoiding a redundant
> static_branch_unlikely() re-evaluation inside the tracepoint.
> trace_call__foo() calls the tracepoint callbacks directly without
> utilizing the static branch again.
> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Assisted-by: Claude:claude-sonnet-4-6


For drivers/cpufreq/amd-pstate.c and drivers/cpufreq/cpufreq.c

Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>

-- 
Thanks and Regards
gautham.

> ---
>  drivers/cpufreq/amd-pstate.c   | 10 +++++-----
>  drivers/cpufreq/cpufreq.c      |  2 +-
>  drivers/cpufreq/intel_pstate.c |  2 +-
>  3 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/cpufreq/amd-pstate.c b/drivers/cpufreq/amd-pstate.c
> index 5aa9fcd80cf51..4c47324aa2f73 100644
> --- a/drivers/cpufreq/amd-pstate.c
> +++ b/drivers/cpufreq/amd-pstate.c
> @@ -247,7 +247,7 @@ static int msr_update_perf(struct cpufreq_policy *policy, u8 min_perf,
>  	if (trace_amd_pstate_epp_perf_enabled()) {
>  		union perf_cached perf = READ_ONCE(cpudata->perf);
>  
> -		trace_amd_pstate_epp_perf(cpudata->cpu,
> +		trace_call__amd_pstate_epp_perf(cpudata->cpu,
>  					  perf.highest_perf,
>  					  epp,
>  					  min_perf,
> @@ -298,7 +298,7 @@ static int msr_set_epp(struct cpufreq_policy *policy, u8 epp)
>  	if (trace_amd_pstate_epp_perf_enabled()) {
>  		union perf_cached perf = cpudata->perf;
>  
> -		trace_amd_pstate_epp_perf(cpudata->cpu, perf.highest_perf,
> +		trace_call__amd_pstate_epp_perf(cpudata->cpu, perf.highest_perf,
>  					  epp,
>  					  FIELD_GET(AMD_CPPC_MIN_PERF_MASK,
>  						    cpudata->cppc_req_cached),
> @@ -343,7 +343,7 @@ static int shmem_set_epp(struct cpufreq_policy *policy, u8 epp)
>  	if (trace_amd_pstate_epp_perf_enabled()) {
>  		union perf_cached perf = cpudata->perf;
>  
> -		trace_amd_pstate_epp_perf(cpudata->cpu, perf.highest_perf,
> +		trace_call__amd_pstate_epp_perf(cpudata->cpu, perf.highest_perf,
>  					  epp,
>  					  FIELD_GET(AMD_CPPC_MIN_PERF_MASK,
>  						    cpudata->cppc_req_cached),
> @@ -507,7 +507,7 @@ static int shmem_update_perf(struct cpufreq_policy *policy, u8 min_perf,
>  	if (trace_amd_pstate_epp_perf_enabled()) {
>  		union perf_cached perf = READ_ONCE(cpudata->perf);
>  
> -		trace_amd_pstate_epp_perf(cpudata->cpu,
> +		trace_call__amd_pstate_epp_perf(cpudata->cpu,
>  					  perf.highest_perf,
>  					  epp,
>  					  min_perf,
> @@ -588,7 +588,7 @@ static void amd_pstate_update(struct amd_cpudata *cpudata, u8 min_perf,
>  	}
>  
>  	if (trace_amd_pstate_perf_enabled() && amd_pstate_sample(cpudata)) {
> -		trace_amd_pstate_perf(min_perf, des_perf, max_perf, cpudata->freq,
> +		trace_call__amd_pstate_perf(min_perf, des_perf, max_perf, cpudata->freq,
>  			cpudata->cur.mperf, cpudata->cur.aperf, cpudata->cur.tsc,
>  				cpudata->cpu, fast_switch);
>  	}
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 277884d91913c..58901047eae5a 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2222,7 +2222,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>  
>  	if (trace_cpu_frequency_enabled()) {
>  		for_each_cpu(cpu, policy->cpus)
> -			trace_cpu_frequency(freq, cpu);
> +			trace_call__cpu_frequency(freq, cpu);
>  	}
>  
>  	return freq;
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 11c58af419006..70be952209144 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -3132,7 +3132,7 @@ static void intel_cpufreq_trace(struct cpudata *cpu, unsigned int trace_type, in
>  		return;
>  
>  	sample = &cpu->sample;
> -	trace_pstate_sample(trace_type,
> +	trace_call__pstate_sample(trace_type,
>  		0,
>  		old_pstate,
>  		cpu->pstate.current_pstate,
> -- 
> 2.53.0
> 

^ permalink raw reply

* Re: [PATCHv4 bpf-next 09/25] bpf: Add bpf_trampoline_multi_attach/detach functions
From: kernel test robot @ 2026-03-27  4:18 UTC (permalink / raw)
  To: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: oe-kbuild-all, bpf, linux-trace-kernel, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Menglong Dong,
	Steven Rostedt
In-Reply-To: <20260324081846.2334094-10-jolsa@kernel.org>

Hi Jiri,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Jiri-Olsa/ftrace-Add-ftrace_hash_count-function/20260326-101836
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20260324081846.2334094-10-jolsa%40kernel.org
patch subject: [PATCHv4 bpf-next 09/25] bpf: Add bpf_trampoline_multi_attach/detach functions
config: x86_64-randconfig-015-20260327 (https://download.01.org/0day-ci/archive/20260327/202603271242.rKGaiSYu-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260327/202603271242.rKGaiSYu-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603271242.rKGaiSYu-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/bpf/trampoline.c:100:13: warning: 'trampoline_unlock_all' defined but not used [-Wunused-function]
     100 | static void trampoline_unlock_all(void)
         |             ^~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/trampoline.c:92:13: warning: 'trampoline_lock_all' defined but not used [-Wunused-function]
      92 | static void trampoline_lock_all(void)
         |             ^~~~~~~~~~~~~~~~~~~


vim +/trampoline_unlock_all +100 kernel/bpf/trampoline.c

    91	
  > 92	static void trampoline_lock_all(void)
    93	{
    94		int i;
    95	
    96		for (i = 0; i < TRAMPOLINE_LOCKS_TABLE_SIZE; i++)
    97			mutex_lock(&trampoline_locks[i].mutex);
    98	}
    99	
 > 100	static void trampoline_unlock_all(void)
   101	{
   102		int i;
   103	
   104		for (i = 0; i < TRAMPOLINE_LOCKS_TABLE_SIZE; i++)
   105			mutex_unlock(&trampoline_locks[i].mutex);
   106	}
   107	#else
   108	static struct bpf_trampoline *direct_ops_ip_lookup(struct ftrace_ops *ops, unsigned long ip)
   109	{
   110		return ops->private;
   111	}
   112	#endif /* CONFIG_HAVE_SINGLE_FTRACE_DIRECT_OPS */
   113	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH next] tracing: Remove spurious default precision from show_event_trigger/filter formats
From: Masami Hiramatsu @ 2026-03-27  0:37 UTC (permalink / raw)
  To: david.laight.linux
  Cc: Steven Rostedt, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Aaron Tomlin, Petr Mladek, Rasmus Villemoes,
	Andy Shevchenko, Sergey Senozhatsky, Andrew Morton
In-Reply-To: <20260326201824.3919-1-david.laight.linux@gmail.com>

On Thu, 26 Mar 2026 20:18:24 +0000
david.laight.linux@gmail.com wrote:

> From: David Laight <david.laight.linux@gmail.com>
> 
> Change 2d8b7f9bf8e6e ("tracing: Have show_event_trigger/filter format a bit more in columns")
> added space padding to align the output.
> However it used ("%*.s", len, "") which requests the default precision.
> It doesn't matter here whether the userspace default (0) or kernel
> default (no precision) is used, but the format should be "%*s".
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks!

> Signed-off-by: David Laight <david.laight.linux@gmail.com>
> ---
>  kernel/trace/trace_events.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 249d1cba72c0..6b54c10f9ba4 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -1718,7 +1718,7 @@ static int t_show_filters(struct seq_file *m, void *v)
>  
>  	len = get_call_len(call);
>  
> -	seq_printf(m, "%s:%s%*.s%s\n", call->class->system,
> +	seq_printf(m, "%s:%s%*s%s\n", call->class->system,
>  		   trace_event_name(call), len, "", filter->filter_string);
>  
>  	return 0;
> @@ -1750,7 +1750,7 @@ static int t_show_triggers(struct seq_file *m, void *v)
>  	len = get_call_len(call);
>  
>  	list_for_each_entry_rcu(data, &file->triggers, list) {
> -		seq_printf(m, "%s:%s%*.s", call->class->system,
> +		seq_printf(m, "%s:%s%*s", call->class->system,
>  			   trace_event_name(call), len, "");
>  
>  		data->cmd_ops->print(m, data);
> -- 
> 2.39.5
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [POC PATCH 6/6] KVM: selftests: Test content modes ZERO and PRESERVE for SNP
From: Ackerley Tng @ 2026-03-26 23:36 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen, baohua, bhe,
	binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet, dave.hansen,
	david, forkloop, hpa, ira.weiny, jgg, jmattson, jroedel,
	jthoughton, kasong, kvm, linux-doc, linux-kernel, linux-kselftest,
	linux-mm, linux-trace-kernel, mathieu.desnoyers, mhiramat,
	michael.roth, mingo, nphamcs, oupton, pankaj.gupta, pbonzini,
	pratyush, qperret, rick.p.edgecombe, rientjes, rostedt, seanjc,
	shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
	tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
	yan.y.zhao, yuanchu
In-Reply-To: <cover.1774568083.git.ackerleytng@google.com>

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../selftests/kvm/x86/sev_smoke_test.c        | 47 +++++++++++++++++--
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index c40c359f78901..b076e0afc3077 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -365,7 +365,26 @@ static void guest_code_conversion(u8 *test_shared_gva, u8 *test_private_gva, u64
 	vmgexit();
 }
 
-static void test_conversion(uint64_t policy)
+static void vm_set_memory_attributes_expect_error(struct kvm_vm *vm, u64 gpa,
+						  size_t size, u64 attributes,
+						  u64 flags, int expected_errno)
+{
+	loff_t error_offset = -1;
+	size_t len_ignored;
+	loff_t offset;
+	int gmem_fd;
+	int ret;
+
+	gmem_fd = kvm_gpa_to_guest_memfd(vm, gpa, &offset, &len_ignored);
+	ret = __gmem_set_memory_attributes(gmem_fd, offset, size, attributes,
+					   &error_offset, flags);
+
+	TEST_ASSERT_EQ(ret, -1);
+	TEST_ASSERT_EQ(offset, error_offset);
+	TEST_ASSERT_EQ(errno, expected_errno);
+}
+
+static void test_conversion(uint64_t policy, u64 content_mode)
 {
 	vm_vaddr_t test_private_gva;
 	vm_vaddr_t test_shared_gva;
@@ -409,6 +428,21 @@ static void test_conversion(uint64_t policy)
 	TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
 	TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_ENCRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
 
+	/* ZERO when setting memory attributes to private is always not supported. */
+	vm_set_memory_attributes_expect_error(vm, test_gpa, PAGE_SIZE,
+					      KVM_MEMORY_ATTRIBUTE_PRIVATE,
+					      KVM_SET_MEMORY_ATTRIBUTES2_ZERO,
+					      EOPNOTSUPP);
+
+	/* PRESERVE is not supported for SNP. */
+	vm_set_memory_attributes_expect_error(vm, test_gpa, PAGE_SIZE, 0,
+					      KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE,
+					      EOPNOTSUPP);
+	vm_set_memory_attributes_expect_error(vm, test_gpa, PAGE_SIZE,
+					      KVM_MEMORY_ATTRIBUTE_PRIVATE,
+					      KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE,
+					      EOPNOTSUPP);
+
 	vm_mem_set_private(vm, test_gpa, PAGE_SIZE, KVM_SET_MEMORY_ATTRIBUTES2_MODE_UNSPECIFIED);
 
 	vcpu_run(vcpu);
@@ -419,7 +453,12 @@ static void test_conversion(uint64_t policy)
 	TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
 	TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_DECRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
 
-	vm_mem_set_shared(vm, test_gpa, PAGE_SIZE, KVM_SET_MEMORY_ATTRIBUTES2_MODE_UNSPECIFIED);
+	vm_mem_set_shared(vm, test_gpa, PAGE_SIZE, content_mode);
+
+	if (content_mode == KVM_SET_MEMORY_ATTRIBUTES2_ZERO)
+		TEST_ASSERT_EQ(READ_ONCE(*(u8 *)test_hva), 0);
+	else
+		fprintf(stderr, "test_hva contents = %x\n", READ_ONCE(*(u8 *)test_hva));
 
 	vcpu_run(vcpu);
 
@@ -441,7 +480,9 @@ int main(int argc, char *argv[])
 	// 	test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
 
 	if (kvm_cpu_has(X86_FEATURE_SEV_SNP)) {
-		test_conversion(snp_default_policy());
+		test_conversion(snp_default_policy(), KVM_SET_MEMORY_ATTRIBUTES2_MODE_UNSPECIFIED);
+		test_conversion(snp_default_policy(), KVM_SET_MEMORY_ATTRIBUTES2_ZERO);
+
 		// test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
 	}
 
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related

* [POC PATCH 5/6] KVM: selftests: Test conversions for SNP
From: Ackerley Tng @ 2026-03-26 23:36 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen, baohua, bhe,
	binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet, dave.hansen,
	david, forkloop, hpa, ira.weiny, jgg, jmattson, jroedel,
	jthoughton, kasong, kvm, linux-doc, linux-kernel, linux-kselftest,
	linux-mm, linux-trace-kernel, mathieu.desnoyers, mhiramat,
	michael.roth, mingo, nphamcs, oupton, pankaj.gupta, pbonzini,
	pratyush, qperret, rick.p.edgecombe, rientjes, rostedt, seanjc,
	shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
	tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
	yan.y.zhao, yuanchu
In-Reply-To: <cover.1774568083.git.ackerleytng@google.com>

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../selftests/kvm/x86/sev_smoke_test.c        | 190 +++++++++++++++++-
 1 file changed, 185 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index 7e69da01cecf4..c40c359f78901 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -253,17 +253,197 @@ static void test_sev_smoke(void *guest, uint32_t type, uint64_t policy)
 	}
 }
 
+#define GHCB_MSR_REG_GPA_REQ		0x012
+#define GHCB_MSR_REG_GPA_REQ_VAL(v)                \
+	/* GHCBData[63:12] */                      \
+	(((u64)((v) & GENMASK_ULL(51, 0)) << 12) | \
+	 /* GHCBData[11:0] */			   \
+	 GHCB_MSR_REG_GPA_REQ)
+
+#define GHCB_MSR_REG_GPA_RESP		0x013
+#define GHCB_MSR_REG_GPA_RESP_VAL(v)			\
+	/* GHCBData[63:12] */				\
+	(((u64)(v) & GENMASK_ULL(63, 12)) >> 12)
+
+#define GHCB_DATA_LOW			12
+#define GHCB_MSR_INFO_MASK		(BIT_ULL(GHCB_DATA_LOW) - 1)
+#define GHCB_RESP_CODE(v) ((v) & GHCB_MSR_INFO_MASK)
+
+/*
+ * SNP Page State Change Operation
+ *
+ * GHCBData[55:52] - Page operation:
+ *   0x0001	Page assignment, Private
+ *   0x0002	Page assignment, Shared
+ */
+enum psc_op {
+	SNP_PAGE_STATE_PRIVATE = 1,
+	SNP_PAGE_STATE_SHARED,
+};
+
+#define GHCB_MSR_PSC_REQ		0x014
+#define GHCB_MSR_PSC_REQ_GFN(gfn, op)			\
+	/* GHCBData[55:52] */				\
+	(((u64)((op) & 0xf) << 52) |			\
+	/* GHCBData[51:12] */				\
+	((u64)((gfn) & GENMASK_ULL(39, 0)) << 12) |	\
+	/* GHCBData[11:0] */				\
+	GHCB_MSR_PSC_REQ)
+
+#define GHCB_MSR_PSC_RESP		0x015
+#define GHCB_MSR_PSC_RESP_VAL(val)			\
+	/* GHCBData[63:32] */				\
+	(((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
+
+static u64 ghcb_gpa;
+static void snp_register_ghcb(void)
+{
+	u64 ghcb_pfn = ghcb_gpa >> PAGE_SHIFT;
+	u64 val;
+
+	GUEST_ASSERT(ghcb_gpa);
+
+	wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_REG_GPA_REQ_VAL(ghcb_gpa >> PAGE_SHIFT));
+	vmgexit();
+
+	val = rdmsr(MSR_AMD64_SEV_ES_GHCB);
+	GUEST_ASSERT_EQ(GHCB_RESP_CODE(val), GHCB_MSR_REG_GPA_RESP);
+	GUEST_ASSERT_EQ(GHCB_MSR_REG_GPA_RESP_VAL(val), ghcb_pfn);
+}
+
+static void snp_page_state_change(u64 gpa, enum psc_op op)
+{
+	u64 val;
+
+	wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_PSC_REQ_GFN(gpa >> PAGE_SHIFT, op));
+	vmgexit();
+
+	val = rdmsr(MSR_AMD64_SEV_ES_GHCB);
+	GUEST_ASSERT_EQ(GHCB_RESP_CODE(val), GHCB_MSR_PSC_RESP);
+	GUEST_ASSERT_EQ(GHCB_MSR_PSC_RESP_VAL(val), 0);
+}
+
+#define RMP_PG_SIZE_4K			0
+static inline void pvalidate(void *vaddr, bool validate)
+{
+	bool no_rmpupdate;
+	int rc;
+
+	/* "pvalidate" mnemonic support in binutils 2.36 and newer */
+	asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFF\n\t"
+		     : "=@ccc"(no_rmpupdate), "=a"(rc)
+		     : "a"(vaddr), "c"(RMP_PG_SIZE_4K), "d"(validate)
+		     : "memory", "cc");
+
+	GUEST_ASSERT(!no_rmpupdate);
+	GUEST_ASSERT_EQ(rc, 0);
+}
+
+#define CONVERSION_TEST_VALUE_SHARED_1 0xab
+#define CONVERSION_TEST_VALUE_SHARED_2 0xcd
+#define CONVERSION_TEST_VALUE_PRIVATE 0xef
+#define CONVERSION_TEST_VALUE_SHARED_3 0xbc
+static void guest_code_conversion(u8 *test_shared_gva, u8 *test_private_gva, u64 test_gpa)
+{
+	snp_register_ghcb();
+
+	GUEST_ASSERT_EQ(READ_ONCE(*test_shared_gva), CONVERSION_TEST_VALUE_SHARED_1);
+	WRITE_ONCE(*test_shared_gva, CONVERSION_TEST_VALUE_SHARED_2);
+
+	snp_page_state_change(test_gpa, SNP_PAGE_STATE_PRIVATE);
+	pvalidate(test_private_gva, true);
+
+	WRITE_ONCE(*test_private_gva, CONVERSION_TEST_VALUE_PRIVATE);
+	GUEST_ASSERT_EQ(READ_ONCE(*test_private_gva), CONVERSION_TEST_VALUE_PRIVATE);
+
+	pvalidate(test_private_gva, false);
+	snp_page_state_change(test_gpa, SNP_PAGE_STATE_SHARED);
+
+	WRITE_ONCE(*test_shared_gva, CONVERSION_TEST_VALUE_SHARED_3);
+
+	wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_TERM_REQ);
+	vmgexit();
+}
+
+static void test_conversion(uint64_t policy)
+{
+	vm_vaddr_t test_private_gva;
+	vm_vaddr_t test_shared_gva;
+	struct kvm_vcpu *vcpu;
+	vm_vaddr_t ghcb_gva;
+	vm_paddr_t test_gpa;
+	struct kvm_vm *vm;
+	void *ghcb_hva;
+	void *test_hva;
+
+	vm = vm_sev_create_with_one_vcpu(KVM_X86_SNP_VM, guest_code_conversion, &vcpu);
+
+	ghcb_gva = vm_vaddr_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR,
+					 MEM_REGION_TEST_DATA);
+	ghcb_hva = addr_gva2hva(vm, ghcb_gva);
+	ghcb_gpa = addr_gva2gpa(vm, ghcb_gva);
+	sync_global_to_guest(vm, ghcb_gpa);
+
+	test_shared_gva = vm_vaddr_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR,
+						MEM_REGION_TEST_DATA);
+	test_hva = addr_gva2hva(vm, test_shared_gva);
+	test_gpa = addr_gva2gpa(vm, test_shared_gva);
+
+	test_private_gva = vm_vaddr_unused_gap(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR);
+	___virt_pg_map(vm, &vm->mmu, test_private_gva, test_gpa, PG_SIZE_4K, true);
+
+	vcpu_args_set(vcpu, 3, test_shared_gva, test_private_gva, test_gpa);
+
+	vm_sev_launch(vm, policy, NULL);
+
+	WRITE_ONCE(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_1);
+
+	fprintf(stderr, "ghcb_hva=%p ghcb_gpa=%lx ghcb_gva=%lx\n", ghcb_hva, ghcb_gpa, ghcb_gva);
+	fprintf(stderr, "test_hva=%p test_gpa=%lx test_private_gva=%lx test_shared_gva=%lx\n", test_hva, test_gpa, test_private_gva, test_shared_gva);
+
+	vcpu_run(vcpu);
+
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_HYPERCALL);
+	TEST_ASSERT_EQ(vcpu->run->hypercall.nr, KVM_HC_MAP_GPA_RANGE);
+	TEST_ASSERT_EQ(vcpu->run->hypercall.args[0], test_gpa);
+	TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
+	TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_ENCRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
+
+	vm_mem_set_private(vm, test_gpa, PAGE_SIZE, KVM_SET_MEMORY_ATTRIBUTES2_MODE_UNSPECIFIED);
+
+	vcpu_run(vcpu);
+
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_HYPERCALL);
+	TEST_ASSERT_EQ(vcpu->run->hypercall.nr, KVM_HC_MAP_GPA_RANGE);
+	TEST_ASSERT_EQ(vcpu->run->hypercall.args[0], test_gpa);
+	TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
+	TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_DECRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
+
+	vm_mem_set_shared(vm, test_gpa, PAGE_SIZE, KVM_SET_MEMORY_ATTRIBUTES2_MODE_UNSPECIFIED);
+
+	vcpu_run(vcpu);
+
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SYSTEM_EVENT);
+	TEST_ASSERT_EQ(vcpu->run->system_event.type, KVM_SYSTEM_EVENT_SEV_TERM);
+	TEST_ASSERT_EQ(vcpu->run->system_event.ndata, 1);
+	TEST_ASSERT_EQ(vcpu->run->system_event.data[0], GHCB_MSR_TERM_REQ);
+
+	TEST_ASSERT_EQ(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_3);
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SEV));
 
-	test_sev_smoke(guest_sev_code, KVM_X86_SEV_VM, 0);
+	// test_sev_smoke(guest_sev_code, KVM_X86_SEV_VM, 0);
 
-	if (kvm_cpu_has(X86_FEATURE_SEV_ES))
-		test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
+	// if (kvm_cpu_has(X86_FEATURE_SEV_ES))
+	// 	test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
 
-	if (kvm_cpu_has(X86_FEATURE_SEV_SNP))
-		test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
+	if (kvm_cpu_has(X86_FEATURE_SEV_SNP)) {
+		test_conversion(snp_default_policy());
+		// test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
+	}
 
 	return 0;
 }
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox