All of lore.kernel.org
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Nathan Chancellor <nathan@kernel.org>
Cc: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Thomas Weißschuh" <thomas.weissschuh@linutronix.de>,
	"Michal Clapinski" <mclapinski@google.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Thomas Gleixner" <tglx@kernel.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Masami Hiramatsu" <mhiramat@kernel.org>,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: NULL pointer dereference when booting ppc64_guest_defconfig in QEMU on -next
Date: Fri, 20 Mar 2026 13:17:52 +0900	[thread overview]
Message-ID: <abzKcGiRSR_E8lLN@hyeyoo> (raw)
In-Reply-To: <20260319233745.GA769346@ax162>

On Thu, Mar 19, 2026 at 04:37:45PM -0700, Nathan Chancellor wrote:
> Hi all,
> 
> I am not really sure whose bug this is, as it only appears when three
> seemingly independent patch series are applied together, so I have added
> the patch authors and their committers (along with the tracing
> maintainers) to this thread. Feel free to expand or reduce that list as
> necessary.
> 
> Our continuous integration has noticed a crash when booting
> ppc64_guest_defconfig in QEMU on the past few -next versions.
> 
>   https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/23311154492/job/67811527112
> 
> This does not appear to be clang related, as it can be reproduced with
> GCC 15.2.0 as well. Through multiple bisects, I was able to land on
> applying:
> 
>   mm: improve RSS counter approximation accuracy for proc interfaces [1]
>   vdso/datastore: Allocate data pages dynamically [2]
>   kho: fix deferred init of kho scratch [3]
> 
> and their dependent changes on top of 7.0-rc4 is enough to reproduce
> this (at least on two of my machines with the same commands). I have
> attached the diff from the result of the following 'git apply' commands
> below, done in a linux-next checkout.
> 
>   $ git checkout v7.0-rc4
>   HEAD is now at f338e7738378 Linux 7.0-rc4
> 
>   # [1]
>   $ git diff 60ddf3eed4999bae440d1cf9e5868ccb3f308b64^..087dd6d2cc12c82945ab859194c32e8e977daae3 | git apply -3v
>   ...
> 
>   # [2]
>   # Fix trivial conflict in init/main.c around headers
>   $ git diff dc432ab7130bb39f5a351281a02d4bc61e85a14a^..05988dba11791ccbb458254484826b32f17f4ad2 | git apply -3v
>   ...
> 
>   # [3]
>   # Fix conflict in kernel/liveupdate/kexec_handover.c due to lack of kho_mem_retrieve(), just add pfn_is_kho_scratch()
>   $ git show 4a78467ffb537463486968232daef1e8a2f105e3 | git apply -3v
>   ...
> 
>   $ make -skj"$(nproc)" ARCH=powerpc CROSS_COMPILE=powerpc64-linux- mrproper ppc64_guest_defconfig vmlinux
> 
>   $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/ppc64-rootfs.cpio.zst | zstd -d >rootfs.cpio
> 
>   $ qemu-system-ppc64 \
>       -display none \
>       -nodefaults \
>       -cpu power8 \
>       -machine pseries \
>       -vga none \
>       -kernel vmlinux \
>       -initrd rootfs.cpio \
>       -m 1G \
>       -serial mon:stdio

Thanks, such a detailed steps to reproduce!
Interestingly, the combination of my compiler (GCC 13.3.0) and
QEMU (8.2.2) don't trigger this bug.

>   [    0.000000][    T0] Linux version 7.0.0-rc4-dirty (nathan@framework-amd-ryzen-maxplus-395) (powerpc64-linux-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.45) #1 SMP PREEMPT Thu Mar 19 15:45:53 MST 2026
>   ...
>   [    0.216764][    T1] vgaarb: loaded
>   [    0.217590][    T1] clocksource: Switched to clocksource timebase
>   [    0.221007][   T12] BUG: Kernel NULL pointer dereference at 0x00000010
>   [    0.221049][   T12] Faulting instruction address: 0xc00000000044947c
>   [    0.221237][   T12] Oops: Kernel access of bad area, sig: 11 [#1]
>   [    0.221276][   T12] BE PAGE_SIZE=64K MMU=Hash  SMP NR_CPUS=2048 NUMA pSeries
>   [    0.221359][   T12] Modules linked in:
>   [    0.221556][   T12] CPU: 0 UID: 0 PID: 12 Comm: kworker/u4:0 Not tainted 7.0.0-rc4-dirty #1 PREEMPTLAZY
>   [    0.221631][   T12] Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
>   [    0.221765][   T12] Workqueue: trace_init_wq tracer_init_tracefs_work_func
>   [    0.222065][   T12] NIP:  c00000000044947c LR: c00000000041a584 CTR: c00000000053aa90
>   [    0.222084][   T12] REGS: c000000003bc7960 TRAP: 0380   Not tainted  (7.0.0-rc4-dirty)
>   [    0.222111][   T12] MSR:  8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 44000204  XER: 00000000
>   [    0.222287][   T12] CFAR: c000000000449420 IRQMASK: 0
>   [    0.222287][   T12] GPR00: c00000000041a584 c000000003bc7c00 c000000001c08100 c000000002892f20
>   [    0.222287][   T12] GPR04: c0000000019cfa68 c0000000019cfa60 0000000000000001 0000000000000064
>   [    0.222287][   T12] GPR08: 0000000000000002 0000000000000000 c000000003bba000 0000000000000010
>   [    0.222287][   T12] GPR12: c00000000053aa90 c000000002c50000 c000000001ab25f8 c000000001626690
>   [    0.222287][   T12] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>   [    0.222287][   T12] GPR20: c000000001624868 c000000001ab2708 c0000000019cfa08 c000000001a00d18
>   [    0.222287][   T12] GPR24: c0000000019cfa18 fffffffffffffef7 c000000003051205 c0000000019cfa68
>   [    0.222287][   T12] GPR28: 0000000000000000 c0000000019cfa60 c000000002894e90 0000000000000000
>   [    0.222526][   T12] NIP [c00000000044947c] __find_event_file+0x9c/0x110
>   [    0.222572][   T12] LR [c00000000041a584] init_tracer_tracefs+0x274/0xcc0
>   [    0.222643][   T12] Call Trace:
>   [    0.222690][   T12] [c000000003bc7c00] [c000000000b943b0] tracefs_create_file+0x1a0/0x2b0 (unreliable)
>   [    0.222766][   T12] [c000000003bc7c50] [c00000000041a584] init_tracer_tracefs+0x274/0xcc0
>   [    0.222791][   T12] [c000000003bc7dc0] [c000000002046f1c] tracer_init_tracefs_work_func+0x50/0x320
>   [    0.222809][   T12] [c000000003bc7e50] [c000000000276958] process_one_work+0x1b8/0x530
>   [    0.222828][   T12] [c000000003bc7f10] [c00000000027778c] worker_thread+0x1dc/0x3d0
>   [    0.222883][   T12] [c000000003bc7f90] [c000000000284c44] kthread+0x194/0x1b0
>   [    0.222900][   T12] [c000000003bc7fe0] [c00000000000cf30] start_kernel_thread+0x14/0x18
>   [    0.222961][   T12] Code: 7c691b78 7f63db78 2c090000 40820018 e89c0000 49107f21 60000000 2c030000 41820048 ebff0000 7c3ff040 41820038 <e93f0010> 7fa3eb78 81490058 e8890018
>   [    0.223190][   T12] ---[ end trace 0000000000000000 ]---
>   ...
>
> Interestingly, turning on CONFIG_KASAN appears to hide this, maybe
> pointing to some sort of memory corruption (or something timing
> related)? If there is any other information I can provide, I am more
> than happy to do so.

I don't have much idea on how things end up causing
NULL-pointer-deref... but let's point out suspicious things.

> [1]: https://lore.kernel.org/20260227153730.1556542-4-mathieu.desnoyers@efficios.com/

@Mathieu: In patch 1/3 description,
> Changes since v7:
> - Explicitly initialize the subsystem from start_kernel() right
>   after mm_core_init() so it is up and running before the creation of
>   the first mm at boot.

But how does this work when someone calls mm_cpumask() on init_mm early?
Looks like it will behave incorrectly because get_rss_stat_items_size()
returns zero?

While it doesn't crash on my environment, it triggers a two warnings
(with -smp 2 option added). IIUC the cpu bit should have been set in
setup_arch(), but at the wrong location. After the
percpu_counter_tree_subsystem_init() function is called, the bit doesn't
appear to be set.

[    1.392787][    T1] ------------[ cut here ]------------
[    1.392935][    T1] WARNING: arch/powerpc/mm/mmu_context.c:106 at switch_mm_irqs_off+0x190/0x1c0, CPU#0: swapper/0/1
[    1.393187][    T1] Modules linked in:
[    1.393458][    T1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-next-20260319 #1 PREEMPTLAZY
[    1.393600][    T1] Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
[    1.393711][    T1] NIP:  c00000000014e390 LR: c00000000014e30c CTR: 0000000000000000
[    1.393752][    T1] REGS: c000000003def7b0 TRAP: 0700   Not tainted  (7.0.0-rc4-next-20260319)
[    1.393807][    T1] MSR:  8000000002021032 <SF,VEC,ME,IR,DR,RI>  CR: 2800284a  XER: 00000000
[    1.393944][    T1] CFAR: c00000000014e328 IRQMASK: 3
[    1.393944][    T1] GPR00: c00000000014e36c c000000003defa50 c000000001bb8100 c0000000028d8c80
[    1.393944][    T1] GPR04: c000000004ddc04a 000000000000000a 0000000022222222 2222222222222222
[    1.393944][    T1] GPR08: 2222222222222222 0000000000000000 0000000000000001 0000000000008000
[    1.393944][    T1] GPR12: c000000000521e80 c000000002c70000 c00000000000fff0 0000000000000000
[    1.393944][    T1] GPR16: 0000000000000000 c00000000606c600 c000000003623ac0 0000000000000000
[    1.393944][    T1] GPR20: c000000004c66300 c00000000606fc00 0000000000000001 0000000000000001
[    1.393944][    T1] GPR24: c000000006069c00 c00000000272c500 0000000000000000 0000000000000000
[    1.393944][    T1] GPR28: c000000003d68200 0000000000000000 c0000000028d8a80 c00000000272bd00
[    1.394355][    T1] NIP [c00000000014e390] switch_mm_irqs_off+0x190/0x1c0
[    1.394395][    T1] LR [c00000000014e30c] switch_mm_irqs_off+0x10c/0x1c0
[    1.394519][    T1] Call Trace:
[    1.394584][    T1] [c000000003defa50] [c00000000014e36c] switch_mm_irqs_off+0x16c/0x1c0 (unreliable)
[    1.394676][    T1] [c000000003defab0] [c0000000006edbf0] begin_new_exec+0x534/0xf60
[    1.394732][    T1] [c000000003defb20] [c000000000795538] load_elf_binary+0x494/0x1d1c
[    1.394765][    T1] [c000000003defc70] [c0000000006eb910] bprm_execve+0x380/0x720
[    1.394796][    T1] [c000000003defd00] [c0000000006ed5a8] kernel_execve+0x12c/0x1bc
[    1.394831][    T1] [c000000003defd50] [c00000000000eda8] run_init_process+0xf8/0x160
[    1.394864][    T1] [c000000003defde0] [c0000000000100b4] kernel_init+0xcc/0x268
[    1.394899][    T1] [c000000003defe50] [c00000000000cf14] ret_from_kernel_user_thread+0x14/0x1c
[    1.394946][    T1] ---- interrupt: 0 at 0x0
[    1.395205][    T1] Code: 7fe4fb78 7f83e378 48009171 60000000 4bffff98 60000000 60000000 60000000 0fe00000 4bffff00 60000000 60000000 <0fe00000> 4bffff98 60000000 60000000
[    1.395420][    T1] ---[ end trace 0000000000000000 ]---
[    1.526024][   T67] mount (67) used greatest stack depth: 28432 bytes left
[    1.605803][   T69] mount (69) used greatest stack depth: 27872 bytes left
[    1.667853][   T71] mkdir (71) used greatest stack depth: 27248 bytes left
Saving 256 bits of creditable seed for next boot
[    1.926636][   T80] ------------[ cut here ]------------
[    1.926719][   T80] WARNING: arch/powerpc/mm/mmu_context.c:51 at switch_mm_irqs_off+0x180/0x1c0, CPU#0: S01seedrng/80
[    1.926782][   T80] Modules linked in:
[    1.926910][   T80] CPU: 0 UID: 0 PID: 80 Comm: S01seedrng Tainted: G        W           7.0.0-rc4-next-20260319 #1 PREEMPTLAZY
[    1.926990][   T80] Tainted: [W]=WARN
[    1.927025][   T80] Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
[    1.927091][   T80] NIP:  c00000000014e380 LR: c00000000014e24c CTR: c000000000232894
[    1.927131][   T80] REGS: c000000004d5f800 TRAP: 0700   Tainted: G        W            (7.0.0-rc4-next-20260319)
[    1.927179][   T80] MSR:  8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 28002828  XER: 20000000
[    1.927253][   T80] CFAR: c00000000014e280 IRQMASK: 1
[    1.927253][   T80] GPR00: c0000000002328ec c000000004d5faa0 c000000001bb8100 0000000000000080
[    1.927253][   T80] GPR04: c0000000028d8280 c000000004509c00 0000000000000002 c00000000272c700
[    1.927253][   T80] GPR08: fffffffffffffffe c0000000028d8280 0000000000000000 0000000048002828
[    1.927253][   T80] GPR12: c000000000232894 c000000002c70000 0000000000000000 0000000000000002
[    1.927253][   T80] GPR16: 0000000000000000 000001002f0a2958 000001002f0a2950 ffffffffffffffff
[    1.927253][   T80] GPR20: 0000000000000000 0000000000000000 c000000002ab1400 c00000000272c700
[    1.927253][   T80] GPR24: 0000000000000000 c0000000028d8a80 0000000000000000 0000000000000000
[    1.927253][   T80] GPR28: c000000004509c00 0000000000000000 c00000000272bd00 c0000000028d8280
[    1.927629][   T80] NIP [c00000000014e380] switch_mm_irqs_off+0x180/0x1c0
[    1.927678][   T80] LR [c00000000014e24c] switch_mm_irqs_off+0x4c/0x1c0
[    1.927715][   T80] Call Trace:
[    1.927737][   T80] [c000000004d5faa0] [c000000004d5faf0] 0xc000000004d5faf0 (unreliable)
[    1.927804][   T80] [c000000004d5fb00] [c0000000002328ec] do_shoot_lazy_tlb+0x58/0x84
[    1.927853][   T80] [c000000004d5fb30] [c000000000388304] smp_call_function_many_cond+0x6a0/0x8d8
[    1.927902][   T80] [c000000004d5fc20] [c000000000388624] on_each_cpu_cond_mask+0x40/0x7c
[    1.927943][   T80] [c000000004d5fc50] [c000000000232ad4] __mmdrop+0x88/0x2ec
[    1.927986][   T80] [c000000004d5fce0] [c000000000242104] do_exit+0x350/0xde4
[    1.928028][   T80] [c000000004d5fdb0] [c000000000242de0] do_group_exit+0x48/0xbc
[    1.928072][   T80] [c000000004d5fdf0] [c000000000242e74] pid_child_should_wake+0x0/0x84
[    1.928128][   T80] [c000000004d5fe10] [c000000000030218] system_call_exception+0x148/0x3c0
[    1.928176][   T80] [c000000004d5fe50] [c00000000000c6d4] system_call_common+0xf4/0x258
[    1.928217][   T80] ---- interrupt: c00 at 0x7fff8ade507c
[    1.928253][   T80] NIP:  00007fff8ade507c LR: 00007fff8ade5034 CTR: 0000000000000000
[    1.928291][   T80] REGS: c000000004d5fe80 TRAP: 0c00   Tainted: G        W            (7.0.0-rc4-next-20260319)
[    1.928333][   T80] MSR:  800000000280f032 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI>  CR: 24002824  XER: 00000000
[    1.928413][   T80] IRQMASK: 0
[    1.928413][   T80] GPR00: 00000000000000ea 00007fffe75beb50 00007fff8aed7300 0000000000000000
[    1.928413][   T80] GPR04: 0000000000000000 00007fffe75beda0 00007fffe75bedb0 0000000000000000
[    1.928413][   T80] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    1.928413][   T80] GPR12: 0000000000000000 00007fff8afaae00 00007fffca692568 0000000133cf0440
[    1.928413][   T80] GPR16: 0000000000000000 000001002f0a2958 000001002f0a2950 ffffffffffffffff
[    1.928413][   T80] GPR20: 0000000000000000 0000000000000000 00007fffe75bf838 00007fff8afa0000
[    1.928413][   T80] GPR24: 0000000126911328 0000000000000001 00007fff8af9dc00 00007fffe75bf818
[    1.928413][   T80] GPR28: 0000000000000003 fffffffffffff000 0000000000000000 00007fff8afa3e10
[    1.928765][   T80] NIP [00007fff8ade507c] 0x7fff8ade507c
[    1.928795][   T80] LR [00007fff8ade5034] 0x7fff8ade5034
[    1.928835][   T80] ---- interrupt: c00
[    1.928924][   T80] Code: 7c0803a6 4e800020 60000000 60000000 7fe4fb78 7f83e378 48009171 60000000 4bffff98 60000000 60000000 60000000 <0fe00000> 4bffff00 60000000 60000000
[    1.929054][   T80] ---[ end trace 0000000000000000 ]---

> [2]: https://lore.kernel.org/20260304-vdso-sparc64-generic-2-v6-3-d8eb3b0e1410@linutronix.de/

> [3]: https://lore.kernel.org/20260311125539.4123672-2-mclapinski@google.com/

@Michal: Something my AI buddy pointed out... (that I think is valid):

> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index df34797691bd..7363b5b0d22a 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2078,9 +2082,11 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>  			unsigned long mo_pfn = ALIGN(spfn + 1, MAX_ORDER_NR_PAGES);
>  			unsigned long chunk_end = min(mo_pfn, epfn);
>  
> -			nr_pages += deferred_init_pages(zone, spfn, chunk_end);

Previously, deferred_init_pages() returned nr of pages to add, which is
(end_pfn (= chunk_end) - spfn).

> -			deferred_free_pages(spfn, chunk_end - spfn);
> +			// KHO scratch is MAX_ORDER_NR_PAGES aligned.
> +			if (!pfn_is_kho_scratch(spfn))
> +				deferred_init_pages(zone, spfn, chunk_end);

But since the function is not always called with the change,
the calculation is moved to...

> +			deferred_free_pages(spfn, chunk_end - spfn);
>  			spfn = chunk_end;
>  
>  			if (can_resched)
> @@ -2088,6 +2094,7 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>  			else
>  				touch_nmi_watchdog();
>  		}
> +		nr_pages += epfn - spfn;

Here.

But this is incorrect, because here we have:
> static unsigned long __init
> deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>                            struct zone *zone, bool can_resched)
> {
>         int nid = zone_to_nid(zone);
>         unsigned long nr_pages = 0;
>         phys_addr_t start, end;
>         u64 i = 0;
> 
>         for_each_free_mem_range(i, nid, 0, &start, &end, NULL) {
>                 unsigned long spfn = PFN_UP(start);
>                 unsigned long epfn = PFN_DOWN(end);
> 
>                 if (spfn >= end_pfn)
>                         break;
> 
>                 spfn = max(spfn, start_pfn);
>                 epfn = min(epfn, end_pfn);
> 
>                 while (spfn < epfn) {

The loop condition is (spfn < epfn), and by the time the loop terminates...

>                         unsigned long mo_pfn = ALIGN(spfn + 1, MAX_ORDER_NR_PAGES);
>                         unsigned long chunk_end = min(mo_pfn, epfn);
> 
>                         // KHO scratch is MAX_ORDER_NR_PAGES aligned.
>                         if (!pfn_is_kho_scratch(spfn))
>                                 deferred_init_pages(zone, spfn, chunk_end);
> 
>                         deferred_free_pages(spfn, chunk_end - spfn);
>                         spfn = chunk_end;
> 
>                         if (can_resched)
>                                 cond_resched();
>                         else
>                                 touch_nmi_watchdog();
>                 }
>                 nr_pages += epfn - spfn;

epfn - spfn <= 0.

So the number of pages returned by deferred_init_memmap_chunk() becomes
incorrect.

The equivalent translation of what's there before would be doing
`nr_pages += chunk_end - spfn;` within the loop.

-- 
Cheers,
Harry / Hyeonggon


  reply	other threads:[~2026-03-20  4:18 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19 23:37 NULL pointer dereference when booting ppc64_guest_defconfig in QEMU on -next Nathan Chancellor
2026-03-20  4:17 ` Harry Yoo [this message]
2026-03-20 12:23   ` Michał Cłapiński
2026-03-20 12:35   ` Mathieu Desnoyers
2026-03-20 13:21     ` Harry Yoo (Oracle)
2026-03-20 13:31       ` Mathieu Desnoyers
2026-03-20 14:20         ` Mathieu Desnoyers
2026-03-21  1:12           ` Ritesh Harjani
2026-03-21  2:21             ` Andrew Morton
2026-04-02 15:30               ` Mathieu Desnoyers
2026-04-02 19:26                 ` Andrew Morton
2026-03-23  1:53           ` Harry Yoo (Oracle)
2026-03-23  1:53         ` Harry Yoo (Oracle)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abzKcGiRSR_E8lLN@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mclapinski@google.com \
    --cc=mhiramat@kernel.org \
    --cc=nathan@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@kernel.org \
    --cc=thomas.weissschuh@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.