From: Harry Yoo <harry.yoo@oracle.com>
To: Nathan Chancellor <nathan@kernel.org>
Cc: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
"Thomas Weißschuh" <thomas.weissschuh@linutronix.de>,
"Michal Clapinski" <mclapinski@google.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Thomas Gleixner" <tglx@kernel.org>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Masami Hiramatsu" <mhiramat@kernel.org>,
linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: NULL pointer dereference when booting ppc64_guest_defconfig in QEMU on -next
Date: Fri, 20 Mar 2026 13:17:52 +0900 [thread overview]
Message-ID: <abzKcGiRSR_E8lLN@hyeyoo> (raw)
In-Reply-To: <20260319233745.GA769346@ax162>
On Thu, Mar 19, 2026 at 04:37:45PM -0700, Nathan Chancellor wrote:
> Hi all,
>
> I am not really sure whose bug this is, as it only appears when three
> seemingly independent patch series are applied together, so I have added
> the patch authors and their committers (along with the tracing
> maintainers) to this thread. Feel free to expand or reduce that list as
> necessary.
>
> Our continuous integration has noticed a crash when booting
> ppc64_guest_defconfig in QEMU on the past few -next versions.
>
> https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/23311154492/job/67811527112
>
> This does not appear to be clang related, as it can be reproduced with
> GCC 15.2.0 as well. Through multiple bisects, I was able to land on
> applying:
>
> mm: improve RSS counter approximation accuracy for proc interfaces [1]
> vdso/datastore: Allocate data pages dynamically [2]
> kho: fix deferred init of kho scratch [3]
>
> and their dependent changes on top of 7.0-rc4 is enough to reproduce
> this (at least on two of my machines with the same commands). I have
> attached the diff from the result of the following 'git apply' commands
> below, done in a linux-next checkout.
>
> $ git checkout v7.0-rc4
> HEAD is now at f338e7738378 Linux 7.0-rc4
>
> # [1]
> $ git diff 60ddf3eed4999bae440d1cf9e5868ccb3f308b64^..087dd6d2cc12c82945ab859194c32e8e977daae3 | git apply -3v
> ...
>
> # [2]
> # Fix trivial conflict in init/main.c around headers
> $ git diff dc432ab7130bb39f5a351281a02d4bc61e85a14a^..05988dba11791ccbb458254484826b32f17f4ad2 | git apply -3v
> ...
>
> # [3]
> # Fix conflict in kernel/liveupdate/kexec_handover.c due to lack of kho_mem_retrieve(), just add pfn_is_kho_scratch()
> $ git show 4a78467ffb537463486968232daef1e8a2f105e3 | git apply -3v
> ...
>
> $ make -skj"$(nproc)" ARCH=powerpc CROSS_COMPILE=powerpc64-linux- mrproper ppc64_guest_defconfig vmlinux
>
> $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/ppc64-rootfs.cpio.zst | zstd -d >rootfs.cpio
>
> $ qemu-system-ppc64 \
> -display none \
> -nodefaults \
> -cpu power8 \
> -machine pseries \
> -vga none \
> -kernel vmlinux \
> -initrd rootfs.cpio \
> -m 1G \
> -serial mon:stdio
Thanks, such a detailed steps to reproduce!
Interestingly, the combination of my compiler (GCC 13.3.0) and
QEMU (8.2.2) don't trigger this bug.
> [ 0.000000][ T0] Linux version 7.0.0-rc4-dirty (nathan@framework-amd-ryzen-maxplus-395) (powerpc64-linux-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.45) #1 SMP PREEMPT Thu Mar 19 15:45:53 MST 2026
> ...
> [ 0.216764][ T1] vgaarb: loaded
> [ 0.217590][ T1] clocksource: Switched to clocksource timebase
> [ 0.221007][ T12] BUG: Kernel NULL pointer dereference at 0x00000010
> [ 0.221049][ T12] Faulting instruction address: 0xc00000000044947c
> [ 0.221237][ T12] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 0.221276][ T12] BE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [ 0.221359][ T12] Modules linked in:
> [ 0.221556][ T12] CPU: 0 UID: 0 PID: 12 Comm: kworker/u4:0 Not tainted 7.0.0-rc4-dirty #1 PREEMPTLAZY
> [ 0.221631][ T12] Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
> [ 0.221765][ T12] Workqueue: trace_init_wq tracer_init_tracefs_work_func
> [ 0.222065][ T12] NIP: c00000000044947c LR: c00000000041a584 CTR: c00000000053aa90
> [ 0.222084][ T12] REGS: c000000003bc7960 TRAP: 0380 Not tainted (7.0.0-rc4-dirty)
> [ 0.222111][ T12] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 44000204 XER: 00000000
> [ 0.222287][ T12] CFAR: c000000000449420 IRQMASK: 0
> [ 0.222287][ T12] GPR00: c00000000041a584 c000000003bc7c00 c000000001c08100 c000000002892f20
> [ 0.222287][ T12] GPR04: c0000000019cfa68 c0000000019cfa60 0000000000000001 0000000000000064
> [ 0.222287][ T12] GPR08: 0000000000000002 0000000000000000 c000000003bba000 0000000000000010
> [ 0.222287][ T12] GPR12: c00000000053aa90 c000000002c50000 c000000001ab25f8 c000000001626690
> [ 0.222287][ T12] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 0.222287][ T12] GPR20: c000000001624868 c000000001ab2708 c0000000019cfa08 c000000001a00d18
> [ 0.222287][ T12] GPR24: c0000000019cfa18 fffffffffffffef7 c000000003051205 c0000000019cfa68
> [ 0.222287][ T12] GPR28: 0000000000000000 c0000000019cfa60 c000000002894e90 0000000000000000
> [ 0.222526][ T12] NIP [c00000000044947c] __find_event_file+0x9c/0x110
> [ 0.222572][ T12] LR [c00000000041a584] init_tracer_tracefs+0x274/0xcc0
> [ 0.222643][ T12] Call Trace:
> [ 0.222690][ T12] [c000000003bc7c00] [c000000000b943b0] tracefs_create_file+0x1a0/0x2b0 (unreliable)
> [ 0.222766][ T12] [c000000003bc7c50] [c00000000041a584] init_tracer_tracefs+0x274/0xcc0
> [ 0.222791][ T12] [c000000003bc7dc0] [c000000002046f1c] tracer_init_tracefs_work_func+0x50/0x320
> [ 0.222809][ T12] [c000000003bc7e50] [c000000000276958] process_one_work+0x1b8/0x530
> [ 0.222828][ T12] [c000000003bc7f10] [c00000000027778c] worker_thread+0x1dc/0x3d0
> [ 0.222883][ T12] [c000000003bc7f90] [c000000000284c44] kthread+0x194/0x1b0
> [ 0.222900][ T12] [c000000003bc7fe0] [c00000000000cf30] start_kernel_thread+0x14/0x18
> [ 0.222961][ T12] Code: 7c691b78 7f63db78 2c090000 40820018 e89c0000 49107f21 60000000 2c030000 41820048 ebff0000 7c3ff040 41820038 <e93f0010> 7fa3eb78 81490058 e8890018
> [ 0.223190][ T12] ---[ end trace 0000000000000000 ]---
> ...
>
> Interestingly, turning on CONFIG_KASAN appears to hide this, maybe
> pointing to some sort of memory corruption (or something timing
> related)? If there is any other information I can provide, I am more
> than happy to do so.
I don't have much idea on how things end up causing
NULL-pointer-deref... but let's point out suspicious things.
> [1]: https://lore.kernel.org/20260227153730.1556542-4-mathieu.desnoyers@efficios.com/
@Mathieu: In patch 1/3 description,
> Changes since v7:
> - Explicitly initialize the subsystem from start_kernel() right
> after mm_core_init() so it is up and running before the creation of
> the first mm at boot.
But how does this work when someone calls mm_cpumask() on init_mm early?
Looks like it will behave incorrectly because get_rss_stat_items_size()
returns zero?
While it doesn't crash on my environment, it triggers a two warnings
(with -smp 2 option added). IIUC the cpu bit should have been set in
setup_arch(), but at the wrong location. After the
percpu_counter_tree_subsystem_init() function is called, the bit doesn't
appear to be set.
[ 1.392787][ T1] ------------[ cut here ]------------
[ 1.392935][ T1] WARNING: arch/powerpc/mm/mmu_context.c:106 at switch_mm_irqs_off+0x190/0x1c0, CPU#0: swapper/0/1
[ 1.393187][ T1] Modules linked in:
[ 1.393458][ T1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-next-20260319 #1 PREEMPTLAZY
[ 1.393600][ T1] Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
[ 1.393711][ T1] NIP: c00000000014e390 LR: c00000000014e30c CTR: 0000000000000000
[ 1.393752][ T1] REGS: c000000003def7b0 TRAP: 0700 Not tainted (7.0.0-rc4-next-20260319)
[ 1.393807][ T1] MSR: 8000000002021032 <SF,VEC,ME,IR,DR,RI> CR: 2800284a XER: 00000000
[ 1.393944][ T1] CFAR: c00000000014e328 IRQMASK: 3
[ 1.393944][ T1] GPR00: c00000000014e36c c000000003defa50 c000000001bb8100 c0000000028d8c80
[ 1.393944][ T1] GPR04: c000000004ddc04a 000000000000000a 0000000022222222 2222222222222222
[ 1.393944][ T1] GPR08: 2222222222222222 0000000000000000 0000000000000001 0000000000008000
[ 1.393944][ T1] GPR12: c000000000521e80 c000000002c70000 c00000000000fff0 0000000000000000
[ 1.393944][ T1] GPR16: 0000000000000000 c00000000606c600 c000000003623ac0 0000000000000000
[ 1.393944][ T1] GPR20: c000000004c66300 c00000000606fc00 0000000000000001 0000000000000001
[ 1.393944][ T1] GPR24: c000000006069c00 c00000000272c500 0000000000000000 0000000000000000
[ 1.393944][ T1] GPR28: c000000003d68200 0000000000000000 c0000000028d8a80 c00000000272bd00
[ 1.394355][ T1] NIP [c00000000014e390] switch_mm_irqs_off+0x190/0x1c0
[ 1.394395][ T1] LR [c00000000014e30c] switch_mm_irqs_off+0x10c/0x1c0
[ 1.394519][ T1] Call Trace:
[ 1.394584][ T1] [c000000003defa50] [c00000000014e36c] switch_mm_irqs_off+0x16c/0x1c0 (unreliable)
[ 1.394676][ T1] [c000000003defab0] [c0000000006edbf0] begin_new_exec+0x534/0xf60
[ 1.394732][ T1] [c000000003defb20] [c000000000795538] load_elf_binary+0x494/0x1d1c
[ 1.394765][ T1] [c000000003defc70] [c0000000006eb910] bprm_execve+0x380/0x720
[ 1.394796][ T1] [c000000003defd00] [c0000000006ed5a8] kernel_execve+0x12c/0x1bc
[ 1.394831][ T1] [c000000003defd50] [c00000000000eda8] run_init_process+0xf8/0x160
[ 1.394864][ T1] [c000000003defde0] [c0000000000100b4] kernel_init+0xcc/0x268
[ 1.394899][ T1] [c000000003defe50] [c00000000000cf14] ret_from_kernel_user_thread+0x14/0x1c
[ 1.394946][ T1] ---- interrupt: 0 at 0x0
[ 1.395205][ T1] Code: 7fe4fb78 7f83e378 48009171 60000000 4bffff98 60000000 60000000 60000000 0fe00000 4bffff00 60000000 60000000 <0fe00000> 4bffff98 60000000 60000000
[ 1.395420][ T1] ---[ end trace 0000000000000000 ]---
[ 1.526024][ T67] mount (67) used greatest stack depth: 28432 bytes left
[ 1.605803][ T69] mount (69) used greatest stack depth: 27872 bytes left
[ 1.667853][ T71] mkdir (71) used greatest stack depth: 27248 bytes left
Saving 256 bits of creditable seed for next boot
[ 1.926636][ T80] ------------[ cut here ]------------
[ 1.926719][ T80] WARNING: arch/powerpc/mm/mmu_context.c:51 at switch_mm_irqs_off+0x180/0x1c0, CPU#0: S01seedrng/80
[ 1.926782][ T80] Modules linked in:
[ 1.926910][ T80] CPU: 0 UID: 0 PID: 80 Comm: S01seedrng Tainted: G W 7.0.0-rc4-next-20260319 #1 PREEMPTLAZY
[ 1.926990][ T80] Tainted: [W]=WARN
[ 1.927025][ T80] Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
[ 1.927091][ T80] NIP: c00000000014e380 LR: c00000000014e24c CTR: c000000000232894
[ 1.927131][ T80] REGS: c000000004d5f800 TRAP: 0700 Tainted: G W (7.0.0-rc4-next-20260319)
[ 1.927179][ T80] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28002828 XER: 20000000
[ 1.927253][ T80] CFAR: c00000000014e280 IRQMASK: 1
[ 1.927253][ T80] GPR00: c0000000002328ec c000000004d5faa0 c000000001bb8100 0000000000000080
[ 1.927253][ T80] GPR04: c0000000028d8280 c000000004509c00 0000000000000002 c00000000272c700
[ 1.927253][ T80] GPR08: fffffffffffffffe c0000000028d8280 0000000000000000 0000000048002828
[ 1.927253][ T80] GPR12: c000000000232894 c000000002c70000 0000000000000000 0000000000000002
[ 1.927253][ T80] GPR16: 0000000000000000 000001002f0a2958 000001002f0a2950 ffffffffffffffff
[ 1.927253][ T80] GPR20: 0000000000000000 0000000000000000 c000000002ab1400 c00000000272c700
[ 1.927253][ T80] GPR24: 0000000000000000 c0000000028d8a80 0000000000000000 0000000000000000
[ 1.927253][ T80] GPR28: c000000004509c00 0000000000000000 c00000000272bd00 c0000000028d8280
[ 1.927629][ T80] NIP [c00000000014e380] switch_mm_irqs_off+0x180/0x1c0
[ 1.927678][ T80] LR [c00000000014e24c] switch_mm_irqs_off+0x4c/0x1c0
[ 1.927715][ T80] Call Trace:
[ 1.927737][ T80] [c000000004d5faa0] [c000000004d5faf0] 0xc000000004d5faf0 (unreliable)
[ 1.927804][ T80] [c000000004d5fb00] [c0000000002328ec] do_shoot_lazy_tlb+0x58/0x84
[ 1.927853][ T80] [c000000004d5fb30] [c000000000388304] smp_call_function_many_cond+0x6a0/0x8d8
[ 1.927902][ T80] [c000000004d5fc20] [c000000000388624] on_each_cpu_cond_mask+0x40/0x7c
[ 1.927943][ T80] [c000000004d5fc50] [c000000000232ad4] __mmdrop+0x88/0x2ec
[ 1.927986][ T80] [c000000004d5fce0] [c000000000242104] do_exit+0x350/0xde4
[ 1.928028][ T80] [c000000004d5fdb0] [c000000000242de0] do_group_exit+0x48/0xbc
[ 1.928072][ T80] [c000000004d5fdf0] [c000000000242e74] pid_child_should_wake+0x0/0x84
[ 1.928128][ T80] [c000000004d5fe10] [c000000000030218] system_call_exception+0x148/0x3c0
[ 1.928176][ T80] [c000000004d5fe50] [c00000000000c6d4] system_call_common+0xf4/0x258
[ 1.928217][ T80] ---- interrupt: c00 at 0x7fff8ade507c
[ 1.928253][ T80] NIP: 00007fff8ade507c LR: 00007fff8ade5034 CTR: 0000000000000000
[ 1.928291][ T80] REGS: c000000004d5fe80 TRAP: 0c00 Tainted: G W (7.0.0-rc4-next-20260319)
[ 1.928333][ T80] MSR: 800000000280f032 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI> CR: 24002824 XER: 00000000
[ 1.928413][ T80] IRQMASK: 0
[ 1.928413][ T80] GPR00: 00000000000000ea 00007fffe75beb50 00007fff8aed7300 0000000000000000
[ 1.928413][ T80] GPR04: 0000000000000000 00007fffe75beda0 00007fffe75bedb0 0000000000000000
[ 1.928413][ T80] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1.928413][ T80] GPR12: 0000000000000000 00007fff8afaae00 00007fffca692568 0000000133cf0440
[ 1.928413][ T80] GPR16: 0000000000000000 000001002f0a2958 000001002f0a2950 ffffffffffffffff
[ 1.928413][ T80] GPR20: 0000000000000000 0000000000000000 00007fffe75bf838 00007fff8afa0000
[ 1.928413][ T80] GPR24: 0000000126911328 0000000000000001 00007fff8af9dc00 00007fffe75bf818
[ 1.928413][ T80] GPR28: 0000000000000003 fffffffffffff000 0000000000000000 00007fff8afa3e10
[ 1.928765][ T80] NIP [00007fff8ade507c] 0x7fff8ade507c
[ 1.928795][ T80] LR [00007fff8ade5034] 0x7fff8ade5034
[ 1.928835][ T80] ---- interrupt: c00
[ 1.928924][ T80] Code: 7c0803a6 4e800020 60000000 60000000 7fe4fb78 7f83e378 48009171 60000000 4bffff98 60000000 60000000 60000000 <0fe00000> 4bffff00 60000000 60000000
[ 1.929054][ T80] ---[ end trace 0000000000000000 ]---
> [2]: https://lore.kernel.org/20260304-vdso-sparc64-generic-2-v6-3-d8eb3b0e1410@linutronix.de/
> [3]: https://lore.kernel.org/20260311125539.4123672-2-mclapinski@google.com/
@Michal: Something my AI buddy pointed out... (that I think is valid):
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index df34797691bd..7363b5b0d22a 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2078,9 +2082,11 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> unsigned long mo_pfn = ALIGN(spfn + 1, MAX_ORDER_NR_PAGES);
> unsigned long chunk_end = min(mo_pfn, epfn);
>
> - nr_pages += deferred_init_pages(zone, spfn, chunk_end);
Previously, deferred_init_pages() returned nr of pages to add, which is
(end_pfn (= chunk_end) - spfn).
> - deferred_free_pages(spfn, chunk_end - spfn);
> + // KHO scratch is MAX_ORDER_NR_PAGES aligned.
> + if (!pfn_is_kho_scratch(spfn))
> + deferred_init_pages(zone, spfn, chunk_end);
But since the function is not always called with the change,
the calculation is moved to...
> + deferred_free_pages(spfn, chunk_end - spfn);
> spfn = chunk_end;
>
> if (can_resched)
> @@ -2088,6 +2094,7 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> else
> touch_nmi_watchdog();
> }
> + nr_pages += epfn - spfn;
Here.
But this is incorrect, because here we have:
> static unsigned long __init
> deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> struct zone *zone, bool can_resched)
> {
> int nid = zone_to_nid(zone);
> unsigned long nr_pages = 0;
> phys_addr_t start, end;
> u64 i = 0;
>
> for_each_free_mem_range(i, nid, 0, &start, &end, NULL) {
> unsigned long spfn = PFN_UP(start);
> unsigned long epfn = PFN_DOWN(end);
>
> if (spfn >= end_pfn)
> break;
>
> spfn = max(spfn, start_pfn);
> epfn = min(epfn, end_pfn);
>
> while (spfn < epfn) {
The loop condition is (spfn < epfn), and by the time the loop terminates...
> unsigned long mo_pfn = ALIGN(spfn + 1, MAX_ORDER_NR_PAGES);
> unsigned long chunk_end = min(mo_pfn, epfn);
>
> // KHO scratch is MAX_ORDER_NR_PAGES aligned.
> if (!pfn_is_kho_scratch(spfn))
> deferred_init_pages(zone, spfn, chunk_end);
>
> deferred_free_pages(spfn, chunk_end - spfn);
> spfn = chunk_end;
>
> if (can_resched)
> cond_resched();
> else
> touch_nmi_watchdog();
> }
> nr_pages += epfn - spfn;
epfn - spfn <= 0.
So the number of pages returned by deferred_init_memmap_chunk() becomes
incorrect.
The equivalent translation of what's there before would be doing
`nr_pages += chunk_end - spfn;` within the loop.
--
Cheers,
Harry / Hyeonggon
next prev parent reply other threads:[~2026-03-20 4:18 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-19 23:37 NULL pointer dereference when booting ppc64_guest_defconfig in QEMU on -next Nathan Chancellor
2026-03-20 4:17 ` Harry Yoo [this message]
2026-03-20 12:23 ` Michał Cłapiński
2026-03-20 12:35 ` Mathieu Desnoyers
2026-03-20 13:21 ` Harry Yoo (Oracle)
2026-03-20 13:31 ` Mathieu Desnoyers
2026-03-20 14:20 ` Mathieu Desnoyers
2026-03-21 1:12 ` Ritesh Harjani
2026-03-21 2:21 ` Andrew Morton
2026-03-23 1:53 ` Harry Yoo (Oracle)
2026-03-23 1:53 ` Harry Yoo (Oracle)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=abzKcGiRSR_E8lLN@hyeyoo \
--to=harry.yoo@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mclapinski@google.com \
--cc=mhiramat@kernel.org \
--cc=nathan@kernel.org \
--cc=rostedt@goodmis.org \
--cc=tglx@kernel.org \
--cc=thomas.weissschuh@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox