public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Nathan Chancellor <nathan@kernel.org>
Cc: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Thomas Weißschuh" <thomas.weissschuh@linutronix.de>,
	"Michal Clapinski" <mclapinski@google.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Thomas Gleixner" <tglx@kernel.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Masami Hiramatsu" <mhiramat@kernel.org>,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: NULL pointer dereference when booting ppc64_guest_defconfig in QEMU on -next
Date: Fri, 20 Mar 2026 13:17:52 +0900	[thread overview]
Message-ID: <abzKcGiRSR_E8lLN@hyeyoo> (raw)
In-Reply-To: <20260319233745.GA769346@ax162>

On Thu, Mar 19, 2026 at 04:37:45PM -0700, Nathan Chancellor wrote:
> Hi all,
> 
> I am not really sure whose bug this is, as it only appears when three
> seemingly independent patch series are applied together, so I have added
> the patch authors and their committers (along with the tracing
> maintainers) to this thread. Feel free to expand or reduce that list as
> necessary.
> 
> Our continuous integration has noticed a crash when booting
> ppc64_guest_defconfig in QEMU on the past few -next versions.
> 
>   https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/23311154492/job/67811527112
> 
> This does not appear to be clang related, as it can be reproduced with
> GCC 15.2.0 as well. Through multiple bisects, I was able to land on
> applying:
> 
>   mm: improve RSS counter approximation accuracy for proc interfaces [1]
>   vdso/datastore: Allocate data pages dynamically [2]
>   kho: fix deferred init of kho scratch [3]
> 
> and their dependent changes on top of 7.0-rc4 is enough to reproduce
> this (at least on two of my machines with the same commands). I have
> attached the diff from the result of the following 'git apply' commands
> below, done in a linux-next checkout.
> 
>   $ git checkout v7.0-rc4
>   HEAD is now at f338e7738378 Linux 7.0-rc4
> 
>   # [1]
>   $ git diff 60ddf3eed4999bae440d1cf9e5868ccb3f308b64^..087dd6d2cc12c82945ab859194c32e8e977daae3 | git apply -3v
>   ...
> 
>   # [2]
>   # Fix trivial conflict in init/main.c around headers
>   $ git diff dc432ab7130bb39f5a351281a02d4bc61e85a14a^..05988dba11791ccbb458254484826b32f17f4ad2 | git apply -3v
>   ...
> 
>   # [3]
>   # Fix conflict in kernel/liveupdate/kexec_handover.c due to lack of kho_mem_retrieve(), just add pfn_is_kho_scratch()
>   $ git show 4a78467ffb537463486968232daef1e8a2f105e3 | git apply -3v
>   ...
> 
>   $ make -skj"$(nproc)" ARCH=powerpc CROSS_COMPILE=powerpc64-linux- mrproper ppc64_guest_defconfig vmlinux
> 
>   $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/ppc64-rootfs.cpio.zst | zstd -d >rootfs.cpio
> 
>   $ qemu-system-ppc64 \
>       -display none \
>       -nodefaults \
>       -cpu power8 \
>       -machine pseries \
>       -vga none \
>       -kernel vmlinux \
>       -initrd rootfs.cpio \
>       -m 1G \
>       -serial mon:stdio

Thanks, such a detailed steps to reproduce!
Interestingly, the combination of my compiler (GCC 13.3.0) and
QEMU (8.2.2) don't trigger this bug.

>   [    0.000000][    T0] Linux version 7.0.0-rc4-dirty (nathan@framework-amd-ryzen-maxplus-395) (powerpc64-linux-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.45) #1 SMP PREEMPT Thu Mar 19 15:45:53 MST 2026
>   ...
>   [    0.216764][    T1] vgaarb: loaded
>   [    0.217590][    T1] clocksource: Switched to clocksource timebase
>   [    0.221007][   T12] BUG: Kernel NULL pointer dereference at 0x00000010
>   [    0.221049][   T12] Faulting instruction address: 0xc00000000044947c
>   [    0.221237][   T12] Oops: Kernel access of bad area, sig: 11 [#1]
>   [    0.221276][   T12] BE PAGE_SIZE=64K MMU=Hash  SMP NR_CPUS=2048 NUMA pSeries
>   [    0.221359][   T12] Modules linked in:
>   [    0.221556][   T12] CPU: 0 UID: 0 PID: 12 Comm: kworker/u4:0 Not tainted 7.0.0-rc4-dirty #1 PREEMPTLAZY
>   [    0.221631][   T12] Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
>   [    0.221765][   T12] Workqueue: trace_init_wq tracer_init_tracefs_work_func
>   [    0.222065][   T12] NIP:  c00000000044947c LR: c00000000041a584 CTR: c00000000053aa90
>   [    0.222084][   T12] REGS: c000000003bc7960 TRAP: 0380   Not tainted  (7.0.0-rc4-dirty)
>   [    0.222111][   T12] MSR:  8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 44000204  XER: 00000000
>   [    0.222287][   T12] CFAR: c000000000449420 IRQMASK: 0
>   [    0.222287][   T12] GPR00: c00000000041a584 c000000003bc7c00 c000000001c08100 c000000002892f20
>   [    0.222287][   T12] GPR04: c0000000019cfa68 c0000000019cfa60 0000000000000001 0000000000000064
>   [    0.222287][   T12] GPR08: 0000000000000002 0000000000000000 c000000003bba000 0000000000000010
>   [    0.222287][   T12] GPR12: c00000000053aa90 c000000002c50000 c000000001ab25f8 c000000001626690
>   [    0.222287][   T12] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>   [    0.222287][   T12] GPR20: c000000001624868 c000000001ab2708 c0000000019cfa08 c000000001a00d18
>   [    0.222287][   T12] GPR24: c0000000019cfa18 fffffffffffffef7 c000000003051205 c0000000019cfa68
>   [    0.222287][   T12] GPR28: 0000000000000000 c0000000019cfa60 c000000002894e90 0000000000000000
>   [    0.222526][   T12] NIP [c00000000044947c] __find_event_file+0x9c/0x110
>   [    0.222572][   T12] LR [c00000000041a584] init_tracer_tracefs+0x274/0xcc0
>   [    0.222643][   T12] Call Trace:
>   [    0.222690][   T12] [c000000003bc7c00] [c000000000b943b0] tracefs_create_file+0x1a0/0x2b0 (unreliable)
>   [    0.222766][   T12] [c000000003bc7c50] [c00000000041a584] init_tracer_tracefs+0x274/0xcc0
>   [    0.222791][   T12] [c000000003bc7dc0] [c000000002046f1c] tracer_init_tracefs_work_func+0x50/0x320
>   [    0.222809][   T12] [c000000003bc7e50] [c000000000276958] process_one_work+0x1b8/0x530
>   [    0.222828][   T12] [c000000003bc7f10] [c00000000027778c] worker_thread+0x1dc/0x3d0
>   [    0.222883][   T12] [c000000003bc7f90] [c000000000284c44] kthread+0x194/0x1b0
>   [    0.222900][   T12] [c000000003bc7fe0] [c00000000000cf30] start_kernel_thread+0x14/0x18
>   [    0.222961][   T12] Code: 7c691b78 7f63db78 2c090000 40820018 e89c0000 49107f21 60000000 2c030000 41820048 ebff0000 7c3ff040 41820038 <e93f0010> 7fa3eb78 81490058 e8890018
>   [    0.223190][   T12] ---[ end trace 0000000000000000 ]---
>   ...
>
> Interestingly, turning on CONFIG_KASAN appears to hide this, maybe
> pointing to some sort of memory corruption (or something timing
> related)? If there is any other information I can provide, I am more
> than happy to do so.

I don't have much idea on how things end up causing
NULL-pointer-deref... but let's point out suspicious things.

> [1]: https://lore.kernel.org/20260227153730.1556542-4-mathieu.desnoyers@efficios.com/

@Mathieu: In patch 1/3 description,
> Changes since v7:
> - Explicitly initialize the subsystem from start_kernel() right
>   after mm_core_init() so it is up and running before the creation of
>   the first mm at boot.

But how does this work when someone calls mm_cpumask() on init_mm early?
Looks like it will behave incorrectly because get_rss_stat_items_size()
returns zero?

While it doesn't crash on my environment, it triggers a two warnings
(with -smp 2 option added). IIUC the cpu bit should have been set in
setup_arch(), but at the wrong location. After the
percpu_counter_tree_subsystem_init() function is called, the bit doesn't
appear to be set.

[    1.392787][    T1] ------------[ cut here ]------------
[    1.392935][    T1] WARNING: arch/powerpc/mm/mmu_context.c:106 at switch_mm_irqs_off+0x190/0x1c0, CPU#0: swapper/0/1
[    1.393187][    T1] Modules linked in:
[    1.393458][    T1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-next-20260319 #1 PREEMPTLAZY
[    1.393600][    T1] Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
[    1.393711][    T1] NIP:  c00000000014e390 LR: c00000000014e30c CTR: 0000000000000000
[    1.393752][    T1] REGS: c000000003def7b0 TRAP: 0700   Not tainted  (7.0.0-rc4-next-20260319)
[    1.393807][    T1] MSR:  8000000002021032 <SF,VEC,ME,IR,DR,RI>  CR: 2800284a  XER: 00000000
[    1.393944][    T1] CFAR: c00000000014e328 IRQMASK: 3
[    1.393944][    T1] GPR00: c00000000014e36c c000000003defa50 c000000001bb8100 c0000000028d8c80
[    1.393944][    T1] GPR04: c000000004ddc04a 000000000000000a 0000000022222222 2222222222222222
[    1.393944][    T1] GPR08: 2222222222222222 0000000000000000 0000000000000001 0000000000008000
[    1.393944][    T1] GPR12: c000000000521e80 c000000002c70000 c00000000000fff0 0000000000000000
[    1.393944][    T1] GPR16: 0000000000000000 c00000000606c600 c000000003623ac0 0000000000000000
[    1.393944][    T1] GPR20: c000000004c66300 c00000000606fc00 0000000000000001 0000000000000001
[    1.393944][    T1] GPR24: c000000006069c00 c00000000272c500 0000000000000000 0000000000000000
[    1.393944][    T1] GPR28: c000000003d68200 0000000000000000 c0000000028d8a80 c00000000272bd00
[    1.394355][    T1] NIP [c00000000014e390] switch_mm_irqs_off+0x190/0x1c0
[    1.394395][    T1] LR [c00000000014e30c] switch_mm_irqs_off+0x10c/0x1c0
[    1.394519][    T1] Call Trace:
[    1.394584][    T1] [c000000003defa50] [c00000000014e36c] switch_mm_irqs_off+0x16c/0x1c0 (unreliable)
[    1.394676][    T1] [c000000003defab0] [c0000000006edbf0] begin_new_exec+0x534/0xf60
[    1.394732][    T1] [c000000003defb20] [c000000000795538] load_elf_binary+0x494/0x1d1c
[    1.394765][    T1] [c000000003defc70] [c0000000006eb910] bprm_execve+0x380/0x720
[    1.394796][    T1] [c000000003defd00] [c0000000006ed5a8] kernel_execve+0x12c/0x1bc
[    1.394831][    T1] [c000000003defd50] [c00000000000eda8] run_init_process+0xf8/0x160
[    1.394864][    T1] [c000000003defde0] [c0000000000100b4] kernel_init+0xcc/0x268
[    1.394899][    T1] [c000000003defe50] [c00000000000cf14] ret_from_kernel_user_thread+0x14/0x1c
[    1.394946][    T1] ---- interrupt: 0 at 0x0
[    1.395205][    T1] Code: 7fe4fb78 7f83e378 48009171 60000000 4bffff98 60000000 60000000 60000000 0fe00000 4bffff00 60000000 60000000 <0fe00000> 4bffff98 60000000 60000000
[    1.395420][    T1] ---[ end trace 0000000000000000 ]---
[    1.526024][   T67] mount (67) used greatest stack depth: 28432 bytes left
[    1.605803][   T69] mount (69) used greatest stack depth: 27872 bytes left
[    1.667853][   T71] mkdir (71) used greatest stack depth: 27248 bytes left
Saving 256 bits of creditable seed for next boot
[    1.926636][   T80] ------------[ cut here ]------------
[    1.926719][   T80] WARNING: arch/powerpc/mm/mmu_context.c:51 at switch_mm_irqs_off+0x180/0x1c0, CPU#0: S01seedrng/80
[    1.926782][   T80] Modules linked in:
[    1.926910][   T80] CPU: 0 UID: 0 PID: 80 Comm: S01seedrng Tainted: G        W           7.0.0-rc4-next-20260319 #1 PREEMPTLAZY
[    1.926990][   T80] Tainted: [W]=WARN
[    1.927025][   T80] Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
[    1.927091][   T80] NIP:  c00000000014e380 LR: c00000000014e24c CTR: c000000000232894
[    1.927131][   T80] REGS: c000000004d5f800 TRAP: 0700   Tainted: G        W            (7.0.0-rc4-next-20260319)
[    1.927179][   T80] MSR:  8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 28002828  XER: 20000000
[    1.927253][   T80] CFAR: c00000000014e280 IRQMASK: 1
[    1.927253][   T80] GPR00: c0000000002328ec c000000004d5faa0 c000000001bb8100 0000000000000080
[    1.927253][   T80] GPR04: c0000000028d8280 c000000004509c00 0000000000000002 c00000000272c700
[    1.927253][   T80] GPR08: fffffffffffffffe c0000000028d8280 0000000000000000 0000000048002828
[    1.927253][   T80] GPR12: c000000000232894 c000000002c70000 0000000000000000 0000000000000002
[    1.927253][   T80] GPR16: 0000000000000000 000001002f0a2958 000001002f0a2950 ffffffffffffffff
[    1.927253][   T80] GPR20: 0000000000000000 0000000000000000 c000000002ab1400 c00000000272c700
[    1.927253][   T80] GPR24: 0000000000000000 c0000000028d8a80 0000000000000000 0000000000000000
[    1.927253][   T80] GPR28: c000000004509c00 0000000000000000 c00000000272bd00 c0000000028d8280
[    1.927629][   T80] NIP [c00000000014e380] switch_mm_irqs_off+0x180/0x1c0
[    1.927678][   T80] LR [c00000000014e24c] switch_mm_irqs_off+0x4c/0x1c0
[    1.927715][   T80] Call Trace:
[    1.927737][   T80] [c000000004d5faa0] [c000000004d5faf0] 0xc000000004d5faf0 (unreliable)
[    1.927804][   T80] [c000000004d5fb00] [c0000000002328ec] do_shoot_lazy_tlb+0x58/0x84
[    1.927853][   T80] [c000000004d5fb30] [c000000000388304] smp_call_function_many_cond+0x6a0/0x8d8
[    1.927902][   T80] [c000000004d5fc20] [c000000000388624] on_each_cpu_cond_mask+0x40/0x7c
[    1.927943][   T80] [c000000004d5fc50] [c000000000232ad4] __mmdrop+0x88/0x2ec
[    1.927986][   T80] [c000000004d5fce0] [c000000000242104] do_exit+0x350/0xde4
[    1.928028][   T80] [c000000004d5fdb0] [c000000000242de0] do_group_exit+0x48/0xbc
[    1.928072][   T80] [c000000004d5fdf0] [c000000000242e74] pid_child_should_wake+0x0/0x84
[    1.928128][   T80] [c000000004d5fe10] [c000000000030218] system_call_exception+0x148/0x3c0
[    1.928176][   T80] [c000000004d5fe50] [c00000000000c6d4] system_call_common+0xf4/0x258
[    1.928217][   T80] ---- interrupt: c00 at 0x7fff8ade507c
[    1.928253][   T80] NIP:  00007fff8ade507c LR: 00007fff8ade5034 CTR: 0000000000000000
[    1.928291][   T80] REGS: c000000004d5fe80 TRAP: 0c00   Tainted: G        W            (7.0.0-rc4-next-20260319)
[    1.928333][   T80] MSR:  800000000280f032 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI>  CR: 24002824  XER: 00000000
[    1.928413][   T80] IRQMASK: 0
[    1.928413][   T80] GPR00: 00000000000000ea 00007fffe75beb50 00007fff8aed7300 0000000000000000
[    1.928413][   T80] GPR04: 0000000000000000 00007fffe75beda0 00007fffe75bedb0 0000000000000000
[    1.928413][   T80] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    1.928413][   T80] GPR12: 0000000000000000 00007fff8afaae00 00007fffca692568 0000000133cf0440
[    1.928413][   T80] GPR16: 0000000000000000 000001002f0a2958 000001002f0a2950 ffffffffffffffff
[    1.928413][   T80] GPR20: 0000000000000000 0000000000000000 00007fffe75bf838 00007fff8afa0000
[    1.928413][   T80] GPR24: 0000000126911328 0000000000000001 00007fff8af9dc00 00007fffe75bf818
[    1.928413][   T80] GPR28: 0000000000000003 fffffffffffff000 0000000000000000 00007fff8afa3e10
[    1.928765][   T80] NIP [00007fff8ade507c] 0x7fff8ade507c
[    1.928795][   T80] LR [00007fff8ade5034] 0x7fff8ade5034
[    1.928835][   T80] ---- interrupt: c00
[    1.928924][   T80] Code: 7c0803a6 4e800020 60000000 60000000 7fe4fb78 7f83e378 48009171 60000000 4bffff98 60000000 60000000 60000000 <0fe00000> 4bffff00 60000000 60000000
[    1.929054][   T80] ---[ end trace 0000000000000000 ]---

> [2]: https://lore.kernel.org/20260304-vdso-sparc64-generic-2-v6-3-d8eb3b0e1410@linutronix.de/

> [3]: https://lore.kernel.org/20260311125539.4123672-2-mclapinski@google.com/

@Michal: Something my AI buddy pointed out... (that I think is valid):

> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index df34797691bd..7363b5b0d22a 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2078,9 +2082,11 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>  			unsigned long mo_pfn = ALIGN(spfn + 1, MAX_ORDER_NR_PAGES);
>  			unsigned long chunk_end = min(mo_pfn, epfn);
>  
> -			nr_pages += deferred_init_pages(zone, spfn, chunk_end);

Previously, deferred_init_pages() returned nr of pages to add, which is
(end_pfn (= chunk_end) - spfn).

> -			deferred_free_pages(spfn, chunk_end - spfn);
> +			// KHO scratch is MAX_ORDER_NR_PAGES aligned.
> +			if (!pfn_is_kho_scratch(spfn))
> +				deferred_init_pages(zone, spfn, chunk_end);

But since the function is not always called with the change,
the calculation is moved to...

> +			deferred_free_pages(spfn, chunk_end - spfn);
>  			spfn = chunk_end;
>  
>  			if (can_resched)
> @@ -2088,6 +2094,7 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>  			else
>  				touch_nmi_watchdog();
>  		}
> +		nr_pages += epfn - spfn;

Here.

But this is incorrect, because here we have:
> static unsigned long __init
> deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>                            struct zone *zone, bool can_resched)
> {
>         int nid = zone_to_nid(zone);
>         unsigned long nr_pages = 0;
>         phys_addr_t start, end;
>         u64 i = 0;
> 
>         for_each_free_mem_range(i, nid, 0, &start, &end, NULL) {
>                 unsigned long spfn = PFN_UP(start);
>                 unsigned long epfn = PFN_DOWN(end);
> 
>                 if (spfn >= end_pfn)
>                         break;
> 
>                 spfn = max(spfn, start_pfn);
>                 epfn = min(epfn, end_pfn);
> 
>                 while (spfn < epfn) {

The loop condition is (spfn < epfn), and by the time the loop terminates...

>                         unsigned long mo_pfn = ALIGN(spfn + 1, MAX_ORDER_NR_PAGES);
>                         unsigned long chunk_end = min(mo_pfn, epfn);
> 
>                         // KHO scratch is MAX_ORDER_NR_PAGES aligned.
>                         if (!pfn_is_kho_scratch(spfn))
>                                 deferred_init_pages(zone, spfn, chunk_end);
> 
>                         deferred_free_pages(spfn, chunk_end - spfn);
>                         spfn = chunk_end;
> 
>                         if (can_resched)
>                                 cond_resched();
>                         else
>                                 touch_nmi_watchdog();
>                 }
>                 nr_pages += epfn - spfn;

epfn - spfn <= 0.

So the number of pages returned by deferred_init_memmap_chunk() becomes
incorrect.

The equivalent translation of what's there before would be doing
`nr_pages += chunk_end - spfn;` within the loop.

-- 
Cheers,
Harry / Hyeonggon


  reply	other threads:[~2026-03-20  4:18 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19 23:37 NULL pointer dereference when booting ppc64_guest_defconfig in QEMU on -next Nathan Chancellor
2026-03-20  4:17 ` Harry Yoo [this message]
2026-03-20 12:23   ` Michał Cłapiński
2026-03-20 12:35   ` Mathieu Desnoyers
2026-03-20 13:21     ` Harry Yoo (Oracle)
2026-03-20 13:31       ` Mathieu Desnoyers
2026-03-20 14:20         ` Mathieu Desnoyers
2026-03-21  1:12           ` Ritesh Harjani
2026-03-21  2:21             ` Andrew Morton
2026-03-23  1:53           ` Harry Yoo (Oracle)
2026-03-23  1:53         ` Harry Yoo (Oracle)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abzKcGiRSR_E8lLN@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mclapinski@google.com \
    --cc=mhiramat@kernel.org \
    --cc=nathan@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@kernel.org \
    --cc=thomas.weissschuh@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox