[PATCH 0/2] KHO Fixes

kexec.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] KHO Fixes
@ 2025-05-18 14:23 Changyuan Lyu
  2025-05-18 14:23 ` [PATCH 1/2] memblock: show a warning if allocation in KHO scratch fails Changyuan Lyu
  2025-05-18 14:23 ` [PATCH 2/2] KHO: init new_physxa->phys_bits to fix lockdep Changyuan Lyu
  0 siblings, 2 replies; 10+ messages in thread
From: Changyuan Lyu @ 2025-05-18 14:23 UTC (permalink / raw)
  To: rppt, akpm
  Cc: graf, bhe, kexec, linux-kernel, linux-mm, chrisl, pasha.tatashin,
	jasonmiu, Changyuan Lyu

Hi,

This patchset contains 2 fixes for the KHO series
(https://lore.kernel.org/all/20250509074635.3187114-1-changyuanl@google.com/).

Best,
Changyuan

Changyuan Lyu (1):
  memblock: show a warning if allocation in KHO scratch fails

Pasha Tatashin (1):
  KHO: init new_physxa->phys_bits to fix lockdep

 kernel/kexec_handover.c | 29 +++++++++++++++++++++++++----
 mm/memblock.c           |  3 +++
 2 files changed, 28 insertions(+), 4 deletions(-)


base-commit: 9e619cd4fefd19cdce16e169d5827bc64ae01aa1
--
2.49.0.1101.gccaa498523-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] memblock: show a warning if allocation in KHO scratch fails
  2025-05-18 14:23 [PATCH 0/2] KHO Fixes Changyuan Lyu
@ 2025-05-18 14:23 ` Changyuan Lyu
  2025-05-18 16:07   ` Mike Rapoport
  2025-05-18 14:23 ` [PATCH 2/2] KHO: init new_physxa->phys_bits to fix lockdep Changyuan Lyu
  1 sibling, 1 reply; 10+ messages in thread
From: Changyuan Lyu @ 2025-05-18 14:23 UTC (permalink / raw)
  To: rppt, akpm
  Cc: graf, bhe, kexec, linux-kernel, linux-mm, chrisl, pasha.tatashin,
	jasonmiu, Changyuan Lyu

When we kexec into a new kernel from an old kernel with KHO
enabled, the new kernel allocates vmemmap in the scratch area.
If the KHO scratch size is too small, vmemmap allocation would
fail and cause kernel panic, like the following,

[    0.027133] Faking a node at [mem 0x0000000000000000-0x00000004ffffffff]
[    0.027877] NODE_DATA(0) allocated [mem 0x4e4bd5a00-0x4e4bfffff]
[    0.029696] sparse_init_nid: node[0] memory map backing failed. Some memory will not be available.
[    0.029698] Zone ranges:
[    0.030974]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.031627]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.032281]   Normal   [mem 0x0000000100000000-0x00000004ffffffff]
[    0.032930]   Device   empty
[    0.033251] Movable zone start for each node
[    0.033710] Early memory node ranges
[    0.034108]   node   0: [mem 0x0000000000001000-0x000000000007ffff]
[    0.034801]   node   0: [mem 0x0000000000100000-0x00000000773fffff]
[    0.035461]   node   0: [mem 0x0000000077400000-0x00000000775fffff]
[    0.036116]   node   0: [mem 0x0000000077600000-0x000000007fffffff]
[    0.036768]   node   0: [mem 0x0000000100000000-0x00000004ccbfffff]
[    0.037423]   node   0: [mem 0x00000004ccc00000-0x00000004e4bfffff]
[    0.038111] BUG: kernel NULL pointer dereference, address: 0000000000000010
[    0.038880] #PF: supervisor write access in kernel mode
[    0.039474] #PF: error_code(0x0002) - not-present page
[    0.040056] PGD 0 P4D 0
[    0.040335] Oops: Oops: 0002 [#1] SMP
[    0.040745] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc4+ #275 NONE
[    0.041541] RIP: 0010:__bitmap_set+0x2b/0x80
[    0.041992] Code: 0f 1e fa 55 48 89 e5 89 f1 89 f0 c1 e8 06 48 8d 04 c7 48 c7 c7 ff ff ff ff 48 d3 e7 41 89 f0 41 83 c8 c0 44 89 c6 01 d6 78 43 <48> 09 38 48 83 c0 08 83 fe 40 72 1a 41 8d 3c 10 83 c7 40 48 c7 00
[    0.043986] RSP: 0000:ffffffff96203df0 EFLAGS: 00010047
[    0.044546] RAX: 0000000000000010 RBX: 000000000000cc00 RCX: 0000000000000000
[    0.045311] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffffffffffffff
[    0.046075] RBP: ffffffff96203df0 R08: 00000000ffffffc0 R09: ffffffff9626c950
[    0.046830] R10: 000000000002fffd R11: 0000000000000004 R12: 0000000000008000
[    0.047574] R13: 0000000000000000 R14: 000000000000003f R15: 000000000000009b
[    0.048313] FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
[    0.049151] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.049751] CR2: 0000000000000010 CR3: 00000004d123e000 CR4: 00000000000200b0
[    0.050494] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.051238] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.051978] Call Trace:
[    0.052235]  <TASK>
[    0.052455]  subsection_map_init+0xe4/0x130
[    0.052891]  free_area_init+0x217/0x3d0
[    0.053290]  zone_sizes_init+0x5e/0x80
[    0.053682]  paging_init+0x27/0x30
[    0.054046]  setup_arch+0x307/0x3e0
[    0.054422]  start_kernel+0x59/0x390
[    0.054820]  x86_64_start_reservations+0x28/0x30
[    0.055307]  x86_64_start_kernel+0x70/0x80
[    0.055736]  common_startup_64+0x13b/0x140
[    0.056165]  </TASK>
[    0.056392] CR2: 0000000000000010
[    0.056737] ---[ end trace 0000000000000000 ]---
[    0.057218] RIP: 0010:__bitmap_set+0x2b/0x80
[    0.057667] Code: 0f 1e fa 55 48 89 e5 89 f1 89 f0 c1 e8 06 48 8d 04 c7 48 c7 c7 ff ff ff ff 48 d3 e7 41 89 f0 41 83 c8 c0 44 89 c6 01 d6 78 43 <48> 09 38 48 83 c0 08 83 fe 40 72 1a 41 8d 3c 10 83 c7 40 48 c7 00
[    0.059650] RSP: 0000:ffffffff96203df0 EFLAGS: 00010047
[    0.060218] RAX: 0000000000000010 RBX: 000000000000cc00 RCX: 0000000000000000
[    0.060985] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffffffffffffff
[    0.061728] RBP: ffffffff96203df0 R08: 00000000ffffffc0 R09: ffffffff9626c950
[    0.062486] R10: 000000000002fffd R11: 0000000000000004 R12: 0000000000008000
[    0.063228] R13: 0000000000000000 R14: 000000000000003f R15: 000000000000009b
[    0.063968] FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
[    0.064812] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.065423] CR2: 0000000000000010 CR3: 00000004d123e000 CR4: 00000000000200b0
[    0.066175] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.066926] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.067678] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.068403] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

The panic above can be easily reproduced by the following steps,

1.  boot a VM with 20GiB physical memory (or larger) and kernel command
    line "kho=on kho_scratch=2m,256m,128m"
2.  echo 1 > /sys/kernel/debug/kho/out/finalize
3.  kexec to a new kernel

The current panic log above is confusing and it's hard to find the
root cause.

Add an error log to make it easier to debug such kind of panics.

Fixes: d59f43b57480 ("memblock: add support for scratch memory")
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 mm/memblock.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memblock.c b/mm/memblock.c
index 154f1d73b61f..ed886bfd3de7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1573,6 +1573,9 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 		goto again;
 	}

+	if (flags & MEMBLOCK_KHO_SCRATCH)
+		pr_err_once("Could not allocate %pap bytes in KHO scratch\n", &size);
+
 	return 0;

 done:
--
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] KHO: init new_physxa->phys_bits to fix lockdep
  2025-05-18 14:23 [PATCH 0/2] KHO Fixes Changyuan Lyu
  2025-05-18 14:23 ` [PATCH 1/2] memblock: show a warning if allocation in KHO scratch fails Changyuan Lyu
@ 2025-05-18 14:23 ` Changyuan Lyu
  2025-05-18 15:51   ` Mike Rapoport
  1 sibling, 1 reply; 10+ messages in thread
From: Changyuan Lyu @ 2025-05-18 14:23 UTC (permalink / raw)
  To: rppt, akpm
  Cc: graf, bhe, kexec, linux-kernel, linux-mm, chrisl, pasha.tatashin,
	jasonmiu, Changyuan Lyu

From: Pasha Tatashin <pasha.tatashin@soleen.com>

Lockdep shows the following warning:

INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.

[<ffffffff810133a6>] dump_stack_lvl+0x66/0xa0
[<ffffffff8136012c>] assign_lock_key+0x10c/0x120
[<ffffffff81358bb4>] register_lock_class+0xf4/0x2f0
[<ffffffff813597ff>] __lock_acquire+0x7f/0x2c40
[<ffffffff81360cb0>] ? __pfx_hlock_conflict+0x10/0x10
[<ffffffff811707be>] ? native_flush_tlb_global+0x8e/0xa0
[<ffffffff8117096e>] ? __flush_tlb_all+0x4e/0xa0
[<ffffffff81172fc2>] ? __kernel_map_pages+0x112/0x140
[<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
[<ffffffff81359556>] lock_acquire+0xe6/0x280
[<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
[<ffffffff8100b9e0>] _raw_spin_lock+0x30/0x40
[<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
[<ffffffff813ec327>] xa_load_or_alloc+0x67/0xe0
[<ffffffff813eb4c0>] kho_preserve_folio+0x90/0x100
[<ffffffff813ebb7f>] __kho_finalize+0xcf/0x400
[<ffffffff813ebef4>] kho_finalize+0x34/0x70

This is becase xa has its own lock, that is not initialized in
xa_load_or_alloc.

Modifiy __kho_preserve_order(), to properly call
xa_init(&new_physxa->phys_bits);

Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 kernel/kexec_handover.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 69b953551677..f0ac6a9170f8 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -144,14 +144,35 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn,
 				unsigned int order)
 {
 	struct kho_mem_phys_bits *bits;
-	struct kho_mem_phys *physxa;
+	struct kho_mem_phys *physxa, *new_physxa;
 	const unsigned long pfn_high = pfn >> order;

 	might_sleep();

-	physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
-	if (IS_ERR(physxa))
-		return PTR_ERR(physxa);
+	physxa = xa_load(&track->orders, order);
+	if (!physxa) {
+		new_physxa = kzalloc(sizeof(*physxa), GFP_KERNEL);
+		if (!new_physxa)
+			return -ENOMEM;
+
+		xa_init(&new_physxa->phys_bits);
+		physxa = xa_cmpxchg(&track->orders, order, NULL, new_physxa,
+				    GFP_KERNEL);
+		if (xa_is_err(physxa)) {
+			int err_ret = xa_err(physxa);
+
+			xa_destroy(&new_physxa->phys_bits);
+			kfree(new_physxa);
+
+			return err_ret;
+		}
+		if (physxa) {
+			xa_destroy(&new_physxa->phys_bits);
+			kfree(new_physxa);
+		} else {
+			physxa = new_physxa;
+		}
+	}

 	bits = xa_load_or_alloc(&physxa->phys_bits, pfn_high / PRESERVE_BITS,
 				sizeof(*bits));
--
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] KHO: init new_physxa->phys_bits to fix lockdep
  2025-05-18 14:23 ` [PATCH 2/2] KHO: init new_physxa->phys_bits to fix lockdep Changyuan Lyu
@ 2025-05-18 15:51   ` Mike Rapoport
  2025-05-19 12:10     ` Pasha Tatashin
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Rapoport @ 2025-05-18 15:51 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: akpm, graf, bhe, kexec, linux-kernel, linux-mm, chrisl,
	pasha.tatashin, jasonmiu

On Sun, May 18, 2025 at 07:23:15AM -0700, Changyuan Lyu wrote:
> From: Pasha Tatashin <pasha.tatashin@soleen.com>
> 
> Lockdep shows the following warning:
> 
> INFO: trying to register non-static key.
> The code is fine but needs lockdep annotation, or maybe
> you didn't initialize this object before use?
> turning off the locking correctness validator.
> 
> [<ffffffff810133a6>] dump_stack_lvl+0x66/0xa0
> [<ffffffff8136012c>] assign_lock_key+0x10c/0x120
> [<ffffffff81358bb4>] register_lock_class+0xf4/0x2f0
> [<ffffffff813597ff>] __lock_acquire+0x7f/0x2c40
> [<ffffffff81360cb0>] ? __pfx_hlock_conflict+0x10/0x10
> [<ffffffff811707be>] ? native_flush_tlb_global+0x8e/0xa0
> [<ffffffff8117096e>] ? __flush_tlb_all+0x4e/0xa0
> [<ffffffff81172fc2>] ? __kernel_map_pages+0x112/0x140
> [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> [<ffffffff81359556>] lock_acquire+0xe6/0x280
> [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> [<ffffffff8100b9e0>] _raw_spin_lock+0x30/0x40
> [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> [<ffffffff813ec327>] xa_load_or_alloc+0x67/0xe0
> [<ffffffff813eb4c0>] kho_preserve_folio+0x90/0x100
> [<ffffffff813ebb7f>] __kho_finalize+0xcf/0x400
> [<ffffffff813ebef4>] kho_finalize+0x34/0x70
> 
> This is becase xa has its own lock, that is not initialized in
> xa_load_or_alloc.
> 
> Modifiy __kho_preserve_order(), to properly call
> xa_init(&new_physxa->phys_bits);
> 
> Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation")
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> ---
>  kernel/kexec_handover.c | 29 +++++++++++++++++++++++++----
>  1 file changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
> index 69b953551677..f0ac6a9170f8 100644
> --- a/kernel/kexec_handover.c
> +++ b/kernel/kexec_handover.c
> @@ -144,14 +144,35 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn,
>  				unsigned int order)
>  {
>  	struct kho_mem_phys_bits *bits;
> -	struct kho_mem_phys *physxa;
> +	struct kho_mem_phys *physxa, *new_physxa;
>  	const unsigned long pfn_high = pfn >> order;
> 
>  	might_sleep();
> 
> -	physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
> -	if (IS_ERR(physxa))
> -		return PTR_ERR(physxa);
> +	physxa = xa_load(&track->orders, order);
> +	if (!physxa) {
> +		new_physxa = kzalloc(sizeof(*physxa), GFP_KERNEL);
> +		if (!new_physxa)
> +			return -ENOMEM;
> +
> +		xa_init(&new_physxa->phys_bits);
> +		physxa = xa_cmpxchg(&track->orders, order, NULL, new_physxa,
> +				    GFP_KERNEL);
> +		if (xa_is_err(physxa)) {
> +			int err_ret = xa_err(physxa);
> +
> +			xa_destroy(&new_physxa->phys_bits);
> +			kfree(new_physxa);
> +
> +			return err_ret;
> +		}
> +		if (physxa) {
> +			xa_destroy(&new_physxa->phys_bits);
> +			kfree(new_physxa);
> +		} else {
> +			physxa = new_physxa;
> +		}
> +	}

You are nearly duplicating xa_load_or_alloc() here. 
Is xa_destroy() is really needed here? In the end we destroying an empty
xarray.

Unless xa_destroy() is a must something like this would be simpler IMHO:

diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index ef21db6c59d5..4c8303fbf97a 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -91,10 +91,12 @@ struct kho_serialization {
 	struct khoser_mem_chunk *preserved_mem_map;
 };
 
-static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
+static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz,
+			      bool *new)
 {
 	void *elm, *res;
 
+	*new = false;
 	elm = xa_load(xa, index);
 	if (elm)
 		return elm;
@@ -112,6 +114,7 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
 		return res;
 	}
 
+	*new = true;
 	return elm;
 }
 
@@ -146,15 +149,18 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn,
 	struct kho_mem_phys_bits *bits;
 	struct kho_mem_phys *physxa;
 	const unsigned long pfn_high = pfn >> order;
+	bool new;
 
 	might_sleep();
 
-	physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
+	physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa), &new);
 	if (IS_ERR(physxa))
 		return PTR_ERR(physxa);
+	if (new)
+		xa_init(&physxa->phys_bits);
 
 	bits = xa_load_or_alloc(&physxa->phys_bits, pfn_high / PRESERVE_BITS,
-				sizeof(*bits));
+				sizeof(*bits), &new);
 	if (IS_ERR(bits))
 		return PTR_ERR(bits);
 
And if xa_destroy() is actually required, the allocation of new xarray
should be a helper function.

>  	bits = xa_load_or_alloc(&physxa->phys_bits, pfn_high / PRESERVE_BITS,
>  				sizeof(*bits));
> --
> 2.49.0.1101.gccaa498523-goog

-- 
Sincerely yours,
Mike.


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] memblock: show a warning if allocation in KHO scratch fails
  2025-05-18 14:23 ` [PATCH 1/2] memblock: show a warning if allocation in KHO scratch fails Changyuan Lyu
@ 2025-05-18 16:07   ` Mike Rapoport
  2025-05-21  7:03     ` Changyuan Lyu
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Rapoport @ 2025-05-18 16:07 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: akpm, graf, bhe, kexec, linux-kernel, linux-mm, chrisl,
	pasha.tatashin, jasonmiu

On Sun, May 18, 2025 at 07:23:14AM -0700, Changyuan Lyu wrote:
> When we kexec into a new kernel from an old kernel with KHO
> enabled, the new kernel allocates vmemmap in the scratch area.
> If the KHO scratch size is too small, vmemmap allocation would
> fail and cause kernel panic, like the following,
> 
> [    0.027133] Faking a node at [mem 0x0000000000000000-0x00000004ffffffff]
> [    0.027877] NODE_DATA(0) allocated [mem 0x4e4bd5a00-0x4e4bfffff]
> [    0.029696] sparse_init_nid: node[0] memory map backing failed. Some memory will not be available.
> [    0.029698] Zone ranges:
> [    0.030974]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> [    0.031627]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> [    0.032281]   Normal   [mem 0x0000000100000000-0x00000004ffffffff]
> [    0.032930]   Device   empty
> [    0.033251] Movable zone start for each node
> [    0.033710] Early memory node ranges
> [    0.034108]   node   0: [mem 0x0000000000001000-0x000000000007ffff]
> [    0.034801]   node   0: [mem 0x0000000000100000-0x00000000773fffff]
> [    0.035461]   node   0: [mem 0x0000000077400000-0x00000000775fffff]
> [    0.036116]   node   0: [mem 0x0000000077600000-0x000000007fffffff]
> [    0.036768]   node   0: [mem 0x0000000100000000-0x00000004ccbfffff]
> [    0.037423]   node   0: [mem 0x00000004ccc00000-0x00000004e4bfffff]
> [    0.038111] BUG: kernel NULL pointer dereference, address: 0000000000000010
> [    0.038880] #PF: supervisor write access in kernel mode
> [    0.039474] #PF: error_code(0x0002) - not-present page
> [    0.040056] PGD 0 P4D 0
> [    0.040335] Oops: Oops: 0002 [#1] SMP
> [    0.040745] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc4+ #275 NONE
> [    0.041541] RIP: 0010:__bitmap_set+0x2b/0x80
> [    0.041992] Code: 0f 1e fa 55 48 89 e5 89 f1 89 f0 c1 e8 06 48 8d 04 c7 48 c7 c7 ff ff ff ff 48 d3 e7 41 89 f0 41 83 c8 c0 44 89 c6 01 d6 78 43 <48> 09 38 48 83 c0 08 83 fe 40 72 1a 41 8d 3c 10 83 c7 40 48 c7 00
> [    0.043986] RSP: 0000:ffffffff96203df0 EFLAGS: 00010047
> [    0.044546] RAX: 0000000000000010 RBX: 000000000000cc00 RCX: 0000000000000000
> [    0.045311] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffffffffffffff
> [    0.046075] RBP: ffffffff96203df0 R08: 00000000ffffffc0 R09: ffffffff9626c950
> [    0.046830] R10: 000000000002fffd R11: 0000000000000004 R12: 0000000000008000
> [    0.047574] R13: 0000000000000000 R14: 000000000000003f R15: 000000000000009b
> [    0.048313] FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
> [    0.049151] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.049751] CR2: 0000000000000010 CR3: 00000004d123e000 CR4: 00000000000200b0
> [    0.050494] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.051238] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    0.051978] Call Trace:
> [    0.052235]  <TASK>
> [    0.052455]  subsection_map_init+0xe4/0x130
> [    0.052891]  free_area_init+0x217/0x3d0
> [    0.053290]  zone_sizes_init+0x5e/0x80
> [    0.053682]  paging_init+0x27/0x30
> [    0.054046]  setup_arch+0x307/0x3e0
> [    0.054422]  start_kernel+0x59/0x390
> [    0.054820]  x86_64_start_reservations+0x28/0x30
> [    0.055307]  x86_64_start_kernel+0x70/0x80
> [    0.055736]  common_startup_64+0x13b/0x140
> [    0.056165]  </TASK>
> [    0.056392] CR2: 0000000000000010
> [    0.056737] ---[ end trace 0000000000000000 ]---
> [    0.057218] RIP: 0010:__bitmap_set+0x2b/0x80
> [    0.057667] Code: 0f 1e fa 55 48 89 e5 89 f1 89 f0 c1 e8 06 48 8d 04 c7 48 c7 c7 ff ff ff ff 48 d3 e7 41 89 f0 41 83 c8 c0 44 89 c6 01 d6 78 43 <48> 09 38 48 83 c0 08 83 fe 40 72 1a 41 8d 3c 10 83 c7 40 48 c7 00
> [    0.059650] RSP: 0000:ffffffff96203df0 EFLAGS: 00010047
> [    0.060218] RAX: 0000000000000010 RBX: 000000000000cc00 RCX: 0000000000000000
> [    0.060985] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffffffffffffff
> [    0.061728] RBP: ffffffff96203df0 R08: 00000000ffffffc0 R09: ffffffff9626c950
> [    0.062486] R10: 000000000002fffd R11: 0000000000000004 R12: 0000000000008000
> [    0.063228] R13: 0000000000000000 R14: 000000000000003f R15: 000000000000009b
> [    0.063968] FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
> [    0.064812] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.065423] CR2: 0000000000000010 CR3: 00000004d123e000 CR4: 00000000000200b0
> [    0.066175] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.066926] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    0.067678] Kernel panic - not syncing: Attempted to kill the idle task!
> [    0.068403] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
> 
> The panic above can be easily reproduced by the following steps,
> 
> 1.  boot a VM with 20GiB physical memory (or larger) and kernel command
>     line "kho=on kho_scratch=2m,256m,128m"
> 2.  echo 1 > /sys/kernel/debug/kho/out/finalize
> 3.  kexec to a new kernel

This can be reproduced without KHO, just squeeze the RAM size, boot with a huge
kernel and initrd and you'll get the same panic.

The issue is that sparse_init_nid() does not treat allocation failures as
fatal and just continues with some sections being unpopulated and then
subsection_map_init() presumes all the sections are valid.

This should be fixed in mm/sparse.c regardless of KHO, maybe as simple as 

diff --git a/mm/sparse.c b/mm/sparse.c
index 3c012cf83cc2..64d071f9f037 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -197,6 +197,10 @@ void __init subsection_map_init(unsigned long pfn, unsigned long nr_pages)
 		pfns = min(nr_pages, PAGES_PER_SECTION
 				- (pfn & ~PAGE_SECTION_MASK));
 		ms = __nr_to_section(nr);
+
+		if (!ms->section_mem_map)
+			continue;
+
 		subsection_mask_set(ms->usage->subsection_map, pfn, pfns);
 
 		pr_debug("%s: sec: %lu pfns: %lu set(%d, %d)\n", __func__, nr,
 
> The current panic log above is confusing and it's hard to find the
> root cause.
> 
> Add an error log to make it easier to debug such kind of panics.
> 
> Fixes: d59f43b57480 ("memblock: add support for scratch memory")
> Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> ---
>  mm/memblock.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 154f1d73b61f..ed886bfd3de7 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1573,6 +1573,9 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>  		goto again;
>  	}
> 
> +	if (flags & MEMBLOCK_KHO_SCRATCH)
> +		pr_err_once("Could not allocate %pap bytes in KHO scratch\n", &size);
> +
>  	return 0;
> 
>  done:
> --
> 2.49.0.1101.gccaa498523-goog

-- 
Sincerely yours,
Mike.


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] KHO: init new_physxa->phys_bits to fix lockdep
  2025-05-18 15:51   ` Mike Rapoport
@ 2025-05-19 12:10     ` Pasha Tatashin
  0 siblings, 0 replies; 10+ messages in thread
From: Pasha Tatashin @ 2025-05-19 12:10 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Changyuan Lyu, akpm, graf, bhe, kexec, linux-kernel, linux-mm,
	chrisl, jasonmiu

On Sun, May 18, 2025 at 11:51 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Sun, May 18, 2025 at 07:23:15AM -0700, Changyuan Lyu wrote:
> > From: Pasha Tatashin <pasha.tatashin@soleen.com>
> >
> > Lockdep shows the following warning:
> >
> > INFO: trying to register non-static key.
> > The code is fine but needs lockdep annotation, or maybe
> > you didn't initialize this object before use?
> > turning off the locking correctness validator.
> >
> > [<ffffffff810133a6>] dump_stack_lvl+0x66/0xa0
> > [<ffffffff8136012c>] assign_lock_key+0x10c/0x120
> > [<ffffffff81358bb4>] register_lock_class+0xf4/0x2f0
> > [<ffffffff813597ff>] __lock_acquire+0x7f/0x2c40
> > [<ffffffff81360cb0>] ? __pfx_hlock_conflict+0x10/0x10
> > [<ffffffff811707be>] ? native_flush_tlb_global+0x8e/0xa0
> > [<ffffffff8117096e>] ? __flush_tlb_all+0x4e/0xa0
> > [<ffffffff81172fc2>] ? __kernel_map_pages+0x112/0x140
> > [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> > [<ffffffff81359556>] lock_acquire+0xe6/0x280
> > [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> > [<ffffffff8100b9e0>] _raw_spin_lock+0x30/0x40
> > [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> > [<ffffffff813ec327>] xa_load_or_alloc+0x67/0xe0
> > [<ffffffff813eb4c0>] kho_preserve_folio+0x90/0x100
> > [<ffffffff813ebb7f>] __kho_finalize+0xcf/0x400
> > [<ffffffff813ebef4>] kho_finalize+0x34/0x70
> >
> > This is becase xa has its own lock, that is not initialized in
> > xa_load_or_alloc.
> >
> > Modifiy __kho_preserve_order(), to properly call
> > xa_init(&new_physxa->phys_bits);
> >
> > Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation")
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > ---
> >  kernel/kexec_handover.c | 29 +++++++++++++++++++++++++----
> >  1 file changed, 25 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
> > index 69b953551677..f0ac6a9170f8 100644
> > --- a/kernel/kexec_handover.c
> > +++ b/kernel/kexec_handover.c
> > @@ -144,14 +144,35 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn,
> >                               unsigned int order)
> >  {
> >       struct kho_mem_phys_bits *bits;
> > -     struct kho_mem_phys *physxa;
> > +     struct kho_mem_phys *physxa, *new_physxa;
> >       const unsigned long pfn_high = pfn >> order;
> >
> >       might_sleep();
> >
> > -     physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
> > -     if (IS_ERR(physxa))
> > -             return PTR_ERR(physxa);
> > +     physxa = xa_load(&track->orders, order);
> > +     if (!physxa) {
> > +             new_physxa = kzalloc(sizeof(*physxa), GFP_KERNEL);
> > +             if (!new_physxa)
> > +                     return -ENOMEM;
> > +
> > +             xa_init(&new_physxa->phys_bits);
> > +             physxa = xa_cmpxchg(&track->orders, order, NULL, new_physxa,
> > +                                 GFP_KERNEL);
> > +             if (xa_is_err(physxa)) {
> > +                     int err_ret = xa_err(physxa);
> > +
> > +                     xa_destroy(&new_physxa->phys_bits);
> > +                     kfree(new_physxa);
> > +
> > +                     return err_ret;
> > +             }
> > +             if (physxa) {
> > +                     xa_destroy(&new_physxa->phys_bits);
> > +                     kfree(new_physxa);
> > +             } else {
> > +                     physxa = new_physxa;
> > +             }
> > +     }
>
> You are nearly duplicating xa_load_or_alloc() here.
> Is xa_destroy() is really needed here? In the end we destroying an empty
> xarray.
>
> Unless xa_destroy() is a must something like this would be simpler IMHO:

I wanted to do proper xa_destroy(), as the whole point of this patch
is to satisfy lockdep, and do a proper xa_init(). The patch fixes a
warning in linux-next, and I think should be taken as is. We can do a
separate clean-up once the series lands, where xa_load_or_alloc()
could either take another argument, or split into two functions.

Pasha


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] memblock: show a warning if allocation in KHO scratch fails
  2025-05-18 16:07   ` Mike Rapoport
@ 2025-05-21  7:03     ` Changyuan Lyu
  2025-05-21  7:43       ` Mike Rapoport
  0 siblings, 1 reply; 10+ messages in thread
From: Changyuan Lyu @ 2025-05-21  7:03 UTC (permalink / raw)
  To: rppt
  Cc: akpm, bhe, changyuanl, chrisl, graf, jasonmiu, kexec,
	linux-kernel, linux-mm, pasha.tatashin

Hi Mike,

On Sun, May 18, 2025 at 19:07:02 +0300, Mike Rapoport <rppt@kernel.org> wrote:
> On Sun, May 18, 2025 at 07:23:14AM -0700, Changyuan Lyu wrote:
> > When we kexec into a new kernel from an old kernel with KHO
> > enabled, the new kernel allocates vmemmap in the scratch area.
> > If the KHO scratch size is too small, vmemmap allocation would
> > fail and cause kernel panic, like the following,
> >
> > [    0.027133] Faking a node at [mem 0x0000000000000000-0x00000004ffffffff]
> > [    0.027877] NODE_DATA(0) allocated [mem 0x4e4bd5a00-0x4e4bfffff]
> > [    0.029696] sparse_init_nid: node[0] memory map backing failed. Some memory will not be available.
> > [    0.029698] Zone ranges:
> > [    0.030974]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> > [    0.031627]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> > [    0.032281]   Normal   [mem 0x0000000100000000-0x00000004ffffffff]
> > [    0.032930]   Device   empty
> > [    0.033251] Movable zone start for each node
> > [    0.033710] Early memory node ranges
> > [    0.034108]   node   0: [mem 0x0000000000001000-0x000000000007ffff]
> > [    0.034801]   node   0: [mem 0x0000000000100000-0x00000000773fffff]
> > [    0.035461]   node   0: [mem 0x0000000077400000-0x00000000775fffff]
> > [    0.036116]   node   0: [mem 0x0000000077600000-0x000000007fffffff]
> > [    0.036768]   node   0: [mem 0x0000000100000000-0x00000004ccbfffff]
> > [    0.037423]   node   0: [mem 0x00000004ccc00000-0x00000004e4bfffff]
> > [    0.038111] BUG: kernel NULL pointer dereference, address: 0000000000000010
> > [    0.038880] #PF: supervisor write access in kernel mode
> > [    0.039474] #PF: error_code(0x0002) - not-present page
> > [    0.040056] PGD 0 P4D 0
> > [    0.040335] Oops: Oops: 0002 [#1] SMP
> > [    0.040745] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc4+ #275 NONE
> > [    0.041541] RIP: 0010:__bitmap_set+0x2b/0x80
> > [    0.041992] Code: 0f 1e fa 55 48 89 e5 89 f1 89 f0 c1 e8 06 48 8d 04 c7 48 c7 c7 ff ff ff ff 48 d3 e7 41 89 f0 41 83 c8 c0 44 89 c6 01 d6 78 43 <48> 09 38 48 83 c0 08 83 fe 40 72 1a 41 8d 3c 10 83 c7 40 48 c7 00
> > [    0.043986] RSP: 0000:ffffffff96203df0 EFLAGS: 00010047
> > [    0.044546] RAX: 0000000000000010 RBX: 000000000000cc00 RCX: 0000000000000000
> > [    0.045311] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffffffffffffff
> > [    0.046075] RBP: ffffffff96203df0 R08: 00000000ffffffc0 R09: ffffffff9626c950
> > [    0.046830] R10: 000000000002fffd R11: 0000000000000004 R12: 0000000000008000
> > [    0.047574] R13: 0000000000000000 R14: 000000000000003f R15: 000000000000009b
> > [    0.048313] FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
> > [    0.049151] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    0.049751] CR2: 0000000000000010 CR3: 00000004d123e000 CR4: 00000000000200b0
> > [    0.050494] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [    0.051238] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [    0.051978] Call Trace:
> > [    0.052235]  <TASK>
> > [    0.052455]  subsection_map_init+0xe4/0x130
> > [    0.052891]  free_area_init+0x217/0x3d0
> > [    0.053290]  zone_sizes_init+0x5e/0x80
> > [    0.053682]  paging_init+0x27/0x30
> > [    0.054046]  setup_arch+0x307/0x3e0
> > [    0.054422]  start_kernel+0x59/0x390
> > [    0.054820]  x86_64_start_reservations+0x28/0x30
> > [    0.055307]  x86_64_start_kernel+0x70/0x80
> > [    0.055736]  common_startup_64+0x13b/0x140
> > [    0.056165]  </TASK>
> > [    0.056392] CR2: 0000000000000010
> > [    0.056737] ---[ end trace 0000000000000000 ]---
> > [    0.057218] RIP: 0010:__bitmap_set+0x2b/0x80
> > [    0.057667] Code: 0f 1e fa 55 48 89 e5 89 f1 89 f0 c1 e8 06 48 8d 04 c7 48 c7 c7 ff ff ff ff 48 d3 e7 41 89 f0 41 83 c8 c0 44 89 c6 01 d6 78 43 <48> 09 38 48 83 c0 08 83 fe 40 72 1a 41 8d 3c 10 83 c7 40 48 c7 00
> > [    0.059650] RSP: 0000:ffffffff96203df0 EFLAGS: 00010047
> > [    0.060218] RAX: 0000000000000010 RBX: 000000000000cc00 RCX: 0000000000000000
> > [    0.060985] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffffffffffffff
> > [    0.061728] RBP: ffffffff96203df0 R08: 00000000ffffffc0 R09: ffffffff9626c950
> > [    0.062486] R10: 000000000002fffd R11: 0000000000000004 R12: 0000000000008000
> > [    0.063228] R13: 0000000000000000 R14: 000000000000003f R15: 000000000000009b
> > [    0.063968] FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
> > [    0.064812] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    0.065423] CR2: 0000000000000010 CR3: 00000004d123e000 CR4: 00000000000200b0
> > [    0.066175] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [    0.066926] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [    0.067678] Kernel panic - not syncing: Attempted to kill the idle task!
> > [    0.068403] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
> >
> > The panic above can be easily reproduced by the following steps,
> >
> > 1.  boot a VM with 20GiB physical memory (or larger) and kernel command
> >     line "kho=on kho_scratch=2m,256m,128m"
> > 2.  echo 1 > /sys/kernel/debug/kho/out/finalize
> > 3.  kexec to a new kernel
>
> This can be reproduced without KHO, just squeeze the RAM size, boot with a huge
> kernel and initrd and you'll get the same panic.
>
> The issue is that sparse_init_nid() does not treat allocation failures as
> fatal and just continues with some sections being unpopulated and then
> subsection_map_init() presumes all the sections are valid.
>
> This should be fixed in mm/sparse.c regardless of KHO, maybe as simple as
>
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 3c012cf83cc2..64d071f9f037 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -197,6 +197,10 @@ void __init subsection_map_init(unsigned long pfn, unsigned long nr_pages)
>  		pfns = min(nr_pages, PAGES_PER_SECTION
>  				- (pfn & ~PAGE_SECTION_MASK));
>  		ms = __nr_to_section(nr);
> +
> +		if (!ms->section_mem_map)
> +			continue;
> +
>  		subsection_mask_set(ms->usage->subsection_map, pfn, pfns);
>
>  		pr_debug("%s: sec: %lu pfns: %lu set(%d, %d)\n", __func__, nr,

I tried your patch and the kernel log now looks like

[    0.027562] Faking a node at [mem 0x0000000000000000-0x000000057fffffff]
[    0.028338] NODE_DATA(0) allocated [mem 0x562bd5a00-0x562bfffff]
[    0.029201] Could not allocate 0x0000000014000000 bytes in KHO scratch
[    0.030229] sparse_init_nid: node[0] memory map backing failed. Some memory will not be available.
[    0.030232] Zone ranges:
[    0.031539]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.032242]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.032952]   Normal   [mem 0x0000000100000000-0x000000057fffffff]
[    0.033658]   Device   empty
[    0.033987] Movable zone start for each node
[    0.034473] Early memory node ranges
[    0.034878]   node   0: [mem 0x0000000000001000-0x000000000007ffff]
[    0.035591]   node   0: [mem 0x0000000000100000-0x00000000773fffff]
[    0.036308]   node   0: [mem 0x0000000077400000-0x00000000775fffff]
[    0.037030]   node   0: [mem 0x0000000077600000-0x000000007fffffff]
[    0.037750]   node   0: [mem 0x0000000100000000-0x000000054abfffff]
[    0.038463]   node   0: [mem 0x000000054ac00000-0x0000000562bfffff]
[    0.039180]   node   0: [mem 0x0000000562c00000-0x000000057fffffff]
[    0.039901] Initmem setup node 0 [mem 0x0000000000001000-0x000000057fffffff]
[    0.040707] On node 0, zone DMA: 1 pages in unavailable ranges
[    0.041401] On node 0, zone DMA: 128 pages in unavailable ranges
[    0.221829] BUG: kernel NULL pointer dereference, address: 0000000000000018
[    0.222675] #PF: supervisor read access in kernel mode
[    0.223271] #PF: error_code(0x0000) - not-present page
[    0.223859] PGD 0 P4D 0
[    0.224152] Oops: Oops: 0000 [#1] SMP
[    0.224575] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc4+ #279 NONE
[    0.225439] RIP: 0010:set_pageblock_migratetype+0x97/0xd0
[    0.226069] Code: b6 c9 c1 e1 04 48 01 c8 eb 02 31 c0 48 8b 70 08 89 f9 c1 e9 07 c1 ef 0d 83 e7 03 80 e1 3c 41 b8 07 00 00 00 49 d3 e0 48 d3 e2 <48> 8b 44 fe 18 49 f7 d0 48 89 c1 4c 21 c1 48 09 d1 f0 48 0f b1 4c
[    0.228231] RSP: 0000:ffffffffa4203d58 EFLAGS: 00010046
[    0.228834] RAX: ffff8e4722bd13b0 RBX: 0000000000000000 RCX: 0000000000009b00
[    0.229664] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[    0.230487] RBP: ffffffffa4203d58 R08: 0000000000000007 R09: 0000000000000000
[    0.231303] R10: ffffffffa4edc610 R11: 000000000000000c R12: 000000000054ac00
[    0.232119] R13: 0017fff000000000 R14: 0000000000000002 R15: 00000000004d8000
[    0.232937] FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
[    0.233868] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.234529] CR2: 0000000000000018 CR3: 000000055923e000 CR4: 00000000000200b0
[    0.235351] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.236171] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.236990] Call Trace:
[    0.237272]  <TASK>
[    0.237514]  memmap_init_range+0x1d8/0x210
[    0.237983]  memmap_init_zone_range+0x7f/0xb0
[    0.238488]  memmap_init+0x9a/0x120
[    0.238892]  free_area_init+0x369/0x3d0
[    0.239331]  zone_sizes_init+0x5e/0x80
[    0.239765]  paging_init+0x27/0x30
[    0.240153]  setup_arch+0x307/0x3e0
[    0.240556]  start_kernel+0x59/0x390
[    0.240968]  x86_64_start_reservations+0x28/0x30
[    0.241493]  x86_64_start_kernel+0x70/0x80
[    0.241962]  common_startup_64+0x13b/0x140
[    0.242433]  </TASK>
[    0.242682] CR2: 0000000000000018
[    0.243064] ---[ end trace 0000000000000000 ]---

It seems we are just defering the panic from subsection_map_init() to
memmap_init(). To me it is still not obvious that the failure was
caused by samll KHO scratch.

I think the error log in my original patch still makes sense since it
indicates potential panics early.

> > The current panic log above is confusing and it's hard to find the
> > root cause.
> >
> > Add an error log to make it easier to debug such kind of panics.
> >
> > Fixes: d59f43b57480 ("memblock: add support for scratch memory")
> > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > ---
> >  mm/memblock.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 154f1d73b61f..ed886bfd3de7 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1573,6 +1573,9 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> >  		goto again;
> >  	}
> >
> > +	if (flags & MEMBLOCK_KHO_SCRATCH)
> > +		pr_err_once("Could not allocate %pap bytes in KHO scratch\n", &size);
> > +
> >  	return 0;
> >
> >  done:
> > --
> > 2.49.0.1101.gccaa498523-goog
>
> --
> Sincerely yours,
> Mike.

Best,
Changyuan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] memblock: show a warning if allocation in KHO scratch fails
  2025-05-21  7:03     ` Changyuan Lyu
@ 2025-05-21  7:43       ` Mike Rapoport
  2025-05-21  8:48         ` Oscar Salvador
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Rapoport @ 2025-05-21  7:43 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: akpm, bhe, chrisl, graf, jasonmiu, kexec, linux-kernel, linux-mm,
	pasha.tatashin

Hi Changyuan,

On Wed, May 21, 2025 at 12:03:10AM -0700, Changyuan Lyu wrote:
> Hi Mike,
> 
> On Sun, May 18, 2025 at 19:07:02 +0300, Mike Rapoport <rppt@kernel.org> wrote:
> >
> > This can be reproduced without KHO, just squeeze the RAM size, boot with a huge
> > kernel and initrd and you'll get the same panic.
> >
> > The issue is that sparse_init_nid() does not treat allocation failures as
> > fatal and just continues with some sections being unpopulated and then
> > subsection_map_init() presumes all the sections are valid.
> >
> > This should be fixed in mm/sparse.c regardless of KHO, maybe as simple as
> >
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index 3c012cf83cc2..64d071f9f037 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -197,6 +197,10 @@ void __init subsection_map_init(unsigned long pfn, unsigned long nr_pages)
> >  		pfns = min(nr_pages, PAGES_PER_SECTION
> >  				- (pfn & ~PAGE_SECTION_MASK));
> >  		ms = __nr_to_section(nr);
> > +
> > +		if (!ms->section_mem_map)
> > +			continue;
> > +
> >  		subsection_mask_set(ms->usage->subsection_map, pfn, pfns);
> >
> >  		pr_debug("%s: sec: %lu pfns: %lu set(%d, %d)\n", __func__, nr,
> 
> I tried your patch and the kernel log now looks like
> 
> [    0.027562] Faking a node at [mem 0x0000000000000000-0x000000057fffffff]
> [    0.028338] NODE_DATA(0) allocated [mem 0x562bd5a00-0x562bfffff]
> [    0.029201] Could not allocate 0x0000000014000000 bytes in KHO scratch
> [    0.030229] sparse_init_nid: node[0] memory map backing failed. Some memory will not be available.
> [    0.030232] Zone ranges:
> [    0.031539]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> [    0.032242]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> [    0.032952]   Normal   [mem 0x0000000100000000-0x000000057fffffff]
> [    0.033658]   Device   empty
> [    0.033987] Movable zone start for each node
> [    0.034473] Early memory node ranges
> [    0.034878]   node   0: [mem 0x0000000000001000-0x000000000007ffff]
> [    0.035591]   node   0: [mem 0x0000000000100000-0x00000000773fffff]
> [    0.036308]   node   0: [mem 0x0000000077400000-0x00000000775fffff]
> [    0.037030]   node   0: [mem 0x0000000077600000-0x000000007fffffff]
> [    0.037750]   node   0: [mem 0x0000000100000000-0x000000054abfffff]
> [    0.038463]   node   0: [mem 0x000000054ac00000-0x0000000562bfffff]
> [    0.039180]   node   0: [mem 0x0000000562c00000-0x000000057fffffff]
> [    0.039901] Initmem setup node 0 [mem 0x0000000000001000-0x000000057fffffff]
> [    0.040707] On node 0, zone DMA: 1 pages in unavailable ranges
> [    0.041401] On node 0, zone DMA: 128 pages in unavailable ranges
> [    0.221829] BUG: kernel NULL pointer dereference, address: 0000000000000018
> [    0.222675] #PF: supervisor read access in kernel mode
> [    0.223271] #PF: error_code(0x0000) - not-present page
> [    0.223859] PGD 0 P4D 0
> [    0.224152] Oops: Oops: 0000 [#1] SMP
> [    0.224575] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc4+ #279 NONE
> [    0.225439] RIP: 0010:set_pageblock_migratetype+0x97/0xd0
> [    0.226069] Code: b6 c9 c1 e1 04 48 01 c8 eb 02 31 c0 48 8b 70 08 89 f9 c1 e9 07 c1 ef 0d 83 e7 03 80 e1 3c 41 b8 07 00 00 00 49 d3 e0 48 d3 e2 <48> 8b 44 fe 18 49 f7 d0 48 89 c1 4c 21 c1 48 09 d1 f0 48 0f b1 4c
> [    0.228231] RSP: 0000:ffffffffa4203d58 EFLAGS: 00010046
> [    0.228834] RAX: ffff8e4722bd13b0 RBX: 0000000000000000 RCX: 0000000000009b00
> [    0.229664] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
> [    0.230487] RBP: ffffffffa4203d58 R08: 0000000000000007 R09: 0000000000000000
> [    0.231303] R10: ffffffffa4edc610 R11: 000000000000000c R12: 000000000054ac00
> [    0.232119] R13: 0017fff000000000 R14: 0000000000000002 R15: 00000000004d8000
> [    0.232937] FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
> [    0.233868] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.234529] CR2: 0000000000000018 CR3: 000000055923e000 CR4: 00000000000200b0
> [    0.235351] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.236171] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    0.236990] Call Trace:
> [    0.237272]  <TASK>
> [    0.237514]  memmap_init_range+0x1d8/0x210
> [    0.237983]  memmap_init_zone_range+0x7f/0xb0
> [    0.238488]  memmap_init+0x9a/0x120
> [    0.238892]  free_area_init+0x369/0x3d0
> [    0.239331]  zone_sizes_init+0x5e/0x80
> [    0.239765]  paging_init+0x27/0x30
> [    0.240153]  setup_arch+0x307/0x3e0
> [    0.240556]  start_kernel+0x59/0x390
> [    0.240968]  x86_64_start_reservations+0x28/0x30
> [    0.241493]  x86_64_start_kernel+0x70/0x80
> [    0.241962]  common_startup_64+0x13b/0x140
> [    0.242433]  </TASK>
> [    0.242682] CR2: 0000000000000018
> [    0.243064] ---[ end trace 0000000000000000 ]---
> 
> It seems we are just defering the panic from subsection_map_init() to
> memmap_init(). To me it is still not obvious that the failure was
> caused by samll KHO scratch.

Small KHO scratch only exposes the issue that from one side
sparse_init_nid() does not treat OOM condition as fatal and tries to
continue with hardly noticeable error message but from the other side, we
presume that all section data was properly allocated and access it.
 
> I think the error log in my original patch still makes sense since it
> indicates potential panics early.

This will add another barely noticeable message at the same place
sparse_init_nid() reports an error.
I don't see how it will be better.

I think we should just make sparse_init_nid() panic or at least change 
"sparse_init_nid: node[0] memory map backing failed. Some memory will not be available."
to something more visible and clear. 

> Best,
> Changyuan

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] memblock: show a warning if allocation in KHO scratch fails
  2025-05-21  7:43       ` Mike Rapoport
@ 2025-05-21  8:48         ` Oscar Salvador
  2025-05-21 15:27           ` Mike Rapoport
  0 siblings, 1 reply; 10+ messages in thread
From: Oscar Salvador @ 2025-05-21  8:48 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Changyuan Lyu, akpm, bhe, chrisl, graf, jasonmiu, kexec,
	linux-kernel, linux-mm, pasha.tatashin

On Wed, May 21, 2025 at 10:43:15AM +0300, Mike Rapoport wrote:
> I think we should just make sparse_init_nid() panic or at least change 
> "sparse_init_nid: node[0] memory map backing failed. Some memory will not be available."
> to something more visible and clear. 

Panicking the system seems a bit too harsh.
Those sections will not be initialized, and sure you will lose some memory,
but still.

I think that making sure that subsection_map_init() does not access
non-initialized values is enough.
Because wrt. error message, I am not sure it can get more clear that we
failed we allocate memory to back the section and so that section will
not be activated :-)

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] memblock: show a warning if allocation in KHO scratch fails
  2025-05-21  8:48         ` Oscar Salvador
@ 2025-05-21 15:27           ` Mike Rapoport
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Rapoport @ 2025-05-21 15:27 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Changyuan Lyu, akpm, bhe, chrisl, graf, jasonmiu, kexec,
	linux-kernel, linux-mm, pasha.tatashin

On Wed, May 21, 2025 at 10:48:55AM +0200, Oscar Salvador wrote:
> On Wed, May 21, 2025 at 10:43:15AM +0300, Mike Rapoport wrote:
> > I think we should just make sparse_init_nid() panic or at least change 
> > "sparse_init_nid: node[0] memory map backing failed. Some memory will not be available."
> > to something more visible and clear. 
> 
> Panicking the system seems a bit too harsh.
> Those sections will not be initialized, and sure you will lose some memory,
> but still.
> 
> I think that making sure that subsection_map_init() does not access
> non-initialized values is enough.

It's not only subsection_map_init(), next failing one is
memmap_init_range() and maybe there's more, but we can audit and fix them.
I believe all those accesses are at init time because after system is
booted we are careful to avoid accessing absent sections.

> Because wrt. error message, I am not sure it can get more clear that we
> failed we allocate memory to back the section and so that section will
> not be activated :-)

Add a dump_stack()? ;-)

> -- 
> Oscar Salvador
> SUSE Labs
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-05-21 15:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-18 14:23 [PATCH 0/2] KHO Fixes Changyuan Lyu
2025-05-18 14:23 ` [PATCH 1/2] memblock: show a warning if allocation in KHO scratch fails Changyuan Lyu
2025-05-18 16:07   ` Mike Rapoport
2025-05-21  7:03     ` Changyuan Lyu
2025-05-21  7:43       ` Mike Rapoport
2025-05-21  8:48         ` Oscar Salvador
2025-05-21 15:27           ` Mike Rapoport
2025-05-18 14:23 ` [PATCH 2/2] KHO: init new_physxa->phys_bits to fix lockdep Changyuan Lyu
2025-05-18 15:51   ` Mike Rapoport
2025-05-19 12:10     ` Pasha Tatashin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).