[PATCH v6 0/2] kho: add support for deferred struct page init

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v6 0/2] kho: add support for deferred struct page init
@ 2026-03-11 12:55 Michal Clapinski
  2026-03-11 12:55 ` [PATCH v6 1/2] kho: fix deferred init of kho scratch Michal Clapinski
  2026-03-11 12:55 ` [PATCH v6 2/2] kho: make preserved pages compatible with deferred struct page init Michal Clapinski
  0 siblings, 2 replies; 5+ messages in thread
From: Michal Clapinski @ 2026-03-11 12:55 UTC (permalink / raw)
  To: Evangelos Petrongonas, Pasha Tatashin, Mike Rapoport,
	Pratyush Yadav, Alexander Graf, Samiullah Khawaja, kexec,
	linux-mm
  Cc: linux-kernel, Andrew Morton, Michal Clapinski

When CONFIG_DEFERRED_STRUCT_PAGE_INIT (hereinafter DEFERRED) is
enabled, struct page initialization is deferred to parallel kthreads
that run later in the boot process.

Currently, KHO is incompatible with DEFERRED.
This series fixes that incompatibility.
---
v6:
- reimplemented the initialization of kho scratch
v5:
- rebased
v4:
- added a new commit to fix deferred init of kho scratch
- switched to ulong when refering to pfn
v3:
- changed commit msg
- don't invoke early_pfn_to_nid if CONFIG_DEFERRED_STRUCT_PAGE_INIT=n
v2:
- updated a comment

I took Evangelos's test code:
https://git.infradead.org/?p=users/vpetrog/linux.git;a=shortlog;h=refs/heads/kho-deferred-struct-page-init
and then modified it to this monster test that does 2 allocations:
at core_initcall (early) and at module_init (late). Then kexec, then
2 more allocations at these points, then restore the original 2, then
kexec, then restore the other 2. Basically I test preservation of early
and late allocation both on cold and on warm boot.
Tested it both with and without DEFERRED.

Evangelos Petrongonas (1):
  kho: make preserved pages compatible with deferred struct page init

Michal Clapinski (1):
  kho: fix deferred init of kho scratch

 include/linux/kexec_handover.h     |  6 +++
 include/linux/memblock.h           |  2 -
 kernel/liveupdate/Kconfig          |  2 -
 kernel/liveupdate/kexec_handover.c | 62 +++++++++++++++++++++++++++++-
 mm/memblock.c                      | 22 -----------
 mm/mm_init.c                       | 17 +++++---
 6 files changed, 78 insertions(+), 33 deletions(-)

-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v6 1/2] kho: fix deferred init of kho scratch
  2026-03-11 12:55 [PATCH v6 0/2] kho: add support for deferred struct page init Michal Clapinski
@ 2026-03-11 12:55 ` Michal Clapinski
  2026-03-12 12:50   ` Mike Rapoport
  2026-03-13 13:58   ` Pratyush Yadav
  2026-03-11 12:55 ` [PATCH v6 2/2] kho: make preserved pages compatible with deferred struct page init Michal Clapinski
  1 sibling, 2 replies; 5+ messages in thread
From: Michal Clapinski @ 2026-03-11 12:55 UTC (permalink / raw)
  To: Evangelos Petrongonas, Pasha Tatashin, Mike Rapoport,
	Pratyush Yadav, Alexander Graf, Samiullah Khawaja, kexec,
	linux-mm
  Cc: linux-kernel, Andrew Morton, Michal Clapinski

Currently, if DEFERRED is enabled, kho_release_scratch will initialize
the struct pages and set migratetype of kho scratch. Unless the whole
scratch fit below first_deferred_pfn, some of that will be overwritten
either by deferred_init_pages or memmap_init_reserved_pages.

To fix it, I initialize kho scratch early and modify every other
path to leave the scratch alone.

In detail:
1. Modify deferred_init_memmap_chunk to not initialize kho
scratch, since we already did that. Then, modify deferred_free_pages
to not set the migratetype. Also modify reserve_bootmem_region to skip
initializing kho scratch.

2. Since kho scratch is now not initialized by any other code, we have
to initialize it ourselves also on cold boot. On cold boot memblock
doesn't mark scratch as scratch, so we also have to modify the
initialization function to not use memblock regions.

Signed-off-by: Michal Clapinski <mclapinski@google.com>
---
My previous idea of marking scratch as CMA late, after deferred struct
page init was done, was bad since allocations can be made before that
and if they land in kho scratch, they become unpreservable.
Such was the case with iommu page tables.
---
 include/linux/kexec_handover.h     |  6 +++++
 include/linux/memblock.h           |  2 --
 kernel/liveupdate/kexec_handover.c | 35 +++++++++++++++++++++++++++++-
 mm/memblock.c                      | 22 -------------------
 mm/mm_init.c                       | 17 ++++++++++-----
 5 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index ac4129d1d741..612a6da6127a 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -35,6 +35,7 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation);
 int kho_add_subtree(const char *name, void *fdt);
 void kho_remove_subtree(void *fdt);
 int kho_retrieve_subtree(const char *name, phys_addr_t *phys);
+bool pfn_is_kho_scratch(unsigned long pfn);
 
 void kho_memory_init(void);
 
@@ -109,6 +110,11 @@ static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys)
 	return -EOPNOTSUPP;
 }
 
+static inline bool pfn_is_kho_scratch(unsigned long pfn)
+{
+	return false;
+}
+
 static inline void kho_memory_init(void) { }
 
 static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len,
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6ec5e9ac0699..3e217414e12d 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -614,11 +614,9 @@ static inline void memtest_report_meminfo(struct seq_file *m) { }
 #ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
 void memblock_set_kho_scratch_only(void);
 void memblock_clear_kho_scratch_only(void);
-void memmap_init_kho_scratch_pages(void);
 #else
 static inline void memblock_set_kho_scratch_only(void) { }
 static inline void memblock_clear_kho_scratch_only(void) { }
-static inline void memmap_init_kho_scratch_pages(void) {}
 #endif
 
 #endif /* _LINUX_MEMBLOCK_H */
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 532f455c5d4f..09cb6660ade7 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -1327,6 +1327,23 @@ int kho_retrieve_subtree(const char *name, phys_addr_t *phys)
 }
 EXPORT_SYMBOL_GPL(kho_retrieve_subtree);
 
+bool pfn_is_kho_scratch(unsigned long pfn)
+{
+	unsigned int i;
+	phys_addr_t scratch_start, scratch_end, phys = __pfn_to_phys(pfn);
+
+	for (i = 0; i < kho_scratch_cnt; i++) {
+		scratch_start = kho_scratch[i].addr;
+		scratch_end = kho_scratch[i].addr + kho_scratch[i].size;
+
+		if (scratch_start <= phys && phys < scratch_end)
+			return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(pfn_is_kho_scratch);
+
 static int __init kho_mem_retrieve(const void *fdt)
 {
 	struct kho_radix_tree tree;
@@ -1453,12 +1470,27 @@ static __init int kho_init(void)
 }
 fs_initcall(kho_init);
 
+static void __init kho_init_scratch_pages(void)
+{
+	if (!IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT))
+		return;
+
+	for (int i = 0; i < kho_scratch_cnt; i++) {
+		unsigned long pfn = PFN_DOWN(kho_scratch[i].addr);
+		unsigned long end_pfn = PFN_UP(kho_scratch[i].addr + kho_scratch[i].size);
+		int nid = early_pfn_to_nid(pfn);
+
+		for (; pfn < end_pfn; pfn++)
+			init_deferred_page(pfn, nid);
+	}
+}
+
 static void __init kho_release_scratch(void)
 {
 	phys_addr_t start, end;
 	u64 i;
 
-	memmap_init_kho_scratch_pages();
+	kho_init_scratch_pages();
 
 	/*
 	 * Mark scratch mem as CMA before we return it. That way we
@@ -1487,6 +1519,7 @@ void __init kho_memory_init(void)
 			kho_in.fdt_phys = 0;
 	} else {
 		kho_reserve_scratch();
+		kho_init_scratch_pages();
 	}
 }
 
diff --git a/mm/memblock.c b/mm/memblock.c
index b3ddfdec7a80..ae6a5af46bd7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -959,28 +959,6 @@ __init void memblock_clear_kho_scratch_only(void)
 {
 	kho_scratch_only = false;
 }
-
-__init void memmap_init_kho_scratch_pages(void)
-{
-	phys_addr_t start, end;
-	unsigned long pfn;
-	int nid;
-	u64 i;
-
-	if (!IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT))
-		return;
-
-	/*
-	 * Initialize struct pages for free scratch memory.
-	 * The struct pages for reserved scratch memory will be set up in
-	 * reserve_bootmem_region()
-	 */
-	__for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
-			     MEMBLOCK_KHO_SCRATCH, &start, &end, &nid) {
-		for (pfn = PFN_UP(start); pfn < PFN_DOWN(end); pfn++)
-			init_deferred_page(pfn, nid);
-	}
-}
 #endif
 
 /**
diff --git a/mm/mm_init.c b/mm/mm_init.c
index cec7bb758bdd..969048f9b320 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -798,7 +798,8 @@ void __meminit reserve_bootmem_region(phys_addr_t start,
 	for_each_valid_pfn(pfn, PFN_DOWN(start), PFN_UP(end)) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_deferred_page(pfn, nid);
+		if (!pfn_is_kho_scratch(pfn))
+			__init_deferred_page(pfn, nid);
 
 		/*
 		 * no need for atomic set_bit because the struct
@@ -2008,9 +2009,12 @@ static void __init deferred_free_pages(unsigned long pfn,
 
 	/* Free a large naturally-aligned chunk if possible */
 	if (nr_pages == MAX_ORDER_NR_PAGES && IS_MAX_ORDER_ALIGNED(pfn)) {
-		for (i = 0; i < nr_pages; i += pageblock_nr_pages)
+		for (i = 0; i < nr_pages; i += pageblock_nr_pages) {
+			if (pfn_is_kho_scratch(page_to_pfn(page + i)))
+				continue;
 			init_pageblock_migratetype(page + i, MIGRATE_MOVABLE,
 					false);
+		}
 		__free_pages_core(page, MAX_PAGE_ORDER, MEMINIT_EARLY);
 		return;
 	}
@@ -2019,7 +2023,7 @@ static void __init deferred_free_pages(unsigned long pfn,
 	accept_memory(PFN_PHYS(pfn), nr_pages * PAGE_SIZE);
 
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
-		if (pageblock_aligned(pfn))
+		if (pageblock_aligned(pfn) && !pfn_is_kho_scratch(pfn))
 			init_pageblock_migratetype(page, MIGRATE_MOVABLE,
 					false);
 		__free_pages_core(page, 0, MEMINIT_EARLY);
@@ -2090,9 +2094,11 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
 			unsigned long mo_pfn = ALIGN(spfn + 1, MAX_ORDER_NR_PAGES);
 			unsigned long chunk_end = min(mo_pfn, epfn);
 
-			nr_pages += deferred_init_pages(zone, spfn, chunk_end);
-			deferred_free_pages(spfn, chunk_end - spfn);
+			// KHO scratch is MAX_ORDER_NR_PAGES aligned.
+			if (!pfn_is_kho_scratch(spfn))
+				deferred_init_pages(zone, spfn, chunk_end);
 
+			deferred_free_pages(spfn, chunk_end - spfn);
 			spfn = chunk_end;
 
 			if (can_resched)
@@ -2100,6 +2106,7 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
 			else
 				touch_nmi_watchdog();
 		}
+		nr_pages += epfn - spfn;
 	}
 
 	return nr_pages;
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v6 2/2] kho: make preserved pages compatible with deferred struct page init
  2026-03-11 12:55 [PATCH v6 0/2] kho: add support for deferred struct page init Michal Clapinski
  2026-03-11 12:55 ` [PATCH v6 1/2] kho: fix deferred init of kho scratch Michal Clapinski
@ 2026-03-11 12:55 ` Michal Clapinski
  1 sibling, 0 replies; 5+ messages in thread
From: Michal Clapinski @ 2026-03-11 12:55 UTC (permalink / raw)
  To: Evangelos Petrongonas, Pasha Tatashin, Mike Rapoport,
	Pratyush Yadav, Alexander Graf, Samiullah Khawaja, kexec,
	linux-mm
  Cc: linux-kernel, Andrew Morton, Michal Clapinski

From: Evangelos Petrongonas <epetron@amazon.de>

When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, struct page
initialization is deferred to parallel kthreads that run later
in the boot process.

During KHO restoration, kho_preserved_memory_reserve() writes metadata
for each preserved memory region. However, if the struct page has not
been initialized, this write targets uninitialized memory, potentially
leading to errors like:
BUG: unable to handle page fault for address: ...

Fix this by introducing kho_get_preserved_page(),  which ensures
all struct pages in a preserved region are initialized by calling
init_deferred_page() which is a no-op when the struct page is already
initialized.

Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
Co-developed-by: Michal Clapinski <mclapinski@google.com>
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
I think we can't initialize those struct pages in kho_restore_page.
I encountered this stack:
  page_zone(start_page)
  __pageblock_pfn_to_page
  set_zone_contiguous
  page_alloc_init_late

So, at the end of page_alloc_init_late struct pages are expected to be
already initialized. set_zone_contiguous() looks at the first and last
struct page of each pageblock in each populated zone to figure out if
the zone is contiguous. If a kho page lands on a pageblock boundary,
this will lead to access of an uninitialized struct page.
There is also page_ext_init that invokes pfn_to_nid, which calls
page_to_nid for each section-aligned page.
There might be other places that do something similar. Therefore, it's
a good idea to initialize all struct pages by the end of deferred
struct page init. That's why I'm resending Evangelos's patch.

I also tried to implement Pratyush's idea, i.e. iterate over zones,
then get node from zone. I didn't notice any performance difference
even with 8GB of kho.
---
 kernel/liveupdate/Kconfig          |  2 --
 kernel/liveupdate/kexec_handover.c | 27 ++++++++++++++++++++++++++-
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
index 1a8513f16ef7..c13af38ba23a 100644
--- a/kernel/liveupdate/Kconfig
+++ b/kernel/liveupdate/Kconfig
@@ -1,12 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 menu "Live Update and Kexec HandOver"
-	depends on !DEFERRED_STRUCT_PAGE_INIT
 
 config KEXEC_HANDOVER
 	bool "kexec handover"
 	depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
-	depends on !DEFERRED_STRUCT_PAGE_INIT
 	select MEMBLOCK_KHO_SCRATCH
 	select KEXEC_FILE
 	select LIBFDT
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 09cb6660ade7..1f9707d11e5f 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -471,6 +471,31 @@ struct page *kho_restore_pages(phys_addr_t phys, unsigned long nr_pages)
 }
 EXPORT_SYMBOL_GPL(kho_restore_pages);
 
+/*
+ * With CONFIG_DEFERRED_STRUCT_PAGE_INIT, struct pages in higher memory regions
+ * may not be initialized yet at the time KHO deserializes preserved memory.
+ * KHO uses the struct page to store metadata and a later initialization would
+ * overwrite it.
+ * Ensure all the struct pages in the preservation are
+ * initialized. kho_preserved_memory_reserve() marks the reservation as noinit
+ * to make sure they don't get re-initialized later.
+ */
+static struct page *__init kho_get_preserved_page(phys_addr_t phys,
+						  unsigned int order)
+{
+	unsigned long pfn = PHYS_PFN(phys);
+	int nid;
+
+	if (!IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT))
+		return pfn_to_page(pfn);
+
+	nid = early_pfn_to_nid(pfn);
+	for (unsigned long i = 0; i < (1UL << order); i++)
+		init_deferred_page(pfn + i, nid);
+
+	return pfn_to_page(pfn);
+}
+
 static int __init kho_preserved_memory_reserve(phys_addr_t phys,
 					       unsigned int order)
 {
@@ -479,7 +504,7 @@ static int __init kho_preserved_memory_reserve(phys_addr_t phys,
 	u64 sz;
 
 	sz = 1 << (order + PAGE_SHIFT);
-	page = phys_to_page(phys);
+	page = kho_get_preserved_page(phys, order);
 
 	/* Reserve the memory preserved in KHO in memblock */
 	memblock_reserve(phys, sz);
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v6 1/2] kho: fix deferred init of kho scratch
  2026-03-11 12:55 ` [PATCH v6 1/2] kho: fix deferred init of kho scratch Michal Clapinski
@ 2026-03-12 12:50   ` Mike Rapoport
  2026-03-13 13:58   ` Pratyush Yadav
  1 sibling, 0 replies; 5+ messages in thread
From: Mike Rapoport @ 2026-03-12 12:50 UTC (permalink / raw)
  To: Michal Clapinski
  Cc: Evangelos Petrongonas, Pasha Tatashin, Pratyush Yadav,
	Alexander Graf, Samiullah Khawaja, kexec, linux-mm, linux-kernel,
	Andrew Morton

On Wed, Mar 11, 2026 at 01:55:38PM +0100, Michal Clapinski wrote:
> Currently, if DEFERRED is enabled, kho_release_scratch will initialize
> the struct pages and set migratetype of kho scratch. Unless the whole
> scratch fit below first_deferred_pfn, some of that will be overwritten
> either by deferred_init_pages or memmap_init_reserved_pages.
> 
> To fix it, I initialize kho scratch early and modify every other
> path to leave the scratch alone.
> 
> In detail:
> 1. Modify deferred_init_memmap_chunk to not initialize kho
> scratch, since we already did that. Then, modify deferred_free_pages
> to not set the migratetype. Also modify reserve_bootmem_region to skip
> initializing kho scratch.
> 
> 2. Since kho scratch is now not initialized by any other code, we have
> to initialize it ourselves also on cold boot. On cold boot memblock
> doesn't mark scratch as scratch, so we also have to modify the
> initialization function to not use memblock regions.
> 
> Signed-off-by: Michal Clapinski <mclapinski@google.com>
> ---
> My previous idea of marking scratch as CMA late, after deferred struct
> page init was done, was bad since allocations can be made before that
> and if they land in kho scratch, they become unpreservable.
> Such was the case with iommu page tables.
> ---
>  include/linux/kexec_handover.h     |  6 +++++
>  include/linux/memblock.h           |  2 --
>  kernel/liveupdate/kexec_handover.c | 35 +++++++++++++++++++++++++++++-
>  mm/memblock.c                      | 22 -------------------
>  mm/mm_init.c                       | 17 ++++++++++-----
>  5 files changed, 52 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
> index ac4129d1d741..612a6da6127a 100644
> --- a/include/linux/kexec_handover.h
> +++ b/include/linux/kexec_handover.h
> @@ -35,6 +35,7 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation);
>  int kho_add_subtree(const char *name, void *fdt);
>  void kho_remove_subtree(void *fdt);
>  int kho_retrieve_subtree(const char *name, phys_addr_t *phys);
> +bool pfn_is_kho_scratch(unsigned long pfn);

I think we can rely on MEMBLOCK_KHO_SCRATCH and query ranges rather than
individual pfns.

This will also eliminate the need to special case scratch memory map
initialization on cold boot. 

>  void kho_memory_init(void);
>  
> @@ -109,6 +110,11 @@ static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys)
>  	return -EOPNOTSUPP;
>  }
>  
> +static inline bool pfn_is_kho_scratch(unsigned long pfn)
> +{
> +	return false;
> +}
> +
>  static inline void kho_memory_init(void) { }
>  
>  static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len,
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 6ec5e9ac0699..3e217414e12d 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -614,11 +614,9 @@ static inline void memtest_report_meminfo(struct seq_file *m) { }
>  #ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
>  void memblock_set_kho_scratch_only(void);
>  void memblock_clear_kho_scratch_only(void);
> -void memmap_init_kho_scratch_pages(void);
>  #else
>  static inline void memblock_set_kho_scratch_only(void) { }
>  static inline void memblock_clear_kho_scratch_only(void) { }
> -static inline void memmap_init_kho_scratch_pages(void) {}
>  #endif
>  
>  #endif /* _LINUX_MEMBLOCK_H */
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index 532f455c5d4f..09cb6660ade7 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -1327,6 +1327,23 @@ int kho_retrieve_subtree(const char *name, phys_addr_t *phys)
>  }
>  EXPORT_SYMBOL_GPL(kho_retrieve_subtree);
>  
> +bool pfn_is_kho_scratch(unsigned long pfn)
> +{
> +	unsigned int i;
> +	phys_addr_t scratch_start, scratch_end, phys = __pfn_to_phys(pfn);
> +
> +	for (i = 0; i < kho_scratch_cnt; i++) {
> +		scratch_start = kho_scratch[i].addr;
> +		scratch_end = kho_scratch[i].addr + kho_scratch[i].size;
> +
> +		if (scratch_start <= phys && phys < scratch_end)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(pfn_is_kho_scratch);
> +
>  static int __init kho_mem_retrieve(const void *fdt)
>  {
>  	struct kho_radix_tree tree;
> @@ -1453,12 +1470,27 @@ static __init int kho_init(void)
>  }
>  fs_initcall(kho_init);
>  
> +static void __init kho_init_scratch_pages(void)
> +{
> +	if (!IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT))
> +		return;
> +
> +	for (int i = 0; i < kho_scratch_cnt; i++) {
> +		unsigned long pfn = PFN_DOWN(kho_scratch[i].addr);
> +		unsigned long end_pfn = PFN_UP(kho_scratch[i].addr + kho_scratch[i].size);
> +		int nid = early_pfn_to_nid(pfn);
> +
> +		for (; pfn < end_pfn; pfn++)
> +			init_deferred_page(pfn, nid);
> +	}
> +}
> +
>  static void __init kho_release_scratch(void)
>  {
>  	phys_addr_t start, end;
>  	u64 i;
>  
> -	memmap_init_kho_scratch_pages();
> +	kho_init_scratch_pages();

This should not be required if deferred init would check if a region is
MEMBLOCK_KHO_SCRATCH rather than pfn_is_kho_scratch().

>  	/*
>  	 * Mark scratch mem as CMA before we return it. That way we
> @@ -1487,6 +1519,7 @@ void __init kho_memory_init(void)
>  			kho_in.fdt_phys = 0;
>  	} else {
>  		kho_reserve_scratch();
> +		kho_init_scratch_pages();
>  	}
>  }
>  
> diff --git a/mm/memblock.c b/mm/memblock.c
> index b3ddfdec7a80..ae6a5af46bd7 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -959,28 +959,6 @@ __init void memblock_clear_kho_scratch_only(void)
>  {
>  	kho_scratch_only = false;
>  }
> -
> -__init void memmap_init_kho_scratch_pages(void)
> -{
> -	phys_addr_t start, end;
> -	unsigned long pfn;
> -	int nid;
> -	u64 i;
> -
> -	if (!IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT))
> -		return;
> -
> -	/*
> -	 * Initialize struct pages for free scratch memory.
> -	 * The struct pages for reserved scratch memory will be set up in
> -	 * reserve_bootmem_region()
> -	 */
> -	__for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
> -			     MEMBLOCK_KHO_SCRATCH, &start, &end, &nid) {
> -		for (pfn = PFN_UP(start); pfn < PFN_DOWN(end); pfn++)
> -			init_deferred_page(pfn, nid);
> -	}
> -}
>  #endif
>  
>  /**
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index cec7bb758bdd..969048f9b320 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -798,7 +798,8 @@ void __meminit reserve_bootmem_region(phys_addr_t start,
>  	for_each_valid_pfn(pfn, PFN_DOWN(start), PFN_UP(end)) {
>  		struct page *page = pfn_to_page(pfn);
>  
> -		__init_deferred_page(pfn, nid);
> +		if (!pfn_is_kho_scratch(pfn))
> +			__init_deferred_page(pfn, nid);

A bit unrelated, we can move reserve_bootmem_region() to memblock and make
it static.

As for skipping the initialization of, I think that
memmap_init_reserved_pages() should check if the region to reserve is in
scratch and if yes, make reserve_bootmem_region() to skip struct page
initialization.
I believe everything that is MEMBLOCK_RSRV_KERNEL would be in scratch and
all reserved memory in scratch would be MEMBLOCK_RSRV_KERNEL, but it's
better to double check it.

Another somewhat related thing, is that __init_page_from_nid() shouldn't
mess with pageblock migrate types, but only call __init_single_page().
It's up to __init_page_from_nid() caller to decide what migrate type to
use and the caller should set it explicitly.

>  
>  		/*
>  		 * no need for atomic set_bit because the struct
> @@ -2008,9 +2009,12 @@ static void __init deferred_free_pages(unsigned long pfn,
>  
>  	/* Free a large naturally-aligned chunk if possible */
>  	if (nr_pages == MAX_ORDER_NR_PAGES && IS_MAX_ORDER_ALIGNED(pfn)) {
> -		for (i = 0; i < nr_pages; i += pageblock_nr_pages)
> +		for (i = 0; i < nr_pages; i += pageblock_nr_pages) {
> +			if (pfn_is_kho_scratch(page_to_pfn(page + i)))
> +				continue;
>  			init_pageblock_migratetype(page + i, MIGRATE_MOVABLE,
>  					false);

We can move init_pageblock_migratetype() here and below to
deferred_init_pages() and ...

> +		}
>  		__free_pages_core(page, MAX_PAGE_ORDER, MEMINIT_EARLY);
>  		return;
>  	}
> @@ -2019,7 +2023,7 @@ static void __init deferred_free_pages(unsigned long pfn,
>  	accept_memory(PFN_PHYS(pfn), nr_pages * PAGE_SIZE);
>  
>  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
> -		if (pageblock_aligned(pfn))
> +		if (pageblock_aligned(pfn) && !pfn_is_kho_scratch(pfn))
>  			init_pageblock_migratetype(page, MIGRATE_MOVABLE,
>  					false);
>  		__free_pages_core(page, 0, MEMINIT_EARLY);
> @@ -2090,9 +2094,11 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>  			unsigned long mo_pfn = ALIGN(spfn + 1, MAX_ORDER_NR_PAGES);
>  			unsigned long chunk_end = min(mo_pfn, epfn);
>  
> -			nr_pages += deferred_init_pages(zone, spfn, chunk_end);
> -			deferred_free_pages(spfn, chunk_end - spfn);
> +			// KHO scratch is MAX_ORDER_NR_PAGES aligned.
> +			if (!pfn_is_kho_scratch(spfn))
> +				deferred_init_pages(zone, spfn, chunk_end);

skip the entire MEMBLOCK_KHO_SCRATCH regions here and only call
deferred_free_pages() for them.

Since the outer loop already walks regions in memblock.memory it shouldn't
be hard to query memblock_region flags from the iterator, or just replace
the simplified iterator with __for_each_mem_range().
  
> +			deferred_free_pages(spfn, chunk_end - spfn);
>  			spfn = chunk_end;
>  
>  			if (can_resched)
> @@ -2100,6 +2106,7 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>  			else
>  				touch_nmi_watchdog();
>  		}
> +		nr_pages += epfn - spfn;
>  	}
>  
>  	return nr_pages;
> -- 
> 2.53.0.473.g4a7958ca14-goog
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v6 1/2] kho: fix deferred init of kho scratch
  2026-03-11 12:55 ` [PATCH v6 1/2] kho: fix deferred init of kho scratch Michal Clapinski
  2026-03-12 12:50   ` Mike Rapoport
@ 2026-03-13 13:58   ` Pratyush Yadav
  1 sibling, 0 replies; 5+ messages in thread
From: Pratyush Yadav @ 2026-03-13 13:58 UTC (permalink / raw)
  To: Michal Clapinski
  Cc: Evangelos Petrongonas, Pasha Tatashin, Mike Rapoport,
	Pratyush Yadav, Alexander Graf, Samiullah Khawaja, kexec,
	linux-mm, linux-kernel, Andrew Morton

Hi Michal,

On Wed, Mar 11 2026, Michal Clapinski wrote:

> Currently, if DEFERRED is enabled, kho_release_scratch will initialize
> the struct pages and set migratetype of kho scratch. Unless the whole
> scratch fit below first_deferred_pfn, some of that will be overwritten
> either by deferred_init_pages or memmap_init_reserved_pages.
>
> To fix it, I initialize kho scratch early and modify every other
> path to leave the scratch alone.
>
> In detail:
> 1. Modify deferred_init_memmap_chunk to not initialize kho
> scratch, since we already did that. Then, modify deferred_free_pages
> to not set the migratetype. Also modify reserve_bootmem_region to skip
> initializing kho scratch.
>
> 2. Since kho scratch is now not initialized by any other code, we have
> to initialize it ourselves also on cold boot. On cold boot memblock
> doesn't mark scratch as scratch, so we also have to modify the
> initialization function to not use memblock regions.
>
> Signed-off-by: Michal Clapinski <mclapinski@google.com>

I haven't had the time to carefully review this yet, but sharing some
high level thoughts.

IIUC the real problem isn't struct page initialization, but the fact
that if the page is at a pageblock boundary its migrate type is not
correctly set to MIGRATE_CMA. So you fix the problem by making sure no
one else but KHO can initialize the scratch pages.

I think the end result makes the already complicated page initialization
sequence even more complicated. I tried to grok that patch and it makes
my brain hurt.

Can we get away with something simpler? Here's an idea: keep the struct
page init the same as it is now, just modify
init_pageblock_migratetype() to override the migrate type if page lands
in scratch. It already does something similar with MIGRATE_PCPTYPES:

	if (unlikely(page_group_by_mobility_disabled &&
		     migratetype < MIGRATE_PCPTYPES))
		migratetype = MIGRATE_UNMOVABLE;

So we can also add:

	/*
	 * Scratch pages are always MIGRATE_CMA since they can't contain
	 * unmovable allocations.
	 */
	if (unlikely(pfn_is_kho_scratch(page_to_pfn(page))))
		migratetype = MIGRATE_CMA;

Do you think this will work? If yes, then I think it is a lot nicer than
what this patch is doing.

Also, pfn_is_kho_scratch() is pretty much a duplicate of
kho_scratch_overlap(). Please pull kho_scratch_overlap() out of
kexec_handover_debug.c and use that instead.

> ---
> My previous idea of marking scratch as CMA late, after deferred struct
> page init was done, was bad since allocations can be made before that
> and if they land in kho scratch, they become unpreservable.
> Such was the case with iommu page tables.
[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-13 13:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11 12:55 [PATCH v6 0/2] kho: add support for deferred struct page init Michal Clapinski
2026-03-11 12:55 ` [PATCH v6 1/2] kho: fix deferred init of kho scratch Michal Clapinski
2026-03-12 12:50   ` Mike Rapoport
2026-03-13 13:58   ` Pratyush Yadav
2026-03-11 12:55 ` [PATCH v6 2/2] kho: make preserved pages compatible with deferred struct page init Michal Clapinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox