linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v1 0/4] Make KHO Stateless
@ 2025-09-17  2:50 Jason Miu
  2025-09-17  2:50 ` [RFC v1 1/4] kho: Introduce KHO page table data structures Jason Miu
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Jason Miu @ 2025-09-17  2:50 UTC (permalink / raw)
  To: Alexander Graf, Andrew Morton, Baoquan He, Changyuan Lyu,
	David Matlack, David Rientjes, Jason Gunthorpe, Jason Miu,
	Joel Granados, Marcos Paulo de Souza, Mario Limonciello,
	Mike Rapoport, Pasha Tatashin, Petr Mladek, Rafael J . Wysocki,
	Steven Chen, Yan Zhao, kexec, linux-kernel, linux-mm

This series transitions KHO from an xarray-based metadata tracking
system with serialization to using page table like data structures
that can be passed directly to the next kernel.

The key motivations for this change are to:
- Eliminate the need for data serialization before kexec.
- Remove the former KHO state machine by deprecating the finalize
  and abort states.
- Pass preservation metadata more directly to the next kernel via the FDT.

The new approach uses a per-order page table structure (kho_order_table,
kho_page_table, kho_bitmap_table) to mark preserved pages. The physical
address of the root `kho_order_table` is passed in the FDT, allowing the
next kernel to reconstruct the preserved memory map.

The series includes the following changes:
1.  Introduce the KHO page table data structures.
2.  Adopt the KHO page tables, remove the xarray-based tracking and
    the serialization/finalization code.
3.  Update memblock to use direct KHO API calls, and adjust KHO FDT
    completion timing.
4.  Remove the KHO notifier system infrastructure.
        

Jason Miu (4):
  kho: Introduce KHO page table data structures
  kho: Adopt KHO page tables and remove serialization
  memblock: Remove KHO notifier usage
  kho: Remove notifier system infrastructure

 include/linux/kexec_handover.h |  44 +-
 kernel/kexec_core.c            |   4 +
 kernel/kexec_handover.c        | 821 ++++++++++++++++-----------------
 kernel/kexec_internal.h        |   2 +
 mm/memblock.c                  |  46 +-
 5 files changed, 404 insertions(+), 513 deletions(-)

-- 
2.51.0.384.g4c02a37b29-goog



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC v1 1/4] kho: Introduce KHO page table data structures
  2025-09-17  2:50 [RFC v1 0/4] Make KHO Stateless Jason Miu
@ 2025-09-17  2:50 ` Jason Miu
  2025-09-17 12:21   ` Jason Gunthorpe
  2025-09-17  2:50 ` [RFC v1 2/4] kho: Adopt KHO page tables and remove serialization Jason Miu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Jason Miu @ 2025-09-17  2:50 UTC (permalink / raw)
  To: Alexander Graf, Andrew Morton, Baoquan He, Changyuan Lyu,
	David Matlack, David Rientjes, Jason Gunthorpe, Jason Miu,
	Joel Granados, Marcos Paulo de Souza, Mario Limonciello,
	Mike Rapoport, Pasha Tatashin, Petr Mladek, Rafael J . Wysocki,
	Steven Chen, Yan Zhao, kexec, linux-kernel, linux-mm

Introduce a page-table-like data structure for tracking preserved
memory pages, which will replace the current xarray-based
implementation.

The primary motivation for this change is to eliminate the need for
serialization. By marking preserved pages directly in these new tables
and passing them to the next kernel, the entire serialization process
can be removed. This ultimately allows for the removal of the KHO
finalize and abort states, simplifying the overall design.

The new KHO page table is a hierarchical structure that maps physical
addresses to preservation metadata. It begins with a root
`kho_order_table` that contains an entry for each page order. Each
entry points to a separate, multi-level tree of `kho_page_table`s that
splits a physical address into indices. The traversal terminates at a
`kho_bitmap_table`, where each bit represents a single preserved page.

This commit adds the core data structures for this hierarchy:
  - kho_order_table: The root table, indexed by page order.
  - kho_page_table: Intermediate-level tables.
  - kho_bitmap_table: The lowest-level table where individual pages
are marked.

The new functions are not yet integrated with the public
`kho_preserve_*` APIs and are marked `__maybe_unused`. The full
integration and the removal of the old xarray code will follow in a
subsequent commit.

Signed-off-by: Jason Miu <jasonmiu@google.com>
---
 kernel/kexec_handover.c | 344 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 344 insertions(+)

diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index ecd1ac210dbd..0daed51c8fb7 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -46,6 +46,350 @@ static int __init kho_parse_enable(char *p)
 }
 early_param("kho", kho_parse_enable);
 
+/*
+ * KHO page tables provide a page-table-like data structure for tracking
+ * preserved memory pages. It is a hierarchical structure that starts with a
+ * `struct kho_order_table`. Each entry in this table points to the root of a
+ * `struct kho_page_table` tree, which tracks the preserved memory pages for a
+ * specific page order.
+ *
+ * Each entry in a `struct kho_page_table` points to the next level page table,
+ * until level 2, which points to a `struct kho_bitmap_table`. The lowest level
+ * (level 1) is a bitmap table where each bit represents a preserved page.
+ *
+ * The table hierarchy is shown as below.
+ *
+ * kho_order_table
+ * +-------------------------------+--------------------+
+ * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER |
+ * ++------------------------------+--------------------+
+ *  |
+ *  |
+ *  v
+ * ++------+
+ * |  Lv6  | kho_page_table
+ * ++------+
+ *  |
+ *  |
+ *  |   +-------+
+ *  +-> |  Lv5  | kho_page_table
+ *      ++------+
+ *       |
+ *       |
+ *       |   +-------+
+ *       +-> |  Lv4  | kho_page_table
+ *           ++------+
+ *            |
+ *            |
+ *            |   +-------+
+ *            +-> |  Lv3  | kho_page_table
+ *                ++------+
+ *                 |
+ *                 |
+ *                 |  +-------+
+ *                 +> |  Lv2  | kho_page_table
+ *                    ++------+
+ *                     |
+ *                     |
+ *                     |   +-------+
+ *                     +-> |  Lv1  | kho_bitmap_table
+ *                         +-------+
+ *
+ * The depth of the KHO page tables depends on the system's page size and the
+ * page order. Both larger page sizes and higher page orders result in
+ * shallower KHO page tables. For example, on a system with a 4KB native
+ * page size, 0-order tables have a depth of 6 levels.
+ *
+ * The following diagram illustrates how a physical address is split into
+ * indices for the different KHO page table levels and the final bitmap.
+ *
+ *      63      62:54    53:45    44:36    35:27        26:0
+ * +--------+--------+--------+--------+--------+-----------------+
+ * |  Lv 6  |  Lv 5  |  Lv 4  |  Lv 3  |  Lv 2  |  Lv 1 (bitmap)  |
+ * +--------+--------+--------+--------+--------+-----------------+
+ *
+ * For higher order pages, the bit fields for each level shift to the left by
+ * the page order.
+ *
+ * Each KHO page table and bitmap table is PAGE_SIZE in size. For 0-order
+ * pages, the bitmap table contains (PAGE_SIZE * 8) bits, covering a
+ * (PAGE_SIZE * 8 * PAGE_SIZE) memory range. For example, on a system with a
+ * 4KB native page size, the bitmap table contains 32768 bits and covers a
+ * 128MB memory range.
+ *
+ * Each KHO page table contains (PAGE_SIZE / 8) entries, where each entry is a
+ * descriptor (a physical address) pointing to the next level table.
+ * For example, with a 4KB page size, each page table holds 512 entries.
+ * The level 2 KHO page table is an exception, where each entry points to a
+ * KHO bitmap table instead.
+ *
+ * An entry of a KHO page table of a 4KB page system is shown as below as an
+ * example.
+ *
+ *         63:12                       11:0
+ * +------------------------------+--------------+
+ * | descriptor to next table     |    zeros     |
+ * +------------------------------+--------------+
+ */
+
+#define BITMAP_TABLE_SHIFT(_order) (PAGE_SHIFT + PAGE_SHIFT + 3 + (_order))
+#define BITMAP_TABLE_MASK(_order) ((1ULL << BITMAP_TABLE_SHIFT(_order)) - 1)
+#define PRESERVED_PAGE_OFFSET_SHIFT(_order) (PAGE_SHIFT + (_order))
+#define PAGE_TABLE_SHIFT_PER_LEVEL (ilog2(PAGE_SIZE / sizeof(unsigned long)))
+#define PAGE_TABLE_LEVEL_MASK ((1ULL << PAGE_TABLE_SHIFT_PER_LEVEL) - 1)
+#define PTR_PER_LEVEL (PAGE_SIZE / sizeof(unsigned long))
+
+typedef int (*kho_walk_callback_t)(phys_addr_t pa, int order);
+
+struct kho_bitmap_table {
+	unsigned long bitmaps[PAGE_SIZE / sizeof(unsigned long)];
+};
+
+struct kho_page_table {
+	unsigned long tables[PTR_PER_LEVEL];
+};
+
+struct kho_order_table {
+	unsigned long orders[HUGETLB_PAGE_ORDER + 1];
+};
+
+/*
+ * `kho_order_table` points to a page that serves as the root of the KHO page
+ * table hierarchy. This page is allocated during KHO module initialization.
+ * Its physical address is written to the FDT and passed to the next kernel
+ * during kexec.
+ */
+static struct kho_order_table *kho_order_table;
+
+static unsigned long kho_page_table_level_shift(int level, int order)
+{
+	/*
+	 * Calculate the cumulative bit shift required to extract the page table
+	 * index for a given physical address at a specific `level` and `order`.
+	 *
+	 * - Level 1 is the bitmap table, which has its own indexing logic, so
+	 *   the shift is 0.
+	 * - Level 2 and above: The base shift is `BITMAP_TABLE_SHIFT(order)`,
+	 *   which corresponds to the entire address space covered by a single
+	 *   level 1 bitmap table.
+	 * - Each subsequent level adds `PAGE_TABLE_SHIFT_PER_LEVEL` to the
+	 *   total shift amount.
+	 */
+	return level <= 1 ? 0 :
+		BITMAP_TABLE_SHIFT(order) + PAGE_TABLE_SHIFT_PER_LEVEL * (level - 2);
+}
+
+static int kho_get_bitmap_table_index(unsigned long pa, int order)
+{
+	/* 4KB (12bits of addr) + 8B per entries (6bits of addr) + order bits */
+	unsigned long idx = pa >> (PAGE_SHIFT + 6 + order);
+
+	return idx;
+}
+
+static int kho_get_page_table_index(unsigned long pa, int order, int level)
+{
+	unsigned long high_addr;
+	unsigned long page_table_offset;
+	unsigned long shift;
+
+	if (level == 1)
+		return kho_get_bitmap_table_index(pa, order);
+
+	shift = kho_page_table_level_shift(level, order);
+	high_addr = pa >> shift;
+
+	page_table_offset = high_addr & PAGE_TABLE_LEVEL_MASK;
+	return page_table_offset;
+}
+
+static int kho_table_level(int order)
+{
+	unsigned long bits_to_resolve;
+	int page_table_num;
+
+	/* We just need 1 bitmap table to cover all addresses */
+	if (BITMAP_TABLE_SHIFT(order) >= 64)
+		return 1;
+
+	bits_to_resolve = 64 - BITMAP_TABLE_SHIFT(order);
+
+	/*
+	 * The level we need is the bits to resolve over the bits a page tabel
+	 * can resolve. Get the ceiling as ceil(a/b) = (a + b - 1) / b.
+	 * Total level is the all table levels plus the buttom
+	 * bitmap level.
+	 */
+	page_table_num = (bits_to_resolve + PAGE_TABLE_SHIFT_PER_LEVEL - 1)
+		/ PAGE_TABLE_SHIFT_PER_LEVEL;
+	return page_table_num + 1;
+}
+
+static struct kho_page_table *kho_alloc_page_table(void)
+{
+	return (struct kho_page_table *)get_zeroed_page(GFP_KERNEL);
+}
+
+static void kho_set_preserved_page_bit(struct kho_bitmap_table *bitmap_table,
+				       unsigned long pa, int order)
+{
+	int bitmap_table_index = kho_get_bitmap_table_index(pa, order);
+	int offset;
+
+	/* Get the bit offset in a 64bits bitmap entry */
+	offset = (pa >> PRESERVED_PAGE_OFFSET_SHIFT(order)) & 0x3f;
+
+	set_bit(offset,
+		(unsigned long *)&bitmap_table->bitmaps[bitmap_table_index]);
+}
+
+static unsigned long kho_pgt_desc(struct kho_page_table *va)
+{
+	return (unsigned long)virt_to_phys(va);
+}
+
+static struct kho_page_table *kho_page_table(unsigned long desc)
+{
+	return (struct kho_page_table *)phys_to_virt(desc);
+}
+
+static int __kho_preserve_page_table(unsigned long pa, int order)
+{
+	int num_table_level = kho_table_level(order);
+	struct kho_page_table *cur;
+	struct kho_page_table *next;
+	struct kho_bitmap_table *bitmap_table;
+	int i, page_table_index;
+	unsigned long page_table_desc;
+
+	if (!kho_order_table->orders[order]) {
+		cur = kho_alloc_page_table();
+		if (!cur)
+			return -ENOMEM;
+		page_table_desc = kho_pgt_desc(cur);
+		kho_order_table->orders[order] = page_table_desc;
+	}
+
+	cur = kho_page_table(kho_order_table->orders[order]);
+
+	/* Go from high level tables to low level tables */
+	for (i = num_table_level; i > 1; i--) {
+		page_table_index = kho_get_page_table_index(pa, order, i);
+
+		if (!cur->tables[page_table_index]) {
+			next = kho_alloc_page_table();
+			if (!next)
+				return -ENOMEM;
+			cur->tables[page_table_index] = kho_pgt_desc(next);
+		} else {
+			next = kho_page_table(cur->tables[page_table_index]);
+		}
+
+		cur = next;
+	}
+
+	/* Cur is now pointing to the level 1 bitmap table */
+	bitmap_table = (struct kho_bitmap_table *)cur;
+	kho_set_preserved_page_bit(bitmap_table,
+				   pa & BITMAP_TABLE_MASK(order),
+				   order);
+
+	return 0;
+}
+
+/*
+ * TODO: __maybe_unused is added to the functions:
+ * kho_preserve_page_table()
+ * kho_walk_tables()
+ * kho_memblock_reserve()
+ * since they are not actually being called in this change.
+ * __maybe_unused will be removed in the next patch.
+ */
+static __maybe_unused int kho_preserve_page_table(unsigned long pfn, int order)
+{
+	unsigned long pa = PFN_PHYS(pfn);
+
+	might_sleep();
+
+	return __kho_preserve_page_table(pa, order);
+}
+
+static int __kho_walk_bitmap_table(int order,
+				   struct kho_bitmap_table *bitmap_table,
+				   unsigned long pa,
+				   kho_walk_callback_t cb)
+{
+	int i;
+	unsigned long offset;
+	int ret = 0;
+	int order_factor = 1 << order;
+	unsigned long *bitmap = (unsigned long *)bitmap_table;
+
+	for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
+		offset = (unsigned long)PAGE_SIZE * order_factor * i;
+		ret = cb(offset + pa, order);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int __kho_walk_page_tables(int order, int level,
+				  struct kho_page_table *cur, unsigned long pa,
+				  kho_walk_callback_t cb)
+{
+	struct kho_page_table *next;
+	struct kho_bitmap_table *bitmap_table;
+	int i;
+	unsigned long offset;
+	int ret = 0;
+
+	if (level == 1) {
+		bitmap_table = (struct kho_bitmap_table *)cur;
+		return __kho_walk_bitmap_table(order, bitmap_table, pa, cb);
+	}
+
+	for (i = 0; i < PTR_PER_LEVEL; i++) {
+		if (cur->tables[i]) {
+			next = kho_page_table(cur->tables[i]);
+			offset = i;
+			offset <<= kho_page_table_level_shift(level, order);
+			ret = __kho_walk_page_tables(order, level - 1,
+						     next, offset + pa, cb);
+			if (ret < 0)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+
+static __maybe_unused int kho_walk_page_tables(struct kho_page_table *top, int order,
+					       kho_walk_callback_t cb)
+{
+	int num_table_level;
+
+	if (top) {
+		num_table_level = kho_table_level(order);
+		return __kho_walk_page_tables(order, num_table_level, top, 0, cb);
+	}
+
+	return 0;
+}
+
+static __maybe_unused int kho_memblock_reserve(phys_addr_t pa, int order)
+{
+	int sz = 1 << (order + PAGE_SHIFT);
+	struct page *page = phys_to_page(pa);
+
+	memblock_reserve(pa, sz);
+	memblock_reserved_mark_noinit(pa, sz);
+	page->private = order;
+
+	return 0;
+}
+
 /*
  * Keep track of memory that is to be preserved across KHO.
  *
-- 
2.51.0.384.g4c02a37b29-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC v1 2/4] kho: Adopt KHO page tables and remove serialization
  2025-09-17  2:50 [RFC v1 0/4] Make KHO Stateless Jason Miu
  2025-09-17  2:50 ` [RFC v1 1/4] kho: Introduce KHO page table data structures Jason Miu
@ 2025-09-17  2:50 ` Jason Miu
  2025-09-17 17:52   ` Mike Rapoport
  2025-09-17  2:50 ` [RFC v1 3/4] memblock: Remove KHO notifier usage Jason Miu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Jason Miu @ 2025-09-17  2:50 UTC (permalink / raw)
  To: Alexander Graf, Andrew Morton, Baoquan He, Changyuan Lyu,
	David Matlack, David Rientjes, Jason Gunthorpe, Jason Miu,
	Joel Granados, Marcos Paulo de Souza, Mario Limonciello,
	Mike Rapoport, Pasha Tatashin, Petr Mladek, Rafael J . Wysocki,
	Steven Chen, Yan Zhao, kexec, linux-kernel, linux-mm

Transition the KHO system to use the new page table data structures
for managing preserved memory, replacing the previous xarray-based
approach. Remove the serialization process and the associated
finalization and abort logic.

Update the methods for marking memory to be preserved to use the KHO
page table hierarchy. Remove the former system of tracking preserved
pages using an xarray-based structure.

Change the method of passing preserved memory information to the next
kernel to be direct. Instead of serializing the memory map, place the
physical address of the `kho_order_table`, which holds the roots of
the KHO page tables for each order, in the FDT. Remove the explicit
`kho_finalize()` and `kho_abort()` functions and the logic supporting
the finalize and abort states, as they are no longer needed. This
simplifies the KHO lifecycle.

Enable the next kernel's initialization process to read the
`kho_order_table` address from the FDT. The kernel will then traverse
the KHO page table structures to discover all preserved memory
regions, reserving them to prevent early boot-time allocators from
overwriting them.

This architectural shift to using a shared page table structure
simplifies the KHO design and eliminates the overhead of serializing
and deserializing the preserved memory map.

Signed-off-by: Jason Miu <jasonmiu@google.com>
---
 include/linux/kexec_handover.h |  17 --
 kernel/kexec_handover.c        | 532 +++++----------------------------
 2 files changed, 71 insertions(+), 478 deletions(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index 348844cffb13..c8229cb11f4b 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -19,23 +19,6 @@ enum kho_event {
 struct folio;
 struct notifier_block;
 
-#define DECLARE_KHOSER_PTR(name, type) \
-	union {                        \
-		phys_addr_t phys;      \
-		type ptr;              \
-	} name
-#define KHOSER_STORE_PTR(dest, val)               \
-	({                                        \
-		typeof(val) v = val;              \
-		typecheck(typeof((dest).ptr), v); \
-		(dest).phys = virt_to_phys(v);    \
-	})
-#define KHOSER_LOAD_PTR(src)                                                 \
-	({                                                                   \
-		typeof(src) s = src;                                         \
-		(typeof((s).ptr))((s).phys ? phys_to_virt((s).phys) : NULL); \
-	})
-
 struct kho_serialization;
 
 #ifdef CONFIG_KEXEC_HANDOVER
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 0daed51c8fb7..578d1c1b9cea 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -29,7 +29,7 @@
 #include "kexec_internal.h"
 
 #define KHO_FDT_COMPATIBLE "kho-v1"
-#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
+#define PROP_PRESERVED_ORDER_TABLE "preserved-order-table"
 #define PROP_SUB_FDT "fdt"
 
 static bool kho_enable __ro_after_init;
@@ -297,15 +297,7 @@ static int __kho_preserve_page_table(unsigned long pa, int order)
 	return 0;
 }
 
-/*
- * TODO: __maybe_unused is added to the functions:
- * kho_preserve_page_table()
- * kho_walk_tables()
- * kho_memblock_reserve()
- * since they are not actually being called in this change.
- * __maybe_unused will be removed in the next patch.
- */
-static __maybe_unused int kho_preserve_page_table(unsigned long pfn, int order)
+static int kho_preserve_page_table(unsigned long pfn, int order)
 {
 	unsigned long pa = PFN_PHYS(pfn);
 
@@ -365,8 +357,8 @@ static int __kho_walk_page_tables(int order, int level,
 	return 0;
 }
 
-static __maybe_unused int kho_walk_page_tables(struct kho_page_table *top, int order,
-					       kho_walk_callback_t cb)
+static int kho_walk_page_tables(struct kho_page_table *top, int order,
+				kho_walk_callback_t cb)
 {
 	int num_table_level;
 
@@ -378,7 +370,7 @@ static __maybe_unused int kho_walk_page_tables(struct kho_page_table *top, int o
 	return 0;
 }
 
-static __maybe_unused int kho_memblock_reserve(phys_addr_t pa, int order)
+static int kho_memblock_reserve(phys_addr_t pa, int order)
 {
 	int sz = 1 << (order + PAGE_SHIFT);
 	struct page *page = phys_to_page(pa);
@@ -390,143 +382,12 @@ static __maybe_unused int kho_memblock_reserve(phys_addr_t pa, int order)
 	return 0;
 }
 
-/*
- * Keep track of memory that is to be preserved across KHO.
- *
- * The serializing side uses two levels of xarrays to manage chunks of per-order
- * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
- * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
- * each bitmap will cover 16M of address space. Thus, for 16G of memory at most
- * 512K of bitmap memory will be needed for order 0.
- *
- * This approach is fully incremental, as the serialization progresses folios
- * can continue be aggregated to the tracker. The final step, immediately prior
- * to kexec would serialize the xarray information into a linked list for the
- * successor kernel to parse.
- */
-
-#define PRESERVE_BITS (512 * 8)
-
-struct kho_mem_phys_bits {
-	DECLARE_BITMAP(preserve, PRESERVE_BITS);
-};
-
-struct kho_mem_phys {
-	/*
-	 * Points to kho_mem_phys_bits, a sparse bitmap array. Each bit is sized
-	 * to order.
-	 */
-	struct xarray phys_bits;
-};
-
-struct kho_mem_track {
-	/* Points to kho_mem_phys, each order gets its own bitmap tree */
-	struct xarray orders;
-};
-
-struct khoser_mem_chunk;
-
 struct kho_serialization {
 	struct page *fdt;
 	struct list_head fdt_list;
 	struct dentry *sub_fdt_dir;
-	struct kho_mem_track track;
-	/* First chunk of serialized preserved memory map */
-	struct khoser_mem_chunk *preserved_mem_map;
 };
 
-static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
-{
-	void *elm, *res;
-
-	elm = xa_load(xa, index);
-	if (elm)
-		return elm;
-
-	elm = kzalloc(sz, GFP_KERNEL);
-	if (!elm)
-		return ERR_PTR(-ENOMEM);
-
-	res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL);
-	if (xa_is_err(res))
-		res = ERR_PTR(xa_err(res));
-
-	if (res) {
-		kfree(elm);
-		return res;
-	}
-
-	return elm;
-}
-
-static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn,
-			     unsigned long end_pfn)
-{
-	struct kho_mem_phys_bits *bits;
-	struct kho_mem_phys *physxa;
-
-	while (pfn < end_pfn) {
-		const unsigned int order =
-			min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
-		const unsigned long pfn_high = pfn >> order;
-
-		physxa = xa_load(&track->orders, order);
-		if (!physxa)
-			continue;
-
-		bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS);
-		if (!bits)
-			continue;
-
-		clear_bit(pfn_high % PRESERVE_BITS, bits->preserve);
-
-		pfn += 1 << order;
-	}
-}
-
-static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn,
-				unsigned int order)
-{
-	struct kho_mem_phys_bits *bits;
-	struct kho_mem_phys *physxa, *new_physxa;
-	const unsigned long pfn_high = pfn >> order;
-
-	might_sleep();
-
-	physxa = xa_load(&track->orders, order);
-	if (!physxa) {
-		int err;
-
-		new_physxa = kzalloc(sizeof(*physxa), GFP_KERNEL);
-		if (!new_physxa)
-			return -ENOMEM;
-
-		xa_init(&new_physxa->phys_bits);
-		physxa = xa_cmpxchg(&track->orders, order, NULL, new_physxa,
-				    GFP_KERNEL);
-
-		err = xa_err(physxa);
-		if (err || physxa) {
-			xa_destroy(&new_physxa->phys_bits);
-			kfree(new_physxa);
-
-			if (err)
-				return err;
-		} else {
-			physxa = new_physxa;
-		}
-	}
-
-	bits = xa_load_or_alloc(&physxa->phys_bits, pfn_high / PRESERVE_BITS,
-				sizeof(*bits));
-	if (IS_ERR(bits))
-		return PTR_ERR(bits);
-
-	set_bit(pfn_high % PRESERVE_BITS, bits->preserve);
-
-	return 0;
-}
-
 /* almost as free_reserved_page(), just don't free the page */
 static void kho_restore_page(struct page *page, unsigned int order)
 {
@@ -568,151 +429,29 @@ struct folio *kho_restore_folio(phys_addr_t phys)
 }
 EXPORT_SYMBOL_GPL(kho_restore_folio);
 
-/* Serialize and deserialize struct kho_mem_phys across kexec
- *
- * Record all the bitmaps in a linked list of pages for the next kernel to
- * process. Each chunk holds bitmaps of the same order and each block of bitmaps
- * starts at a given physical address. This allows the bitmaps to be sparse. The
- * xarray is used to store them in a tree while building up the data structure,
- * but the KHO successor kernel only needs to process them once in order.
- *
- * All of this memory is normal kmalloc() memory and is not marked for
- * preservation. The successor kernel will remain isolated to the scratch space
- * until it completes processing this list. Once processed all the memory
- * storing these ranges will be marked as free.
- */
-
-struct khoser_mem_bitmap_ptr {
-	phys_addr_t phys_start;
-	DECLARE_KHOSER_PTR(bitmap, struct kho_mem_phys_bits *);
-};
-
-struct khoser_mem_chunk_hdr {
-	DECLARE_KHOSER_PTR(next, struct khoser_mem_chunk *);
-	unsigned int order;
-	unsigned int num_elms;
-};
-
-#define KHOSER_BITMAP_SIZE                                   \
-	((PAGE_SIZE - sizeof(struct khoser_mem_chunk_hdr)) / \
-	 sizeof(struct khoser_mem_bitmap_ptr))
-
-struct khoser_mem_chunk {
-	struct khoser_mem_chunk_hdr hdr;
-	struct khoser_mem_bitmap_ptr bitmaps[KHOSER_BITMAP_SIZE];
-};
-
-static_assert(sizeof(struct khoser_mem_chunk) == PAGE_SIZE);
-
-static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
-					  unsigned long order)
-{
-	struct khoser_mem_chunk *chunk;
-
-	chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
-	if (!chunk)
-		return NULL;
-	chunk->hdr.order = order;
-	if (cur_chunk)
-		KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
-	return chunk;
-}
-
-static void kho_mem_ser_free(struct khoser_mem_chunk *first_chunk)
-{
-	struct khoser_mem_chunk *chunk = first_chunk;
-
-	while (chunk) {
-		struct khoser_mem_chunk *tmp = chunk;
-
-		chunk = KHOSER_LOAD_PTR(chunk->hdr.next);
-		kfree(tmp);
-	}
-}
-
-static int kho_mem_serialize(struct kho_serialization *ser)
-{
-	struct khoser_mem_chunk *first_chunk = NULL;
-	struct khoser_mem_chunk *chunk = NULL;
-	struct kho_mem_phys *physxa;
-	unsigned long order;
-
-	xa_for_each(&ser->track.orders, order, physxa) {
-		struct kho_mem_phys_bits *bits;
-		unsigned long phys;
-
-		chunk = new_chunk(chunk, order);
-		if (!chunk)
-			goto err_free;
-
-		if (!first_chunk)
-			first_chunk = chunk;
-
-		xa_for_each(&physxa->phys_bits, phys, bits) {
-			struct khoser_mem_bitmap_ptr *elm;
-
-			if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) {
-				chunk = new_chunk(chunk, order);
-				if (!chunk)
-					goto err_free;
-			}
-
-			elm = &chunk->bitmaps[chunk->hdr.num_elms];
-			chunk->hdr.num_elms++;
-			elm->phys_start = (phys * PRESERVE_BITS)
-					  << (order + PAGE_SHIFT);
-			KHOSER_STORE_PTR(elm->bitmap, bits);
-		}
-	}
-
-	ser->preserved_mem_map = first_chunk;
-
-	return 0;
-
-err_free:
-	kho_mem_ser_free(first_chunk);
-	return -ENOMEM;
-}
-
-static void __init deserialize_bitmap(unsigned int order,
-				      struct khoser_mem_bitmap_ptr *elm)
-{
-	struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap);
-	unsigned long bit;
-
-	for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) {
-		int sz = 1 << (order + PAGE_SHIFT);
-		phys_addr_t phys =
-			elm->phys_start + (bit << (order + PAGE_SHIFT));
-		struct page *page = phys_to_page(phys);
-
-		memblock_reserve(phys, sz);
-		memblock_reserved_mark_noinit(phys, sz);
-		page->private = order;
-	}
-}
-
 static void __init kho_mem_deserialize(const void *fdt)
 {
-	struct khoser_mem_chunk *chunk;
 	const phys_addr_t *mem;
-	int len;
-
-	mem = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len);
+	int len, i;
+	struct kho_order_table *order_table;
 
+	/* Retrieve the KHO order table from passed-in FDT. */
+	mem = fdt_getprop(fdt, 0, PROP_PRESERVED_ORDER_TABLE, &len);
 	if (!mem || len != sizeof(*mem)) {
-		pr_err("failed to get preserved memory bitmaps\n");
+		pr_err("failed to get preserved order table\n");
 		return;
 	}
 
-	chunk = *mem ? phys_to_virt(*mem) : NULL;
-	while (chunk) {
-		unsigned int i;
+	order_table = *mem ?
+		(struct kho_order_table *)phys_to_virt(*mem) :
+		NULL;
 
-		for (i = 0; i != chunk->hdr.num_elms; i++)
-			deserialize_bitmap(chunk->hdr.order,
-					   &chunk->bitmaps[i]);
-		chunk = KHOSER_LOAD_PTR(chunk->hdr.next);
+	if (!order_table)
+		return;
+
+	for (i = 0; i < HUGETLB_PAGE_ORDER + 1; i++) {
+		kho_walk_page_tables(kho_page_table(order_table->orders[i]),
+				     i, kho_memblock_reserve);
 	}
 }
 
@@ -977,25 +716,15 @@ EXPORT_SYMBOL_GPL(kho_add_subtree);
 
 struct kho_out {
 	struct blocking_notifier_head chain_head;
-
 	struct dentry *dir;
-
-	struct mutex lock; /* protects KHO FDT finalization */
-
 	struct kho_serialization ser;
-	bool finalized;
 };
 
 static struct kho_out kho_out = {
 	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
-	.lock = __MUTEX_INITIALIZER(kho_out.lock),
 	.ser = {
 		.fdt_list = LIST_HEAD_INIT(kho_out.ser.fdt_list),
-		.track = {
-			.orders = XARRAY_INIT(kho_out.ser.track.orders, 0),
-		},
 	},
-	.finalized = false,
 };
 
 int register_kho_notifier(struct notifier_block *nb)
@@ -1023,12 +752,8 @@ int kho_preserve_folio(struct folio *folio)
 {
 	const unsigned long pfn = folio_pfn(folio);
 	const unsigned int order = folio_order(folio);
-	struct kho_mem_track *track = &kho_out.ser.track;
-
-	if (kho_out.finalized)
-		return -EBUSY;
 
-	return __kho_preserve_order(track, pfn, order);
+	return kho_preserve_page_table(pfn, order);
 }
 EXPORT_SYMBOL_GPL(kho_preserve_folio);
 
@@ -1045,14 +770,8 @@ EXPORT_SYMBOL_GPL(kho_preserve_folio);
 int kho_preserve_phys(phys_addr_t phys, size_t size)
 {
 	unsigned long pfn = PHYS_PFN(phys);
-	unsigned long failed_pfn = 0;
-	const unsigned long start_pfn = pfn;
 	const unsigned long end_pfn = PHYS_PFN(phys + size);
 	int err = 0;
-	struct kho_mem_track *track = &kho_out.ser.track;
-
-	if (kho_out.finalized)
-		return -EBUSY;
 
 	if (!PAGE_ALIGNED(phys) || !PAGE_ALIGNED(size))
 		return -EINVAL;
@@ -1061,19 +780,14 @@ int kho_preserve_phys(phys_addr_t phys, size_t size)
 		const unsigned int order =
 			min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
 
-		err = __kho_preserve_order(track, pfn, order);
-		if (err) {
-			failed_pfn = pfn;
-			break;
-		}
+		err = kho_preserve_page_table(pfn, order);
+		if (err)
+			return err;
 
 		pfn += 1 << order;
 	}
 
-	if (err)
-		__kho_unpreserve(track, start_pfn, failed_pfn);
-
-	return err;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(kho_preserve_phys);
 
@@ -1081,150 +795,6 @@ EXPORT_SYMBOL_GPL(kho_preserve_phys);
 
 static struct dentry *debugfs_root;
 
-static int kho_out_update_debugfs_fdt(void)
-{
-	int err = 0;
-	struct fdt_debugfs *ff, *tmp;
-
-	if (kho_out.finalized) {
-		err = kho_debugfs_fdt_add(&kho_out.ser.fdt_list, kho_out.dir,
-					  "fdt", page_to_virt(kho_out.ser.fdt));
-	} else {
-		list_for_each_entry_safe(ff, tmp, &kho_out.ser.fdt_list, list) {
-			debugfs_remove(ff->file);
-			list_del(&ff->list);
-			kfree(ff);
-		}
-	}
-
-	return err;
-}
-
-static int kho_abort(void)
-{
-	int err;
-	unsigned long order;
-	struct kho_mem_phys *physxa;
-
-	xa_for_each(&kho_out.ser.track.orders, order, physxa) {
-		struct kho_mem_phys_bits *bits;
-		unsigned long phys;
-
-		xa_for_each(&physxa->phys_bits, phys, bits)
-			kfree(bits);
-
-		xa_destroy(&physxa->phys_bits);
-		kfree(physxa);
-	}
-	xa_destroy(&kho_out.ser.track.orders);
-
-	if (kho_out.ser.preserved_mem_map) {
-		kho_mem_ser_free(kho_out.ser.preserved_mem_map);
-		kho_out.ser.preserved_mem_map = NULL;
-	}
-
-	err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_ABORT,
-					   NULL);
-	err = notifier_to_errno(err);
-
-	if (err)
-		pr_err("Failed to abort KHO finalization: %d\n", err);
-
-	return err;
-}
-
-static int kho_finalize(void)
-{
-	int err = 0;
-	u64 *preserved_mem_map;
-	void *fdt = page_to_virt(kho_out.ser.fdt);
-
-	err |= fdt_create(fdt, PAGE_SIZE);
-	err |= fdt_finish_reservemap(fdt);
-	err |= fdt_begin_node(fdt, "");
-	err |= fdt_property_string(fdt, "compatible", KHO_FDT_COMPATIBLE);
-	/**
-	 * Reserve the preserved-memory-map property in the root FDT, so
-	 * that all property definitions will precede subnodes created by
-	 * KHO callers.
-	 */
-	err |= fdt_property_placeholder(fdt, PROP_PRESERVED_MEMORY_MAP,
-					sizeof(*preserved_mem_map),
-					(void **)&preserved_mem_map);
-	if (err)
-		goto abort;
-
-	err = kho_preserve_folio(page_folio(kho_out.ser.fdt));
-	if (err)
-		goto abort;
-
-	err = blocking_notifier_call_chain(&kho_out.chain_head,
-					   KEXEC_KHO_FINALIZE, &kho_out.ser);
-	err = notifier_to_errno(err);
-	if (err)
-		goto abort;
-
-	err = kho_mem_serialize(&kho_out.ser);
-	if (err)
-		goto abort;
-
-	*preserved_mem_map = (u64)virt_to_phys(kho_out.ser.preserved_mem_map);
-
-	err |= fdt_end_node(fdt);
-	err |= fdt_finish(fdt);
-
-abort:
-	if (err) {
-		pr_err("Failed to convert KHO state tree: %d\n", err);
-		kho_abort();
-	}
-
-	return err;
-}
-
-static int kho_out_finalize_get(void *data, u64 *val)
-{
-	mutex_lock(&kho_out.lock);
-	*val = kho_out.finalized;
-	mutex_unlock(&kho_out.lock);
-
-	return 0;
-}
-
-static int kho_out_finalize_set(void *data, u64 _val)
-{
-	int ret = 0;
-	bool val = !!_val;
-
-	mutex_lock(&kho_out.lock);
-
-	if (val == kho_out.finalized) {
-		if (kho_out.finalized)
-			ret = -EEXIST;
-		else
-			ret = -ENOENT;
-		goto unlock;
-	}
-
-	if (val)
-		ret = kho_finalize();
-	else
-		ret = kho_abort();
-
-	if (ret)
-		goto unlock;
-
-	kho_out.finalized = val;
-	ret = kho_out_update_debugfs_fdt();
-
-unlock:
-	mutex_unlock(&kho_out.lock);
-	return ret;
-}
-
-DEFINE_DEBUGFS_ATTRIBUTE(fops_kho_out_finalize, kho_out_finalize_get,
-			 kho_out_finalize_set, "%llu\n");
-
 static int scratch_phys_show(struct seq_file *m, void *v)
 {
 	for (int i = 0; i < kho_scratch_cnt; i++)
@@ -1265,11 +835,6 @@ static __init int kho_out_debugfs_init(void)
 	if (IS_ERR(f))
 		goto err_rmdir;
 
-	f = debugfs_create_file("finalize", 0600, dir, NULL,
-				&fops_kho_out_finalize);
-	if (IS_ERR(f))
-		goto err_rmdir;
-
 	kho_out.dir = dir;
 	kho_out.ser.sub_fdt_dir = sub_fdt_dir;
 	return 0;
@@ -1381,6 +946,35 @@ static __init int kho_in_debugfs_init(const void *fdt)
 	return err;
 }
 
+static int kho_out_fdt_init(void)
+{
+	int err = 0;
+	void *fdt = page_to_virt(kho_out.ser.fdt);
+	u64 *preserved_order_table;
+
+	err |= fdt_create(fdt, PAGE_SIZE);
+	err |= fdt_finish_reservemap(fdt);
+	err |= fdt_begin_node(fdt, "");
+	err |= fdt_property_string(fdt, "compatible", KHO_FDT_COMPATIBLE);
+
+	err |= fdt_property_placeholder(fdt, PROP_PRESERVED_ORDER_TABLE,
+					sizeof(*preserved_order_table),
+					(void **)&preserved_order_table);
+	if (err)
+		goto abort;
+
+	*preserved_order_table = (u64)virt_to_phys(kho_order_table);
+
+	err |= fdt_end_node(fdt);
+	err |= fdt_finish(fdt);
+
+abort:
+	if (err)
+		pr_err("Failed to convert KHO state tree: %d\n", err);
+
+	return err;
+}
+
 static __init int kho_init(void)
 {
 	int err = 0;
@@ -1395,15 +989,26 @@ static __init int kho_init(void)
 		goto err_free_scratch;
 	}
 
+	kho_order_table = (struct kho_order_table *)
+		kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!kho_order_table) {
+		err = -ENOMEM;
+		goto err_free_fdt;
+	}
+
+	err = kho_out_fdt_init();
+	if (err)
+		goto err_free_kho_order_table;
+
 	debugfs_root = debugfs_create_dir("kho", NULL);
 	if (IS_ERR(debugfs_root)) {
 		err = -ENOENT;
-		goto err_free_fdt;
+		goto err_free_kho_order_table;
 	}
 
 	err = kho_out_debugfs_init();
 	if (err)
-		goto err_free_fdt;
+		goto err_free_kho_order_table;
 
 	if (fdt) {
 		err = kho_in_debugfs_init(fdt);
@@ -1431,6 +1036,9 @@ static __init int kho_init(void)
 
 	return 0;
 
+err_free_kho_order_table:
+	kfree(kho_order_table);
+	kho_order_table = NULL;
 err_free_fdt:
 	put_page(kho_out.ser.fdt);
 	kho_out.ser.fdt = NULL;
@@ -1581,6 +1189,8 @@ int kho_fill_kimage(struct kimage *image)
 		return 0;
 
 	image->kho.fdt = page_to_phys(kho_out.ser.fdt);
+	/* Preserve the memory page of FDT for the next kernel */
+	kho_preserve_phys(image->kho.fdt, PAGE_SIZE);
 
 	scratch_size = sizeof(*kho_scratch) * kho_scratch_cnt;
 	scratch = (struct kexec_buf){
-- 
2.51.0.384.g4c02a37b29-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC v1 3/4] memblock: Remove KHO notifier usage
  2025-09-17  2:50 [RFC v1 0/4] Make KHO Stateless Jason Miu
  2025-09-17  2:50 ` [RFC v1 1/4] kho: Introduce KHO page table data structures Jason Miu
  2025-09-17  2:50 ` [RFC v1 2/4] kho: Adopt KHO page tables and remove serialization Jason Miu
@ 2025-09-17  2:50 ` Jason Miu
  2025-09-17  2:50 ` [RFC v1 4/4] kho: Remove notifier system infrastructure Jason Miu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Jason Miu @ 2025-09-17  2:50 UTC (permalink / raw)
  To: Alexander Graf, Andrew Morton, Baoquan He, Changyuan Lyu,
	David Matlack, David Rientjes, Jason Gunthorpe, Jason Miu,
	Joel Granados, Marcos Paulo de Souza, Mario Limonciello,
	Mike Rapoport, Pasha Tatashin, Petr Mladek, Rafael J . Wysocki,
	Steven Chen, Yan Zhao, kexec, linux-kernel, linux-mm

Update memblock to use direct KHO API calls for memory preservation.

Remove the KHO notifier registration and callbacks from the memblock
subsystem. These notifiers were tied to the former KHO finalize and
abort events, which are no longer used.

Memblock now preserves its `reserve_mem` regions and registers its
metadata by calling kho_preserve_phys(), kho_preserve_folio(), and
kho_add_subtree() directly within its initialization function. This is
made possible by changes in the KHO core to complete the FDT at a
later stage, during kexec.

Signed-off-by: Jason Miu <jasonmiu@google.com>
---
 include/linux/kexec_handover.h |  7 ++----
 kernel/kexec_core.c            |  4 +++
 kernel/kexec_handover.c        | 46 +++++++++++++++++++++-------------
 kernel/kexec_internal.h        |  2 ++
 mm/memblock.c                  | 46 ++++++++--------------------------
 5 files changed, 47 insertions(+), 58 deletions(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index c8229cb11f4b..e29dcf53de7e 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -19,15 +19,13 @@ enum kho_event {
 struct folio;
 struct notifier_block;
 
-struct kho_serialization;
-
 #ifdef CONFIG_KEXEC_HANDOVER
 bool kho_is_enabled(void);
 
 int kho_preserve_folio(struct folio *folio);
 int kho_preserve_phys(phys_addr_t phys, size_t size);
 struct folio *kho_restore_folio(phys_addr_t phys);
-int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt);
+int kho_add_subtree(const char *name, void *fdt);
 int kho_retrieve_subtree(const char *name, phys_addr_t *phys);
 
 int register_kho_notifier(struct notifier_block *nb);
@@ -58,8 +56,7 @@ static inline struct folio *kho_restore_folio(phys_addr_t phys)
 	return NULL;
 }
 
-static inline int kho_add_subtree(struct kho_serialization *ser,
-				  const char *name, void *fdt)
+static inline int kho_add_subtree(const char *name, void *fdt)
 {
 	return -EOPNOTSUPP;
 }
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 31203f0bacaf..3cf33aaded17 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1147,6 +1147,10 @@ int kernel_kexec(void)
 		goto Unlock;
 	}
 
+	error = kho_commit_fdt();
+	if (error)
+		goto Unlock;
+
 #ifdef CONFIG_KEXEC_JUMP
 	if (kexec_image->preserve_context) {
 		/*
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 578d1c1b9cea..f7933b434364 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -682,9 +682,21 @@ static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
 	return 0;
 }
 
+struct kho_out {
+	struct blocking_notifier_head chain_head;
+	struct dentry *dir;
+	struct kho_serialization ser;
+};
+
+static struct kho_out kho_out = {
+	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
+	.ser = {
+		.fdt_list = LIST_HEAD_INIT(kho_out.ser.fdt_list),
+	},
+};
+
 /**
  * kho_add_subtree - record the physical address of a sub FDT in KHO root tree.
- * @ser: serialization control object passed by KHO notifiers.
  * @name: name of the sub tree.
  * @fdt: the sub tree blob.
  *
@@ -697,8 +709,9 @@ static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
  *
  * Return: 0 on success, error code on failure
  */
-int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt)
+int kho_add_subtree(const char *name, void *fdt)
 {
+	struct kho_serialization *ser = &kho_out.ser;
 	int err = 0;
 	u64 phys = (u64)virt_to_phys(fdt);
 	void *root = page_to_virt(ser->fdt);
@@ -714,19 +727,6 @@ int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt)
 }
 EXPORT_SYMBOL_GPL(kho_add_subtree);
 
-struct kho_out {
-	struct blocking_notifier_head chain_head;
-	struct dentry *dir;
-	struct kho_serialization ser;
-};
-
-static struct kho_out kho_out = {
-	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
-	.ser = {
-		.fdt_list = LIST_HEAD_INIT(kho_out.ser.fdt_list),
-	},
-};
-
 int register_kho_notifier(struct notifier_block *nb)
 {
 	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
@@ -952,6 +952,7 @@ static int kho_out_fdt_init(void)
 	void *fdt = page_to_virt(kho_out.ser.fdt);
 	u64 *preserved_order_table;
 
+	/* Do not close the root node and FDT until kho_commit_fdt() */
 	err |= fdt_create(fdt, PAGE_SIZE);
 	err |= fdt_finish_reservemap(fdt);
 	err |= fdt_begin_node(fdt, "");
@@ -965,9 +966,6 @@ static int kho_out_fdt_init(void)
 
 	*preserved_order_table = (u64)virt_to_phys(kho_order_table);
 
-	err |= fdt_end_node(fdt);
-	err |= fdt_finish(fdt);
-
 abort:
 	if (err)
 		pr_err("Failed to convert KHO state tree: %d\n", err);
@@ -1211,6 +1209,18 @@ int kho_fill_kimage(struct kimage *image)
 	return 0;
 }
 
+int kho_commit_fdt(void)
+{
+	int err = 0;
+	void *fdt = page_to_virt(kho_out.ser.fdt);
+
+	/* Close the root node and commit the FDT */
+	err = fdt_end_node(fdt);
+	err |= fdt_finish(fdt);
+
+	return err;
+}
+
 static int kho_walk_scratch(struct kexec_buf *kbuf,
 			    int (*func)(struct resource *, void *))
 {
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 228bb88c018b..490170911f5a 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -46,6 +46,7 @@ struct kexec_buf;
 int kho_locate_mem_hole(struct kexec_buf *kbuf,
 			int (*func)(struct resource *, void *));
 int kho_fill_kimage(struct kimage *image);
+int kho_commit_fdt(void);
 #else
 static inline int kho_locate_mem_hole(struct kexec_buf *kbuf,
 				      int (*func)(struct resource *, void *))
@@ -54,5 +55,6 @@ static inline int kho_locate_mem_hole(struct kexec_buf *kbuf,
 }
 
 static inline int kho_fill_kimage(struct kimage *image) { return 0; }
+static inline int kho_commit_fdt(void) { return 0; }
 #endif /* CONFIG_KEXEC_HANDOVER */
 #endif /* LINUX_KEXEC_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index 117d963e677c..978717d59a6f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -6,6 +6,7 @@
  * Copyright (C) 2001 Peter Bergner.
  */
 
+#include "asm-generic/memory_model.h"
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/init.h>
@@ -2510,39 +2511,6 @@ int reserve_mem_release_by_name(const char *name)
 #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
 static struct page *kho_fdt;
 
-static int reserve_mem_kho_finalize(struct kho_serialization *ser)
-{
-	int err = 0, i;
-
-	for (i = 0; i < reserved_mem_count; i++) {
-		struct reserve_mem_table *map = &reserved_mem_table[i];
-
-		err |= kho_preserve_phys(map->start, map->size);
-	}
-
-	err |= kho_preserve_folio(page_folio(kho_fdt));
-	err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
-
-	return notifier_from_errno(err);
-}
-
-static int reserve_mem_kho_notifier(struct notifier_block *self,
-				    unsigned long cmd, void *v)
-{
-	switch (cmd) {
-	case KEXEC_KHO_FINALIZE:
-		return reserve_mem_kho_finalize((struct kho_serialization *)v);
-	case KEXEC_KHO_ABORT:
-		return NOTIFY_DONE;
-	default:
-		return NOTIFY_BAD;
-	}
-}
-
-static struct notifier_block reserve_mem_kho_nb = {
-	.notifier_call = reserve_mem_kho_notifier,
-};
-
 static int __init prepare_kho_fdt(void)
 {
 	int err = 0, i;
@@ -2583,7 +2551,7 @@ static int __init prepare_kho_fdt(void)
 
 static int __init reserve_mem_init(void)
 {
-	int err;
+	int err, i;
 
 	if (!kho_is_enabled() || !reserved_mem_count)
 		return 0;
@@ -2592,7 +2560,15 @@ static int __init reserve_mem_init(void)
 	if (err)
 		return err;
 
-	err = register_kho_notifier(&reserve_mem_kho_nb);
+	for (i = 0; i < reserved_mem_count; i++) {
+		struct reserve_mem_table *map = &reserved_mem_table[i];
+
+		err |= kho_preserve_phys(map->start, map->size);
+	}
+
+	err |= kho_preserve_folio(page_folio(kho_fdt));
+	err |= kho_add_subtree(MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
+
 	if (err) {
 		put_page(kho_fdt);
 		kho_fdt = NULL;
-- 
2.51.0.384.g4c02a37b29-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC v1 4/4] kho: Remove notifier system infrastructure
  2025-09-17  2:50 [RFC v1 0/4] Make KHO Stateless Jason Miu
                   ` (2 preceding siblings ...)
  2025-09-17  2:50 ` [RFC v1 3/4] memblock: Remove KHO notifier usage Jason Miu
@ 2025-09-17  2:50 ` Jason Miu
  2025-09-17 11:36 ` [RFC v1 0/4] Make KHO Stateless Jason Gunthorpe
  2025-09-25  9:19 ` Mike Rapoport
  5 siblings, 0 replies; 19+ messages in thread
From: Jason Miu @ 2025-09-17  2:50 UTC (permalink / raw)
  To: Alexander Graf, Andrew Morton, Baoquan He, Changyuan Lyu,
	David Matlack, David Rientjes, Jason Gunthorpe, Jason Miu,
	Joel Granados, Marcos Paulo de Souza, Mario Limonciello,
	Mike Rapoport, Pasha Tatashin, Petr Mladek, Rafael J . Wysocki,
	Steven Chen, Yan Zhao, kexec, linux-kernel, linux-mm

Remove the KHO notifier system.

Eliminate the core KHO notifier API
functions (`register_kho_notifier`, `unregister_kho_notifier`), the
`kho_event` enum, and the notifier chain head from KHO internal
structures.

This infrastructure was used to support the now-removed finalize and
abort states and is no longer required. Client subsystems now interact
with KHO through direct API calls.

Signed-off-by: Jason Miu <jasonmiu@google.com>
---
 include/linux/kexec_handover.h | 20 --------------------
 kernel/kexec_handover.c        | 15 ---------------
 2 files changed, 35 deletions(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index e29dcf53de7e..09e8f0b0fcab 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -10,14 +10,7 @@ struct kho_scratch {
 	phys_addr_t size;
 };
 
-/* KHO Notifier index */
-enum kho_event {
-	KEXEC_KHO_FINALIZE = 0,
-	KEXEC_KHO_ABORT = 1,
-};
-
 struct folio;
-struct notifier_block;
 
 #ifdef CONFIG_KEXEC_HANDOVER
 bool kho_is_enabled(void);
@@ -28,9 +21,6 @@ struct folio *kho_restore_folio(phys_addr_t phys);
 int kho_add_subtree(const char *name, void *fdt);
 int kho_retrieve_subtree(const char *name, phys_addr_t *phys);
 
-int register_kho_notifier(struct notifier_block *nb);
-int unregister_kho_notifier(struct notifier_block *nb);
-
 void kho_memory_init(void);
 
 void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys,
@@ -66,16 +56,6 @@ static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys)
 	return -EOPNOTSUPP;
 }
 
-static inline int register_kho_notifier(struct notifier_block *nb)
-{
-	return -EOPNOTSUPP;
-}
-
-static inline int unregister_kho_notifier(struct notifier_block *nb)
-{
-	return -EOPNOTSUPP;
-}
-
 static inline void kho_memory_init(void)
 {
 }
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index f7933b434364..62f654b08c74 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -16,7 +16,6 @@
 #include <linux/libfdt.h>
 #include <linux/list.h>
 #include <linux/memblock.h>
-#include <linux/notifier.h>
 #include <linux/page-isolation.h>
 
 #include <asm/early_ioremap.h>
@@ -683,13 +682,11 @@ static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
 }
 
 struct kho_out {
-	struct blocking_notifier_head chain_head;
 	struct dentry *dir;
 	struct kho_serialization ser;
 };
 
 static struct kho_out kho_out = {
-	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
 	.ser = {
 		.fdt_list = LIST_HEAD_INIT(kho_out.ser.fdt_list),
 	},
@@ -727,18 +724,6 @@ int kho_add_subtree(const char *name, void *fdt)
 }
 EXPORT_SYMBOL_GPL(kho_add_subtree);
 
-int register_kho_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
-}
-EXPORT_SYMBOL_GPL(register_kho_notifier);
-
-int unregister_kho_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_unregister(&kho_out.chain_head, nb);
-}
-EXPORT_SYMBOL_GPL(unregister_kho_notifier);
-
 /**
  * kho_preserve_folio - preserve a folio across kexec.
  * @folio: folio to preserve.
-- 
2.51.0.384.g4c02a37b29-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC v1 0/4] Make KHO Stateless
  2025-09-17  2:50 [RFC v1 0/4] Make KHO Stateless Jason Miu
                   ` (3 preceding siblings ...)
  2025-09-17  2:50 ` [RFC v1 4/4] kho: Remove notifier system infrastructure Jason Miu
@ 2025-09-17 11:36 ` Jason Gunthorpe
  2025-09-17 14:48   ` Pasha Tatashin
  2025-09-21 22:26   ` Matthew Wilcox
  2025-09-25  9:19 ` Mike Rapoport
  5 siblings, 2 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2025-09-17 11:36 UTC (permalink / raw)
  To: Jason Miu
  Cc: Alexander Graf, Andrew Morton, Baoquan He, Changyuan Lyu,
	David Matlack, David Rientjes, Joel Granados,
	Marcos Paulo de Souza, Mario Limonciello, Mike Rapoport,
	Pasha Tatashin, Petr Mladek, Rafael J . Wysocki, Steven Chen,
	Yan Zhao, kexec, linux-kernel, linux-mm

On Tue, Sep 16, 2025 at 07:50:15PM -0700, Jason Miu wrote:
> This series transitions KHO from an xarray-based metadata tracking
> system with serialization to using page table like data structures
> that can be passed directly to the next kernel.
> 
> The key motivations for this change are to:
> - Eliminate the need for data serialization before kexec.
> - Remove the former KHO state machine by deprecating the finalize
>   and abort states.
> - Pass preservation metadata more directly to the next kernel via the FDT.
> 
> The new approach uses a per-order page table structure (kho_order_table,
> kho_page_table, kho_bitmap_table) to mark preserved pages. The physical
> address of the root `kho_order_table` is passed in the FDT, allowing the
> next kernel to reconstruct the preserved memory map.

It is not a "page table" structure, it is just a radix tree with bits
as the leaf.

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 1/4] kho: Introduce KHO page table data structures
  2025-09-17  2:50 ` [RFC v1 1/4] kho: Introduce KHO page table data structures Jason Miu
@ 2025-09-17 12:21   ` Jason Gunthorpe
  2025-09-17 16:18     ` Pasha Tatashin
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2025-09-17 12:21 UTC (permalink / raw)
  To: Jason Miu
  Cc: Alexander Graf, Andrew Morton, Baoquan He, Changyuan Lyu,
	David Matlack, David Rientjes, Joel Granados,
	Marcos Paulo de Souza, Mario Limonciello, Mike Rapoport,
	Pasha Tatashin, Petr Mladek, Rafael J . Wysocki, Steven Chen,
	Yan Zhao, kexec, linux-kernel, linux-mm

On Tue, Sep 16, 2025 at 07:50:16PM -0700, Jason Miu wrote:
> + * kho_order_table
> + * +-------------------------------+--------------------+
> + * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER |
> + * ++------------------------------+--------------------+
> + *  |
> + *  |
> + *  v
> + * ++------+
> + * |  Lv6  | kho_page_table
> + * ++------+

I seem to remember suggesting this could be simplified without the
special case 7h level table table for order.

Encode the phys address as:

(order << 51) | (phys >> (PAGE_SHIFT + order))

Then you don't need another table for order, the 64 bits encode
everything consistently. Order can't be > 52 so it is
only 6 bits, meaning the result fits into at most 57 bits.

> + *      63      62:54    53:45    44:36    35:27        26:0
> + * +--------+--------+--------+--------+--------+-----------------+
> + * |  Lv 6  |  Lv 5  |  Lv 4  |  Lv 3  |  Lv 2  |  Lv 1 (bitmap)  |
> + * +--------+--------+--------+--------+--------+-----------------+

This isn't quite right, the 11:0 bits are must be zero and not used to
index anything.

Adjusting to reflect the above math, it would be like this:

 63:60   59:51    50:42    41:33    32:34    23:15       14:0
+-----+--------+--------+--------+--------+--------+-----------------+
| 0   |  Lv 6  |  Lv 5  |  Lv 4  |  Lv 3  |  Lv 2  |  Lv 1 (bitmap)  |
+-----+--------+--------+--------+--------+--------+-----------------+

The order level is just folded into lv 6

> + * For higher order pages, the bit fields for each level shift to the left by
> + * the page order.

This is probably an unncessary complexity. The table levels cost only
64 bytes, it isn't so valuable to eliminate them. So with the above
math it shifts right not left. Level 1 is always the bitmap and it
doesn't move around. I'd label this 0 in the code.

If you also fix the sizes to be 64 bytes and 4096 bytes regardless of
PAGE_SIZE then everything is easy and fixed, while still efficient on
higher PAGE_SIZE architectures.

Fruther, changing the formula to this:

(1 << (63 - PAGE_SHIFT - order)) | (phys >> (PAGE_SHIFT + order))

Will shift the overhead levels to the top of the radix tree and share
them across all orders, higher PAGE_SIZE arches will just get a single
lvl 5 and an unecessary lvl 6 - cost 64 extra bytes who cares.

> +#define BITMAP_TABLE_SHIFT(_order) (PAGE_SHIFT + PAGE_SHIFT + 3 + (_order))
> +#define BITMAP_TABLE_MASK(_order) ((1ULL << BITMAP_TABLE_SHIFT(_order)) - 1)
> +#define PRESERVED_PAGE_OFFSET_SHIFT(_order) (PAGE_SHIFT + (_order))
> +#define PAGE_TABLE_SHIFT_PER_LEVEL (ilog2(PAGE_SIZE / sizeof(unsigned long)))
> +#define PAGE_TABLE_LEVEL_MASK ((1ULL << PAGE_TABLE_SHIFT_PER_LEVEL) - 1)
> +#define PTR_PER_LEVEL (PAGE_SIZE / sizeof(unsigned long))

please use inlines and enums :(

It looks like if you make the above algorithm changes most of the this
code is deleted.

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 0/4] Make KHO Stateless
  2025-09-17 11:36 ` [RFC v1 0/4] Make KHO Stateless Jason Gunthorpe
@ 2025-09-17 14:48   ` Pasha Tatashin
  2025-09-21 22:26   ` Matthew Wilcox
  1 sibling, 0 replies; 19+ messages in thread
From: Pasha Tatashin @ 2025-09-17 14:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Miu, Alexander Graf, Andrew Morton, Baoquan He,
	Changyuan Lyu, David Matlack, David Rientjes, Joel Granados,
	Marcos Paulo de Souza, Mario Limonciello, Mike Rapoport,
	Petr Mladek, Rafael J . Wysocki, Steven Chen, Yan Zhao, kexec,
	linux-kernel, linux-mm

On Wed, Sep 17, 2025 at 7:36 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Sep 16, 2025 at 07:50:15PM -0700, Jason Miu wrote:
> > This series transitions KHO from an xarray-based metadata tracking
> > system with serialization to using page table like data structures
> > that can be passed directly to the next kernel.
> >
> > The key motivations for this change are to:
> > - Eliminate the need for data serialization before kexec.
> > - Remove the former KHO state machine by deprecating the finalize
> >   and abort states.
> > - Pass preservation metadata more directly to the next kernel via the FDT.
> >
> > The new approach uses a per-order page table structure (kho_order_table,
> > kho_page_table, kho_bitmap_table) to mark preserved pages. The physical
> > address of the root `kho_order_table` is passed in the FDT, allowing the
> > next kernel to reconstruct the preserved memory map.
>
> It is not a "page table" structure, it is just a radix tree with bits
> as the leaf.

To be fair above it is referred to as a page table *like* data
structure, but I agree kho radix tree sounds like a good overall name
for this, and it might make sense to rename from kho_page_table to
kho_radix_tree in other places.

>
> Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 1/4] kho: Introduce KHO page table data structures
  2025-09-17 12:21   ` Jason Gunthorpe
@ 2025-09-17 16:18     ` Pasha Tatashin
  2025-09-17 16:32       ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Pasha Tatashin @ 2025-09-17 16:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Miu, Alexander Graf, Andrew Morton, Baoquan He,
	Changyuan Lyu, David Matlack, David Rientjes, Joel Granados,
	Marcos Paulo de Souza, Mario Limonciello, Mike Rapoport,
	Petr Mladek, Rafael J . Wysocki, Steven Chen, Yan Zhao, kexec,
	linux-kernel, linux-mm

On Wed, Sep 17, 2025 at 8:22 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Sep 16, 2025 at 07:50:16PM -0700, Jason Miu wrote:
> > + * kho_order_table
> > + * +-------------------------------+--------------------+
> > + * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER |
> > + * ++------------------------------+--------------------+
> > + *  |
> > + *  |
> > + *  v
> > + * ++------+
> > + * |  Lv6  | kho_page_table
> > + * ++------+
>
> I seem to remember suggesting this could be simplified without the
> special case 7h level table table for order.
>
> Encode the phys address as:
>
> (order << 51) | (phys >> (PAGE_SHIFT + order))

Why 51 and not 52, this limits to 63bit address space, is it not?

>
> Then you don't need another table for order, the 64 bits encode
> everything consistently. Order can't be > 52 so it is
> only 6 bits, meaning the result fits into at most 57 bits.
>

Hi Jason,

Nice packing. That's a really clever bit-packing scheme to create a
unified address space.

I like the idea, but I'm trying to find the benefits compared to the
current per-order tree approach.

1. Packing adds a slight performance overhead for higher orders. With
the current approach, preserving higher order pages only requires a
3/4-level page table. With bit-packing proposal we will always have
extra loads during preserve/unpreserve operations.

2. It also adds insignificant memory overhead, as extra levels will
have a couple extra pages.

3. It slightly complicates the logic in the new kernel. Instead of
simply iterating a known tree for a specific order, the boot-time
walker would need to reconstruct the per-order subtrees, and walk
them.

Perhaps I'm missing a key benefit of the unified tree? The current
approach might not be as elegant as having everything packed into the
same page table but it seems to be OK to me, and easy to understand.

Pasha


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 1/4] kho: Introduce KHO page table data structures
  2025-09-17 16:18     ` Pasha Tatashin
@ 2025-09-17 16:32       ` Jason Gunthorpe
  2025-09-19  6:49         ` Jason Miu
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2025-09-17 16:32 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Jason Miu, Alexander Graf, Andrew Morton, Baoquan He,
	Changyuan Lyu, David Matlack, David Rientjes, Joel Granados,
	Marcos Paulo de Souza, Mario Limonciello, Mike Rapoport,
	Petr Mladek, Rafael J . Wysocki, Steven Chen, Yan Zhao, kexec,
	linux-kernel, linux-mm

On Wed, Sep 17, 2025 at 12:18:39PM -0400, Pasha Tatashin wrote:
> On Wed, Sep 17, 2025 at 8:22 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Sep 16, 2025 at 07:50:16PM -0700, Jason Miu wrote:
> > > + * kho_order_table
> > > + * +-------------------------------+--------------------+
> > > + * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER |
> > > + * ++------------------------------+--------------------+
> > > + *  |
> > > + *  |
> > > + *  v
> > > + * ++------+
> > > + * |  Lv6  | kho_page_table
> > > + * ++------+
> >
> > I seem to remember suggesting this could be simplified without the
> > special case 7h level table table for order.
> >
> > Encode the phys address as:
> >
> > (order << 51) | (phys >> (PAGE_SHIFT + order))
> 
> Why 51 and not 52, this limits to 63bit address space, is it not?

Yeah, might have got the math off

> I like the idea, but I'm trying to find the benefits compared to the
> current per-order tree approach.

It is probably about half the code compared to what I see here because
everything is agressively simplified.

> 3. It slightly complicates the logic in the new kernel. Instead of
> simply iterating a known tree for a specific order, the boot-time
> walker would need to reconstruct the per-order subtrees, and walk
> them.

The core walker just runs over a range, it is easy to compute the
range.

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 2/4] kho: Adopt KHO page tables and remove serialization
  2025-09-17  2:50 ` [RFC v1 2/4] kho: Adopt KHO page tables and remove serialization Jason Miu
@ 2025-09-17 17:52   ` Mike Rapoport
  2025-09-19  6:58     ` Jason Miu
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Rapoport @ 2025-09-17 17:52 UTC (permalink / raw)
  To: Jason Miu
  Cc: Alexander Graf, Andrew Morton, Baoquan He, Changyuan Lyu,
	David Matlack, David Rientjes, Jason Gunthorpe, Joel Granados,
	Marcos Paulo de Souza, Mario Limonciello, Pasha Tatashin,
	Petr Mladek, Rafael J . Wysocki, Steven Chen, Yan Zhao, kexec,
	linux-kernel, linux-mm

Hi Jason,

On Tue, Sep 16, 2025 at 07:50:17PM -0700, Jason Miu wrote:
> Transition the KHO system to use the new page table data structures
> for managing preserved memory, replacing the previous xarray-based
> approach. Remove the serialization process and the associated
> finalization and abort logic.
> 
> Update the methods for marking memory to be preserved to use the KHO
> page table hierarchy. Remove the former system of tracking preserved
> pages using an xarray-based structure.
> 
> Change the method of passing preserved memory information to the next
> kernel to be direct. Instead of serializing the memory map, place the
> physical address of the `kho_order_table`, which holds the roots of
> the KHO page tables for each order, in the FDT. Remove the explicit
> `kho_finalize()` and `kho_abort()` functions and the logic supporting
> the finalize and abort states, as they are no longer needed. This
> simplifies the KHO lifecycle.
> 
> Enable the next kernel's initialization process to read the
> `kho_order_table` address from the FDT. The kernel will then traverse
> the KHO page table structures to discover all preserved memory
> regions, reserving them to prevent early boot-time allocators from
> overwriting them.
> 
> This architectural shift to using a shared page table structure
> simplifies the KHO design and eliminates the overhead of serializing
> and deserializing the preserved memory map.
> 
> Signed-off-by: Jason Miu <jasonmiu@google.com>
> ---
>  include/linux/kexec_handover.h |  17 --
>  kernel/kexec_handover.c        | 532 +++++----------------------------
>  2 files changed, 71 insertions(+), 478 deletions(-)
>  
> -/*
> - * TODO: __maybe_unused is added to the functions:
> - * kho_preserve_page_table()
> - * kho_walk_tables()
> - * kho_memblock_reserve()
> - * since they are not actually being called in this change.
> - * __maybe_unused will be removed in the next patch.
> - */
> -static __maybe_unused int kho_preserve_page_table(unsigned long pfn, int order)
> +static int kho_preserve_page_table(unsigned long pfn, int order)

Just merge this and the previous patch so that the patch will replace the
current preservation mechanism with a new one.

>  {
>  	unsigned long pa = PFN_PHYS(pfn);
>  
> @@ -365,8 +357,8 @@ static int __kho_walk_page_tables(int order, int level,
>  	return 0;
>  }
>  

...

> @@ -1023,12 +752,8 @@ int kho_preserve_folio(struct folio *folio)
>  {
>  	const unsigned long pfn = folio_pfn(folio);
>  	const unsigned int order = folio_order(folio);
> -	struct kho_mem_track *track = &kho_out.ser.track;
> -
> -	if (kho_out.finalized)
> -		return -EBUSY;
>  
> -	return __kho_preserve_order(track, pfn, order);
> +	return kho_preserve_page_table(pfn, order);

I don't think we should "rename" __kho_preserve_order() to
kho_preserve_page_table(). __kho_preserve_order() could use the new data
structure, or call the new implementation, but I don't see a reason to
replace it.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 1/4] kho: Introduce KHO page table data structures
  2025-09-17 16:32       ` Jason Gunthorpe
@ 2025-09-19  6:49         ` Jason Miu
  2025-09-19 12:56           ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Miu @ 2025-09-19  6:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, Alexander Graf, Andrew Morton, Baoquan He,
	Changyuan Lyu, David Matlack, David Rientjes, Joel Granados,
	Marcos Paulo de Souza, Mario Limonciello, Mike Rapoport,
	Petr Mladek, Rafael J . Wysocki, Steven Chen, Yan Zhao, kexec,
	linux-kernel, linux-mm

Hi Jason,

On Wed, Sep 17, 2025 at 9:32 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Sep 17, 2025 at 12:18:39PM -0400, Pasha Tatashin wrote:
> > On Wed, Sep 17, 2025 at 8:22 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Sep 16, 2025 at 07:50:16PM -0700, Jason Miu wrote:
> > > > + * kho_order_table
> > > > + * +-------------------------------+--------------------+
> > > > + * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER |
> > > > + * ++------------------------------+--------------------+
> > > > + *  |
> > > > + *  |
> > > > + *  v
> > > > + * ++------+
> > > > + * |  Lv6  | kho_page_table
> > > > + * ++------+
> > >
> > > I seem to remember suggesting this could be simplified without the
> > > special case 7h level table table for order.
> > >
> > > Encode the phys address as:
> > >
> > > (order << 51) | (phys >> (PAGE_SHIFT + order))
> >
> > Why 51 and not 52, this limits to 63bit address space, is it not?
>
> Yeah, might have got the math off
>
> > I like the idea, but I'm trying to find the benefits compared to the
> > current per-order tree approach.
>
> It is probably about half the code compared to what I see here because
> everything is agressively simplified.

Thank you very much for providing feedback to me, and I think this is
a very smart idea.

> > 3. It slightly complicates the logic in the new kernel. Instead of
> > simply iterating a known tree for a specific order, the boot-time
> > walker would need to reconstruct the per-order subtrees, and walk
> > them.
>
> The core walker just runs over a range, it is easy to compute the
> range.

I believe the "range" here refers to the specific portion of the tree
relevant to the `target_order` being restored, while the
`target_order` is the variable from 0 to MAX_PAGE_ORDER to be used in
the tree walk in the new kernel.

My current understanding of the walker for a given `target_order`:

  1. Find the `start_level` from the `target_order`. (for example,
target_order = 10, start_level = 4)
  2. The path from the root down to the level above `start_level` is
fixed (index 0 at each of these levels).
  3. At `start_level`, the index is also fixed, by (1 << (63 -
PAGE_SHIFT - order)) in a 9 bit slice.
  4. Then, for all levels *below* `order_level`, the walker iterates
through all 512 table entries, until the bitmap level.

so the "range" is the subtrees under the start_level, is my
understanding correct?

--
Jason Miu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 2/4] kho: Adopt KHO page tables and remove serialization
  2025-09-17 17:52   ` Mike Rapoport
@ 2025-09-19  6:58     ` Jason Miu
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Miu @ 2025-09-19  6:58 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Alexander Graf, Andrew Morton, Baoquan He, Changyuan Lyu,
	David Matlack, David Rientjes, Jason Gunthorpe, Joel Granados,
	Marcos Paulo de Souza, Mario Limonciello, Pasha Tatashin,
	Petr Mladek, Rafael J . Wysocki, Steven Chen, Yan Zhao, kexec,
	linux-kernel, linux-mm

Hi Mike,

Thank you very much for your comments!

On Wed, Sep 17, 2025 at 10:52 AM Mike Rapoport <rppt@kernel.org> wrote:
> > -/*
> > - * TODO: __maybe_unused is added to the functions:
> > - * kho_preserve_page_table()
> > - * kho_walk_tables()
> > - * kho_memblock_reserve()
> > - * since they are not actually being called in this change.
> > - * __maybe_unused will be removed in the next patch.
> > - */
> > -static __maybe_unused int kho_preserve_page_table(unsigned long pfn, int order)
> > +static int kho_preserve_page_table(unsigned long pfn, int order)
>
> Just merge this and the previous patch so that the patch will replace the
> current preservation mechanism with a new one.

Sure I can do this.

> > @@ -1023,12 +752,8 @@ int kho_preserve_folio(struct folio *folio)
> >  {
> >       const unsigned long pfn = folio_pfn(folio);
> >       const unsigned int order = folio_order(folio);
> > -     struct kho_mem_track *track = &kho_out.ser.track;
> > -
> > -     if (kho_out.finalized)
> > -             return -EBUSY;
> >
> > -     return __kho_preserve_order(track, pfn, order);
> > +     return kho_preserve_page_table(pfn, order);
>
> I don't think we should "rename" __kho_preserve_order() to
> kho_preserve_page_table(). __kho_preserve_order() could use the new data
> structure, or call the new implementation, but I don't see a reason to
> replace it.
>

Ok, I prefer calling the new implementation, so it will look like:

kho_preserve_folio() -> __kho_preserve_order() -> __kho_preserve_page_table()

 __kho_preserve_page_table() is the internal implementation of
kho_preserve_page_table() and we can remove the
kho_preserve_page_table().

--
Jason Miu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 1/4] kho: Introduce KHO page table data structures
  2025-09-19  6:49         ` Jason Miu
@ 2025-09-19 12:56           ` Jason Gunthorpe
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2025-09-19 12:56 UTC (permalink / raw)
  To: Jason Miu
  Cc: Pasha Tatashin, Alexander Graf, Andrew Morton, Baoquan He,
	Changyuan Lyu, David Matlack, David Rientjes, Joel Granados,
	Marcos Paulo de Souza, Mario Limonciello, Mike Rapoport,
	Petr Mladek, Rafael J . Wysocki, Steven Chen, Yan Zhao, kexec,
	linux-kernel, linux-mm

>   1. Find the `start_level` from the `target_order`. (for example,
> target_order = 10, start_level = 4)
>   2. The path from the root down to the level above `start_level` is
> fixed (index 0 at each of these levels).
>   3. At `start_level`, the index is also fixed, by (1 << (63 -
> PAGE_SHIFT - order)) in a 9 bit slice.
>   4. Then, for all levels *below* `order_level`, the walker iterates
> through all 512 table entries, until the bitmap level.

You don't need any special logic like that, that is my point, the
whole thing is very simple:

static int get_index(unsigned int level, u64 pos)
{
	return (pos / (level * ITEMS_PER_TABLE * ITEMS_PER_BITMAP)) %
	       ITEMS_PER_TABLE;
}

walk_table(u64 *table, unsigned int level, u64 start, u64 last)
{
	unsigned int index = get_index(level, start);
	unsigned int last_index = get_index(level, last);

	do {
		if (table[index]) {
			u64 *next_table = phys_to_virt(table[index]);

			if (level == 1)
				walk_bitmap(next_table);
			else
				walk_table(next_table, level - 1, start, last);
		}
		index++;
	} while (index <= last_index);
}

insert_table(u64 *table, unsigned int level, u64 pos)
{
	unsigned int index = get_index(level, start);
	u64 *next_table;

	if (!table[index]) {
		// allocate table[index]
	}
	else
		next_table = phys_to_virt(table[index]);
	if (level == 1)
		insert_bitmap(next_table, pos);
	else
		insert_table(next_table, level - 1, pos);
}

That's it.. No special cases requried.

The above is very limited, it only works with certain formulations
of start/last:
   start has only one bit set
   start & last == true,
   last ^ start has bits 0 -> N set N > log2(ITEMS_PER_BITMAP)

Which align to my suggestion for encoding.

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 0/4] Make KHO Stateless
  2025-09-17 11:36 ` [RFC v1 0/4] Make KHO Stateless Jason Gunthorpe
  2025-09-17 14:48   ` Pasha Tatashin
@ 2025-09-21 22:26   ` Matthew Wilcox
  2025-09-21 23:07     ` Pasha Tatashin
  1 sibling, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2025-09-21 22:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Miu, Alexander Graf, Andrew Morton, Baoquan He,
	Changyuan Lyu, David Matlack, David Rientjes, Joel Granados,
	Marcos Paulo de Souza, Mario Limonciello, Mike Rapoport,
	Pasha Tatashin, Petr Mladek, Rafael J . Wysocki, Steven Chen,
	Yan Zhao, kexec, linux-kernel, linux-mm

On Wed, Sep 17, 2025 at 08:36:09AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 16, 2025 at 07:50:15PM -0700, Jason Miu wrote:
> > This series transitions KHO from an xarray-based metadata tracking
> > system with serialization to using page table like data structures
> > that can be passed directly to the next kernel.
> > 
> > The key motivations for this change are to:
> > - Eliminate the need for data serialization before kexec.
> > - Remove the former KHO state machine by deprecating the finalize
> >   and abort states.
> > - Pass preservation metadata more directly to the next kernel via the FDT.
> > 
> > The new approach uses a per-order page table structure (kho_order_table,
> > kho_page_table, kho_bitmap_table) to mark preserved pages. The physical
> > address of the root `kho_order_table` is passed in the FDT, allowing the
> > next kernel to reconstruct the preserved memory map.
> 
> It is not a "page table" structure, it is just a radix tree with bits
> as the leaf.

Sounds like the IDA data structure.  Maybe that API needs to be enhanced
for this use case, but surely using the same data structure would be a
good thing?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 0/4] Make KHO Stateless
  2025-09-21 22:26   ` Matthew Wilcox
@ 2025-09-21 23:07     ` Pasha Tatashin
  0 siblings, 0 replies; 19+ messages in thread
From: Pasha Tatashin @ 2025-09-21 23:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Jason Miu, Alexander Graf, Andrew Morton,
	Baoquan He, Changyuan Lyu, David Matlack, David Rientjes,
	Joel Granados, Marcos Paulo de Souza, Mario Limonciello,
	Mike Rapoport, Petr Mladek, Rafael J . Wysocki, Steven Chen,
	Yan Zhao, kexec, linux-kernel, linux-mm

On Sun, Sep 21, 2025 at 6:26 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Sep 17, 2025 at 08:36:09AM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 16, 2025 at 07:50:15PM -0700, Jason Miu wrote:
> > > This series transitions KHO from an xarray-based metadata tracking
> > > system with serialization to using page table like data structures
> > > that can be passed directly to the next kernel.
> > >
> > > The key motivations for this change are to:
> > > - Eliminate the need for data serialization before kexec.
> > > - Remove the former KHO state machine by deprecating the finalize
> > >   and abort states.
> > > - Pass preservation metadata more directly to the next kernel via the FDT.
> > >
> > > The new approach uses a per-order page table structure (kho_order_table,
> > > kho_page_table, kho_bitmap_table) to mark preserved pages. The physical
> > > address of the root `kho_order_table` is passed in the FDT, allowing the
> > > next kernel to reconstruct the preserved memory map.
> >
> > It is not a "page table" structure, it is just a radix tree with bits
> > as the leaf.
>
> Sounds like the IDA data structure.  Maybe that API needs to be enhanced
> for this use case, but surely using the same data structure would be a
> good thing?

Normally, I would agree, but in this case, this has to be a simple
data structure that, in the long run, is going to be stable between
different kernel versions: the old and the next kernel must understand
it. Therefore, relying on any external data structure would require
the maintainers and other developers to be aware of this rather
unusual kernel requirement. So, I think it is much better to keep this
implementation private to KHO, whose only responsibility is reliably
passing memory pages from the old kernel to the next kernel.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 0/4] Make KHO Stateless
  2025-09-17  2:50 [RFC v1 0/4] Make KHO Stateless Jason Miu
                   ` (4 preceding siblings ...)
  2025-09-17 11:36 ` [RFC v1 0/4] Make KHO Stateless Jason Gunthorpe
@ 2025-09-25  9:19 ` Mike Rapoport
  2025-09-25 12:27   ` Pratyush Yadav
  5 siblings, 1 reply; 19+ messages in thread
From: Mike Rapoport @ 2025-09-25  9:19 UTC (permalink / raw)
  To: Jason Miu
  Cc: Alexander Graf, Andrew Morton, Baoquan He, Changyuan Lyu,
	David Matlack, David Rientjes, Jason Gunthorpe, Joel Granados,
	Marcos Paulo de Souza, Mario Limonciello, Pasha Tatashin,
	Petr Mladek, Rafael J . Wysocki, Steven Chen, Yan Zhao, kexec,
	linux-kernel, linux-mm

Hi Jason,

On Tue, Sep 16, 2025 at 07:50:15PM -0700, Jason Miu wrote:
> This series transitions KHO from an xarray-based metadata tracking
> system with serialization to using page table like data structures
> that can be passed directly to the next kernel.
> 
> The key motivations for this change are to:
> - Eliminate the need for data serialization before kexec.
> - Remove the former KHO state machine by deprecating the finalize
>   and abort states.
> - Pass preservation metadata more directly to the next kernel via the FDT.

If we pass the preservation metadata directly between the kernels, it means
that any change to that data structure will break compatibility between the
new and old kernels. With serialization this is less severe because a more
recent kernel can relatively easy have backward compatible deserialization.

I'm all for removing KHO state machine, but that does not necessarily mean
we must remove the serialization of memory persistence metadata?

For example, we can do the serialization at kernel_kexec() time and if we
want to avoid memory allocations there we might preallocate pages for
khoser_mem_chunk's as amount of bitmaps grow.

It also would be interesting to see how much time is saved if we remove the
serialization.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 0/4] Make KHO Stateless
  2025-09-25  9:19 ` Mike Rapoport
@ 2025-09-25 12:27   ` Pratyush Yadav
  2025-09-25 12:33     ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Pratyush Yadav @ 2025-09-25 12:27 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Jason Miu, Alexander Graf, Andrew Morton, Baoquan He,
	Changyuan Lyu, David Matlack, David Rientjes, Jason Gunthorpe,
	Joel Granados, Marcos Paulo de Souza, Mario Limonciello,
	Pasha Tatashin, Petr Mladek, Rafael J . Wysocki, Steven Chen,
	Yan Zhao, kexec, linux-kernel, linux-mm

On Thu, Sep 25 2025, Mike Rapoport wrote:

> Hi Jason,
>
> On Tue, Sep 16, 2025 at 07:50:15PM -0700, Jason Miu wrote:
>> This series transitions KHO from an xarray-based metadata tracking
>> system with serialization to using page table like data structures
>> that can be passed directly to the next kernel.
>> 
>> The key motivations for this change are to:
>> - Eliminate the need for data serialization before kexec.
>> - Remove the former KHO state machine by deprecating the finalize
>>   and abort states.
>> - Pass preservation metadata more directly to the next kernel via the FDT.
>
> If we pass the preservation metadata directly between the kernels, it means
> that any change to that data structure will break compatibility between the
> new and old kernels. With serialization this is less severe because a more
> recent kernel can relatively easy have backward compatible deserialization.
>
> I'm all for removing KHO state machine, but that does not necessarily mean
> we must remove the serialization of memory persistence metadata?

I think the tables should be treated as the final serialized data
structure, and should get all the same properties that other KHO
serialization formats have like stable binary format, versioning, etc.

It just so happens that the table format lends itself very well to being
serialized on-the-go. When a page is marked as preserved during normal
operation, it is very simple to just allocate all the intermediate
levels and mark the page as reserved. There is no further processing
needed to "serialize" it -- like we need to do with the bitmaps today.

So I don't really see why we should introduce an intermediate processing
step when it is easy to just directly build the serialized data
structure during normal operation.

>
> For example, we can do the serialization at kernel_kexec() time and if we
> want to avoid memory allocations there we might preallocate pages for
> khoser_mem_chunk's as amount of bitmaps grow.
>
> It also would be interesting to see how much time is saved if we remove the
> serialization.

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v1 0/4] Make KHO Stateless
  2025-09-25 12:27   ` Pratyush Yadav
@ 2025-09-25 12:33     ` Jason Gunthorpe
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2025-09-25 12:33 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Mike Rapoport, Jason Miu, Alexander Graf, Andrew Morton,
	Baoquan He, Changyuan Lyu, David Matlack, David Rientjes,
	Joel Granados, Marcos Paulo de Souza, Mario Limonciello,
	Pasha Tatashin, Petr Mladek, Rafael J . Wysocki, Steven Chen,
	Yan Zhao, kexec, linux-kernel, linux-mm

On Thu, Sep 25, 2025 at 02:27:06PM +0200, Pratyush Yadav wrote:
> I think the tables should be treated as the final serialized data
> structure, and should get all the same properties that other KHO
> serialization formats have like stable binary format, versioning, etc.

Right, that's how I see it too

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-09-25 12:33 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-17  2:50 [RFC v1 0/4] Make KHO Stateless Jason Miu
2025-09-17  2:50 ` [RFC v1 1/4] kho: Introduce KHO page table data structures Jason Miu
2025-09-17 12:21   ` Jason Gunthorpe
2025-09-17 16:18     ` Pasha Tatashin
2025-09-17 16:32       ` Jason Gunthorpe
2025-09-19  6:49         ` Jason Miu
2025-09-19 12:56           ` Jason Gunthorpe
2025-09-17  2:50 ` [RFC v1 2/4] kho: Adopt KHO page tables and remove serialization Jason Miu
2025-09-17 17:52   ` Mike Rapoport
2025-09-19  6:58     ` Jason Miu
2025-09-17  2:50 ` [RFC v1 3/4] memblock: Remove KHO notifier usage Jason Miu
2025-09-17  2:50 ` [RFC v1 4/4] kho: Remove notifier system infrastructure Jason Miu
2025-09-17 11:36 ` [RFC v1 0/4] Make KHO Stateless Jason Gunthorpe
2025-09-17 14:48   ` Pasha Tatashin
2025-09-21 22:26   ` Matthew Wilcox
2025-09-21 23:07     ` Pasha Tatashin
2025-09-25  9:19 ` Mike Rapoport
2025-09-25 12:27   ` Pratyush Yadav
2025-09-25 12:33     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).