[PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO)
@ 2025-03-20  1:55 Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 01/16] kexec: define functions to map and unmap segments Changyuan Lyu
                   ` (16 more replies)
  0 siblings, 17 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

Hi,

This is a next version of Alexander Graf and Mike Rapoport's
"kexec: introduce Kexec HandOver (KHO)" series
(https://lore.kernel.org/all/20250206132754.2596694-1-rppt@kernel.org/),
with bitmaps for preserving folios and address ranges, new hashtable-based
KHO state tree API, and reduced blackout window with kexec_file_load.

The patches are also available in git:
https://github.com/googleprodkernel/linux-liveupdate/tree/kho/v5

v4 -> v5:
  - New: Preserve folios and address ranges in bitmaps [1]. Removed the
    `mem` property.
  - New: Hash table based API for manipulating the KHO state tree.
  - Change the concept of "active phase" to "finalization phase". KHO
    users can add/remove data into/from KHO DT anytime before the
    finalization phase.
  - Decouple kexec_file_load and KHO FDT creation. kexec_file_load can be
    done before KHO FDT is created.
  - Update the example usecase (reserve_mem) using the new KHO API,
    replace underscores with dashes in reserve-mem fdt generation.
  - Drop the YAMLs for now and add a brief description of KHO FDT before
    KHO schema is stable.
  - Move all sysfs interfaces to debugfs.
  - Fixed the memblock test reported in [2].
  - Incorporate fix for kho_locate_mem_hole() with !CONFIG_KEXEC_HANDOVER
    [3] into "kexec: Add KHO support to kexec file loads".

[1] https://lore.kernel.org/all/20250212152336.GA3848889@nvidia.com/
[2] https://lore.kernel.org/all/20250217040448.56xejbvsr2a73h4c@master/
[3] https://lore.kernel.org/all/20250214125402.90709-1-sourabhjain@linux.ibm.com/

= Original cover letter =

Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.

However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See "pkernfs: Persisting guest memory
and kernel/device state safely across kexec" Linux Plumbers
Conference 2023 presentation for details:

  https://lpc.events/event/17/contributions/1485/

To start us on the journey to support all the use cases above, this patch
implements basic infrastructure to allow hand over of kernel state across
kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
memblock's reserve_mem.
With this patch set applied, memory that was reserved using "reserve_mem"
command line options remains intact after kexec and it is guaranteed to
reside at the same physical address.

== Alternatives ==

There are alternative approaches to (parts of) the problems above:

  * Memory Pools [1] - preallocated persistent memory region + allocator
  * PRMEM [2] - resizable persistent memory regions with fixed metadata
                pointer on the kernel command line + allocator
  * Pkernfs [3] - preallocated file system for in-kernel data with fixed
                  address location on the kernel command line
  * PKRAM [4] - handover of user space pages using a fixed metadata page
                specified via command line

All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command
line to pass data (including memory reservations) between kexec'ing
kernels.

KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of for
example IOMMU page tables. But IMHO they would all be users of KHO, with
KHO providing the foundational primitive to pass metadata and bulk memory
reservations as well as provide easy versioning for data.

== Overview ==

We introduce a metadata file that the kernels pass between each other. How
they pass it is architecture specific. The file's format is a Flattened
Device Tree (fdt) which has a generator and parser already included in
Linux. KHO is enabled in the kernel command line by `kho=on`. Drivers can
add/remove data into/from KHO root state tree hash-table anytime. When the
hash-table is converted to FDT (by kernel automatically in the case of
kexec_file_load or by the userspace manually through debugfs kho/out/finalize),
the kernel invokes callbacks to every driver that supports KHO to serialize
its state. When the actual kexec happens, the fdt is part of the image
set that we boot into. In addition, we keep a "scratch regions" available
for kexec: A physically contiguous memory regions that is guaranteed to
not have any memory that KHO would preserve. The new kernel bootstraps
itself using the scratch regions and sets all handed over memory as in use.
When drivers initialize that support KHO, they introspect the fdt and
recover their state from it. This includes memory reservations, where the
driver can either discard or claim reservations.

== Limitations ==

Currently KHO is only implemented for file based kexec. The kernel
interfaces in the patch set are already in place to support user space
kexec as well, but it is still not implemented it yet inside kexec tools.

== How to Use ==

To use the code, please boot the kernel with the "kho=on" command line
parameter.
KHO will automatically create scratch regions. If you want to set the
scratch size explicitly you can use "kho_scratch=" command line parameter.
For instance, "kho_scratch=16M,512M,256M" will reserve a 16 MiB low
memory scratch area, a 512 MiB global scratch region, and 256 MiB
per NUMA node scratch regions on boot.

Make sure to have a reserved memory range requested with reserv_mem
command line option. Then you invoke file based "kexec -l",

  # kexec -l Image --initrd=initrd -s
  # kexec -e

The new kernel will boot up and contain the previous kernel's reserve_mem
contents at the same physical address as the first kernel.

Optionally, you can finalize the KHO FDT early by

  # echo 1 > /sys/kernel/debug/kho/out/finalize

which allows you to preview the FDT to be passed to the next kernel at
/sys/kernel/debug/kho/out/fdt.

== Changelog ==

v3 -> v4:
  - Major rework of scrach management. Rather than force scratch memory
    allocations only very early in boot now we rely on scratch for all
    memblock allocations.
  - Use simple example usecase (reserv_mem instead of ftrace)
  - merge all KHO functionality into a single kernel/kexec_handover.c file
  - rename CONFIG_KEXEC_KHO to CONFIG_KEXEC_HANDOVER

v2 -> v3:
  - Fix make dt_binding_check
  - Add descriptions for each object
  - s/trace_flags/trace-flags/
  - s/global_trace/global-trace/
  - Make all additionalProperties false
  - Change subject to reflect subsysten (dt-bindings)
  - Fix indentation
  - Remove superfluous examples
  - Convert to 64bit syntax
  - Move to kho directory
  - s/"global_trace"/"global-trace"/
  - s/"global_trace"/"global-trace"/
  - s/"trace_flags"/"trace-flags"/
  - Fix wording
  - Add Documentation to MAINTAINERS file
  - Remove kho reference on read error
  - Move handover_dt unmap up
  - s/reserve_scratch_mem/mark_phys_as_cma/
  - Remove ifdeffery
  - Remove superfluous comment

v1 -> v2:
  - Removed: tracing: Introduce names for ring buffers
  - Removed: tracing: Introduce names for events
  - New: kexec: Add config option for KHO
  - New: kexec: Add documentation for KHO
  - New: tracing: Initialize fields before registering
  - New: devicetree: Add bindings for ftrace KHO
  - test bot warning fixes
  - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - s/kho_reserve/kho_reserve_scratch/g
  - Remove / reduce ifdefs
  - Select crc32
  - Leave anything that requires a name in trace.c to keep buffers
    unnamed entities
  - Put events as array into a property, use fingerprint instead of
    names to identify them
  - Reduce footprint without CONFIG_FTRACE_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - make kho_get_fdt() const
  - Add stubs for return_mem and claim_mem
  - make kho_get_fdt() const
  - Get events as array from a property, use fingerprint instead of
    names to identify events
  - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - s/kho_reserve/kho_reserve_scratch/g
  - Leave the node generation code that needs to know the name in
    trace.c so that ring buffers can stay anonymous
  - s/kho_reserve/kho_reserve_scratch/g
  - Move kho enums out of ifdef
  - Move from names to fdt offsets. That way, trace.c can find the trace
    array offset and then the ring buffer code only needs to read out
    its per-CPU data. That way it can stay oblivient to its name.
  - Make kho_get_fdt() const

Alexander Graf (9):
  memblock: Add support for scratch memory
  kexec: add Kexec HandOver (KHO) generation helpers
  kexec: add KHO parsing support
  kexec: add KHO support to kexec file loads
  kexec: add config option for KHO
  arm64: add KHO support
  x86: add KHO support
  memblock: add KHO support for reserve_mem
  Documentation: add documentation for KHO

Changyuan Lyu (1):
  hashtable: add macro HASHTABLE_INIT

Mike Rapoport (Microsoft) (5):
  mm/mm_init: rename init_reserved_page to init_deferred_page
  memblock: add MEMBLOCK_RSRV_KERN flag
  memblock: introduce memmap_init_kho_scratch()
  kexec: enable KHO support for memory preservation
  x86/setup: use memblock_reserve_kern for memory used by kernel

steven chen (1):
  kexec: define functions to map and unmap segments

 .../admin-guide/kernel-parameters.txt         |   25 +
 Documentation/kho/concepts.rst                |   70 +
 Documentation/kho/fdt.rst                     |   62 +
 Documentation/kho/index.rst                   |   14 +
 Documentation/kho/usage.rst                   |  118 ++
 Documentation/subsystem-apis.rst              |    1 +
 MAINTAINERS                                   |    3 +-
 arch/arm64/Kconfig                            |    3 +
 arch/x86/Kconfig                              |    3 +
 arch/x86/boot/compressed/kaslr.c              |   52 +-
 arch/x86/include/asm/setup.h                  |    4 +
 arch/x86/include/uapi/asm/setup_data.h        |   13 +-
 arch/x86/kernel/e820.c                        |   18 +
 arch/x86/kernel/kexec-bzimage64.c             |   36 +
 arch/x86/kernel/setup.c                       |   41 +-
 arch/x86/realmode/init.c                      |    2 +
 drivers/of/fdt.c                              |   33 +
 drivers/of/kexec.c                            |   37 +
 include/linux/hashtable.h                     |    7 +-
 include/linux/kexec.h                         |   12 +
 include/linux/kexec_handover.h                |  202 ++
 include/linux/memblock.h                      |   41 +-
 kernel/Kconfig.kexec                          |   15 +
 kernel/Makefile                               |    1 +
 kernel/kexec_core.c                           |   58 +
 kernel/kexec_file.c                           |   19 +
 kernel/kexec_handover.c                       | 1755 +++++++++++++++++
 kernel/kexec_internal.h                       |   18 +
 mm/Kconfig                                    |    4 +
 mm/internal.h                                 |    2 +
 mm/memblock.c                                 |  303 ++-
 mm/mm_init.c                                  |   19 +-
 tools/testing/memblock/tests/alloc_api.c      |   22 +-
 .../memblock/tests/alloc_helpers_api.c        |    4 +-
 tools/testing/memblock/tests/alloc_nid_api.c  |   20 +-
 35 files changed, 2988 insertions(+), 49 deletions(-)
 create mode 100644 Documentation/kho/concepts.rst
 create mode 100644 Documentation/kho/fdt.rst
 create mode 100644 Documentation/kho/index.rst
 create mode 100644 Documentation/kho/usage.rst
 create mode 100644 include/linux/kexec_handover.h
 create mode 100644 kernel/kexec_handover.c

base-commit: a7f2e10ecd8f18b83951b0bab47ddaf48f93bf47
--
2.48.1.711.g2feabab25a-goog

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v5 01/16] kexec: define functions to map and unmap segments
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 02/16] mm/mm_init: rename init_reserved_page to init_deferred_page Changyuan Lyu
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, steven chen,
	Tushar Sugandhi, Changyuan Lyu

From: steven chen <chenste@linux.microsoft.com>

Currently, the mechanism to map and unmap segments to the kimage
structure is not available to the subsystems outside of kexec.  This
functionality is needed when IMA is allocating the memory segments
during kexec 'load' operation.  Implement functions to map and unmap
segments to kimage.

Implement kimage_map_segment() to enable mapping of IMA buffer source
pages to the kimage structure post kexec 'load'.  This function,
accepting a kimage pointer, an address, and a size, will gather the
source pages within the specified address range, create an array of page
pointers, and map these to a contiguous virtual address range.  The
function returns the start of this range if successful, or NULL if
unsuccessful.

Implement kimage_unmap_segment() for unmapping segments
using vunmap().

Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: steven chen <chenste@linux.microsoft.com>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 include/linux/kexec.h |  5 ++++
 kernel/kexec_core.c   | 54 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index f0e9f8eda7a3..fad04f3bcf1d 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -467,6 +467,8 @@ extern bool kexec_file_dbg_print;
 #define kexec_dprintk(fmt, arg...) \
         do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0)
 
+void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size);
+void kimage_unmap_segment(void *buffer);
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
 struct task_struct;
@@ -474,6 +476,9 @@ static inline void __crash_kexec(struct pt_regs *regs) { }
 static inline void crash_kexec(struct pt_regs *regs) { }
 static inline int kexec_should_crash(struct task_struct *p) { return 0; }
 static inline int kexec_crash_loaded(void) { return 0; }
+static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size)
+{ return NULL; }
+static inline void kimage_unmap_segment(void *buffer) { }
 #define kexec_in_progress false
 #endif /* CONFIG_KEXEC_CORE */
 
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index c0bdc1686154..640d252306ea 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -867,6 +867,60 @@ int kimage_load_segment(struct kimage *image,
 	return result;
 }
 
+void *kimage_map_segment(struct kimage *image,
+			 unsigned long addr, unsigned long size)
+{
+	unsigned long eaddr = addr + size;
+	unsigned long src_page_addr, dest_page_addr;
+	unsigned int npages;
+	struct page **src_pages;
+	int i;
+	kimage_entry_t *ptr, entry;
+	void *vaddr = NULL;
+
+	/*
+	 * Collect the source pages and map them in a contiguous VA range.
+	 */
+	npages = PFN_UP(eaddr) - PFN_DOWN(addr);
+	src_pages = kvmalloc_array(npages, sizeof(*src_pages), GFP_KERNEL);
+	if (!src_pages) {
+		pr_err("Could not allocate source pages array for destination %lx.\n", addr);
+		return NULL;
+	}
+
+	i = 0;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_DESTINATION) {
+			dest_page_addr = entry & PAGE_MASK;
+		} else if (entry & IND_SOURCE) {
+			if (dest_page_addr >= addr && dest_page_addr < eaddr) {
+				src_page_addr = entry & PAGE_MASK;
+				src_pages[i++] =
+					virt_to_page(__va(src_page_addr));
+				if (i == npages)
+					break;
+				dest_page_addr += PAGE_SIZE;
+			}
+		}
+	}
+
+	/* Sanity check. */
+	WARN_ON(i < npages);
+
+	vaddr = vmap(src_pages, npages, VM_MAP, PAGE_KERNEL);
+	kvfree(src_pages);
+
+	if (!vaddr)
+		pr_err("Could not map segment source pages for destination %lx.\n", addr);
+
+	return vaddr;
+}
+
+void kimage_unmap_segment(void *segment_buffer)
+{
+	vunmap(segment_buffer);
+}
+
 struct kexec_load_limit {
 	/* Mutex protects the limit count. */
 	struct mutex mutex;
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 02/16] mm/mm_init: rename init_reserved_page to init_deferred_page
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 01/16] kexec: define functions to map and unmap segments Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  7:10   ` Krzysztof Kozlowski
  2025-03-20  1:55 ` [PATCH v5 03/16] memblock: add MEMBLOCK_RSRV_KERN flag Changyuan Lyu
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page()
function performs initialization of a struct page that would have been
deferred normally.

Rename it to init_deferred_page() to better reflect what the function does.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/mm_init.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2630cc30147e..c4b425125bad 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -705,7 +705,7 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
 	return false;
 }
 
-static void __meminit init_reserved_page(unsigned long pfn, int nid)
+static void __meminit init_deferred_page(unsigned long pfn, int nid)
 {
 	pg_data_t *pgdat;
 	int zid;
@@ -739,7 +739,7 @@ static inline bool defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
 	return false;
 }
 
-static inline void init_reserved_page(unsigned long pfn, int nid)
+static inline void init_deferred_page(unsigned long pfn, int nid)
 {
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
@@ -760,7 +760,7 @@ void __meminit reserve_bootmem_region(phys_addr_t start,
 		if (pfn_valid(start_pfn)) {
 			struct page *page = pfn_to_page(start_pfn);
 
-			init_reserved_page(start_pfn, nid);
+			init_deferred_page(start_pfn, nid);
 
 			/*
 			 * no need for atomic set_bit because the struct
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 03/16] memblock: add MEMBLOCK_RSRV_KERN flag
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 01/16] kexec: define functions to map and unmap segments Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 02/16] mm/mm_init: rename init_reserved_page to init_deferred_page Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 04/16] memblock: Add support for scratch memory Changyuan Lyu
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

to denote areas that were reserved for kernel use either directly with
memblock_reserve_kern() or via memblock allocations.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 include/linux/memblock.h                      | 19 ++++++++-
 mm/memblock.c                                 | 40 +++++++++++++++----
 tools/testing/memblock/tests/alloc_api.c      | 22 +++++-----
 .../memblock/tests/alloc_helpers_api.c        |  4 +-
 tools/testing/memblock/tests/alloc_nid_api.c  | 20 +++++-----
 5 files changed, 73 insertions(+), 32 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e79eb6ac516f..1037fd7aabf4 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,9 @@ extern unsigned long long max_possible_pfn;
  * kernel resource tree.
  * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
  * not initialized (only for reserved regions).
+ * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use,
+ * either explictitly with memblock_reserve_kern() or via memblock
+ * allocation APIs. All memblock allocations set this flag.
  */
 enum memblock_flags {
 	MEMBLOCK_NONE		= 0x0,	/* No special request */
@@ -50,6 +53,7 @@ enum memblock_flags {
 	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
 	MEMBLOCK_DRIVER_MANAGED = 0x8,	/* always detected via a driver */
 	MEMBLOCK_RSRV_NOINIT	= 0x10,	/* don't initialize struct pages */
+	MEMBLOCK_RSRV_KERN	= 0x20,	/* memory reserved for kernel use */
 };
 
 /**
@@ -116,7 +120,19 @@ int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid,
 int memblock_add(phys_addr_t base, phys_addr_t size);
 int memblock_remove(phys_addr_t base, phys_addr_t size);
 int memblock_phys_free(phys_addr_t base, phys_addr_t size);
-int memblock_reserve(phys_addr_t base, phys_addr_t size);
+int __memblock_reserve(phys_addr_t base, phys_addr_t size, int nid,
+		       enum memblock_flags flags);
+
+static __always_inline int memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+	return __memblock_reserve(base, size, NUMA_NO_NODE, 0);
+}
+
+static __always_inline int memblock_reserve_kern(phys_addr_t base, phys_addr_t size)
+{
+	return __memblock_reserve(base, size, NUMA_NO_NODE, MEMBLOCK_RSRV_KERN);
+}
+
 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
 int memblock_physmem_add(phys_addr_t base, phys_addr_t size);
 #endif
@@ -477,6 +493,7 @@ static inline __init_memblock bool memblock_bottom_up(void)
 
 phys_addr_t memblock_phys_mem_size(void);
 phys_addr_t memblock_reserved_size(void);
+phys_addr_t memblock_reserved_kern_size(phys_addr_t limit, int nid);
 unsigned long memblock_estimated_nr_free_pages(void);
 phys_addr_t memblock_start_of_DRAM(void);
 phys_addr_t memblock_end_of_DRAM(void);
diff --git a/mm/memblock.c b/mm/memblock.c
index 95af35fd1389..e704e3270b32 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -491,7 +491,7 @@ static int __init_memblock memblock_double_array(struct memblock_type *type,
 	 * needn't do it
 	 */
 	if (!use_slab)
-		BUG_ON(memblock_reserve(addr, new_alloc_size));
+		BUG_ON(memblock_reserve_kern(addr, new_alloc_size));
 
 	/* Update slab flag */
 	*in_slab = use_slab;
@@ -641,7 +641,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type,
 #ifdef CONFIG_NUMA
 			WARN_ON(nid != memblock_get_region_node(rgn));
 #endif
-			WARN_ON(flags != rgn->flags);
+			WARN_ON(flags != MEMBLOCK_NONE && flags != rgn->flags);
 			nr_new++;
 			if (insert) {
 				if (start_rgn == -1)
@@ -901,14 +901,15 @@ int __init_memblock memblock_phys_free(phys_addr_t base, phys_addr_t size)
 	return memblock_remove_range(&memblock.reserved, base, size);
 }
 
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+int __init_memblock __memblock_reserve(phys_addr_t base, phys_addr_t size,
+				       int nid, enum memblock_flags flags)
 {
 	phys_addr_t end = base + size - 1;
 
-	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
-		     &base, &end, (void *)_RET_IP_);
+	memblock_dbg("%s: [%pa-%pa] nid=%d flags=%x %pS\n", __func__,
+		     &base, &end, nid, flags, (void *)_RET_IP_);
 
-	return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
+	return memblock_add_range(&memblock.reserved, base, size, nid, flags);
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 again:
 	found = memblock_find_in_range_node(size, align, start, end, nid,
 					    flags);
-	if (found && !memblock_reserve(found, size))
+	if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
 		goto done;
 
 	if (numa_valid_node(nid) && !exact_nid) {
 		found = memblock_find_in_range_node(size, align, start,
 						    end, NUMA_NO_NODE,
 						    flags);
-		if (found && !memblock_reserve(found, size))
+		if (found && !memblock_reserve_kern(found, size))
 			goto done;
 	}
 
@@ -1751,6 +1752,28 @@ phys_addr_t __init_memblock memblock_reserved_size(void)
 	return memblock.reserved.total_size;
 }
 
+phys_addr_t __init_memblock memblock_reserved_kern_size(phys_addr_t limit, int nid)
+{
+	struct memblock_region *r;
+	phys_addr_t total = 0;
+
+	for_each_reserved_mem_region(r) {
+		phys_addr_t size = r->size;
+
+		if (r->base > limit)
+			break;
+
+		if (r->base + r->size > limit)
+			size = limit - r->base;
+
+		if (nid == memblock_get_region_node(r) || !numa_valid_node(nid))
+			if (r->flags & MEMBLOCK_RSRV_KERN)
+				total += size;
+	}
+
+	return total;
+}
+
 /**
  * memblock_estimated_nr_free_pages - return estimated number of free pages
  * from memblock point of view
@@ -2397,6 +2420,7 @@ static const char * const flagname[] = {
 	[ilog2(MEMBLOCK_NOMAP)] = "NOMAP",
 	[ilog2(MEMBLOCK_DRIVER_MANAGED)] = "DRV_MNG",
 	[ilog2(MEMBLOCK_RSRV_NOINIT)] = "RSV_NIT",
+	[ilog2(MEMBLOCK_RSRV_KERN)] = "RSV_KERN",
 };
 
 static int memblock_debug_show(struct seq_file *m, void *private)
diff --git a/tools/testing/memblock/tests/alloc_api.c b/tools/testing/memblock/tests/alloc_api.c
index 68f1a75cd72c..c55f67dd367d 100644
--- a/tools/testing/memblock/tests/alloc_api.c
+++ b/tools/testing/memblock/tests/alloc_api.c
@@ -134,7 +134,7 @@ static int alloc_top_down_before_check(void)
 	PREFIX_PUSH();
 	setup_memblock();
 
-	memblock_reserve(memblock_end_of_DRAM() - total_size, r1_size);
+	memblock_reserve_kern(memblock_end_of_DRAM() - total_size, r1_size);
 
 	allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES);
 
@@ -182,7 +182,7 @@ static int alloc_top_down_after_check(void)
 
 	total_size = r1.size + r2_size;
 
-	memblock_reserve(r1.base, r1.size);
+	memblock_reserve_kern(r1.base, r1.size);
 
 	allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES);
 
@@ -231,8 +231,8 @@ static int alloc_top_down_second_fit_check(void)
 
 	total_size = r1.size + r2.size + r3_size;
 
-	memblock_reserve(r1.base, r1.size);
-	memblock_reserve(r2.base, r2.size);
+	memblock_reserve_kern(r1.base, r1.size);
+	memblock_reserve_kern(r2.base, r2.size);
 
 	allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES);
 
@@ -285,8 +285,8 @@ static int alloc_in_between_generic_check(void)
 
 	total_size = r1.size + r2.size + r3_size;
 
-	memblock_reserve(r1.base, r1.size);
-	memblock_reserve(r2.base, r2.size);
+	memblock_reserve_kern(r1.base, r1.size);
+	memblock_reserve_kern(r2.base, r2.size);
 
 	allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES);
 
@@ -422,7 +422,7 @@ static int alloc_limited_space_generic_check(void)
 	setup_memblock();
 
 	/* Simulate almost-full memory */
-	memblock_reserve(memblock_start_of_DRAM(), reserved_size);
+	memblock_reserve_kern(memblock_start_of_DRAM(), reserved_size);
 
 	allocated_ptr = run_memblock_alloc(available_size, SMP_CACHE_BYTES);
 
@@ -608,7 +608,7 @@ static int alloc_bottom_up_before_check(void)
 	PREFIX_PUSH();
 	setup_memblock();
 
-	memblock_reserve(memblock_start_of_DRAM() + r1_size, r2_size);
+	memblock_reserve_kern(memblock_start_of_DRAM() + r1_size, r2_size);
 
 	allocated_ptr = run_memblock_alloc(r1_size, SMP_CACHE_BYTES);
 
@@ -655,7 +655,7 @@ static int alloc_bottom_up_after_check(void)
 
 	total_size = r1.size + r2_size;
 
-	memblock_reserve(r1.base, r1.size);
+	memblock_reserve_kern(r1.base, r1.size);
 
 	allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES);
 
@@ -705,8 +705,8 @@ static int alloc_bottom_up_second_fit_check(void)
 
 	total_size = r1.size + r2.size + r3_size;
 
-	memblock_reserve(r1.base, r1.size);
-	memblock_reserve(r2.base, r2.size);
+	memblock_reserve_kern(r1.base, r1.size);
+	memblock_reserve_kern(r2.base, r2.size);
 
 	allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES);
 
diff --git a/tools/testing/memblock/tests/alloc_helpers_api.c b/tools/testing/memblock/tests/alloc_helpers_api.c
index 3ef9486da8a0..e5362cfd2ff3 100644
--- a/tools/testing/memblock/tests/alloc_helpers_api.c
+++ b/tools/testing/memblock/tests/alloc_helpers_api.c
@@ -163,7 +163,7 @@ static int alloc_from_top_down_no_space_above_check(void)
 	min_addr = memblock_end_of_DRAM() - SMP_CACHE_BYTES * 2;
 
 	/* No space above this address */
-	memblock_reserve(min_addr, r2_size);
+	memblock_reserve_kern(min_addr, r2_size);
 
 	allocated_ptr = memblock_alloc_from(r1_size, SMP_CACHE_BYTES, min_addr);
 
@@ -199,7 +199,7 @@ static int alloc_from_top_down_min_addr_cap_check(void)
 	start_addr = (phys_addr_t)memblock_start_of_DRAM();
 	min_addr = start_addr - SMP_CACHE_BYTES * 3;
 
-	memblock_reserve(start_addr + r1_size, MEM_SIZE - r1_size);
+	memblock_reserve_kern(start_addr + r1_size, MEM_SIZE - r1_size);
 
 	allocated_ptr = memblock_alloc_from(r1_size, SMP_CACHE_BYTES, min_addr);
 
diff --git a/tools/testing/memblock/tests/alloc_nid_api.c b/tools/testing/memblock/tests/alloc_nid_api.c
index 49bb416d34ff..562e4701b0e0 100644
--- a/tools/testing/memblock/tests/alloc_nid_api.c
+++ b/tools/testing/memblock/tests/alloc_nid_api.c
@@ -324,7 +324,7 @@ static int alloc_nid_min_reserved_generic_check(void)
 	min_addr = max_addr - r2_size;
 	reserved_base = min_addr - r1_size;
 
-	memblock_reserve(reserved_base, r1_size);
+	memblock_reserve_kern(reserved_base, r1_size);
 
 	allocated_ptr = run_memblock_alloc_nid(r2_size, SMP_CACHE_BYTES,
 					       min_addr, max_addr,
@@ -374,7 +374,7 @@ static int alloc_nid_max_reserved_generic_check(void)
 	max_addr = memblock_end_of_DRAM() - r1_size;
 	min_addr = max_addr - r2_size;
 
-	memblock_reserve(max_addr, r1_size);
+	memblock_reserve_kern(max_addr, r1_size);
 
 	allocated_ptr = run_memblock_alloc_nid(r2_size, SMP_CACHE_BYTES,
 					       min_addr, max_addr,
@@ -436,8 +436,8 @@ static int alloc_nid_top_down_reserved_with_space_check(void)
 	min_addr = r2.base + r2.size;
 	max_addr = r1.base;
 
-	memblock_reserve(r1.base, r1.size);
-	memblock_reserve(r2.base, r2.size);
+	memblock_reserve_kern(r1.base, r1.size);
+	memblock_reserve_kern(r2.base, r2.size);
 
 	allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES,
 					       min_addr, max_addr,
@@ -499,8 +499,8 @@ static int alloc_nid_reserved_full_merge_generic_check(void)
 	min_addr = r2.base + r2.size;
 	max_addr = r1.base;
 
-	memblock_reserve(r1.base, r1.size);
-	memblock_reserve(r2.base, r2.size);
+	memblock_reserve_kern(r1.base, r1.size);
+	memblock_reserve_kern(r2.base, r2.size);
 
 	allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES,
 					       min_addr, max_addr,
@@ -563,8 +563,8 @@ static int alloc_nid_top_down_reserved_no_space_check(void)
 	min_addr = r2.base + r2.size;
 	max_addr = r1.base;
 
-	memblock_reserve(r1.base, r1.size);
-	memblock_reserve(r2.base, r2.size);
+	memblock_reserve_kern(r1.base, r1.size);
+	memblock_reserve_kern(r2.base, r2.size);
 
 	allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES,
 					       min_addr, max_addr,
@@ -909,8 +909,8 @@ static int alloc_nid_bottom_up_reserved_with_space_check(void)
 	min_addr = r2.base + r2.size;
 	max_addr = r1.base;
 
-	memblock_reserve(r1.base, r1.size);
-	memblock_reserve(r2.base, r2.size);
+	memblock_reserve_kern(r1.base, r1.size);
+	memblock_reserve_kern(r2.base, r2.size);
 
 	allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES,
 					       min_addr, max_addr,
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 04/16] memblock: Add support for scratch memory
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (2 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 03/16] memblock: add MEMBLOCK_RSRV_KERN flag Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 05/16] memblock: introduce memmap_init_kho_scratch() Changyuan Lyu
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

From: Alexander Graf <graf@amazon.com>

With KHO (Kexec HandOver), we need a way to ensure that the new kernel
does not allocate memory on top of any memory regions that the previous
kernel was handing over. But to know where those are, we need to include
them in the memblock.reserved array which may not be big enough to hold
all ranges that need to be persisted across kexec. To resize the array,
we need to allocate memory. That brings us into a catch 22 situation.

The solution to that is limit memblock allocations to the scratch regions:
safe regions to operate in the case when there is memory that should remain
intact across kexec.

KHO provides several "scratch regions" as part of its metadata. These
scratch regions are contiguous memory blocks that known not to contain any
memory that should be persisted across kexec. These regions should be large
enough to accommodate all memblock allocations done by the kexeced kernel.

We introduce a new memblock_set_scratch_only() function that allows KHO to
indicate that any memblock allocation must happen from the scratch regions.

Later, we may want to perform another KHO kexec. For that, we reuse the
same scratch regions. To ensure that no eventually handed over data gets
allocated inside a scratch region, we flip the semantics of the scratch
region with memblock_clear_scratch_only(): After that call, no allocations
may happen from scratch memblock regions. We will lift that restriction
in the next patch.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/linux/memblock.h | 20 +++++++++++++
 mm/Kconfig               |  4 +++
 mm/memblock.c            | 61 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 1037fd7aabf4..a83738b7218b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -45,6 +45,11 @@ extern unsigned long long max_possible_pfn;
  * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use,
  * either explictitly with memblock_reserve_kern() or via memblock
  * allocation APIs. All memblock allocations set this flag.
+ * @MEMBLOCK_KHO_SCRATCH: memory region that kexec can pass to the next
+ * kernel in handover mode. During early boot, we do not know about all
+ * memory reservations yet, so we get scratch memory from the previous
+ * kernel that we know is good to use. It is the only memory that
+ * allocations may happen from in this phase.
  */
 enum memblock_flags {
 	MEMBLOCK_NONE		= 0x0,	/* No special request */
@@ -54,6 +59,7 @@ enum memblock_flags {
 	MEMBLOCK_DRIVER_MANAGED = 0x8,	/* always detected via a driver */
 	MEMBLOCK_RSRV_NOINIT	= 0x10,	/* don't initialize struct pages */
 	MEMBLOCK_RSRV_KERN	= 0x20,	/* memory reserved for kernel use */
+	MEMBLOCK_KHO_SCRATCH	= 0x40,	/* scratch memory for kexec handover */
 };
 
 /**
@@ -148,6 +154,8 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
 int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size);
+int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size);
+int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size);
 
 void memblock_free_all(void);
 void memblock_free(void *ptr, size_t size);
@@ -292,6 +300,11 @@ static inline bool memblock_is_driver_managed(struct memblock_region *m)
 	return m->flags & MEMBLOCK_DRIVER_MANAGED;
 }
 
+static inline bool memblock_is_kho_scratch(struct memblock_region *m)
+{
+	return m->flags & MEMBLOCK_KHO_SCRATCH;
+}
+
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
@@ -620,5 +633,12 @@ static inline void early_memtest(phys_addr_t start, phys_addr_t end) { }
 static inline void memtest_report_meminfo(struct seq_file *m) { }
 #endif
 
+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
+void memblock_set_kho_scratch_only(void);
+void memblock_clear_kho_scratch_only(void);
+#else
+static inline void memblock_set_kho_scratch_only(void) { }
+static inline void memblock_clear_kho_scratch_only(void) { }
+#endif
 
 #endif /* _LINUX_MEMBLOCK_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..550bbafe5c0b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -506,6 +506,10 @@ config HAVE_GUP_FAST
 	depends on MMU
 	bool
 
+# Enable memblock support for scratch memory which is needed for kexec handover
+config MEMBLOCK_KHO_SCRATCH
+	bool
+
 # Don't discard allocated memory used to track "memory" and "reserved" memblocks
 # after early boot, so it can still be used to test for validity of memory.
 # Also, memblocks are updated with memory hot(un)plug.
diff --git a/mm/memblock.c b/mm/memblock.c
index e704e3270b32..c0f7da7dff47 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -106,6 +106,13 @@ unsigned long min_low_pfn;
 unsigned long max_pfn;
 unsigned long long max_possible_pfn;
 
+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
+/* When set to true, only allocate from MEMBLOCK_KHO_SCRATCH ranges */
+static bool kho_scratch_only;
+#else
+#define kho_scratch_only false
+#endif
+
 static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_MEMORY_REGIONS] __initdata_memblock;
 static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_RESERVED_REGIONS] __initdata_memblock;
 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
@@ -165,6 +172,10 @@ bool __init_memblock memblock_has_mirror(void)
 
 static enum memblock_flags __init_memblock choose_memblock_flags(void)
 {
+	/* skip non-scratch memory for kho early boot allocations */
+	if (kho_scratch_only)
+		return MEMBLOCK_KHO_SCRATCH;
+
 	return system_has_some_mirror ? MEMBLOCK_MIRROR : MEMBLOCK_NONE;
 }
 
@@ -924,6 +935,18 @@ int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size)
 }
 #endif
 
+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
+__init_memblock void memblock_set_kho_scratch_only(void)
+{
+	kho_scratch_only = true;
+}
+
+__init_memblock void memblock_clear_kho_scratch_only(void)
+{
+	kho_scratch_only = false;
+}
+#endif
+
 /**
  * memblock_setclr_flag - set or clear flag for a memory region
  * @type: memblock type to set/clear flag for
@@ -1049,6 +1072,36 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t
 				    MEMBLOCK_RSRV_NOINIT);
 }
 
+/**
+ * memblock_mark_kho_scratch - Mark a memory region as MEMBLOCK_KHO_SCRATCH.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Only memory regions marked with %MEMBLOCK_KHO_SCRATCH will be considered
+ * for allocations during early boot with kexec handover.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_setclr_flag(&memblock.memory, base, size, 1,
+				    MEMBLOCK_KHO_SCRATCH);
+}
+
+/**
+ * memblock_clear_kho_scratch - Clear MEMBLOCK_KHO_SCRATCH flag for a
+ * specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_setclr_flag(&memblock.memory, base, size, 0,
+				    MEMBLOCK_KHO_SCRATCH);
+}
+
 static bool should_skip_region(struct memblock_type *type,
 			       struct memblock_region *m,
 			       int nid, int flags)
@@ -1080,6 +1133,13 @@ static bool should_skip_region(struct memblock_type *type,
 	if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m))
 		return true;
 
+	/*
+	 * In early alloc during kexec handover, we can only consider
+	 * MEMBLOCK_KHO_SCRATCH regions for the allocations
+	 */
+	if ((flags & MEMBLOCK_KHO_SCRATCH) && !memblock_is_kho_scratch(m))
+		return true;
+
 	return false;
 }
 
@@ -2421,6 +2481,7 @@ static const char * const flagname[] = {
 	[ilog2(MEMBLOCK_DRIVER_MANAGED)] = "DRV_MNG",
 	[ilog2(MEMBLOCK_RSRV_NOINIT)] = "RSV_NIT",
 	[ilog2(MEMBLOCK_RSRV_KERN)] = "RSV_KERN",
+	[ilog2(MEMBLOCK_KHO_SCRATCH)] = "KHO_SCRATCH",
 };
 
 static int memblock_debug_show(struct seq_file *m, void *private)
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 05/16] memblock: introduce memmap_init_kho_scratch()
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (3 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 04/16] memblock: Add support for scratch memory Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 06/16] hashtable: add macro HASHTABLE_INIT Changyuan Lyu
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

With deferred initialization of struct page it will be necessary to
initialize memory map for KHO scratch regions early.

Add memmap_init_kho_scratch() method that will allow such initialization
in upcoming patches.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/linux/memblock.h |  2 ++
 mm/internal.h            |  2 ++
 mm/memblock.c            | 22 ++++++++++++++++++++++
 mm/mm_init.c             | 11 ++++++++---
 4 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index a83738b7218b..497e2c1364a6 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -636,9 +636,11 @@ static inline void memtest_report_meminfo(struct seq_file *m) { }
 #ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
 void memblock_set_kho_scratch_only(void);
 void memblock_clear_kho_scratch_only(void);
+void memmap_init_kho_scratch_pages(void);
 #else
 static inline void memblock_set_kho_scratch_only(void) { }
 static inline void memblock_clear_kho_scratch_only(void) { }
+static inline void memmap_init_kho_scratch_pages(void) {}
 #endif
 
 #endif /* _LINUX_MEMBLOCK_H */
diff --git a/mm/internal.h b/mm/internal.h
index 20b3535935a3..8e45a2ae961a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1053,6 +1053,8 @@ DECLARE_STATIC_KEY_TRUE(deferred_pages);
 bool __init deferred_grow_zone(struct zone *zone, unsigned int order);
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
+void init_deferred_page(unsigned long pfn, int nid);
+
 enum mminit_level {
 	MMINIT_WARNING,
 	MMINIT_VERIFY,
diff --git a/mm/memblock.c b/mm/memblock.c
index c0f7da7dff47..d5d406a5160a 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -945,6 +945,28 @@ __init_memblock void memblock_clear_kho_scratch_only(void)
 {
 	kho_scratch_only = false;
 }
+
+void __init_memblock memmap_init_kho_scratch_pages(void)
+{
+	phys_addr_t start, end;
+	unsigned long pfn;
+	int nid;
+	u64 i;
+
+	if (!IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT))
+		return;
+
+	/*
+	 * Initialize struct pages for free scratch memory.
+	 * The struct pages for reserved scratch memory will be set up in
+	 * reserve_bootmem_region()
+	 */
+	__for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
+			     MEMBLOCK_KHO_SCRATCH, &start, &end, &nid) {
+		for (pfn = PFN_UP(start); pfn < PFN_DOWN(end); pfn++)
+			init_deferred_page(pfn, nid);
+	}
+}
 #endif
 
 /**
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c4b425125bad..04441c258b05 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -705,7 +705,7 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
 	return false;
 }
 
-static void __meminit init_deferred_page(unsigned long pfn, int nid)
+static void __meminit __init_deferred_page(unsigned long pfn, int nid)
 {
 	pg_data_t *pgdat;
 	int zid;
@@ -739,11 +739,16 @@ static inline bool defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
 	return false;
 }
 
-static inline void init_deferred_page(unsigned long pfn, int nid)
+static inline void __init_deferred_page(unsigned long pfn, int nid)
 {
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
+void __meminit init_deferred_page(unsigned long pfn, int nid)
+{
+	__init_deferred_page(pfn, nid);
+}
+
 /*
  * Initialised pages do not have PageReserved set. This function is
  * called for each range allocated by the bootmem allocator and
@@ -760,7 +765,7 @@ void __meminit reserve_bootmem_region(phys_addr_t start,
 		if (pfn_valid(start_pfn)) {
 			struct page *page = pfn_to_page(start_pfn);
 
-			init_deferred_page(start_pfn, nid);
+			__init_deferred_page(start_pfn, nid);
 
 			/*
 			 * no need for atomic set_bit because the struct
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 06/16] hashtable: add macro HASHTABLE_INIT
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (4 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 05/16] memblock: introduce memmap_init_kho_scratch() Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers Changyuan Lyu
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

Similar to HLIST_HEAD_INIT, HASHTABLE_INIT allows a hashtable embedded
in another structure to be initialized at compile time.

Example,

struct tree_node {
    DECLARE_HASHTABLE(properties, 4);
    DECLARE_HASHTABLE(sub_nodes, 4);
};

static struct tree_node root_node = {
    .properties = HASHTABLE_INIT(4),
    .sub_nodes = HASHTABLE_INIT(4),
};

Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 include/linux/hashtable.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index f6c666730b8c..27e07a436e2a 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -13,13 +13,14 @@
 #include <linux/hash.h>
 #include <linux/rculist.h>
 
+#define HASHTABLE_INIT(bits) { [0 ... ((1 << (bits)) - 1)] = HLIST_HEAD_INIT }
+
 #define DEFINE_HASHTABLE(name, bits)						\
-	struct hlist_head name[1 << (bits)] =					\
-			{ [0 ... ((1 << (bits)) - 1)] = HLIST_HEAD_INIT }
+	struct hlist_head name[1 << (bits)] =	HASHTABLE_INIT(bits)	\
 
 #define DEFINE_READ_MOSTLY_HASHTABLE(name, bits)				\
 	struct hlist_head name[1 << (bits)] __read_mostly =			\
-			{ [0 ... ((1 << (bits)) - 1)] = HLIST_HEAD_INIT }
+		HASHTABLE_INIT(bits)
 
 #define DECLARE_HASHTABLE(name, bits)                                   	\
 	struct hlist_head name[1 << (bits)]
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (5 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 06/16] hashtable: add macro HASHTABLE_INIT Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-21 13:34   ` Jason Gunthorpe
  2025-03-24 18:40   ` Frank van der Linden
  2025-03-20  1:55 ` [PATCH v5 08/16] kexec: add KHO parsing support Changyuan Lyu
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

From: Alexander Graf <graf@amazon.com>

Add the core infrastructure to generate Kexec HandOver metadata. Kexec
HandOver is a mechanism that allows Linux to preserve state - arbitrary
properties as well as memory locations - across kexec.

It does so using 2 concepts:

  1) State Tree - Every KHO kexec carries a state tree that describes the
     state of the system. The state tree is represented as hash-tables.
     Device drivers can add/remove their data into/from the state tree at
     system runtime. On kexec, the tree is converted to FDT (flattened
     device tree).

  2) Scratch Regions - CMA regions that we allocate in the first kernel.
     CMA gives us the guarantee that no handover pages land in those
     regions, because handover pages must be at a static physical memory
     location. We use these regions as the place to load future kexec
     images so that they won't collide with any handover data.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Pratyush Yadav <ptyadav@amazon.de>
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 MAINTAINERS                    |   2 +-
 include/linux/kexec_handover.h | 109 +++++
 kernel/Makefile                |   1 +
 kernel/kexec_handover.c        | 865 +++++++++++++++++++++++++++++++++
 mm/mm_init.c                   |   8 +
 5 files changed, 984 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/kexec_handover.h
 create mode 100644 kernel/kexec_handover.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 12852355bd66..a000a277ccf7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12828,7 +12828,7 @@ F:	include/linux/kernfs.h
 KEXEC
 L:	kexec@lists.infradead.org
 W:	http://kernel.org/pub/linux/utils/kernel/kexec/
-F:	include/linux/kexec.h
+F:	include/linux/kexec*.h
 F:	include/uapi/linux/kexec.h
 F:	kernel/kexec*
 
diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
new file mode 100644
index 000000000000..9cd9ad31e2d1
--- /dev/null
+++ b/include/linux/kexec_handover.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_KEXEC_HANDOVER_H
+#define LINUX_KEXEC_HANDOVER_H
+
+#include <linux/types.h>
+#include <linux/hashtable.h>
+#include <linux/notifier.h>
+
+struct kho_scratch {
+	phys_addr_t addr;
+	phys_addr_t size;
+};
+
+/* KHO Notifier index */
+enum kho_event {
+	KEXEC_KHO_FINALIZE = 0,
+	KEXEC_KHO_UNFREEZE = 1,
+};
+
+#define KHO_HASHTABLE_BITS 3
+#define KHO_NODE_INIT                                        \
+	{                                                    \
+		.props = HASHTABLE_INIT(KHO_HASHTABLE_BITS), \
+		.nodes = HASHTABLE_INIT(KHO_HASHTABLE_BITS), \
+	}
+
+struct kho_node {
+	struct hlist_node hlist;
+
+	const char *name;
+	DECLARE_HASHTABLE(props, KHO_HASHTABLE_BITS);
+	DECLARE_HASHTABLE(nodes, KHO_HASHTABLE_BITS);
+
+	struct list_head list;
+	bool visited;
+};
+
+#ifdef CONFIG_KEXEC_HANDOVER
+bool kho_is_enabled(void);
+void kho_init_node(struct kho_node *node);
+int kho_add_node(struct kho_node *parent, const char *name,
+		 struct kho_node *child);
+struct kho_node *kho_remove_node(struct kho_node *parent, const char *name);
+int kho_add_prop(struct kho_node *node, const char *key, const void *val,
+		 u32 size);
+void *kho_remove_prop(struct kho_node *node, const char *key, u32 *size);
+int kho_add_string_prop(struct kho_node *node, const char *key,
+			const char *val);
+
+int register_kho_notifier(struct notifier_block *nb);
+int unregister_kho_notifier(struct notifier_block *nb);
+
+void kho_memory_init(void);
+#else
+static inline bool kho_is_enabled(void)
+{
+	return false;
+}
+
+static inline void kho_init_node(struct kho_node *node)
+{
+}
+
+static inline int kho_add_node(struct kho_node *parent, const char *name,
+			       struct kho_node *child)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline struct kho_node *kho_remove_node(struct kho_node *parent,
+					       const char *name)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline int kho_add_prop(struct kho_node *node, const char *key,
+			       const void *val, u32 size)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void *kho_remove_prop(struct kho_node *node, const char *key,
+				    u32 *size)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline int kho_add_string_prop(struct kho_node *node, const char *key,
+				      const char *val)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int register_kho_notifier(struct notifier_block *nb)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int unregister_kho_notifier(struct notifier_block *nb)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void kho_memory_init(void)
+{
+}
+#endif /* CONFIG_KEXEC_HANDOVER */
+
+#endif /* LINUX_KEXEC_HANDOVER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 87866b037fbe..cef5377c25cd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -75,6 +75,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_core.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
 obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
+obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup/
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
new file mode 100644
index 000000000000..df0d9debbb64
--- /dev/null
+++ b/kernel/kexec_handover.c
@@ -0,0 +1,865 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kexec_handover.c - kexec handover metadata processing
+ * Copyright (C) 2023 Alexander Graf <graf@amazon.com>
+ * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport <rppt@kernel.org>
+ * Copyright (C) 2024 Google LLC
+ */
+
+#define pr_fmt(fmt) "KHO: " fmt
+
+#include <linux/cma.h>
+#include <linux/kexec.h>
+#include <linux/libfdt.h>
+#include <linux/debugfs.h>
+#include <linux/memblock.h>
+#include <linux/notifier.h>
+#include <linux/kexec_handover.h>
+#include <linux/page-isolation.h>
+#include <linux/rwsem.h>
+#include <linux/xxhash.h>
+/*
+ * KHO is tightly coupled with mm init and needs access to some of mm
+ * internal APIs.
+ */
+#include "../mm/internal.h"
+#include "kexec_internal.h"
+
+static bool kho_enable __ro_after_init;
+
+bool kho_is_enabled(void)
+{
+	return kho_enable;
+}
+EXPORT_SYMBOL_GPL(kho_is_enabled);
+
+static int __init kho_parse_enable(char *p)
+{
+	return kstrtobool(p, &kho_enable);
+}
+early_param("kho", kho_parse_enable);
+
+/*
+ * With KHO enabled, memory can become fragmented because KHO regions may
+ * be anywhere in physical address space. The scratch regions give us a
+ * safe zones that we will never see KHO allocations from. This is where we
+ * can later safely load our new kexec images into and then use the scratch
+ * area for early allocations that happen before page allocator is
+ * initialized.
+ */
+static struct kho_scratch *kho_scratch;
+static unsigned int kho_scratch_cnt;
+
+static struct dentry *debugfs_root;
+
+struct kho_out {
+	struct blocking_notifier_head chain_head;
+
+	struct debugfs_blob_wrapper fdt_wrapper;
+	struct dentry *fdt_file;
+	struct dentry *dir;
+
+	struct rw_semaphore tree_lock;
+	struct kho_node root;
+
+	void *fdt;
+	u64 fdt_max;
+};
+
+static struct kho_out kho_out = {
+	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
+	.tree_lock = __RWSEM_INITIALIZER(kho_out.tree_lock),
+	.root = KHO_NODE_INIT,
+	.fdt_max = 10 * SZ_1M,
+};
+
+int register_kho_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(register_kho_notifier);
+
+int unregister_kho_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&kho_out.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_kho_notifier);
+
+/* Helper functions for KHO state tree */
+
+struct kho_prop {
+	struct hlist_node hlist;
+
+	const char *key;
+	const void *val;
+	u32 size;
+};
+
+static unsigned long strhash(const char *s)
+{
+	return xxhash(s, strlen(s), 1120);
+}
+
+void kho_init_node(struct kho_node *node)
+{
+	hash_init(node->props);
+	hash_init(node->nodes);
+}
+EXPORT_SYMBOL_GPL(kho_init_node);
+
+/**
+ * kho_add_node - add a child node to a parent node.
+ * @parent: parent node to add to.
+ * @name: name of the child node.
+ * @child: child node to be added to @parent with @name.
+ *
+ * If @parent is NULL, @child is added to KHO state tree root node.
+ *
+ * @child must be a valid pointer through KHO FDT finalization.
+ * @name is duplicated and thus can have a short lifetime.
+ *
+ * Callers must use their own locking if there are concurrent accesses to
+ * @parent or @child.
+ *
+ * Return: 0 on success, 1 if @child is already in @parent with @name, or
+ *   - -EOPNOTSUPP: KHO is not enabled in the kernel command line,
+ *   - -ENOMEM: failed to duplicate @name,
+ *   - -EBUSY: KHO state tree has been converted to FDT,
+ *   - -EEXIST: another node of the same name has been added to the parent.
+ */
+int kho_add_node(struct kho_node *parent, const char *name,
+		 struct kho_node *child)
+{
+	unsigned long name_hash;
+	int ret = 0;
+	struct kho_node *node;
+	char *child_name;
+
+	if (!kho_enable)
+		return -EOPNOTSUPP;
+
+	if (!parent)
+		parent = &kho_out.root;
+
+	child_name = kstrdup(name, GFP_KERNEL);
+	if (!child_name)
+		return -ENOMEM;
+
+	name_hash = strhash(child_name);
+
+	if (parent == &kho_out.root)
+		down_write(&kho_out.tree_lock);
+	else
+		down_read(&kho_out.tree_lock);
+
+	if (kho_out.fdt) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	hash_for_each_possible(parent->nodes, node, hlist, name_hash) {
+		if (!strcmp(node->name, child_name)) {
+			ret = node == child ? 1 : -EEXIST;
+			break;
+		}
+	}
+
+	if (ret == 0) {
+		child->name = child_name;
+		hash_add(parent->nodes, &child->hlist, name_hash);
+	}
+
+out:
+	if (parent == &kho_out.root)
+		up_write(&kho_out.tree_lock);
+	else
+		up_read(&kho_out.tree_lock);
+
+	if (ret)
+		kfree(child_name);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kho_add_node);
+
+/**
+ * kho_remove_node - remove a child node from a parent node.
+ * @parent: parent node to look up for.
+ * @name: name of the child node.
+ *
+ * If @parent is NULL, KHO state tree root node is looked up.
+ *
+ * Callers must use their own locking if there are concurrent accesses to
+ * @parent or @child.
+ *
+ * Return: the pointer to the child node on success, or an error pointer,
+ *   - -EOPNOTSUPP: KHO is not enabled in the kernel command line,
+ *   - -ENOENT: no node named @name is found.
+ *   - -EBUSY: KHO state tree has been converted to FDT.
+ */
+struct kho_node *kho_remove_node(struct kho_node *parent, const char *name)
+{
+	struct kho_node *child, *ret = ERR_PTR(-ENOENT);
+	unsigned long name_hash;
+
+	if (!kho_enable)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if (!parent)
+		parent = &kho_out.root;
+
+	name_hash = strhash(name);
+
+	if (parent == &kho_out.root)
+		down_write(&kho_out.tree_lock);
+	else
+		down_read(&kho_out.tree_lock);
+
+	if (kho_out.fdt) {
+		ret = ERR_PTR(-EBUSY);
+		goto out;
+	}
+
+	hash_for_each_possible(parent->nodes, child, hlist, name_hash) {
+		if (!strcmp(child->name, name)) {
+			ret = child;
+			break;
+		}
+	}
+
+	if (!IS_ERR(ret)) {
+		hash_del(&ret->hlist);
+		kfree(ret->name);
+		ret->name = NULL;
+	}
+
+out:
+	if (parent == &kho_out.root)
+		up_write(&kho_out.tree_lock);
+	else
+		up_read(&kho_out.tree_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kho_remove_node);
+
+/**
+ * kho_add_prop - add a property to a node.
+ * @node: KHO node to add the property to.
+ * @key: key of the property.
+ * @val: pointer to the property value.
+ * @size: size of the property value in bytes.
+ *
+ * @val and @key must be valid pointers through KHO FDT finalization.
+ * Generally @key is a string literal with static lifetime.
+ *
+ * Callers must use their own locking if there are concurrent accesses to @node.
+ *
+ * Return: 0 on success, 1 if the value is already added with @key, or
+ *   - -ENOMEM: failed to allocate memory,
+ *   - -EBUSY: KHO state tree has been converted to FDT,
+ *   - -EEXIST: another property of the same key exists.
+ */
+int kho_add_prop(struct kho_node *node, const char *key, const void *val,
+		 u32 size)
+{
+	unsigned long key_hash;
+	int ret = 0;
+	struct kho_prop *prop, *p;
+
+	key_hash = strhash(key);
+	prop = kmalloc(sizeof(*prop), GFP_KERNEL);
+	if (!prop)
+		return -ENOMEM;
+
+	prop->key = key;
+	prop->val = val;
+	prop->size = size;
+
+	down_read(&kho_out.tree_lock);
+	if (kho_out.fdt) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	hash_for_each_possible(node->props, p, hlist, key_hash) {
+		if (!strcmp(p->key, key)) {
+			ret = (p->val == val && p->size == size) ? 1 : -EEXIST;
+			break;
+		}
+	}
+
+	if (!ret)
+		hash_add(node->props, &prop->hlist, key_hash);
+
+out:
+	up_read(&kho_out.tree_lock);
+
+	if (ret)
+		kfree(prop);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kho_add_prop);
+
+/**
+ * kho_add_string_prop - add a string property to a node.
+ *
+ * See kho_add_prop() for details.
+ */
+int kho_add_string_prop(struct kho_node *node, const char *key, const char *val)
+{
+	return kho_add_prop(node, key, val, strlen(val) + 1);
+}
+EXPORT_SYMBOL_GPL(kho_add_string_prop);
+
+/**
+ * kho_remove_prop - remove a property from a node.
+ * @node: KHO node to remove the property from.
+ * @key: key of the property.
+ * @size: if non-NULL, the property size is stored in it on success.
+ *
+ * Callers must use their own locking if there are concurrent accesses to @node.
+ *
+ * Return: the pointer to the property value, or
+ *   - -EBUSY: KHO state tree has been converted to FDT,
+ *   - -ENOENT: no property with @key is found.
+ */
+void *kho_remove_prop(struct kho_node *node, const char *key, u32 *size)
+{
+	struct kho_prop *p, *prop = NULL;
+	unsigned long key_hash;
+	void *ret = ERR_PTR(-ENOENT);
+
+	key_hash = strhash(key);
+
+	down_read(&kho_out.tree_lock);
+
+	if (kho_out.fdt) {
+		ret = ERR_PTR(-EBUSY);
+		goto out;
+	}
+
+	hash_for_each_possible(node->props, p, hlist, key_hash) {
+		if (!strcmp(p->key, key)) {
+			prop = p;
+			break;
+		}
+	}
+
+	if (prop) {
+		ret = (void *)prop->val;
+		if (size)
+			*size = prop->size;
+		hash_del(&prop->hlist);
+		kfree(prop);
+	}
+
+out:
+	up_read(&kho_out.tree_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kho_remove_prop);
+
+static int kho_out_update_debugfs_fdt(void)
+{
+	int err = 0;
+
+	if (kho_out.fdt) {
+		kho_out.fdt_wrapper.data = kho_out.fdt;
+		kho_out.fdt_wrapper.size = fdt_totalsize(kho_out.fdt);
+		kho_out.fdt_file = debugfs_create_blob("fdt", 0400, kho_out.dir,
+						       &kho_out.fdt_wrapper);
+		if (IS_ERR(kho_out.fdt_file))
+			err = -ENOENT;
+	} else {
+		debugfs_remove(kho_out.fdt_file);
+	}
+
+	return err;
+}
+
+static int kho_unfreeze(void)
+{
+	int err;
+	void *fdt;
+
+	down_write(&kho_out.tree_lock);
+	fdt = kho_out.fdt;
+	kho_out.fdt = NULL;
+	up_write(&kho_out.tree_lock);
+
+	if (fdt)
+		kvfree(fdt);
+
+	err = blocking_notifier_call_chain(&kho_out.chain_head,
+					   KEXEC_KHO_UNFREEZE, NULL);
+	err = notifier_to_errno(err);
+
+	return notifier_to_errno(err);
+}
+
+static int kho_flatten_tree(void *fdt)
+{
+	int iter, err = 0;
+	struct kho_node *node, *sub_node;
+	struct list_head *ele;
+	struct kho_prop *prop;
+	LIST_HEAD(stack);
+
+	kho_out.root.visited = false;
+	list_add(&kho_out.root.list, &stack);
+
+	for (ele = stack.next; !list_is_head(ele, &stack); ele = stack.next) {
+		node = list_entry(ele, struct kho_node, list);
+
+		if (node->visited) {
+			err = fdt_end_node(fdt);
+			if (err)
+				return err;
+			list_del_init(ele);
+			continue;
+		}
+
+		err = fdt_begin_node(fdt, node->name);
+		if (err)
+			return err;
+
+		hash_for_each(node->props, iter, prop, hlist) {
+			err = fdt_property(fdt, prop->key, prop->val,
+					   prop->size);
+			if (err)
+				return err;
+		}
+
+		hash_for_each(node->nodes, iter, sub_node, hlist) {
+			sub_node->visited = false;
+			list_add(&sub_node->list, &stack);
+		}
+
+		node->visited = true;
+	}
+
+	return 0;
+}
+
+static int kho_convert_tree(void *buffer, int size)
+{
+	void *fdt = buffer;
+	int err = 0;
+
+	err = fdt_create(fdt, size);
+	if (err)
+		goto out;
+
+	err = fdt_finish_reservemap(fdt);
+	if (err)
+		goto out;
+
+	err = kho_flatten_tree(fdt);
+	if (err)
+		goto out;
+
+	err = fdt_finish(fdt);
+	if (err)
+		goto out;
+
+	err = fdt_check_header(fdt);
+	if (err)
+		goto out;
+
+out:
+	if (err) {
+		pr_err("failed to flatten state tree: %d\n", err);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int kho_finalize(void)
+{
+	int err = 0;
+	void *fdt;
+
+	fdt = kvmalloc(kho_out.fdt_max, GFP_KERNEL);
+	if (!fdt)
+		return -ENOMEM;
+
+	err = blocking_notifier_call_chain(&kho_out.chain_head,
+					   KEXEC_KHO_FINALIZE, NULL);
+	err = notifier_to_errno(err);
+	if (err)
+		goto unfreeze;
+
+	down_write(&kho_out.tree_lock);
+	kho_out.fdt = fdt;
+	up_write(&kho_out.tree_lock);
+
+	err = kho_convert_tree(fdt, kho_out.fdt_max);
+
+unfreeze:
+	if (err) {
+		int abort_err;
+
+		pr_err("Failed to convert KHO state tree: %d\n", err);
+
+		abort_err = kho_unfreeze();
+		if (abort_err)
+			pr_err("Failed to abort KHO state tree: %d\n",
+			       abort_err);
+	}
+
+	return err;
+}
+
+/* Handling for debug/kho/out */
+static int kho_out_finalize_get(void *data, u64 *val)
+{
+	*val = !!kho_out.fdt;
+
+	return 0;
+}
+
+static int kho_out_finalize_set(void *data, u64 _val)
+{
+	int ret = 0;
+	bool val = !!_val;
+
+	if (!kexec_trylock())
+		return -EBUSY;
+
+	if (val == !!kho_out.fdt) {
+		if (kho_out.fdt)
+			ret = -EEXIST;
+		else
+			ret = -ENOENT;
+		goto unlock;
+	}
+
+	if (val)
+		ret = kho_finalize();
+	else
+		ret = kho_unfreeze();
+
+	if (ret)
+		goto unlock;
+
+	ret = kho_out_update_debugfs_fdt();
+
+unlock:
+	kexec_unlock();
+	return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(fops_kho_out_finalize, kho_out_finalize_get,
+			 kho_out_finalize_set, "%llu\n");
+
+static int kho_out_fdt_max_get(void *data, u64 *val)
+{
+	*val = kho_out.fdt_max;
+
+	return 0;
+}
+
+static int kho_out_fdt_max_set(void *data, u64 val)
+{
+	int ret = 0;
+
+	if (!kexec_trylock()) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	/* FDT already exists, it's too late to change fdt_max */
+	if (kho_out.fdt) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	kho_out.fdt_max = val;
+
+unlock:
+	kexec_unlock();
+	return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(fops_kho_out_fdt_max, kho_out_fdt_max_get,
+			 kho_out_fdt_max_set, "%llu\n");
+
+static int scratch_phys_show(struct seq_file *m, void *v)
+{
+	for (int i = 0; i < kho_scratch_cnt; i++)
+		seq_printf(m, "0x%llx\n", kho_scratch[i].addr);
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(scratch_phys);
+
+static int scratch_len_show(struct seq_file *m, void *v)
+{
+	for (int i = 0; i < kho_scratch_cnt; i++)
+		seq_printf(m, "0x%llx\n", kho_scratch[i].size);
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(scratch_len);
+
+static __init int kho_out_debugfs_init(void)
+{
+	struct dentry *dir, *f;
+
+	dir = debugfs_create_dir("out", debugfs_root);
+	if (IS_ERR(dir))
+		return -ENOMEM;
+
+	f = debugfs_create_file("scratch_phys", 0400, dir, NULL,
+				&scratch_phys_fops);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	f = debugfs_create_file("scratch_len", 0400, dir, NULL,
+				&scratch_len_fops);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	f = debugfs_create_file("fdt_max", 0600, dir, NULL,
+				&fops_kho_out_fdt_max);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	f = debugfs_create_file("finalize", 0600, dir, NULL,
+				&fops_kho_out_finalize);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	kho_out.dir = dir;
+	return 0;
+
+err_rmdir:
+	debugfs_remove_recursive(dir);
+	return -ENOENT;
+}
+
+static __init int kho_init(void)
+{
+	int err;
+
+	if (!kho_enable)
+		return 0;
+
+	kho_out.root.name = "";
+	err = kho_add_string_prop(&kho_out.root, "compatible", "kho-v1");
+	if (err)
+		goto err_free_scratch;
+
+	debugfs_root = debugfs_create_dir("kho", NULL);
+	if (IS_ERR(debugfs_root)) {
+		err = -ENOENT;
+		goto err_free_scratch;
+	}
+
+	err = kho_out_debugfs_init();
+	if (err)
+		goto err_free_scratch;
+
+	for (int i = 0; i < kho_scratch_cnt; i++) {
+		unsigned long base_pfn = PHYS_PFN(kho_scratch[i].addr);
+		unsigned long count = kho_scratch[i].size >> PAGE_SHIFT;
+		unsigned long pfn;
+
+		for (pfn = base_pfn; pfn < base_pfn + count;
+		     pfn += pageblock_nr_pages)
+			init_cma_reserved_pageblock(pfn_to_page(pfn));
+	}
+
+	return 0;
+
+err_free_scratch:
+	for (int i = 0; i < kho_scratch_cnt; i++) {
+		void *start = __va(kho_scratch[i].addr);
+		void *end = start + kho_scratch[i].size;
+
+		free_reserved_area(start, end, -1, "");
+	}
+	kho_enable = false;
+	return err;
+}
+late_initcall(kho_init);
+
+/*
+ * The scratch areas are scaled by default as percent of memory allocated from
+ * memblock. A user can override the scale with command line parameter:
+ *
+ * kho_scratch=N%
+ *
+ * It is also possible to explicitly define size for a lowmem, a global and
+ * per-node scratch areas:
+ *
+ * kho_scratch=l[KMG],n[KMG],m[KMG]
+ *
+ * The explicit size definition takes precedence over scale definition.
+ */
+static unsigned int scratch_scale __initdata = 200;
+static phys_addr_t scratch_size_global __initdata;
+static phys_addr_t scratch_size_pernode __initdata;
+static phys_addr_t scratch_size_lowmem __initdata;
+
+static int __init kho_parse_scratch_size(char *p)
+{
+	unsigned long size, size_pernode, size_global;
+	char *endptr, *oldp = p;
+
+	if (!p)
+		return -EINVAL;
+
+	size = simple_strtoul(p, &endptr, 0);
+	if (*endptr == '%') {
+		scratch_scale = size;
+		pr_notice("scratch scale is %d percent\n", scratch_scale);
+	} else {
+		size = memparse(p, &p);
+		if (!size || p == oldp)
+			return -EINVAL;
+
+		if (*p != ',')
+			return -EINVAL;
+
+		oldp = p;
+		size_global = memparse(p + 1, &p);
+		if (!size_global || p == oldp)
+			return -EINVAL;
+
+		if (*p != ',')
+			return -EINVAL;
+
+		size_pernode = memparse(p + 1, &p);
+		if (!size_pernode)
+			return -EINVAL;
+
+		scratch_size_lowmem = size;
+		scratch_size_global = size_global;
+		scratch_size_pernode = size_pernode;
+		scratch_scale = 0;
+
+		pr_notice("scratch areas: lowmem: %lluMB global: %lluMB pernode: %lldMB\n",
+			  (u64)(scratch_size_lowmem >> 20),
+			  (u64)(scratch_size_global >> 20),
+			  (u64)(scratch_size_pernode >> 20));
+	}
+
+	return 0;
+}
+early_param("kho_scratch", kho_parse_scratch_size);
+
+static void __init scratch_size_update(void)
+{
+	phys_addr_t size;
+
+	if (!scratch_scale)
+		return;
+
+	size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT,
+					   NUMA_NO_NODE);
+	size = size * scratch_scale / 100;
+	scratch_size_lowmem = round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+
+	size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
+					   NUMA_NO_NODE);
+	size = size * scratch_scale / 100 - scratch_size_lowmem;
+	scratch_size_global = round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+}
+
+static phys_addr_t __init scratch_size_node(int nid)
+{
+	phys_addr_t size;
+
+	if (scratch_scale) {
+		size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
+						   nid);
+		size = size * scratch_scale / 100;
+	} else {
+		size = scratch_size_pernode;
+	}
+
+	return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+}
+
+/**
+ * kho_reserve_scratch - Reserve a contiguous chunk of memory for kexec
+ *
+ * With KHO we can preserve arbitrary pages in the system. To ensure we still
+ * have a large contiguous region of memory when we search the physical address
+ * space for target memory, let's make sure we always have a large CMA region
+ * active. This CMA region will only be used for movable pages which are not a
+ * problem for us during KHO because we can just move them somewhere else.
+ */
+static void __init kho_reserve_scratch(void)
+{
+	phys_addr_t addr, size;
+	int nid, i = 0;
+
+	if (!kho_enable)
+		return;
+
+	scratch_size_update();
+
+	/* FIXME: deal with node hot-plug/remove */
+	kho_scratch_cnt = num_online_nodes() + 2;
+	size = kho_scratch_cnt * sizeof(*kho_scratch);
+	kho_scratch = memblock_alloc(size, PAGE_SIZE);
+	if (!kho_scratch)
+		goto err_disable_kho;
+
+	/*
+	 * reserve scratch area in low memory for lowmem allocations in the
+	 * next kernel
+	 */
+	size = scratch_size_lowmem;
+	addr = memblock_phys_alloc_range(size, CMA_MIN_ALIGNMENT_BYTES, 0,
+					 ARCH_LOW_ADDRESS_LIMIT);
+	if (!addr)
+		goto err_free_scratch_desc;
+
+	kho_scratch[i].addr = addr;
+	kho_scratch[i].size = size;
+	i++;
+
+	/* reserve large contiguous area for allocations without nid */
+	size = scratch_size_global;
+	addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES);
+	if (!addr)
+		goto err_free_scratch_areas;
+
+	kho_scratch[i].addr = addr;
+	kho_scratch[i].size = size;
+	i++;
+
+	for_each_online_node(nid) {
+		size = scratch_size_node(nid);
+		addr = memblock_alloc_range_nid(size, CMA_MIN_ALIGNMENT_BYTES,
+						0, MEMBLOCK_ALLOC_ACCESSIBLE,
+						nid, true);
+		if (!addr)
+			goto err_free_scratch_areas;
+
+		kho_scratch[i].addr = addr;
+		kho_scratch[i].size = size;
+		i++;
+	}
+
+	return;
+
+err_free_scratch_areas:
+	for (i--; i >= 0; i--)
+		memblock_phys_free(kho_scratch[i].addr, kho_scratch[i].size);
+err_free_scratch_desc:
+	memblock_free(kho_scratch, kho_scratch_cnt * sizeof(*kho_scratch));
+err_disable_kho:
+	kho_enable = false;
+}
+
+void __init kho_memory_init(void)
+{
+	kho_reserve_scratch();
+}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 04441c258b05..757659b7a26b 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -30,6 +30,7 @@
 #include <linux/crash_dump.h>
 #include <linux/execmem.h>
 #include <linux/vmstat.h>
+#include <linux/kexec_handover.h>
 #include "internal.h"
 #include "slab.h"
 #include "shuffle.h"
@@ -2661,6 +2662,13 @@ void __init mm_core_init(void)
 	report_meminit();
 	kmsan_init_shadow();
 	stack_depot_early_init();
+
+	/*
+	 * KHO memory setup must happen while memblock is still active, but
+	 * as close as possible to buddy initialization
+	 */
+	kho_memory_init();
+
 	mem_init();
 	kmem_cache_init();
 	/*
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 08/16] kexec: add KHO parsing support
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (6 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 09/16] kexec: enable KHO support for memory preservation Changyuan Lyu
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

From: Alexander Graf <graf@amazon.com>

When we have a KHO kexec, we get an FDT blob and scratch region to
populate the state of the system. Provide helper functions that allow
architecture code to easily handle memory reservations based on them and
give device drivers visibility into the KHO FDT and memory reservations
so they can recover their own state.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 include/linux/kexec_handover.h |  48 ++++++
 kernel/kexec_handover.c        | 302 ++++++++++++++++++++++++++++++++-
 mm/memblock.c                  |   1 +
 3 files changed, 350 insertions(+), 1 deletion(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index 9cd9ad31e2d1..c665ff6cd728 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -35,6 +35,10 @@ struct kho_node {
 	bool visited;
 };
 
+struct kho_in_node {
+	int offset;
+};
+
 #ifdef CONFIG_KEXEC_HANDOVER
 bool kho_is_enabled(void);
 void kho_init_node(struct kho_node *node);
@@ -51,6 +55,19 @@ int register_kho_notifier(struct notifier_block *nb);
 int unregister_kho_notifier(struct notifier_block *nb);
 
 void kho_memory_init(void);
+
+void kho_populate(phys_addr_t handover_fdt_phys, phys_addr_t scratch_phys,
+		  u64 scratch_len);
+
+int kho_get_node(const struct kho_in_node *parent, const char *name,
+		 struct kho_in_node *child);
+int kho_get_nodes(const struct kho_in_node *parent,
+		  int (*func)(const char *, const struct kho_in_node *, void *),
+		  void *data);
+const void *kho_get_prop(const struct kho_in_node *node, const char *key,
+			 u32 *size);
+int kho_node_check_compatible(const struct kho_in_node *node,
+			      const char *compatible);
 #else
 static inline bool kho_is_enabled(void)
 {
@@ -104,6 +121,37 @@ static inline int unregister_kho_notifier(struct notifier_block *nb)
 static inline void kho_memory_init(void)
 {
 }
+
+static inline void kho_populate(phys_addr_t handover_fdt_phys,
+				phys_addr_t scratch_phys, u64 scratch_len)
+{
+}
+
+static inline int kho_get_node(const struct kho_in_node *parent,
+			       const char *name, struct kho_in_node *child)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int kho_get_nodes(const struct kho_in_node *parent,
+				int (*func)(const char *,
+					    const struct kho_in_node *, void *),
+				void *data)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline const void *kho_get_prop(const struct kho_in_node *node,
+				       const char *key, u32 *size)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline int kho_node_check_compatible(const struct kho_in_node *node,
+					    const char *compatible)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_KEXEC_HANDOVER */
 
 #endif /* LINUX_KEXEC_HANDOVER_H */
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index df0d9debbb64..6ebad2f023f9 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -73,6 +73,20 @@ static struct kho_out kho_out = {
 	.fdt_max = 10 * SZ_1M,
 };
 
+struct kho_in {
+	struct debugfs_blob_wrapper fdt_wrapper;
+	struct dentry *dir;
+	phys_addr_t kho_scratch_phys;
+	phys_addr_t fdt_phys;
+};
+
+static struct kho_in kho_in;
+
+static const void *kho_get_fdt(void)
+{
+	return kho_in.fdt_phys ? phys_to_virt(kho_in.fdt_phys) : NULL;
+}
+
 int register_kho_notifier(struct notifier_block *nb)
 {
 	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
@@ -85,6 +99,144 @@ int unregister_kho_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_kho_notifier);
 
+/**
+ * kho_get_node - retrieve a node saved in KHO FDT.
+ * @parent: the parent node to look up for.
+ * @name: the name of the node to look for.
+ * @child: if a node named @name is found under @parent, it is stored in @child.
+ *
+ * If @parent is NULL, this function looks up for @name under KHO root node.
+ *
+ * Return: 0 on success, and @child is populated, error code on failure.
+ */
+int kho_get_node(const struct kho_in_node *parent, const char *name,
+		 struct kho_in_node *child)
+{
+	int parent_offset = 0;
+	int offset = 0;
+	const void *fdt = kho_get_fdt();
+
+	if (!fdt)
+		return -ENOENT;
+
+	if (!child)
+		return -EINVAL;
+
+	if (parent)
+		parent_offset = parent->offset;
+
+	offset = fdt_subnode_offset(fdt, parent_offset, name);
+	if (offset < 0)
+		return -ENOENT;
+
+	child->offset = offset;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kho_get_node);
+
+/**
+ * kho_get_nodes - iterate over all direct child nodes.
+ * @parent: the parent node to look for child nodes.
+ * @func: a function pointer to be called on each child node.
+ * @data: auxiliary data to be passed to @func.
+ *
+ * For every direct child node of @parent, @func is called with the child node
+ * name, the child node (a struct kho_in_node *), and @data.
+ *
+ * If @parent is NULL, this function iterates over the child nodes of the KHO
+ * root node.
+ *
+ * Return: 0 on success, error code on failure.
+ */
+int kho_get_nodes(const struct kho_in_node *parent,
+		  int (*func)(const char *, const struct kho_in_node *, void *),
+		  void *data)
+{
+	int parent_offset = 0;
+	struct kho_in_node child;
+	const char *name;
+	int ret = 0;
+	const void *fdt = kho_get_fdt();
+
+	if (!fdt)
+		return -ENOENT;
+
+	if (parent)
+		parent_offset = parent->offset;
+
+	fdt_for_each_subnode(child.offset, fdt, parent_offset) {
+		if (child.offset < 0)
+			return -EINVAL;
+
+		name = fdt_get_name(fdt, child.offset, NULL);
+
+		if (!name)
+			return -EINVAL;
+
+		ret = func(name, &child, data);
+
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kho_get_nodes);
+
+/**
+ * kho_get_prop - retrieve the property data stored in the KHO tree.
+ * @node: the node to look up for.
+ * @key: the key of the property.
+ * @size: a pointer to store the size of the data in bytes.
+ *
+ * Return: pointer to the data, and data size is stored in @size, or NULL on
+ * failure.
+ */
+const void *kho_get_prop(const struct kho_in_node *node, const char *key,
+			 u32 *size)
+{
+	int offset = 0;
+	u32 s;
+	const void *fdt = kho_get_fdt();
+
+	if (!fdt)
+		return NULL;
+
+	if (node)
+		offset = node->offset;
+
+	if (!size)
+		size = &s;
+
+	return fdt_getprop(fdt, offset, key, size);
+}
+EXPORT_SYMBOL_GPL(kho_get_prop);
+
+/**
+ * kho_node_check_compatible - check a node's compatible property.
+ * @node: the node to check.
+ * @compatible: the compatible stirng.
+ *
+ * Wrapper of fdt_node_check_compatible().
+ *
+ * Return: 0 if @compatible is in the node's "compatible" list, or
+ * error code on failure.
+ */
+int kho_node_check_compatible(const struct kho_in_node *node,
+			      const char *compatible)
+{
+	int result = 0;
+	const void *fdt = kho_get_fdt();
+
+	if (!fdt)
+		return -ENOENT;
+
+	result = fdt_node_check_compatible(fdt, node->offset, compatible);
+
+	return result ? -EINVAL : 0;
+}
+EXPORT_SYMBOL_GPL(kho_node_check_compatible);
+
 /* Helper functions for KHO state tree */
 
 struct kho_prop {
@@ -605,6 +757,32 @@ static int scratch_len_show(struct seq_file *m, void *v)
 }
 DEFINE_SHOW_ATTRIBUTE(scratch_len);
 
+/* Handling for debugfs/kho/in */
+static __init int kho_in_debugfs_init(const void *fdt)
+{
+	struct dentry *file;
+	int err;
+
+	kho_in.dir = debugfs_create_dir("in", debugfs_root);
+	if (IS_ERR(kho_in.dir))
+		return PTR_ERR(kho_in.dir);
+
+	kho_in.fdt_wrapper.size = fdt_totalsize(fdt);
+	kho_in.fdt_wrapper.data = (void *)fdt;
+	file = debugfs_create_blob("fdt", 0400, kho_in.dir,
+				   &kho_in.fdt_wrapper);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_rmdir;
+	}
+
+	return 0;
+
+err_rmdir:
+	debugfs_remove(kho_in.dir);
+	return err;
+}
+
 static __init int kho_out_debugfs_init(void)
 {
 	struct dentry *dir, *f;
@@ -644,6 +822,7 @@ static __init int kho_out_debugfs_init(void)
 static __init int kho_init(void)
 {
 	int err;
+	const void *fdt = kho_get_fdt();
 
 	if (!kho_enable)
 		return 0;
@@ -663,6 +842,21 @@ static __init int kho_init(void)
 	if (err)
 		goto err_free_scratch;
 
+	if (fdt) {
+		err = kho_in_debugfs_init(fdt);
+		/*
+		 * Failure to create /sys/kernel/debug/kho/in does not prevent
+		 * reviving state from KHO and setting up KHO for the next
+		 * kexec.
+		 */
+		if (err)
+			pr_err("failed exposing handover FDT in debugfs\n");
+
+		kho_scratch = __va(kho_in.kho_scratch_phys);
+
+		return 0;
+	}
+
 	for (int i = 0; i < kho_scratch_cnt; i++) {
 		unsigned long base_pfn = PHYS_PFN(kho_scratch[i].addr);
 		unsigned long count = kho_scratch[i].size >> PAGE_SHIFT;
@@ -859,7 +1053,113 @@ static void __init kho_reserve_scratch(void)
 	kho_enable = false;
 }
 
+static void __init kho_release_scratch(void)
+{
+	phys_addr_t start, end;
+	u64 i;
+
+	memmap_init_kho_scratch_pages();
+
+	/*
+	 * Mark scratch mem as CMA before we return it. That way we
+	 * ensure that no kernel allocations happen on it. That means
+	 * we can reuse it as scratch memory again later.
+	 */
+	__for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
+			     MEMBLOCK_KHO_SCRATCH, &start, &end, NULL) {
+		ulong start_pfn = pageblock_start_pfn(PFN_DOWN(start));
+		ulong end_pfn = pageblock_align(PFN_UP(end));
+		ulong pfn;
+
+		for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages)
+			set_pageblock_migratetype(pfn_to_page(pfn),
+						  MIGRATE_CMA);
+	}
+}
+
 void __init kho_memory_init(void)
 {
-	kho_reserve_scratch();
+	if (!kho_get_fdt())
+		kho_reserve_scratch();
+	else
+		kho_release_scratch();
+}
+
+void __init kho_populate(phys_addr_t handover_fdt_phys,
+			 phys_addr_t scratch_phys, u64 scratch_len)
+{
+	void *handover_fdt;
+	struct kho_scratch *scratch;
+	u32 fdt_size = 0;
+
+	/* Determine the real size of the FDT */
+	handover_fdt =
+		early_memremap(handover_fdt_phys, sizeof(struct fdt_header));
+	if (!handover_fdt) {
+		pr_warn("setup: failed to memremap kexec FDT (0x%llx)\n",
+			handover_fdt_phys);
+		return;
+	}
+
+	if (fdt_check_header(handover_fdt)) {
+		pr_warn("setup: kexec handover FDT is invalid (0x%llx)\n",
+			handover_fdt_phys);
+		early_memunmap(handover_fdt, sizeof(struct fdt_header));
+		return;
+	}
+
+	fdt_size = fdt_totalsize(handover_fdt);
+	kho_in.fdt_phys = handover_fdt_phys;
+
+	early_memunmap(handover_fdt, sizeof(struct fdt_header));
+
+	/* Reserve the DT so we can still access it in late boot */
+	memblock_reserve(handover_fdt_phys, fdt_size);
+
+	kho_in.kho_scratch_phys = scratch_phys;
+	kho_scratch_cnt = scratch_len / sizeof(*kho_scratch);
+	scratch = early_memremap(scratch_phys, scratch_len);
+	if (!scratch) {
+		pr_warn("setup: failed to memremap kexec scratch (0x%llx)\n",
+			scratch_phys);
+		return;
+	}
+
+	/*
+	 * We pass a safe contiguous blocks of memory to use for early boot
+	 * purporses from the previous kernel so that we can resize the
+	 * memblock array as needed.
+	 */
+	for (int i = 0; i < kho_scratch_cnt; i++) {
+		struct kho_scratch *area = &scratch[i];
+		u64 size = area->size;
+
+		memblock_add(area->addr, size);
+
+		if (WARN_ON(memblock_mark_kho_scratch(area->addr, size))) {
+			pr_err("Kexec failed to mark the scratch region. Disabling KHO revival.");
+			kho_in.fdt_phys = 0;
+			scratch = NULL;
+			break;
+		}
+		pr_debug("Marked 0x%pa+0x%pa as scratch", &area->addr, &size);
+	}
+
+	early_memunmap(scratch, scratch_len);
+
+	if (!scratch)
+		return;
+
+	memblock_reserve(scratch_phys, scratch_len);
+
+	/*
+	 * Now that we have a viable region of scratch memory, let's tell
+	 * the memblocks allocator to only use that for any allocations.
+	 * That way we ensure that nothing scribbles over in use data while
+	 * we initialize the page tables which we will need to ingest all
+	 * memory reservations from the previous kernel.
+	 */
+	memblock_set_kho_scratch_only();
+
+	pr_info("setup: Found kexec handover data. Will skip init for some devices\n");
 }
diff --git a/mm/memblock.c b/mm/memblock.c
index d5d406a5160a..d28abf3def1c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2374,6 +2374,7 @@ void __init memblock_free_all(void)
 	free_unused_memmap();
 	reset_all_zones_managed_pages();
 
+	memblock_clear_kho_scratch_only();
 	pages = free_low_memory_core_early();
 	totalram_pages_add(pages);
 }
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (7 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 08/16] kexec: add KHO parsing support Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-21 13:46   ` Jason Gunthorpe
                     ` (3 more replies)
  2025-03-20  1:55 ` [PATCH v5 10/16] kexec: add KHO support to kexec file loads Changyuan Lyu
                   ` (7 subsequent siblings)
  16 siblings, 4 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Jason Gunthorpe,
	Changyuan Lyu

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Introduce APIs allowing KHO users to preserve memory across kexec and
get access to that memory after boot of the kexeced kernel

kho_preserve_folio() - record a folio to be preserved over kexec
kho_restore_folio() - recreates the folio from the preserved memory
kho_preserve_phys() - record physically contiguous range to be
preserved over kexec.
kho_restore_phys() - recreates order-0 pages corresponding to the
preserved physical range

The memory preservations are tracked by two levels of xarrays to manage
chunks of per-order 512 byte bitmaps. For instance the entire 1G order
of a 1TB x86 system would fit inside a single 512 byte bitmap. For
order 0 allocations each bitmap will cover 16M of address space. Thus,
for 16G of memory at most 512K of bitmap memory will be needed for order 0.

At serialization time all bitmaps are recorded in a linked list of pages
for the next kernel to process and the physical address of the list is
recorded in KHO FDT.

The next kernel then processes that list, reserves the memory ranges and
later, when a user requests a folio or a physical range, KHO restores
corresponding memory map entries.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 include/linux/kexec_handover.h |  38 +++
 kernel/kexec_handover.c        | 486 ++++++++++++++++++++++++++++++++-
 2 files changed, 522 insertions(+), 2 deletions(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index c665ff6cd728..d52a7b500f4c 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/hashtable.h>
 #include <linux/notifier.h>
+#include <linux/mm_types.h>
 
 struct kho_scratch {
 	phys_addr_t addr;
@@ -54,6 +55,13 @@ int kho_add_string_prop(struct kho_node *node, const char *key,
 int register_kho_notifier(struct notifier_block *nb);
 int unregister_kho_notifier(struct notifier_block *nb);
 
+int kho_preserve_folio(struct folio *folio);
+int kho_unpreserve_folio(struct folio *folio);
+int kho_preserve_phys(phys_addr_t phys, size_t size);
+int kho_unpreserve_phys(phys_addr_t phys, size_t size);
+struct folio *kho_restore_folio(phys_addr_t phys);
+void *kho_restore_phys(phys_addr_t phys, size_t size);
+
 void kho_memory_init(void);
 
 void kho_populate(phys_addr_t handover_fdt_phys, phys_addr_t scratch_phys,
@@ -118,6 +126,36 @@ static inline int unregister_kho_notifier(struct notifier_block *nb)
 	return -EOPNOTSUPP;
 }
 
+static inline int kho_preserve_folio(struct folio *folio)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int kho_unpreserve_folio(struct folio *folio)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int kho_preserve_phys(phys_addr_t phys, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int kho_unpreserve_phys(phys_addr_t phys, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline struct folio *kho_restore_folio(phys_addr_t phys)
+{
+	return NULL;
+}
+
+static inline void *kho_restore_phys(phys_addr_t phys, size_t size)
+{
+	return NULL;
+}
+
 static inline void kho_memory_init(void)
 {
 }
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 6ebad2f023f9..592563c21369 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -62,6 +62,13 @@ struct kho_out {
 	struct rw_semaphore tree_lock;
 	struct kho_node root;
 
+	/**
+	 * Physical address of the first struct khoser_mem_chunk containing
+	 * serialized data from struct kho_mem_track.
+	 */
+	phys_addr_t first_chunk_phys;
+	struct kho_node preserved_memory;
+
 	void *fdt;
 	u64 fdt_max;
 };
@@ -70,6 +77,7 @@ static struct kho_out kho_out = {
 	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
 	.tree_lock = __RWSEM_INITIALIZER(kho_out.tree_lock),
 	.root = KHO_NODE_INIT,
+	.preserved_memory = KHO_NODE_INIT,
 	.fdt_max = 10 * SZ_1M,
 };
 
@@ -237,6 +245,461 @@ int kho_node_check_compatible(const struct kho_in_node *node,
 }
 EXPORT_SYMBOL_GPL(kho_node_check_compatible);
 
+/*
+ * Keep track of memory that is to be preserved across KHO.
+ *
+ * The serializing side uses two levels of xarrays to manage chunks of per-order
+ * 512 byte bitmaps. For instance the entire 1G order of a 1TB system would fit
+ * inside a single 512 byte bitmap. For order 0 allocations each bitmap will
+ * cover 16M of address space. Thus, for 16G of memory at most 512K
+ * of bitmap memory will be needed for order 0.
+ *
+ * This approach is fully incremental, as the serialization progresses folios
+ * can continue be aggregated to the tracker. The final step, immediately prior
+ * to kexec would serialize the xarray information into a linked list for the
+ * successor kernel to parse.
+ */
+
+#define PRESERVE_BITS (512 * 8)
+
+struct kho_mem_phys_bits {
+	DECLARE_BITMAP(preserve, PRESERVE_BITS);
+};
+
+struct kho_mem_phys {
+	/*
+	 * Points to kho_mem_phys_bits, a sparse bitmap array. Each bit is sized
+	 * to order.
+	 */
+	struct xarray phys_bits;
+};
+
+struct kho_mem_track {
+	/* Points to kho_mem_phys, each order gets its own bitmap tree */
+	struct xarray orders;
+};
+
+static struct kho_mem_track kho_mem_track;
+
+static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
+{
+	void *elm, *res;
+
+	elm = xa_load(xa, index);
+	if (elm)
+		return elm;
+
+	elm = kzalloc(sz, GFP_KERNEL);
+	if (!elm)
+		return ERR_PTR(-ENOMEM);
+
+	res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL);
+	if (xa_is_err(res))
+		res = ERR_PTR(xa_err(res));
+
+	if (res) {
+		kfree(elm);
+		return res;
+	}
+
+	return elm;
+}
+
+static void __kho_unpreserve(struct kho_mem_track *tracker, unsigned long pfn,
+			     unsigned int order)
+{
+	struct kho_mem_phys_bits *bits;
+	struct kho_mem_phys *physxa;
+	unsigned long pfn_hi = pfn >> order;
+
+	physxa = xa_load(&tracker->orders, order);
+	if (!physxa)
+		return;
+
+	bits = xa_load(&physxa->phys_bits, pfn_hi / PRESERVE_BITS);
+	if (!bits)
+		return;
+
+	clear_bit(pfn_hi % PRESERVE_BITS, bits->preserve);
+}
+
+static int __kho_preserve(struct kho_mem_track *tracker, unsigned long pfn,
+			  unsigned int order)
+{
+	struct kho_mem_phys_bits *bits;
+	struct kho_mem_phys *physxa;
+	unsigned long pfn_hi = pfn >> order;
+
+	might_sleep();
+
+	physxa = xa_load_or_alloc(&tracker->orders, order, sizeof(*physxa));
+	if (IS_ERR(physxa))
+		return PTR_ERR(physxa);
+
+	bits = xa_load_or_alloc(&physxa->phys_bits, pfn_hi / PRESERVE_BITS,
+				sizeof(*bits));
+	if (IS_ERR(bits))
+		return PTR_ERR(bits);
+
+	set_bit(pfn_hi % PRESERVE_BITS, bits->preserve);
+
+	return 0;
+}
+
+/**
+ * kho_preserve_folio - preserve a folio across KHO.
+ * @folio: folio to preserve
+ *
+ * Records that the entire folio is preserved across KHO. The order
+ * will be preserved as well.
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_preserve_folio(struct folio *folio)
+{
+	unsigned long pfn = folio_pfn(folio);
+	unsigned int order = folio_order(folio);
+	int err;
+
+	if (!kho_enable)
+		return -EOPNOTSUPP;
+
+	down_read(&kho_out.tree_lock);
+	if (kho_out.fdt) {
+		err = -EBUSY;
+		goto unlock;
+	}
+
+	err = __kho_preserve(&kho_mem_track, pfn, order);
+
+unlock:
+	up_read(&kho_out.tree_lock);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(kho_preserve_folio);
+
+/**
+ * kho_unpreserve_folio - unpreserve a folio
+ * @folio: folio to unpreserve
+ *
+ * Remove the record of a folio previously preserved by kho_preserve_folio().
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_unpreserve_folio(struct folio *folio)
+{
+	unsigned long pfn = folio_pfn(folio);
+	unsigned int order = folio_order(folio);
+	int err = 0;
+
+	down_read(&kho_out.tree_lock);
+	if (kho_out.fdt) {
+		err = -EBUSY;
+		goto unlock;
+	}
+
+	__kho_unpreserve(&kho_mem_track, pfn, order);
+
+unlock:
+	up_read(&kho_out.tree_lock);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(kho_unpreserve_folio);
+
+/**
+ * kho_preserve_phys - preserve a physically contiguous range across KHO.
+ * @phys: physical address of the range
+ * @size: size of the range
+ *
+ * Records that the entire range from @phys to @phys + @size is preserved
+ * across KHO.
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_preserve_phys(phys_addr_t phys, size_t size)
+{
+	unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
+	unsigned int order = ilog2(end_pfn - pfn);
+	unsigned long failed_pfn;
+	int err = 0;
+
+	if (!kho_enable)
+		return -EOPNOTSUPP;
+
+	down_read(&kho_out.tree_lock);
+	if (kho_out.fdt) {
+		err = -EBUSY;
+		goto unlock;
+	}
+
+	for (; pfn < end_pfn;
+	     pfn += (1 << order), order = ilog2(end_pfn - pfn)) {
+		err = __kho_preserve(&kho_mem_track, pfn, order);
+		if (err) {
+			failed_pfn = pfn;
+			break;
+		}
+	}
+
+	if (err)
+		for (pfn = PHYS_PFN(phys); pfn < failed_pfn;
+		     pfn += (1 << order), order = ilog2(end_pfn - pfn))
+			__kho_unpreserve(&kho_mem_track, pfn, order);
+
+unlock:
+	up_read(&kho_out.tree_lock);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(kho_preserve_phys);
+
+/**
+ * kho_unpreserve_phys - unpreserve a physically contiguous range
+ * @phys: physical address of the range
+ * @size: size of the range
+ *
+ * Remove the record of a range previously preserved by kho_preserve_phys().
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_unpreserve_phys(phys_addr_t phys, size_t size)
+{
+	unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
+	unsigned int order = ilog2(end_pfn - pfn);
+	int err = 0;
+
+	down_read(&kho_out.tree_lock);
+	if (kho_out.fdt) {
+		err = -EBUSY;
+		goto unlock;
+	}
+
+	for (; pfn < end_pfn; pfn += (1 << order), order = ilog2(end_pfn - pfn))
+		__kho_unpreserve(&kho_mem_track, pfn, order);
+
+unlock:
+	up_read(&kho_out.tree_lock);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(kho_unpreserve_phys);
+
+/* almost as free_reserved_page(), just don't free the page */
+static void kho_restore_page(struct page *page)
+{
+	ClearPageReserved(page);
+	init_page_count(page);
+	adjust_managed_page_count(page, 1);
+}
+
+struct folio *kho_restore_folio(phys_addr_t phys)
+{
+	struct page *page = pfn_to_online_page(PHYS_PFN(phys));
+	unsigned long order = page->private;
+
+	if (!page)
+		return NULL;
+
+	order = page->private;
+	if (order)
+		prep_compound_page(page, order);
+	else
+		kho_restore_page(page);
+
+	return page_folio(page);
+}
+EXPORT_SYMBOL_GPL(kho_restore_folio);
+
+void *kho_restore_phys(phys_addr_t phys, size_t size)
+{
+	unsigned long start_pfn, end_pfn, pfn;
+	void *va = __va(phys);
+
+	start_pfn = PFN_DOWN(phys);
+	end_pfn = PFN_UP(phys + size);
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_online_page(pfn);
+
+		if (!page)
+			return NULL;
+		kho_restore_page(page);
+	}
+
+	return va;
+}
+EXPORT_SYMBOL_GPL(kho_restore_phys);
+
+#define KHOSER_PTR(type)          \
+	union {                   \
+		phys_addr_t phys; \
+		type ptr;         \
+	}
+#define KHOSER_STORE_PTR(dest, val)                 \
+	({                                          \
+		(dest).phys = virt_to_phys(val);    \
+		typecheck(typeof((dest).ptr), val); \
+	})
+#define KHOSER_LOAD_PTR(src) \
+	((src).phys ? (typeof((src).ptr))(phys_to_virt((src).phys)) : NULL)
+
+struct khoser_mem_bitmap_ptr {
+	phys_addr_t phys_start;
+	KHOSER_PTR(struct kho_mem_phys_bits *) bitmap;
+};
+
+struct khoser_mem_chunk;
+
+struct khoser_mem_chunk_hdr {
+	KHOSER_PTR(struct khoser_mem_chunk *) next;
+	unsigned int order;
+	unsigned int num_elms;
+};
+
+#define KHOSER_BITMAP_SIZE                                   \
+	((PAGE_SIZE - sizeof(struct khoser_mem_chunk_hdr)) / \
+	 sizeof(struct khoser_mem_bitmap_ptr))
+
+struct khoser_mem_chunk {
+	struct khoser_mem_chunk_hdr hdr;
+	struct khoser_mem_bitmap_ptr bitmaps[KHOSER_BITMAP_SIZE];
+};
+static_assert(sizeof(struct khoser_mem_chunk) == PAGE_SIZE);
+
+static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
+					  unsigned long order)
+{
+	struct khoser_mem_chunk *chunk;
+
+	chunk = (struct khoser_mem_chunk *)get_zeroed_page(GFP_KERNEL);
+	if (!chunk)
+		return NULL;
+	chunk->hdr.order = order;
+	if (cur_chunk)
+		KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
+	return chunk;
+}
+
+static void kho_mem_ser_free(struct khoser_mem_chunk *first_chunk)
+{
+	struct khoser_mem_chunk *chunk = first_chunk;
+
+	while (chunk) {
+		unsigned long chunk_page = (unsigned long)chunk;
+
+		chunk = KHOSER_LOAD_PTR(chunk->hdr.next);
+		free_page(chunk_page);
+	}
+}
+
+/*
+ * Record all the bitmaps in a linked list of pages for the next kernel to
+ * process. Each chunk holds bitmaps of the same order and each block of bitmaps
+ * starts at a given physical address. This allows the bitmaps to be sparse. The
+ * xarray is used to store them in a tree while building up the data structure,
+ * but the KHO successor kernel only needs to process them once in order.
+ *
+ * All of this memory is normal kmalloc() memory and is not marked for
+ * preservation. The successor kernel will remain isolated to the scratch space
+ * until it completes processing this list. Once processed all the memory
+ * storing these ranges will be marked as free.
+ */
+static struct khoser_mem_chunk *kho_mem_serialize(void)
+{
+	struct kho_mem_track *tracker = &kho_mem_track;
+	struct khoser_mem_chunk *first_chunk = NULL;
+	struct khoser_mem_chunk *chunk = NULL;
+	struct kho_mem_phys *physxa;
+	unsigned long order;
+
+	xa_for_each(&tracker->orders, order, physxa) {
+		struct kho_mem_phys_bits *bits;
+		unsigned long phys;
+
+		chunk = new_chunk(chunk, order);
+		if (!chunk)
+			goto err_free;
+
+		if (!first_chunk)
+			first_chunk = chunk;
+
+		xa_for_each(&physxa->phys_bits, phys, bits) {
+			struct khoser_mem_bitmap_ptr *elm;
+
+			if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) {
+				chunk = new_chunk(chunk, order);
+				if (!chunk)
+					goto err_free;
+			}
+
+			elm = &chunk->bitmaps[chunk->hdr.num_elms];
+			chunk->hdr.num_elms++;
+			elm->phys_start = (phys * PRESERVE_BITS)
+					  << (order + PAGE_SHIFT);
+			KHOSER_STORE_PTR(elm->bitmap, bits);
+		}
+	}
+
+	return first_chunk;
+
+err_free:
+	kho_mem_ser_free(first_chunk);
+	return ERR_PTR(-ENOMEM);
+}
+
+static void deserialize_bitmap(unsigned int order,
+			       struct khoser_mem_bitmap_ptr *elm)
+{
+	struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap);
+	unsigned long bit;
+
+	for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) {
+		int sz = 1 << (order + PAGE_SHIFT);
+		phys_addr_t phys =
+			elm->phys_start + (bit << (order + PAGE_SHIFT));
+		struct page *page = phys_to_page(phys);
+
+		memblock_reserve(phys, sz);
+		memblock_reserved_mark_noinit(phys, sz);
+		page->private = order;
+	}
+}
+
+static void __init kho_mem_deserialize(void)
+{
+	struct khoser_mem_chunk *chunk;
+	struct kho_in_node preserved_mem;
+	const phys_addr_t *mem;
+	int err;
+	u32 len;
+
+	err = kho_get_node(NULL, "preserved-memory", &preserved_mem);
+	if (err) {
+		pr_err("no preserved-memory node: %d\n", err);
+		return;
+	}
+
+	mem = kho_get_prop(&preserved_mem, "metadata", &len);
+	if (!mem || len != sizeof(*mem)) {
+		pr_err("failed to get preserved memory bitmaps\n");
+		return;
+	}
+
+	chunk = *mem ? phys_to_virt(*mem) : NULL;
+	while (chunk) {
+		unsigned int i;
+
+		memblock_reserve(virt_to_phys(chunk), sizeof(*chunk));
+
+		for (i = 0; i != chunk->hdr.num_elms; i++)
+			deserialize_bitmap(chunk->hdr.order,
+					   &chunk->bitmaps[i]);
+		chunk = KHOSER_LOAD_PTR(chunk->hdr.next);
+	}
+}
+
 /* Helper functions for KHO state tree */
 
 struct kho_prop {
@@ -545,6 +1008,11 @@ static int kho_unfreeze(void)
 	if (fdt)
 		kvfree(fdt);
 
+	if (kho_out.first_chunk_phys) {
+		kho_mem_ser_free(phys_to_virt(kho_out.first_chunk_phys));
+		kho_out.first_chunk_phys = 0;
+	}
+
 	err = blocking_notifier_call_chain(&kho_out.chain_head,
 					   KEXEC_KHO_UNFREEZE, NULL);
 	err = notifier_to_errno(err);
@@ -633,6 +1101,7 @@ static int kho_finalize(void)
 {
 	int err = 0;
 	void *fdt;
+	struct khoser_mem_chunk *first_chunk;
 
 	fdt = kvmalloc(kho_out.fdt_max, GFP_KERNEL);
 	if (!fdt)
@@ -648,6 +1117,13 @@ static int kho_finalize(void)
 	kho_out.fdt = fdt;
 	up_write(&kho_out.tree_lock);
 
+	first_chunk = kho_mem_serialize();
+	if (IS_ERR(first_chunk)) {
+		err = PTR_ERR(first_chunk);
+		goto unfreeze;
+	}
+	kho_out.first_chunk_phys = first_chunk ? virt_to_phys(first_chunk) : 0;
+
 	err = kho_convert_tree(fdt, kho_out.fdt_max);
 
 unfreeze:
@@ -829,6 +1305,10 @@ static __init int kho_init(void)
 
 	kho_out.root.name = "";
 	err = kho_add_string_prop(&kho_out.root, "compatible", "kho-v1");
+	err |= kho_add_prop(&kho_out.preserved_memory, "metadata",
+			    &kho_out.first_chunk_phys, sizeof(phys_addr_t));
+	err |= kho_add_node(&kho_out.root, "preserved-memory",
+			    &kho_out.preserved_memory);
 	if (err)
 		goto err_free_scratch;
 
@@ -1079,10 +1559,12 @@ static void __init kho_release_scratch(void)
 
 void __init kho_memory_init(void)
 {
-	if (!kho_get_fdt())
+	if (!kho_get_fdt()) {
 		kho_reserve_scratch();
-	else
+	} else {
+		kho_mem_deserialize();
 		kho_release_scratch();
+	}
 }
 
 void __init kho_populate(phys_addr_t handover_fdt_phys,
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 10/16] kexec: add KHO support to kexec file loads
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (8 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 09/16] kexec: enable KHO support for memory preservation Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-21 13:48   ` Jason Gunthorpe
  2025-03-20  1:55 ` [PATCH v5 11/16] kexec: add config option for KHO Changyuan Lyu
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

From: Alexander Graf <graf@amazon.com>

Kexec has 2 modes: A user space driven mode and a kernel driven mode.
For the kernel driven mode, kernel code determines the physical
addresses of all target buffers that the payload gets copied into.

With KHO, we can only safely copy payloads into the "scratch area".
Teach the kexec file loader about it, so it only allocates for that
area. In addition, enlighten it with support to ask the KHO subsystem
for its respective payloads to copy into target memory. Also teach the
KHO subsystem how to fill the images for file loads.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 include/linux/kexec.h   |   7 +++
 kernel/kexec_core.c     |   4 ++
 kernel/kexec_file.c     |  19 +++++++
 kernel/kexec_handover.c | 108 ++++++++++++++++++++++++++++++++++++++++
 kernel/kexec_internal.h |  18 +++++++
 5 files changed, 156 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index fad04f3bcf1d..d59eee60e36e 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -364,6 +364,13 @@ struct kimage {
 	size_t ima_buffer_size;
 #endif
 
+#ifdef CONFIG_KEXEC_HANDOVER
+	struct {
+		struct kexec_segment *scratch;
+		struct kexec_segment *fdt;
+	} kho;
+#endif
+
 	/* Core ELF header buffer */
 	void *elf_headers;
 	unsigned long elf_headers_sz;
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 640d252306ea..67fb9c0b3714 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1053,6 +1053,10 @@ int kernel_kexec(void)
 		goto Unlock;
 	}
 
+	error = kho_copy_fdt(kexec_image);
+	if (error)
+		goto Unlock;
+
 #ifdef CONFIG_KEXEC_JUMP
 	if (kexec_image->preserve_context) {
 		/*
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 3eedb8c226ad..070ef206f573 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -253,6 +253,11 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
 	/* IMA needs to pass the measurement list to the next kernel. */
 	ima_add_kexec_buffer(image);
 
+	/* If KHO is active, add its images to the list */
+	ret = kho_fill_kimage(image);
+	if (ret)
+		goto out;
+
 	/* Call image load handler */
 	ldata = kexec_image_load_default(image);
 
@@ -636,6 +641,14 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
 	if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN)
 		return 0;
 
+	/*
+	 * If KHO is active, only use KHO scratch memory. All other memory
+	 * could potentially be handed over.
+	 */
+	ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback);
+	if (ret <= 0)
+		return ret;
+
 	if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
 		ret = kexec_walk_resources(kbuf, locate_mem_hole_callback);
 	else
@@ -764,6 +777,12 @@ static int kexec_calculate_store_digests(struct kimage *image)
 		if (ksegment->kbuf == pi->purgatory_buf)
 			continue;
 
+#ifdef CONFIG_KEXEC_HANDOVER
+		/* Skip KHO FDT as its contects are copied in kernel_kexec(). */
+		if (ksegment == image->kho.fdt)
+			continue;
+#endif
+
 		ret = crypto_shash_update(desc, ksegment->kbuf,
 					  ksegment->bufsz);
 		if (ret)
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 592563c21369..5108e2cc1a22 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -245,6 +245,85 @@ int kho_node_check_compatible(const struct kho_in_node *node,
 }
 EXPORT_SYMBOL_GPL(kho_node_check_compatible);
 
+int kho_fill_kimage(struct kimage *image)
+{
+	ssize_t scratch_size;
+	int err = 0;
+
+	if (!kho_enable)
+		return 0;
+
+	/* Allocate target memory for KHO FDT */
+	struct kexec_buf fdt = {
+		.image = image,
+		.buffer = NULL,
+		.bufsz = 0,
+		.mem = KEXEC_BUF_MEM_UNKNOWN,
+		.memsz = kho_out.fdt_max,
+		.buf_align = SZ_64K, /* Makes it easier to map */
+		.buf_max = ULONG_MAX,
+		.top_down = true,
+	};
+	err = kexec_add_buffer(&fdt);
+	if (err) {
+		pr_err("failed to reserved a segment for KHO FDT: %d\n", err);
+		return err;
+	}
+	image->kho.fdt = &image->segment[image->nr_segments - 1];
+
+	scratch_size = sizeof(*kho_scratch) * kho_scratch_cnt;
+	struct kexec_buf scratch = {
+		.image = image,
+		.buffer = kho_scratch,
+		.bufsz = scratch_size,
+		.mem = KEXEC_BUF_MEM_UNKNOWN,
+		.memsz = scratch_size,
+		.buf_align = SZ_64K, /* Makes it easier to map */
+		.buf_max = ULONG_MAX,
+		.top_down = true,
+	};
+	err = kexec_add_buffer(&scratch);
+	if (err)
+		return err;
+	image->kho.scratch = &image->segment[image->nr_segments - 1];
+
+	return 0;
+}
+
+static int kho_walk_scratch(struct kexec_buf *kbuf,
+			    int (*func)(struct resource *, void *))
+{
+	int ret = 0;
+	int i;
+
+	for (i = 0; i < kho_scratch_cnt; i++) {
+		struct resource res = {
+			.start = kho_scratch[i].addr,
+			.end = kho_scratch[i].addr + kho_scratch[i].size - 1,
+		};
+
+		/* Try to fit the kimage into our KHO scratch region */
+		ret = func(&res, kbuf);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+int kho_locate_mem_hole(struct kexec_buf *kbuf,
+			int (*func)(struct resource *, void *))
+{
+	int ret;
+
+	if (!kho_enable || kbuf->image->type == KEXEC_TYPE_CRASH)
+		return 1;
+
+	ret = kho_walk_scratch(kbuf, func);
+
+	return ret == 1 ? 0 : -EADDRNOTAVAIL;
+}
+
 /*
  * Keep track of memory that is to be preserved across KHO.
  *
@@ -1141,6 +1220,35 @@ static int kho_finalize(void)
 	return err;
 }
 
+int kho_copy_fdt(struct kimage *image)
+{
+	int err = 0;
+	void *fdt;
+
+	if (!kho_enable || !image->file_mode)
+		return 0;
+
+	if (!kho_out.fdt) {
+		err = kho_finalize();
+		kho_out_update_debugfs_fdt();
+		if (err)
+			return err;
+	}
+
+	fdt = kimage_map_segment(image, image->kho.fdt->mem,
+				 PAGE_ALIGN(kho_out.fdt_max));
+	if (!fdt) {
+		pr_err("failed to vmap fdt ksegment in kimage\n");
+		return -ENOMEM;
+	}
+
+	memcpy(fdt, kho_out.fdt, fdt_totalsize(kho_out.fdt));
+
+	kimage_unmap_segment(fdt);
+
+	return 0;
+}
+
 /* Handling for debug/kho/out */
 static int kho_out_finalize_get(void *data, u64 *val)
 {
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index d35d9792402d..ec9555a4751d 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -39,4 +39,22 @@ extern size_t kexec_purgatory_size;
 #else /* CONFIG_KEXEC_FILE */
 static inline void kimage_file_post_load_cleanup(struct kimage *image) { }
 #endif /* CONFIG_KEXEC_FILE */
+
+struct kexec_buf;
+
+#ifdef CONFIG_KEXEC_HANDOVER
+int kho_locate_mem_hole(struct kexec_buf *kbuf,
+			int (*func)(struct resource *, void *));
+int kho_fill_kimage(struct kimage *image);
+int kho_copy_fdt(struct kimage *image);
+#else
+static inline int kho_locate_mem_hole(struct kexec_buf *kbuf,
+				      int (*func)(struct resource *, void *))
+{
+	return 1;
+}
+
+static inline int kho_fill_kimage(struct kimage *image) { return 0; }
+static inline int kho_copy_fdt(struct kimage *image) { return 0; }
+#endif /* CONFIG_KEXEC_HANDOVER */
 #endif /* LINUX_KEXEC_INTERNAL_H */
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (9 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 10/16] kexec: add KHO support to kexec file loads Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  7:10   ` Krzysztof Kozlowski
  2025-03-24  4:18   ` Dave Young
  2025-03-20  1:55 ` [PATCH v5 12/16] arm64: add KHO support Changyuan Lyu
                   ` (5 subsequent siblings)
  16 siblings, 2 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

From: Alexander Graf <graf@amazon.com>

We have all generic code in place now to support Kexec with KHO. This
patch adds a config option that depends on architecture support to
enable KHO support.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 kernel/Kconfig.kexec | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index 4d111f871951..57db99e758a8 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -95,6 +95,21 @@ config KEXEC_JUMP
 	  Jump between original kernel and kexeced kernel and invoke
 	  code in physical address mode via KEXEC
 
+config KEXEC_HANDOVER
+	bool "kexec handover"
+	depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
+	select MEMBLOCK_KHO_SCRATCH
+	select KEXEC_FILE
+	select DEBUG_FS
+	select LIBFDT
+	select CMA
+	select XXHASH
+	help
+	  Allow kexec to hand over state across kernels by generating and
+	  passing additional metadata to the target kernel. This is useful
+	  to keep data or state alive across the kexec. For this to work,
+	  both source and target kernels need to have this option enabled.
+
 config CRASH_DUMP
 	bool "kernel crash dumps"
 	default ARCH_DEFAULT_CRASH_DUMP
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 12/16] arm64: add KHO support
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (10 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 11/16] kexec: add config option for KHO Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  7:13   ` Krzysztof Kozlowski
  2025-04-11  3:47   ` Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 13/16] x86/setup: use memblock_reserve_kern for memory used by kernel Changyuan Lyu
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

From: Alexander Graf <graf@amazon.com>

We now have all bits in place to support KHO kexecs. Add awareness of
KHO in the kexec file as well as boot path for arm64 and adds the
respective kconfig option to the architecture so that it can use KHO
successfully.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 arch/arm64/Kconfig |  3 +++
 drivers/of/fdt.c   | 33 +++++++++++++++++++++++++++++++++
 drivers/of/kexec.c | 37 +++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 940343beb3d4..c997b27b7da1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1589,6 +1589,9 @@ config ARCH_SUPPORTS_KEXEC_IMAGE_VERIFY_SIG
 config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG
 	def_bool y
 
+config ARCH_SUPPORTS_KEXEC_HANDOVER
+	def_bool y
+
 config ARCH_SUPPORTS_CRASH_DUMP
 	def_bool y
 
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index aedd0e2dcd89..73f80e3f7188 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -25,6 +25,7 @@
 #include <linux/serial_core.h>
 #include <linux/sysfs.h>
 #include <linux/random.h>
+#include <linux/kexec_handover.h>
 
 #include <asm/setup.h>  /* for COMMAND_LINE_SIZE */
 #include <asm/page.h>
@@ -875,6 +876,35 @@ void __init early_init_dt_check_for_usable_mem_range(void)
 		memblock_add(rgn[i].base, rgn[i].size);
 }
 
+/**
+ * early_init_dt_check_kho - Decode info required for kexec handover from DT
+ */
+static void __init early_init_dt_check_kho(void)
+{
+	unsigned long node = chosen_node_offset;
+	u64 kho_start, scratch_start, scratch_size;
+	const __be32 *p;
+	int l;
+
+	if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER) || (long)node < 0)
+		return;
+
+	p = of_get_flat_dt_prop(node, "linux,kho-fdt", &l);
+	if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+		return;
+
+	kho_start = dt_mem_next_cell(dt_root_addr_cells, &p);
+
+	p = of_get_flat_dt_prop(node, "linux,kho-scratch", &l);
+	if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+		return;
+
+	scratch_start = dt_mem_next_cell(dt_root_addr_cells, &p);
+	scratch_size = dt_mem_next_cell(dt_root_addr_cells, &p);
+
+	kho_populate(kho_start, scratch_start, scratch_size);
+}
+
 #ifdef CONFIG_SERIAL_EARLYCON
 
 int __init early_init_dt_scan_chosen_stdout(void)
@@ -1169,6 +1199,9 @@ void __init early_init_dt_scan_nodes(void)
 
 	/* Handle linux,usable-memory-range property */
 	early_init_dt_check_for_usable_mem_range();
+
+	/* Handle kexec handover */
+	early_init_dt_check_kho();
 }
 
 bool __init early_init_dt_scan(void *dt_virt, phys_addr_t dt_phys)
diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 5b924597a4de..db7d7014d8b4 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -264,6 +264,38 @@ static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
 }
 #endif /* CONFIG_IMA_KEXEC */
 
+static int kho_add_chosen(const struct kimage *image, void *fdt, int chosen_node)
+{
+	int ret = 0;
+#ifdef CONFIG_KEXEC_HANDOVER
+	phys_addr_t dt_mem = 0;
+	phys_addr_t dt_len = 0;
+	phys_addr_t scratch_mem = 0;
+	phys_addr_t scratch_len = 0;
+
+	if (!image->kho.fdt || !image->kho.scratch)
+		return 0;
+
+	dt_mem = image->kho.fdt->mem;
+	dt_len = image->kho.fdt->memsz;
+
+	scratch_mem = image->kho.scratch->mem;
+	scratch_len = image->kho.scratch->bufsz;
+
+	pr_debug("Adding kho metadata to DT");
+
+	ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-fdt",
+				       dt_mem, dt_len);
+	if (ret)
+		return ret;
+
+	ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-scratch",
+				       scratch_mem, scratch_len);
+
+#endif /* CONFIG_KEXEC_HANDOVER */
+	return ret;
+}
+
 /*
  * of_kexec_alloc_and_setup_fdt - Alloc and setup a new Flattened Device Tree
  *
@@ -414,6 +446,11 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage *image,
 #endif
 	}
 
+	/* Add kho metadata if this is a KHO image */
+	ret = kho_add_chosen(image, fdt, chosen_node);
+	if (ret)
+		goto out;
+
 	/* add bootargs */
 	if (cmdline) {
 		ret = fdt_setprop_string(fdt, chosen_node, "bootargs", cmdline);
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 13/16] x86/setup: use memblock_reserve_kern for memory used by kernel
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (11 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 12/16] arm64: add KHO support Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 14/16] x86: add KHO support Changyuan Lyu
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

memblock_reserve() does not distinguish memory used by firmware from memory
used by kernel.

The distinction is nice to have for accounting of early memory allocations
and reservations, but it is essential for kexec handover (kho) to know how
much memory kernel consumes during boot.

Use memblock_reserve_kern() to reserve kernel memory, such as kernel image,
initrd and setup data.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/kernel/setup.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index cebee310e200..ead370570eb2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -220,8 +220,8 @@ static void __init cleanup_highmap(void)
 static void __init reserve_brk(void)
 {
 	if (_brk_end > _brk_start)
-		memblock_reserve(__pa_symbol(_brk_start),
-				 _brk_end - _brk_start);
+		memblock_reserve_kern(__pa_symbol(_brk_start),
+				      _brk_end - _brk_start);
 
 	/* Mark brk area as locked down and no longer taking any
 	   new allocations */
@@ -294,7 +294,7 @@ static void __init early_reserve_initrd(void)
 	    !ramdisk_image || !ramdisk_size)
 		return;		/* No initrd provided by bootloader */
 
-	memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);
+	memblock_reserve_kern(ramdisk_image, ramdisk_end - ramdisk_image);
 }
 
 static void __init reserve_initrd(void)
@@ -347,7 +347,7 @@ static void __init add_early_ima_buffer(u64 phys_addr)
 	}
 
 	if (data->size) {
-		memblock_reserve(data->addr, data->size);
+		memblock_reserve_kern(data->addr, data->size);
 		ima_kexec_buffer_phys = data->addr;
 		ima_kexec_buffer_size = data->size;
 	}
@@ -447,7 +447,7 @@ static void __init memblock_x86_reserve_range_setup_data(void)
 		len = sizeof(*data);
 		pa_next = data->next;
 
-		memblock_reserve(pa_data, sizeof(*data) + data->len);
+		memblock_reserve_kern(pa_data, sizeof(*data) + data->len);
 
 		if (data->type == SETUP_INDIRECT) {
 			len += data->len;
@@ -461,7 +461,7 @@ static void __init memblock_x86_reserve_range_setup_data(void)
 			indirect = (struct setup_indirect *)data->data;
 
 			if (indirect->type != SETUP_INDIRECT)
-				memblock_reserve(indirect->addr, indirect->len);
+				memblock_reserve_kern(indirect->addr, indirect->len);
 		}
 
 		pa_data = pa_next;
@@ -649,8 +649,8 @@ static void __init early_reserve_memory(void)
 	 * __end_of_kernel_reserve symbol must be explicitly reserved with a
 	 * separate memblock_reserve() or they will be discarded.
 	 */
-	memblock_reserve(__pa_symbol(_text),
-			 (unsigned long)__end_of_kernel_reserve - (unsigned long)_text);
+	memblock_reserve_kern(__pa_symbol(_text),
+			      (unsigned long)__end_of_kernel_reserve - (unsigned long)_text);
 
 	/*
 	 * The first 4Kb of memory is a BIOS owned area, but generally it is
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 14/16] x86: add KHO support
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (12 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 13/16] x86/setup: use memblock_reserve_kern for memory used by kernel Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 15/16] memblock: add KHO support for reserve_mem Changyuan Lyu
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

From: Alexander Graf <graf@amazon.com>

We now have all bits in place to support KHO kexecs. This patch adds
awareness of KHO in the kexec file as well as boot path for x86 and
adds the respective kconfig option to the architecture so that it can
use KHO successfully.

In addition, it enlightens it decompression code with KHO so that its
KASLR location finder only considers memory regions that are not already
occupied by KHO memory.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 arch/x86/Kconfig                       |  3 ++
 arch/x86/boot/compressed/kaslr.c       | 52 +++++++++++++++++++++++++-
 arch/x86/include/asm/setup.h           |  4 ++
 arch/x86/include/uapi/asm/setup_data.h | 13 ++++++-
 arch/x86/kernel/e820.c                 | 18 +++++++++
 arch/x86/kernel/kexec-bzimage64.c      | 36 ++++++++++++++++++
 arch/x86/kernel/setup.c                | 25 +++++++++++++
 arch/x86/realmode/init.c               |  2 +
 include/linux/kexec_handover.h         | 13 +++++--
 9 files changed, 161 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0e27ebd7e36a..acd180e3002f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2091,6 +2091,9 @@ config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG
 config ARCH_SUPPORTS_KEXEC_JUMP
 	def_bool y
 
+config ARCH_SUPPORTS_KEXEC_HANDOVER
+	def_bool y
+
 config ARCH_SUPPORTS_CRASH_DUMP
 	def_bool X86_64 || (X86_32 && HIGHMEM)
 
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index f03d59ea6e40..ff1168881016 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -760,6 +760,55 @@ static void process_e820_entries(unsigned long minimum,
 	}
 }
 
+/*
+ * If KHO is active, only process its scratch areas to ensure we are not
+ * stepping onto preserved memory.
+ */
+#ifdef CONFIG_KEXEC_HANDOVER
+static bool process_kho_entries(unsigned long minimum, unsigned long image_size)
+{
+	struct kho_scratch *kho_scratch;
+	struct setup_data *ptr;
+	int i, nr_areas = 0;
+
+	ptr = (struct setup_data *)(unsigned long)boot_params_ptr->hdr.setup_data;
+	while (ptr) {
+		if (ptr->type == SETUP_KEXEC_KHO) {
+			struct kho_data *kho = (struct kho_data *)ptr->data;
+
+			kho_scratch = (void *)kho->scratch_addr;
+			nr_areas = kho->scratch_size / sizeof(*kho_scratch);
+
+			break;
+		}
+
+		ptr = (struct setup_data *)(unsigned long)ptr->next;
+	}
+
+	if (!nr_areas)
+		return false;
+
+	for (i = 0; i < nr_areas; i++) {
+		struct kho_scratch *area = &kho_scratch[i];
+		struct mem_vector region = {
+			.start = area->addr,
+			.size = area->size,
+		};
+
+		if (process_mem_region(&region, minimum, image_size))
+			break;
+	}
+
+	return true;
+}
+#else
+static inline bool process_kho_entries(unsigned long minimum,
+				       unsigned long image_size)
+{
+	return false;
+}
+#endif
+
 static unsigned long find_random_phys_addr(unsigned long minimum,
 					   unsigned long image_size)
 {
@@ -775,7 +824,8 @@ static unsigned long find_random_phys_addr(unsigned long minimum,
 		return 0;
 	}
 
-	if (!process_efi_entries(minimum, image_size))
+	if (!process_kho_entries(minimum, image_size) &&
+	    !process_efi_entries(minimum, image_size))
 		process_e820_entries(minimum, image_size);
 
 	phys_addr = slots_fetch_random();
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 85f4fde3515c..70e045321d4b 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -66,6 +66,10 @@ extern void x86_ce4100_early_setup(void);
 static inline void x86_ce4100_early_setup(void) { }
 #endif
 
+#ifdef CONFIG_KEXEC_HANDOVER
+#include <linux/kexec_handover.h>
+#endif
+
 #ifndef _SETUP
 
 #include <asm/espfix.h>
diff --git a/arch/x86/include/uapi/asm/setup_data.h b/arch/x86/include/uapi/asm/setup_data.h
index b111b0c18544..c258c37768ee 100644
--- a/arch/x86/include/uapi/asm/setup_data.h
+++ b/arch/x86/include/uapi/asm/setup_data.h
@@ -13,7 +13,8 @@
 #define SETUP_CC_BLOB			7
 #define SETUP_IMA			8
 #define SETUP_RNG_SEED			9
-#define SETUP_ENUM_MAX			SETUP_RNG_SEED
+#define SETUP_KEXEC_KHO			10
+#define SETUP_ENUM_MAX			SETUP_KEXEC_KHO
 
 #define SETUP_INDIRECT			(1<<31)
 #define SETUP_TYPE_MAX			(SETUP_ENUM_MAX | SETUP_INDIRECT)
@@ -78,6 +79,16 @@ struct ima_setup_data {
 	__u64 size;
 } __attribute__((packed));
 
+/*
+ * Locations of kexec handover metadata
+ */
+struct kho_data {
+	__u64 dt_addr;
+	__u64 dt_size;
+	__u64 scratch_addr;
+	__u64 scratch_size;
+} __attribute__((packed));
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _UAPI_ASM_X86_SETUP_DATA_H */
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 82b96ed9890a..0b81cd70b02a 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1329,6 +1329,24 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	/*
+	 * At this point with KHO we only allocate from scratch memory.
+	 * At the same time, we configure memblock to only allow
+	 * allocations from memory below ISA_END_ADDRESS which is not
+	 * a natural scratch region, because Linux ignores memory below
+	 * ISA_END_ADDRESS at runtime. Beside very few (if any) early
+	 * allocations, we must allocate real-mode trapoline below
+	 * ISA_END_ADDRESS.
+	 *
+	 * To make sure that we can actually perform allocations during
+	 * this phase, let's mark memory below ISA_END_ADDRESS as scratch
+	 * so we can allocate from there in a scratch-only world.
+	 *
+	 * After real mode trampoline is allocated, we clear scratch
+	 * marking from the memory below ISA_END_ADDRESS
+	 */
+	memblock_mark_kho_scratch(0, ISA_END_ADDRESS);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
 
diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index 68530fad05f7..09d6a068b14c 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -233,6 +233,31 @@ setup_ima_state(const struct kimage *image, struct boot_params *params,
 #endif /* CONFIG_IMA_KEXEC */
 }
 
+static void setup_kho(const struct kimage *image, struct boot_params *params,
+		      unsigned long params_load_addr,
+		      unsigned int setup_data_offset)
+{
+#ifdef CONFIG_KEXEC_HANDOVER
+	struct setup_data *sd = (void *)params + setup_data_offset;
+	struct kho_data *kho = (void *)sd + sizeof(*sd);
+
+	sd->type = SETUP_KEXEC_KHO;
+	sd->len = sizeof(struct kho_data);
+
+	/* Only add if we have all KHO images in place */
+	if (!image->kho.fdt || !image->kho.scratch)
+		return;
+
+	/* Add setup data */
+	kho->dt_addr = image->kho.fdt->mem;
+	kho->dt_size = image->kho.fdt->memsz;
+	kho->scratch_addr = image->kho.scratch->mem;
+	kho->scratch_size = image->kho.scratch->bufsz;
+	sd->next = params->hdr.setup_data;
+	params->hdr.setup_data = params_load_addr + setup_data_offset;
+#endif /* CONFIG_KEXEC_HANDOVER */
+}
+
 static int
 setup_boot_parameters(struct kimage *image, struct boot_params *params,
 		      unsigned long params_load_addr,
@@ -312,6 +337,13 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params,
 				     sizeof(struct ima_setup_data);
 	}
 
+	if (IS_ENABLED(CONFIG_KEXEC_HANDOVER)) {
+		/* Setup space to store preservation metadata */
+		setup_kho(image, params, params_load_addr, setup_data_offset);
+		setup_data_offset += sizeof(struct setup_data) +
+				     sizeof(struct kho_data);
+	}
+
 	/* Setup RNG seed */
 	setup_rng_seed(params, params_load_addr, setup_data_offset);
 
@@ -479,6 +511,10 @@ static void *bzImage64_load(struct kimage *image, char *kernel,
 		kbuf.bufsz += sizeof(struct setup_data) +
 			      sizeof(struct ima_setup_data);
 
+	if (IS_ENABLED(CONFIG_KEXEC_HANDOVER))
+		kbuf.bufsz += sizeof(struct setup_data) +
+			      sizeof(struct kho_data);
+
 	params = kzalloc(kbuf.bufsz, GFP_KERNEL);
 	if (!params)
 		return ERR_PTR(-ENOMEM);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ead370570eb2..e2c54181405b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -385,6 +385,28 @@ int __init ima_get_kexec_buffer(void **addr, size_t *size)
 }
 #endif
 
+static void __init add_kho(u64 phys_addr, u32 data_len)
+{
+#ifdef CONFIG_KEXEC_HANDOVER
+	struct kho_data *kho;
+	u64 addr = phys_addr + sizeof(struct setup_data);
+	u64 size = data_len - sizeof(struct setup_data);
+
+	kho = early_memremap(addr, size);
+	if (!kho) {
+		pr_warn("setup: failed to memremap kho data (0x%llx, 0x%llx)\n",
+			addr, size);
+		return;
+	}
+
+	kho_populate(kho->dt_addr, kho->scratch_addr, kho->scratch_size);
+
+	early_memunmap(kho, size);
+#else
+	pr_warn("Passed KHO data, but CONFIG_KEXEC_HANDOVER not set. Ignoring.\n");
+#endif
+}
+
 static void __init parse_setup_data(void)
 {
 	struct setup_data *data;
@@ -413,6 +435,9 @@ static void __init parse_setup_data(void)
 		case SETUP_IMA:
 			add_early_ima_buffer(pa_data);
 			break;
+		case SETUP_KEXEC_KHO:
+			add_kho(pa_data, data_len);
+			break;
 		case SETUP_RNG_SEED:
 			data = early_memremap(pa_data, data_len);
 			add_bootloader_randomness(data->data, data->len);
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index f9bc444a3064..9b9f4534086d 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -65,6 +65,8 @@ void __init reserve_real_mode(void)
 	 * setup_arch().
 	 */
 	memblock_reserve(0, SZ_1M);
+
+	memblock_clear_kho_scratch(0, SZ_1M);
 }
 
 static void __init sme_sev_setup_real_mode(struct trampoline_header *th)
diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index d52a7b500f4c..2dd51a77d56c 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -3,9 +3,6 @@
 #define LINUX_KEXEC_HANDOVER_H
 
 #include <linux/types.h>
-#include <linux/hashtable.h>
-#include <linux/notifier.h>
-#include <linux/mm_types.h>
 
 struct kho_scratch {
 	phys_addr_t addr;
@@ -18,6 +15,15 @@ enum kho_event {
 	KEXEC_KHO_UNFREEZE = 1,
 };
 
+#ifdef _SETUP
+struct notifier_block;
+struct kho_node;
+struct folio;
+#else
+#include <linux/notifier.h>
+#include <linux/hashtable.h>
+#include <linux/mm_types.h>
+
 #define KHO_HASHTABLE_BITS 3
 #define KHO_NODE_INIT                                        \
 	{                                                    \
@@ -35,6 +41,7 @@ struct kho_node {
 	struct list_head list;
 	bool visited;
 };
+#endif /* _SETUP */
 
 struct kho_in_node {
 	int offset;
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 15/16] memblock: add KHO support for reserve_mem
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (13 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 14/16] x86: add KHO support Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20  1:55 ` [PATCH v5 16/16] Documentation: add documentation for KHO Changyuan Lyu
  2025-03-25 14:19 ` [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Pasha Tatashin
  16 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

From: Alexander Graf <graf@amazon.com>

Linux has recently gained support for "reserve_mem": A mechanism to
allocate a region of memory early enough in boot that we can cross our
fingers and hope it stays at the same location during most boots, so we
can store for example ftrace buffers into it.

Thanks to KASLR, we can never be really sure that "reserve_mem"
allocations are static across kexec. Let's teach it KHO awareness so
that it serializes its reservations on kexec exit and deserializes them
again on boot, preserving the exact same mapping across kexec.

This is an example user for KHO in the KHO patch set to ensure we have
at least one (not very controversial) user in the tree before extending
KHO's use to more subsystems.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 mm/memblock.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 179 insertions(+)

diff --git a/mm/memblock.c b/mm/memblock.c
index d28abf3def1c..dd698c55b87e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -17,6 +17,10 @@
 #include <linux/seq_file.h>
 #include <linux/memblock.h>
 
+#ifdef CONFIG_KEXEC_HANDOVER
+#include <linux/kexec_handover.h>
+#endif /* CONFIG_KEXEC_HANDOVER */
+
 #include <asm/sections.h>
 #include <linux/io.h>
 
@@ -2431,6 +2435,176 @@ int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t *
 }
 EXPORT_SYMBOL_GPL(reserve_mem_find_by_name);
 
+#ifdef CONFIG_KEXEC_HANDOVER
+#define MEMBLOCK_KHO_NODE "memblock"
+#define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
+#define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
+
+static struct kho_node memblock_kho_node = KHO_NODE_INIT;
+
+static void reserve_mem_kho_reset(void)
+{
+	int i;
+	struct kho_node *node;
+
+	kho_remove_node(NULL, MEMBLOCK_KHO_NODE);
+	kho_remove_prop(&memblock_kho_node, "compatible", NULL);
+
+	for (i = 0; i < reserved_mem_count; i++) {
+		struct reserve_mem_table *map = &reserved_mem_table[i];
+
+		node = kho_remove_node(&memblock_kho_node, map->name);
+		if (IS_ERR(node))
+			continue;
+
+		kho_unpreserve_phys(map->start, map->size);
+
+		kho_remove_prop(node, "compatible", NULL);
+		kho_remove_prop(node, "start", NULL);
+		kho_remove_prop(node, "size", NULL);
+
+		kfree(node);
+	}
+}
+
+static int reserve_mem_kho_finalize(void)
+{
+	int i, err = 0;
+	struct kho_node *node;
+
+	if (!reserved_mem_count)
+		return NOTIFY_DONE;
+
+	err = kho_add_node(NULL, MEMBLOCK_KHO_NODE, &memblock_kho_node);
+	if (err == 1)
+		return NOTIFY_DONE;
+
+	err |= kho_add_string_prop(&memblock_kho_node, "compatible",
+				   MEMBLOCK_KHO_NODE_COMPATIBLE);
+
+	for (i = 0; i < reserved_mem_count; i++) {
+		struct reserve_mem_table *map = &reserved_mem_table[i];
+
+		node = kmalloc(sizeof(*node), GFP_KERNEL);
+		if (!node) {
+			err = -ENOMEM;
+			break;
+		}
+
+		err |= kho_preserve_phys(map->start, map->size);
+
+		kho_init_node(node);
+		err |= kho_add_string_prop(node, "compatible",
+					   RESERVE_MEM_KHO_NODE_COMPATIBLE);
+		err |= kho_add_prop(node, "start", &map->start,
+				    sizeof(map->start));
+		err |= kho_add_prop(node, "size", &map->size,
+				    sizeof(map->size));
+		err |= kho_add_node(&memblock_kho_node, map->name, node);
+
+		if (err)
+			break;
+	}
+
+	if (err) {
+		pr_err("failed to save reserve_mem to KHO: %d\n", err);
+		reserve_mem_kho_reset();
+		return NOTIFY_STOP;
+	}
+
+	return NOTIFY_DONE;
+}
+
+static int reserve_mem_kho_notifier(struct notifier_block *self,
+				    unsigned long cmd, void *v)
+{
+	switch (cmd) {
+	case KEXEC_KHO_FINALIZE:
+		return reserve_mem_kho_finalize();
+	case KEXEC_KHO_UNFREEZE:
+		return NOTIFY_DONE;
+	default:
+		return NOTIFY_BAD;
+	}
+}
+
+static struct notifier_block reserve_mem_kho_nb = {
+	.notifier_call = reserve_mem_kho_notifier,
+};
+
+static int __init reserve_mem_init(void)
+{
+	if (!kho_is_enabled())
+		return 0;
+
+	return register_kho_notifier(&reserve_mem_kho_nb);
+}
+core_initcall(reserve_mem_init);
+
+static bool __init reserve_mem_kho_revive(const char *name, phys_addr_t size,
+					  phys_addr_t align)
+{
+	int err, len_start, len_size;
+	struct kho_in_node node, child;
+	const phys_addr_t *p_start, *p_size;
+
+	err = kho_get_node(NULL, MEMBLOCK_KHO_NODE, &node);
+	if (err)
+		return false;
+
+	err = kho_node_check_compatible(&node, MEMBLOCK_KHO_NODE_COMPATIBLE);
+	if (err) {
+		pr_warn("Node '%s' is incompatible with %s: %d\n",
+			MEMBLOCK_KHO_NODE, MEMBLOCK_KHO_NODE_COMPATIBLE, err);
+		return false;
+	}
+
+	err = kho_get_node(&node, name, &child);
+	if (err) {
+		pr_warn("Node '%s' has no child '%s': %d\n",
+			MEMBLOCK_KHO_NODE, name, err);
+		return false;
+	}
+	err = kho_node_check_compatible(&child, RESERVE_MEM_KHO_NODE_COMPATIBLE);
+	if (err) {
+		pr_warn("Node '%s/%s' is incompatible with %s: %d\n",
+			MEMBLOCK_KHO_NODE, name,
+			RESERVE_MEM_KHO_NODE_COMPATIBLE, err);
+		return false;
+	}
+
+	p_start = kho_get_prop(&child, "start", &len_start);
+	p_size = kho_get_prop(&child, "size", &len_size);
+	if (!p_start || len_start != sizeof(*p_start) || !p_size ||
+	    len_size != sizeof(*p_size)) {
+		return false;
+	}
+
+	if (*p_start & (align - 1)) {
+		pr_warn("KHO reserve-mem '%s' has wrong alignment (0x%lx, 0x%lx)\n",
+			name, (long)align, (long)*p_start);
+		return false;
+	}
+
+	if (*p_size != size) {
+		pr_warn("KHO reserve-mem '%s' has wrong size (0x%lx != 0x%lx)\n",
+			name, (long)*p_size, (long)size);
+		return false;
+	}
+
+	reserved_mem_add(*p_start, size, name);
+	pr_info("Revived memory reservation '%s' from KHO\n", name);
+
+	return true;
+}
+#else
+static bool __init reserve_mem_kho_revive(const char *name, phys_addr_t size,
+					  phys_addr_t align)
+{
+	return false;
+}
+#endif /* CONFIG_KEXEC_HANDOVER */
+
 /*
  * Parse reserve_mem=nn:align:name
  */
@@ -2486,6 +2660,11 @@ static int __init reserve_mem(char *p)
 	if (reserve_mem_find_by_name(name, &start, &tmp))
 		return -EBUSY;
 
+	/* Pick previous allocations up from KHO if available */
+	if (reserve_mem_kho_revive(name, size, align))
+		return 1;
+
+	/* TODO: Allocation must be outside of scratch region */
 	start = memblock_phys_alloc(size, align);
 	if (!start)
 		return -ENOMEM;
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 16/16] Documentation: add documentation for KHO
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (14 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 15/16] memblock: add KHO support for reserve_mem Changyuan Lyu
@ 2025-03-20  1:55 ` Changyuan Lyu
  2025-03-20 14:45   ` Jonathan Corbet
  2025-03-25 14:19 ` [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Pasha Tatashin
  16 siblings, 1 reply; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20  1:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa,
	peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt,
	tglx, thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

From: Alexander Graf <graf@amazon.com>

With KHO in place, let's add documentation that describes what it is and
how to use it.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 .../admin-guide/kernel-parameters.txt         |  25 ++++
 Documentation/kho/concepts.rst                |  70 +++++++++++
 Documentation/kho/fdt.rst                     |  62 +++++++++
 Documentation/kho/index.rst                   |  14 +++
 Documentation/kho/usage.rst                   | 118 ++++++++++++++++++
 Documentation/subsystem-apis.rst              |   1 +
 MAINTAINERS                                   |   1 +
 7 files changed, 291 insertions(+)
 create mode 100644 Documentation/kho/concepts.rst
 create mode 100644 Documentation/kho/fdt.rst
 create mode 100644 Documentation/kho/index.rst
 create mode 100644 Documentation/kho/usage.rst

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8..d715c6d9dbb3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2698,6 +2698,31 @@
 	kgdbwait	[KGDB,EARLY] Stop kernel execution and enter the
 			kernel debugger at the earliest opportunity.
 
+	kho=		[KEXEC,EARLY]
+			Format: { "0" | "1" | "off" | "on" | "y" | "n" }
+			Enables or disables Kexec HandOver.
+			"0" | "off" | "n" - kexec handover is disabled
+			"1" | "on" | "y" - kexec handover is enabled
+
+	kho_scratch=	[KEXEC,EARLY]
+			Format: ll[KMG],mm[KMG],nn[KMG] | nn%
+			Defines the size of the KHO scratch region. The KHO
+			scratch regions are physically contiguous memory
+			ranges that can only be used for non-kernel
+			allocations. That way, even when memory is heavily
+			fragmented with handed over memory, the kexeced
+			kernel will always have enough contiguous ranges to
+			bootstrap itself.
+
+			It is possible to specify the exact amount of
+			memory in the form of "ll[KMG],mm[KMG],nn[KMG]"
+			where the first parameter defines the size of a low
+			memory scratch area, the second parameter defines
+			the size of a global scratch area and the third
+			parameter defines the size of additional per-node
+			scratch areas.  The form "nn%" defines scale factor
+			(in percents) of memory that was used during boot.
+
 	kmac=		[MIPS] Korina ethernet MAC address.
 			Configure the RouterBoard 532 series on-chip
 			Ethernet adapter MAC address.
diff --git a/Documentation/kho/concepts.rst b/Documentation/kho/concepts.rst
new file mode 100644
index 000000000000..174e23404ebc
--- /dev/null
+++ b/Documentation/kho/concepts.rst
@@ -0,0 +1,70 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+.. _concepts:
+
+=======================
+Kexec Handover Concepts
+=======================
+
+Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
+arbitrary properties as well as memory locations - across kexec.
+
+It introduces multiple concepts:
+
+KHO State tree
+==============
+
+Every KHO kexec carries a state tree, in the format of flattened device tree
+(FDT), that describes the state of the system. Device drivers can register to
+KHO to serialize their state before kexec. After KHO, device drivers can read
+the FDT and extract previous state.
+
+KHO only uses the FDT container format and libfdt library, but does not
+adhere to the same property semantics that normal device trees do: Properties
+are passed in native endianness and standardized properties like ``regs`` and
+``ranges`` do not exist, hence there are no ``#...-cells`` properties.
+
+Scratch Regions
+===============
+
+To boot into kexec, we need to have a physically contiguous memory range that
+contains no handed over memory. Kexec then places the target kernel and initrd
+into that region. The new kernel exclusively uses this region for memory
+allocations before during boot up to the initialization of the page allocator.
+
+We guarantee that we always have such regions through the scratch regions: On
+first boot KHO allocates several physically contiguous memory regions. Since
+after kexec these regions will be used by early memory allocations, there is a
+scratch region per NUMA node plus a scratch region to satisfy allocations
+requests that do not require particular NUMA node assignment.
+By default, size of the scratch region is calculated based on amount of memory
+allocated during boot. The ``kho_scratch`` kernel command line option may be
+used to explicitly define size of the scratch regions.
+The scratch regions are declared as CMA when page allocator is initialized so
+that their memory can be used during system lifetime. CMA gives us the
+guarantee that no handover pages land in that region, because handover pages
+must be at a static physical memory location and CMA enforces that only
+movable pages can be located inside.
+
+After KHO kexec, we ignore the ``kho_scratch`` kernel command line option and
+instead reuse the exact same region that was originally allocated. This allows
+us to recursively execute any amount of KHO kexecs. Because we used this region
+for boot memory allocations and as target memory for kexec blobs, some parts
+of that memory region may be reserved. These reservations are irrelevant for
+the next KHO, because kexec can overwrite even the original kernel.
+
+.. _finalization_phase:
+
+KHO finalization phase
+======================
+
+To enable user space based kexec file loader, the kernel needs to be able to
+provide the FDT that describes the previous kernel's state before
+performing the actual kexec. The process of generating that FDT is
+called serialization. When the FDT is generated, some properties
+of the system may become immutable because they are already written down
+in the FDT. That state is called the KHO finalization phase.
+
+With the in-kernel kexec file loader, i.e., using the syscall
+``kexec_file_load``, KHO FDT is not created until the actual kexec. Thus the
+finalization phase is much shorter. User space can optionally choose to generate
+the FDT early using the debugfs interface.
diff --git a/Documentation/kho/fdt.rst b/Documentation/kho/fdt.rst
new file mode 100644
index 000000000000..70b508533b77
--- /dev/null
+++ b/Documentation/kho/fdt.rst
@@ -0,0 +1,62 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+=======
+KHO FDT
+=======
+
+KHO uses the flattened device tree (FDT) container format and libfdt
+library to create and parse the data that is passed between the
+kernels. The properties in KHO FDT are stored in native format and can
+include any data KHO users need to preserve. Parsing of FDT subnodes is
+responsibility of KHO users, except for nodes and properties defined by
+KHO itself.
+
+KHO nodes and properties
+========================
+
+Node ``preserved-memory``
+-------------------------
+
+KHO saves a special node named ``preserved-memory`` under the root node.
+This node contains the metadata for KHO to preserve pages across kexec.
+
+Property ``compatible``
+-----------------------
+
+The ``compatible`` property determines compatibility between the kernel
+that created the KHO FDT and the kernel that attempts to load it.
+If the kernel that loads the KHO FDT is not compatible with it, the entire
+KHO process will be bypassed.
+
+Examples
+========
+
+The following example demonstrates KHO FDT that preserves two memory
+regions create with ``reserve_mem`` kernel command line parameter::
+
+  /dts-v1/;
+
+  / {
+  	compatible = "kho-v1";
+
+  	memblock {
+  		compatible = "memblock-v1";
+
+  		region1 {
+  			compatible = "reserve-mem-v1";
+  			start = <0xc07a 0x4000000>;
+			size = <0x01 0x00>;
+  		};
+
+		region2 {
+			compatible = "reserve-mem-v1";
+			start = <0xc07b 0x4000000>;
+			size = <0x8000 0x00>;
+		};
+
+  	};
+
+	preserved-memory {
+                metadata = <0x00 0x00>;
+        };
+  };
diff --git a/Documentation/kho/index.rst b/Documentation/kho/index.rst
new file mode 100644
index 000000000000..d108c3f8d15c
--- /dev/null
+++ b/Documentation/kho/index.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+========================
+Kexec Handover Subsystem
+========================
+
+.. toctree::
+   :maxdepth: 1
+
+   concepts
+   usage
+   fdt
+
+.. only::  subproject and html
diff --git a/Documentation/kho/usage.rst b/Documentation/kho/usage.rst
new file mode 100644
index 000000000000..b45dc58e8d3f
--- /dev/null
+++ b/Documentation/kho/usage.rst
@@ -0,0 +1,118 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+====================
+Kexec Handover Usage
+====================
+
+Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
+arbitrary properties as well as memory locations - across kexec.
+
+This document expects that you are familiar with the base KHO
+:ref:`concepts <concepts>`. If you have not read
+them yet, please do so now.
+
+Prerequisites
+=============
+
+KHO is available when the ``CONFIG_KEXEC_HANDOVER`` config option is set to y
+at compile time. Every KHO producer may have its own config option that you
+need to enable if you would like to preserve their respective state across
+kexec.
+
+To use KHO, please boot the kernel with the ``kho=on`` command line
+parameter. You may use ``kho_scratch`` parameter to define size of the
+scratch regions. For example ``kho_scratch=16M,512M,256M`` will reserve a
+16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB
+per NUMA node scratch regions on boot.
+
+Perform a KHO kexec
+===================
+
+First, before you perform a KHO kexec, you can optionally move the system into
+the :ref:`KHO finalization phase <finalization_phase>` ::
+
+  $ echo 1 > /sys/kernel/debug/kho/out/finalize
+
+After this command, the KHO FDT is available in
+``/sys/kernel/debug/kho/out/fdt``.
+
+Next, load the target payload and kexec into it. It is important that you
+use the ``-s`` parameter to use the in-kernel kexec file loader, as user
+space kexec tooling currently has no support for KHO with the user space
+based file loader ::
+
+  # kexec -l Image --initrd=initrd -s
+  # kexec -e
+
+If you skipped finalization in the first step, ``kexec -e`` triggers
+FDT finalization automatically. The new kernel will boot up and contain
+some of the previous kernel's state.
+
+For example, if you used ``reserve_mem`` command line parameter to create
+an early memory reservation, the new kernel will have that memory at the
+same physical address as the old kernel.
+
+Unfreeze KHO FDT data
+=====================
+
+You can move the system out of KHO finalization phase by calling ::
+
+  $ echo 0 > /sys/kernel/debug/kho/out/finalize
+
+After this command, the KHO FDT is no longer available in
+``/sys/kernel/debug/kho/out/fdt``, and the states kept in KHO can be
+modified by other kernel subsystems again.
+
+debugfs Interfaces
+==================
+
+Currently KHO creates the following debugfs interfaces. Notice that these
+interfaces may change in the future. They will be moved to sysfs once KHO is
+stabilized.
+
+``/sys/kernel/debug/kho/out/finalize``
+    Kexec HandOver (KHO) allows Linux to transition the state of
+    compatible drivers into the next kexec'ed kernel. To do so,
+    device drivers will serialize their current state into an FDT.
+    While the state is serialized, they are unable to perform
+    any modifications to state that was serialized, such as
+    handed over memory allocations.
+
+    When this file contains "1", the system is in the transition
+    state. When contains "0", it is not. To switch between the
+    two states, echo the respective number into this file.
+
+``/sys/kernel/debug/kho/out/fdt_max``
+    KHO needs to allocate a buffer for the FDT that gets
+    generated before it knows the final size. By default, it
+    will allocate 10 MiB for it. You can write to this file
+    to modify the size of that allocation.
+
+``/sys/kernel/debug/kho/out/fdt``
+    When KHO state tree is finalized, the kernel exposes the
+    flattened device tree blob that carries its current KHO
+    state in this file. Kexec user space tooling can use this
+    as input file for the KHO payload image.
+
+``/sys/kernel/debug/kho/out/scratch_len``
+    To support continuous KHO kexecs, we need to reserve
+    physically contiguous memory regions that will always stay
+    available for future kexec allocations. This file describes
+    the length of these memory regions. Kexec user space tooling
+    can use this to determine where it should place its payload
+    images.
+
+``/sys/kernel/debug/kho/out/scratch_phys``
+    To support continuous KHO kexecs, we need to reserve
+    physically contiguous memory regions that will always stay
+    available for future kexec allocations. This file describes
+    the physical location of these memory regions. Kexec user space
+    tooling can use this to determine where it should place its
+    payload images.
+
+``/sys/kernel/debug/kho/in/fdt``
+    When the kernel was booted with Kexec HandOver (KHO),
+    the state tree that carries metadata about the previous
+    kernel's state is in this file in the format of flattened
+    device tree. This file may disappear when all consumers of
+    it finished to interpret their metadata.
diff --git a/Documentation/subsystem-apis.rst b/Documentation/subsystem-apis.rst
index b52ad5b969d4..5fc69d6ff9f0 100644
--- a/Documentation/subsystem-apis.rst
+++ b/Documentation/subsystem-apis.rst
@@ -90,3 +90,4 @@ Other subsystems
    peci/index
    wmi/index
    tee/index
+   kho/index
diff --git a/MAINTAINERS b/MAINTAINERS
index a000a277ccf7..d0df0b380e34 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12828,6 +12828,7 @@ F:	include/linux/kernfs.h
 KEXEC
 L:	kexec@lists.infradead.org
 W:	http://kernel.org/pub/linux/utils/kernel/kexec/
+F:	Documentation/kho/
 F:	include/linux/kexec*.h
 F:	include/uapi/linux/kexec.h
 F:	kernel/kexec*
-- 
2.48.1.711.g2feabab25a-goog



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-20  1:55 ` [PATCH v5 11/16] kexec: add config option for KHO Changyuan Lyu
@ 2025-03-20  7:10   ` Krzysztof Kozlowski
  2025-03-20 17:18     ` Changyuan Lyu
  2025-03-24  4:18   ` Dave Young
  1 sibling, 1 reply; 103+ messages in thread
From: Krzysztof Kozlowski @ 2025-03-20  7:10 UTC (permalink / raw)
  To: Changyuan Lyu, linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa, peterz,
	ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt, tglx,
	thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

On 20/03/2025 02:55, Changyuan Lyu wrote:
> From: Alexander Graf <graf@amazon.com>
> 
> We have all generic code in place now to support Kexec with KHO. This
> patch adds a config option that depends on architecture support to
> enable KHO support.
> 
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> Signed-off-by: Changyuan Lyu <changyuanl@google.com>
What did you exactly co-develop here? Few changes does not mean you are
a co-developer.

Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 02/16] mm/mm_init: rename init_reserved_page to init_deferred_page
  2025-03-20  1:55 ` [PATCH v5 02/16] mm/mm_init: rename init_reserved_page to init_deferred_page Changyuan Lyu
@ 2025-03-20  7:10   ` Krzysztof Kozlowski
  2025-03-20 17:15     ` Changyuan Lyu
  0 siblings, 1 reply; 103+ messages in thread
From: Krzysztof Kozlowski @ 2025-03-20  7:10 UTC (permalink / raw)
  To: Changyuan Lyu, linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa, peterz,
	ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt, tglx,
	thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

On 20/03/2025 02:55, Changyuan Lyu wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page()
> function performs initialization of a struct page that would have been
> deferred normally.
> 
> Rename it to init_deferred_page() to better reflect what the function does.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Incorrect DCO chain.

Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 12/16] arm64: add KHO support
  2025-03-20  1:55 ` [PATCH v5 12/16] arm64: add KHO support Changyuan Lyu
@ 2025-03-20  7:13   ` Krzysztof Kozlowski
  2025-03-20  8:30     ` Krzysztof Kozlowski
  2025-03-20 23:29     ` Changyuan Lyu
  2025-04-11  3:47   ` Changyuan Lyu
  1 sibling, 2 replies; 103+ messages in thread
From: Krzysztof Kozlowski @ 2025-03-20  7:13 UTC (permalink / raw)
  To: Changyuan Lyu, linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa, peterz,
	ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt, tglx,
	thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

On 20/03/2025 02:55, Changyuan Lyu wrote:
>  
> +/**
> + * early_init_dt_check_kho - Decode info required for kexec handover from DT
> + */
> +static void __init early_init_dt_check_kho(void)
> +{
> +	unsigned long node = chosen_node_offset;
> +	u64 kho_start, scratch_start, scratch_size;
> +	const __be32 *p;
> +	int l;
> +
> +	if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER) || (long)node < 0)
> +		return;
> +
> +	p = of_get_flat_dt_prop(node, "linux,kho-fdt", &l);


You are adding undocumented ABI for OF properties. That's not what was
explained last time.

NAK.


Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 12/16] arm64: add KHO support
  2025-03-20  7:13   ` Krzysztof Kozlowski
@ 2025-03-20  8:30     ` Krzysztof Kozlowski
  2025-03-20 23:29     ` Changyuan Lyu
  1 sibling, 0 replies; 103+ messages in thread
From: Krzysztof Kozlowski @ 2025-03-20  8:30 UTC (permalink / raw)
  To: Changyuan Lyu, linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa, peterz,
	ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt, tglx,
	thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

On Thu, Mar 20, 2025 at 08:13:24AM +0100, Krzysztof Kozlowski wrote:
> On 20/03/2025 02:55, Changyuan Lyu wrote:
> >  
> > +/**
> > + * early_init_dt_check_kho - Decode info required for kexec handover from DT
> > + */
> > +static void __init early_init_dt_check_kho(void)
> > +{
> > +	unsigned long node = chosen_node_offset;
> > +	u64 kho_start, scratch_start, scratch_size;
> > +	const __be32 *p;
> > +	int l;
> > +
> > +	if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER) || (long)node < 0)
> > +		return;
> > +
> > +	p = of_get_flat_dt_prop(node, "linux,kho-fdt", &l);
> 
> 
> You are adding undocumented ABI for OF properties. That's not what was
> explained last time.
> 
> NAK.

Also there are checkpatch warnings :/

Best regards,
Krzysztof



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 16/16] Documentation: add documentation for KHO
  2025-03-20  1:55 ` [PATCH v5 16/16] Documentation: add documentation for KHO Changyuan Lyu
@ 2025-03-20 14:45   ` Jonathan Corbet
  2025-03-21  6:33     ` Changyuan Lyu
  0 siblings, 1 reply; 103+ messages in thread
From: Jonathan Corbet @ 2025-03-20 14:45 UTC (permalink / raw)
  To: Changyuan Lyu, linux-kernel
  Cc: graf, akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	krzk, rppt, mark.rutland, pbonzini, pasha.tatashin, hpa, peterz,
	ptyadav, robh+dt, robh, saravanak, skinsburskii, rostedt, tglx,
	thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86, Changyuan Lyu

Changyuan Lyu <changyuanl@google.com> writes:

> From: Alexander Graf <graf@amazon.com>
>
> With KHO in place, let's add documentation that describes what it is and
> how to use it.
>
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  25 ++++
>  Documentation/kho/concepts.rst                |  70 +++++++++++
>  Documentation/kho/fdt.rst                     |  62 +++++++++
>  Documentation/kho/index.rst                   |  14 +++
>  Documentation/kho/usage.rst                   | 118 ++++++++++++++++++
>  Documentation/subsystem-apis.rst              |   1 +
>  MAINTAINERS                                   |   1 +
>  7 files changed, 291 insertions(+)
>  create mode 100644 Documentation/kho/concepts.rst
>  create mode 100644 Documentation/kho/fdt.rst
>  create mode 100644 Documentation/kho/index.rst
>  create mode 100644 Documentation/kho/usage.rst

I will ask again: please let's not create another top-level docs
directory for this...?  It looks like it belongs in the admin guide to
me.

Thanks,

jon


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 02/16] mm/mm_init: rename init_reserved_page to init_deferred_page
  2025-03-20  7:10   ` Krzysztof Kozlowski
@ 2025-03-20 17:15     ` Changyuan Lyu
  0 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20 17:15 UTC (permalink / raw)
  To: krzk
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, changyuanl, corbet, dave.hansen, devicetree,
	dwmw2, ebiederm, graf, hpa, jgowans, kexec, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, luto, mark.rutland, mingo,
	pasha.tatashin, pbonzini, peterz, ptyadav, robh+dt, robh, rostedt,
	rppt, saravanak, skinsburskii, tglx, thomas.lendacky, will, x86

Hi Krzysztof,

On Thu, Mar 20, 2025 at 08:10:48 +0100, Krzysztof Kozlowski <krzk@kernel.org> wrote:
> On 20/03/2025 02:55, Changyuan Lyu wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page()
> > function performs initialization of a struct page that would have been
> > deferred normally.
> >
> > Rename it to init_deferred_page() to better reflect what the function does.
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Signed-off-by: Changyuan Lyu <changyuanl@google.com>

>
> Incorrect DCO chain.
>

Thanks for the reminder! I missed "Signed-off-by" in a few patches from
Mike and Alex where I did not make any changes. I will fix them in the
next version.

Best,
Changyuan


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-20  7:10   ` Krzysztof Kozlowski
@ 2025-03-20 17:18     ` Changyuan Lyu
  0 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20 17:18 UTC (permalink / raw)
  To: krzk
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, changyuanl, corbet, dave.hansen, devicetree,
	dwmw2, ebiederm, graf, hpa, jgowans, kexec, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, luto, mark.rutland, mingo,
	pasha.tatashin, pbonzini, peterz, ptyadav, robh+dt, robh, rostedt,
	rppt, saravanak, skinsburskii, tglx, thomas.lendacky, will, x86

Hi Krzysztof,

On Thu, Mar 20, 2025 at 08:10:37 +0100, Krzysztof Kozlowski <krzk@kernel.org> wrote:
> On 20/03/2025 02:55, Changyuan Lyu wrote:
> > From: Alexander Graf <graf@amazon.com>
> >
> > We have all generic code in place now to support Kexec with KHO. This
> > patch adds a config option that depends on architecture support to
> > enable KHO support.
> >
> > Signed-off-by: Alexander Graf <graf@amazon.com>
> > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> What did you exactly co-develop here? Few changes does not mean you are
> a co-developer.

I proposed and implemented the hashtable-based state tree API in the
previous patch "kexec: add Kexec HandOver (KHO) generation helpers" [1]
and then added `select XXHASH` here. If one line of change is not
qualified for "Co-developed-by", I will remove it from the commit message
in the next version.

[1] https://lore.kernel.org/all/20250320015551.2157511-8-changyuanl@google.com/

Best,
Changyuan


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 12/16] arm64: add KHO support
  2025-03-20  7:13   ` Krzysztof Kozlowski
  2025-03-20  8:30     ` Krzysztof Kozlowski
@ 2025-03-20 23:29     ` Changyuan Lyu
  1 sibling, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-20 23:29 UTC (permalink / raw)
  To: krzk
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, changyuanl, corbet, dave.hansen, devicetree,
	dwmw2, ebiederm, graf, hpa, jgowans, kexec, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, luto, mark.rutland, mingo,
	pasha.tatashin, pbonzini, peterz, ptyadav, robh+dt, robh, rostedt,
	rppt, saravanak, skinsburskii, tglx, thomas.lendacky, will, x86

Hi, Krzysztof

On Thu, Mar 20, 2025 at 08:13:24 +0100, Krzysztof Kozlowski <krzk@kernel.org> wrote:
> On 20/03/2025 02:55, Changyuan Lyu wrote:
> >
> > +/**
> > + * early_init_dt_check_kho - Decode info required for kexec handover from DT
> > + */
> > +static void __init early_init_dt_check_kho(void)
> > +{
> > +	unsigned long node = chosen_node_offset;
> > +	u64 kho_start, scratch_start, scratch_size;
> > +	const __be32 *p;
> > +	int l;
> > +
> > +	if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER) || (long)node < 0)
> > +		return;
> > +
> > +	p = of_get_flat_dt_prop(node, "linux,kho-fdt", &l);
>
>
> You are adding undocumented ABI for OF properties. That's not what was
> explained last time.

Thanks for the reminder! Sorry I missed [1].
I drafted a PR to add the documentation:
https://github.com/devicetree-org/dt-schema/pull/158 .

[1] https://lore.kernel.org/all/b1e59928-b2f4-47d4-87b8-424234c52f8d@kernel.org/

Best,
Changyuan


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 16/16] Documentation: add documentation for KHO
  2025-03-20 14:45   ` Jonathan Corbet
@ 2025-03-21  6:33     ` Changyuan Lyu
  2025-03-21 13:46       ` Jonathan Corbet
  0 siblings, 1 reply; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-21  6:33 UTC (permalink / raw)
  To: corbet
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, changyuanl, dave.hansen, devicetree, dwmw2,
	ebiederm, graf, hpa, jgowans, kexec, krzk, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, luto, mark.rutland, mingo,
	pasha.tatashin, pbonzini, peterz, ptyadav, robh+dt, robh, rostedt,
	rppt, saravanak, skinsburskii, tglx, thomas.lendacky, will, x86

Hi Jonathan,

On Thu, Mar 20, 2025 at 08:45:03 -0600, Jonathan Corbet <corbet@lwn.net> wrote:
> Changyuan Lyu <changyuanl@google.com> writes:
>
> > From: Alexander Graf <graf@amazon.com>
> >
> > With KHO in place, let's add documentation that describes what it is and
> > how to use it.
> >
> > Signed-off-by: Alexander Graf <graf@amazon.com>
> > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > ---
> >  .../admin-guide/kernel-parameters.txt         |  25 ++++
> >  Documentation/kho/concepts.rst                |  70 +++++++++++
> >  Documentation/kho/fdt.rst                     |  62 +++++++++
> >  Documentation/kho/index.rst                   |  14 +++
> >  Documentation/kho/usage.rst                   | 118 ++++++++++++++++++
> >  Documentation/subsystem-apis.rst              |   1 +
> >  MAINTAINERS                                   |   1 +
> >  7 files changed, 291 insertions(+)
> >  create mode 100644 Documentation/kho/concepts.rst
> >  create mode 100644 Documentation/kho/fdt.rst
> >  create mode 100644 Documentation/kho/index.rst
> >  create mode 100644 Documentation/kho/usage.rst
>
> I will ask again: please let's not create another top-level docs
> directory for this...?  It looks like it belongs in the admin guide to
> me.

Thanks for review the patch! Sure I will move usage.rst to
Documentation/admin-guide in the next version.

However, I think concepts.rst and fdt.rst are not not end-user oriented,
but for kernel developers of other subsystems to use KHO API (I also plan
to include the kernel-doc generated from kernel/kexec_handover.c and
include/linux/kexec_handover.h here in the next version).
Should Documentation/core-api be a better choice?

Best,
Changyuan


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
  2025-03-20  1:55 ` [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers Changyuan Lyu
@ 2025-03-21 13:34   ` Jason Gunthorpe
  2025-03-23 19:02     ` Changyuan Lyu
  2025-03-24 18:40   ` Frank van der Linden
  1 sibling, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-03-21 13:34 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: linux-kernel, graf, akpm, luto, anthony.yznaga, arnd,
	ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh,
	saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Wed, Mar 19, 2025 at 06:55:42PM -0700, Changyuan Lyu wrote:
> From: Alexander Graf <graf@amazon.com>
> 
> Add the core infrastructure to generate Kexec HandOver metadata. Kexec
> HandOver is a mechanism that allows Linux to preserve state - arbitrary
> properties as well as memory locations - across kexec.
> 
> It does so using 2 concepts:
> 
>   1) State Tree - Every KHO kexec carries a state tree that describes the
>      state of the system. The state tree is represented as hash-tables.
>      Device drivers can add/remove their data into/from the state tree at
>      system runtime. On kexec, the tree is converted to FDT (flattened
>      device tree).

Why are we changing this? I much prefered the idea of having recursive
FDTs than this notion copying eveything into tables then out into FDT?
Now that we have the preserved pages mechanism there is a pretty
direct path to doing recursive FDT.

I feel like this patch is premature, it should come later in the
project along with a stronger justification for this approach.

IHMO keep things simple for this series, just the very basics.

> +int register_kho_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
> +}
> +EXPORT_SYMBOL_GPL(register_kho_notifier);

And another different set of notifiers? :(

> +static int kho_finalize(void)
> +{
> +	int err = 0;
> +	void *fdt;
> +
> +	fdt = kvmalloc(kho_out.fdt_max, GFP_KERNEL);
> +	if (!fdt)
> +		return -ENOMEM;

We go to all the trouble of keeping track of stuff in dynamic hashes
but still can't automatically size the fdt and keep the dumb uapi to
have the user say? :( :(

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-20  1:55 ` [PATCH v5 09/16] kexec: enable KHO support for memory preservation Changyuan Lyu
@ 2025-03-21 13:46   ` Jason Gunthorpe
  2025-03-22 19:12     ` Mike Rapoport
  2025-03-23 19:07     ` Changyuan Lyu
  2025-03-27 10:03   ` Pratyush Yadav
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-03-21 13:46 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: linux-kernel, graf, akpm, luto, anthony.yznaga, arnd,
	ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh,
	saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Wed, Mar 19, 2025 at 06:55:44PM -0700, Changyuan Lyu wrote:
> +/**
> + * kho_preserve_folio - preserve a folio across KHO.
> + * @folio: folio to preserve
> + *
> + * Records that the entire folio is preserved across KHO. The order
> + * will be preserved as well.
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int kho_preserve_folio(struct folio *folio)
> +{
> +	unsigned long pfn = folio_pfn(folio);
> +	unsigned int order = folio_order(folio);
> +	int err;
> +
> +	if (!kho_enable)
> +		return -EOPNOTSUPP;
> +
> +	down_read(&kho_out.tree_lock);
> +	if (kho_out.fdt) {

What is the lock and fdt test for?

I'm getting the feeling that probably kho_preserve_folio() and the
like should accept some kind of 
'struct kho_serialization *' and then we don't need this to prove we
are within a valid serialization window. It could pass the pointer
through the notifiers

The global variables in this series are sort of ugly..

We want this to be fast, so try hard to avoid a lock..

> +void *kho_restore_phys(phys_addr_t phys, size_t size)
> +{
> +	unsigned long start_pfn, end_pfn, pfn;
> +	void *va = __va(phys);
> +
> +	start_pfn = PFN_DOWN(phys);
> +	end_pfn = PFN_UP(phys + size);
> +
> +	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> +		struct page *page = pfn_to_online_page(pfn);
> +
> +		if (!page)
> +			return NULL;
> +		kho_restore_page(page);
> +	}
> +
> +	return va;
> +}
> +EXPORT_SYMBOL_GPL(kho_restore_phys);

What do you imagine this is used for? I'm not sure what value there is
in returning a void *? How does the caller "free" this?


> +#define KHOSER_PTR(type)          \
> +	union {                   \
> +		phys_addr_t phys; \
> +		type ptr;         \
> +	}
> +#define KHOSER_STORE_PTR(dest, val)                 \
> +	({                                          \
> +		(dest).phys = virt_to_phys(val);    \
> +		typecheck(typeof((dest).ptr), val); \
> +	})
> +#define KHOSER_LOAD_PTR(src) \
> +	((src).phys ? (typeof((src).ptr))(phys_to_virt((src).phys)) : NULL)

I had imagined these macros would be in a header and usably by drivers
that also want to use structs to carry information.

> +static void deserialize_bitmap(unsigned int order,
> +			       struct khoser_mem_bitmap_ptr *elm)
> +{
> +	struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap);
> +	unsigned long bit;
> +
> +	for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) {
> +		int sz = 1 << (order + PAGE_SHIFT);
> +		phys_addr_t phys =
> +			elm->phys_start + (bit << (order + PAGE_SHIFT));
> +		struct page *page = phys_to_page(phys);
> +
> +		memblock_reserve(phys, sz);
> +		memblock_reserved_mark_noinit(phys, sz);

Mike asked about this earlier, is it work combining runs of set bits
to increase sz? Or is this sort of temporary pending something better
that doesn't rely on memblock_reserve?

> +		page->private = order;

Can't just set the page order directly? Why use private?

> @@ -829,6 +1305,10 @@ static __init int kho_init(void)
>  
>  	kho_out.root.name = "";

?

>  	err = kho_add_string_prop(&kho_out.root, "compatible", "kho-v1");
> +	err |= kho_add_prop(&kho_out.preserved_memory, "metadata",
> +			    &kho_out.first_chunk_phys, sizeof(phys_addr_t));

metedata doesn't fee like a great a better name..

Please also document all the FDT schema thoroughly!

There should be yaml files just like in the normal DT case defining
all of this. This level of documentation and stability was one of the
selling reasons why FDT is being used here!

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 16/16] Documentation: add documentation for KHO
  2025-03-21  6:33     ` Changyuan Lyu
@ 2025-03-21 13:46       ` Jonathan Corbet
  0 siblings, 0 replies; 103+ messages in thread
From: Jonathan Corbet @ 2025-03-21 13:46 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, changyuanl, dave.hansen, devicetree, dwmw2,
	ebiederm, graf, hpa, jgowans, kexec, krzk, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, luto, mark.rutland, mingo,
	pasha.tatashin, pbonzini, peterz, ptyadav, robh+dt, robh, rostedt,
	rppt, saravanak, skinsburskii, tglx, thomas.lendacky, will, x86

Changyuan Lyu <changyuanl@google.com> writes:

> However, I think concepts.rst and fdt.rst are not not end-user oriented,
> but for kernel developers of other subsystems to use KHO API (I also plan
> to include the kernel-doc generated from kernel/kexec_handover.c and
> include/linux/kexec_handover.h here in the next version).
> Should Documentation/core-api be a better choice?

Yes, I think it would be.

Thanks,

jon


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 10/16] kexec: add KHO support to kexec file loads
  2025-03-20  1:55 ` [PATCH v5 10/16] kexec: add KHO support to kexec file loads Changyuan Lyu
@ 2025-03-21 13:48   ` Jason Gunthorpe
  0 siblings, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-03-21 13:48 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: linux-kernel, graf, akpm, luto, anthony.yznaga, arnd,
	ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh,
	saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Wed, Mar 19, 2025 at 06:55:45PM -0700, Changyuan Lyu wrote:
> +int kho_copy_fdt(struct kimage *image)
> +{
> +	int err = 0;
> +	void *fdt;
> +
> +	if (!kho_enable || !image->file_mode)
> +		return 0;
> +
> +	if (!kho_out.fdt) {
> +		err = kho_finalize();
> +		kho_out_update_debugfs_fdt();
> +		if (err)
> +			return err;
> +	}
> +
> +	fdt = kimage_map_segment(image, image->kho.fdt->mem,
> +				 PAGE_ALIGN(kho_out.fdt_max));
> +	if (!fdt) {
> +		pr_err("failed to vmap fdt ksegment in kimage\n");
> +		return -ENOMEM;
> +	}

I continue to think this copying is kind of pointless.

I liked the point where we could get the FDT blob into userspace and
validate it ie through debugfs.

But as a way to actually do the kexec, I think you should not copy but
preserve the memory holding FDT using the new preservation mechanism
and simply put a 16 byte size/length on the image to refer to it.

This would clean up the special hard coded memblock reserve to
preserve the FDT memory too as normal preservation would take care fo
it.

Now that this is all in the kernel it seems like there is no reason to
do the copying version any more.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-21 13:46   ` Jason Gunthorpe
@ 2025-03-22 19:12     ` Mike Rapoport
  2025-03-23 18:55       ` Jason Gunthorpe
  2025-03-23 19:07     ` Changyuan Lyu
  1 sibling, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-03-22 19:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, mark.rutland, pbonzini,
	pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Fri, Mar 21, 2025 at 10:46:29AM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 19, 2025 at 06:55:44PM -0700, Changyuan Lyu wrote:
> >
> > +static void deserialize_bitmap(unsigned int order,
> > +			       struct khoser_mem_bitmap_ptr *elm)
> > +{
> > +	struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap);
> > +	unsigned long bit;
> > +
> > +	for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) {
> > +		int sz = 1 << (order + PAGE_SHIFT);
> > +		phys_addr_t phys =
> > +			elm->phys_start + (bit << (order + PAGE_SHIFT));
> > +		struct page *page = phys_to_page(phys);
> > +
> > +		memblock_reserve(phys, sz);
> > +		memblock_reserved_mark_noinit(phys, sz);
> 
> Mike asked about this earlier, is it work combining runs of set bits
> to increase sz? Or is this sort of temporary pending something better
> that doesn't rely on memblock_reserve?

This hunk actually came from me. I decided to keep it simple for now and
check what are the alternatives, like moving away from memblock_reserve(),
adding a maple_tree or even something else.

> > +		page->private = order;
> 
> Can't just set the page order directly? Why use private?

Setting the order means recreating the folio the way prep_compound_page()
does. I think it's better to postpone it until the folio is requested. This
way it might run after SMP is enabled. Besides, when we start allocating
folios separately from struct page, initializing it here would be a real
issue.
 
> Jason

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-22 19:12     ` Mike Rapoport
@ 2025-03-23 18:55       ` Jason Gunthorpe
  2025-03-24 18:18         ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-03-23 18:55 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, mark.rutland, pbonzini,
	pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Sat, Mar 22, 2025 at 03:12:26PM -0400, Mike Rapoport wrote:
> This hunk actually came from me. I decided to keep it simple for now and
> check what are the alternatives, like moving away from memblock_reserve(),
> adding a maple_tree or even something else.

Okat, makes sense to me
 
> > > +		page->private = order;
> > 
> > Can't just set the page order directly? Why use private?
> 
> Setting the order means recreating the folio the way prep_compound_page()
> does. I think it's better to postpone it until the folio is requested. This
> way it might run after SMP is enabled. 

I see, that makes sense, but also it could stil use page->order..

> Besides, when we start allocating
> folios separately from struct page, initializing it here would be a real
> issue.

Yes, but also we wouldn't have page->private to make it work.. Somehow
anything we want to carry over would have to become encoded in the
memdesc directly.

I think this supports my remark someplace else that any user of this
that wants to preserve per-page data should do it on its own somehow
as an add-on-top?

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
  2025-03-21 13:34   ` Jason Gunthorpe
@ 2025-03-23 19:02     ` Changyuan Lyu
  2025-03-24 16:28       ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-23 19:02 UTC (permalink / raw)
  To: jgg
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, changyuanl, corbet, dave.hansen, devicetree,
	dwmw2, ebiederm, graf, hpa, jgowans, kexec, krzk,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm, luto,
	mark.rutland, mingo, pasha.tatashin, pbonzini, peterz, ptyadav,
	robh+dt, robh, rostedt, rppt, saravanak, skinsburskii, tglx,
	thomas.lendacky, will, x86

Hi Jason, thanks for reviewing the patchset!

On Fri, Mar 21, 2025 at 10:34:47 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Wed, Mar 19, 2025 at 06:55:42PM -0700, Changyuan Lyu wrote:
> > From: Alexander Graf <graf@amazon.com>
> >
> > Add the core infrastructure to generate Kexec HandOver metadata. Kexec
> > HandOver is a mechanism that allows Linux to preserve state - arbitrary
> > properties as well as memory locations - across kexec.
> >
> > It does so using 2 concepts:
> >
> >   1) State Tree - Every KHO kexec carries a state tree that describes the
> >      state of the system. The state tree is represented as hash-tables.
> >      Device drivers can add/remove their data into/from the state tree at
> >      system runtime. On kexec, the tree is converted to FDT (flattened
> >      device tree).
>
> Why are we changing this? I much prefered the idea of having recursive
> FDTs than this notion copying eveything into tables then out into FDT?
> Now that we have the preserved pages mechanism there is a pretty
> direct path to doing recursive FDT.

We are not copying data into the hashtables, instead the hashtables only
record the address and size of the data to be serialized into FDT.
The idea is similar to recording preserved folios in xarray
and then serialize it to linked pages.

> I feel like this patch is premature, it should come later in the
> project along with a stronger justification for this approach.
>
> IHMO keep things simple for this series, just the very basics.

The main purpose of using hashtables is to enable KHO users to save
data to KHO at any time, not just at the time of activate/finalize KHO
through sysfs/debugfs. For example, FDBox can save the data into KHO
tree once a new fd is saved to KHO. Also, using hashtables allows KHO
users to add data to KHO concurrently, while with notifiers, KHO users'
callbacks are executed serially.

Regarding the suggestion of recursive FDT, I feel like it is already
doable with this patchset, or even with Mike's V4 patch. A KHO user can
just allocates a buffer, serialize all its states to the buffer using
libfdt (or even using other binary formats), save the address of the
buffer to KHO's tree, and finally register the buffer's underlying
pages/folios with kho_preserve_folio().

> > +int register_kho_notifier(struct notifier_block *nb)
> > +{
> > +	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
> > +}
> > +EXPORT_SYMBOL_GPL(register_kho_notifier);
>
> And another different set of notifiers? :(

I changed the semantics of the notifiers. In Mike's V4, the KHO notifier
is to pass the fdt pointer to KHO users to push data into the blob. In
this patchset, it notifies KHO users about the last chance for saving
data to KHO.

It is not necessary for every KHO user to register a
notifier, as they can use the helper functions to save data to KHO tree
anytime (but before the KHO tree is converted and frozen). For example,
FDBox would not need a notifier if it saves data to KHO tree immediately
once an FD is registered to it.

However, some KHO users may still want to add data just before kexec,
so I kept the notifiers and allow KHO users to get notified when the
state tree hashtables are about to be frozen and converted to FDT.

> > +static int kho_finalize(void)
> > +{
> > +	int err = 0;
> > +	void *fdt;
> > +
> > +	fdt = kvmalloc(kho_out.fdt_max, GFP_KERNEL);
> > +	if (!fdt)
> > +		return -ENOMEM;
>
> We go to all the trouble of keeping track of stuff in dynamic hashes
> but still can't automatically size the fdt and keep the dumb uapi to
> have the user say? :( :(

The reason of keeping fdt_max in the this patchset is to simplify the
support of kexec_file_load().

We want to be able to do kexec_file_load()
first and then do KHO activation/finalization to move kexec_file_load()
out of the blackout window. At the time of kexec_file_load(), we need to
pass the KHO FDT address to the new kernel's setup data (x86) or
devicetree (arm), but KHO FDT is not generated yet. The simple solution
used in this patchset is to reserve a ksegment of size fdt_max and pass
the address of that ksegment to the new kernel. The final FDT is copied
to that ksegment in kernel_kexec().
The extra benefit of this solution is the reserved ksegment is
physically contiguous.

To completely remove fdt_max, I am considering the idea in [1]. At the
time of kexec_file_load(), we pass the address of an anchor page to
the new kernel, and the anchor page will later be fulfilled with the
physical addresses of the pages containing the FDT blob. Multiple
anchor pages can be linked together. The FDT blob pages can be physically
noncontiguous.

[1] https://lore.kernel.org/all/CA+CK2bBBX+HgD0HLj-AyTScM59F2wXq11BEPgejPMHoEwqj+_Q@mail.gmail.com/

Best,
Changyuan

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-21 13:46   ` Jason Gunthorpe
  2025-03-22 19:12     ` Mike Rapoport
@ 2025-03-23 19:07     ` Changyuan Lyu
  2025-03-25  2:04       ` Jason Gunthorpe
  1 sibling, 1 reply; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-23 19:07 UTC (permalink / raw)
  To: jgg
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, changyuanl, corbet, dave.hansen, devicetree,
	dwmw2, ebiederm, graf, hpa, jgowans, kexec, krzk,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm, luto,
	mark.rutland, mingo, pasha.tatashin, pbonzini, peterz, ptyadav,
	robh+dt, robh, rostedt, rppt, saravanak, skinsburskii, tglx,
	thomas.lendacky, will, x86

On Fri, Mar 21, 2025 at 10:46:29 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Wed, Mar 19, 2025 at 06:55:44PM -0700, Changyuan Lyu wrote:
> > +/**
> > + * kho_preserve_folio - preserve a folio across KHO.
> > + * @folio: folio to preserve
> > + *
> > + * Records that the entire folio is preserved across KHO. The order
> > + * will be preserved as well.
> > + *
> > + * Return: 0 on success, error code on failure
> > + */
> > +int kho_preserve_folio(struct folio *folio)
> > +{
> > +	unsigned long pfn = folio_pfn(folio);
> > +	unsigned int order = folio_order(folio);
> > +	int err;
> > +
> > +	if (!kho_enable)
> > +		return -EOPNOTSUPP;
> > +
> > +	down_read(&kho_out.tree_lock);
> > +	if (kho_out.fdt) {
>
> What is the lock and fdt test for?

It is to avoid the competition between the following 2 operations,
- converting the hashtables and mem traker to FDT,
- adding new data to hashtable/mem tracker.
Please also see function kho_finalize() in the previous patch
"kexec: add Kexec HandOver (KHO) generation helpers" [1].

The function kho_finalize() iterates over all the hashtables and
the mem tracker. We want to make sure that during the iterations,
no new data is added to the hashtables and mem tracker.

Also if FDT is generated, the mem tracker then has been serialized
to linked pages, so we return -EBUSY to prevent more data from
being added to the mem tracker.

> I'm getting the feeling that probably kho_preserve_folio() and the
> like should accept some kind of
> 'struct kho_serialization *' and then we don't need this to prove we
> are within a valid serialization window. It could pass the pointer
> through the notifiers

If we use notifiers, callbacks have to be done serially.

> The global variables in this series are sort of ugly..
>
> We want this to be fast, so try hard to avoid a lock..

In most cases we only need read lock. Different KHO users can adding
data into their own subnodes in parallel.
We only need a write lock if
- 2 KHO users register subnodes to the KHO root node at the same time
- KHO root tree is about to be converted to FDT.

> > +void *kho_restore_phys(phys_addr_t phys, size_t size)
> > +{
> > +	unsigned long start_pfn, end_pfn, pfn;
> > +	void *va = __va(phys);
> > +
> > +	start_pfn = PFN_DOWN(phys);
> > +	end_pfn = PFN_UP(phys + size);
> > +
> > +	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > +		struct page *page = pfn_to_online_page(pfn);
> > +
> > +		if (!page)
> > +			return NULL;
> > +		kho_restore_page(page);
> > +	}
> > +
> > +	return va;
> > +}
> > +EXPORT_SYMBOL_GPL(kho_restore_phys);
>
> What do you imagine this is used for? I'm not sure what value there is
> in returning a void *? How does the caller "free" this?

This function is also from Mike :)

I suppose some KHO users may still
preserve memory using memory ranges (instead of folio). In the restoring
stage they need a helper to setup the pages of reserved memory ranges.
A void * is returned so the KHO user can access the memory
contents through the virtual address.
I guess the caller can free the ranges by free_pages()?

It makes sense to return nothing and let caller to call `__va`
if they want. Then the function signature looks more symmetric to
`kho_preserve_phys`.

> > +#define KHOSER_PTR(type)          \
> > +	union {                   \
> > +		phys_addr_t phys; \
> > +		type ptr;         \
> > +	}
> > +#define KHOSER_STORE_PTR(dest, val)                 \
> > +	({                                          \
> > +		(dest).phys = virt_to_phys(val);    \
> > +		typecheck(typeof((dest).ptr), val); \
> > +	})
> > +#define KHOSER_LOAD_PTR(src) \
> > +	((src).phys ? (typeof((src).ptr))(phys_to_virt((src).phys)) : NULL)
>
> I had imagined these macros would be in a header and usably by drivers
> that also want to use structs to carry information.
>

OK I will move them to the header file in the next version.

> > [...]
> > @@ -829,6 +1305,10 @@ static __init int kho_init(void)
> >
> >  	kho_out.root.name = "";
>
> ?

Set the root node name to an empty string since fdt_begin_node
calls strlen on the node name.

It is equivalent to `err = fdt_begin_node(fdt, "")` in kho_serialize()
of Mike's V4 patch [2].

> >  	err = kho_add_string_prop(&kho_out.root, "compatible", "kho-v1");
> > +	err |= kho_add_prop(&kho_out.preserved_memory, "metadata",
> > +			    &kho_out.first_chunk_phys, sizeof(phys_addr_t));
>
> metedata doesn't fee like a great a better name..
>
> Please also document all the FDT schema thoroughly!
>
> There should be yaml files just like in the normal DT case defining
> all of this. This level of documentation and stability was one of the
> selling reasons why FDT is being used here!

YAML files were dropped because we think it may take a while for our
schema to be near stable. So we start from some simple plain text. We
can add some prop and node docs (that are considered stable at this point)
back to YAML in the next version.

[1] https://lore.kernel.org/all/20250320015551.2157511-8-changyuanl@google.com/
[2] https://lore.kernel.org/all/20250206132754.2596694-6-rppt@kernel.org/

Best,
Changyuan


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-20  1:55 ` [PATCH v5 11/16] kexec: add config option for KHO Changyuan Lyu
  2025-03-20  7:10   ` Krzysztof Kozlowski
@ 2025-03-24  4:18   ` Dave Young
  2025-03-24 19:26     ` Pasha Tatashin
  2025-03-25  6:57     ` Baoquan He
  1 sibling, 2 replies; 103+ messages in thread
From: Dave Young @ 2025-03-24  4:18 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: linux-kernel, graf, akpm, luto, anthony.yznaga, arnd,
	ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh,
	saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Thu, 20 Mar 2025 at 23:05, Changyuan Lyu <changyuanl@google.com> wrote:
>
> From: Alexander Graf <graf@amazon.com>
>
> We have all generic code in place now to support Kexec with KHO. This
> patch adds a config option that depends on architecture support to
> enable KHO support.
>
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> ---
>  kernel/Kconfig.kexec | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
> index 4d111f871951..57db99e758a8 100644
> --- a/kernel/Kconfig.kexec
> +++ b/kernel/Kconfig.kexec
> @@ -95,6 +95,21 @@ config KEXEC_JUMP
>           Jump between original kernel and kexeced kernel and invoke
>           code in physical address mode via KEXEC
>
> +config KEXEC_HANDOVER
> +       bool "kexec handover"
> +       depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
> +       select MEMBLOCK_KHO_SCRATCH
> +       select KEXEC_FILE
> +       select DEBUG_FS
> +       select LIBFDT
> +       select CMA
> +       select XXHASH
> +       help
> +         Allow kexec to hand over state across kernels by generating and
> +         passing additional metadata to the target kernel. This is useful
> +         to keep data or state alive across the kexec. For this to work,
> +         both source and target kernels need to have this option enabled.
> +

Have you tested kdump?  In my mind there are two issues,  one is with
CMA enabled, it could cause kdump crashkernel memory reservation
failures more often due to the fragmented low memory.  Secondly,  in
kdump kernel dump the crazy scratch memory in vmcore is not very
meaningful.  Otherwise I suspect this is not tested under kdump.  If
so please disable this option for kdump.

>  config CRASH_DUMP
>         bool "kernel crash dumps"
>         default ARCH_DEFAULT_CRASH_DUMP
> --
> 2.48.1.711.g2feabab25a-goog
>
>



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
  2025-03-23 19:02     ` Changyuan Lyu
@ 2025-03-24 16:28       ` Jason Gunthorpe
  2025-03-25  0:21         ` Changyuan Lyu
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-03-24 16:28 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, corbet, dave.hansen, devicetree, dwmw2, ebiederm,
	graf, hpa, jgowans, kexec, krzk, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, luto, mark.rutland, mingo, pasha.tatashin,
	pbonzini, peterz, ptyadav, robh+dt, robh, rostedt, rppt,
	saravanak, skinsburskii, tglx, thomas.lendacky, will, x86

On Sun, Mar 23, 2025 at 12:02:04PM -0700, Changyuan Lyu wrote:

> > Why are we changing this? I much prefered the idea of having recursive
> > FDTs than this notion copying eveything into tables then out into FDT?
> > Now that we have the preserved pages mechanism there is a pretty
> > direct path to doing recursive FDT.
> 
> We are not copying data into the hashtables, instead the hashtables only
> record the address and size of the data to be serialized into FDT.
> The idea is similar to recording preserved folios in xarray
> and then serialize it to linked pages.

I understand that, I mean you are copying the keys/tree/etc. It
doesn't seem like a good idea idea to me.
 
> > I feel like this patch is premature, it should come later in the
> > project along with a stronger justification for this approach.
> >
> > IHMO keep things simple for this series, just the very basics.
> 
> The main purpose of using hashtables is to enable KHO users to save
> data to KHO at any time, not just at the time of activate/finalize KHO
> through sysfs/debugfs. For example, FDBox can save the data into KHO
> tree once a new fd is saved to KHO. Also, using hashtables allows KHO
> users to add data to KHO concurrently, while with notifiers, KHO users'
> callbacks are executed serially.

This is why I like the recursive FDT scheme. Each serialization
operation can open its own FDT write to it and the close it
sequenatially within its operation without any worries about
concurrency.

The top level just aggregates the FDT blobs (which are in preserved
memory)

To me all this complexity here with the hash table and the copying
makes no sense compared to that. It is all around slower.

> Regarding the suggestion of recursive FDT, I feel like it is already
> doable with this patchset, or even with Mike's V4 patch. 

Of course it is doable, here we are really talk about what is the
right, recommended way to use this system. recurisive FDT is a better
methodology than hash tables

> just allocates a buffer, serialize all its states to the buffer using
> libfdt (or even using other binary formats), save the address of the
> buffer to KHO's tree, and finally register the buffer's underlying
> pages/folios with kho_preserve_folio().

Yes, exactly! I think this is how we should operate this system as a
paradig, not a giant FDT, hash table and so on...

> I changed the semantics of the notifiers. In Mike's V4, the KHO notifier
> is to pass the fdt pointer to KHO users to push data into the blob. In
> this patchset, it notifies KHO users about the last chance for saving
> data to KHO.

I think Mike's semantic makes more sense.. At least I'd want to see an
actual example of someting that wants to do a list minute adjustment
before adding the code.

> However, some KHO users may still want to add data just before kexec,
> so I kept the notifiers and allow KHO users to get notified when the
> state tree hashtables are about to be frozen and converted to FDT.

Let's try not adding API surface that has no present user as much as
possible please. You can shove this into speculative patches that
someone can pick up if they need this semantic
 
> To completely remove fdt_max, I am considering the idea in [1]. At the
> time of kexec_file_load(), we pass the address of an anchor page to
> the new kernel, and the anchor page will later be fulfilled with the
> physical addresses of the pages containing the FDT blob. Multiple
> anchor pages can be linked together. The FDT blob pages can be physically
> noncontiguous.

Yes, this is basically what I suggested too. I think this is much
prefered and doesn't require the wakky uapi.

Except I suggested you just really need a single u64 to point to a
preserved page holding the top level FDT.

With recursive FDT I think we can say that no FDT fragement should
exceed PAGE_SIZE, and things become much simpler, IMHO.

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-23 18:55       ` Jason Gunthorpe
@ 2025-03-24 18:18         ` Mike Rapoport
  2025-03-24 20:07           ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-03-24 18:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, mark.rutland, pbonzini,
	pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Sun, Mar 23, 2025 at 03:55:52PM -0300, Jason Gunthorpe wrote:
> On Sat, Mar 22, 2025 at 03:12:26PM -0400, Mike Rapoport wrote:
>  
> > > > +		page->private = order;
> > > 
> > > Can't just set the page order directly? Why use private?
> > 
> > Setting the order means recreating the folio the way prep_compound_page()
> > does. I think it's better to postpone it until the folio is requested. This
> > way it might run after SMP is enabled. 
> 
> I see, that makes sense, but also it could stil use page->order..

But there's no page->order :)
 
> > Besides, when we start allocating
> > folios separately from struct page, initializing it here would be a real
> > issue.
> 
> Yes, but also we wouldn't have page->private to make it work.. Somehow
> anything we want to carry over would have to become encoded in the
> memdesc directly.

This is a problem to solve in 2026 :)

The January update for State of Page [1] talks about 

	reasonable goal to shrink struct page to (approximately): 

	struct page {
	    unsigned long flags;
	    union {
	        struct list_head buddy_list;
	        struct list_head pcp_list;
	        struct {
	            unsigned long memdesc;
	            int _refcount;
	        };
	    };
	    union {
	        unsigned long private;
	        struct {
	            int _folio_mapcount;
	        };
	    };
	};
 
[1] https://lore.kernel.org/linux-mm/Z37pxbkHPbLYnDKn@casper.infradead.org/
 
> Jason

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
  2025-03-20  1:55 ` [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers Changyuan Lyu
  2025-03-21 13:34   ` Jason Gunthorpe
@ 2025-03-24 18:40   ` Frank van der Linden
  2025-03-25 19:19     ` Mike Rapoport
  1 sibling, 1 reply; 103+ messages in thread
From: Frank van der Linden @ 2025-03-24 18:40 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: linux-kernel, graf, akpm, luto, anthony.yznaga, arnd,
	ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh,
	saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Wed, Mar 19, 2025 at 6:56 PM Changyuan Lyu <changyuanl@google.com> wrote:
>
> From: Alexander Graf <graf@amazon.com>
>
> Add the core infrastructure to generate Kexec HandOver metadata. Kexec
> HandOver is a mechanism that allows Linux to preserve state - arbitrary
> properties as well as memory locations - across kexec.
>
> It does so using 2 concepts:
>
>   1) State Tree - Every KHO kexec carries a state tree that describes the
>      state of the system. The state tree is represented as hash-tables.
>      Device drivers can add/remove their data into/from the state tree at
>      system runtime. On kexec, the tree is converted to FDT (flattened
>      device tree).
>
>   2) Scratch Regions - CMA regions that we allocate in the first kernel.
>      CMA gives us the guarantee that no handover pages land in those
>      regions, because handover pages must be at a static physical memory
>      location. We use these regions as the place to load future kexec
>      images so that they won't collide with any handover data.
>
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Co-developed-by: Pratyush Yadav <ptyadav@amazon.de>
> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> ---
>  MAINTAINERS                    |   2 +-
>  include/linux/kexec_handover.h | 109 +++++
>  kernel/Makefile                |   1 +
>  kernel/kexec_handover.c        | 865 +++++++++++++++++++++++++++++++++
>  mm/mm_init.c                   |   8 +
>  5 files changed, 984 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/kexec_handover.h
>  create mode 100644 kernel/kexec_handover.c
[...]
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 04441c258b05..757659b7a26b 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -30,6 +30,7 @@
>  #include <linux/crash_dump.h>
>  #include <linux/execmem.h>
>  #include <linux/vmstat.h>
> +#include <linux/kexec_handover.h>
>  #include "internal.h"
>  #include "slab.h"
>  #include "shuffle.h"
> @@ -2661,6 +2662,13 @@ void __init mm_core_init(void)
>         report_meminit();
>         kmsan_init_shadow();
>         stack_depot_early_init();
> +
> +       /*
> +        * KHO memory setup must happen while memblock is still active, but
> +        * as close as possible to buddy initialization
> +        */
> +       kho_memory_init();
> +
>         mem_init();
>         kmem_cache_init();
>         /*


Thanks for the work on this.

Obviously it needs to happen while memblock is still active - but why
as close as possible to buddy initialization?

Ordering is always a sticky issue when it comes to doing things during
boot, of course. In this case, I can see scenarios where code that
runs a little earlier may want to use some preserved memory. The
current requirement in the patch set seems to be "after sparse/page
init", but I'm not sure why it needs to be as close as possibly to
buddy init.

- Frank


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-24  4:18   ` Dave Young
@ 2025-03-24 19:26     ` Pasha Tatashin
  2025-03-25  1:24       ` Dave Young
  2025-03-25  6:57     ` Baoquan He
  1 sibling, 1 reply; 103+ messages in thread
From: Pasha Tatashin @ 2025-03-24 19:26 UTC (permalink / raw)
  To: Dave Young
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Mon, Mar 24, 2025 at 12:18 AM Dave Young <dyoung@redhat.com> wrote:
>
> On Thu, 20 Mar 2025 at 23:05, Changyuan Lyu <changyuanl@google.com> wrote:
> >
> > From: Alexander Graf <graf@amazon.com>
> >
> > We have all generic code in place now to support Kexec with KHO. This
> > patch adds a config option that depends on architecture support to
> > enable KHO support.
> >
> > Signed-off-by: Alexander Graf <graf@amazon.com>
> > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > ---
> >  kernel/Kconfig.kexec | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> >
> > diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
> > index 4d111f871951..57db99e758a8 100644
> > --- a/kernel/Kconfig.kexec
> > +++ b/kernel/Kconfig.kexec
> > @@ -95,6 +95,21 @@ config KEXEC_JUMP
> >           Jump between original kernel and kexeced kernel and invoke
> >           code in physical address mode via KEXEC
> >
> > +config KEXEC_HANDOVER
> > +       bool "kexec handover"
> > +       depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
> > +       select MEMBLOCK_KHO_SCRATCH
> > +       select KEXEC_FILE
> > +       select DEBUG_FS
> > +       select LIBFDT
> > +       select CMA
> > +       select XXHASH
> > +       help
> > +         Allow kexec to hand over state across kernels by generating and
> > +         passing additional metadata to the target kernel. This is useful
> > +         to keep data or state alive across the kexec. For this to work,
> > +         both source and target kernels need to have this option enabled.
> > +
>
> Have you tested kdump?  In my mind there are two issues,  one is with
> CMA enabled, it could cause kdump crashkernel memory reservation
> failures more often due to the fragmented low memory.  Secondly,  in

As I understand cma low memory scratch reservation is needed only to
support some legacy pci devices that cannot use the full 64-bit space.
If so, I am not sure if KHO needs to be supported on machines with
such devices. However, even if we keep it, it should really be small,
so I would not expect that to be a problem for crash kernel memory
reservation.

> kdump kernel dump the crazy scratch memory in vmcore is not very
> meaningful.  Otherwise I suspect this is not tested under kdump.  If
> so please disable this option for kdump.

The scratch memory will appear as regular CMA in the vmcore. The crash
kernel can be kexec loaded only from userland, long after the scratch
memory is converted to CMA.

Pasha


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-24 18:18         ` Mike Rapoport
@ 2025-03-24 20:07           ` Jason Gunthorpe
  2025-03-26 12:07             ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-03-24 20:07 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, mark.rutland, pbonzini,
	pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Mon, Mar 24, 2025 at 02:18:34PM -0400, Mike Rapoport wrote:
> On Sun, Mar 23, 2025 at 03:55:52PM -0300, Jason Gunthorpe wrote:
> > On Sat, Mar 22, 2025 at 03:12:26PM -0400, Mike Rapoport wrote:
> >  
> > > > > +		page->private = order;
> > > > 
> > > > Can't just set the page order directly? Why use private?
> > > 
> > > Setting the order means recreating the folio the way prep_compound_page()
> > > does. I think it's better to postpone it until the folio is requested. This
> > > way it might run after SMP is enabled. 
> > 
> > I see, that makes sense, but also it could stil use page->order..
> 
> But there's no page->order :)

I mean this:

static inline unsigned int folio_order(const struct folio *folio)
{
        if (!folio_test_large(folio))
                return 0;
        return folio->_flags_1 & 0xff;
}
 
> > Yes, but also we wouldn't have page->private to make it work.. Somehow
> > anything we want to carry over would have to become encoded in the
> > memdesc directly.
> 
> This is a problem to solve in 2026 :)

Yes :)

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
  2025-03-24 16:28       ` Jason Gunthorpe
@ 2025-03-25  0:21         ` Changyuan Lyu
  2025-03-25  2:20           ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Changyuan Lyu @ 2025-03-25  0:21 UTC (permalink / raw)
  To: jgg
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, changyuanl, corbet, dave.hansen, devicetree,
	dwmw2, ebiederm, graf, hpa, jgowans, kexec, krzk,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm, luto,
	mark.rutland, mingo, pasha.tatashin, pbonzini, peterz, ptyadav,
	robh+dt, robh, rostedt, rppt, saravanak, skinsburskii, tglx,
	thomas.lendacky, will, x86

Hi Jason,

On Mon, Mar 24, 2025 at 13:28:53 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> [...]
> > > I feel like this patch is premature, it should come later in the
> > > project along with a stronger justification for this approach.
> > >
> > > IHMO keep things simple for this series, just the very basics.
> >
> > The main purpose of using hashtables is to enable KHO users to save
> > data to KHO at any time, not just at the time of activate/finalize KHO
> > through sysfs/debugfs. For example, FDBox can save the data into KHO
> > tree once a new fd is saved to KHO. Also, using hashtables allows KHO
> > users to add data to KHO concurrently, while with notifiers, KHO users'
> > callbacks are executed serially.
>
> This is why I like the recursive FDT scheme. Each serialization
> operation can open its own FDT write to it and the close it
> sequenatially within its operation without any worries about
> concurrency.
>
> The top level just aggregates the FDT blobs (which are in preserved
> memory)
>
> To me all this complexity here with the hash table and the copying
> makes no sense compared to that. It is all around slower.
>
> > Regarding the suggestion of recursive FDT, I feel like it is already
> > doable with this patchset, or even with Mike's V4 patch.
>
> Of course it is doable, here we are really talk about what is the
> right, recommended way to use this system. recurisive FDT is a better
> methodology than hash tables
>
> > just allocates a buffer, serialize all its states to the buffer using
> > libfdt (or even using other binary formats), save the address of the
> > buffer to KHO's tree, and finally register the buffer's underlying
> > pages/folios with kho_preserve_folio().
>
> Yes, exactly! I think this is how we should operate this system as a
> paradig, not a giant FDT, hash table and so on...
>
> [...]
> > To completely remove fdt_max, I am considering the idea in [1]. At the
> > time of kexec_file_load(), we pass the address of an anchor page to
> > the new kernel, and the anchor page will later be fulfilled with the
> > physical addresses of the pages containing the FDT blob. Multiple
> > anchor pages can be linked together. The FDT blob pages can be physically
> > noncontiguous.
>
> Yes, this is basically what I suggested too. I think this is much
> prefered and doesn't require the wakky uapi.
>
> Except I suggested you just really need a single u64 to point to a
> preserved page holding the top level FDT.
>
> With recursive FDT I think we can say that no FDT fragement should
> exceed PAGE_SIZE, and things become much simpler, IMHO.

Thanks for the suggestions! I am a little bit concerned about assuming
every FDT fragment is smaller than PAGE_SIZE. In case a child FDT is
larger than PAGE_SIZE, I would like to turn the single u64 in the parent
FDT into a u64 list to record all the underlying pages of the child FDT.

To be concrete and make sure I understand your suggestions correctly,
I drafted the following design,

Suppose we have 2 KHO users, memblock and gpu@0x2000000000, the KHO
FDT (top level FDT) would look like the following,

    /dts-v1/;
    / {
            compatible = "kho-v1";
            memblock {
                    kho,recursive-fdt = <0x00 0x40001000>;
            };
            gpu@0x100000000 {
                    kho,recursive-fdt = <0x00 0x40002000>;
            };
    };

kho,recursive-fdt in "memblock" points to a page containing another
FDT,

    / {
            compatible = "memblock-v1";
            n1 {
                    compatible = "reserve-mem-v1";
                    size = <0x04 0x00>;
                    start = <0xc06b 0x4000000>;
            };
            n2 {
                    compatible = "reserve-mem-v1";
                    size = <0x04 0x00>;
                    start = <0xc067 0x4000000>;
            };
    };

Similarly, "kho,recursive-fdt" in "gpu@0x2000000000" points to a page
containing another FDT,

    / {
            compatible = "gpu-v1"
            key1 = "v1";
            key2 = "v2";

            node1 {
                    kho,recursive-fdt = <0x00 0x40003000 0x00 0x40005000>;
            }
            node2 {
                    key3 = "v3";
                    key4 = "v4";
            }
    }

and kho,recursive-fdt in "node1" contains 2 non-contagious pages backing
the following large FDT fragment,

    / {
            compatible = "gpu-subnode1-v1";

            key5 = "v5";
            key6 = "v6";
            key7 = "v7";
            key8 = "v8";
            ... // many many keys and small values
    }

In this way we assume that most FDT fragment is smaller than 1 page so
"kho,recursive-fdt" is usually just 1 u64, but we can also handle
larger fragments if that really happens.

I also allow KHO users to add sub nodes in-place, instead of forcing
to create a new FDT fragment for every sub node, if the KHO user is
confident that those subnodes are small enough to fit in the parent
node's page. In this way we do not need to waste a full page for a small
sub node. An example is the "memblock" node above.

Finally, the KHO top level FDT may also be larger than 1 page, this can
be handled using the anchor-page method discussed in the previous mails.

What do you think?

Best,
Changyuan


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-24 19:26     ` Pasha Tatashin
@ 2025-03-25  1:24       ` Dave Young
  2025-03-25  3:07         ` Dave Young
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Young @ 2025-03-25  1:24 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Tue, 25 Mar 2025 at 03:27, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On Mon, Mar 24, 2025 at 12:18 AM Dave Young <dyoung@redhat.com> wrote:
> >
> > On Thu, 20 Mar 2025 at 23:05, Changyuan Lyu <changyuanl@google.com> wrote:
> > >
> > > From: Alexander Graf <graf@amazon.com>
> > >
> > > We have all generic code in place now to support Kexec with KHO. This
> > > patch adds a config option that depends on architecture support to
> > > enable KHO support.
> > >
> > > Signed-off-by: Alexander Graf <graf@amazon.com>
> > > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > > Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> > > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > > ---
> > >  kernel/Kconfig.kexec | 15 +++++++++++++++
> > >  1 file changed, 15 insertions(+)
> > >
> > > diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
> > > index 4d111f871951..57db99e758a8 100644
> > > --- a/kernel/Kconfig.kexec
> > > +++ b/kernel/Kconfig.kexec
> > > @@ -95,6 +95,21 @@ config KEXEC_JUMP
> > >           Jump between original kernel and kexeced kernel and invoke
> > >           code in physical address mode via KEXEC
> > >
> > > +config KEXEC_HANDOVER
> > > +       bool "kexec handover"
> > > +       depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
> > > +       select MEMBLOCK_KHO_SCRATCH
> > > +       select KEXEC_FILE
> > > +       select DEBUG_FS
> > > +       select LIBFDT
> > > +       select CMA
> > > +       select XXHASH
> > > +       help
> > > +         Allow kexec to hand over state across kernels by generating and
> > > +         passing additional metadata to the target kernel. This is useful
> > > +         to keep data or state alive across the kexec. For this to work,
> > > +         both source and target kernels need to have this option enabled.
> > > +
> >
> > Have you tested kdump?  In my mind there are two issues,  one is with
> > CMA enabled, it could cause kdump crashkernel memory reservation
> > failures more often due to the fragmented low memory.  Secondly,  in
>
> As I understand cma low memory scratch reservation is needed only to
> support some legacy pci devices that cannot use the full 64-bit space.
> If so, I am not sure if KHO needs to be supported on machines with
> such devices. However, even if we keep it, it should really be small,
> so I would not expect that to be a problem for crash kernel memory
> reservation.

It is not easy to estimate how much of the KHO reserved memory is
needed.  I assume this as a mechanism for all different users, it is
not  predictable.  Also it is not only about the size, but also it
makes the memory fragmented.

>
> > kdump kernel dump the crazy scratch memory in vmcore is not very
> > meaningful.  Otherwise I suspect this is not tested under kdump.  If
> > so please disable this option for kdump.
>
> The scratch memory will appear as regular CMA in the vmcore. The crash
> kernel can be kexec loaded only from userland, long after the scratch
> memory is converted to CMA.

Depending on the reserved size, if big enough it should be excluded in
vmcore dumping.
Otherwise if it is a kdump kernel it should skip the handling of the
KHO passed previous old states.

>
> Pasha
>



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-23 19:07     ` Changyuan Lyu
@ 2025-03-25  2:04       ` Jason Gunthorpe
  0 siblings, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-03-25  2:04 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, corbet, dave.hansen, devicetree, dwmw2, ebiederm,
	graf, hpa, jgowans, kexec, krzk, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, luto, mark.rutland, mingo, pasha.tatashin,
	pbonzini, peterz, ptyadav, robh+dt, robh, rostedt, rppt,
	saravanak, skinsburskii, tglx, thomas.lendacky, will, x86

On Sun, Mar 23, 2025 at 12:07:58PM -0700, Changyuan Lyu wrote:

> > > +	down_read(&kho_out.tree_lock);
> > > +	if (kho_out.fdt) {
> >
> > What is the lock and fdt test for?
> 
> It is to avoid the competition between the following 2 operations,
> - converting the hashtables and mem traker to FDT,
> - adding new data to hashtable/mem tracker.

I think you should strive to prevent this by code construction at a
higher level.

Do not lock each preserve but lock entire object serializations, operations.

For instance if we do recursive FDT then you'd lock the call that
builds a single FDT page for a single object.

> In most cases we only need read lock. Different KHO users can adding
> data into their own subnodes in parallel.

read locks like this are still quite slow in parallel systems, there
is alot of slow cacheline bouncing as taking a read lock still has to
write to the lock memory.

> > What do you imagine this is used for? I'm not sure what value there is
> > in returning a void *? How does the caller "free" this?
> 
> This function is also from Mike :)
> 
> I suppose some KHO users may still
> preserve memory using memory ranges (instead of folio).

I don't know what that would be, but the folio scheme is all about
preserving memory from the page buddy allocator, I don't know what
this is for or how it would be used.

IMHO split this to its own patch and include it in the series that
would use it.

> I guess the caller can free the ranges by free_pages()?

The folios were not setup right, so no.. And if this is the case then
you'd just get the struct page and convert it to a void * with some
helper function, not implement a whole new function...

> > There should be yaml files just like in the normal DT case defining
> > all of this. This level of documentation and stability was one of the
> > selling reasons why FDT is being used here!
> 
> YAML files were dropped because we think it may take a while for our
> schema to be near stable. So we start from some simple plain text. We
> can add some prop and node docs (that are considered stable at this point)
> back to YAML in the next version.

You need to do something to document what is going on here and show
the full schema with some explanation. It is hard to grasp the full
intention just from the C code.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
  2025-03-25  0:21         ` Changyuan Lyu
@ 2025-03-25  2:20           ` Jason Gunthorpe
  0 siblings, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-03-25  2:20 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, corbet, dave.hansen, devicetree, dwmw2, ebiederm,
	graf, hpa, jgowans, kexec, krzk, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, luto, mark.rutland, mingo, pasha.tatashin,
	pbonzini, peterz, ptyadav, robh+dt, robh, rostedt, rppt,
	saravanak, skinsburskii, tglx, thomas.lendacky, will, x86

On Mon, Mar 24, 2025 at 05:21:45PM -0700, Changyuan Lyu wrote:

> Thanks for the suggestions! I am a little bit concerned about assuming
> every FDT fragment is smaller than PAGE_SIZE. In case a child FDT is
> larger than PAGE_SIZE, I would like to turn the single u64 in the parent
> FDT into a u64 list to record all the underlying pages of the child FDT.

Maybe, but I'd suggest leaving some accomodation for this in the API
but not implement it until we see proof it is needed. 4k is alot of
space for a FDT, and if you are doing per-object FDT I don't see
exceeding it.

For instance a vfio, memfd, and iommufd object FDTs would not get
close.

> In this way we assume that most FDT fragment is smaller than 1 page so
> "kho,recursive-fdt" is usually just 1 u64, but we can also handle
> larger fragments if that really happens.

Yes, this is close to what I imagine.

You have to decide if the child FDT top pointers will be stored
directly in parent FDTs like you sketched above, or if they should be
stored in some dedicated allocated and preserved datastructure, like
the memory preservation works. There are some tradeoffs in each
direction..

> I also allow KHO users to add sub nodes in-place, instead of forcing
> to create a new FDT fragment for every sub node, if the KHO user is
> confident that those subnodes are small enough to fit in the parent
> node's page. In this way we do not need to waste a full page for a small
> sub node. An example is the "memblock" node above.

Well, I think that sort of misses the bigger picture. What we want is
to run serialization of everything in parallel. So merging like you
say will complicate that.

Really, I think we will have on the order of 10's of objects to
serialize so I don't really care if they use partial pages if that
makes the serialization faster. As long as the memory is freed once
the live update is done, the waste doesn't matter.

> Finally, the KHO top level FDT may also be larger than 1 page, this can
> be handled using the anchor-page method discussed in the previous mails.

This is one of the trade offs I mentioned. If you inline the objects
as FDT nodes then you have to scale and multi-page a FDT.

If you do a binary-structure like memory preservation then you have to
serialize to something that is inherently scalable and 4k granular.

The 4k FDT limit really only works if you make liberal use of pointers
to binary data. Anything that is not of a predictable size limit would
be in some related binary structure.

So.. I'd probably suggest to think about how to make multi-page FDT
work in the memory description, but not implement it now. When we
reach the point where we know we need multi-page FDT then someone
would have to implement a growable FDT through vmap or something like
that to make it work.

Keep this intial step simple, we clearly don't need more than 4k FDT
at this point and we aren't doing stable kexec-ABI either. So simplify
simplify simplify to get a very thin minimal functionality merged to
put the fdbox step on top of.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-25  1:24       ` Dave Young
@ 2025-03-25  3:07         ` Dave Young
  0 siblings, 0 replies; 103+ messages in thread
From: Dave Young @ 2025-03-25  3:07 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Tue, 25 Mar 2025 at 09:24, Dave Young <dyoung@redhat.com> wrote:
>
> On Tue, 25 Mar 2025 at 03:27, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> >
> > On Mon, Mar 24, 2025 at 12:18 AM Dave Young <dyoung@redhat.com> wrote:
> > >
> > > On Thu, 20 Mar 2025 at 23:05, Changyuan Lyu <changyuanl@google.com> wrote:
> > > >
> > > > From: Alexander Graf <graf@amazon.com>
> > > >
> > > > We have all generic code in place now to support Kexec with KHO. This
> > > > patch adds a config option that depends on architecture support to
> > > > enable KHO support.
> > > >
> > > > Signed-off-by: Alexander Graf <graf@amazon.com>
> > > > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > > > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > > > Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> > > > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > > > ---
> > > >  kernel/Kconfig.kexec | 15 +++++++++++++++
> > > >  1 file changed, 15 insertions(+)
> > > >
> > > > diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
> > > > index 4d111f871951..57db99e758a8 100644
> > > > --- a/kernel/Kconfig.kexec
> > > > +++ b/kernel/Kconfig.kexec
> > > > @@ -95,6 +95,21 @@ config KEXEC_JUMP
> > > >           Jump between original kernel and kexeced kernel and invoke
> > > >           code in physical address mode via KEXEC
> > > >
> > > > +config KEXEC_HANDOVER
> > > > +       bool "kexec handover"
> > > > +       depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
> > > > +       select MEMBLOCK_KHO_SCRATCH
> > > > +       select KEXEC_FILE
> > > > +       select DEBUG_FS
> > > > +       select LIBFDT
> > > > +       select CMA
> > > > +       select XXHASH
> > > > +       help
> > > > +         Allow kexec to hand over state across kernels by generating and
> > > > +         passing additional metadata to the target kernel. This is useful
> > > > +         to keep data or state alive across the kexec. For this to work,
> > > > +         both source and target kernels need to have this option enabled.
> > > > +
> > >
> > > Have you tested kdump?  In my mind there are two issues,  one is with
> > > CMA enabled, it could cause kdump crashkernel memory reservation
> > > failures more often due to the fragmented low memory.  Secondly,  in
> >
> > As I understand cma low memory scratch reservation is needed only to
> > support some legacy pci devices that cannot use the full 64-bit space.
> > If so, I am not sure if KHO needs to be supported on machines with
> > such devices. However, even if we keep it, it should really be small,
> > so I would not expect that to be a problem for crash kernel memory
> > reservation.
>
> It is not easy to estimate how much of the KHO reserved memory is
> needed.  I assume this as a mechanism for all different users, it is
> not  predictable.  Also it is not only about the size, but also it
> makes the memory fragmented.
>
> >
> > > kdump kernel dump the crazy scratch memory in vmcore is not very
> > > meaningful.  Otherwise I suspect this is not tested under kdump.  If
> > > so please disable this option for kdump.
> >
> > The scratch memory will appear as regular CMA in the vmcore. The crash
> > kernel can be kexec loaded only from userland, long after the scratch
> > memory is converted to CMA.
>
> Depending on the reserved size, if big enough it should be excluded in
> vmcore dumping.
> Otherwise if it is a kdump kernel it should skip the handling of the
> KHO passed previous old states.

If you do not want to make the KHO conflicts with kdump, then the
above should be handled and well tested.  And then leave to end user
and distribution to determine if they want the both enabled
considering the risk of crashkernel reservation failure.

>
> >
> > Pasha
> >



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-24  4:18   ` Dave Young
  2025-03-24 19:26     ` Pasha Tatashin
@ 2025-03-25  6:57     ` Baoquan He
  2025-03-25  8:36       ` Dave Young
  2025-03-25 14:04       ` Pasha Tatashin
  1 sibling, 2 replies; 103+ messages in thread
From: Baoquan He @ 2025-03-25  6:57 UTC (permalink / raw)
  To: Dave Young
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh,
	saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On 03/24/25 at 12:18pm, Dave Young wrote:
> On Thu, 20 Mar 2025 at 23:05, Changyuan Lyu <changyuanl@google.com> wrote:
> >
> > From: Alexander Graf <graf@amazon.com>
> >
> > We have all generic code in place now to support Kexec with KHO. This
> > patch adds a config option that depends on architecture support to
> > enable KHO support.
> >
> > Signed-off-by: Alexander Graf <graf@amazon.com>
> > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > ---
> >  kernel/Kconfig.kexec | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> >
> > diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
> > index 4d111f871951..57db99e758a8 100644
> > --- a/kernel/Kconfig.kexec
> > +++ b/kernel/Kconfig.kexec
> > @@ -95,6 +95,21 @@ config KEXEC_JUMP
> >           Jump between original kernel and kexeced kernel and invoke
> >           code in physical address mode via KEXEC
> >
> > +config KEXEC_HANDOVER
> > +       bool "kexec handover"
> > +       depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
> > +       select MEMBLOCK_KHO_SCRATCH
> > +       select KEXEC_FILE
> > +       select DEBUG_FS
> > +       select LIBFDT
> > +       select CMA
> > +       select XXHASH
> > +       help
> > +         Allow kexec to hand over state across kernels by generating and
> > +         passing additional metadata to the target kernel. This is useful
> > +         to keep data or state alive across the kexec. For this to work,
> > +         both source and target kernels need to have this option enabled.
> > +
> 
> Have you tested kdump?  In my mind there are two issues,  one is with
> CMA enabled, it could cause kdump crashkernel memory reservation
> failures more often due to the fragmented low memory.  Secondly,  in

kho scracth memorys are reserved much later than crashkernel, we may not
need to worry about it.
====================
start_kernel()
  ......
  -->setup_arch(&command_line);
     -->arch_reserve_crashkernel();
  ......
  -->mm_core_init();
     -->kho_memory_init();

> kdump kernel dump the crazy scratch memory in vmcore is not very
> meaningful.  Otherwise I suspect this is not tested under kdump.  If
> so please disable this option for kdump.

Yeah, it's not meaningful to dump out scratch memorys into vmcore. We
may need to dig them out from eflcorehdr. While it's an optimization,
kho scratch is not big relative to the entire system memory. It can be
done in later stage. My personal opinion.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-25  6:57     ` Baoquan He
@ 2025-03-25  8:36       ` Dave Young
  2025-03-26  9:17         ` Dave Young
  2025-03-25 14:04       ` Pasha Tatashin
  1 sibling, 1 reply; 103+ messages in thread
From: Dave Young @ 2025-03-25  8:36 UTC (permalink / raw)
  To: Baoquan He
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh,
	saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

> >
> > Have you tested kdump?  In my mind there are two issues,  one is with
> > CMA enabled, it could cause kdump crashkernel memory reservation
> > failures more often due to the fragmented low memory.  Secondly,  in
>
> kho scracth memorys are reserved much later than crashkernel, we may not
> need to worry about it.
> ====================
> start_kernel()
>   ......
>   -->setup_arch(&command_line);
>      -->arch_reserve_crashkernel();
>   ......
>   -->mm_core_init();
>      -->kho_memory_init();
>
> > kdump kernel dump the crazy scratch memory in vmcore is not very
> > meaningful.  Otherwise I suspect this is not tested under kdump.  If
> > so please disable this option for kdump.

Ok,  it is fine if this is the case, thanks Baoquan for clearing this worry.

But the other concerns are still need to address, eg. KHO use cases
are not good for kdump.
There could be more to think about.
eg. the issues talked in thread:
https://lore.kernel.org/lkml/Z7dc9Cd8KX3b_brB@dwarf.suse.cz/T/



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-25  6:57     ` Baoquan He
  2025-03-25  8:36       ` Dave Young
@ 2025-03-25 14:04       ` Pasha Tatashin
  1 sibling, 0 replies; 103+ messages in thread
From: Pasha Tatashin @ 2025-03-25 14:04 UTC (permalink / raw)
  To: Baoquan He
  Cc: Dave Young, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk, rppt,
	mark.rutland, pbonzini, hpa, peterz, ptyadav, robh+dt, robh,
	saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Tue, Mar 25, 2025 at 2:58 AM Baoquan He <bhe@redhat.com> wrote:
>
> On 03/24/25 at 12:18pm, Dave Young wrote:
> > On Thu, 20 Mar 2025 at 23:05, Changyuan Lyu <changyuanl@google.com> wrote:
> > >
> > > From: Alexander Graf <graf@amazon.com>
> > >
> > > We have all generic code in place now to support Kexec with KHO. This
> > > patch adds a config option that depends on architecture support to
> > > enable KHO support.
> > >
> > > Signed-off-by: Alexander Graf <graf@amazon.com>
> > > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > > Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> > > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > > ---
> > >  kernel/Kconfig.kexec | 15 +++++++++++++++
> > >  1 file changed, 15 insertions(+)
> > >
> > > diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
> > > index 4d111f871951..57db99e758a8 100644
> > > --- a/kernel/Kconfig.kexec
> > > +++ b/kernel/Kconfig.kexec
> > > @@ -95,6 +95,21 @@ config KEXEC_JUMP
> > >           Jump between original kernel and kexeced kernel and invoke
> > >           code in physical address mode via KEXEC
> > >
> > > +config KEXEC_HANDOVER
> > > +       bool "kexec handover"
> > > +       depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
> > > +       select MEMBLOCK_KHO_SCRATCH
> > > +       select KEXEC_FILE
> > > +       select DEBUG_FS
> > > +       select LIBFDT
> > > +       select CMA
> > > +       select XXHASH
> > > +       help
> > > +         Allow kexec to hand over state across kernels by generating and
> > > +         passing additional metadata to the target kernel. This is useful
> > > +         to keep data or state alive across the kexec. For this to work,
> > > +         both source and target kernels need to have this option enabled.
> > > +
> >
> > Have you tested kdump?  In my mind there are two issues,  one is with
> > CMA enabled, it could cause kdump crashkernel memory reservation
> > failures more often due to the fragmented low memory.  Secondly,  in
>
> kho scracth memorys are reserved much later than crashkernel, we may not
> need to worry about it.
> ====================
> start_kernel()
>   ......
>   -->setup_arch(&command_line);
>      -->arch_reserve_crashkernel();
>   ......
>   -->mm_core_init();
>      -->kho_memory_init();
>
> > kdump kernel dump the crazy scratch memory in vmcore is not very
> > meaningful.  Otherwise I suspect this is not tested under kdump.  If
> > so please disable this option for kdump.
>
> Yeah, it's not meaningful to dump out scratch memorys into vmcore. We
> may need to dig them out from eflcorehdr. While it's an optimization,
> kho scratch is not big relative to the entire system memory. It can be
> done in later stage. My personal opinion.

But, we don't; we only dump out the regular CMA memory that absolutely
should be part of vmcore. When scratch is used during boot, it is used
for regular early boot kernel allocations, such as to allocate memmap,
which is an essential part of the crash dump.

Pasha


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO)
  2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
                   ` (15 preceding siblings ...)
  2025-03-20  1:55 ` [PATCH v5 16/16] Documentation: add documentation for KHO Changyuan Lyu
@ 2025-03-25 14:19 ` Pasha Tatashin
  2025-03-25 15:03   ` Mike Rapoport
  16 siblings, 1 reply; 103+ messages in thread
From: Pasha Tatashin @ 2025-03-25 14:19 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: linux-kernel, graf, akpm, luto, anthony.yznaga, arnd,
	ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

> To use the code, please boot the kernel with the "kho=on" command line
> parameter.
> KHO will automatically create scratch regions. If you want to set the
> scratch size explicitly you can use "kho_scratch=" command line parameter.
> For instance, "kho_scratch=16M,512M,256M" will reserve a 16 MiB low
> memory scratch area, a 512 MiB global scratch region, and 256 MiB
> per NUMA node scratch regions on boot.

kho_scratch= is confusing. It should be renamed to what this memory
actually represents, which is memory that cannot be preserved by KHO.

I suggest renaming all references to "scratch" and this parameter to:

kho_nopersistent= or kho_nopreserve=

This way, we can also add checks that early allocations done by the
kernel in this memory do not get preserved. We can also add checks to
ensure that scarce low DMA memory does not get preserved across
reboots, and we avoid adding fragmentation to that region.

Pasha


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO)
  2025-03-25 14:19 ` [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Pasha Tatashin
@ 2025-03-25 15:03   ` Mike Rapoport
  0 siblings, 0 replies; 103+ messages in thread
From: Mike Rapoport @ 2025-03-25 15:03 UTC (permalink / raw)
  To: Pasha Tatashin, Changyuan Lyu
  Cc: linux-kernel, graf, akpm, luto, anthony.yznaga, arnd,
	ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, mark.rutland, pbonzini,
	hpa, peterz, ptyadav, robh+dt, robh, saravanak, skinsburskii,
	rostedt, tglx, thomas.lendacky, usama.arif, will, devicetree,
	kexec, linux-arm-kernel, linux-doc, linux-mm, x86



On March 25, 2025 10:19:53 AM EDT, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>> To use the code, please boot the kernel with the "kho=on" command line
>> parameter.
>> KHO will automatically create scratch regions. If you want to set the
>> scratch size explicitly you can use "kho_scratch=" command line parameter.
>> For instance, "kho_scratch=16M,512M,256M" will reserve a 16 MiB low
>> memory scratch area, a 512 MiB global scratch region, and 256 MiB
>> per NUMA node scratch regions on boot.
>
>kho_scratch= is confusing. It should be renamed to what this memory
>actually represents, which is memory that cannot be preserved by KHO.
>
>I suggest renaming all references to "scratch" and this parameter to:
>
>kho_nopersistent= or kho_nopreserve=

I'm leaning towards kho_bootstrap

>This way, we can also add checks that early allocations done by the
>kernel in this memory do not get preserved. We can also add checks to
>ensure that scarce low DMA memory does not get preserved across
>reboots, and we avoid adding fragmentation to that region.
>
>Pasha
>

-- 
Sincerely yours,
Mike


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
  2025-03-24 18:40   ` Frank van der Linden
@ 2025-03-25 19:19     ` Mike Rapoport
  2025-03-25 21:56       ` Frank van der Linden
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-03-25 19:19 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, mark.rutland, pbonzini,
	pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Mon, Mar 24, 2025 at 11:40:43AM -0700, Frank van der Linden wrote:
> On Wed, Mar 19, 2025 at 6:56 PM Changyuan Lyu <changyuanl@google.com> wrote:
> >
> > From: Alexander Graf <graf@amazon.com>
> >
> > Add the core infrastructure to generate Kexec HandOver metadata. Kexec
> > HandOver is a mechanism that allows Linux to preserve state - arbitrary
> > properties as well as memory locations - across kexec.
> >
> > It does so using 2 concepts:
> >
> >   1) State Tree - Every KHO kexec carries a state tree that describes the
> >      state of the system. The state tree is represented as hash-tables.
> >      Device drivers can add/remove their data into/from the state tree at
> >      system runtime. On kexec, the tree is converted to FDT (flattened
> >      device tree).
> >
> >   2) Scratch Regions - CMA regions that we allocate in the first kernel.
> >      CMA gives us the guarantee that no handover pages land in those
> >      regions, because handover pages must be at a static physical memory
> >      location. We use these regions as the place to load future kexec
> >      images so that they won't collide with any handover data.
> >
> > Signed-off-by: Alexander Graf <graf@amazon.com>
> > Co-developed-by: Pratyush Yadav <ptyadav@amazon.de>
> > Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
> > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > ---
> >  MAINTAINERS                    |   2 +-
> >  include/linux/kexec_handover.h | 109 +++++
> >  kernel/Makefile                |   1 +
> >  kernel/kexec_handover.c        | 865 +++++++++++++++++++++++++++++++++
> >  mm/mm_init.c                   |   8 +
> >  5 files changed, 984 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/kexec_handover.h
> >  create mode 100644 kernel/kexec_handover.c
> [...]
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index 04441c258b05..757659b7a26b 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -30,6 +30,7 @@
> >  #include <linux/crash_dump.h>
> >  #include <linux/execmem.h>
> >  #include <linux/vmstat.h>
> > +#include <linux/kexec_handover.h>
> >  #include "internal.h"
> >  #include "slab.h"
> >  #include "shuffle.h"
> > @@ -2661,6 +2662,13 @@ void __init mm_core_init(void)
> >         report_meminit();
> >         kmsan_init_shadow();
> >         stack_depot_early_init();
> > +
> > +       /*
> > +        * KHO memory setup must happen while memblock is still active, but
> > +        * as close as possible to buddy initialization
> > +        */
> > +       kho_memory_init();
> > +
> >         mem_init();
> >         kmem_cache_init();
> >         /*
> 
> 
> Thanks for the work on this.
> 
> Obviously it needs to happen while memblock is still active - but why
> as close as possible to buddy initialization?

One reason is to have all memblock allocations done to autoscale the
scratch area. Another reason is to keep memblock structures small as long
as possible as memblock_reserve()ing the preserved memory would quite
inflate them.

And it's overall simpler if memblock only allocates from scratch rather
than doing some of early allocations from scratch and some elsewhere and
still making sure they avoid the preserved ranges.
 
> Ordering is always a sticky issue when it comes to doing things during
> boot, of course. In this case, I can see scenarios where code that
> runs a little earlier may want to use some preserved memory. The

Can you elaborate about such scenarios?

> current requirement in the patch set seems to be "after sparse/page
> init", but I'm not sure why it needs to be as close as possibly to
> buddy init.

Why would you say that sparse/page init would be a requirement here?
 
> - Frank

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
  2025-03-25 19:19     ` Mike Rapoport
@ 2025-03-25 21:56       ` Frank van der Linden
  2025-03-26 11:59         ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Frank van der Linden @ 2025-03-25 21:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, mark.rutland, pbonzini,
	pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Tue, Mar 25, 2025 at 12:19 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Mar 24, 2025 at 11:40:43AM -0700, Frank van der Linden wrote:
[...]
> > Thanks for the work on this.
> >
> > Obviously it needs to happen while memblock is still active - but why
> > as close as possible to buddy initialization?
>
> One reason is to have all memblock allocations done to autoscale the
> scratch area. Another reason is to keep memblock structures small as long
> as possible as memblock_reserve()ing the preserved memory would quite
> inflate them.
>
> And it's overall simpler if memblock only allocates from scratch rather
> than doing some of early allocations from scratch and some elsewhere and
> still making sure they avoid the preserved ranges.

Ah, thanks, I see the argument for the scratch area sizing.

>
> > Ordering is always a sticky issue when it comes to doing things during
> > boot, of course. In this case, I can see scenarios where code that
> > runs a little earlier may want to use some preserved memory. The
>
> Can you elaborate about such scenarios?

There has, for example, been some talk about making hugetlbfs
persistent. You could have hugetlb_cma active. The hugetlb CMA areas
are set up quite early, quite some time before KHO restores memory. So
that would have to be changed somehow if the location of the KHO init
call would remain as close as possible to buddy init as possible. I
suspect there may be other uses.

Although I suppose you could just look up the addresses and then
reserve them yourself, you would just need the KHO FDT to be
initialized. And you'd need to avoid the KHO bitmap deserialize trying
to redo the ranges you've already done.

>
> > current requirement in the patch set seems to be "after sparse/page
> > init", but I'm not sure why it needs to be as close as possibly to
> > buddy init.
>
> Why would you say that sparse/page init would be a requirement here?

At least in its current form, the KHO code expects vmemmap to be
initialized, as it does its restore base on page structures, as
deserialize_bitmap expects them. I think the use of the page->private
field was discussed in a separate thread, I think. If that is done
differently, it wouldn't rely on vmemmap being initialized.

A few more things I've noticed (not sure if these were discussed before):

* Should KHO depend on CONFIG_DEFERRED_STRUCT_PAGE_INIT? Essentially,
marking memblock ranges as NOINIT doesn't work without
DEFERRED_STRUCT_PAGE_INIT. Although, if the page->private use
disappears, this wouldn't be an issue anymore.
* As a future extension, it could be nice to store vmemmap init
information in the KHO FDT. Then you can use that to init ranges in an
optimized way (HVO hugetlb or DAX-style persisted ranges) straight
away.

- Frank

> > - Frank
>
> --
> Sincerely yours,
> Mike.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-25  8:36       ` Dave Young
@ 2025-03-26  9:17         ` Dave Young
  2025-03-26 11:28           ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Young @ 2025-03-26  9:17 UTC (permalink / raw)
  To: Baoquan He
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh,
	saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Tue, 25 Mar 2025 at 16:36, Dave Young <dyoung@redhat.com> wrote:
>
> > >
> > > Have you tested kdump?  In my mind there are two issues,  one is with
> > > CMA enabled, it could cause kdump crashkernel memory reservation
> > > failures more often due to the fragmented low memory.  Secondly,  in
> >
> > kho scracth memorys are reserved much later than crashkernel, we may not
> > need to worry about it.
> > ====================
> > start_kernel()
> >   ......
> >   -->setup_arch(&command_line);
> >      -->arch_reserve_crashkernel();
> >   ......
> >   -->mm_core_init();
> >      -->kho_memory_init();
> >
> > > kdump kernel dump the crazy scratch memory in vmcore is not very
> > > meaningful.  Otherwise I suspect this is not tested under kdump.  If
> > > so please disable this option for kdump.
>
> Ok,  it is fine if this is the case, thanks Baoquan for clearing this worry.
>
> But the other concerns are still need to address, eg. KHO use cases
> are not good for kdump.
> There could be more to think about.
> eg. the issues talked in thread:
> https://lore.kernel.org/lkml/Z7dc9Cd8KX3b_brB@dwarf.suse.cz/T/

Rethink about this,  other than previous concerns.  Transferring the
old kernel state to kdump kernel makes no sense since the old state is
not stable as the kernel has crashed.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-26  9:17         ` Dave Young
@ 2025-03-26 11:28           ` Mike Rapoport
  2025-03-26 12:09             ` Dave Young
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-03-26 11:28 UTC (permalink / raw)
  To: Dave Young
  Cc: Baoquan He, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, ptyadav,
	robh+dt, robh, saravanak, skinsburskii, rostedt, tglx,
	thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

Hi Dave,

On Wed, Mar 26, 2025 at 05:17:16PM +0800, Dave Young wrote:
> On Tue, 25 Mar 2025 at 16:36, Dave Young <dyoung@redhat.com> wrote:
> >
> > > >
> > > > Have you tested kdump?  In my mind there are two issues,  one is with
> > > > CMA enabled, it could cause kdump crashkernel memory reservation
> > > > failures more often due to the fragmented low memory.  Secondly,  in
> > >
> > > kho scracth memorys are reserved much later than crashkernel, we may not
> > > need to worry about it.
> > > ====================
> > > start_kernel()
> > >   ......
> > >   -->setup_arch(&command_line);
> > >      -->arch_reserve_crashkernel();
> > >   ......
> > >   -->mm_core_init();
> > >      -->kho_memory_init();
> > >
> > > > kdump kernel dump the crazy scratch memory in vmcore is not very
> > > > meaningful.  Otherwise I suspect this is not tested under kdump.  If
> > > > so please disable this option for kdump.
> >
> > Ok,  it is fine if this is the case, thanks Baoquan for clearing this worry.
> >
> > But the other concerns are still need to address, eg. KHO use cases
> > are not good for kdump.
> > There could be more to think about.
> > eg. the issues talked in thread:
> > https://lore.kernel.org/lkml/Z7dc9Cd8KX3b_brB@dwarf.suse.cz/T/
> 
> Rethink about this,  other than previous concerns.  Transferring the
> old kernel state to kdump kernel makes no sense since the old state is
> not stable as the kernel has crashed.
 
KHO won't be active for kdump case. The KHO segments are only added to
kexec_image and never to kexec_crash_image. 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
  2025-03-25 21:56       ` Frank van der Linden
@ 2025-03-26 11:59         ` Mike Rapoport
  2025-03-26 16:25           ` Frank van der Linden
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-03-26 11:59 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, mark.rutland, pbonzini,
	pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Tue, Mar 25, 2025 at 02:56:52PM -0700, Frank van der Linden wrote:
> On Tue, Mar 25, 2025 at 12:19 PM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Mon, Mar 24, 2025 at 11:40:43AM -0700, Frank van der Linden wrote:
> [...]
> > > Thanks for the work on this.
> > >
> > > Obviously it needs to happen while memblock is still active - but why
> > > as close as possible to buddy initialization?
> >
> > One reason is to have all memblock allocations done to autoscale the
> > scratch area. Another reason is to keep memblock structures small as long
> > as possible as memblock_reserve()ing the preserved memory would quite
> > inflate them.
> >
> > And it's overall simpler if memblock only allocates from scratch rather
> > than doing some of early allocations from scratch and some elsewhere and
> > still making sure they avoid the preserved ranges.
> 
> Ah, thanks, I see the argument for the scratch area sizing.
> 
> >
> > > Ordering is always a sticky issue when it comes to doing things during
> > > boot, of course. In this case, I can see scenarios where code that
> > > runs a little earlier may want to use some preserved memory. The
> >
> > Can you elaborate about such scenarios?
> 
> There has, for example, been some talk about making hugetlbfs
> persistent. You could have hugetlb_cma active. The hugetlb CMA areas
> are set up quite early, quite some time before KHO restores memory. So
> that would have to be changed somehow if the location of the KHO init
> call would remain as close as possible to buddy init as possible. I
> suspect there may be other uses.

I think we can address this when/if implementing preservation for hugetlbfs
and it will be tricky. 
If hugetlb in the first kernel uses a lot of memory, we just won't have
enough scratch space for early hugetlb reservations in the second kernel
regardless of hugetlb_cma. On the other hand, we already have the preserved
hugetlbfs memory, so we'd probably need to reserve less memory in the
second kernel.

But anyway, it's completely different discussion about how to preserve
hugetlbfs.
 
> > > current requirement in the patch set seems to be "after sparse/page
> > > init", but I'm not sure why it needs to be as close as possibly to
> > > buddy init.
> >
> > Why would you say that sparse/page init would be a requirement here?
> 
> At least in its current form, the KHO code expects vmemmap to be
> initialized, as it does its restore base on page structures, as
> deserialize_bitmap expects them. I think the use of the page->private
> field was discussed in a separate thread, I think. If that is done
> differently, it wouldn't rely on vmemmap being initialized.

In the current form KHO does relies on vmemmap being allocated, but it does
not rely on it being initialized. Marking memblock ranges NOINT ensures
nothing touches the corresponding struct pages and KHO can use their fields
up to the point the memory is returned to KHO callers.
 
> A few more things I've noticed (not sure if these were discussed before):
> 
> * Should KHO depend on CONFIG_DEFERRED_STRUCT_PAGE_INIT? Essentially,
> marking memblock ranges as NOINIT doesn't work without
> DEFERRED_STRUCT_PAGE_INIT. Although, if the page->private use
> disappears, this wouldn't be an issue anymore.

It does. 
memmap_init_reserved_pages() is called always, no matter of
CONFIG_DEFERRED_STRUCT_PAGE_INIT is set or not and it skips initialization
of NOINIT regions.

> * As a future extension, it could be nice to store vmemmap init
> information in the KHO FDT. Then you can use that to init ranges in an
> optimized way (HVO hugetlb or DAX-style persisted ranges) straight
> away.
 
These days memmap contents is unstable because of the folio/memdesc
project, but in general carrying memory map data from kernel to kernel is
indeed something to consider.

> - Frank

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-24 20:07           ` Jason Gunthorpe
@ 2025-03-26 12:07             ` Mike Rapoport
  0 siblings, 0 replies; 103+ messages in thread
From: Mike Rapoport @ 2025-03-26 12:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, mark.rutland, pbonzini,
	pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Mon, Mar 24, 2025 at 05:07:36PM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 24, 2025 at 02:18:34PM -0400, Mike Rapoport wrote:
> > On Sun, Mar 23, 2025 at 03:55:52PM -0300, Jason Gunthorpe wrote:
> > > On Sat, Mar 22, 2025 at 03:12:26PM -0400, Mike Rapoport wrote:
> > >  
> > > > > > +		page->private = order;
> > > > > 
> > > > > Can't just set the page order directly? Why use private?
> > > > 
> > > > Setting the order means recreating the folio the way prep_compound_page()
> > > > does. I think it's better to postpone it until the folio is requested. This
> > > > way it might run after SMP is enabled. 
> > > 
> > > I see, that makes sense, but also it could stil use page->order..
> > 
> > But there's no page->order :)
> 
> I mean this:
> 
> static inline unsigned int folio_order(const struct folio *folio)
> {
>         if (!folio_test_large(folio))
>                 return 0;
>         return folio->_flags_1 & 0xff;
> }

I don't think it's better than page->private, KHO will need to
prep_compound_page() anyway so these will be overwritten there.
And I don't remember, but having those set before prep_compound_page()
might trigger VM_BUG_ON_PGFLAGS().
  
> Jason

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 11/16] kexec: add config option for KHO
  2025-03-26 11:28           ` Mike Rapoport
@ 2025-03-26 12:09             ` Dave Young
  0 siblings, 0 replies; 103+ messages in thread
From: Dave Young @ 2025-03-26 12:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Baoquan He, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, ptyadav,
	robh+dt, robh, saravanak, skinsburskii, rostedt, tglx,
	thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

On Wed, 26 Mar 2025 at 19:34, Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi Dave,
>
> On Wed, Mar 26, 2025 at 05:17:16PM +0800, Dave Young wrote:
> > On Tue, 25 Mar 2025 at 16:36, Dave Young <dyoung@redhat.com> wrote:
> > >
> > > > >
> > > > > Have you tested kdump?  In my mind there are two issues,  one is with
> > > > > CMA enabled, it could cause kdump crashkernel memory reservation
> > > > > failures more often due to the fragmented low memory.  Secondly,  in
> > > >
> > > > kho scracth memorys are reserved much later than crashkernel, we may not
> > > > need to worry about it.
> > > > ====================
> > > > start_kernel()
> > > >   ......
> > > >   -->setup_arch(&command_line);
> > > >      -->arch_reserve_crashkernel();
> > > >   ......
> > > >   -->mm_core_init();
> > > >      -->kho_memory_init();
> > > >
> > > > > kdump kernel dump the crazy scratch memory in vmcore is not very
> > > > > meaningful.  Otherwise I suspect this is not tested under kdump.  If
> > > > > so please disable this option for kdump.
> > >
> > > Ok,  it is fine if this is the case, thanks Baoquan for clearing this worry.
> > >
> > > But the other concerns are still need to address, eg. KHO use cases
> > > are not good for kdump.
> > > There could be more to think about.
> > > eg. the issues talked in thread:
> > > https://lore.kernel.org/lkml/Z7dc9Cd8KX3b_brB@dwarf.suse.cz/T/
> >
> > Rethink about this,  other than previous concerns.  Transferring the
> > old kernel state to kdump kernel makes no sense since the old state is
> > not stable as the kernel has crashed.
>
> KHO won't be active for kdump case. The KHO segments are only added to
> kexec_image and never to kexec_crash_image.

Good to know, thanks!

>
> --
> Sincerely yours,
> Mike.
>



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
  2025-03-26 11:59         ` Mike Rapoport
@ 2025-03-26 16:25           ` Frank van der Linden
  0 siblings, 0 replies; 103+ messages in thread
From: Frank van der Linden @ 2025-03-26 16:25 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, mark.rutland, pbonzini,
	pasha.tatashin, hpa, peterz, ptyadav, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Wed, Mar 26, 2025 at 4:59 AM Mike Rapoport <rppt@kernel.org> wrote:
[...]
> > There has, for example, been some talk about making hugetlbfs
> > persistent. You could have hugetlb_cma active. The hugetlb CMA areas
> > are set up quite early, quite some time before KHO restores memory. So
> > that would have to be changed somehow if the location of the KHO init
> > call would remain as close as possible to buddy init as possible. I
> > suspect there may be other uses.
>
> I think we can address this when/if implementing preservation for hugetlbfs
> and it will be tricky.
> If hugetlb in the first kernel uses a lot of memory, we just won't have
> enough scratch space for early hugetlb reservations in the second kernel
> regardless of hugetlb_cma. On the other hand, we already have the preserved
> hugetlbfs memory, so we'd probably need to reserve less memory in the
> second kernel.
>
> But anyway, it's completely different discussion about how to preserve
> hugetlbfs.

Right, there would have to be a KHO interface way to carry over the
early reserved memory and reinit it early too.

>
> > > > current requirement in the patch set seems to be "after sparse/page
> > > > init", but I'm not sure why it needs to be as close as possibly to
> > > > buddy init.
> > >
> > > Why would you say that sparse/page init would be a requirement here?
> >
> > At least in its current form, the KHO code expects vmemmap to be
> > initialized, as it does its restore base on page structures, as
> > deserialize_bitmap expects them. I think the use of the page->private
> > field was discussed in a separate thread, I think. If that is done
> > differently, it wouldn't rely on vmemmap being initialized.
>
> In the current form KHO does relies on vmemmap being allocated, but it does
> not rely on it being initialized. Marking memblock ranges NOINT ensures
> nothing touches the corresponding struct pages and KHO can use their fields
> up to the point the memory is returned to KHO callers.
>
> > A few more things I've noticed (not sure if these were discussed before):
> >
> > * Should KHO depend on CONFIG_DEFERRED_STRUCT_PAGE_INIT? Essentially,
> > marking memblock ranges as NOINIT doesn't work without
> > DEFERRED_STRUCT_PAGE_INIT. Although, if the page->private use
> > disappears, this wouldn't be an issue anymore.
>
> It does.
> memmap_init_reserved_pages() is called always, no matter of
> CONFIG_DEFERRED_STRUCT_PAGE_INIT is set or not and it skips initialization
> of NOINIT regions.

Yeah, I see - the ordering makes this work out.

MEMBLOCK_RSRV_NOINIT is a bit confusing in the sense that if you do a
memblock allocation in the !CONFIG_DEFERRED_STRUCT_PAGE_INIT case, and
that allocation is done before free_area_init(), the pages will always
get initialized regardless, since memmap_init_range() will do it. But
this is done before the KHO deserialize, so it works out.

>
> > * As a future extension, it could be nice to store vmemmap init
> > information in the KHO FDT. Then you can use that to init ranges in an
> > optimized way (HVO hugetlb or DAX-style persisted ranges) straight
> > away.
>
> These days memmap contents is unstable because of the folio/memdesc
> project, but in general carrying memory map data from kernel to kernel is
> indeed something to consider.

Yes, I think we might have a need for that, but we'll see.

Thanks,

- Frank


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-20  1:55 ` [PATCH v5 09/16] kexec: enable KHO support for memory preservation Changyuan Lyu
  2025-03-21 13:46   ` Jason Gunthorpe
@ 2025-03-27 10:03   ` Pratyush Yadav
  2025-03-27 13:31     ` Jason Gunthorpe
  2025-04-02 19:16   ` Pratyush Yadav
  2025-04-03 15:50   ` Pratyush Yadav
  3 siblings, 1 reply; 103+ messages in thread
From: Pratyush Yadav @ 2025-03-27 10:03 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: linux-kernel, graf, akpm, luto, anthony.yznaga, arnd,
	ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86,
	Jason Gunthorpe

Hi Changyuan,

On Wed, Mar 19 2025, Changyuan Lyu wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Introduce APIs allowing KHO users to preserve memory across kexec and
> get access to that memory after boot of the kexeced kernel
>
> kho_preserve_folio() - record a folio to be preserved over kexec
> kho_restore_folio() - recreates the folio from the preserved memory
> kho_preserve_phys() - record physically contiguous range to be
> preserved over kexec.
> kho_restore_phys() - recreates order-0 pages corresponding to the
> preserved physical range
>
> The memory preservations are tracked by two levels of xarrays to manage
> chunks of per-order 512 byte bitmaps. For instance the entire 1G order
> of a 1TB x86 system would fit inside a single 512 byte bitmap. For
> order 0 allocations each bitmap will cover 16M of address space. Thus,
> for 16G of memory at most 512K of bitmap memory will be needed for order 0.
>
> At serialization time all bitmaps are recorded in a linked list of pages
> for the next kernel to process and the physical address of the list is
> recorded in KHO FDT.

Why build the xarray only to transform it down to bitmaps when you can
build the bitmaps from the get go? This would end up wasting both time
and memory. At least from this patch, I don't really see much else being
done with the xarray apart from setting bits in the bitmap.

Of course, with the current linked list structure, this cannot work. But
I don't see why we need to have it. I think having a page-table like
structure would be better -- only instead of having PTEs at the lowest
levels, you have the bitmap.

Just like page tables, each table is page-size. So each page at the
lowest level can have 4k * 8 == 32768 bits. This maps to 128 MiB of 4k
pages. The next level will be pointers to the level 1 table, just like
in page tables. So we get 4096 / 8 == 512 pointers. Each level 2 table
maps to 64 GiB of memory. Similarly, level 3 table maps to 32 TiB and
level 4 to 16 PiB.

Now, __kho_preserve() can just find or allocate the table entry for the
PFN and set its bit. Similar work has to be done when doing the xarray
access as well, so this should have roughly the same performance. When
doing KHO, we just need to record the base address of the table and we
are done. This saves us from doing the expensive copying/transformation
of data in the critical path.

I don't see any obvious downsides compared to the current format. The
serialized state might end up taking slightly more memory due to upper
level tables, but it should still be much less than having two
representations of the same information exist simultaneously.

>
> The next kernel then processes that list, reserves the memory ranges and
> later, when a user requests a folio or a physical range, KHO restores
> corresponding memory map entries.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> Signed-off-by: Changyuan Lyu <changyuanl@google.com>
[...]
> +static void deserialize_bitmap(unsigned int order,
> +			       struct khoser_mem_bitmap_ptr *elm)
> +{
> +	struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap);
> +	unsigned long bit;
> +
> +	for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) {
> +		int sz = 1 << (order + PAGE_SHIFT);
> +		phys_addr_t phys =
> +			elm->phys_start + (bit << (order + PAGE_SHIFT));
> +		struct page *page = phys_to_page(phys);
> +
> +		memblock_reserve(phys, sz);
> +		memblock_reserved_mark_noinit(phys, sz);

Why waste time and memory building the reserved ranges? We already have
all the information in the serialized bitmaps, and memblock is already
only allocating from scratch. So we should not need this at all, and
instead simply skip these pages in memblock_free_pages(). With the
page-table like format I mentioned above, this should be very easy since
you can find out whether a page is reserved or not in O(1) time.

> +		page->private = order;
> +	}
> +}
> +
> +static void __init kho_mem_deserialize(void)
> +{
> +	struct khoser_mem_chunk *chunk;
> +	struct kho_in_node preserved_mem;
> +	const phys_addr_t *mem;
> +	int err;
> +	u32 len;
> +
> +	err = kho_get_node(NULL, "preserved-memory", &preserved_mem);
> +	if (err) {
> +		pr_err("no preserved-memory node: %d\n", err);
> +		return;
> +	}
> +
> +	mem = kho_get_prop(&preserved_mem, "metadata", &len);
> +	if (!mem || len != sizeof(*mem)) {
> +		pr_err("failed to get preserved memory bitmaps\n");
> +		return;
> +	}
> +
> +	chunk = *mem ? phys_to_virt(*mem) : NULL;
> +	while (chunk) {
> +		unsigned int i;
> +
> +		memblock_reserve(virt_to_phys(chunk), sizeof(*chunk));
> +
> +		for (i = 0; i != chunk->hdr.num_elms; i++)
> +			deserialize_bitmap(chunk->hdr.order,
> +					   &chunk->bitmaps[i]);
> +		chunk = KHOSER_LOAD_PTR(chunk->hdr.next);
> +	}
> +}
> +
>  /* Helper functions for KHO state tree */
>  
>  struct kho_prop {
[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-27 10:03   ` Pratyush Yadav
@ 2025-03-27 13:31     ` Jason Gunthorpe
  2025-03-27 17:28       ` Pratyush Yadav
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-03-27 13:31 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Thu, Mar 27, 2025 at 10:03:17AM +0000, Pratyush Yadav wrote:

> Of course, with the current linked list structure, this cannot work. But
> I don't see why we need to have it. I think having a page-table like
> structure would be better -- only instead of having PTEs at the lowest
> levels, you have the bitmap.

Yes, but there is a trade off here of what I could write in 30 mins
and what is maximally possible :) The xarray is providing a page table
implementation in a library form.

I think this whole thing can be optimized, especially the
memblock_reserve side, but the idea here is to get started and once we
have some data on what the actual preservation workload is then
someone can optimize this.

Otherwise we are going to be spending months just polishing this one
patch without any actual data on where the performance issues and hot
spots actually are.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-27 13:31     ` Jason Gunthorpe
@ 2025-03-27 17:28       ` Pratyush Yadav
  2025-03-28 12:53         ` Jason Gunthorpe
  2025-04-02 16:44         ` Changyuan Lyu
  0 siblings, 2 replies; 103+ messages in thread
From: Pratyush Yadav @ 2025-03-27 17:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Thu, Mar 27 2025, Jason Gunthorpe wrote:

> On Thu, Mar 27, 2025 at 10:03:17AM +0000, Pratyush Yadav wrote:
>
>> Of course, with the current linked list structure, this cannot work. But
>> I don't see why we need to have it. I think having a page-table like
>> structure would be better -- only instead of having PTEs at the lowest
>> levels, you have the bitmap.
>
> Yes, but there is a trade off here of what I could write in 30 mins
> and what is maximally possible :) The xarray is providing a page table
> implementation in a library form.
>
> I think this whole thing can be optimized, especially the
> memblock_reserve side, but the idea here is to get started and once we
> have some data on what the actual preservation workload is then
> someone can optimize this.
>
> Otherwise we are going to be spending months just polishing this one
> patch without any actual data on where the performance issues and hot
> spots actually are.

The memblock_reserve side we can optimize later, I agree. But the memory
preservation format is ABI and I think that is worth spending a little
more time on. And I don't think it should be that much more complex than
the current format.

I want to hack around with it, so I'll give it a try over the next few
days and see what I can come up with.

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-27 17:28       ` Pratyush Yadav
@ 2025-03-28 12:53         ` Jason Gunthorpe
  2025-04-02 16:44         ` Changyuan Lyu
  1 sibling, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-03-28 12:53 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Thu, Mar 27, 2025 at 05:28:40PM +0000, Pratyush Yadav wrote:
> > Otherwise we are going to be spending months just polishing this one
> > patch without any actual data on where the performance issues and hot
> > spots actually are.
> 
> The memblock_reserve side we can optimize later, I agree. But the memory
> preservation format is ABI 

I think the agreement was that nothing is ABI at this point..

> and I think that is worth spending a little
> more time on. And I don't think it should be that much more complex than
> the current format.

Maybe!

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-27 17:28       ` Pratyush Yadav
  2025-03-28 12:53         ` Jason Gunthorpe
@ 2025-04-02 16:44         ` Changyuan Lyu
  2025-04-02 16:47           ` Pratyush Yadav
  1 sibling, 1 reply; 103+ messages in thread
From: Changyuan Lyu @ 2025-04-02 16:44 UTC (permalink / raw)
  To: ptyadav
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, changyuanl, corbet, dave.hansen, devicetree,
	dwmw2, ebiederm, graf, hpa, jgg, jgowans, kexec, krzk,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm, luto,
	mark.rutland, mingo, pasha.tatashin, pbonzini, peterz, robh+dt,
	robh, rostedt, rppt, saravanak, skinsburskii, tglx,
	thomas.lendacky, will, x86

Hi Pratyush, Thanks for suggestions!

On Thu, Mar 27, 2025 at 17:28:40 +0000, Pratyush Yadav <ptyadav@amazon.de> wrote:
> On Thu, Mar 27 2025, Jason Gunthorpe wrote:
>
> > On Thu, Mar 27, 2025 at 10:03:17AM +0000, Pratyush Yadav wrote:
> >
> >> Of course, with the current linked list structure, this cannot work. But
> >> I don't see why we need to have it. I think having a page-table like
> >> structure would be better -- only instead of having PTEs at the lowest
> >> levels, you have the bitmap.
> >
> > Yes, but there is a trade off here of what I could write in 30 mins
> > and what is maximally possible :) The xarray is providing a page table
> > implementation in a library form.
> >
> > I think this whole thing can be optimized, especially the
> > memblock_reserve side, but the idea here is to get started and once we
> > have some data on what the actual preservation workload is then
> > someone can optimize this.
> >
> > Otherwise we are going to be spending months just polishing this one
> > patch without any actual data on where the performance issues and hot
> > spots actually are.
>
> The memblock_reserve side we can optimize later, I agree. But the memory
> preservation format is ABI and I think that is worth spending a little
> more time on. And I don't think it should be that much more complex than
> the current format.
>
> I want to hack around with it, so I'll give it a try over the next few
> days and see what I can come up with.

I agree with Jason that "nothing is ABI at this
point" and it will take some time for KHO to stabilize.

On the other hand if you have already came up with something working and
simple, we can include it in the next version.

(Sorry for the late reply, I was traveling.)

Best,
Changyuan


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-02 16:44         ` Changyuan Lyu
@ 2025-04-02 16:47           ` Pratyush Yadav
  2025-04-02 18:37             ` Pasha Tatashin
  0 siblings, 1 reply; 103+ messages in thread
From: Pratyush Yadav @ 2025-04-02 16:47 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, corbet, dave.hansen, devicetree, dwmw2, ebiederm,
	graf, hpa, jgg, jgowans, kexec, krzk, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, luto, mark.rutland, mingo, pasha.tatashin,
	pbonzini, peterz, robh+dt, robh, rostedt, rppt, saravanak,
	skinsburskii, tglx, thomas.lendacky, will, x86

Hi,

On Wed, Apr 02 2025, Changyuan Lyu wrote:

> Hi Pratyush, Thanks for suggestions!
>
> On Thu, Mar 27, 2025 at 17:28:40 +0000, Pratyush Yadav <ptyadav@amazon.de> wrote:
>> On Thu, Mar 27 2025, Jason Gunthorpe wrote:
>>
>> > On Thu, Mar 27, 2025 at 10:03:17AM +0000, Pratyush Yadav wrote:
>> >
>> >> Of course, with the current linked list structure, this cannot work. But
>> >> I don't see why we need to have it. I think having a page-table like
>> >> structure would be better -- only instead of having PTEs at the lowest
>> >> levels, you have the bitmap.
>> >
>> > Yes, but there is a trade off here of what I could write in 30 mins
>> > and what is maximally possible :) The xarray is providing a page table
>> > implementation in a library form.
>> >
>> > I think this whole thing can be optimized, especially the
>> > memblock_reserve side, but the idea here is to get started and once we
>> > have some data on what the actual preservation workload is then
>> > someone can optimize this.
>> >
>> > Otherwise we are going to be spending months just polishing this one
>> > patch without any actual data on where the performance issues and hot
>> > spots actually are.
>>
>> The memblock_reserve side we can optimize later, I agree. But the memory
>> preservation format is ABI and I think that is worth spending a little
>> more time on. And I don't think it should be that much more complex than
>> the current format.
>>
>> I want to hack around with it, so I'll give it a try over the next few
>> days and see what I can come up with.
>
> I agree with Jason that "nothing is ABI at this
> point" and it will take some time for KHO to stabilize.
>
> On the other hand if you have already came up with something working and
> simple, we can include it in the next version.

I already have something that works with zero-order pages. I am
currently implementing support for other orders. It is almost done, but
I need to test it and do a performance comparison with the current
patch. Will post something soon!

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-02 16:47           ` Pratyush Yadav
@ 2025-04-02 18:37             ` Pasha Tatashin
  2025-04-02 18:49               ` Pratyush Yadav
  0 siblings, 1 reply; 103+ messages in thread
From: Pasha Tatashin @ 2025-04-02 18:37 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Changyuan Lyu, akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, corbet, dave.hansen, devicetree, dwmw2, ebiederm,
	graf, hpa, jgg, jgowans, kexec, krzk, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, luto, mark.rutland, mingo, pbonzini,
	peterz, robh+dt, robh, rostedt, rppt, saravanak, skinsburskii,
	tglx, thomas.lendacky, will, x86

On Wed, Apr 2, 2025 at 12:47 PM Pratyush Yadav <ptyadav@amazon.de> wrote:
>
> Hi,
>
> On Wed, Apr 02 2025, Changyuan Lyu wrote:
>
> > Hi Pratyush, Thanks for suggestions!
> >
> > On Thu, Mar 27, 2025 at 17:28:40 +0000, Pratyush Yadav <ptyadav@amazon.de> wrote:
> >> On Thu, Mar 27 2025, Jason Gunthorpe wrote:
> >>
> >> > On Thu, Mar 27, 2025 at 10:03:17AM +0000, Pratyush Yadav wrote:
> >> >
> >> >> Of course, with the current linked list structure, this cannot work. But
> >> >> I don't see why we need to have it. I think having a page-table like
> >> >> structure would be better -- only instead of having PTEs at the lowest
> >> >> levels, you have the bitmap.
> >> >
> >> > Yes, but there is a trade off here of what I could write in 30 mins
> >> > and what is maximally possible :) The xarray is providing a page table
> >> > implementation in a library form.
> >> >
> >> > I think this whole thing can be optimized, especially the
> >> > memblock_reserve side, but the idea here is to get started and once we
> >> > have some data on what the actual preservation workload is then
> >> > someone can optimize this.
> >> >
> >> > Otherwise we are going to be spending months just polishing this one
> >> > patch without any actual data on where the performance issues and hot
> >> > spots actually are.
> >>
> >> The memblock_reserve side we can optimize later, I agree. But the memory
> >> preservation format is ABI and I think that is worth spending a little
> >> more time on. And I don't think it should be that much more complex than
> >> the current format.
> >>
> >> I want to hack around with it, so I'll give it a try over the next few
> >> days and see what I can come up with.
> >
> > I agree with Jason that "nothing is ABI at this
> > point" and it will take some time for KHO to stabilize.
> >
> > On the other hand if you have already came up with something working and
> > simple, we can include it in the next version.
>
> I already have something that works with zero-order pages. I am
> currently implementing support for other orders. It is almost done, but
> I need to test it and do a performance comparison with the current
> patch. Will post something soon!

Hi Pratyush,

Just to clarify, how soon? We are about to post v6 for KHO, with all
other comments in this thread addressed.

Thanks,
Pasha

>
> --
> Regards,
> Pratyush Yadav


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-02 18:37             ` Pasha Tatashin
@ 2025-04-02 18:49               ` Pratyush Yadav
  0 siblings, 0 replies; 103+ messages in thread
From: Pratyush Yadav @ 2025-04-02 18:49 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Changyuan Lyu, akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, corbet, dave.hansen, devicetree, dwmw2, ebiederm,
	graf, hpa, jgg, jgowans, kexec, krzk, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, luto, mark.rutland, mingo, pbonzini,
	peterz, robh+dt, robh, rostedt, rppt, saravanak, skinsburskii,
	tglx, thomas.lendacky, will, x86

On Wed, Apr 02 2025, Pasha Tatashin wrote:

> On Wed, Apr 2, 2025 at 12:47 PM Pratyush Yadav <ptyadav@amazon.de> wrote:
>>
>> Hi,
>>
>> On Wed, Apr 02 2025, Changyuan Lyu wrote:
>>
>> > Hi Pratyush, Thanks for suggestions!
>> >
>> > On Thu, Mar 27, 2025 at 17:28:40 +0000, Pratyush Yadav <ptyadav@amazon.de> wrote:
[...]
>> >>
>> >> The memblock_reserve side we can optimize later, I agree. But the memory
>> >> preservation format is ABI and I think that is worth spending a little
>> >> more time on. And I don't think it should be that much more complex than
>> >> the current format.
>> >>
>> >> I want to hack around with it, so I'll give it a try over the next few
>> >> days and see what I can come up with.
>> >
>> > I agree with Jason that "nothing is ABI at this
>> > point" and it will take some time for KHO to stabilize.
>> >
>> > On the other hand if you have already came up with something working and
>> > simple, we can include it in the next version.
>>
>> I already have something that works with zero-order pages. I am
>> currently implementing support for other orders. It is almost done, but
>> I need to test it and do a performance comparison with the current
>> patch. Will post something soon!
>
> Hi Pratyush,
>
> Just to clarify, how soon? We are about to post v6 for KHO, with all
> other comments in this thread addressed.

I have it working, but I need to clean up the code a bit and test it
better. So hopefully end of this week or early next week.

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-20  1:55 ` [PATCH v5 09/16] kexec: enable KHO support for memory preservation Changyuan Lyu
  2025-03-21 13:46   ` Jason Gunthorpe
  2025-03-27 10:03   ` Pratyush Yadav
@ 2025-04-02 19:16   ` Pratyush Yadav
  2025-04-03 11:42     ` Jason Gunthorpe
                       ` (2 more replies)
  2025-04-03 15:50   ` Pratyush Yadav
  3 siblings, 3 replies; 103+ messages in thread
From: Pratyush Yadav @ 2025-04-02 19:16 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: linux-kernel, graf, akpm, luto, anthony.yznaga, arnd,
	ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86,
	Jason Gunthorpe

Hi Changyuan,

On Wed, Mar 19 2025, Changyuan Lyu wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Introduce APIs allowing KHO users to preserve memory across kexec and
> get access to that memory after boot of the kexeced kernel
>
> kho_preserve_folio() - record a folio to be preserved over kexec
> kho_restore_folio() - recreates the folio from the preserved memory
> kho_preserve_phys() - record physically contiguous range to be
> preserved over kexec.
> kho_restore_phys() - recreates order-0 pages corresponding to the
> preserved physical range
>
> The memory preservations are tracked by two levels of xarrays to manage
> chunks of per-order 512 byte bitmaps. For instance the entire 1G order
> of a 1TB x86 system would fit inside a single 512 byte bitmap. For
> order 0 allocations each bitmap will cover 16M of address space. Thus,
> for 16G of memory at most 512K of bitmap memory will be needed for order 0.
>
> At serialization time all bitmaps are recorded in a linked list of pages
> for the next kernel to process and the physical address of the list is
> recorded in KHO FDT.
>
> The next kernel then processes that list, reserves the memory ranges and
> later, when a user requests a folio or a physical range, KHO restores
> corresponding memory map entries.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> ---
>  include/linux/kexec_handover.h |  38 +++
>  kernel/kexec_handover.c        | 486 ++++++++++++++++++++++++++++++++-
>  2 files changed, 522 insertions(+), 2 deletions(-)
[...]
> +int kho_preserve_phys(phys_addr_t phys, size_t size)
> +{
> +	unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
> +	unsigned int order = ilog2(end_pfn - pfn);

This caught my eye when playing around with the code. It does not put
any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when
initializing the page after KHO, we pass the order directly to
prep_compound_page() without sanity checking it. The next kernel might
not support all the orders the current one supports. Perhaps something
to fix?

> +	unsigned long failed_pfn;
> +	int err = 0;
> +
> +	if (!kho_enable)
> +		return -EOPNOTSUPP;
> +
> +	down_read(&kho_out.tree_lock);
> +	if (kho_out.fdt) {
> +		err = -EBUSY;
> +		goto unlock;
> +	}
> +
> +	for (; pfn < end_pfn;
> +	     pfn += (1 << order), order = ilog2(end_pfn - pfn)) {
> +		err = __kho_preserve(&kho_mem_track, pfn, order);
> +		if (err) {
> +			failed_pfn = pfn;
> +			break;
> +		}
> +	}
[...
> +struct folio *kho_restore_folio(phys_addr_t phys)
> +{
> +	struct page *page = pfn_to_online_page(PHYS_PFN(phys));
> +	unsigned long order = page->private;
> +
> +	if (!page)
> +		return NULL;
> +
> +	order = page->private;
> +	if (order)
> +		prep_compound_page(page, order);
> +	else
> +		kho_restore_page(page);
> +
> +	return page_folio(page);
> +}
[...]

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-02 19:16   ` Pratyush Yadav
@ 2025-04-03 11:42     ` Jason Gunthorpe
  2025-04-03 13:58       ` Mike Rapoport
  2025-04-03 13:57     ` Mike Rapoport
  2025-04-11  4:02     ` Changyuan Lyu
  2 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-03 11:42 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Wed, Apr 02, 2025 at 07:16:27PM +0000, Pratyush Yadav wrote:
> > +int kho_preserve_phys(phys_addr_t phys, size_t size)
> > +{
> > +	unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
> > +	unsigned int order = ilog2(end_pfn - pfn);
> 
> This caught my eye when playing around with the code. It does not put
> any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when
> initializing the page after KHO, we pass the order directly to
> prep_compound_page() without sanity checking it. The next kernel might
> not support all the orders the current one supports. Perhaps something
> to fix?

IMHO we should delete the phys functions until we get a user of them

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-02 19:16   ` Pratyush Yadav
  2025-04-03 11:42     ` Jason Gunthorpe
@ 2025-04-03 13:57     ` Mike Rapoport
  2025-04-11  4:02     ` Changyuan Lyu
  2 siblings, 0 replies; 103+ messages in thread
From: Mike Rapoport @ 2025-04-03 13:57 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, mark.rutland, pbonzini,
	pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86,
	Jason Gunthorpe

On Wed, Apr 02, 2025 at 07:16:27PM +0000, Pratyush Yadav wrote:
> Hi Changyuan,
> 
> On Wed, Mar 19 2025, Changyuan Lyu wrote:
> 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > Introduce APIs allowing KHO users to preserve memory across kexec and
> > get access to that memory after boot of the kexeced kernel
> >
> > kho_preserve_folio() - record a folio to be preserved over kexec
> > kho_restore_folio() - recreates the folio from the preserved memory
> > kho_preserve_phys() - record physically contiguous range to be
> > preserved over kexec.
> > kho_restore_phys() - recreates order-0 pages corresponding to the
> > preserved physical range
> >
> > The memory preservations are tracked by two levels of xarrays to manage
> > chunks of per-order 512 byte bitmaps. For instance the entire 1G order
> > of a 1TB x86 system would fit inside a single 512 byte bitmap. For
> > order 0 allocations each bitmap will cover 16M of address space. Thus,
> > for 16G of memory at most 512K of bitmap memory will be needed for order 0.
> >
> > At serialization time all bitmaps are recorded in a linked list of pages
> > for the next kernel to process and the physical address of the list is
> > recorded in KHO FDT.
> >
> > The next kernel then processes that list, reserves the memory ranges and
> > later, when a user requests a folio or a physical range, KHO restores
> > corresponding memory map entries.
> >
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > ---
> >  include/linux/kexec_handover.h |  38 +++
> >  kernel/kexec_handover.c        | 486 ++++++++++++++++++++++++++++++++-
> >  2 files changed, 522 insertions(+), 2 deletions(-)
> [...]
> > +int kho_preserve_phys(phys_addr_t phys, size_t size)
> > +{
> > +	unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
> > +	unsigned int order = ilog2(end_pfn - pfn);
> 
> This caught my eye when playing around with the code. It does not put
> any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when

I don't see a problem with this

> initializing the page after KHO, we pass the order directly to
> prep_compound_page() without sanity checking it. The next kernel might
> not support all the orders the current one supports. Perhaps something
> to fix?

And this needs to be fixed and we should refuse to create folios larger
than MAX_ORDER.
 
> > +	unsigned long failed_pfn;
> > +	int err = 0;
> > +
> > +	if (!kho_enable)
> > +		return -EOPNOTSUPP;
> > +
> > +	down_read(&kho_out.tree_lock);
> > +	if (kho_out.fdt) {
> > +		err = -EBUSY;
> > +		goto unlock;
> > +	}
> > +
> > +	for (; pfn < end_pfn;
> > +	     pfn += (1 << order), order = ilog2(end_pfn - pfn)) {
> > +		err = __kho_preserve(&kho_mem_track, pfn, order);
> > +		if (err) {
> > +			failed_pfn = pfn;
> > +			break;
> > +		}
> > +	}
> [...
> > +struct folio *kho_restore_folio(phys_addr_t phys)
> > +{
> > +	struct page *page = pfn_to_online_page(PHYS_PFN(phys));
> > +	unsigned long order = page->private;
> > +
> > +	if (!page)
> > +		return NULL;
> > +
> > +	order = page->private;
> > +	if (order)
> > +		prep_compound_page(page, order);
> > +	else
> > +		kho_restore_page(page);
> > +
> > +	return page_folio(page);
> > +}
> [...]
> 
> -- 
> Regards,
> Pratyush Yadav

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-03 11:42     ` Jason Gunthorpe
@ 2025-04-03 13:58       ` Mike Rapoport
  2025-04-03 14:24         ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-04-03 13:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Thu, Apr 03, 2025 at 08:42:09AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 02, 2025 at 07:16:27PM +0000, Pratyush Yadav wrote:
> > > +int kho_preserve_phys(phys_addr_t phys, size_t size)
> > > +{
> > > +	unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
> > > +	unsigned int order = ilog2(end_pfn - pfn);
> > 
> > This caught my eye when playing around with the code. It does not put
> > any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when
> > initializing the page after KHO, we pass the order directly to
> > prep_compound_page() without sanity checking it. The next kernel might
> > not support all the orders the current one supports. Perhaps something
> > to fix?
> 
> IMHO we should delete the phys functions until we get a user of them

The only user of memory tracker in this series uses kho_preserve_phys()
 
> Jason

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-03 13:58       ` Mike Rapoport
@ 2025-04-03 14:24         ` Jason Gunthorpe
  2025-04-04  9:54           ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-03 14:24 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Thu, Apr 03, 2025 at 04:58:27PM +0300, Mike Rapoport wrote:
> On Thu, Apr 03, 2025 at 08:42:09AM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 02, 2025 at 07:16:27PM +0000, Pratyush Yadav wrote:
> > > > +int kho_preserve_phys(phys_addr_t phys, size_t size)
> > > > +{
> > > > +	unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
> > > > +	unsigned int order = ilog2(end_pfn - pfn);
> > > 
> > > This caught my eye when playing around with the code. It does not put
> > > any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when
> > > initializing the page after KHO, we pass the order directly to
> > > prep_compound_page() without sanity checking it. The next kernel might
> > > not support all the orders the current one supports. Perhaps something
> > > to fix?
> > 
> > IMHO we should delete the phys functions until we get a user of them
> 
> The only user of memory tracker in this series uses kho_preserve_phys()

But it really shouldn't. The reserved memory is a completely different
mechanism than buddy allocator preservation. It doesn't even call
kho_restore_phys() those pages, it just feeds the ranges directly to:

+       reserved_mem_add(*p_start, size, name);

The bitmaps should be understood as preserving memory from the buddy
allocator only.

IMHO it should not call kho_preserve_phys() at all.

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-03-20  1:55 ` [PATCH v5 09/16] kexec: enable KHO support for memory preservation Changyuan Lyu
                     ` (2 preceding siblings ...)
  2025-04-02 19:16   ` Pratyush Yadav
@ 2025-04-03 15:50   ` Pratyush Yadav
  2025-04-03 16:10     ` Jason Gunthorpe
  3 siblings, 1 reply; 103+ messages in thread
From: Pratyush Yadav @ 2025-04-03 15:50 UTC (permalink / raw)
  To: Changyuan Lyu
  Cc: linux-kernel, graf, akpm, luto, anthony.yznaga, arnd,
	ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86,
	Jason Gunthorpe

Hi all,

The below patch implements the table based memory preservation mechanism
I suggested. It is a replacement of this patch. Instead of using an
xarray of bitmaps and converting them into a linked list of bitmaps at
serialization time, it tracks preserved pages in a page table like
format, that needs no extra work when serializing. This results in
noticeably better performance when preserving a large number of pages.

To compare performance, I allocated 48 GiB of memory and preserved it
using KHO. Below is the time taken to make the reservations, and then
serialize that to FDT.

    Linked list:  577ms +- 0.7% (6 samples)
    Table:        469ms +- 0.6% (6 samples)

From this, we can see that the table is almost 19% faster.

This test was done with only one thread, but since it is possible to
make reservations in parallel, the performance would increase even more
-- especially since the linked list serialization cannot be parallelized
easily.

In terms of memory usage, I could not collect reliable data, but I don't
think there should be significant difference between either approach
since the bitmaps are the same density, and only difference would be
extra metadata (chunks vs upper level tables).

Memory usage for tables can be further optimized if needed by collapsing
full tables. That is, if all bits in a L1 table are set, we can just not
allocate a page for it, and instead set a flag in the L2 descriptor.

The patch currently has a limitation where it does not free any of the
empty tables after a unpreserve operation. But Changyuan's patch also
doesn't do it so at least it is not any worse off.

In terms of code size, I believe both are roughly the same. This patch
is 609 lines compared Changyuan's 522, many of which come from the
longer comment.

When working on this patch, I realized that kho_mem_deserialize() is
currently _very_ slow. It takes over 2 seconds to make memblock
reservations for 48 GiB of 0-order pages. I suppose this can later be
optimized by teaching memblock_free_all() to skip preserved pages
instead of making memblock reservations.

Regards,
Pratyush Yadav

---- 8< ----
From 40c1274052709e4d102cc9fe55fa94272f827283 Mon Sep 17 00:00:00 2001
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Date: Wed, 19 Mar 2025 18:55:44 -0700
Subject: [PATCH] kexec: enable KHO support for memory preservation

Introduce APIs allowing KHO users to preserve memory across kexec and
get access to that memory after boot of the kexeced kernel

kho_preserve_folio() - record a folio to be preserved over kexec
kho_restore_folio() - recreates the folio from the preserved memory
kho_preserve_phys() - record physically contiguous range to be
preserved over kexec.
kho_restore_phys() - recreates order-0 pages corresponding to the
preserved physical range

The memory preservations are tracked by using 4 levels of tables,
similar to page tables, except at the lowest level, a bitmap is present
instead of PTEs. Each page order has its own separate table. A set bit
in the bitmap represents a page of the table's order. The tables are
named simply by their level, with the highest being level 4 (L4), the
next being level 3 (L3) and so on.

Assuming 0-order 4K pages, a L1 table will have a total of 4096 * 8 ==
23768 bits. This maps to 128 MiB of memory. L2 and above tables will
consist of pointers to lower level tables, so each level will have 4096
/ 8 == 512 pointers. This means Each level 2 table maps to 64 GiB of
memory, each level 3 table maps to 32 TiB of memory, and each level 4
table maps to 16 PiB of memory. More information on the table format can
be found in the comment in the patch.

At serialization time, all that needs to be done is to record the
top-level table descriptors for each order. The next kernel can use
those to find the tables, walk the them, reserve the memory ranges, and
later when a user requests a folio or a physical range, KHO restores
corresponding memory map entries.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
[ptyadav@amazon.de: table based preserved page tracking]
Co-developed-by: Pratyush Yadav <ptyadav@amazon.de>
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
---
 include/linux/kexec_handover.h |  38 +++
 kernel/kexec_handover.c        | 573 ++++++++++++++++++++++++++++++++-
 2 files changed, 609 insertions(+), 2 deletions(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index c665ff6cd728a..d52a7b500f4ce 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/hashtable.h>
 #include <linux/notifier.h>
+#include <linux/mm_types.h>
 
 struct kho_scratch {
 	phys_addr_t addr;
@@ -54,6 +55,13 @@ int kho_add_string_prop(struct kho_node *node, const char *key,
 int register_kho_notifier(struct notifier_block *nb);
 int unregister_kho_notifier(struct notifier_block *nb);
 
+int kho_preserve_folio(struct folio *folio);
+int kho_unpreserve_folio(struct folio *folio);
+int kho_preserve_phys(phys_addr_t phys, size_t size);
+int kho_unpreserve_phys(phys_addr_t phys, size_t size);
+struct folio *kho_restore_folio(phys_addr_t phys);
+void *kho_restore_phys(phys_addr_t phys, size_t size);
+
 void kho_memory_init(void);
 
 void kho_populate(phys_addr_t handover_fdt_phys, phys_addr_t scratch_phys,
@@ -118,6 +126,36 @@ static inline int unregister_kho_notifier(struct notifier_block *nb)
 	return -EOPNOTSUPP;
 }
 
+static inline int kho_preserve_folio(struct folio *folio)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int kho_unpreserve_folio(struct folio *folio)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int kho_preserve_phys(phys_addr_t phys, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int kho_unpreserve_phys(phys_addr_t phys, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline struct folio *kho_restore_folio(phys_addr_t phys)
+{
+	return NULL;
+}
+
+static inline void *kho_restore_phys(phys_addr_t phys, size_t size)
+{
+	return NULL;
+}
+
 static inline void kho_memory_init(void)
 {
 }
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 6ebad2f023f95..16f10bd06de0a 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -3,6 +3,7 @@
  * kexec_handover.c - kexec handover metadata processing
  * Copyright (C) 2023 Alexander Graf <graf@amazon.com>
  * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport <rppt@kernel.org>
+ * Copyright (C) 2025 Amazon.com Inc. or its affiliates, Pratyush Yadav <ptyadav@amazon.de>
  * Copyright (C) 2024 Google LLC
  */
 
@@ -62,6 +63,10 @@ struct kho_out {
 	struct rw_semaphore tree_lock;
 	struct kho_node root;
 
+	/* Array containing the L4 table descriptors for each order. */
+	unsigned long mem_tables[NR_PAGE_ORDERS];
+	struct kho_node preserved_memory;
+
 	void *fdt;
 	u64 fdt_max;
 };
@@ -70,6 +75,7 @@ static struct kho_out kho_out = {
 	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
 	.tree_lock = __RWSEM_INITIALIZER(kho_out.tree_lock),
 	.root = KHO_NODE_INIT,
+	.preserved_memory = KHO_NODE_INIT,
 	.fdt_max = 10 * SZ_1M,
 };
 
@@ -237,6 +243,559 @@ int kho_node_check_compatible(const struct kho_in_node *node,
 }
 EXPORT_SYMBOL_GPL(kho_node_check_compatible);
 
+/*
+ * Keep track of memory that is to be preserved across KHO.
+ *
+ * The memory is tracked by using 4 levels of tables, similar to page tables,
+ * except at the lowest level, a bitmap is present instead of PTEs. Each page
+ * order has its own separate table. A set bit in the bitmap represents a page
+ * of the table's order. The tables are named simply by their level, with the
+ * highest being level 4 (L4), the next being level 3 (L3) and so on.
+ *
+ * The table hierarchy can be seen with the below diagram.
+ *
+ * +----+
+ * | L4 |
+ * +----+
+ *    |
+ *    |   +----+
+ *    +-->| L3 |
+ *        +----+
+ *           |
+ *           |   +----+
+ *           +-->| L2 |
+ *               +----+
+ *                  |
+ *                  |   +----+
+ *                  +-->| L1 |
+ *                      +----+
+ *
+ * Assuming 0-order 4K pages, a L1 table will have a total of 4096 * 8 == 23768
+ * bits. This maps to 128 MiB of memory. L2 and above tables will consist of
+ * pointers to lower level tables, so each level will have 4096 / 8 == 512
+ * pointers. This means Each level 2 table maps to 64 GiB of memory, each level
+ * 3 table maps to 32 TiB of memory, and each level 4 table maps to 16 PiB of
+ * memory.
+ *
+ * The below diagram shows how the address is split into the different levels
+ * for 0-order 4K pages:
+ *
+ *    63:54     53:45    44:36     35:27       26:12        11:0
+ * +----------+--------+--------+--------+--------------+-----------+
+ * | Ignored  |   L4   |   L3   |   L2   |      L1      | Page off  |
+ * +----------+--------+--------+--------+--------------+-----------+
+ *
+ * For higher order pages, the bits for each level get shifted left by the
+ * order.
+ *
+ * Each table except L1 contains a descriptor for the next level table. For 4K
+ * pages, the below diagram shows the format of the descriptor:
+ *
+ *                           63:12                          11:0
+ * +----------+--------+--------+--------+--------------+-----------+
+ * |             Pointer to next level table            | Reserved  |
+ * +----------+--------+--------+--------+--------------+-----------+
+ *
+ * The reserved bits must be zero, but can be used for flags in later versions.
+ */
+
+typedef unsigned long khomem_desc_t;
+typedef int (*khomem_walk_fn_t)(unsigned long phys, unsigned int order, void *arg);
+
+#define PTRS_PER_LEVEL		(PAGE_SIZE / sizeof(unsigned long))
+#define KHOMEM_L1_BITS		(PAGE_SIZE * BITS_PER_BYTE)
+#define KHOMEM_L1_MASK		((1 << ilog2(KHOMEM_L1_BITS)) - 1)
+#define KHOMEM_L1_SHIFT		(PAGE_SHIFT)
+#define KHOMEM_L2_SHIFT		(KHOMEM_L1_SHIFT + ilog2(KHOMEM_L1_BITS))
+#define KHOMEM_L3_SHIFT		(KHOMEM_L2_SHIFT + ilog2(PTRS_PER_LEVEL))
+#define KHOMEM_L4_SHIFT		(KHOMEM_L3_SHIFT + ilog2(PTRS_PER_LEVEL))
+#define KHOMEM_PFN_MASK		PAGE_MASK
+
+static unsigned int khomem_level_shifts[] = {
+	[1] = KHOMEM_L1_SHIFT,
+	[2] = KHOMEM_L2_SHIFT,
+	[3] = KHOMEM_L3_SHIFT,
+	[4] = KHOMEM_L4_SHIFT,
+};
+
+static inline unsigned long khomem_table_index(unsigned long address,
+					       unsigned int level,
+					       unsigned int order)
+{
+	unsigned long mask = level == 1 ? KHOMEM_L1_MASK : (PTRS_PER_LEVEL - 1);
+	/* Avoid undefined behaviour in case shift is too big. */
+	int shift = min_t(int, khomem_level_shifts[level] + order, BITS_PER_LONG);
+
+	return (address >> shift) & mask;
+}
+
+static inline khomem_desc_t *khomem_table(khomem_desc_t desc)
+{
+	return __va(desc & KHOMEM_PFN_MASK);
+}
+
+static inline khomem_desc_t *khomem_table_offset(khomem_desc_t *table,
+						 unsigned long address,
+						 unsigned int level,
+						 unsigned int order)
+{
+	return khomem_table(*table) + khomem_table_index(address, level, order);
+}
+
+static inline bool khomem_desc_none(khomem_desc_t desc)
+{
+	return !(desc & KHOMEM_PFN_MASK);
+}
+
+static inline void khomem_bitmap_preserve(khomem_desc_t *desc,
+					  unsigned long address,
+					  unsigned int order)
+{
+	/* set_bit() is atomic, so no need for locking. */
+	set_bit(khomem_table_index(address, 1, order), khomem_table(*desc));
+}
+
+static inline void khomem_bitmap_unpreserve(khomem_desc_t *desc,
+					    unsigned long address,
+					    unsigned int order)
+{
+	/* clear_bit() is atomic, so no need for locking. */
+	clear_bit(khomem_table_index(address, 1, order), khomem_table(*desc));
+}
+
+static inline khomem_desc_t khomem_mkdesc(void *table)
+{
+	return virt_to_phys(table) & KHOMEM_PFN_MASK;
+}
+
+static int __khomem_table_alloc(khomem_desc_t *desc)
+{
+	if (khomem_desc_none(*desc)) {
+		khomem_desc_t *table, val;
+
+		table = (khomem_desc_t *)get_zeroed_page(GFP_KERNEL);
+		if (!table)
+			return -ENOMEM;
+
+		val = khomem_mkdesc(table);
+		if (cmpxchg(desc, 0, val))
+			/* Someone else already allocated it. */
+			free_page((unsigned long)table);
+	}
+
+	return 0;
+}
+
+static khomem_desc_t *khomem_table_alloc(khomem_desc_t *desc,
+					 unsigned long address,
+					 unsigned int level,
+					 unsigned int order)
+{
+	if (__khomem_table_alloc(desc))
+		return NULL;
+
+	return khomem_table_offset(desc, address, level, order);
+}
+
+static int khomem_preserve(khomem_desc_t *l4, unsigned long pfn,
+			   unsigned int order)
+{
+	unsigned long address = PFN_PHYS(pfn);
+	khomem_desc_t *l4p, *l3p, *l2p;
+	int ret;
+
+	l4p = khomem_table_alloc(l4, address, 4, order);
+	if (!l4p)
+		return -ENOMEM;
+
+	l3p = khomem_table_alloc(l4p, address, 3, order);
+	if (!l3p)
+		return -ENOMEM;
+
+	l2p = khomem_table_alloc(l3p, address, 2, order);
+	if (!l2p)
+		return -ENOMEM;
+
+	/*
+	 * L1 table is handled different since it is a bitmap not a table of
+	 * descriptors. So offsetting into it directly does not work.
+	 */
+	ret = __khomem_table_alloc(l2p);
+	if (ret)
+		return ret;
+
+	khomem_bitmap_preserve(l2p, address, order);
+	return 0;
+}
+
+/* TODO: Clean up empty tables eventually. */
+static void khomem_unpreserve(khomem_desc_t *l4, unsigned long pfn,
+			      unsigned int order)
+{
+	unsigned long address = PFN_PHYS(pfn);
+	khomem_desc_t *l4p, *l3p, *l2p;
+
+	if (khomem_desc_none(*l4))
+		return;
+
+	l4p = khomem_table_offset(l4, address, 4, order);
+	if (khomem_desc_none(*l4p))
+		return;
+
+	l3p = khomem_table_offset(l4p, address, 3, order);
+	if (khomem_desc_none(*l3p))
+		return;
+
+	l2p = khomem_table_offset(l3p, address, 2, order);
+	if (khomem_desc_none(*l2p))
+		return;
+
+	khomem_bitmap_unpreserve(l2p, address, order);
+}
+
+static int khomem_walk_l1(unsigned long *table, unsigned long addr,
+			  unsigned int order, khomem_walk_fn_t fn, void *arg)
+{
+	int ret, i;
+
+	for_each_set_bit(i, table, KHOMEM_L1_BITS) {
+		ret = fn(addr + (i * PAGE_SIZE), order, arg);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int __khomem_walk_table(khomem_desc_t *base, unsigned int level,
+			       unsigned long addr, unsigned int order,
+			       khomem_walk_fn_t fn, void *arg)
+{
+	unsigned long block = (1UL << (khomem_level_shifts[level] + order));
+	khomem_desc_t *cur;
+	int ret;
+
+	if (level == 1)
+		return khomem_walk_l1(base, addr, order, fn, arg);
+
+	for (cur = base; cur < base + PTRS_PER_LEVEL; cur++, addr += block) {
+		if (!khomem_desc_none(*cur)) {
+			ret = __khomem_walk_table(khomem_table(*cur), level - 1,
+						  addr, order, fn, arg);
+			if (ret)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+
+static int khomem_walk_preserved(khomem_desc_t *l4, unsigned int order,
+				 khomem_walk_fn_t fn, void *arg)
+{
+	if (khomem_desc_none(*l4))
+		return 0;
+
+	return __khomem_walk_table(khomem_table(*l4), 4, 0, order, fn, arg);
+}
+
+struct kho_mem_track {
+	/* Points to L4 KHOMEM descriptor, each order gets its own table. */
+	struct xarray orders;
+};
+
+static struct kho_mem_track kho_mem_track;
+
+static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
+{
+	void *elm, *res;
+
+	elm = xa_load(xa, index);
+	if (elm)
+		return elm;
+
+	elm = kzalloc(sz, GFP_KERNEL);
+	if (!elm)
+		return ERR_PTR(-ENOMEM);
+
+	res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL);
+	if (xa_is_err(res))
+		res = ERR_PTR(xa_err(res));
+
+	if (res) {
+		kfree(elm);
+		return res;
+	}
+
+	return elm;
+}
+
+static void __kho_unpreserve(struct kho_mem_track *tracker, unsigned long pfn,
+			     unsigned int order)
+{
+	khomem_desc_t *l4;
+
+	l4 = xa_load(&tracker->orders, order);
+	if (!l4)
+		return;
+
+	khomem_unpreserve(l4, pfn, order);
+}
+
+static int __kho_preserve(struct kho_mem_track *tracker, unsigned long pfn,
+			  unsigned int order)
+{
+	khomem_desc_t *l4;
+
+	might_sleep();
+
+	l4 = xa_load_or_alloc(&tracker->orders, order, sizeof(*l4));
+	if (IS_ERR(l4))
+		return PTR_ERR(l4);
+
+	khomem_preserve(l4, pfn, order);
+
+	return 0;
+}
+
+/**
+ * kho_preserve_folio - preserve a folio across KHO.
+ * @folio: folio to preserve
+ *
+ * Records that the entire folio is preserved across KHO. The order
+ * will be preserved as well.
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_preserve_folio(struct folio *folio)
+{
+	unsigned long pfn = folio_pfn(folio);
+	unsigned int order = folio_order(folio);
+	int err;
+
+	if (!kho_enable)
+		return -EOPNOTSUPP;
+
+	down_read(&kho_out.tree_lock);
+	if (kho_out.fdt) {
+		err = -EBUSY;
+		goto unlock;
+	}
+
+	err = __kho_preserve(&kho_mem_track, pfn, order);
+
+unlock:
+	up_read(&kho_out.tree_lock);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(kho_preserve_folio);
+
+/**
+ * kho_unpreserve_folio - unpreserve a folio
+ * @folio: folio to unpreserve
+ *
+ * Remove the record of a folio previously preserved by kho_preserve_folio().
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_unpreserve_folio(struct folio *folio)
+{
+	unsigned long pfn = folio_pfn(folio);
+	unsigned int order = folio_order(folio);
+	int err = 0;
+
+	down_read(&kho_out.tree_lock);
+	if (kho_out.fdt) {
+		err = -EBUSY;
+		goto unlock;
+	}
+
+	__kho_unpreserve(&kho_mem_track, pfn, order);
+
+unlock:
+	up_read(&kho_out.tree_lock);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(kho_unpreserve_folio);
+
+/**
+ * kho_preserve_phys - preserve a physically contiguous range across KHO.
+ * @phys: physical address of the range
+ * @size: size of the range
+ *
+ * Records that the entire range from @phys to @phys + @size is preserved
+ * across KHO.
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_preserve_phys(phys_addr_t phys, size_t size)
+{
+	unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
+	unsigned int order = ilog2(end_pfn - pfn);
+	unsigned long failed_pfn;
+	int err = 0;
+
+	if (!kho_enable)
+		return -EOPNOTSUPP;
+
+	down_read(&kho_out.tree_lock);
+	if (kho_out.fdt) {
+		err = -EBUSY;
+		goto unlock;
+	}
+
+	for (; pfn < end_pfn;
+	     pfn += (1 << order), order = ilog2(end_pfn - pfn)) {
+		err = __kho_preserve(&kho_mem_track, pfn, order);
+		if (err) {
+			failed_pfn = pfn;
+			break;
+		}
+	}
+
+	if (err)
+		for (pfn = PHYS_PFN(phys); pfn < failed_pfn;
+		     pfn += (1 << order), order = ilog2(end_pfn - pfn))
+			__kho_unpreserve(&kho_mem_track, pfn, order);
+
+unlock:
+	up_read(&kho_out.tree_lock);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(kho_preserve_phys);
+
+/**
+ * kho_unpreserve_phys - unpreserve a physically contiguous range
+ * @phys: physical address of the range
+ * @size: size of the range
+ *
+ * Remove the record of a range previously preserved by kho_preserve_phys().
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_unpreserve_phys(phys_addr_t phys, size_t size)
+{
+	unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
+	unsigned int order = ilog2(end_pfn - pfn);
+	int err = 0;
+
+	down_read(&kho_out.tree_lock);
+	if (kho_out.fdt) {
+		err = -EBUSY;
+		goto unlock;
+	}
+
+	for (; pfn < end_pfn; pfn += (1 << order), order = ilog2(end_pfn - pfn))
+		__kho_unpreserve(&kho_mem_track, pfn, order);
+
+unlock:
+	up_read(&kho_out.tree_lock);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(kho_unpreserve_phys);
+
+/* almost as free_reserved_page(), just don't free the page */
+static void kho_restore_page(struct page *page)
+{
+	ClearPageReserved(page);
+	init_page_count(page);
+	adjust_managed_page_count(page, 1);
+}
+
+struct folio *kho_restore_folio(phys_addr_t phys)
+{
+	struct page *page = pfn_to_online_page(PHYS_PFN(phys));
+	unsigned long order = page->private;
+
+	if (!page)
+		return NULL;
+
+	order = page->private;
+	if (order)
+		prep_compound_page(page, order);
+	else
+		kho_restore_page(page);
+
+	return page_folio(page);
+}
+EXPORT_SYMBOL_GPL(kho_restore_folio);
+
+void *kho_restore_phys(phys_addr_t phys, size_t size)
+{
+	unsigned long start_pfn, end_pfn, pfn;
+	void *va = __va(phys);
+
+	start_pfn = PFN_DOWN(phys);
+	end_pfn = PFN_UP(phys + size);
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_online_page(pfn);
+
+		if (!page)
+			return NULL;
+		kho_restore_page(page);
+	}
+
+	return va;
+}
+EXPORT_SYMBOL_GPL(kho_restore_phys);
+
+static void kho_mem_serialize(void)
+{
+	struct kho_mem_track *tracker = &kho_mem_track;
+	khomem_desc_t *desc;
+	unsigned long order;
+
+	xa_for_each(&tracker->orders, order, desc) {
+		if (WARN_ON(order >= NR_PAGE_ORDERS))
+			break;
+		kho_out.mem_tables[order] = *desc;
+	}
+}
+
+static int kho_mem_deser_walk_fn(unsigned long phys, unsigned int order,
+				 void *arg)
+{
+	struct page *page = phys_to_page(phys);
+	unsigned long sz = 1UL << (PAGE_SHIFT + order);
+
+	memblock_reserve(phys, sz);
+	memblock_reserved_mark_noinit(phys, sz);
+	page->private = order;
+
+	return 0;
+}
+
+static void __init kho_mem_deserialize(void)
+{
+	struct kho_in_node preserved_mem;
+	const unsigned long *tables;
+	unsigned int nr_tables;
+	int err, order;
+	u32 len;
+
+	err = kho_get_node(NULL, "preserved-memory", &preserved_mem);
+	if (err) {
+		pr_err("no preserved-memory node: %d\n", err);
+		return;
+	}
+
+	tables = kho_get_prop(&preserved_mem, "metadata", &len);
+	if (!tables || len % sizeof(*tables)) {
+		pr_err("failed to get preserved memory table\n");
+		return;
+	}
+
+	nr_tables = min_t(unsigned int, len / sizeof(*tables), NR_PAGE_ORDERS);
+	for (order = 0; order < nr_tables; order++)
+		khomem_walk_preserved((khomem_desc_t *)&tables[order], order,
+				      kho_mem_deser_walk_fn, NULL);
+}
+
 /* Helper functions for KHO state tree */
 
 struct kho_prop {
@@ -542,6 +1101,8 @@ static int kho_unfreeze(void)
 	kho_out.fdt = NULL;
 	up_write(&kho_out.tree_lock);
 
+	memset(kho_out.mem_tables, 0, sizeof(kho_out.mem_tables));
+
 	if (fdt)
 		kvfree(fdt);
 
@@ -648,6 +1209,8 @@ static int kho_finalize(void)
 	kho_out.fdt = fdt;
 	up_write(&kho_out.tree_lock);
 
+	kho_mem_serialize();
+
 	err = kho_convert_tree(fdt, kho_out.fdt_max);
 
 unfreeze:
@@ -829,6 +1392,10 @@ static __init int kho_init(void)
 
 	kho_out.root.name = "";
 	err = kho_add_string_prop(&kho_out.root, "compatible", "kho-v1");
+	err |= kho_add_prop(&kho_out.preserved_memory, "metadata",
+			    kho_out.mem_tables, sizeof(kho_out.mem_tables));
+	err |= kho_add_node(&kho_out.root, "preserved-memory",
+			    &kho_out.preserved_memory);
 	if (err)
 		goto err_free_scratch;
 
@@ -1079,10 +1646,12 @@ static void __init kho_release_scratch(void)
 
 void __init kho_memory_init(void)
 {
-	if (!kho_get_fdt())
+	if (!kho_get_fdt()) {
 		kho_reserve_scratch();
-	else
+	} else {
+		kho_mem_deserialize();
 		kho_release_scratch();
+	}
 }
 
 void __init kho_populate(phys_addr_t handover_fdt_phys,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-03 15:50   ` Pratyush Yadav
@ 2025-04-03 16:10     ` Jason Gunthorpe
  2025-04-03 17:37       ` Pratyush Yadav
  2025-04-09  8:35       ` Mike Rapoport
  0 siblings, 2 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-03 16:10 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Thu, Apr 03, 2025 at 03:50:04PM +0000, Pratyush Yadav wrote:

> The patch currently has a limitation where it does not free any of the
> empty tables after a unpreserve operation. But Changyuan's patch also
> doesn't do it so at least it is not any worse off.

We do we even have unpreserve? Just discard the entire KHO operation
in a bulk.

> When working on this patch, I realized that kho_mem_deserialize() is
> currently _very_ slow. It takes over 2 seconds to make memblock
> reservations for 48 GiB of 0-order pages. I suppose this can later be
> optimized by teaching memblock_free_all() to skip preserved pages
> instead of making memblock reservations.

Yes, this was my prior point of not having actual data to know what
the actual hot spots are.. This saves a few ms on an operation that
takes over 2 seconds :)

> +typedef unsigned long khomem_desc_t;

This should be more like:

union {
      void *table;
      phys_addr_t table_phys;
};

Since we are not using the low bits right now and it is alot cheaper
to convert from va to phys only once during the final step. __va is
not exactly fast.

> +#define PTRS_PER_LEVEL		(PAGE_SIZE / sizeof(unsigned long))
> +#define KHOMEM_L1_BITS		(PAGE_SIZE * BITS_PER_BYTE)
> +#define KHOMEM_L1_MASK		((1 << ilog2(KHOMEM_L1_BITS)) - 1)
> +#define KHOMEM_L1_SHIFT		(PAGE_SHIFT)
> +#define KHOMEM_L2_SHIFT		(KHOMEM_L1_SHIFT + ilog2(KHOMEM_L1_BITS))
> +#define KHOMEM_L3_SHIFT		(KHOMEM_L2_SHIFT + ilog2(PTRS_PER_LEVEL))
> +#define KHOMEM_L4_SHIFT		(KHOMEM_L3_SHIFT + ilog2(PTRS_PER_LEVEL))
> +#define KHOMEM_PFN_MASK		PAGE_MASK

This all works better if you just use GENMASK and FIELD_GET

> +static int __khomem_table_alloc(khomem_desc_t *desc)
> +{
> +	if (khomem_desc_none(*desc)) {

Needs READ_ONCE

> +struct kho_mem_track {
> +	/* Points to L4 KHOMEM descriptor, each order gets its own table. */
> +	struct xarray orders;
> +};

I think it would be easy to add a 5th level and just use bits 63:57 as
a 6 bit order. Then you don't need all this stuff either.

> +int kho_preserve_folio(struct folio *folio)
> +{
> +	unsigned long pfn = folio_pfn(folio);
> +	unsigned int order = folio_order(folio);
> +	int err;
> +
> +	if (!kho_enable)
> +		return -EOPNOTSUPP;
> +
> +	down_read(&kho_out.tree_lock);

This lock still needs to go away

> +static void kho_mem_serialize(void)
> +{
> +	struct kho_mem_track *tracker = &kho_mem_track;
> +	khomem_desc_t *desc;
> +	unsigned long order;
> +
> +	xa_for_each(&tracker->orders, order, desc) {
> +		if (WARN_ON(order >= NR_PAGE_ORDERS))
> +			break;
> +		kho_out.mem_tables[order] = *desc;

Missing the virt_to_phys?

> +	nr_tables = min_t(unsigned int, len / sizeof(*tables), NR_PAGE_ORDERS);
> +	for (order = 0; order < nr_tables; order++)
> +		khomem_walk_preserved((khomem_desc_t *)&tables[order], order,

Missing phys_to_virt

Please dont' remove the KHOSER stuff, and do use it with proper
structs and types. It is part of keeping this stuff understandable.

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-03 16:10     ` Jason Gunthorpe
@ 2025-04-03 17:37       ` Pratyush Yadav
  2025-04-04 12:54         ` Jason Gunthorpe
  2025-04-09  8:35       ` Mike Rapoport
  1 sibling, 1 reply; 103+ messages in thread
From: Pratyush Yadav @ 2025-04-03 17:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Thu, Apr 03 2025, Jason Gunthorpe wrote:

> On Thu, Apr 03, 2025 at 03:50:04PM +0000, Pratyush Yadav wrote:
>
>> The patch currently has a limitation where it does not free any of the
>> empty tables after a unpreserve operation. But Changyuan's patch also
>> doesn't do it so at least it is not any worse off.
>
> We do we even have unpreserve? Just discard the entire KHO operation
> in a bulk.

Yeah, I guess that makes sense.

>
>> When working on this patch, I realized that kho_mem_deserialize() is
>> currently _very_ slow. It takes over 2 seconds to make memblock
>> reservations for 48 GiB of 0-order pages. I suppose this can later be
>> optimized by teaching memblock_free_all() to skip preserved pages
>> instead of making memblock reservations.
>
> Yes, this was my prior point of not having actual data to know what
> the actual hot spots are.. This saves a few ms on an operation that
> takes over 2 seconds :)

Yes, you're right. But for 2.5 days of work it isn't too shabby :-)

And I think this will help make the 2 seconds much smaller as well later
down the line since we can now find out if a given page is reserved in a
few operations, and do it in parallel.

>
>> +typedef unsigned long khomem_desc_t;
>
> This should be more like:
>
> union {
>       void *table;
>       phys_addr_t table_phys;
> };
>
> Since we are not using the low bits right now and it is alot cheaper
> to convert from va to phys only once during the final step. __va is
> not exactly fast.

The descriptor is used on _every_ level of the table, not just the top.
So if we use virtual addresses, at serialize time we would have to walk
the whole table and covert all addresses to physical. And __va() does
not seem to be doing too much. On x86, it expands to:

    #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))

and on ARM64 to:

    #define __va(x)			((void *)__phys_to_virt((phys_addr_t)(x)))
    #define __phys_to_virt(x)	((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET)

So only some addition and bitwise or. Should be fast enough I reckon.
Maybe walking the table once is faster than calculating va every time,
but that walking would happen in the blackout time, and need more data
on whether that optimization is worth it.

>
>> +#define PTRS_PER_LEVEL		(PAGE_SIZE / sizeof(unsigned long))
>> +#define KHOMEM_L1_BITS		(PAGE_SIZE * BITS_PER_BYTE)
>> +#define KHOMEM_L1_MASK		((1 << ilog2(KHOMEM_L1_BITS)) - 1)
>> +#define KHOMEM_L1_SHIFT		(PAGE_SHIFT)
>> +#define KHOMEM_L2_SHIFT		(KHOMEM_L1_SHIFT + ilog2(KHOMEM_L1_BITS))
>> +#define KHOMEM_L3_SHIFT		(KHOMEM_L2_SHIFT + ilog2(PTRS_PER_LEVEL))
>> +#define KHOMEM_L4_SHIFT		(KHOMEM_L3_SHIFT + ilog2(PTRS_PER_LEVEL))
>> +#define KHOMEM_PFN_MASK		PAGE_MASK
>
> This all works better if you just use GENMASK and FIELD_GET

I suppose yes. Though the masks need to be shifted by page order so need
to be careful. Will take a look.

>
>> +static int __khomem_table_alloc(khomem_desc_t *desc)
>> +{
>> +	if (khomem_desc_none(*desc)) {
>
> Needs READ_ONCE

ACK, will add.

>
>> +struct kho_mem_track {
>> +	/* Points to L4 KHOMEM descriptor, each order gets its own table. */
>> +	struct xarray orders;
>> +};
>
> I think it would be easy to add a 5th level and just use bits 63:57 as
> a 6 bit order. Then you don't need all this stuff either.

I am guessing you mean to store the order in the table descriptor
itself, instead of having a different table for each order. I don't
think that would work since say you have a level 1 table spanning 128
MiB. You can have pages of different orders in that 128 MiB, and have no
way of knowing which is which. To have all orders in one table, we would
need more than one bit per page at the lowest level.

Though now that I think of it, it is probably much simpler to just use
khomem_desc_t orders[NR_PAGE_ORDERS] instead of the xarray.

>
>> +int kho_preserve_folio(struct folio *folio)
>> +{
>> +	unsigned long pfn = folio_pfn(folio);
>> +	unsigned int order = folio_order(folio);
>> +	int err;
>> +
>> +	if (!kho_enable)
>> +		return -EOPNOTSUPP;
>> +
>> +	down_read(&kho_out.tree_lock);
>
> This lock still needs to go away

Agree. I hope Changyuan's next version fixes it. I didn't really touch
any of these functions.

>
>> +static void kho_mem_serialize(void)
>> +{
>> +	struct kho_mem_track *tracker = &kho_mem_track;
>> +	khomem_desc_t *desc;
>> +	unsigned long order;
>> +
>> +	xa_for_each(&tracker->orders, order, desc) {
>> +		if (WARN_ON(order >= NR_PAGE_ORDERS))
>> +			break;
>> +		kho_out.mem_tables[order] = *desc;
>
> Missing the virt_to_phys?

Nope. This isn't storing the pointer to the descriptor, but the _value_
of the descriptor -- so it already contains the physical address of the
level 4 table.

>
>> +	nr_tables = min_t(unsigned int, len / sizeof(*tables), NR_PAGE_ORDERS);
>> +	for (order = 0; order < nr_tables; order++)
>> +		khomem_walk_preserved((khomem_desc_t *)&tables[order], order,
>
> Missing phys_to_virt

Same as above. tables contains the _values_ of the descriptors which
already have physical addresses which we turn into virtual ones in
khomem_table().

>
> Please dont' remove the KHOSER stuff, and do use it with proper
> structs and types. It is part of keeping this stuff understandable.

I didn't see any need for KHOSER stuff here to be honest. The only time
we deal with KHO pointers is with table addresses, and that is already
well abstracted in khomem_mkdesc() (I suppose that can require a
khomem_desc_t * instead of a void *, but beyond that it is quite easy to
understand IMO).

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-03 14:24         ` Jason Gunthorpe
@ 2025-04-04  9:54           ` Mike Rapoport
  2025-04-04 12:47             ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-04-04  9:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

Hi Jason,

On Thu, Apr 03, 2025 at 11:24:38AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 03, 2025 at 04:58:27PM +0300, Mike Rapoport wrote:
> > On Thu, Apr 03, 2025 at 08:42:09AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Apr 02, 2025 at 07:16:27PM +0000, Pratyush Yadav wrote:
> > > > > +int kho_preserve_phys(phys_addr_t phys, size_t size)
> > > > > +{
> > > > > +	unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
> > > > > +	unsigned int order = ilog2(end_pfn - pfn);
> > > > 
> > > > This caught my eye when playing around with the code. It does not put
> > > > any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when
> > > > initializing the page after KHO, we pass the order directly to
> > > > prep_compound_page() without sanity checking it. The next kernel might
> > > > not support all the orders the current one supports. Perhaps something
> > > > to fix?
> > > 
> > > IMHO we should delete the phys functions until we get a user of them
> > 
> > The only user of memory tracker in this series uses kho_preserve_phys()
> 
> But it really shouldn't. The reserved memory is a completely different
> mechanism than buddy allocator preservation. It doesn't even call
> kho_restore_phys() those pages, it just feeds the ranges directly to:
> 
> +       reserved_mem_add(*p_start, size, name);
> 
> The bitmaps should be understood as preserving memory from the buddy
> allocator only.
> 
> IMHO it should not call kho_preserve_phys() at all.

Do you mean that for preserving large physical ranges we need something
entirely different? 

Then we don't need the bitmaps at this point, as we don't have any users
for kho_preserve_folio() and we should not worry ourself with orders and
restoration of high order folios until then ;-)

Now more seriously, I considered the bitmaps and sparse xarrays as good
initial implementation of memory preservation that can do both physical
ranges now and folios later when we'll need them. It might not be the
optimal solution in the long run but we don't have enough data right now to
do any optimizations for real. Preserving huge amounts of order-0 pages
does not seem to me a representative test case at all. 

The xarrays + bitmaps do have the limitation that we cannot store any
information about the folio except its order and if we are anyway need
something else to preserve physical ranges, I suggest starting with
preserving ranges and then adding optimizations for the folio case.

As I've mentioned earlier, maple tree is perfect for tracking ranges, it
simpler than other alternatives, and at allows storing information
about a range and easy and efficient coalescing for adjacent ranges with
matching properties. The maple tree based memory tracker is less memory
efficient than bitmap if we count how many data is required to preserve
gigabytes of distinct order-0 pages, but I don't think this is the right
thing to measure at least until we have some real data about how KHO is
used.

Here's something that implements preservation of ranges (compile tested
only) and adding folios with their orders and maybe other information would
be quite easy.

/*
 * Keep track of memory that is to be preserved across KHO.
 *
 * For simplicity use a maple tree that conveniently stores ranges and
 * allows adding BITS_PER_XA_VALUE of metadata to each range
 */

struct kho_mem_track
{
	struct maple_tree ranges;
};

static struct kho_mem_track kho_mem_track;

typedef unsigned long kho_range_desc_t;

static int __kho_preserve(struct kho_mem_track *tracker, unsigned long addr,
			  size_t size, kho_range_desc_t info)
{
	struct maple_tree *ranges = &tracker->ranges;
	MA_STATE(mas, ranges, addr - 1, addr + size + 1);
	unsigned long lower, upper;

	void *area = NULL;

	lower = addr;
	upper = addr + size - 1;

	might_sleep();

	area = mas_walk(&mas);
	if (area && mas.last == addr - 1)
		lower = mas.index;

	area = mas_next(&mas, ULONG_MAX);
	if (area && mas.index == addr + size)
		upper = mas.last;

	mas_set_range(&mas, lower, upper);

	return mas_store_gfp(&mas, xa_mk_value(info), GFP_KERNEL);
}

/**
 * kho_preserve_phys - preserve a physcally contiguous range accross KHO.
 * @phys: physical address of the range
 * @phys: size of the range
 *
 * Records that the entire range from @phys to @phys + @size is preserved
 * across KHO.
 *
 * Return: 0 on success, error code on failure
 */
int kho_preserve_phys(phys_addr_t phys, size_t size)
{
	return __kho_preserve(&kho_mem_track, phys, size, 0);
}
EXPORT_SYMBOL_GPL(kho_preserve_phys);

#define KHOSER_PTR(type)  union {phys_addr_t phys; type ptr;}
#define KHOSER_STORE_PTR(dest, val)			\
	({						\
		(dest).phys = virt_to_phys(val);	\
		typecheck(typeof((dest).ptr), val);	\
	})
#define KHOSER_LOAD_PTR(src) ((src).phys ? (typeof((src).ptr))(phys_to_virt((src).phys)): NULL)

struct khoser_mem_range {
	phys_addr_t start;
	phys_addr_t size;
	unsigned long data;
};

struct khoser_mem_chunk_hdr {
	KHOSER_PTR(struct khoser_mem_chunk *) next;
	unsigned long num_ranges;
};

#define KHOSER_RANGES_SIZE					\
	((PAGE_SIZE - sizeof(struct khoser_mem_chunk_hdr) /	\
	  sizeof(struct khoser_mem_range)))

struct khoser_mem_chunk {
	struct khoser_mem_chunk_hdr hdr;
	struct khoser_mem_range ranges[KHOSER_RANGES_SIZE];
};

static int new_chunk(struct khoser_mem_chunk **cur_chunk)
{
	struct khoser_mem_chunk *chunk;

	chunk = kzalloc(sizeof(*chunk), GFP_KERNEL);
	if (!chunk)
		return -ENOMEM;
	if (*cur_chunk)
		KHOSER_STORE_PTR((*cur_chunk)->hdr.next, chunk);
	*cur_chunk = chunk;
	return 0;
}

/*
 * Record all the ranges in a linked list of pages for the next kernel to
 * process. Each chunk holds array of ragnes. The maple_tree is used to store
 * them in a tree while building up the data structure, but the KHO successor
 * kernel only needs to process them once in order.
 *
 * All of this memory is normal kmalloc() memory and is not marked for
 * preservation. The successor kernel will remain isolated to the scratch space
 * until it completes processing this list. Once processed all the memory
 * storing these ranges will be marked as free.
 */
static int kho_mem_serialize(phys_addr_t *fdt_value)
{
	struct kho_mem_track *tracker = &kho_mem_track;
	struct maple_tree *ranges = &tracker->ranges;
	struct khoser_mem_chunk *first_chunk = NULL;
	struct khoser_mem_chunk *chunk = NULL;
	MA_STATE(mas, ranges, 0, ULONG_MAX);
	void *entry;
	int err;

	mas_for_each(&mas, entry, ULONG_MAX) {
		size_t size = mas.last - mas.index + 1;
		struct khoser_mem_range *range;

		err = new_chunk(&chunk);
		if (err)
			goto err_free;
		if (!first_chunk)
			first_chunk = chunk;

		if (chunk->hdr.num_ranges == ARRAY_SIZE(chunk->ranges)) {
			err = new_chunk(&chunk);
			if (err)
				goto err_free;
		}

		range = &chunk->ranges[chunk->hdr.num_ranges];
		range->start = mas.index;
		range->size = size;
		range->data = xa_to_value(entry);
		chunk->hdr.num_ranges++;
	}

	*fdt_value = virt_to_phys(first_chunk);

	return 0;

err_free:
	chunk = first_chunk;
	while (chunk) {
		struct khoser_mem_chunk *tmp = chunk;
		chunk = KHOSER_LOAD_PTR(chunk->hdr.next);
		kfree(tmp);
	}
	return err;
}

static void __init deserialize_range(struct khoser_mem_range *range)
{
	memblock_reserved_mark_noinit(range->start, range->size);
	memblock_reserve(range->start, range->size);
}

static void __init kho_mem_deserialize(void)
{
	const void *fdt = kho_get_fdt();
	struct khoser_mem_chunk *chunk;
	const phys_addr_t *mem;
	int len, node;

	if (!fdt)
		return;

	node = fdt_path_offset(fdt, "/preserved-memory");
	if (node < 0) {
		pr_err("no preserved-memory node: %d\n", node);
		return;
	}

	mem = fdt_getprop(fdt, node, "metadata", &len);
	if (!mem || len != sizeof(*mem)) {
		pr_err("failed to get preserved memory bitmaps\n");
		return;
	}

	chunk = phys_to_virt(*mem);
	while (chunk) {
		unsigned int i;

		memblock_reserve(virt_to_phys(chunk), sizeof(*chunk));

		for (i = 0; i != chunk->hdr.num_ranges; i++)
			deserialize_range(&chunk->ranges[i]);

		chunk = KHOSER_LOAD_PTR(chunk->hdr.next);
	}
}

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-04  9:54           ` Mike Rapoport
@ 2025-04-04 12:47             ` Jason Gunthorpe
  2025-04-04 13:53               ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-04 12:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Fri, Apr 04, 2025 at 12:54:25PM +0300, Mike Rapoport wrote:
> > IMHO it should not call kho_preserve_phys() at all.
> 
> Do you mean that for preserving large physical ranges we need something
> entirely different?

If they don't use the buddy allocator, then yes?

> Then we don't need the bitmaps at this point, as we don't have any users
> for kho_preserve_folio() and we should not worry ourself with orders and
> restoration of high order folios until then ;-)

Arguably yes :\

Maybe change the reserved regions code to put the region list in a
folio and preserve the folio instead of using FDT as a "demo" for the
functionality.

> The xarrays + bitmaps do have the limitation that we cannot store any
> information about the folio except its order and if we are anyway need
> something else to preserve physical ranges, I suggest starting with
> preserving ranges and then adding optimizations for the folio case.

Why? What is the use case for physical ranges that isn't handled
entirely by reserved_mem_add()?

We know what the future use case is for the folio preservation, all
the drivers and the iommu are going to rely on this.

> Here's something that implements preservation of ranges (compile tested
> only) and adding folios with their orders and maybe other information would
> be quite easy.

But folios and their orders is the *whole point*, again I don't see
any use case for preserving ranges, beyond it being a way to optimize
the memblock reserve path. But that path should be fixed up to just
use the bitmap directly..

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-03 17:37       ` Pratyush Yadav
@ 2025-04-04 12:54         ` Jason Gunthorpe
  2025-04-04 15:39           ` Pratyush Yadav
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-04 12:54 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Thu, Apr 03, 2025 at 05:37:06PM +0000, Pratyush Yadav wrote:

> And I think this will help make the 2 seconds much smaller as well later
> down the line since we can now find out if a given page is reserved in a
> few operations, and do it in parallel.

Yes, most certainly
 
> > This should be more like:
> >
> > union {
> >       void *table;
> >       phys_addr_t table_phys;
> > };
> >
> > Since we are not using the low bits right now and it is alot cheaper
> > to convert from va to phys only once during the final step. __va is
> > not exactly fast.
> 
> The descriptor is used on _every_ level of the table, not just the
> top.

Yes

> So if we use virtual addresses, at serialize time we would have to walk
> the whole table and covert all addresses to physical.

Yes

> And __va() does
> not seem to be doing too much. On x86, it expands to:
> 
>     #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))
> 
> and on ARM64 to:
> 
>     #define __va(x)			((void *)__phys_to_virt((phys_addr_t)(x)))
>     #define __phys_to_virt(x)	((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET)

Hmm, I was sure sparsemem added a bunch of stuff to this path, maybe
I'm thinking of page_to_phys

> >> +struct kho_mem_track {
> >> +	/* Points to L4 KHOMEM descriptor, each order gets its own table. */
> >> +	struct xarray orders;
> >> +};
> >
> > I think it would be easy to add a 5th level and just use bits 63:57 as
> > a 6 bit order. Then you don't need all this stuff either.
> 
> I am guessing you mean to store the order in the table descriptor
> itself, instead of having a different table for each order.

Not quite, I mean to index the per-order sub trees by using the high
order bits. You still end up with N seperate bitmap trees, but instead
of using an xarray to hold their top pointers you hold them in a 5th
level.

> Though now that I think of it, it is probably much simpler to just use
> khomem_desc_t orders[NR_PAGE_ORDERS] instead of the xarray.

Which is basically this, but encoding the index to orders in the address

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-04 12:47             ` Jason Gunthorpe
@ 2025-04-04 13:53               ` Mike Rapoport
  2025-04-04 14:30                 ` Jason Gunthorpe
  2025-04-04 16:15                 ` Pratyush Yadav
  0 siblings, 2 replies; 103+ messages in thread
From: Mike Rapoport @ 2025-04-04 13:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Fri, Apr 04, 2025 at 09:47:29AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 04, 2025 at 12:54:25PM +0300, Mike Rapoport wrote:
> > > IMHO it should not call kho_preserve_phys() at all.
> > 
> > Do you mean that for preserving large physical ranges we need something
> > entirely different?
> 
> If they don't use the buddy allocator, then yes?
> 
> > Then we don't need the bitmaps at this point, as we don't have any users
> > for kho_preserve_folio() and we should not worry ourself with orders and
> > restoration of high order folios until then ;-)
> 
> Arguably yes :\
> 
> Maybe change the reserved regions code to put the region list in a
> folio and preserve the folio instead of using FDT as a "demo" for the
> functionality.

Folios are not available when we restore reserved regions, this just won't
work.

> > The xarrays + bitmaps do have the limitation that we cannot store any
> > information about the folio except its order and if we are anyway need
> > something else to preserve physical ranges, I suggest starting with
> > preserving ranges and then adding optimizations for the folio case.
> 
> Why? What is the use case for physical ranges that isn't handled
> entirely by reserved_mem_add()?
> 
> We know what the future use case is for the folio preservation, all
> the drivers and the iommu are going to rely on this.

We don't know how much of the preservation will be based on folios.
Most drivers do not use folios and for preserving memfd* and hugetlb we'd
need to have some dance around that memory anyway.  So I think
kho_preserve_folio() would be a part of the fdbox or whatever that
functionality will be called.

> > Here's something that implements preservation of ranges (compile tested
> > only) and adding folios with their orders and maybe other information would
> > be quite easy.
> 
> But folios and their orders is the *whole point*, again I don't see
> any use case for preserving ranges, beyond it being a way to optimize
> the memblock reserve path. But that path should be fixed up to just
> use the bitmap directly..

Are they? 
The purpose of basic KHO is to make sure the memory we want to preserve is
not trampled over. Preserving folios with their orders means we need to
make sure memory range of the folio is preserved and we carry additional
information to actually recreate the folio object, in case it is needed and
in case it is possible. Hughetlb, for instance has its own way initializing
folios and just keeping the order won't be enough for that.

As for the optimizations of memblock reserve path, currently it what hurts
the most in my and Pratyush experiments. They are not very representative,
but still, preserving lots of pages/folios spread all over would have it's
toll on the mm initialization. And I don't think invasive changes to how
buddy and memory map initialization are the best way to move forward and
optimize that. Quite possibly we'd want to be able to minimize amount of
*ranges* that we preserve.

So from the three alternatives we have now (xarrays + bitmaps, tables +
bitmaps and maple tree for ranges) maple tree seems to be the simplest and
efficient enough to start with.

Preserving folio orders with it is really straighforward and until we see
some real data of how the entire KHO machinery is used, I'd prefer simple
over anything else.

> Jason

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-04 13:53               ` Mike Rapoport
@ 2025-04-04 14:30                 ` Jason Gunthorpe
  2025-04-04 16:24                   ` Pratyush Yadav
  2025-04-06 16:11                   ` Mike Rapoport
  2025-04-04 16:15                 ` Pratyush Yadav
  1 sibling, 2 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-04 14:30 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Fri, Apr 04, 2025 at 04:53:13PM +0300, Mike Rapoport wrote:
> > Maybe change the reserved regions code to put the region list in a
> > folio and preserve the folio instead of using FDT as a "demo" for the
> > functionality.
> 
> Folios are not available when we restore reserved regions, this just won't
> work.

You don't need the folio at that point, you just need the data in the
page.

The folio would be freed after starting up the buddy allocator.

> > We know what the future use case is for the folio preservation, all
> > the drivers and the iommu are going to rely on this.
> 
> We don't know how much of the preservation will be based on folios.

I think almost all of it. Where else does memory come from for drivers?

> Most drivers do not use folios

Yes they do, either through kmalloc or through alloc_page/etc. "folio"
here is just some generic word meaning memory from the buddy allocator.

The big question on my mind is if we need a way to preserve slab
objects as well..

> and for preserving memfd* and hugetlb we'd need to have some dance
> around that memory anyway.

memfd is all folios - what do you mean?

hugetlb is moving toward folios.. eg guestmemfd is supposed to be
taking the hugetlb special stuff and turning it into folios.

> So I think kho_preserve_folio() would be a part of the fdbox or
> whatever that functionality will be called.

It is part of KHO. Preserving the folios has to be sequenced with
starting the buddy allocator, and that is KHO's entire responsibility.

I could see something like preserving slab being in a different layer,
built on preserving folios.

> Are they? 
> The purpose of basic KHO is to make sure the memory we want to preserve is
> not trampled over. Preserving folios with their orders means we need to
> make sure memory range of the folio is preserved and we carry additional
> information to actually recreate the folio object, in case it is needed and
> in case it is possible. Hughetlb, for instance has its own way initializing
> folios and just keeping the order won't be enough for that.

I expect many things will need a side-car datastructure to record that
additional meta-data. hugetlb can start with folios, then switch them
over to its non-folio stuff based on its metadata.

The point is the basic low level KHO mechanism is simple folios -
memory from the buddy allocator with an neutral struct folio that the
caller can then customize to its own memory descriptor type on restore.

Eventually restore would allocate a caller specific memdesc and it
wouldn't be "folios" at all. We just don't have the right words yet to
describe this.

> As for the optimizations of memblock reserve path, currently it what hurts
> the most in my and Pratyush experiments. They are not very representative,
> but still, preserving lots of pages/folios spread all over would have it's
> toll on the mm initialization.

> And I don't think invasive changes to how
> buddy and memory map initialization are the best way to move forward and
> optimize that.

I'm pretty sure this is going to be the best performance path, but I
have no idea how invasive it would be to the buddy alloactor to make
it work.

> Quite possibly we'd want to be able to minimize amount of *ranges*
> that we preserve.

I'm not sure, that seems backwards to me, we really don't want to have
KHO mem zones! So I think optimizing for, and thinking about ranges
doesn't make sense.

The big ranges will arise naturally beacuse things like hugetlb
reservations should all be contiguous and the resulting folios should
all be allocated for the VM and also all be contigous. So vast, vast
amounts of memory will be high order and contiguous.

> Preserving folio orders with it is really straighforward and until we see
> some real data of how the entire KHO machinery is used, I'd prefer simple
> over anything else.

mapletree may not even work as it has a very high bound on memory
usage if the preservation workload is small and fragmented. This is
why I didn't want to use list of ranges in the first place.

It also doesn't work so well if you need to preserve the order too :\

Until we know the workload(s) and cost how much memory the maple tree
version will use I don't think it is a good general starting point.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-04 12:54         ` Jason Gunthorpe
@ 2025-04-04 15:39           ` Pratyush Yadav
  0 siblings, 0 replies; 103+ messages in thread
From: Pratyush Yadav @ 2025-04-04 15:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Changyuan Lyu, linux-kernel, graf, akpm, luto, anthony.yznaga,
	arnd, ashish.kalra, benh, bp, catalin.marinas, dave.hansen, dwmw2,
	ebiederm, mingo, jgowans, corbet, krzk, rppt, mark.rutland,
	pbonzini, pasha.tatashin, hpa, peterz, robh+dt, robh, saravanak,
	skinsburskii, rostedt, tglx, thomas.lendacky, usama.arif, will,
	devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86

On Fri, Apr 04 2025, Jason Gunthorpe wrote:

> On Thu, Apr 03, 2025 at 05:37:06PM +0000, Pratyush Yadav wrote:
>
[...]
>> > This should be more like:
>> >
>> > union {
>> >       void *table;
>> >       phys_addr_t table_phys;
>> > };
>> >
>> > Since we are not using the low bits right now and it is alot cheaper
>> > to convert from va to phys only once during the final step. __va is
>> > not exactly fast.
>> 
>> The descriptor is used on _every_ level of the table, not just the
>> top.
>
> Yes
>
>> So if we use virtual addresses, at serialize time we would have to walk
>> the whole table and covert all addresses to physical.
>
> Yes
>
>> And __va() does
>> not seem to be doing too much. On x86, it expands to:
>> 
>>     #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))
>> 
>> and on ARM64 to:
>> 
>>     #define __va(x)			((void *)__phys_to_virt((phys_addr_t)(x)))
>>     #define __phys_to_virt(x)	((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET)
>
> Hmm, I was sure sparsemem added a bunch of stuff to this path, maybe
> I'm thinking of page_to_phys

Yep, page_to_phys for sparsemem is somewhat expensive, but __va() seems
to be fine.

    #define __page_to_pfn(pg)					\
    ({	const struct page *__pg = (pg);				\
            int __sec = page_to_section(__pg);			\
            (unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec)));	\
    })

>
>> >> +struct kho_mem_track {
>> >> +	/* Points to L4 KHOMEM descriptor, each order gets its own table. */
>> >> +	struct xarray orders;
>> >> +};
>> >
>> > I think it would be easy to add a 5th level and just use bits 63:57 as
>> > a 6 bit order. Then you don't need all this stuff either.
>> 
>> I am guessing you mean to store the order in the table descriptor
>> itself, instead of having a different table for each order.
>
> Not quite, I mean to index the per-order sub trees by using the high
> order bits. You still end up with N seperate bitmap trees, but instead
> of using an xarray to hold their top pointers you hold them in a 5th
> level.
>
>> Though now that I think of it, it is probably much simpler to just use
>> khomem_desc_t orders[NR_PAGE_ORDERS] instead of the xarray.
>
> Which is basically this, but encoding the index to orders in the address

I think this is way easier to wrap your head around compared to trying
to encode orders in address. That doesn't even work that well since
address has pretty much no connection to order. So a page at address say
0x100000 might be any order. So we would have to encode the order into
the address, and then decode it when doing table operations. Just having
a separate table indexed separately by orders is way simpler.

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-04 13:53               ` Mike Rapoport
  2025-04-04 14:30                 ` Jason Gunthorpe
@ 2025-04-04 16:15                 ` Pratyush Yadav
  2025-04-06 16:34                   ` Mike Rapoport
  1 sibling, 1 reply; 103+ messages in thread
From: Pratyush Yadav @ 2025-04-04 16:15 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Jason Gunthorpe, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

Hi Mike,

On Fri, Apr 04 2025, Mike Rapoport wrote:

[...]
> As for the optimizations of memblock reserve path, currently it what hurts
> the most in my and Pratyush experiments. They are not very representative,
> but still, preserving lots of pages/folios spread all over would have it's
> toll on the mm initialization. And I don't think invasive changes to how
> buddy and memory map initialization are the best way to move forward and
> optimize that. Quite possibly we'd want to be able to minimize amount of
> *ranges* that we preserve.
>
> So from the three alternatives we have now (xarrays + bitmaps, tables +
> bitmaps and maple tree for ranges) maple tree seems to be the simplest and
> efficient enough to start with.

But you'd need to somehow serialize the maple tree ranges into some
format. So you would either end up going back to the kho_mem ranges we
had, or have to invent something more complex. The sample code you wrote
is pretty much going back to having kho_mem ranges.

And if you say that we should minimize the amount of ranges, the table +
bitmaps is still a fairly good data structure. You can very well have a
higher order table where your entire range is a handful of bits. This
lets you track a small number of ranges fairly efficiently -- both in
terms of memory and in terms of CPU. I think the only place where it
doesn't work as well as a maple tree is if you want to merge or split a
lot ranges quickly. But if you say that you only want to have a handful
of ranges, does that really matter?

Also, I think the allocation pattern depends on which use case you have
in mind. For hypervisor live update, you might very well only have a
handful of ranges. The use case I have in mind is for taking a userspace
process, quickly checkpointing it by dumping its memory contents to a
memfd, and restoring it after KHO. For that, the ability to do random
sparse allocations quickly helps a lot.

So IMO the table works well for both sparse and dense allocations. So
why have a data structure that only solves one problem when we can have
one that solves both? And honestly, I don't think the table is that much
more complex either -- both in terms of understanding the idea and in
terms of code -- the whole thing is like 200 lines.

Also, I think changes to buddy initialization _is_ the way to optimize
boot times. Having maple tree ranges and moving them around into
memblock ranges does not really scale very well for anything other than
a handful of ranges, and we shouldn't limit ourselves to that without
good reason.

>  
> Preserving folio orders with it is really straighforward and until we see
> some real data of how the entire KHO machinery is used, I'd prefer simple
> over anything else.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-04 14:30                 ` Jason Gunthorpe
@ 2025-04-04 16:24                   ` Pratyush Yadav
  2025-04-04 17:31                     ` Jason Gunthorpe
  2025-04-06 16:13                     ` Mike Rapoport
  2025-04-06 16:11                   ` Mike Rapoport
  1 sibling, 2 replies; 103+ messages in thread
From: Pratyush Yadav @ 2025-04-04 16:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mike Rapoport, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Fri, Apr 04 2025, Jason Gunthorpe wrote:

> On Fri, Apr 04, 2025 at 04:53:13PM +0300, Mike Rapoport wrote:
[...]
>> Most drivers do not use folios
>
> Yes they do, either through kmalloc or through alloc_page/etc. "folio"
> here is just some generic word meaning memory from the buddy allocator.
>
> The big question on my mind is if we need a way to preserve slab
> objects as well..

Only if the objects in the slab cache are of a format that doesn't
change, and I am not sure if that is the case anywhere. Maybe a driver
written with KHO in mind would find it useful, but that's way down the
line.

>
>> and for preserving memfd* and hugetlb we'd need to have some dance
>> around that memory anyway.
>
> memfd is all folios - what do you mean?
>
> hugetlb is moving toward folios.. eg guestmemfd is supposed to be
> taking the hugetlb special stuff and turning it into folios.
>
>> So I think kho_preserve_folio() would be a part of the fdbox or
>> whatever that functionality will be called.
>
> It is part of KHO. Preserving the folios has to be sequenced with
> starting the buddy allocator, and that is KHO's entire responsibility.
>
> I could see something like preserving slab being in a different layer,
> built on preserving folios.

Agree with both points.

[...]
>> As for the optimizations of memblock reserve path, currently it what hurts
>> the most in my and Pratyush experiments. They are not very representative,
>> but still, preserving lots of pages/folios spread all over would have it's
>> toll on the mm initialization.
>
>> And I don't think invasive changes to how
>> buddy and memory map initialization are the best way to move forward and
>> optimize that.
>
> I'm pretty sure this is going to be the best performance path, but I
> have no idea how invasive it would be to the buddy alloactor to make
> it work.

I don't imagine it would be that invasive TBH. memblock_free_pages()
already checks for kmsan_memblock_free_pages() or
early_page_initialised(), it can also check for kho_page() just as
easily.

>
>> Quite possibly we'd want to be able to minimize amount of *ranges*
>> that we preserve.
>
> I'm not sure, that seems backwards to me, we really don't want to have
> KHO mem zones! So I think optimizing for, and thinking about ranges
> doesn't make sense.
>
> The big ranges will arise naturally beacuse things like hugetlb
> reservations should all be contiguous and the resulting folios should
> all be allocated for the VM and also all be contigous. So vast, vast
> amounts of memory will be high order and contiguous.

Yes, and those can work quite well with table + bitmaps too.

[...]

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-04 16:24                   ` Pratyush Yadav
@ 2025-04-04 17:31                     ` Jason Gunthorpe
  2025-04-06 16:13                     ` Mike Rapoport
  1 sibling, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-04 17:31 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Mike Rapoport, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Fri, Apr 04, 2025 at 04:24:54PM +0000, Pratyush Yadav wrote:
> Only if the objects in the slab cache are of a format that doesn't
> change, and I am not sure if that is the case anywhere. Maybe a driver
> written with KHO in mind would find it useful, but that's way down the
> line.

Things like iommu STE/CD entires are HW specified and currently
allocated from slab, so there may be something interesting there.

They could also possibly be converted to page allocations..

But I think we will come up with cases where maybe a kmemcache
specifically for KHO objects makes some sense. Need to get further
along.

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-04 14:30                 ` Jason Gunthorpe
  2025-04-04 16:24                   ` Pratyush Yadav
@ 2025-04-06 16:11                   ` Mike Rapoport
  2025-04-07 14:16                     ` Jason Gunthorpe
  1 sibling, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-04-06 16:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Fri, Apr 04, 2025 at 11:30:31AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 04, 2025 at 04:53:13PM +0300, Mike Rapoport wrote:
> > > Maybe change the reserved regions code to put the region list in a
> > > folio and preserve the folio instead of using FDT as a "demo" for the
> > > functionality.
> > 
> > Folios are not available when we restore reserved regions, this just won't
> > work.
> 
> You don't need the folio at that point, you just need the data in the
> page.
> 
> The folio would be freed after starting up the buddy allocator.

Maybe, but seems a bit far fetched to me.

> > > We know what the future use case is for the folio preservation, all
> > > the drivers and the iommu are going to rely on this.
> > 
> > We don't know how much of the preservation will be based on folios.
> 
> I think almost all of it. Where else does memory come from for drivers?

alloc_pages()? vmalloc()?
These don't use struct folio unless there's __GFP_COMP in alloc_pages()
call, and in my mind "folio" is memory described by struct folio.

> > Most drivers do not use folios
> 
> Yes they do, either through kmalloc or through alloc_page/etc. "folio"
> here is just some generic word meaning memory from the buddy allocator.

How about we find some less ambiguous term? Using "folio" for memory
returned from kmalloc is really confusing. And even alloc_pages() does not
treat all memory it returns as folios.

How about we call them ranges? ;-)

> The big question on my mind is if we need a way to preserve slab
> objects as well..
> 
> > and for preserving memfd* and hugetlb we'd need to have some dance
> > around that memory anyway.
> 
> memfd is all folios - what do you mean?

memfd is struct folios indeed, but some of them are hugetlb and even for
those that are not I'm not sure that kho_preserve_folio(struct *folio)
kho_restore_folio(some token?) will be enough. I totally might be wrong
here.

> hugetlb is moving toward folios.. eg guestmemfd is supposed to be
> taking the hugetlb special stuff and turning it into folios.

At some point yes. But I really hope KHO can happen faster than hugetlb and
guestmemfd convergence.

> > So I think kho_preserve_folio() would be a part of the fdbox or
> > whatever that functionality will be called.
> 
> It is part of KHO. Preserving the folios has to be sequenced with
> starting the buddy allocator, and that is KHO's entire responsibility.

So if you call "folio" any memory range that comes from page allocator, I
do agree. But since it's not necessarily struct folio, and struct folio is
mostly used with file descriptors, the kho_preserve_folio(struct folio *)
API can be a part of fdbox.

Preserving struct folio is one of several case where we'd want to preserve
ranges. There's simplistic memblock case that does not care about any
memdesc, there's memory returned from alloc_pages() without __GFP_COMP,
there's vmalloc() and of course there's memory with struct folio.

But the basic KHO primitive should preserve ranges because they are the
common denominator of alloc_pages(), folio_alloc(), vmalloc() and memblock.

> I could see something like preserving slab being in a different layer,
> built on preserving folios.

Maybe, on top of ranges. slab is yet another memdesc.

> > Are they? 
> > The purpose of basic KHO is to make sure the memory we want to preserve is
> > not trampled over. Preserving folios with their orders means we need to
> > make sure memory range of the folio is preserved and we carry additional
> > information to actually recreate the folio object, in case it is needed and
> > in case it is possible. Hughetlb, for instance has its own way initializing
> > folios and just keeping the order won't be enough for that.
> 
> I expect many things will need a side-car datastructure to record that
> additional meta-data. hugetlb can start with folios, then switch them
> over to its non-folio stuff based on its metadata.
> 
> The point is the basic low level KHO mechanism is simple folios -
> memory from the buddy allocator with an neutral struct folio that the
> caller can then customize to its own memory descriptor type on restore.

I can't say I understand what do you mean by "neutral struct folio", but we
can't really use struct folio for memory that wasn't struct folio at the
first place. There's a ton of checks for flags etc in mm core that could
blow up if we use a wrong memdesc.

Hence the use of page->private for order of folios. It's stable (for now)
and can be used by any page owner.

> Eventually restore would allocate a caller specific memdesc and it
> wouldn't be "folios" at all. We just don't have the right words yet to
> describe this.
> 
> > As for the optimizations of memblock reserve path, currently it what hurts
> > the most in my and Pratyush experiments. They are not very representative,
> > but still, preserving lots of pages/folios spread all over would have it's
> > toll on the mm initialization.
> 
> > And I don't think invasive changes to how
> > buddy and memory map initialization are the best way to move forward and
> > optimize that.
> 
> I'm pretty sure this is going to be the best performance path, but I
> have no idea how invasive it would be to the buddy alloactor to make
> it work.

I'm not sure about the best performance, but if we are to completely bypass
memblock_reserve() we'd need an alternative memory map and free lists
initialization for KHO. I believe it's too premature to target that at this
point.

> > Quite possibly we'd want to be able to minimize amount of *ranges*
> > that we preserve.
> 
> I'm not sure, that seems backwards to me, we really don't want to have
> KHO mem zones! So I think optimizing for, and thinking about ranges
> doesn't make sense.

"folio" as "some generic word meaning memory from the buddy allocator" and
range are quite the same thing.

> The big ranges will arise naturally beacuse things like hugetlb
> reservations should all be contiguous and the resulting folios should
> all be allocated for the VM and also all be contigous. So vast, vast
> amounts of memory will be high order and contiguous.

So there won't be a problem with too many memblock_reserve() then.

> > Preserving folio orders with it is really straighforward and until we see
> > some real data of how the entire KHO machinery is used, I'd prefer simple
> > over anything else.
> 
> mapletree may not even work as it has a very high bound on memory
> usage if the preservation workload is small and fragmented. This is
> why I didn't want to use list of ranges in the first place.

But aren't "vast, vast amounts of memory will be high order and
contiguous."? ;-)

For small and fragmented workload bitmaps become really sparse and we are
wasting memory for nothing. Maple tree only tracks memory that is actually
used and coalesces adjacent ranges so although it's unbound in theory, in
practice it may be not that bad. 

> It also doesn't work so well if you need to preserve the order too :\

It does. In the example I've sent there's an unsigned long to store
"kho_mem_info_t", which definitely can contain order.

> Until we know the workload(s) and cost how much memory the maple tree
> version will use I don't think it is a good general starting point.

I did and experiment with preserving 8G of memory allocated with randomly
chosen order. For each order (0 to 10) I've got roughly 1000 "folios". I
measured time kho_mem_deserialize() takes with xarrays + bitmaps vs maple
tree based implementation. The maple tree outperformed by factor of 10 and
it's serialized data used 6 times less memory.

> Jason

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-04 16:24                   ` Pratyush Yadav
  2025-04-04 17:31                     ` Jason Gunthorpe
@ 2025-04-06 16:13                     ` Mike Rapoport
  1 sibling, 0 replies; 103+ messages in thread
From: Mike Rapoport @ 2025-04-06 16:13 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Jason Gunthorpe, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Fri, Apr 04, 2025 at 04:24:54PM +0000, Pratyush Yadav wrote:
> On Fri, Apr 04 2025, Jason Gunthorpe wrote:
> >
> > I'm pretty sure this is going to be the best performance path, but I
> > have no idea how invasive it would be to the buddy alloactor to make
> > it work.
> 
> I don't imagine it would be that invasive TBH. memblock_free_pages()
> already checks for kmsan_memblock_free_pages() or
> early_page_initialised(), it can also check for kho_page() just as
> easily.

And how does it help us? 

> -- 
> Regards,
> Pratyush Yadav
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-04 16:15                 ` Pratyush Yadav
@ 2025-04-06 16:34                   ` Mike Rapoport
  2025-04-07 14:23                     ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-04-06 16:34 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Jason Gunthorpe, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Fri, Apr 04, 2025 at 04:15:28PM +0000, Pratyush Yadav wrote:
> Hi Mike,
> 
> On Fri, Apr 04 2025, Mike Rapoport wrote:
> 
> [...]
> > As for the optimizations of memblock reserve path, currently it what hurts
> > the most in my and Pratyush experiments. They are not very representative,
> > but still, preserving lots of pages/folios spread all over would have it's
> > toll on the mm initialization. And I don't think invasive changes to how
> > buddy and memory map initialization are the best way to move forward and
> > optimize that. Quite possibly we'd want to be able to minimize amount of
> > *ranges* that we preserve.
> >
> > So from the three alternatives we have now (xarrays + bitmaps, tables +
> > bitmaps and maple tree for ranges) maple tree seems to be the simplest and
> > efficient enough to start with.
> 
> But you'd need to somehow serialize the maple tree ranges into some
> format. So you would either end up going back to the kho_mem ranges we
> had, or have to invent something more complex. The sample code you wrote
> is pretty much going back to having kho_mem ranges.

It's a bit better and it's not a part of FDT which Jason was so much
against :)
 
> And if you say that we should minimize the amount of ranges, the table +
> bitmaps is still a fairly good data structure. You can very well have a
> higher order table where your entire range is a handful of bits. This
> lets you track a small number of ranges fairly efficiently -- both in
> terms of memory and in terms of CPU. I think the only place where it
> doesn't work as well as a maple tree is if you want to merge or split a
> lot ranges quickly. But if you say that you only want to have a handful
> of ranges, does that really matter?

Until we all agree that we are bypassing memblock_reserve() and
reimplementing memory map and free lists initialization for KHO we must
minimize the amount of memblock_reserve() calls. And maple tree allows
easily merge ranges where appropriate resulting in much smaller amount of
ranges that kho_mem had.
 
> Also, I think the allocation pattern depends on which use case you have
> in mind. For hypervisor live update, you might very well only have a
> handful of ranges. The use case I have in mind is for taking a userspace
> process, quickly checkpointing it by dumping its memory contents to a
> memfd, and restoring it after KHO. For that, the ability to do random
> sparse allocations quickly helps a lot.
> 
> So IMO the table works well for both sparse and dense allocations. So
> why have a data structure that only solves one problem when we can have
> one that solves both? And honestly, I don't think the table is that much
> more complex either -- both in terms of understanding the idea and in
> terms of code -- the whole thing is like 200 lines.

It's more than 200 line longer than maple tree if we count the lines.
My point is both table and xarrays are trying to optimize for an unknown
goal. kho_mem with all it's drawbacks was an obvious baseline. Maple tree
improves that baseline and it is more straightforward than the
alternatives.
 
> Also, I think changes to buddy initialization _is_ the way to optimize
> boot times. Having maple tree ranges and moving them around into
> memblock ranges does not really scale very well for anything other than
> a handful of ranges, and we shouldn't limit ourselves to that without
> good reason.

As I said, this means an alternative implementation of the memory map and
free lists, which has been and remains quite fragile.
So we'd better start with something that does not require that in the
roadmap.
 
> -- 
> Regards,
> Pratyush Yadav

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-06 16:11                   ` Mike Rapoport
@ 2025-04-07 14:16                     ` Jason Gunthorpe
  2025-04-07 16:31                       ` Mike Rapoport
  2025-04-09 16:28                       ` Mike Rapoport
  0 siblings, 2 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-07 14:16 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Sun, Apr 06, 2025 at 07:11:14PM +0300, Mike Rapoport wrote:
> > > > We know what the future use case is for the folio preservation, all
> > > > the drivers and the iommu are going to rely on this.
> > > 
> > > We don't know how much of the preservation will be based on folios.
> > 
> > I think almost all of it. Where else does memory come from for drivers?
> 
> alloc_pages()? vmalloc()?

alloc_pages is a 0 order "folio". vmalloc is an array of 0 order
folios (?)

> These don't use struct folio unless there's __GFP_COMP in alloc_pages()
> call, and in my mind "folio" is memory described by struct folio.

I understand Matthew wants to get rid of non __GFP_COMP usage.

> > > Most drivers do not use folios
> > 
> > Yes they do, either through kmalloc or through alloc_page/etc. "folio"
> > here is just some generic word meaning memory from the buddy allocator.
> 
> How about we find some less ambiguous term? Using "folio" for memory
> returned from kmalloc is really confusing. And even alloc_pages() does not
> treat all memory it returns as folios.
> 
> How about we call them ranges? ;-)

memdescs if you want to be forward looking. It is not ranges.

The point very much is that they are well defined allocations from the
buddy allocator that can be freed back to the buddy allocator. We
provide an API sort of like alloc_pages/folio_alloc to get the pointer
back out and that is the only way to use it.

> > > and for preserving memfd* and hugetlb we'd need to have some dance
> > > around that memory anyway.
> > 
> > memfd is all folios - what do you mean?
> 
> memfd is struct folios indeed, but some of them are hugetlb and even for
> those that are not I'm not sure that kho_preserve_folio(struct *folio)
> kho_restore_folio(some token?) will be enough. I totally might be wrong
> here.

Well, that is the point, we want it to be enough and we need to make
it work. Ranges is the wrong place to fall back on to if there are
problems.

> > hugetlb is moving toward folios.. eg guestmemfd is supposed to be
> > taking the hugetlb special stuff and turning it into folios.
> 
> At some point yes. But I really hope KHO can happen faster than hugetlb and
> guestmemfd convergence.

Regardless, it is still representable as a near-folio thing since
there are struct pages backing hugetlbfs.

> > > So I think kho_preserve_folio() would be a part of the fdbox or
> > > whatever that functionality will be called.
> > 
> > It is part of KHO. Preserving the folios has to be sequenced with
> > starting the buddy allocator, and that is KHO's entire responsibility.
> 
> So if you call "folio" any memory range that comes from page allocator, I
> do agree.

Yes

> But since it's not necessarily struct folio, and struct folio is
> mostly used with file descriptors, the kho_preserve_folio(struct folio *)
> API can be a part of fdbox.

KHO needs to provide a way to give back an allocated struct page/folio
that can be freed back to the buddy alloactor, of the proper
order. Whatever you call that function it belongs to KHO as it is
KHO's primary responsibility to manage the buddy allocator and the
struct pages.

Today initializing the folio is the work required to do that.

> Preserving struct folio is one of several case where we'd want to preserve
> ranges. There's simplistic memblock case that does not care about any
> memdesc, there's memory returned from alloc_pages() without __GFP_COMP,
> there's vmalloc() and of course there's memory with struct folio.

non-struct page memory is fundamentally different from struct-page
memory, we don't even start up the buddy allocator on non-struct page
memory, and we don't allocate struct pages for them.

This should be a completely different flow.

Buddy allocator memory should start up in the next kernel as allocate
multi-order "folios", with allocated struct pages, with a working
folio_put()/etc to free them.

> I can't say I understand what do you mean by "neutral struct folio", but we
> can't really use struct folio for memory that wasn't struct folio at the
> first place. There's a ton of checks for flags etc in mm core that could
> blow up if we use a wrong memdesc.

For instance go look at how slab gets memory from the allocator:

        folio = (struct folio *)alloc_frozen_pages(flags, order);
	slab = folio_slab(folio);
        __folio_set_slab(folio);

I know the naming is tortured, but this is how it works right now. You
allocate "netrual" folios, then you change them into your memdesc
specific subtype. And somehow we simultaneously call this thing page,
folio and slab memdesc :\

So for preservation it makes complete sense that you'd have a
'kho_restore_frozen_folios/pages()' that returns a struct page/struct
folio in the exact same state as though it was newly allocated.

> "folio" as "some generic word meaning memory from the buddy allocator" and
> range are quite the same thing.

Not quite, folios are constrained to be aligned powers of two and we
expect the order to round trip through the system.

'ranges' are just ranges, no implied alingment, no round tripping of
the order.

> > > Preserving folio orders with it is really straighforward and until we see
> > > some real data of how the entire KHO machinery is used, I'd prefer simple
> > > over anything else.
> > 
> > mapletree may not even work as it has a very high bound on memory
> > usage if the preservation workload is small and fragmented. This is
> > why I didn't want to use list of ranges in the first place.
> 
> But aren't "vast, vast amounts of memory will be high order and
> contiguous."? ;-)

Yes, if you have a 500GB host most likely something like 480GB will be
high order contiguous, then you have 20GB that has to be random
layout. That is still alot of memory to eat up in a discontinuous
maple tree.

> > It also doesn't work so well if you need to preserve the order too :\
> 
> It does. In the example I've sent there's an unsigned long to store
> "kho_mem_info_t", which definitely can contain order.

It mucks up the combining logic since you can't combine maple tree
nodes with different orders, and now you have defeated the main
argument of using ranges :\

> > Until we know the workload(s) and cost how much memory the maple tree
> > version will use I don't think it is a good general starting point.
>  
> I did and experiment with preserving 8G of memory allocated with randomly
> chosen order. For each order (0 to 10) I've got roughly 1000 "folios". I
> measured time kho_mem_deserialize() takes with xarrays + bitmaps vs maple
> tree based implementation. The maple tree outperformed by factor of 10 and
> it's serialized data used 6 times less memory.

That seems like it means most of your memory ended up contiguous and
the maple tree didn't split nodes to preserve order. :\ Also the
bitmap scanning to optimize the memblock reserve isn't implemented for
xarray.. I don't think this is representative..

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-06 16:34                   ` Mike Rapoport
@ 2025-04-07 14:23                     ` Jason Gunthorpe
  0 siblings, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-07 14:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Sun, Apr 06, 2025 at 07:34:30PM +0300, Mike Rapoport wrote:

> It's more than 200 line longer than maple tree if we count the lines.
> My point is both table and xarrays are trying to optimize for an unknown
> goal.

Not unknown, the point of the bitmap scheme is to be memory
deterministic.

You can measure your workload and you can say I need XX MB of memory
to succeed a KHO using bitmaps.

With maple tree you need to both measure your work load, compute a
worst case fragmentation, then say you need YY MB of memory to succeed
the KHO.

Since we are looking only at worst case YY > XX

These are engineered systems, there is limited memory available to the
hypervisor, and every MB is basically accounted for to minimize the
memory requirement.

So every action needs to be worst cased and accounted for in the
hypervisor memory budget.

> As I said, this means an alternative implementation of the memory map and
> free lists, which has been and remains quite fragile.
> So we'd better start with something that does not require that in the
> roadmap.

I think the obvious next step is to use the bitmaps to generate
contiguous ranges to pass into memblock reserve. That will get you
performance equivilent to mapletree and deterministic memory usage.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-07 14:16                     ` Jason Gunthorpe
@ 2025-04-07 16:31                       ` Mike Rapoport
  2025-04-07 17:03                         ` Jason Gunthorpe
  2025-04-09 16:28                       ` Mike Rapoport
  1 sibling, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-04-07 16:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Mon, Apr 07, 2025 at 11:16:26AM -0300, Jason Gunthorpe wrote:
> On Sun, Apr 06, 2025 at 07:11:14PM +0300, Mike Rapoport wrote:
> > > > > We know what the future use case is for the folio preservation, all
> > > > > the drivers and the iommu are going to rely on this.
> > > > 
> > > > We don't know how much of the preservation will be based on folios.
> > > 
> > > I think almost all of it. Where else does memory come from for drivers?
> > 
> > alloc_pages()? vmalloc()?
> 
> alloc_pages is a 0 order "folio". vmalloc is an array of 0 order
> folios (?)

According to current Matthew's plan [1] vmalloc is misc memory :)
 
> > How about we find some less ambiguous term? Using "folio" for memory
> > returned from kmalloc is really confusing. And even alloc_pages() does not
> > treat all memory it returns as folios.
> > 
> > How about we call them ranges? ;-)
> 
> memdescs if you want to be forward looking. It is not ranges.
>
> The point very much is that they are well defined allocations from the
> buddy allocator that can be freed back to the buddy allocator. We
> provide an API sort of like alloc_pages/folio_alloc to get the pointer
> back out and that is the only way to use it.
> 
> KHO needs to provide a way to give back an allocated struct page/folio
> that can be freed back to the buddy alloactor, of the proper
> order. Whatever you call that function it belongs to KHO as it is
> KHO's primary responsibility to manage the buddy allocator and the
> struct pages.
> 
> Today initializing the folio is the work required to do that.
 
Ok, let's stick with memdesc then. Put aside the name it looks like we do
agree that KHO needs to provide a way to preserve memory allocated from
buddy along with some of the metadata describing that memory, like order
for multi-order allocations.

The issue I see with bitmaps is that there's nothing except the order that
we can save. And if sometime later we'd have to recreate memdesc for that
memory, that would mean allocating a correct data structure, i.e. struct
folio, struct slab, struct vmalloc maybe.

I'm not sure we are going to preserve slabs at least at the foreseeable
future, but vmalloc seems like something that we'd have to address.
  
> > I did and experiment with preserving 8G of memory allocated with randomly
> > chosen order. For each order (0 to 10) I've got roughly 1000 "folios". I
> > measured time kho_mem_deserialize() takes with xarrays + bitmaps vs maple
> > tree based implementation. The maple tree outperformed by factor of 10 and
> > it's serialized data used 6 times less memory.
> 
> That seems like it means most of your memory ended up contiguous and
> the maple tree didn't split nodes to preserve order. :\

I was cheating to some extent but not that much. I preserved order in
kho_mem_info_t and if the folios next to each other were of different
orders they were not merged into a single maple tree node. But in case all
memory is free and not fragmented my understanding is that buddy will
allocate folios of the same order next to each other and so they could be
merged in the maple tree.

> Also the bitmap scanning to optimize the memblock reserve isn't
> implemented for xarray.. I don't think this is representative..

I believe that even with optimization of bitmap scanning maple tree would
perform much better when the memory is not fragmented. And when it is
fragmented both will need to call memblock_reserve() similar number of
times and there won't be real difference. Of course maple tree will consume
much more memory in the worst case.

[1] https://kernelnewbies.org/MatthewWilcox/Memdescs
 
> Jason
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-07 16:31                       ` Mike Rapoport
@ 2025-04-07 17:03                         ` Jason Gunthorpe
  2025-04-09  9:06                           ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-07 17:03 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Mon, Apr 07, 2025 at 07:31:21PM +0300, Mike Rapoport wrote:
> > alloc_pages is a 0 order "folio". vmalloc is an array of 0 order
> > folios (?)
> 
> According to current Matthew's plan [1] vmalloc is misc memory :)

Someday! :)

> Ok, let's stick with memdesc then. Put aside the name it looks like we do
> agree that KHO needs to provide a way to preserve memory allocated from
> buddy along with some of the metadata describing that memory, like order
> for multi-order allocations.

+1

> The issue I see with bitmaps is that there's nothing except the order that
> we can save. And if sometime later we'd have to recreate memdesc for that
> memory, that would mean allocating a correct data structure, i.e. struct
> folio, struct slab, struct vmalloc maybe.

Yes. The caller would have to take care of this using a caller
specific serialization of any memdesc data. Like slab would have to
presumably record the object size and the object allocation bitmap.

> I'm not sure we are going to preserve slabs at least at the foreseeable
> future, but vmalloc seems like something that we'd have to address.

And I suspect vmalloc doesn't need to preserve any memdesc information?
It can all be recreated

> > Also the bitmap scanning to optimize the memblock reserve isn't
> > implemented for xarray.. I don't think this is representative..
> 
> I believe that even with optimization of bitmap scanning maple tree would
> perform much better when the memory is not fragmented. 

Hard to guess, bitmap scanning is not free, especially if there are
lots of zeros, but memory allocating maple tree nodes and locking them
is not free either so who knows where things cross over..

> And when it is fragmented both will need to call memblock_reserve()
> similar number of times and there won't be real difference. Of
> course maple tree will consume much more memory in the worst case.

Yes.

bitmaps are bounded like the comment says, 512K for 16G of memory with
arbitary order 0 fragmentation.

Assuming absolute worst case fragmentation maple tree (@24 bytes per
range, alternating allocated/freed pattern) would require around
50M. Then almost doubled since we have the maple tree and then the
serialized copy.

100Mb vs 512k - I will pick the 512K :)

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-03 16:10     ` Jason Gunthorpe
  2025-04-03 17:37       ` Pratyush Yadav
@ 2025-04-09  8:35       ` Mike Rapoport
  1 sibling, 0 replies; 103+ messages in thread
From: Mike Rapoport @ 2025-04-09  8:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Thu, Apr 03, 2025 at 01:10:01PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 03, 2025 at 03:50:04PM +0000, Pratyush Yadav wrote:
> 
> > +struct kho_mem_track {
> > +	/* Points to L4 KHOMEM descriptor, each order gets its own table. */
> > +	struct xarray orders;
> > +};
> 
> I think it would be easy to add a 5th level and just use bits 63:57 as
> a 6 bit order. Then you don't need all this stuff either.

Even 4 levels won't work with 16K and 64K pages.
To use tables we'd need to scale the number of levels based on PAGE_SIZE.
 
> Jason

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-07 17:03                         ` Jason Gunthorpe
@ 2025-04-09  9:06                           ` Mike Rapoport
  2025-04-09 12:56                             ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-04-09  9:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Mon, Apr 07, 2025 at 02:03:05PM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 07, 2025 at 07:31:21PM +0300, Mike Rapoport wrote:
> >
> > Ok, let's stick with memdesc then. Put aside the name it looks like we do
> > agree that KHO needs to provide a way to preserve memory allocated from
> > buddy along with some of the metadata describing that memory, like order
> > for multi-order allocations.
> 
> +1
> 
> > The issue I see with bitmaps is that there's nothing except the order that
> > we can save. And if sometime later we'd have to recreate memdesc for that
> > memory, that would mean allocating a correct data structure, i.e. struct
> > folio, struct slab, struct vmalloc maybe.
> 
> Yes. The caller would have to take care of this using a caller
> specific serialization of any memdesc data. Like slab would have to
> presumably record the object size and the object allocation bitmap.
> 
> > I'm not sure we are going to preserve slabs at least at the foreseeable
> > future, but vmalloc seems like something that we'd have to address.
> 
> And I suspect vmalloc doesn't need to preserve any memdesc information?
> It can all be recreated

vmalloc does not have anything in memdesc now, just plain order-0 pages
from alloc_pages variants.

Now we've settled with terminology, and given that currently memdesc ==
struct page, I think we need kho_preserve_folio(struct *folio) for actual
struct folios and, apparently other high order allocations, and
kho_preserve_pages(struct page *, int nr) for memblock, vmalloc and
alloc_pages_exact.

On the restore path kho_restore_folio() will recreate multi-order thingy by
doing parts of what prep_new_page() does. And kho_restore_pages() will
recreate order-0 pages as if they were allocated from buddy.

If the caller needs more in its memdesc, it is responsible to fill in the
missing bits.
 
> > > Also the bitmap scanning to optimize the memblock reserve isn't
> > > implemented for xarray.. I don't think this is representative..
> > 
> > I believe that even with optimization of bitmap scanning maple tree would
> > perform much better when the memory is not fragmented. 
> 
> Hard to guess, bitmap scanning is not free, especially if there are
> lots of zeros, but memory allocating maple tree nodes and locking them
> is not free either so who knows where things cross over..
> 
> > And when it is fragmented both will need to call memblock_reserve()
> > similar number of times and there won't be real difference. Of
> > course maple tree will consume much more memory in the worst case.
> 
> Yes.
> 
> bitmaps are bounded like the comment says, 512K for 16G of memory with
> arbitary order 0 fragmentation.
> 
> Assuming absolute worst case fragmentation maple tree (@24 bytes per
> range, alternating allocated/freed pattern) would require around
> 50M. Then almost doubled since we have the maple tree and then the
> serialized copy.
> 
> 100Mb vs 512k - I will pick the 512K :)

Nah, memory is cheap nowadays :)

Ok, let's start with bitmaps and then see what are the actual bottlenecks
we have to optimize.
 
> Jason

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-09  9:06                           ` Mike Rapoport
@ 2025-04-09 12:56                             ` Jason Gunthorpe
  2025-04-09 13:58                               ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-09 12:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Wed, Apr 09, 2025 at 12:06:27PM +0300, Mike Rapoport wrote:

> Now we've settled with terminology, and given that currently memdesc ==
> struct page, I think we need kho_preserve_folio(struct *folio) for actual
> struct folios and, apparently other high order allocations, and
> kho_preserve_pages(struct page *, int nr) for memblock, vmalloc and
> alloc_pages_exact.

I'm not sure that is consistent with what Matthew is trying to build,
I think we are trying to remove 'struct page' usage, especially for
compound pages. Right now, though it is confusing, folio is the right
word to encompass both page cache memory and random memdescs from
other subsystems.

Maybe next year we will get a memdesc API that will clarify this
substantially.

> On the restore path kho_restore_folio() will recreate multi-order thingy by
> doing parts of what prep_new_page() does. And kho_restore_pages() will
> recreate order-0 pages as if they were allocated from buddy.

I don't see we need two functions, folio should handle 0 order pages
just fine, and callers should generally be either not using struct
page at all or using their own memdesc/folio.

If we need a second function it would be a void * function that is for
things that need memory but have no interest in the memdesc. Arguably
this should be slab preservation. There is a corner case of preserving
slab allocations >= PAGE_SIZE that is much simpler than general slab
preservation, maybe that would be interesting..

I think we still don't really know what will be needed, so I'd stick
with folio only as that allows building the memfd and a potential slab
preservation system.

Then we can see where we get to with further patches doing
serialization of actual things.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-09 12:56                             ` Jason Gunthorpe
@ 2025-04-09 13:58                               ` Mike Rapoport
  2025-04-09 15:37                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-04-09 13:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Wed, Apr 09, 2025 at 09:56:30AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 09, 2025 at 12:06:27PM +0300, Mike Rapoport wrote:
> 
> > Now we've settled with terminology, and given that currently memdesc ==
> > struct page, I think we need kho_preserve_folio(struct *folio) for actual
> > struct folios and, apparently other high order allocations, and
> > kho_preserve_pages(struct page *, int nr) for memblock, vmalloc and
> > alloc_pages_exact.
> 
> I'm not sure that is consistent with what Matthew is trying to build,
> I think we are trying to remove 'struct page' usage, especially for
> compound pages. Right now, though it is confusing, folio is the right
> word to encompass both page cache memory and random memdescs from
> other subsystems.

A disagree about random memdescs, just take a look at struct folio.
 
> Maybe next year we will get a memdesc API that will clarify this
> substantially.
> 
> > On the restore path kho_restore_folio() will recreate multi-order thingy by
> > doing parts of what prep_new_page() does. And kho_restore_pages() will
> > recreate order-0 pages as if they were allocated from buddy.
> 
> I don't see we need two functions, folio should handle 0 order pages
> just fine, and callers should generally be either not using struct
> page at all or using their own memdesc/folio.

struct folio is 4 struct pages. I don't see it suitable for order-0 pages
at all. 
 
> If we need a second function it would be a void * function that is for
> things that need memory but have no interest in the memdesc. Arguably
> this should be slab preservation. There is a corner case of preserving
> slab allocations >= PAGE_SIZE that is much simpler than general slab
> preservation, maybe that would be interesting..
> 
> I think we still don't really know what will be needed, so I'd stick
> with folio only as that allows building the memfd and a potential slab
> preservation system.

void * seems to me much more reasonable than folio one as the starting
point because it allows preserving folios with the right order but it's not
limited to it. 

I don't mind having kho_preserve_folio() from day 1 and even stretching the
use case we have right now to use it to preserve FDT memory.

But kho_preserve_folio() does not make sense for reserve_mem and it won't
make sense for vmalloc.

The weird games slab does with casting back and forth to folio also seem to
me like transitional and there won't be that folios in slab later.
 
> Then we can see where we get to with further patches doing
> serialization of actual things.
> 
> Jason

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-09 13:58                               ` Mike Rapoport
@ 2025-04-09 15:37                                 ` Jason Gunthorpe
  2025-04-09 16:19                                   ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-09 15:37 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Wed, Apr 09, 2025 at 04:58:16PM +0300, Mike Rapoport wrote:
> > I'm not sure that is consistent with what Matthew is trying to build,
> > I think we are trying to remove 'struct page' usage, especially for
> > compound pages. Right now, though it is confusing, folio is the right
> > word to encompass both page cache memory and random memdescs from
> > other subsystems.
> 
> A disagree about random memdescs, just take a look at struct folio.

It is weird and confusing if you look too closely

> > I don't see we need two functions, folio should handle 0 order pages
> > just fine, and callers should generally be either not using struct
> > page at all or using their own memdesc/folio.
> 
> struct folio is 4 struct pages. I don't see it suitable for order-0 pages
> at all. 

It is used widely now for order 0 cases. There are lots of rules about
how the members of struct folio can be used, and one of them is you
can't exceed the 64 byte space for an order 0 allocation.

> > I think we still don't really know what will be needed, so I'd stick
> > with folio only as that allows building the memfd and a potential slab
> > preservation system.
> 
> void * seems to me much more reasonable than folio one as the starting
> point because it allows preserving folios with the right order but it's not
> limited to it. 

It would just call kho_preserve_folio() under the covers though.

> I don't mind having kho_preserve_folio() from day 1 and even stretching the
> use case we have right now to use it to preserve FDT memory.
> 
> But kho_preserve_folio() does not make sense for reserve_mem and it won't
> make sense for vmalloc.

It does for vmalloc too, just stop thinking about it as a
folio-for-pagecache and instead as an arbitary order handle to buddy
allocator memory that will someday be changed to a memdesc :|

> The weird games slab does with casting back and forth to folio also seem to
> me like transitional and there won't be that folios in slab later.

Yes transitional, but we are at the transitional point and KHO should
fit in.

The lowest allocator primitive returns folios, which can represent any
order, and the caller casts to their own memdesc.

Later the lowest primitive will be to setup a memdesc and folio/others
will become much more logically split.

Jason


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-09 15:37                                 ` Jason Gunthorpe
@ 2025-04-09 16:19                                   ` Mike Rapoport
  2025-04-09 16:28                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-04-09 16:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Wed, Apr 09, 2025 at 12:37:14PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 09, 2025 at 04:58:16PM +0300, Mike Rapoport wrote:
> > >
> > > I think we still don't really know what will be needed, so I'd stick
> > > with folio only as that allows building the memfd and a potential slab
> > > preservation system.
> > 
> > void * seems to me much more reasonable than folio one as the starting
> > point because it allows preserving folios with the right order but it's not
> > limited to it. 
> 
> It would just call kho_preserve_folio() under the covers though.

How that will work for memblock and 1G pages?
 
> > I don't mind having kho_preserve_folio() from day 1 and even stretching the
> > use case we have right now to use it to preserve FDT memory.
> > 
> > But kho_preserve_folio() does not make sense for reserve_mem and it won't
> > make sense for vmalloc.
> 
> It does for vmalloc too, just stop thinking about it as a
> folio-for-pagecache and instead as an arbitary order handle to buddy
> allocator memory that will someday be changed to a memdesc :|

But we have memdesc today, it's struct page. It will be shrinked and maybe
renamed, it will contain a pointer rather than data, but that's what basic
memdesc is.
And when the data structure that memdesc points to will be allocated
separately folios won't make sense for order-0 allocations.
 
> > The weird games slab does with casting back and forth to folio also seem to
> > me like transitional and there won't be that folios in slab later.
> 
> Yes transitional, but we are at the transitional point and KHO should
> fit in.
> 
> The lowest allocator primitive returns folios, which can represent any
> order, and the caller casts to their own memdesc.

The lowest allocation primitive returns pages. 

struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
		nodemask_t *nodemask)
{
	struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
					preferred_nid, nodemask);
	return page_rmappable_folio(page);
}
EXPORT_SYMBOL(__folio_alloc_noprof);

And page_rmappable_folio() clues about folio-for-pagecache very clearly.

And I don't think folio will be a lowest primitive buddy returns anytime
soon if ever.
 
> Jason
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-09 16:19                                   ` Mike Rapoport
@ 2025-04-09 16:28                                     ` Jason Gunthorpe
  2025-04-10 16:51                                       ` Matthew Wilcox
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-09 16:28 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Wed, Apr 09, 2025 at 07:19:30PM +0300, Mike Rapoport wrote:
> On Wed, Apr 09, 2025 at 12:37:14PM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 09, 2025 at 04:58:16PM +0300, Mike Rapoport wrote:
> > > >
> > > > I think we still don't really know what will be needed, so I'd stick
> > > > with folio only as that allows building the memfd and a potential slab
> > > > preservation system.
> > > 
> > > void * seems to me much more reasonable than folio one as the starting
> > > point because it allows preserving folios with the right order but it's not
> > > limited to it. 
> > 
> > It would just call kho_preserve_folio() under the covers though.
> 
> How that will work for memblock and 1G pages?

memblock has to do its own thing, it isn't the buddy allocator.

1G pages should be very high order folios

> > It does for vmalloc too, just stop thinking about it as a
> > folio-for-pagecache and instead as an arbitary order handle to buddy
> > allocator memory that will someday be changed to a memdesc :|
> 
> But we have memdesc today, it's struct page.

No, I don't think it is. struct page seems to be turning into
something legacy that indicates the code has not been converted to the
new stuff yet.

> And when the data structure that memdesc points to will be allocated
> separately folios won't make sense for order-0 allocations.

At that point the lowest level allocator function will be allocating
the memdesc along with the struct page. Then folio will become
restricted to only actual folio memdescs and alot of the type punning
should go away. We are not there yet.

> > The lowest allocator primitive returns folios, which can represent any
> > order, and the caller casts to their own memdesc.
> 
> The lowest allocation primitive returns pages. 

Yes, but as I understand things, we should not be calling that
interface in new code because we are trying to make 'struct page' go
away.

Instead you should use the folio interfaces and cast to your own
memdesc, or use an allocator interface that returns void * (ie slab)
and never touch the struct page area.

AFAICT, and I just wrote one of these..

> And I don't think folio will be a lowest primitive buddy returns anytime
> soon if ever.

Maybe not internally, but driver facing, I think it should be true.

Like I just completely purged all struct page from the iommu code:

https://lore.kernel.org/linux-iommu/0-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com/

I don't want some weird KHO interface that doesn't align with using
__folio_alloc_node() and folio_put() as the lowest level allocator
interface.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-07 14:16                     ` Jason Gunthorpe
  2025-04-07 16:31                       ` Mike Rapoport
@ 2025-04-09 16:28                       ` Mike Rapoport
  2025-04-09 18:32                         ` Jason Gunthorpe
  1 sibling, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2025-04-09 16:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Mon, Apr 07, 2025 at 11:16:26AM -0300, Jason Gunthorpe wrote:
> On Sun, Apr 06, 2025 at 07:11:14PM +0300, Mike Rapoport wrote:
> 
> KHO needs to provide a way to give back an allocated struct page/folio
> that can be freed back to the buddy alloactor, of the proper
> order. Whatever you call that function it belongs to KHO as it is
> KHO's primary responsibility to manage the buddy allocator and the
> struct pages.

If order is only important for freeing memory back to page allocator, you
don't really need it. Freeing contiguous power-of-two number of pages with
proper alignment will give the same result, just a tad slower.
 
> Jason
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-09 16:28                       ` Mike Rapoport
@ 2025-04-09 18:32                         ` Jason Gunthorpe
  0 siblings, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-09 18:32 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Changyuan Lyu, linux-kernel, graf, akpm, luto,
	anthony.yznaga, arnd, ashish.kalra, benh, bp, catalin.marinas,
	dave.hansen, dwmw2, ebiederm, mingo, jgowans, corbet, krzk,
	mark.rutland, pbonzini, pasha.tatashin, hpa, peterz, robh+dt,
	robh, saravanak, skinsburskii, rostedt, tglx, thomas.lendacky,
	usama.arif, will, devicetree, kexec, linux-arm-kernel, linux-doc,
	linux-mm, x86

On Wed, Apr 09, 2025 at 07:28:47PM +0300, Mike Rapoport wrote:
> On Mon, Apr 07, 2025 at 11:16:26AM -0300, Jason Gunthorpe wrote:
> > On Sun, Apr 06, 2025 at 07:11:14PM +0300, Mike Rapoport wrote:
> > 
> > KHO needs to provide a way to give back an allocated struct page/folio
> > that can be freed back to the buddy alloactor, of the proper
> > order. Whatever you call that function it belongs to KHO as it is
> > KHO's primary responsibility to manage the buddy allocator and the
> > struct pages.
> 
> If order is only important for freeing memory back to page allocator, you
> don't really need it. Freeing contiguous power-of-two number of pages with
> proper alignment will give the same result, just a tad slower.

What I'm asking for is transparency for the driver.

iommu going to be doing:

  folio = __folio_alloc_node(order >= 0)
  [.. init struct ioptdesc that is overlayed with struct folio ]
  folio_put(folio);

As that is how you make memdescs work today. So when we add KHO, I
want to see:

  folio = __folio_alloc_node(order >= 0);
  [.. init struct ioptdesc that is overlayed with struct folio ]
  kho_preserve_folio(folio);

  // kexec

  folio = kho_restore_folio(phys);
  [.. init struct ioptdesc that is overlayed with struct folio ]

  folio_put(folio);

Working fully.

I do not want to mess with the existing folio_put() code just because
KHO can't preserve __folio_alloc_node().

Tomorrow someday I think we will switch to a flow more like

   memory = memdesc_alloc(&ioptdesc, order >= 0);
   [.. init struct ioptdesc that is a new allocation]
   kho_preserve_memdesc(ioptdesc)

   // kexec

   memory = kho_restore_memdesc(phys, &ioptdesc)
   [.. init struct ioptdesc that is a new allocation]
   memdesc_free(memory, ioptdesc);

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-09 16:28                                     ` Jason Gunthorpe
@ 2025-04-10 16:51                                       ` Matthew Wilcox
  2025-04-10 17:31                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Matthew Wilcox @ 2025-04-10 16:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mike Rapoport, Pratyush Yadav, Changyuan Lyu, linux-kernel, graf,
	akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, mark.rutland, pbonzini, pasha.tatashin, hpa, peterz,
	robh+dt, robh, saravanak, skinsburskii, rostedt, tglx,
	thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

On Wed, Apr 09, 2025 at 01:28:37PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 09, 2025 at 07:19:30PM +0300, Mike Rapoport wrote:
> > But we have memdesc today, it's struct page.
> 
> No, I don't think it is. struct page seems to be turning into
> something legacy that indicates the code has not been converted to the
> new stuff yet.

No, struct page will be with us for a while.  Possibly forever.  I have
started reluctantly talking about a future in which there aren't struct
pages, but it's really premature at this point.  That's a 2030 kind
of future.

For 2025-2029, we will still have alloc_page(s)().  It's just that
the size of struct page will be gradually shrinking over that time.

> > And when the data structure that memdesc points to will be allocated
> > separately folios won't make sense for order-0 allocations.
> 
> At that point the lowest level allocator function will be allocating
> the memdesc along with the struct page. Then folio will become
> restricted to only actual folio memdescs and alot of the type punning
> should go away. We are not there yet.

We'll have a few allocator functions.  There'll be a slab_alloc(),
folio_alloc(), pt_alloc() and so on.  I sketched out how these might
work last year:

https://kernelnewbies.org/MatthewWilcox/FolioAlloc

> > > The lowest allocator primitive returns folios, which can represent any
> > > order, and the caller casts to their own memdesc.
> > 
> > The lowest allocation primitive returns pages. 
> 
> Yes, but as I understand things, we should not be calling that
> interface in new code because we are trying to make 'struct page' go
> away.
> 
> Instead you should use the folio interfaces and cast to your own
> memdesc, or use an allocator interface that returns void * (ie slab)
> and never touch the struct page area.
> 
> AFAICT, and I just wrote one of these..

Casting is the best you can do today because I haven't provided a better
interface yet.

> > And I don't think folio will be a lowest primitive buddy returns anytime
> > soon if ever.
> 
> Maybe not internally, but driver facing, I think it should be true.
> 
> Like I just completely purged all struct page from the iommu code:
> 
> https://lore.kernel.org/linux-iommu/0-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com/
> 
> I don't want some weird KHO interface that doesn't align with using
> __folio_alloc_node() and folio_put() as the lowest level allocator
> interface.

I think it's fine to say "the KHO interface doesn't support bare pages;
you must have a memdesc".  But I'm not sure that's the right approach.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-10 16:51                                       ` Matthew Wilcox
@ 2025-04-10 17:31                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2025-04-10 17:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Rapoport, Pratyush Yadav, Changyuan Lyu, linux-kernel, graf,
	akpm, luto, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, dave.hansen, dwmw2, ebiederm, mingo, jgowans,
	corbet, krzk, mark.rutland, pbonzini, pasha.tatashin, hpa, peterz,
	robh+dt, robh, saravanak, skinsburskii, rostedt, tglx,
	thomas.lendacky, usama.arif, will, devicetree, kexec,
	linux-arm-kernel, linux-doc, linux-mm, x86

On Thu, Apr 10, 2025 at 05:51:51PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 09, 2025 at 01:28:37PM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 09, 2025 at 07:19:30PM +0300, Mike Rapoport wrote:
> > > But we have memdesc today, it's struct page.
> > 
> > No, I don't think it is. struct page seems to be turning into
> > something legacy that indicates the code has not been converted to the
> > new stuff yet.
> 
> No, struct page will be with us for a while.  Possibly forever.  I have
> started reluctantly talking about a future in which there aren't struct
> pages, but it's really premature at this point.  That's a 2030 kind
> of future.

I was trying to say that code that uses struct page type might be
thought of as 'old code' and maybe in drivers that have been updated
to use KHO we can also insist they be modernized away from struct
page?

> For 2025-2029, we will still have alloc_page(s)().  It's just that
> the size of struct page will be gradually shrinking over that time.

For instance while we still have alloc_pages(), yes, but we could say
that KHO enabled drivers should not use it and be migrated to
folio_alloc or slab instead.

> > I don't want some weird KHO interface that doesn't align with using
> > __folio_alloc_node() and folio_put() as the lowest level allocator
> > interface.
> 
> I think it's fine to say "the KHO interface doesn't support bare pages;
> you must have a memdesc".  But I'm not sure that's the right approach.

The KHO interface needs to know how to initialize the memdesc. So if
the very lowest allocator function is:

        page = alloc_page_desc(gfp, order, MEMDESC_TYPE_FOLIO(folio));

Then we need a KHO 'restore' version of that that also accepts the
MEMDESC_TYPE_XX() too.

Bare pages would be built on top of that layering and supply a memdesc
that is equivalent to whatever a normal allocation of a a bare page
would get, so that normal freeing of a bare page works properly.

IOW the KHO preserve/restore APIs should mirror alloc/free APIs.

 struct folio *folio = folio_alloc()
 kho_folio_preserve(folio)
 folio = kho_folio_restore()
 folio_put()

 void *vmem = kvmalloc(PAGE_SIZE * N);
 kho_vmalloc_preserve(vmem)
 vmem = kho_vmalloc_restore()
 kvfree(vmem)

 void *mem = kmalloc(PAGE_SIZE)
 kho_slab_preserve(mem)
 mem = kho_slab_restore()
 kfree(mem)

The point of the restore function being to put everything back so that
the matching free function works.

I'm guessing if we imagine the above 3 options, then in a memdesc
world they will all be implemented under the covers by doing some
internal kho_restore_page_desc() which will be the lowest level
primitive.

So, I'm not sure what the API should be for bare pages (ie not use of
the memdesc). kmalloc(PAGE_SIZE) would certainly be nice if we can
make it work.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 12/16] arm64: add KHO support
  2025-03-20  1:55 ` [PATCH v5 12/16] arm64: add KHO support Changyuan Lyu
  2025-03-20  7:13   ` Krzysztof Kozlowski
@ 2025-04-11  3:47   ` Changyuan Lyu
  1 sibling, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-04-11  3:47 UTC (permalink / raw)
  To: changyuanl
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, corbet, dave.hansen, devicetree, dwmw2, ebiederm,
	graf, hpa, jgowans, kexec, krzk, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, luto, mark.rutland, mingo, pasha.tatashin,
	pbonzini, peterz, ptyadav, robh+dt, robh, rostedt, rppt,
	saravanak, skinsburskii, tglx, thomas.lendacky, will, x86

On Wed, Mar 19, 2025 at 18:55:47 -0700, Changyuan Lyu <changyuanl@google.com> wrote:
> From: Alexander Graf <graf@amazon.com>
> [...]
> +/**
> + * early_init_dt_check_kho - Decode info required for kexec handover from DT
> + */
> +static void __init early_init_dt_check_kho(void)
> +{
> +	unsigned long node = chosen_node_offset;
> +	u64 kho_start, scratch_start, scratch_size;
> +	const __be32 *p;
> +	int l;
> +
> +	if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER) || (long)node < 0)
> +		return;
> +
> +	p = of_get_flat_dt_prop(node, "linux,kho-fdt", &l);
> +	if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
> +		return;
> +
> +	kho_start = dt_mem_next_cell(dt_root_addr_cells, &p);
> +
> +	p = of_get_flat_dt_prop(node, "linux,kho-scratch", &l);
> +	if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
> +		return;
> +
> +	scratch_start = dt_mem_next_cell(dt_root_addr_cells, &p);
> +	scratch_size = dt_mem_next_cell(dt_root_addr_cells, &p);
> +
> +	kho_populate(kho_start, scratch_start, scratch_size);
> +}
> [...]
> +static int kho_add_chosen(const struct kimage *image, void *fdt, int chosen_node)
> +{
> +	int ret = 0;
> +#ifdef CONFIG_KEXEC_HANDOVER
> +	phys_addr_t dt_mem = 0;
> +	phys_addr_t dt_len = 0;
> +	phys_addr_t scratch_mem = 0;
> +	phys_addr_t scratch_len = 0;
> +
> +	if (!image->kho.fdt || !image->kho.scratch)
> +		return 0;
> +
> +	dt_mem = image->kho.fdt->mem;
> +	dt_len = image->kho.fdt->memsz;
> +
> +	scratch_mem = image->kho.scratch->mem;
> +	scratch_len = image->kho.scratch->bufsz;
> +
> +	pr_debug("Adding kho metadata to DT");
> +
> +	ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-fdt",
> +				       dt_mem, dt_len);
> +	if (ret)
> +		return ret;
> +
> +	ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-scratch",
> +				       scratch_mem, scratch_len);

While testing on ARM64 today, I realized that calling "fdt_appendprop_addrrange"
here intercedes a bug that prevents consecutive KHO-enabled kexecs.

Suppose we do KHO kexec from kernel 1 to kernel 2 and then from kernel 2 to
kernel 3. The firmware DT got by kernel 2 from kernel 1 already has
"linux,kho-fdt" and "linux,kho-scratch" in the "chosen" node. And when KHO
kexec-ing to kernel 3 from kernel 2, kernel 2 will __append__ the its KHO
FDT address to the "linux,kho-fdt". Thus the "linux,kho-fdt" received by
kernel 3 contains 2 address ranges. Kernel 3 would fail at
early_init_dt_check_kho() above.

I will fix this bug in v6.

> +
> +#endif /* CONFIG_KEXEC_HANDOVER */
> +	return ret;
> +}
> [...]

Best,
Changyuan


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
  2025-04-02 19:16   ` Pratyush Yadav
  2025-04-03 11:42     ` Jason Gunthorpe
  2025-04-03 13:57     ` Mike Rapoport
@ 2025-04-11  4:02     ` Changyuan Lyu
  2 siblings, 0 replies; 103+ messages in thread
From: Changyuan Lyu @ 2025-04-11  4:02 UTC (permalink / raw)
  To: ptyadav
  Cc: akpm, anthony.yznaga, arnd, ashish.kalra, benh, bp,
	catalin.marinas, changyuanl, corbet, dave.hansen, devicetree,
	dwmw2, ebiederm, graf, hpa, jgg, jgowans, kexec, krzk,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm, luto,
	mark.rutland, mingo, pasha.tatashin, pbonzini, peterz, robh+dt,
	robh, rostedt, rppt, saravanak, skinsburskii, tglx,
	thomas.lendacky, will, x86

Hi Pratyush,

Thanks for reviewing!

On Wed, Apr 02, 2025 at 19:16:27 +0000, Pratyush Yadav <ptyadav@amazon.de> wrote:
> Hi Changyuan,
>
> On Wed, Mar 19 2025, Changyuan Lyu wrote:
> > [...]
> > +int kho_preserve_phys(phys_addr_t phys, size_t size)
> > +{
> > +	unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
> > +	unsigned int order = ilog2(end_pfn - pfn);
>
> This caught my eye when playing around with the code. It does not put
> any limit on the order, so it can exceed NR_PAGE_ORDERS.

I agree with Mike that this should not be a problem.

> Also, when
> initializing the page after KHO, we pass the order directly to
> prep_compound_page() without sanity checking it. The next kernel might
> not support all the orders the current one supports. Perhaps something
> to fix?

Yes the new kernel should check the order.

> > +	unsigned long failed_pfn;
> > +	int err = 0;
> > +
> > +	if (!kho_enable)
> > +		return -EOPNOTSUPP;
> > +
> > +	down_read(&kho_out.tree_lock);
> > +	if (kho_out.fdt) {
> > +		err = -EBUSY;
> > +		goto unlock;
> > +	}
> > +
> > +	for (; pfn < end_pfn;
> > +	     pfn += (1 << order), order = ilog2(end_pfn - pfn)) {
> > +		err = __kho_preserve(&kho_mem_track, pfn, order);

I realized another bug here: we did not check if "pfn" is aligned to
1 << order. For example, if the function input is
@phys = 4096, @size = 8192, in the 1st iteration, pfn = 1, end_pfn = 3,
order = 1. This is problematic since these 2 pages should be viewed
as 2 folios of order 0, instead of 1 folio of order 1.

> > +		if (err) {
> > +			failed_pfn = pfn;
> > +			break;
> > +		}
> > +	}
> [...]

I will fix the 2 bugs above in V6.

Best,
Changyuan


^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2025-04-11  4:02 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 01/16] kexec: define functions to map and unmap segments Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 02/16] mm/mm_init: rename init_reserved_page to init_deferred_page Changyuan Lyu
2025-03-20  7:10   ` Krzysztof Kozlowski
2025-03-20 17:15     ` Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 03/16] memblock: add MEMBLOCK_RSRV_KERN flag Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 04/16] memblock: Add support for scratch memory Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 05/16] memblock: introduce memmap_init_kho_scratch() Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 06/16] hashtable: add macro HASHTABLE_INIT Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers Changyuan Lyu
2025-03-21 13:34   ` Jason Gunthorpe
2025-03-23 19:02     ` Changyuan Lyu
2025-03-24 16:28       ` Jason Gunthorpe
2025-03-25  0:21         ` Changyuan Lyu
2025-03-25  2:20           ` Jason Gunthorpe
2025-03-24 18:40   ` Frank van der Linden
2025-03-25 19:19     ` Mike Rapoport
2025-03-25 21:56       ` Frank van der Linden
2025-03-26 11:59         ` Mike Rapoport
2025-03-26 16:25           ` Frank van der Linden
2025-03-20  1:55 ` [PATCH v5 08/16] kexec: add KHO parsing support Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 09/16] kexec: enable KHO support for memory preservation Changyuan Lyu
2025-03-21 13:46   ` Jason Gunthorpe
2025-03-22 19:12     ` Mike Rapoport
2025-03-23 18:55       ` Jason Gunthorpe
2025-03-24 18:18         ` Mike Rapoport
2025-03-24 20:07           ` Jason Gunthorpe
2025-03-26 12:07             ` Mike Rapoport
2025-03-23 19:07     ` Changyuan Lyu
2025-03-25  2:04       ` Jason Gunthorpe
2025-03-27 10:03   ` Pratyush Yadav
2025-03-27 13:31     ` Jason Gunthorpe
2025-03-27 17:28       ` Pratyush Yadav
2025-03-28 12:53         ` Jason Gunthorpe
2025-04-02 16:44         ` Changyuan Lyu
2025-04-02 16:47           ` Pratyush Yadav
2025-04-02 18:37             ` Pasha Tatashin
2025-04-02 18:49               ` Pratyush Yadav
2025-04-02 19:16   ` Pratyush Yadav
2025-04-03 11:42     ` Jason Gunthorpe
2025-04-03 13:58       ` Mike Rapoport
2025-04-03 14:24         ` Jason Gunthorpe
2025-04-04  9:54           ` Mike Rapoport
2025-04-04 12:47             ` Jason Gunthorpe
2025-04-04 13:53               ` Mike Rapoport
2025-04-04 14:30                 ` Jason Gunthorpe
2025-04-04 16:24                   ` Pratyush Yadav
2025-04-04 17:31                     ` Jason Gunthorpe
2025-04-06 16:13                     ` Mike Rapoport
2025-04-06 16:11                   ` Mike Rapoport
2025-04-07 14:16                     ` Jason Gunthorpe
2025-04-07 16:31                       ` Mike Rapoport
2025-04-07 17:03                         ` Jason Gunthorpe
2025-04-09  9:06                           ` Mike Rapoport
2025-04-09 12:56                             ` Jason Gunthorpe
2025-04-09 13:58                               ` Mike Rapoport
2025-04-09 15:37                                 ` Jason Gunthorpe
2025-04-09 16:19                                   ` Mike Rapoport
2025-04-09 16:28                                     ` Jason Gunthorpe
2025-04-10 16:51                                       ` Matthew Wilcox
2025-04-10 17:31                                         ` Jason Gunthorpe
2025-04-09 16:28                       ` Mike Rapoport
2025-04-09 18:32                         ` Jason Gunthorpe
2025-04-04 16:15                 ` Pratyush Yadav
2025-04-06 16:34                   ` Mike Rapoport
2025-04-07 14:23                     ` Jason Gunthorpe
2025-04-03 13:57     ` Mike Rapoport
2025-04-11  4:02     ` Changyuan Lyu
2025-04-03 15:50   ` Pratyush Yadav
2025-04-03 16:10     ` Jason Gunthorpe
2025-04-03 17:37       ` Pratyush Yadav
2025-04-04 12:54         ` Jason Gunthorpe
2025-04-04 15:39           ` Pratyush Yadav
2025-04-09  8:35       ` Mike Rapoport
2025-03-20  1:55 ` [PATCH v5 10/16] kexec: add KHO support to kexec file loads Changyuan Lyu
2025-03-21 13:48   ` Jason Gunthorpe
2025-03-20  1:55 ` [PATCH v5 11/16] kexec: add config option for KHO Changyuan Lyu
2025-03-20  7:10   ` Krzysztof Kozlowski
2025-03-20 17:18     ` Changyuan Lyu
2025-03-24  4:18   ` Dave Young
2025-03-24 19:26     ` Pasha Tatashin
2025-03-25  1:24       ` Dave Young
2025-03-25  3:07         ` Dave Young
2025-03-25  6:57     ` Baoquan He
2025-03-25  8:36       ` Dave Young
2025-03-26  9:17         ` Dave Young
2025-03-26 11:28           ` Mike Rapoport
2025-03-26 12:09             ` Dave Young
2025-03-25 14:04       ` Pasha Tatashin
2025-03-20  1:55 ` [PATCH v5 12/16] arm64: add KHO support Changyuan Lyu
2025-03-20  7:13   ` Krzysztof Kozlowski
2025-03-20  8:30     ` Krzysztof Kozlowski
2025-03-20 23:29     ` Changyuan Lyu
2025-04-11  3:47   ` Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 13/16] x86/setup: use memblock_reserve_kern for memory used by kernel Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 14/16] x86: add KHO support Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 15/16] memblock: add KHO support for reserve_mem Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 16/16] Documentation: add documentation for KHO Changyuan Lyu
2025-03-20 14:45   ` Jonathan Corbet
2025-03-21  6:33     ` Changyuan Lyu
2025-03-21 13:46       ` Jonathan Corbet
2025-03-25 14:19 ` [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Pasha Tatashin
2025-03-25 15:03   ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).