From: Pratyush Yadav <pratyush@kernel.org>
To: Mike Rapoport <rppt@kernel.org>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Pratyush Yadav <pratyush@kernel.org>,
Alexander Graf <graf@amazon.com>,
Muchun Song <muchun.song@linux.dev>,
Oscar Salvador <osalvador@suse.de>,
David Hildenbrand <david@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Jason Miu <jasonmiu@google.com>,
Jork Loeser <jloeser@linux.microsoft.com>
Cc: kexec@lists.infradead.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: [PATCH v2 15/18] kho: extend scratch
Date: Fri, 5 Jun 2026 20:34:48 +0200 [thread overview]
Message-ID: <20260605183501.3884950-16-pratyush@kernel.org> (raw)
In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org>
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
Motivation
==========
The scratch space is allocated by the first kernel in the KHO chain, and
is reused by all subsequent kernels. The size of the space is either set
via the commandline by the system administrator or by calculating the
amount of memory used by the kernel and adding a multiplier. In either
case, the scratch space is a heuristic and is liable to fill up and fail
allocation if a kernel uses more memory than expected.
In addition, gigantic huge pages (usually 1 GiB) are allocated via
memblock, and in a KHO boot that memory comes from the scratch space. In
hypervisors it is common to dedicate a major part of the system's memory
to gigantic hugepages for VM memory.
If this memory needs to come from scratch space, then scratch needs to
be greater than the memory needed for huge pages, which is impractical.
In addition, hugepages can be preserved memory. Allocating them from
scratch violates the assumption that scratch contains no preserved
memory.
Methodology
===========
Discover areas that don't contain any preserved memory at boot by
walking the preserved memory radix tree. Mark them as scratch to allow
allocations from them. This makes KHO more resilient to memory pressure
and allows supporting huge page preservation.
Since the preserved memory radix tree mixes both physical address and
order into a single key, and does not track table pages, it is difficult
to identify free areas from it directly. Walk the tree and digest it
down into another radix tree. The latter tracks blocks of
KHO_EXT_SHIFT (1 GiB as of now) granularity. Then walk the digested tree
and mark the areas between the present keys as scratch.
Performance
===========
The discovery algorithm traverses the preserved memory radix tree
exactly once. While it does use memory for the digested radix tree,
since the blocks are split by 1 GiB, a single bitmap with 4k pages can
track up to 32 TiB of memory. So there are likely to be very few radix
tree pages used in this tracking. For systems with all physical memory
below 32 TiB, this should result in a total of 6 pages being
used (KHO_TREE_MAX_DEPTH == 6).
An alternate way of achieving this would be to call kho_mem_retrieve()
earlier in boot and mark all the KHO preservations as reserved. But that
can blow up memblock.reserved with a bunch of 4K pages scattered
everywhere, which will reduce performance of subsequent allocations.
Since the free blocks are tracked in chunks of 1 GiB, this won't blow up
memblock.memory as much.
There is no inherent reason for using 1 GiB as the discovered block
size. This can be changed later if needed. Currently, KHO is mainly
targeted for server grade systems with hundreds of gigabytes to
terabytes of memory. So 1 GiB is a reasonable granularity for those
systems. For smaller systems this doesn't work as well, but we can
arrive at a better heuristic when we have concrete use cases.
Practical evaluation
====================
The testing is done on a x86_64 qemu VM running under KVM with 64G
memory and 12 CPUs. The machine pre-allocates 50 1G pages.
Since the performance scales with how busy the radix tree is, tests are
done with 2 preservation patterns: first with two 1M memfds, second with
two 1G memfds, both using 4k pages.
Test case 1 - 1M memfd
~~~~~~~~~~~~~~~~~~~~~~
This test case has two memfds with 1M memory each in 4k pages, plus
other preservations from LUO core and other KHO users.
This is how the radix tree stats look like:
radix_nodes: 0x13
nr_preservations: 0x214
mem_preserved: 0x227000
per order preservations:
order 0: 0x20f
order 1: 0x4
order 4: 0x1
and this is how long it takes to extend the scratch after KHO boot:
KHO: KHO extend time: 47 us
KHO: KHO extend total mem: 0xe6c17b000 (~57G)
Test case 2 - 1G memfd
~~~~~~~~~~~~~~~~~~~~~~
This test case has two memfds with 1G memory each in 4k pages, plus
other preservations from LUO core and other KHO users.
This is how the radix tree stats look like:
radix_nodes: 0x28
nr_preservations: 0x80816
mem_preserved: 0x80829000
per order preservations:
order 0: 0x80811
order 1: 0x4
order 4: 0x1
and this is how long it takes to extend the scratch after KHO boot:
KHO: KHO extend time: 22514 us
KHO: KHO extend total mem: 0xd3f200000 (~52G)
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
---
Notes:
As one might notice, the "scratch" terminology starts to break down
here. There is the original "scratch", which is passed down by the
previous kernel. It is marked MEMBLOCK_KHO_SCRATCH. There is also the
discovered "scratch", which also gets marked MEMBLOCK_KHO_SCRATCH, but
has nothing to do with the former.
For limiting the scope of this series, I haven't done the rename here. I
can do it as a follow up series once this stabilizes and lands into
-next.
I suggest the following scheme:
- Rename "KHO scratch" to "KHO bootmem". Update the documentation and
all code to use this name. We have the kho_scratch kernel cmdline
parameter, which is harder to change, but perhaps we can rename it to
"kho_bootmem" and if someone complains we can add it back.
- Rename MEMBLOCK_KHO_SCRATCH to MEMBLOCK_KHO_NOPRSRV. This describes
the property of the memory not its origin. Then KHO can mark its
"bootmem" as KHO_NOPRSRV because bootmem never has any preserved
memory. Later, kho_extend_scratch() (which is also due for a better
name) can also mark its discovered areas as KHO_NOPRSRV.
kernel/liveupdate/kexec_handover.c | 149 +++++++++++++++++++++++++----
1 file changed, 132 insertions(+), 17 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index af22086ca2d6..8540608b8602 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -84,6 +84,23 @@ static struct kho_out kho_out = {
},
};
+struct kho_in {
+ phys_addr_t fdt_phys;
+ phys_addr_t scratch_phys;
+ char previous_release[__NEW_UTS_LEN + 1];
+ u32 kexec_count;
+ struct kho_debugfs dbg;
+ struct kho_radix_tree radix_tree;
+};
+
+static struct kho_in kho_in = {
+};
+
+static const void *kho_get_fdt(void)
+{
+ return kho_in.fdt_phys ? phys_to_virt(kho_in.fdt_phys) : NULL;
+}
+
/**
* kho_encode_radix_key - Encodes a physical address and order into a radix key.
* @phys: The physical address of the page.
@@ -869,6 +886,119 @@ static void __init kho_reserve_scratch(void)
kho_enable = false;
}
+#define KHO_EXT_SHIFT 30 /* 1 GiB */
+
+static int __init kho_ext_walk_key(unsigned long key, void *data)
+{
+ struct kho_radix_tree *tree = data;
+ phys_addr_t start, end;
+ unsigned int order;
+ int err;
+
+ start = kho_decode_radix_key(key, &order);
+ end = start + (1UL << (order + PAGE_SHIFT));
+
+ while (start < end) {
+ err = kho_radix_add_key(tree, start >> KHO_EXT_SHIFT);
+ if (err)
+ return err;
+
+ start += (1UL << KHO_EXT_SHIFT);
+ }
+
+ return 0;
+}
+
+static int __init kho_ext_walk_node(phys_addr_t phys, void *data)
+{
+ struct kho_radix_tree *tree = data;
+
+ return kho_radix_add_key(tree, phys >> KHO_EXT_SHIFT);
+}
+
+static int __init kho_ext_mark_scratch(unsigned long key, void *data)
+{
+ phys_addr_t *prev_end = data;
+ phys_addr_t start = key << KHO_EXT_SHIFT;
+ int err;
+
+ if (start > *prev_end) {
+ err = memblock_mark_kho_scratch(*prev_end, start - *prev_end);
+ if (err)
+ return err;
+ }
+
+ *prev_end = start + (1UL << KHO_EXT_SHIFT);
+ return 0;
+}
+
+/**
+ * kho_extend_scratch - Extend the scratch regions
+ *
+ * The KHO radix tree mixes both physical address and order into a single key.
+ * This makes it hard to look for free ranges directly. This function first
+ * walks the radix tree and digests it down into another radix tree, whose keys
+ * identify blocks of KHO_EXT_SHIFT which contain preserved memory.
+ *
+ * Then it walks the digested radix tree and marks everything that doesn't have
+ * preserved memory as scratch.
+ *
+ * NOTE: This function allocates memory so it should be called when scratch has
+ * available space.
+ *
+ * NOTE: The pages of the KHO radix tree tables are not marked as preserved in
+ * the KHO tree. But they are expected to remain untouched until the tree is
+ * fully parsed. So this function also considers them to be "preserved memory"
+ * and marks their blocks as busy.
+ */
+static void __init kho_extend_scratch(void)
+{
+ const struct kho_radix_walk_cb kho_cb = {
+ .leaf = kho_ext_walk_key,
+ .node = kho_ext_walk_node,
+ };
+ const struct kho_radix_walk_cb ext_cb = {
+ .leaf = kho_ext_mark_scratch,
+ };
+ struct kho_radix_tree radix;
+ phys_addr_t prev_end = 0;
+ int err = 0;
+
+ if (!is_kho_boot())
+ return;
+
+ /* Make sure the KHO radix tree is initialized. */
+ err = kho_radix_init_tree(&kho_in.radix_tree,
+ kho_get_mem_map(kho_get_fdt()));
+ if (err)
+ goto print;
+
+ err = kho_radix_init_tree(&radix, NULL);
+ if (err)
+ goto print;
+
+ /* Walk the KHO radix tree to find busy blocks. */
+ err = kho_radix_walk_tree(&kho_in.radix_tree, &kho_cb, &radix);
+ if (err)
+ goto out;
+
+ /* Walk the blocks and mark everything between keys as scratch. */
+ err = kho_radix_walk_tree(&radix, &ext_cb, &prev_end);
+ if (err)
+ goto out;
+
+ /* Mark everything from last busy block to end of DRAM. */
+ if (prev_end < memblock_end_of_DRAM())
+ err = memblock_mark_kho_scratch(prev_end, memblock_end_of_DRAM() - prev_end);
+
+ /* fallthrough */
+out:
+ kho_radix_destroy_tree(&radix);
+print:
+ if (err)
+ pr_err("Failed to extend scratch: %pe\n", ERR_PTR(err));
+}
+
/**
* kho_add_subtree - record the physical address of a sub blob in KHO root tree.
* @name: name of the sub tree.
@@ -1443,23 +1573,6 @@ void kho_restore_free(void *mem)
}
EXPORT_SYMBOL_GPL(kho_restore_free);
-struct kho_in {
- phys_addr_t fdt_phys;
- phys_addr_t scratch_phys;
- char previous_release[__NEW_UTS_LEN + 1];
- u32 kexec_count;
- struct kho_debugfs dbg;
- struct kho_radix_tree radix_tree;
-};
-
-static struct kho_in kho_in = {
-};
-
-static const void *kho_get_fdt(void)
-{
- return kho_in.fdt_phys ? phys_to_virt(kho_in.fdt_phys) : NULL;
-}
-
/**
* is_kho_boot - check if current kernel was booted via KHO-enabled
* kexec
@@ -1763,6 +1876,8 @@ void __init kho_memory_init_early(void)
*/
if (kho_in.scratch_phys)
kho_scratch = phys_to_virt(kho_in.scratch_phys);
+
+ kho_extend_scratch();
}
void __init kho_memory_init(void)
--
2.54.0.1032.g2f8565e1d1-goog
next prev parent reply other threads:[~2026-06-05 18:35 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-05 18:34 [PATCH v2 00/18] kho: make boot time huge page allocation work nicely with KHO Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 01/18] kho: generalize radix tree APIs Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 02/18] kho: disallow wide keys in radix tree Pratyush Yadav
2026-06-05 22:06 ` Jork Loeser
2026-06-08 9:10 ` Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 03/18] kho: return virtual address of mem_map Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 04/18] kho: store incoming radix tree in kho_in Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 05/18] kho: move all memory retrieval logic to kho_mem_retrieve() Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 06/18] kho: add a struct for radix callbacks Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 07/18] kho: add callback for table pages Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 08/18] kho: add data argument to radix walk callback Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 09/18] kho: allow early-boot usage of the KHO radix tree Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 10/18] kho: allow destroying " Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 11/18] kho: add kho_radix_init_tree() Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 12/18] kho: export kho_scratch_overlap() Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 13/18] kho: initialize kho_scratch pointer earlier in boot Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 14/18] memblock: use kho_scratch_overlap() to decide migratetype Pratyush Yadav
2026-06-05 18:34 ` Pratyush Yadav [this message]
2026-06-05 18:34 ` [PATCH v2 16/18] memblock: make HugeTLB bootmem allocation work with KHO Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 17/18] memblock: allow calculating reserved size by flags Pratyush Yadav
2026-06-05 18:34 ` [PATCH v2 18/18] kho: exclude hugetlb memory from scratch size calculation Pratyush Yadav
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260605183501.3884950-16-pratyush@kernel.org \
--to=pratyush@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=graf@amazon.com \
--cc=jasonmiu@google.com \
--cc=jloeser@linux.microsoft.com \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=muchun.song@linux.dev \
--cc=osalvador@suse.de \
--cc=pasha.tatashin@soleen.com \
--cc=rppt@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox