* [RFC PATCH 01/20] kho: generalize radix tree APIs
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:22 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 02/20] kho: store incoming radix tree in kho_in Jork Loeser
` (18 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
The KHO radix tree is a data structure that can track the presence or
absence of an arbitrary key, with nothing inherently tied to KHO memory
preservation tracking. This was one of the design goals of the radix
tree. This was done to enable it to be re-used by other users of KHO.
Despite that, the radix tree APIs are very closely tied to KHO memory
preservation tracking. Adding a key is done by kho_radix_add_page(),
which encodes it as a page tracking operation and takes in PFN and
order. kho_radix_del_page() does the same. These functions encode the
key internally that goes into the radix tree. kho_radix_walk_tree() does
the same by baking the PFN and order into the callback arguments.
Generalize the APIs by taking the key directly and doing the encoding at
the callers. Rename the functions to kho_radix_add_key() and
kho_radix_del_key(). In practice, this removes a line each from the
functions and moves the encoding function call to the callers.
Similarly, update kho_radix_tree_walk_callback_t to take the key
directly.
To keep the naming convention clearer, rename
kho_radix_{encode,decode}_key() to kho_{encode,decode}_radix_key().
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
include/linux/kho_radix_tree.h | 18 ++---
kernel/liveupdate/kexec_handover.c | 119 ++++++++++++++---------------
2 files changed, 63 insertions(+), 74 deletions(-)
diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
index 84e918b96e53..f368f3b9f923 100644
--- a/include/linux/kho_radix_tree.h
+++ b/include/linux/kho_radix_tree.h
@@ -34,30 +34,24 @@ struct kho_radix_tree {
struct mutex lock; /* protects the tree's structure and root pointer */
};
-typedef int (*kho_radix_tree_walk_callback_t)(phys_addr_t phys,
- unsigned int order);
+typedef int (*kho_radix_tree_walk_callback_t)(unsigned long key);
#ifdef CONFIG_KEXEC_HANDOVER
-int kho_radix_add_page(struct kho_radix_tree *tree, unsigned long pfn,
- unsigned int order);
-
-void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn,
- unsigned int order);
-
+int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key);
+void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key);
int kho_radix_walk_tree(struct kho_radix_tree *tree,
kho_radix_tree_walk_callback_t cb);
#else /* #ifdef CONFIG_KEXEC_HANDOVER */
-static inline int kho_radix_add_page(struct kho_radix_tree *tree, long pfn,
- unsigned int order)
+static inline int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key)
{
return -EOPNOTSUPP;
}
-static inline void kho_radix_del_page(struct kho_radix_tree *tree,
- unsigned long pfn, unsigned int order) { }
+static inline void kho_radix_del_key(struct kho_radix_tree *tree,
+ unsigned long key) { }
static inline int kho_radix_walk_tree(struct kho_radix_tree *tree,
kho_radix_tree_walk_callback_t cb)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 4834a809985a..05a6eb56e176 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -85,7 +85,7 @@ static struct kho_out kho_out = {
};
/**
- * kho_radix_encode_key - Encodes a physical address and order into a radix key.
+ * kho_encode_radix_key - Encodes a physical address and order into a radix key.
* @phys: The physical address of the page.
* @order: The order of the page.
*
@@ -95,7 +95,7 @@ static struct kho_out kho_out = {
*
* Return: The encoded unsigned long radix key.
*/
-static unsigned long kho_radix_encode_key(phys_addr_t phys, unsigned int order)
+static unsigned long kho_encode_radix_key(phys_addr_t phys, unsigned int order)
{
/* Order bits part */
unsigned long h = 1UL << (KHO_ORDER_0_LOG2 - order);
@@ -106,17 +106,17 @@ static unsigned long kho_radix_encode_key(phys_addr_t phys, unsigned int order)
}
/**
- * kho_radix_decode_key - Decodes a radix key back into a physical address and order.
+ * kho_decode_radix_key - Decodes a radix key back into a physical address and order.
* @key: The unsigned long key to decode.
* @order: An output parameter, a pointer to an unsigned int where the decoded
* page order will be stored.
*
- * This function reverses the encoding performed by kho_radix_encode_key(),
+ * This function reverses the encoding performed by kho_encode_radix_key(),
* extracting the original physical address and page order from a given key.
*
* Return: The decoded physical address.
*/
-static phys_addr_t kho_radix_decode_key(unsigned long key, unsigned int *order)
+static phys_addr_t kho_decode_radix_key(unsigned long key, unsigned int *order)
{
unsigned int order_bit = fls64(key);
phys_addr_t phys;
@@ -144,24 +144,21 @@ static unsigned long kho_radix_get_table_index(unsigned long key,
}
/**
- * kho_radix_add_page - Marks a page as preserved in the radix tree.
+ * kho_radix_add_key - Add a key to the radix tree.
* @tree: The KHO radix tree.
- * @pfn: The page frame number of the page to preserve.
- * @order: The order of the page.
+ * @key: The key to add.
*
- * This function traverses the radix tree based on the key derived from @pfn
- * and @order. It sets the corresponding bit in the leaf bitmap to mark the
- * page for preservation. If intermediate nodes do not exist along the path,
- * they are allocated and added to the tree.
+ * This function traverses the radix tree based on the key provided. It sets the
+ * corresponding bit in the leaf bitmap to mark the key as present. If
+ * intermediate nodes do not exist along the path, they are allocated and added
+ * to the tree.
*
* Return: 0 on success, or a negative error code on failure.
*/
-int kho_radix_add_page(struct kho_radix_tree *tree,
- unsigned long pfn, unsigned int order)
+int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key)
{
/* Newly allocated nodes for error cleanup */
struct kho_radix_node *intermediate_nodes[KHO_TREE_MAX_DEPTH] = { 0 };
- unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order);
struct kho_radix_node *anchor_node = NULL;
struct kho_radix_node *node = tree->root;
struct kho_radix_node *new_node;
@@ -224,22 +221,19 @@ int kho_radix_add_page(struct kho_radix_tree *tree,
return err;
}
-EXPORT_SYMBOL_GPL(kho_radix_add_page);
+EXPORT_SYMBOL_GPL(kho_radix_add_key);
/**
- * kho_radix_del_page - Removes a page's preservation status from the radix tree.
+ * kho_radix_del_key - Removes the key from the radix tree.
* @tree: The KHO radix tree.
- * @pfn: The page frame number of the page to unpreserve.
- * @order: The order of the page.
+ * @key: The key to remove.
*
* This function traverses the radix tree and clears the bit corresponding to
- * the page, effectively removing its "preserved" status. It does not free
- * the tree's intermediate nodes, even if they become empty.
+ * the key, effectively removing it from the tree. It does not free the tree's
+ * intermediate nodes, even if they become empty.
*/
-void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn,
- unsigned int order)
+void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key)
{
- unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order);
struct kho_radix_node *node = tree->root;
struct kho_radix_leaf *leaf;
unsigned int i, idx;
@@ -270,21 +264,18 @@ void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn,
idx = kho_radix_get_bitmap_index(key);
__clear_bit(idx, leaf->bitmap);
}
-EXPORT_SYMBOL_GPL(kho_radix_del_page);
+EXPORT_SYMBOL_GPL(kho_radix_del_key);
static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf,
unsigned long key,
kho_radix_tree_walk_callback_t cb)
{
unsigned long *bitmap = (unsigned long *)leaf;
- unsigned int order;
- phys_addr_t phys;
unsigned int i;
int err;
for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
- phys = kho_radix_decode_key(key | i, &order);
- err = cb(phys, order);
+ err = cb(key | i);
if (err)
return err;
}
@@ -332,15 +323,14 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root,
}
/**
- * kho_radix_walk_tree - Traverses the radix tree and calls a callback for each preserved page.
+ * kho_radix_walk_tree - Traverses the radix tree and calls a callback for each key.
* @tree: A pointer to the KHO radix tree to walk.
* @cb: A callback function of type kho_radix_tree_walk_callback_t that will be
- * invoked for each preserved page found in the tree. The callback receives
- * the physical address and order of the preserved page.
+ * invoked for each key in the tree.
*
* This function walks the radix tree, searching from the specified top level
- * down to the lowest level (level 0). For each preserved page found, it invokes
- * the provided callback, passing the page's physical address and order.
+ * down to the lowest level (level 0). For each key found, it invokes the
+ * provided callback.
*
* Return: 0 if the walk completed the specified tree, or the non-zero return
* value from the callback that stopped the walk.
@@ -484,13 +474,16 @@ static struct page *__init kho_get_preserved_page(phys_addr_t phys,
return pfn_to_page(pfn);
}
-static int __init kho_preserved_memory_reserve(phys_addr_t phys,
- unsigned int order)
+static int __init kho_preserved_memory_reserve(unsigned long key)
{
union kho_page_info info;
struct page *page;
+ unsigned int order;
+ phys_addr_t phys;
u64 sz;
+ phys = kho_decode_radix_key(key, &order);
+
sz = 1 << (order + PAGE_SHIFT);
page = kho_get_preserved_page(phys, order);
@@ -618,30 +611,20 @@ early_param("kho_scratch", kho_parse_scratch_size);
static void __init scratch_size_update(void)
{
- /*
- * If fixed sizes are not provided via command line, calculate them
- * now.
- */
- if (scratch_scale) {
- phys_addr_t size;
+ phys_addr_t size;
- size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT,
- NUMA_NO_NODE);
- size = size * scratch_scale / 100;
- scratch_size_lowmem = size;
+ if (!scratch_scale)
+ return;
- size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
- NUMA_NO_NODE);
- size = size * scratch_scale / 100 - scratch_size_lowmem;
- scratch_size_global = size;
- }
+ size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT,
+ NUMA_NO_NODE);
+ size = size * scratch_scale / 100;
+ scratch_size_lowmem = round_up(size, CMA_MIN_ALIGNMENT_BYTES);
- /*
- * Scratch areas are released as MIGRATE_CMA. Round them up to the right
- * size.
- */
- scratch_size_lowmem = round_up(scratch_size_lowmem, CMA_MIN_ALIGNMENT_BYTES);
- scratch_size_global = round_up(scratch_size_global, CMA_MIN_ALIGNMENT_BYTES);
+ size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
+ NUMA_NO_NODE);
+ size = size * scratch_scale / 100 - scratch_size_lowmem;
+ scratch_size_global = round_up(size, CMA_MIN_ALIGNMENT_BYTES);
}
static phys_addr_t __init scratch_size_node(int nid)
@@ -859,7 +842,8 @@ int kho_preserve_folio(struct folio *folio)
if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
return -EINVAL;
- return kho_radix_add_page(tree, pfn, order);
+ return kho_radix_add_key(tree, kho_encode_radix_key(PFN_PHYS(pfn),
+ order));
}
EXPORT_SYMBOL_GPL(kho_preserve_folio);
@@ -877,7 +861,7 @@ void kho_unpreserve_folio(struct folio *folio)
const unsigned long pfn = folio_pfn(folio);
const unsigned int order = folio_order(folio);
- kho_radix_del_page(tree, pfn, order);
+ kho_radix_del_key(tree, kho_encode_radix_key(PFN_PHYS(pfn), order));
}
EXPORT_SYMBOL_GPL(kho_unpreserve_folio);
@@ -906,7 +890,8 @@ static void __kho_unpreserve(struct kho_radix_tree *tree,
while (pfn < end_pfn) {
order = __kho_preserve_pages_order(pfn, end_pfn);
- kho_radix_del_page(tree, pfn, order);
+ kho_radix_del_key(tree, kho_encode_radix_key(PFN_PHYS(pfn),
+ order));
pfn += 1 << order;
}
@@ -937,9 +922,19 @@ int kho_preserve_pages(struct page *page, unsigned long nr_pages)
}
while (pfn < end_pfn) {
- unsigned int order = __kho_preserve_pages_order(pfn, end_pfn);
+ unsigned int order =
+ min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
+
+ /*
+ * Make sure all the pages in a single preservation are in the
+ * same NUMA node. The restore machinery can not cope with a
+ * preservation spanning multiple NUMA nodes.
+ */
+ while (pfn_to_nid(pfn) != pfn_to_nid(pfn + (1UL << order) - 1))
+ order--;
- err = kho_radix_add_page(tree, pfn, order);
+ err = kho_radix_add_key(tree, kho_encode_radix_key(PFN_PHYS(pfn),
+ order));
if (err) {
failed_pfn = pfn;
break;
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 01/20] kho: generalize radix tree APIs
2026-05-28 0:41 ` [RFC PATCH 01/20] kho: generalize radix tree APIs Jork Loeser
@ 2026-05-28 1:22 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:22 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 8 potential issue(s) to consider:
New issues:
- [High] Refactoring of scratch_size_update() skips CMA alignment for fixed scratch sizes.
- [High] Unsigned integer underflow when calculating scratch_size_global for unaligned lowmem sizes.
- [High] The generalized radix tree API silently truncates bits from generic 64-bit keys.
- [Low] Unnecessary duplication of __kho_preserve_pages_order() logic in kho_preserve_pages().
Pre-existing issues:
- [Critical] Heap buffer overflow in radix tree node allocation on 32-bit platforms with 4-byte phys_addr_t.
- [High] Sleeping memory allocation (GFP_KERNEL) is performed while holding a mutex required in reclaim paths.
- [High] Bitwise shift operations using a 32-bit signed integer literal '1' overflow for large memory preservations.
- [High] Lack of bounds checking on reconstructed radix tree keys causes an underflow of 'order', leading to severe memory corruption.
--
commit 8b45214cacd5b180aa13f584e115cd95e44f26d2
Author: Pratyush Yadav (Google) <pratyush@kernel.org>
kho: generalize radix tree APIs
This patch generalizes the KHO radix tree APIs to track arbitrary keys directly
instead of hardcoding page tracking operations (PFN and order). It renames the
functions to kho_radix_add_key() and kho_radix_del_key() and adjusts the
callback structures accordingly.
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index 4834a809985ab..05a6eb56e1766 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
[ ... ]
> @@ -106,17 +106,17 @@ static unsigned long kho_radix_encode_key(phys_addr_t phys, unsigned int order)
> }
>
> /**
> - * kho_radix_decode_key - Decodes a radix key back into a physical address and order.
> + * kho_decode_radix_key - Decodes a radix key back into a physical address and order.
> * @key: The unsigned long key to decode.
> * @order: An output parameter, a pointer to an unsigned int where the decoded
> * page order will be stored.
> *
> - * This function reverses the encoding performed by kho_radix_encode_key(),
> + * This function reverses the encoding performed by kho_encode_radix_key(),
> * extracting the original physical address and page order from a given key.
> *
> * Return: The decoded physical address.
> */
> -static phys_addr_t kho_radix_decode_key(unsigned long key, unsigned int *order)
> +static phys_addr_t kho_decode_radix_key(unsigned long key, unsigned int *order)
> {
> unsigned int order_bit = fls64(key);
> phys_addr_t phys;
>
> /* order_bit is numbered starting at 1 from fls64 */
> *order = KHO_ORDER_0_LOG2 - order_bit + 1;
[Severity: High]
This isn't a bug introduced by this patch, but does this calculation risk an
unsigned integer underflow?
If a corrupted tree (for instance, an FDT from a crashing kernel) provides
entries at inappropriately high indices, the reconstructed key in
__kho_radix_walk_tree() might have bits set above KHO_ORDER_0_LOG2.
If fls64(key) returns an order_bit greater than KHO_ORDER_0_LOG2 + 1, this
calculation will underflow and produce an enormous order. This value is later
used in loops bounded by (1UL << order), which could lead to out-of-bounds
iteration.
[ ... ]
> @@ -144,24 +144,21 @@ static unsigned long kho_radix_get_table_index(unsigned long key,
> }
>
> /**
> - * kho_radix_add_page - Marks a page as preserved in the radix tree.
> + * kho_radix_add_key - Add a key to the radix tree.
> * @tree: The KHO radix tree.
> - * @pfn: The page frame number of the page to preserve.
> - * @order: The order of the page.
> + * @key: The key to add.
> *
> - * This function traverses the radix tree based on the key derived from @pfn
> - * and @order. It sets the corresponding bit in the leaf bitmap to mark the
> - * page for preservation. If intermediate nodes do not exist along the path,
> - * they are allocated and added to the tree.
> + * This function traverses the radix tree based on the key provided. It sets the
> + * corresponding bit in the leaf bitmap to mark the key as present. If
> + * intermediate nodes do not exist along the path, they are allocated and added
> + * to the tree.
> *
> * Return: 0 on success, or a negative error code on failure.
> */
> -int kho_radix_add_page(struct kho_radix_tree *tree,
> - unsigned long pfn, unsigned int order)
> +int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key)
[Severity: High]
Does this new arbitrary key API truncate bits from generic 64-bit keys?
The KHO radix tree depth and shifting logic appear to be statically sized
based on physical address and order bounds. When an arbitrary 64-bit key
is passed here, any bits that exceed the maximum addressable depth might
be silently ignored (e.g., bits shifted out or lost in the modulo operations
inside kho_radix_get_table_index()).
Could this cause distinct large 64-bit keys to silently alias to the same
path and overwrite each other?
> {
> /* Newly allocated nodes for error cleanup */
> struct kho_radix_node *intermediate_nodes[KHO_TREE_MAX_DEPTH] = { 0 };
> - unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order);
> struct kho_radix_node *anchor_node = NULL;
[Severity: Critical]
This is a pre-existing issue, but can the node allocation later in this
function cause an out-of-bounds heap access on 32-bit platforms where
sizeof(phys_addr_t) is 4?
The struct kho_radix_node array size is based on KHO_TABLE_SIZE_LOG2,
which is calculated as ilog2(PAGE_SIZE / sizeof(phys_addr_t)).
On 32-bit systems, this evaluates to 1024 elements of u64, requiring
8192 bytes. However, the allocation relies on get_zeroed_page():
new_node = (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL);
This only allocates a single 4096-byte page. Could writes to indices >= 512,
or walking the tree via __kho_radix_walk_tree(), overflow this allocation?
[Severity: High]
This is also a pre-existing issue, but can using GFP_KERNEL for the node
allocation lead to an ABBA deadlock?
This allocation happens while holding tree->lock:
kho_radix_add_key()
mutex_lock(&tree->lock)
...
get_zeroed_page(GFP_KERNEL)
Since GFP_KERNEL can sleep and trigger direct memory reclaim, if a page cache
folio preserved for KHO is evicted, the filesystem's eviction path could call
kho_unpreserve_folio().
That function would attempt to acquire the exact same tree->lock, resulting
in a deadlock. Should this allocation be moved outside the lock or downgraded
to GFP_NOWAIT?
[ ... ]
> @@ -484,13 +474,16 @@ static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> return pfn_to_page(pfn);
> }
>
> -static int __init kho_preserved_memory_reserve(phys_addr_t phys,
> - unsigned int order)
> +static int __init kho_preserved_memory_reserve(unsigned long key)
> {
> union kho_page_info info;
> struct page *page;
> + unsigned int order;
> + phys_addr_t phys;
> u64 sz;
>
> + phys = kho_decode_radix_key(key, &order);
> +
> sz = 1 << (order + PAGE_SHIFT);
[Severity: High]
This is a pre-existing issue, but will this bitwise shift overflow for large
memory preservations?
Since the literal 1 is a 32-bit signed integer, a shift amount of 31 or
greater (which can happen for contiguous preservations of 2GB or more, like
hugepages or CMA areas) results in a signed integer overflow.
This would produce a negative 32-bit value that is sign-extended into the
64-bit sz variable, leading to incorrect memblock reservations.
Does this need to use 1UL or 1ULL?
[ ... ]
> @@ -618,30 +611,20 @@ early_param("kho_scratch", kho_parse_scratch_size);
>
> static void __init scratch_size_update(void)
> {
> - /*
> - * If fixed sizes are not provided via command line, calculate them
> - * now.
> - */
> - if (scratch_scale) {
> - phys_addr_t size;
> + phys_addr_t size;
>
> - size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT,
> - NUMA_NO_NODE);
> - size = size * scratch_scale / 100;
> - scratch_size_lowmem = size;
> + if (!scratch_scale)
> + return;
[Severity: High]
Does this early return skip CMA alignment for fixed scratch sizes?
In the previous implementation, scratch_size_lowmem and scratch_size_global
were unconditionally aligned using round_up(..., CMA_MIN_ALIGNMENT_BYTES)
at the end of the function, regardless of whether scratch_scale was set.
By returning early when scratch_scale is 0 (which occurs when fixed sizes
are provided via the command line), the alignment step is bypassed.
Will this cause CMA initialization to fail when iterating over partial
pageblocks?
>
> - size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
> - NUMA_NO_NODE);
> - size = size * scratch_scale / 100 - scratch_size_lowmem;
> - scratch_size_global = size;
> - }
> + size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT,
> + NUMA_NO_NODE);
> + size = size * scratch_scale / 100;
> + scratch_size_lowmem = round_up(size, CMA_MIN_ALIGNMENT_BYTES);
>
> - /*
> - * Scratch areas are released as MIGRATE_CMA. Round them up to the right
> - * size.
> - */
> - scratch_size_lowmem = round_up(scratch_size_lowmem, CMA_MIN_ALIGNMENT_BYTES);
> - scratch_size_global = round_up(scratch_size_global, CMA_MIN_ALIGNMENT_BYTES);
> + size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
> + NUMA_NO_NODE);
> + size = size * scratch_scale / 100 - scratch_size_lowmem;
[Severity: High]
Can this subtraction result in an unsigned integer underflow?
Because the round_up() for scratch_size_lowmem now occurs before the
calculation of scratch_size_global, scratch_size_lowmem may be larger
than the unaligned size (size * scratch_scale / 100).
On a system where all memory is below ARCH_LOW_ADDRESS_LIMIT, both calls to
memblock_reserved_kern_size() return the same value. The subtraction will
then subtract the rounded-up value from the unaligned value, resulting in
a negative number.
Since size is a phys_addr_t (unsigned), this will underflow to an enormous
positive value and likely lead to an unfulfillable CMA reservation attempt.
[ ... ]
> @@ -937,9 +922,19 @@ int kho_preserve_pages(struct page *page, unsigned long nr_pages)
> }
>
> while (pfn < end_pfn) {
> - unsigned int order = __kho_preserve_pages_order(pfn, end_pfn);
> + unsigned int order =
> + min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
> +
> + /*
> + * Make sure all the pages in a single preservation are in the
> + * same NUMA node. The restore machinery can not cope with a
> + * preservation spanning multiple NUMA nodes.
> + */
> + while (pfn_to_nid(pfn) != pfn_to_nid(pfn + (1UL << order) - 1))
> + order--;
[Severity: Low]
Is there a specific reason to inline and duplicate the logic from
__kho_preserve_pages_order() here?
The helper function __kho_preserve_pages_order() remains in the file and
is still actively used by __kho_unpreserve(). Duplicating this complex
order calculation and NUMA node alignment logic increases the risk of the
two implementations diverging in the future.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=1
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 02/20] kho: store incoming radix tree in kho_in
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
2026-05-28 0:41 ` [RFC PATCH 01/20] kho: generalize radix tree APIs Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:08 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 03/20] kho: add a struct for radix callbacks Jork Loeser
` (17 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
This allows other functions to also use the radix tree. While at it,
also use kho_get_mem_map_phys() instead of duplicating the code to get
the radix tree root from the FDT.
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
kernel/liveupdate/kexec_handover.c | 27 ++++++++-------------------
1 file changed, 8 insertions(+), 19 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 05a6eb56e176..afc986845839 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -1316,6 +1316,7 @@ struct kho_in {
char previous_release[__NEW_UTS_LEN + 1];
u32 kexec_count;
struct kho_debugfs dbg;
+ struct kho_radix_tree radix_tree;
};
static struct kho_in kho_in = {
@@ -1395,24 +1396,10 @@ EXPORT_SYMBOL_GPL(kho_retrieve_subtree);
static int __init kho_mem_retrieve(const void *fdt)
{
- struct kho_radix_tree tree;
- const phys_addr_t *mem;
- int len;
-
- /* Retrieve the KHO radix tree from passed-in FDT. */
- mem = fdt_getprop(fdt, 0, KHO_FDT_MEMORY_MAP_PROP_NAME, &len);
-
- if (!mem || len != sizeof(*mem)) {
- pr_err("failed to get preserved KHO memory tree\n");
- return -ENOENT;
- }
-
- if (!*mem)
- return -EINVAL;
-
- tree.root = phys_to_virt(*mem);
- mutex_init(&tree.lock);
- return kho_radix_walk_tree(&tree, kho_preserved_memory_reserve);
+ kho_in.radix_tree.root = phys_to_virt(kho_get_mem_map_phys(fdt));
+ mutex_init(&kho_in.radix_tree.lock);
+ return kho_radix_walk_tree(&kho_in.radix_tree,
+ kho_preserved_memory_reserve);
}
static __init int kho_out_fdt_setup(void)
@@ -1619,8 +1606,10 @@ void __init kho_memory_init(void)
if (kho_in.scratch_phys) {
kho_scratch = phys_to_virt(kho_in.scratch_phys);
- if (kho_mem_retrieve(kho_get_fdt()))
+ if (kho_mem_retrieve(kho_get_fdt())) {
kho_in.fdt_phys = 0;
+ kho_in.radix_tree.root = NULL;
+ }
} else {
kho_reserve_scratch();
}
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 02/20] kho: store incoming radix tree in kho_in
2026-05-28 0:41 ` [RFC PATCH 02/20] kho: store incoming radix tree in kho_in Jork Loeser
@ 2026-05-28 1:08 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:08 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Missing error handling leads to boot panic
--
commit 84425fcb93791aa9b9c2db978b57ed659f2c9ec2
Author: Pratyush Yadav (Google) <pratyush@kernel.org>
kho: store incoming radix tree in kho_in
This commit stores the incoming radix tree in the kho_in struct to allow
other functions to use it. It also refactors the code to use
kho_get_mem_map_phys() for retrieving the radix tree root from the FDT.
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
[ ... ]
> @@ -1395,24 +1396,10 @@ EXPORT_SYMBOL_GPL(kho_retrieve_subtree);
>
> static int __init kho_mem_retrieve(const void *fdt)
> {
> - struct kho_radix_tree tree;
> - const phys_addr_t *mem;
> - int len;
> -
> - /* Retrieve the KHO radix tree from passed-in FDT. */
> - mem = fdt_getprop(fdt, 0, KHO_FDT_MEMORY_MAP_PROP_NAME, &len);
> -
> - if (!mem || len != sizeof(*mem)) {
> - pr_err("failed to get preserved KHO memory tree\n");
> - return -ENOENT;
> - }
> -
> - if (!*mem)
> - return -EINVAL;
> -
> - tree.root = phys_to_virt(*mem);
> - mutex_init(&tree.lock);
> - return kho_radix_walk_tree(&tree, kho_preserved_memory_reserve);
> + kho_in.radix_tree.root = phys_to_virt(kho_get_mem_map_phys(fdt));
[Severity: High]
Can this lead to a boot panic if kho_get_mem_map_phys() fails?
If kho_get_mem_map_phys() returns 0 on failure, phys_to_virt(0) will likely
generate a non-NULL pointer. This bypasses the safety validation later when
walking the tree:
kernel/liveupdate/kexec_handover.c:kho_radix_walk_tree()
if (WARN_ON_ONCE(!tree->root))
return -EINVAL;
Walking the tree with this invalid address would then trigger a panic during
early boot.
I note this issue appears to be fixed later in the series by commit
80928d5942267 (kho: return virtual address of mem_map), but wanted to point
out the temporary breakage here.
> + mutex_init(&kho_in.radix_tree.lock);
> + return kho_radix_walk_tree(&kho_in.radix_tree,
> + kho_preserved_memory_reserve);
> }
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=2
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 03/20] kho: add a struct for radix callbacks
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
2026-05-28 0:41 ` [RFC PATCH 01/20] kho: generalize radix tree APIs Jork Loeser
2026-05-28 0:41 ` [RFC PATCH 02/20] kho: store incoming radix tree in kho_in Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 0:41 ` [RFC PATCH 04/20] kho: add callback for table pages Jork Loeser
` (16 subsequent siblings)
19 siblings, 0 replies; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
A future commit will add more callbacks for the KHO radix tree. Add a
struct for collecting the callbacks.
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
include/linux/kho_radix_tree.h | 15 ++++++++++++---
kernel/liveupdate/kexec_handover.c | 29 ++++++++++++++++-------------
2 files changed, 28 insertions(+), 16 deletions(-)
diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
index f368f3b9f923..030da6399d28 100644
--- a/include/linux/kho_radix_tree.h
+++ b/include/linux/kho_radix_tree.h
@@ -34,14 +34,23 @@ struct kho_radix_tree {
struct mutex lock; /* protects the tree's structure and root pointer */
};
-typedef int (*kho_radix_tree_walk_callback_t)(unsigned long key);
+/**
+ * struct kho_radix_walk_cb - Callbacks for KHO radix tree walk.
+ * @key: Called on each present key in the radix tree.
+ *
+ * For each callback, a return value of 0 continues the walk and a non-zero
+ * return value is directly returned to the caller.
+ */
+struct kho_radix_walk_cb {
+ int (*key)(unsigned long key);
+};
#ifdef CONFIG_KEXEC_HANDOVER
int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key);
void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key);
int kho_radix_walk_tree(struct kho_radix_tree *tree,
- kho_radix_tree_walk_callback_t cb);
+ const struct kho_radix_walk_cb *cb);
#else /* #ifdef CONFIG_KEXEC_HANDOVER */
@@ -54,7 +63,7 @@ static inline void kho_radix_del_key(struct kho_radix_tree *tree,
unsigned long key) { }
static inline int kho_radix_walk_tree(struct kho_radix_tree *tree,
- kho_radix_tree_walk_callback_t cb)
+ const struct kho_radix_walk_cb *cb)
{
return -EOPNOTSUPP;
}
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index afc986845839..b22b3cec251e 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -266,16 +266,18 @@ void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key)
}
EXPORT_SYMBOL_GPL(kho_radix_del_key);
-static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf,
- unsigned long key,
- kho_radix_tree_walk_callback_t cb)
+static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key,
+ const struct kho_radix_walk_cb *cb)
{
unsigned long *bitmap = (unsigned long *)leaf;
unsigned int i;
int err;
+ if (!cb->key)
+ return 0;
+
for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
- err = cb(key | i);
+ err = cb->key(key | i);
if (err)
return err;
}
@@ -285,7 +287,7 @@ static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf,
static int __kho_radix_walk_tree(struct kho_radix_node *root,
unsigned int level, unsigned long start,
- kho_radix_tree_walk_callback_t cb)
+ const struct kho_radix_walk_cb *cb)
{
struct kho_radix_node *node;
struct kho_radix_leaf *leaf;
@@ -325,18 +327,16 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root,
/**
* kho_radix_walk_tree - Traverses the radix tree and calls a callback for each key.
* @tree: A pointer to the KHO radix tree to walk.
- * @cb: A callback function of type kho_radix_tree_walk_callback_t that will be
- * invoked for each key in the tree.
+ * @cb: Set of callbacks to be invoked during the tree walk.
*
- * This function walks the radix tree, searching from the specified top level
- * down to the lowest level (level 0). For each key found, it invokes the
- * provided callback.
+ * This function walks the radix tree, searching from the top level down to the
+ * lowest level (level 0), invoking the appropriate callbacks.
*
* Return: 0 if the walk completed the specified tree, or the non-zero return
* value from the callback that stopped the walk.
*/
int kho_radix_walk_tree(struct kho_radix_tree *tree,
- kho_radix_tree_walk_callback_t cb)
+ const struct kho_radix_walk_cb *cb)
{
if (WARN_ON_ONCE(!tree->root))
return -EINVAL;
@@ -1396,10 +1396,13 @@ EXPORT_SYMBOL_GPL(kho_retrieve_subtree);
static int __init kho_mem_retrieve(const void *fdt)
{
+ const struct kho_radix_walk_cb cb = {
+ .key = kho_preserved_memory_reserve,
+ };
+
kho_in.radix_tree.root = phys_to_virt(kho_get_mem_map_phys(fdt));
mutex_init(&kho_in.radix_tree.lock);
- return kho_radix_walk_tree(&kho_in.radix_tree,
- kho_preserved_memory_reserve);
+ return kho_radix_walk_tree(&kho_in.radix_tree, &cb);
}
static __init int kho_out_fdt_setup(void)
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* [RFC PATCH 04/20] kho: add callback for table pages
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (2 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 03/20] kho: add a struct for radix callbacks Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:33 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 05/20] kho: add data argument to radix walk callback Jork Loeser
` (15 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
The KHO memory preservation radix tree does not mark the table pages
themselves as scratch. This is done to avoid a circular dependency where
preserving a page can lead of allocating other preserved pages. This
means any walker looking for free ranges of memory outside of scratch
areas will ignore the table
Add a table callback that is invoked for each table page. The callback
is given the physical address of the table page.
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
include/linux/kho_radix_tree.h | 3 +++
kernel/liveupdate/kexec_handover.c | 12 ++++++++++++
2 files changed, 15 insertions(+)
diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
index 030da6399d28..fe7151d89361 100644
--- a/include/linux/kho_radix_tree.h
+++ b/include/linux/kho_radix_tree.h
@@ -37,12 +37,15 @@ struct kho_radix_tree {
/**
* struct kho_radix_walk_cb - Callbacks for KHO radix tree walk.
* @key: Called on each present key in the radix tree.
+ * @table: Called on each table of the radix tree itself. Receives the
+ * physical address of the page containing the table.
*
* For each callback, a return value of 0 continues the walk and a non-zero
* return value is directly returned to the caller.
*/
struct kho_radix_walk_cb {
int (*key)(unsigned long key);
+ int (*table)(phys_addr_t phys);
};
#ifdef CONFIG_KEXEC_HANDOVER
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index b22b3cec251e..0f8d058f1a27 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -273,6 +273,12 @@ static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key,
unsigned int i;
int err;
+ if (cb->table) {
+ err = cb->table(virt_to_phys(leaf));
+ if (err)
+ return err;
+ }
+
if (!cb->key)
return 0;
@@ -295,6 +301,12 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root,
unsigned int shift;
int err;
+ if (cb->table) {
+ err = cb->table(virt_to_phys(root));
+ if (err)
+ return err;
+ }
+
for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) {
if (!root->table[i])
continue;
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 04/20] kho: add callback for table pages
2026-05-28 0:41 ` [RFC PATCH 04/20] kho: add callback for table pages Jork Loeser
@ 2026-05-28 1:33 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:33 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
Pre-existing issues:
- [Critical] Severe out-of-bounds memory accesses and ABI incompatibility due to incorrect node sizing on 32-bit platforms.
- [High] Pre-existing memory corruption risk due to `kzalloc` crossing page boundaries for KHO radix root node.
- [High] Undefined behavior and logic failure due to 64-bit shift assumptions on 32-bit platforms.
--
commit a4a688bed2326d7eaaf8a3a47636fe600e228c89
Author: Pratyush Yadav (Google) <pratyush@kernel.org>
kho: add callback for table pages
The KHO memory preservation radix tree does not mark the table pages
themselves as scratch. This means any walker looking for free ranges of
memory outside of scratch areas will ignore the table. This patch adds a
table callback that is invoked for each table page to address this.
> diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
> index 030da6399d288..fe7151d89361d 100644
> --- a/include/linux/kho_radix_tree.h
> +++ b/include/linux/kho_radix_tree.h
> @@ -37,12 +37,15 @@ struct kho_radix_tree {
> /**
> * struct kho_radix_walk_cb - Callbacks for KHO radix tree walk.
> * @key: Called on each present key in the radix tree.
> + * @table: Called on each table of the radix tree itself. Receives the
> + * physical address of the page containing the table.
> *
> * For each callback, a return value of 0 continues the walk and a non-zero
> * return value is directly returned to the caller.
> */
> struct kho_radix_walk_cb {
> int (*key)(unsigned long key);
[Severity: High]
This isn't a bug introduced by this patch, but does the key encoding work
correctly on 32-bit platforms?
Looking at kho_encode_radix_key(), the code performs
1UL << (KHO_ORDER_0_LOG2 - order). With KHO_ORDER_0_LOG2 defined as
64 - PAGE_SHIFT (52 for 4KB pages), doesn't shifting a 32-bit 1UL by
52 bits invoke undefined behavior?
Additionally, since the radix key is typed as unsigned long (which is 32 bits
on these platforms), is it structurally possible to encode and decode the
52-bit shifted physical address and order bit as designed?
> + int (*table)(phys_addr_t phys);
> };
>
[ ... ]
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index b22b3cec251ee..0f8d058f1a272 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
[ ... ]
> @@ -295,6 +301,12 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root,
> unsigned int shift;
> int err;
>
> + if (cb->table) {
> + err = cb->table(virt_to_phys(root));
[Severity: High]
This isn't a bug introduced by this patch, but does the KHO radix tree root
node get allocated via kzalloc(PAGE_SIZE, GFP_KERNEL) in kho_init()?
The SLUB allocator does not guarantee page alignment for PAGE_SIZE objects
when debugging features (like redzones) are enabled. If the root node crosses
a physical page boundary, won't this table callback only receive the physical
address of the start of the object?
If a walker treats this address as a single page, could it fail to mark the
second page as in-use? During kexec, couldn't this unmarked page be treated
as free scratch memory and overwritten, corrupting the KHO radix tree for the
incoming kernel?
Note: I see this is fixed later in the series in commit
d207a159c7c112b64706fe2fb64399a65932fdb6 ("kho: add kho_radix_init_tree()").
> + if (err)
> + return err;
> + }
> +
> for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) {
[Severity: Critical]
This is a pre-existing issue, but does this loop correctly bound memory
accesses on 32-bit platforms?
Since sizeof(phys_addr_t) is 4 bytes on 32-bit platforms,
PAGE_SIZE / sizeof(phys_addr_t) evaluates to 1024 for 4KB pages.
KHO_TABLE_SIZE_LOG2 evaluates to 10, meaning the table array in
struct kho_radix_node has 1024 elements of type u64 (8 bytes each).
This makes the structure 8192 bytes large.
If node allocations only allocate a single PAGE_SIZE (4096 bytes), won't
accessing root->table[i] in this loop up to 1024 read and write well past
the bounds of the allocated page?
Also, by relying on sizeof(phys_addr_t), doesn't the FDT format vary by
platform and break ABI cross-compatibility?
> if (!root->table[i])
> continue;
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=4
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 05/20] kho: add data argument to radix walk callback
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (3 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 04/20] kho: add callback for table pages Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:11 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 06/20] kho: allow early-boot usage of the KHO radix tree Jork Loeser
` (14 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
Add an opaque data pointer argument to kho_radix_walk_cb_t. This can be
used for callers to pass extra information to the callback.
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
include/linux/kho_radix_tree.h | 8 ++++----
kernel/liveupdate/kexec_handover.c | 24 +++++++++++++-----------
2 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
index fe7151d89361..6c0f7d82716b 100644
--- a/include/linux/kho_radix_tree.h
+++ b/include/linux/kho_radix_tree.h
@@ -44,8 +44,8 @@ struct kho_radix_tree {
* return value is directly returned to the caller.
*/
struct kho_radix_walk_cb {
- int (*key)(unsigned long key);
- int (*table)(phys_addr_t phys);
+ int (*key)(unsigned long key, void *data);
+ int (*table)(phys_addr_t phys, void *data);
};
#ifdef CONFIG_KEXEC_HANDOVER
@@ -53,7 +53,7 @@ struct kho_radix_walk_cb {
int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key);
void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key);
int kho_radix_walk_tree(struct kho_radix_tree *tree,
- const struct kho_radix_walk_cb *cb);
+ const struct kho_radix_walk_cb *cb, void *data);
#else /* #ifdef CONFIG_KEXEC_HANDOVER */
@@ -66,7 +66,7 @@ static inline void kho_radix_del_key(struct kho_radix_tree *tree,
unsigned long key) { }
static inline int kho_radix_walk_tree(struct kho_radix_tree *tree,
- const struct kho_radix_walk_cb *cb)
+ const struct kho_radix_walk_cb *cb, void *data)
{
return -EOPNOTSUPP;
}
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 0f8d058f1a27..f6de6bf63226 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -267,14 +267,14 @@ void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key)
EXPORT_SYMBOL_GPL(kho_radix_del_key);
static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key,
- const struct kho_radix_walk_cb *cb)
+ const struct kho_radix_walk_cb *cb, void *data)
{
unsigned long *bitmap = (unsigned long *)leaf;
unsigned int i;
int err;
if (cb->table) {
- err = cb->table(virt_to_phys(leaf));
+ err = cb->table(virt_to_phys(leaf), data);
if (err)
return err;
}
@@ -283,7 +283,7 @@ static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key,
return 0;
for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
- err = cb->key(key | i);
+ err = cb->key(key | i, data);
if (err)
return err;
}
@@ -293,7 +293,7 @@ static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key,
static int __kho_radix_walk_tree(struct kho_radix_node *root,
unsigned int level, unsigned long start,
- const struct kho_radix_walk_cb *cb)
+ const struct kho_radix_walk_cb *cb, void *data)
{
struct kho_radix_node *node;
struct kho_radix_leaf *leaf;
@@ -302,7 +302,7 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root,
int err;
if (cb->table) {
- err = cb->table(virt_to_phys(root));
+ err = cb->table(virt_to_phys(root), data);
if (err)
return err;
}
@@ -323,10 +323,10 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root,
* node is pointing to the level 0 bitmap.
*/
leaf = (struct kho_radix_leaf *)node;
- err = kho_radix_walk_leaf(leaf, key, cb);
+ err = kho_radix_walk_leaf(leaf, key, cb, data);
} else {
err = __kho_radix_walk_tree(node, level - 1,
- key, cb);
+ key, cb, data);
}
if (err)
@@ -340,6 +340,7 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root,
* kho_radix_walk_tree - Traverses the radix tree and calls a callback for each key.
* @tree: A pointer to the KHO radix tree to walk.
* @cb: Set of callbacks to be invoked during the tree walk.
+ * @data: Opaque data pointer passed to each callback in @cb.
*
* This function walks the radix tree, searching from the top level down to the
* lowest level (level 0), invoking the appropriate callbacks.
@@ -348,14 +349,15 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root,
* value from the callback that stopped the walk.
*/
int kho_radix_walk_tree(struct kho_radix_tree *tree,
- const struct kho_radix_walk_cb *cb)
+ const struct kho_radix_walk_cb *cb, void *data)
{
if (WARN_ON_ONCE(!tree->root))
return -EINVAL;
guard(mutex)(&tree->lock);
- return __kho_radix_walk_tree(tree->root, KHO_TREE_MAX_DEPTH - 1, 0, cb);
+ return __kho_radix_walk_tree(tree->root, KHO_TREE_MAX_DEPTH - 1, 0, cb,
+ data);
}
EXPORT_SYMBOL_GPL(kho_radix_walk_tree);
@@ -486,7 +488,7 @@ static struct page *__init kho_get_preserved_page(phys_addr_t phys,
return pfn_to_page(pfn);
}
-static int __init kho_preserved_memory_reserve(unsigned long key)
+static int __init kho_preserved_memory_reserve(unsigned long key, void *data)
{
union kho_page_info info;
struct page *page;
@@ -1414,7 +1416,7 @@ static int __init kho_mem_retrieve(const void *fdt)
kho_in.radix_tree.root = phys_to_virt(kho_get_mem_map_phys(fdt));
mutex_init(&kho_in.radix_tree.lock);
- return kho_radix_walk_tree(&kho_in.radix_tree, &cb);
+ return kho_radix_walk_tree(&kho_in.radix_tree, &cb, NULL);
}
static __init int kho_out_fdt_setup(void)
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 05/20] kho: add data argument to radix walk callback
2026-05-28 0:41 ` [RFC PATCH 05/20] kho: add data argument to radix walk callback Jork Loeser
@ 2026-05-28 1:11 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:11 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [High] Signed integer overflow in size calculation in kho_preserved_memory_reserve().
--
commit 64e1a8a692077719d0e7f42a7ff19088fba754f4
Author: Pratyush Yadav (Google) <pratyush@kernel.org>
kho: add data argument to radix walk callback
This commit adds an opaque data pointer argument to kho_radix_walk_cb_t.
This allows callers to pass extra information to the callback functions
during tree walks.
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index 0f8d058f1a272..f6de6bf632261 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
[ ... ]
> @@ -486,7 +488,7 @@ static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> return pfn_to_page(pfn);
> }
>
> -static int __init kho_preserved_memory_reserve(unsigned long key)
> +static int __init kho_preserved_memory_reserve(unsigned long key, void *data)
> {
> union kho_page_info info;
> struct page *page;
[Severity: High]
This isn't a bug introduced by this patch, but does sz overflow when
calculating the memory size in this function?
Looking at kho_preserved_memory_reserve(), the memory size is calculated as:
sz = 1 << (order + PAGE_SHIFT);
The literal 1 is a signed 32-bit integer. When KHO preserves memory using
kho_preserve_pages(), the maximum order is bounded by NUMA node boundaries
rather than MAX_PAGE_ORDER, allowing very large contiguous chunks to be
preserved as a single radix tree entry.
For example, if a 2GB contiguous memory chunk is preserved, the order is 19.
With PAGE_SHIFT = 12, order + PAGE_SHIFT is 31. The expression 1 << 31
overflows the signed 32-bit integer into -2147483648, which is then
sign-extended into u64 sz as 0xffffffff80000000.
If order + PAGE_SHIFT >= 32 (e.g., preserving a 4GB chunk), the behavior is
undefined and often evaluates to 1.
Calling memblock_reserve() with either of these invalid sizes will cause the
system to either exhaust all physical memory and fail to boot, or fail to
actually reserve the preserved memory, leading to severe memory corruption
when the successor kernel allocates and overwrites the preserved state.
Could this expression be updated to use the 1ULL literal to prevent
overflow?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=5
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 06/20] kho: allow early-boot usage of the KHO radix tree
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (4 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 05/20] kho: add data argument to radix walk callback Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:40 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 07/20] kho: allow destroying " Jork Loeser
` (13 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
The KHO radix tree allocates memory for table pages from the buddy
allocator using get_zeroed_page(). This is not available in early boot
when memblock is still active.
Using the radix tree in early boot is useful for KHO to track metadata
about its memory. One such example is for tracking free blocks for
memory allocation when scratch runs out of space. This feature will be
added in the following commits.
Add kho_radix_{alloc,free}_node() which allocate and free the table
pages. They use slab_is_available() to decide which allocator to use.
While slab_is_available() indicates availability of the slab allocator,
it gets initialized right before buddy so it serves the same practical
purpose.
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
kernel/liveupdate/kexec_handover.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index f6de6bf63226..5c201e605b96 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -143,6 +143,26 @@ static unsigned long kho_radix_get_table_index(unsigned long key,
return (key >> s) % (1 << KHO_TABLE_SIZE_LOG2);
}
+static void __ref *kho_radix_alloc_node(void)
+{
+ struct kho_radix_node *node;
+
+ if (slab_is_available())
+ node = (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL);
+ else
+ node = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+
+ return node;
+}
+
+static void __ref kho_radix_free_node(struct kho_radix_node *node)
+{
+ if (slab_is_available())
+ free_page((unsigned long)node);
+ else
+ memblock_free(node, PAGE_SIZE);
+}
+
/**
* kho_radix_add_key - Add a key to the radix tree.
* @tree: The KHO radix tree.
@@ -183,7 +203,7 @@ int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key)
}
/* Next node is empty, create a new node for it */
- new_node = (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL);
+ new_node = kho_radix_alloc_node();
if (!new_node) {
err = -ENOMEM;
goto err_free_nodes;
@@ -214,7 +234,7 @@ int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key)
err_free_nodes:
for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) {
if (intermediate_nodes[i])
- free_page((unsigned long)intermediate_nodes[i]);
+ kho_radix_free_node(intermediate_nodes[i]);
}
if (anchor_node)
anchor_node->table[anchor_idx] = 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 06/20] kho: allow early-boot usage of the KHO radix tree
2026-05-28 0:41 ` [RFC PATCH 06/20] kho: allow early-boot usage of the KHO radix tree Jork Loeser
@ 2026-05-28 1:40 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:40 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] kho_radix_alloc_node() incorrectly uses slab_is_available() to determine if the buddy allocator can be used, leading to double-allocation and memory corruption during boot.
- [High] kho_radix_free_node() conditionally frees memory based on the current global boot state rather than how the memory was originally allocated, leading to page state corruption or crashes.
--
commit 470d80274c16904cd71b8b9c878c7272d6b58647
Author: Pratyush Yadav (Google) <pratyush@kernel.org>
kho: allow early-boot usage of the KHO radix tree
This commit updates the KHO radix tree to support allocations during early boot
when the buddy allocator is not yet available. It introduces
kho_radix_alloc_node() and kho_radix_free_node() which use slab_is_available()
to toggle between memblock and buddy allocators.
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index f6de6bf632261..5c201e605b969 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
[ ... ]
> @@ -143,6 +143,26 @@ static unsigned long kho_radix_get_table_index(unsigned long key,
> return (key >> s) % (1 << KHO_TABLE_SIZE_LOG2);
> }
>
> +static void __ref *kho_radix_alloc_node(void)
> +{
> + struct kho_radix_node *node;
> +
> + if (slab_is_available())
> + node = (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL);
> + else
> + node = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
[Severity: High]
Will this check cause memory corruption during the window between buddy
initialization and slab initialization?
During boot, memblock_free_all() hands over all unreserved pages to the buddy
allocator. However, slab_is_available() remains false until kmem_cache_init()
finishes later in mm_core_init().
If kho_radix_alloc_node() is called in this window, it evaluates
slab_is_available() as false and falls back to memblock_alloc(). This means
it will reserve physical memory that the buddy allocator is already actively
managing, potentially leading to a silent double-allocation where KHO radix
tree nodes alias with other kernel allocations.
> +
> + return node;
> +}
> +
> +static void __ref kho_radix_free_node(struct kho_radix_node *node)
> +{
> + if (slab_is_available())
> + free_page((unsigned long)node);
> + else
> + memblock_free(node, PAGE_SIZE);
> +}
[Severity: High]
Could checking slab_is_available() at the time of freeing cause crashes or
memory leaks for nodes allocated during early boot?
If a node is allocated early via memblock_alloc(), it is backed by a
PageReserved physical page. If this same node is later freed after the
slab allocator has been initialized, slab_is_available() will be true,
causing the free path to incorrectly call free_page() on the reserved page.
Calling free_page() on a memblock-reserved page fails in __free_pages_prepare()
because PAGE_FLAGS_CHECK_AT_FREE includes PG_reserved. This triggers a
bad_page() warning, dumps the stack, and leaks the memory.
Should this code determine the deallocation method based on the node's
actual allocation origin, such as checking PageReserved(virt_to_page(node)),
rather than checking the current global boot state?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=6
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 07/20] kho: allow destroying KHO radix tree
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (5 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 06/20] kho: allow early-boot usage of the KHO radix tree Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 0:41 ` [RFC PATCH 08/20] kho: add kho_radix_init_tree() Jork Loeser
` (12 subsequent siblings)
19 siblings, 0 replies; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
Add kho_radix_destroy_tree() which allows destroying the radix tree and
freeing all its pages.
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
include/linux/kho_radix_tree.h | 3 +++
kernel/liveupdate/kexec_handover.c | 34 ++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)
diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
index 6c0f7d82716b..617395a6647a 100644
--- a/include/linux/kho_radix_tree.h
+++ b/include/linux/kho_radix_tree.h
@@ -54,6 +54,7 @@ int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key);
void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key);
int kho_radix_walk_tree(struct kho_radix_tree *tree,
const struct kho_radix_walk_cb *cb, void *data);
+void kho_radix_destroy_tree(struct kho_radix_tree *tree);
#else /* #ifdef CONFIG_KEXEC_HANDOVER */
@@ -71,6 +72,8 @@ static inline int kho_radix_walk_tree(struct kho_radix_tree *tree,
return -EOPNOTSUPP;
}
+static inline void kho_radix_destroy_tree(struct kho_radix_tree *tree) { }
+
#endif /* #ifdef CONFIG_KEXEC_HANDOVER */
#endif /* _LINUX_KHO_RADIX_TREE_H */
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 5c201e605b96..3f3ea71baa1a 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -286,6 +286,40 @@ void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key)
}
EXPORT_SYMBOL_GPL(kho_radix_del_key);
+static void __kho_radix_destroy_tree(struct kho_radix_node *root,
+ unsigned int level)
+{
+ unsigned long i;
+
+ if (level == 0) {
+ kho_radix_free_node(root);
+ return;
+ }
+
+ for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) {
+ if (root->table[i])
+ __kho_radix_destroy_tree(phys_to_virt(root->table[i]),
+ level - 1);
+ }
+
+ kho_radix_free_node(root);
+}
+
+/**
+ * kho_radix_destroy_tree - Destroy the radix tree
+ * @tree: The radix tree to destroy
+ *
+ * Walk @tree and free all its nodes.
+ */
+void kho_radix_destroy_tree(struct kho_radix_tree *tree)
+{
+ if (!tree->root)
+ return;
+
+ __kho_radix_destroy_tree(tree->root, KHO_TREE_MAX_DEPTH - 1);
+}
+EXPORT_SYMBOL_GPL(kho_radix_destroy_tree);
+
static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key,
const struct kho_radix_walk_cb *cb, void *data)
{
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* [RFC PATCH 08/20] kho: add kho_radix_init_tree()
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (6 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 07/20] kho: allow destroying " Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:21 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 09/20] memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT Jork Loeser
` (11 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
Move the initialization logic of the radix tree into
kho_radix_init_tree() instead of having users open-code it. Makes the
boundaries cleaner and reduces code duplication when a new user of the
radix tree will be added in a future commit.
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
include/linux/kho_radix_tree.h | 7 ++++++
kernel/liveupdate/kexec_handover.c | 37 ++++++++++++++++++++++++++++--
2 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
index 617395a6647a..c0840ecb230c 100644
--- a/include/linux/kho_radix_tree.h
+++ b/include/linux/kho_radix_tree.h
@@ -54,6 +54,7 @@ int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key);
void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key);
int kho_radix_walk_tree(struct kho_radix_tree *tree,
const struct kho_radix_walk_cb *cb, void *data);
+int kho_radix_init_tree(struct kho_radix_tree *tree, struct kho_radix_node *root);
void kho_radix_destroy_tree(struct kho_radix_tree *tree);
#else /* #ifdef CONFIG_KEXEC_HANDOVER */
@@ -72,6 +73,12 @@ static inline int kho_radix_walk_tree(struct kho_radix_tree *tree,
return -EOPNOTSUPP;
}
+static inline int kho_radix_init_tree(struct kho_radix_tree *tree,
+ struct kho_radix_node *root)
+{
+ return 0;
+}
+
static inline void kho_radix_destroy_tree(struct kho_radix_tree *tree) { }
#endif /* #ifdef CONFIG_KEXEC_HANDOVER */
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 3f3ea71baa1a..b2d1572808eb 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -305,6 +305,34 @@ static void __kho_radix_destroy_tree(struct kho_radix_node *root,
kho_radix_free_node(root);
}
+/**
+ * kho_radix_init_tree - initialize the radix tree.
+ * @tree: the tree to initialize.
+ * @root: root table of the radix tree.
+ *
+ * Initialize the radix tree with the given root node. If root is %NULL, an
+ * empty root table is allocated. If root is not %NULL, it is the caller's
+ * responsibility to make sure the root is valid and in the correct format.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int kho_radix_init_tree(struct kho_radix_tree *tree, struct kho_radix_node *root)
+{
+ /* Already initialized. */
+ if (tree->root)
+ return 0;
+
+ if (!root)
+ root = kho_radix_alloc_node();
+ if (!root)
+ return -ENOMEM;
+
+ tree->root = root;
+ mutex_init(&tree->lock);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kho_radix_init_tree);
+
/**
* kho_radix_destroy_tree - Destroy the radix tree
* @tree: The radix tree to destroy
@@ -1467,9 +1495,14 @@ static int __init kho_mem_retrieve(const void *fdt)
const struct kho_radix_walk_cb cb = {
.key = kho_preserved_memory_reserve,
};
+ phys_addr_t mem_map_phys;
+ int err;
+
+ mem_map_phys = kho_get_mem_map_phys(fdt);
+ err = kho_radix_init_tree(&kho_in.radix_tree, phys_to_virt(mem_map_phys));
+ if (err)
+ return err;
- kho_in.radix_tree.root = phys_to_virt(kho_get_mem_map_phys(fdt));
- mutex_init(&kho_in.radix_tree.lock);
return kho_radix_walk_tree(&kho_in.radix_tree, &cb, NULL);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 08/20] kho: add kho_radix_init_tree()
2026-05-28 0:41 ` [RFC PATCH 08/20] kho: add kho_radix_init_tree() Jork Loeser
@ 2026-05-28 1:21 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:21 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] kho_radix_destroy_tree() leaves tree->root dangling, causing subsequent re-initializations to erroneously succeed and lead to Use-After-Free.
--
commit d207a159c7c112b64706fe2fb64399a65932fdb6
Author: Pratyush Yadav (Google) <pratyush@kernel.org>
kho: add kho_radix_init_tree()
Move the initialization logic of the radix tree into
kho_radix_init_tree() instead of having users open-code it. Makes the
boundaries cleaner and reduces code duplication when a new user of the
radix tree will be added in a future commit.
[ ... ]
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index 3f3ea71baa1ad..b2d1572808ebf 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -305,6 +305,34 @@ static void __kho_radix_destroy_tree(struct kho_radix_node *root,
> kho_radix_free_node(root);
> }
>
> +/**
> + * kho_radix_init_tree - initialize the radix tree.
> + * @tree: the tree to initialize.
> + * @root: root table of the radix tree.
> + *
> + * Initialize the radix tree with the given root node. If root is %NULL, an
> + * empty root table is allocated. If root is not %NULL, it is the caller's
> + * responsibility to make sure the root is valid and in the correct format.
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +int kho_radix_init_tree(struct kho_radix_tree *tree, struct kho_radix_node *root)
> +{
> + /* Already initialized. */
> + if (tree->root)
> + return 0;
[Severity: High]
Can this cause a use-after-free if a tree is re-initialized?
Looking at kho_radix_destroy_tree(), it frees the tree nodes but does not set
tree->root to NULL:
kernel/liveupdate/kexec_handover.c:kho_radix_destroy_tree() {
if (!tree->root)
return;
__kho_radix_destroy_tree(tree->root, KHO_TREE_MAX_DEPTH - 1);
}
If a tree is destroyed and then kho_radix_init_tree() is called on it again,
wouldn't this check see the dangling tree->root pointer and incorrectly return
success, leading to a use-after-free on subsequent tree operations?
> +
> + if (!root)
> + root = kho_radix_alloc_node();
> + if (!root)
> + return -ENOMEM;
> +
> + tree->root = root;
> + mutex_init(&tree->lock);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(kho_radix_init_tree);
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=8
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 09/20] memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (7 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 08/20] kho: add kho_radix_init_tree() Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 0:41 ` [RFC PATCH 10/20] kho: extended scratch Jork Loeser
` (10 subsequent siblings)
19 siblings, 0 replies; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
In the upcoming commits, the KHO will learn how to discover free blocks
of memory by walking the KHO radix tree. It will then mark those regions
as scratch to allow memory allocation in case scratch runs low.
To differentiate the extended scratch areas from the main scratch areas,
introduce MEMBLOCK_KHO_SCRATCH_EXT. Use it when choosing memblock flags
for allocations during scratch-only. Teach should_skip_region() to check
for both flags before deciding if the region should be skipped.
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
include/linux/memblock.h | 10 ++++++++++
mm/memblock.c | 41 ++++++++++++++++++++++++++++++++++------
2 files changed, 45 insertions(+), 6 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5afcd99aa8c1..4f535ca4947a 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -51,6 +51,9 @@ extern unsigned long long max_possible_pfn;
* memory reservations yet, so we get scratch memory from the previous
* kernel that we know is good to use. It is the only memory that
* allocations may happen from in this phase.
+ * @MEMBLOCK_KHO_SCRATCH_EXT: same as MEMBLOCK_KHO_SCRATCH but was discovered at
+ * boot time by finding gaps in preserved memory instead of being passed from
+ * previous kernel. Does not get passed to the next kernel.
*/
enum memblock_flags {
MEMBLOCK_NONE = 0x0, /* No special request */
@@ -61,6 +64,7 @@ enum memblock_flags {
MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
MEMBLOCK_KHO_SCRATCH = 0x40, /* scratch memory for kexec handover */
+ MEMBLOCK_KHO_SCRATCH_EXT= 0x80, /* extended scratch memory for KHO */
};
/**
@@ -157,6 +161,7 @@ int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size);
int memblock_reserved_mark_kern(phys_addr_t base, phys_addr_t size);
int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size);
+int memblock_mark_kho_scratch_ext(phys_addr_t base, phys_addr_t size);
int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size);
void memblock_free(void *ptr, size_t size);
@@ -304,6 +309,11 @@ static inline bool memblock_is_kho_scratch(struct memblock_region *m)
return m->flags & MEMBLOCK_KHO_SCRATCH;
}
+static inline bool memblock_is_kho_scratch_ext(struct memblock_region *m)
+{
+ return m->flags & MEMBLOCK_KHO_SCRATCH_EXT;
+}
+
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long *end_pfn);
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 6349c48154f4..6f76a6bb96d6 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -182,7 +182,7 @@ static enum memblock_flags __init_memblock choose_memblock_flags(void)
{
/* skip non-scratch memory for kho early boot allocations */
if (kho_scratch_only)
- return MEMBLOCK_KHO_SCRATCH;
+ return MEMBLOCK_KHO_SCRATCH | MEMBLOCK_KHO_SCRATCH_EXT;
return system_has_some_mirror ? MEMBLOCK_MIRROR : MEMBLOCK_NONE;
}
@@ -1180,8 +1180,9 @@ int __init_memblock memblock_reserved_mark_kern(phys_addr_t base, phys_addr_t si
* @base: the base phys addr of the region
* @size: the size of the region
*
- * Only memory regions marked with %MEMBLOCK_KHO_SCRATCH will be considered
- * for allocations during early boot with kexec handover.
+ * Only memory regions marked with %MEMBLOCK_KHO_SCRATCH or
+ * %MEMBLOCK_KHO_SCRATCH_EXT will be considered for allocations during early
+ * boot with kexec handover.
*
* Return: 0 on success, -errno on failure.
*/
@@ -1205,6 +1206,23 @@ __init int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size)
MEMBLOCK_KHO_SCRATCH);
}
+/**
+ * memblock_mark_kho_scratch_ext - Mark a memory region as MEMBLOCK_KHO_SCRATCH_EXT.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Only memory regions marked with %MEMBLOCK_KHO_SCRATCH or
+ * %MEMBLOCK_KHO_SCRATCH_EXT will be considered for allocations during early
+ * boot with kexec handover.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+__init int memblock_mark_kho_scratch_ext(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(&memblock.memory, base, size, 1,
+ MEMBLOCK_KHO_SCRATCH_EXT);
+}
+
static bool should_skip_region(struct memblock_type *type,
struct memblock_region *m,
int nid, int flags)
@@ -1238,10 +1256,20 @@ static bool should_skip_region(struct memblock_type *type,
/*
* In early alloc during kexec handover, we can only consider
- * MEMBLOCK_KHO_SCRATCH regions for the allocations
+ * MEMBLOCK_KHO_SCRATCH or MEMBLOCK_KHO_SCRATCH_EXT regions for the
+ * allocations.
*/
- if ((flags & MEMBLOCK_KHO_SCRATCH) && !memblock_is_kho_scratch(m))
- return true;
+ if (flags & (MEMBLOCK_KHO_SCRATCH | MEMBLOCK_KHO_SCRATCH_EXT)) {
+ bool skip = true;
+
+ if ((flags & MEMBLOCK_KHO_SCRATCH) && memblock_is_kho_scratch(m))
+ skip = false;
+
+ if ((flags & MEMBLOCK_KHO_SCRATCH_EXT) && memblock_is_kho_scratch_ext(m))
+ skip = false;
+
+ return skip;
+ }
return false;
}
@@ -2801,6 +2829,7 @@ static const char * const flagname[] = {
[ilog2(MEMBLOCK_RSRV_NOINIT)] = "RSV_NIT",
[ilog2(MEMBLOCK_RSRV_KERN)] = "RSV_KERN",
[ilog2(MEMBLOCK_KHO_SCRATCH)] = "KHO_SCRATCH",
+ [ilog2(MEMBLOCK_KHO_SCRATCH_EXT)] = "KHO_SCRATCH_EXT",
};
static int memblock_debug_show(struct seq_file *m, void *private)
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* [RFC PATCH 10/20] kho: extended scratch
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (8 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 09/20] memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:21 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 11/20] kho: return virtual address of mem_map Jork Loeser
` (9 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
Motivation
==========
The scratch space is allocated by the first kernel in the KHO chain, and
is reused by all subsequent kernels. The size of the space is either set
via the commandline by the system administrator or by calculating the
amount of memory used by the kernel and adding a multiplier. In either
case, the scratch space is a heuristic and is liable to fill up and fail
allocation if a kernel uses more memory than expected.
In addition, gigantic huge pages (usually 1 GiB) are allocated via
memblock, and in a KHO boot that memory comes from the scratch space. In
hypervisors it is common to dedicate a major part of the system's memory
to gigantic hugepages for VM memory.
If this memory needs to come from scratch space, then scratch needs to
be greater than the memory needed for huge pages, which is impractical.
In addition, hugepages can be preserved memory. Allocating them from
scratch violates the assumption that scratch contains no preserved
memory.
Methodology
===========
Introduce extended scratch areas. These areas are discovered at boot by
walking the preserved memory radix tree and looking for free blocks of
memory. They then marked as scratch to allow allocations from them. This
makes KHO more resilient to memory pressure and allows supporting huge
page preservation.
Since the preserved memory radix tree mixes both physical address and
order into a single key, and does not track table pages, it is difficult
to identify free areas from it directly. Walk the tree and digest it
down into another radix tree. The latter tracks blocks of
KHO_EXT_SHIFT (1 GiB as of now) granularity. Then walk the digested tree
and mark the areas between the present keys as scratch.
Performance
===========
The discovery algorithm traverses the preserved memory radix tree
exactly once. While it does use memory for the digested radix tree,
since the blocks are split by 1 GiB, a single bitmap with 4k pages can
track up to 32 TiB of memory. So there are likely to be very few radix
tree pages used in this tracking. For systems with all physical memory
below 32 TiB, this should result in a total of 6 pages being
used (KHO_TREE_MAX_DEPTH == 6).
An alternate way of achieving this would be to call kho_mem_retrieve()
earlier in boot and mark all the KHO preservations as reserved. But that
can blow up memblock.reserved with a bunch of 4K pages scattered
everywhere, which will reduce performance of subsequent allocations.
Since the free blocks are tracked in chunks of 1 GiB, this won't blow up
memblock.memory as much.
Practical evaluation
====================
The testing is done on a x86_64 qemu VM running under KVM with 64G
memory and 12 CPUs. The machine pre-allocates 50 1G pages.
Since the performance scales with how busy the radix tree is, tests are
done with 2 preservation patterns: first with two 1M memfds, second with
two 1G memfds, both using 4k pages.
Test case 1 - 1M memfd
~~~~~~~~~~~~~~~~~~~~~~
This test case has two memfds with 1M memory each in 4k pages, plus
other preservations from LUO core and other KHO users.
This is how the radix tree stats look like:
radix_nodes: 0x2f
nr_preservations: 0x22d
mem_preserved: 0xa2b000
per order preservations:
order 0: 0x215
order 1: 0x9
order 2: 0x1
order 3: 0x2
order 4: 0x5
order 5: 0x1
order 6: 0x2
order 7: 0x2
order 9: 0x1
order 10: 0x1
and this is how long it takes to extend the scratch after KHO boot:
kho_extend_scratch(): time taken: 88 us
kho_extend_scratch(): total memory recovered: 0xf7ff7b000 (~62G)
Test case 2 - 1G memfd
~~~~~~~~~~~~~~~~~~~~~~
This test case has two memfds with 1G memory each in 4k pages, plus
other preservations from LUO core and other KHO users.
This is how the radix tree stats look like:
radix_nodes: 0x45
nr_preservations: 0x80832
mem_preserved: 0x8102d000
per order preservations:
order 0: 0x80817
order 1: 0x7
order 2: 0x2
order 3: 0x4
order 4: 0x2
order 5: 0x2
order 6: 0x4
order 7: 0x3
order 8: 0x1
order 9: 0x2
and this is how long it takes to extend the scratch after KHO boot:
kho_extend_scratch(): time taken: 21769 us
kho_extend_scratch(): total memory recovered: 0xe40000000 (57G)
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
include/linux/kexec_handover.h | 1 +
kernel/liveupdate/kexec_handover.c | 148 +++++++++++++++++++++++++----
mm/mm_init.c | 1 +
3 files changed, 133 insertions(+), 17 deletions(-)
diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index 8968c56d2d73..6ce46f36ed99 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -37,6 +37,7 @@ void kho_remove_subtree(void *blob);
int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size);
void kho_memory_init(void);
+void kho_extend_scratch(void);
void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys,
u64 scratch_len);
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index b2d1572808eb..a006a883ee94 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -84,6 +84,23 @@ static struct kho_out kho_out = {
},
};
+struct kho_in {
+ phys_addr_t fdt_phys;
+ phys_addr_t scratch_phys;
+ char previous_release[__NEW_UTS_LEN + 1];
+ u32 kexec_count;
+ struct kho_debugfs dbg;
+ struct kho_radix_tree radix_tree;
+};
+
+static struct kho_in kho_in = {
+};
+
+static const void *kho_get_fdt(void)
+{
+ return kho_in.fdt_phys ? phys_to_virt(kho_in.fdt_phys) : NULL;
+}
+
/**
* kho_encode_radix_key - Encodes a physical address and order into a radix key.
* @phys: The physical address of the page.
@@ -825,6 +842,120 @@ static void __init kho_reserve_scratch(void)
kho_enable = false;
}
+#define KHO_EXT_SHIFT 30 /* 1 GiB */
+
+static int __init kho_ext_walk_key(unsigned long key, void *data)
+{
+ struct kho_radix_tree *tree = data;
+ phys_addr_t start, end;
+ unsigned int order;
+ int err;
+
+ start = kho_decode_radix_key(key, &order);
+ end = start + (1UL << (order + PAGE_SHIFT));
+
+ while (start < end) {
+ err = kho_radix_add_key(tree, start >> KHO_EXT_SHIFT);
+ if (err)
+ return err;
+
+ start += (1UL << KHO_EXT_SHIFT);
+ }
+
+ return 0;
+}
+
+static int __init kho_ext_walk_table(phys_addr_t phys, void *data)
+{
+ struct kho_radix_tree *tree = data;
+
+ return kho_radix_add_key(tree, phys >> KHO_EXT_SHIFT);
+}
+
+static int __init kho_ext_mark_scratch(unsigned long key, void *data)
+{
+ phys_addr_t *prev_end = data;
+ phys_addr_t start = key << KHO_EXT_SHIFT;
+ int err;
+
+ if (start > *prev_end) {
+ err = memblock_mark_kho_scratch_ext(*prev_end, start - *prev_end);
+ if (err)
+ return err;
+ }
+
+ *prev_end = start + (1UL << KHO_EXT_SHIFT);
+ return 0;
+}
+
+/**
+ * kho_extend_scratch - Extend the scratch regions
+ *
+ * The KHO radix tree mixes both physical address and order into a single key.
+ * This makes it hard to look for free ranges directly. This function first
+ * walks the radix tree and digests it down into another radix tree, whose keys
+ * identify blocks of KHO_EXT_SHIFT which contain preserved memory.
+ *
+ * Then it walks the digested radix tree and marks everything that doesn't have
+ * preserved memory as scratch.
+ *
+ * NOTE: This function allocates memory so it should be called when scratch has
+ * available space.
+ *
+ * NOTE: The pages of the KHO radix tree tables are not marked as preserved in
+ * the KHO tree. But they are expected to remain untouched until the tree is
+ * fully parsed. So this function also considers them to be "preserved memory"
+ * and marks their blocks as busy.
+ */
+void __init kho_extend_scratch(void)
+{
+ const struct kho_radix_walk_cb kho_cb = {
+ .key = kho_ext_walk_key,
+ .table = kho_ext_walk_table,
+ };
+ const struct kho_radix_walk_cb ext_cb = {
+ .key = kho_ext_mark_scratch,
+ };
+ struct kho_radix_tree radix;
+ phys_addr_t prev_end = 0, mem_map_phys;
+ int err = 0;
+
+ if (!is_kho_boot())
+ return;
+
+ /* Make sure the KHO radix tree is initialized. */
+ mem_map_phys = kho_get_mem_map_phys(kho_get_fdt());
+ err = kho_radix_init_tree(&kho_in.radix_tree, phys_to_virt(mem_map_phys));
+ if (err)
+ goto print;
+
+ err = kho_radix_init_tree(&radix, NULL);
+ if (err)
+ goto print;
+
+ /* Walk the KHO radix tree to find busy blocks. */
+ err = kho_radix_walk_tree(&kho_in.radix_tree, &kho_cb, &radix);
+ if (err)
+ goto out;
+
+ /* Walk the blocks and mark everything between keys as scratch. */
+ err = kho_radix_walk_tree(&radix, &ext_cb, &prev_end);
+ if (err)
+ goto out;
+
+ /* Mark everything from last busy block to end of DRAM. */
+ if (prev_end < memblock_end_of_DRAM())
+ err = memblock_mark_kho_scratch_ext(prev_end,
+ memblock_end_of_DRAM() - prev_end);
+
+ /* fallthrough */
+out:
+ kho_radix_destroy_tree(&radix);
+print:
+ if (err)
+ pr_err("Failed to extend scratch: %pe\n", ERR_PTR(err));
+}
+
/**
* kho_add_subtree - record the physical address of a sub blob in KHO root tree.
* @name: name of the sub tree.
@@ -1406,23 +1537,6 @@ void kho_restore_free(void *mem)
}
EXPORT_SYMBOL_GPL(kho_restore_free);
-struct kho_in {
- phys_addr_t fdt_phys;
- phys_addr_t scratch_phys;
- char previous_release[__NEW_UTS_LEN + 1];
- u32 kexec_count;
- struct kho_debugfs dbg;
- struct kho_radix_tree radix_tree;
-};
-
-static struct kho_in kho_in = {
-};
-
-static const void *kho_get_fdt(void)
-{
- return kho_in.fdt_phys ? phys_to_virt(kho_in.fdt_phys) : NULL;
-}
-
/**
* is_kho_boot - check if current kernel was booted via KHO-enabled
* kexec
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 6de3a77eb9ae..bbca4cc9b912 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2702,6 +2702,7 @@ void __init __weak mem_init(void)
void __init mm_core_init_early(void)
{
+ kho_extend_scratch();
hugetlb_cma_reserve();
hugetlb_bootmem_alloc();
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 10/20] kho: extended scratch
2026-05-28 0:41 ` [RFC PATCH 10/20] kho: extended scratch Jork Loeser
@ 2026-05-28 1:21 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:21 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [High] Missing error check for `kho_get_mem_map_phys()` return value
- [High] Missing static inline stub for `kho_extend_scratch()` causes build failures
- [High] 32-bit integer overflow in `kho_ext_mark_scratch()` causes incorrect physical address calculations
- [Medium] Uninitialized stack variable `radix` passed to `kho_radix_init_tree()`
- [Medium] Nested acquisition of mutexes from the same lockdep class
--
commit 69c3d241e681519286b4c5b53a9ed0d4287fb2b8
Author: Pratyush Yadav (Google) <pratyush@kernel.org>
kho: extended scratch
The scratch space is allocated by the first kernel in the KHO chain and reused.
This commit introduces extended scratch areas discovered at boot by walking the
preserved memory radix tree.
> diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
> index 8968c56d2d73e..6ce46f36ed992 100644
> --- a/include/linux/kexec_handover.h
> +++ b/include/linux/kexec_handover.h
> @@ -37,6 +37,7 @@ void kho_remove_subtree(void *blob);
> int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size);
>
> void kho_memory_init(void);
> +void kho_extend_scratch(void);
[Severity: High]
Does this missing static inline stub cause build failures when
CONFIG_KEXEC_HANDOVER=n?
kho_extend_scratch() is called unconditionally from mm_core_init_early().
However, its prototype here is only declared when CONFIG_KEXEC_HANDOVER is
enabled. Compiling the kernel with CONFIG_KEXEC_HANDOVER=n will result in an
implicit function declaration error.
[ ... ]
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index b2d1572808ebf..a006a883ee94a 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
[ ... ]
> @@ -825,6 +842,120 @@ static void __init kho_reserve_scratch(void)
> kho_enable = false;
> }
>
> +#define KHO_EXT_SHIFT 30 /* 1 GiB */
> +
> +static int __init kho_ext_walk_key(unsigned long key, void *data)
> +{
> + struct kho_radix_tree *tree = data;
> + phys_addr_t start, end;
> + unsigned int order;
> + int err;
> +
> + start = kho_decode_radix_key(key, &order);
> + end = start + (1UL << (order + PAGE_SHIFT));
> +
> + while (start < end) {
> + err = kho_radix_add_key(tree, start >> KHO_EXT_SHIFT);
[Severity: Medium]
Will this trigger a lockdep warning due to nested acquisition of mutexes from
the same lockdep class?
When kho_radix_walk_tree() is called on kho_in.radix_tree, it acquires its
lock and then calls this walk callback. This callback calls kho_radix_add_key()
on the new radix tree, which acquires its lock. Because both locks are
initialized by kho_radix_init_tree(), they share the same lockdep class.
Acquiring a second lock of the same class without using mutex_lock_nested()
can trigger a "possible recursive locking detected" lockdep warning.
> + if (err)
> + return err;
> +
> + start += (1UL << KHO_EXT_SHIFT);
> + }
> +
> + return 0;
> +}
[ ... ]
> +static int __init kho_ext_mark_scratch(unsigned long key, void *data)
> +{
> + phys_addr_t *prev_end = data;
> + phys_addr_t start = key << KHO_EXT_SHIFT;
[Severity: High]
Can this cause a 32-bit integer overflow on 32-bit architectures?
The key variable is of type unsigned long, which is 32-bit on 32-bit
architectures. Since KHO_EXT_SHIFT is 30, any key >= 4 (representing memory at
or above 4 GiB) will overflow the 32-bit shift and result in a truncated
address. Should key be cast to phys_addr_t before shifting?
> + int err;
> +
> + if (start > *prev_end) {
> + err = memblock_mark_kho_scratch_ext(*prev_end, start - *prev_end);
> + if (err)
> + return err;
> + }
> +
> + *prev_end = start + (1UL << KHO_EXT_SHIFT);
> + return 0;
> +}
[ ... ]
> +void __init kho_extend_scratch(void)
> +{
> + const struct kho_radix_walk_cb kho_cb = {
> + .key = kho_ext_walk_key,
> + .table = kho_ext_walk_table,
> + };
> + const struct kho_radix_walk_cb ext_cb = {
> + .key = kho_ext_mark_scratch,
> + };
> + struct kho_radix_tree radix;
[Severity: Medium]
Is it safe to pass this uninitialized stack variable radix to
kho_radix_init_tree()?
It is allocated on the stack but not zero-initialized. If kho_radix_init_tree()
conditionally initializes its fields (e.g., checking if tree->root is non-NULL
to avoid double initialization for global trees), it could read garbage stack
memory and return without initializing radix.lock. Subsequent operations would
then dereference uninitialized pointers.
> + phys_addr_t prev_end = 0, mem_map_phys;
> + int err = 0;
> +
> + if (!is_kho_boot())
> + return;
> +
> + /* Make sure the KHO radix tree is initialized. */
> + mem_map_phys = kho_get_mem_map_phys(kho_get_fdt());
> + err = kho_radix_init_tree(&kho_in.radix_tree, phys_to_virt(mem_map_phys));
[Severity: High]
Does this code miss checking the return value of kho_get_mem_map_phys()?
kho_get_mem_map_phys() returns 0 on failure (e.g., missing FDT property),
and this return value is passed directly to phys_to_virt() without validation.
If the map is missing, the code will initialize the KHO radix tree using
physical address 0x0 as the root, leading to arbitrary memory being parsed
as tree nodes, which can cause memory corruption or a kernel panic during
early boot.
> + if (err)
> + goto print;
> +
> + err = kho_radix_init_tree(&radix, NULL);
> + if (err)
> + goto print;
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=10
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 11/20] kho: return virtual address of mem_map
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (9 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 10/20] kho: extended scratch Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:27 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 12/20] mm/hugetlb: make bootmem allocation work with KHO Jork Loeser
` (8 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
There are currently 3 callers of kho_get_mem_map_phys(). Two of them,
kho_mem_retrieve() and kho_extend_scratch() need the virtual address.
The third, kho_populate() doesn't care. Make things simpler by
directly returning the virtual address. Rename kho_get_mem_map_phys() to
kho_get_mem_map() to accurately reflect what it returns.
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
kernel/liveupdate/kexec_handover.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index a006a883ee94..797ec285b698 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -610,10 +610,11 @@ static int __init kho_preserved_memory_reserve(unsigned long key, void *data)
return 0;
}
-/* Returns physical address of the preserved memory map from FDT */
-static phys_addr_t __init kho_get_mem_map_phys(const void *fdt)
+/* Returns virtual address of the preserved memory map from FDT */
+static __init void *kho_get_mem_map(const void *fdt)
{
const void *mem_ptr;
+ phys_addr_t mem_map_phys;
int len;
mem_ptr = fdt_getprop(fdt, 0, KHO_FDT_MEMORY_MAP_PROP_NAME, &len);
@@ -622,7 +623,11 @@ static phys_addr_t __init kho_get_mem_map_phys(const void *fdt)
return 0;
}
- return get_unaligned((const u64 *)mem_ptr);
+ mem_map_phys = get_unaligned((const u64 *)mem_ptr);
+ if (!mem_map_phys)
+ return NULL;
+
+ return phys_to_virt(mem_map_phys);
}
/*
@@ -917,15 +922,15 @@ void __init kho_extend_scratch(void)
.key = kho_ext_mark_scratch,
};
struct kho_radix_tree radix;
- phys_addr_t prev_end = 0, mem_map_phys;
+ phys_addr_t prev_end = 0;
int err = 0;
if (!is_kho_boot())
return;
/* Make sure the KHO radix tree is initialized. */
- mem_map_phys = kho_get_mem_map_phys(kho_get_fdt());
- err = kho_radix_init_tree(&kho_in.radix_tree, phys_to_virt(mem_map_phys));
+ err = kho_radix_init_tree(&kho_in.radix_tree,
+ kho_get_mem_map(kho_get_fdt()));
if (err)
goto print;
@@ -1609,11 +1614,9 @@ static int __init kho_mem_retrieve(const void *fdt)
const struct kho_radix_walk_cb cb = {
.key = kho_preserved_memory_reserve,
};
- phys_addr_t mem_map_phys;
int err;
- mem_map_phys = kho_get_mem_map_phys(fdt);
- err = kho_radix_init_tree(&kho_in.radix_tree, phys_to_virt(mem_map_phys));
+ err = kho_radix_init_tree(&kho_in.radix_tree, kho_get_mem_map(fdt));
if (err)
return err;
@@ -1838,8 +1841,7 @@ void __init kho_populate(phys_addr_t fdt_phys, u64 fdt_len,
{
unsigned int scratch_cnt = scratch_len / sizeof(*kho_scratch);
struct kho_scratch *scratch = NULL;
- phys_addr_t mem_map_phys;
- void *fdt = NULL;
+ void *fdt = NULL, *mem_map;
bool populated = false;
int err;
@@ -1862,8 +1864,8 @@ void __init kho_populate(phys_addr_t fdt_phys, u64 fdt_len,
goto unmap_fdt;
}
- mem_map_phys = kho_get_mem_map_phys(fdt);
- if (!mem_map_phys)
+ mem_map = kho_get_mem_map(fdt);
+ if (!mem_map)
goto unmap_fdt;
scratch = early_memremap(scratch_phys, scratch_len);
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 11/20] kho: return virtual address of mem_map
2026-05-28 0:41 ` [RFC PATCH 11/20] kho: return virtual address of mem_map Jork Loeser
@ 2026-05-28 1:27 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:27 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] The local variable `radix` in `kho_extend_scratch` is uninitialized, leading to skipped initialization and a subsequent kernel panic when walking the radix tree.
- [Low] kho_get_mem_map returns an integer 0 instead of a NULL pointer on an error path.
--
commit 80928d5942267d8e8edcbc91a1c5643fdd5aecee
Author: Pratyush Yadav (Google) <pratyush@kernel.org>
kho: return virtual address of mem_map
Changes kho_get_mem_map_phys() to kho_get_mem_map() and makes it return a
virtual address instead of a physical address to simplify its callers.
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index a006a883ee94a..797ec285b698c 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -610,10 +610,11 @@ static int __init kho_preserved_memory_reserve(unsigned long key, void *data)
> return 0;
> }
>
> -/* Returns physical address of the preserved memory map from FDT */
> -static phys_addr_t __init kho_get_mem_map_phys(const void *fdt)
> +/* Returns virtual address of the preserved memory map from FDT */
> +static __init void *kho_get_mem_map(const void *fdt)
> {
> const void *mem_ptr;
> + phys_addr_t mem_map_phys;
> int len;
>
> mem_ptr = fdt_getprop(fdt, 0, KHO_FDT_MEMORY_MAP_PROP_NAME, &len);
> @@ -622,7 +623,11 @@ static phys_addr_t __init kho_get_mem_map_phys(const void *fdt)
> return 0;
> }
[Severity: Low]
Since the return type has been changed to a pointer, should this return NULL
instead of the integer 0 to avoid a sparse warning?
> - return get_unaligned((const u64 *)mem_ptr);
> + mem_map_phys = get_unaligned((const u64 *)mem_ptr);
> + if (!mem_map_phys)
> + return NULL;
> +
> + return phys_to_virt(mem_map_phys);
> }
>
[ ... ]
> @@ -917,15 +922,15 @@ void __init kho_extend_scratch(void)
> .key = kho_ext_mark_scratch,
> };
> struct kho_radix_tree radix;
> - phys_addr_t prev_end = 0, mem_map_phys;
> + phys_addr_t prev_end = 0;
> int err = 0;
[Severity: Critical]
Because radix is declared on the stack without being zeroed, it is
uninitialized. When kho_radix_init_tree() is called later, it checks if
tree->root is already set:
kernel/liveupdate/kexec_handover.c:kho_radix_init_tree() {
...
if (tree->root)
return 0;
...
}
Since radix.root contains uninitialized stack garbage, it is highly likely
to be non-zero. Could this cause kho_radix_init_tree() to return without
initializing the lock and setting up the root node, leading to a panic when
the tree is walked?
>
> if (!is_kho_boot())
> return;
>
> /* Make sure the KHO radix tree is initialized. */
> - mem_map_phys = kho_get_mem_map_phys(kho_get_fdt());
> - err = kho_radix_init_tree(&kho_in.radix_tree, phys_to_virt(mem_map_phys));
> + err = kho_radix_init_tree(&kho_in.radix_tree,
> + kho_get_mem_map(kho_get_fdt()));
> if (err)
> goto print;
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=11
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 12/20] mm/hugetlb: make bootmem allocation work with KHO
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (10 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 11/20] kho: return virtual address of mem_map Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:06 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 13/20] kho: add radix tree freeze and del_key() error reporting Jork Loeser
` (7 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
Gigantic page allocation is somewhat broken currently when KHO is used.
Firstly, they break KHO scratch size accounting. RSRV_KERN is used to
track how much memory is reserved for use by the kernel. Since
alloc_bootmem() calls the memblock_alloc*() APIs, the hugepages
allocated also get marked as RSRV_KERN.
Allocations marked RSRV_KERN are used by KHO to calculate how much
scratch space it should reserve to make sure the next kernel has enough
memory to boot when it is in scratch-only phase. Counting hugepages in
that blows up scratch size, and can lead to the scratch allocation
failing, making KHO unusable. This will show up when huge pages make up
more than 50% of the system, which is a fairly common use case.
Secondly, while not supported right now, huge pages are user memory and
can be preserved via KHO. The scratch spaces should not have any
preserved memory. Allocating hugepages from scratch (on a KHO boot) can
lead to them being un-preservable.
Introduce memblock_alloc_nid_user(). This does two things: first, it
instructs __memblock_alloc_range_nid() to not use scratch areas to
fulfill allocation. If KHO is in scratch-only mode, allocations will
only be made from extended scratch areas. Second, it removes RSRV_KERN
from the allocation to make sure it doesn't mess up scratch size
accounting.
To reduce duplication, introduce __memblock_alloc_range_nid() which does
exactly what memblock_alloc_range_nid() used to do, but takes the flags
from its caller. Then make memblock_alloc_range_nid() a wrapper to it.
This lets memblock_alloc_nid_user() re-use most of the logic without
causing churn to update all callers of memblock_alloc_range_nid() and
adding yet another argument to it.
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
include/linux/memblock.h | 4 ++
mm/hugetlb.c | 19 ++----
mm/memblock.c | 138 ++++++++++++++++++++++++++++++---------
3 files changed, 116 insertions(+), 45 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 4f535ca4947a..c7056cf3f0f2 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -160,6 +160,7 @@ int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size);
int memblock_reserved_mark_kern(phys_addr_t base, phys_addr_t size);
+int memblock_reserved_clear_kern(phys_addr_t base, phys_addr_t size);
int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size);
int memblock_mark_kho_scratch_ext(phys_addr_t base, phys_addr_t size);
int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size);
@@ -431,6 +432,9 @@ void *memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr,
int nid);
+void *memblock_alloc_nid_user(phys_addr_t size, phys_addr_t align, int nid,
+ bool exact_nid);
+
static __always_inline void *memblock_alloc(phys_addr_t size, phys_addr_t align)
{
return memblock_alloc_try_nid(size, align, MEMBLOCK_LOW_LIMIT,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 571212b80835..46f2b1bd5abe 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3033,26 +3033,19 @@ static __init void *alloc_bootmem(struct hstate *h, int nid, bool node_exact)
if (hugetlb_early_cma(h))
m = hugetlb_cma_alloc_bootmem(h, &listnode, node_exact);
else {
- if (node_exact)
- m = memblock_alloc_exact_nid_raw(huge_page_size(h),
- huge_page_size(h), 0,
- MEMBLOCK_ALLOC_ACCESSIBLE, nid);
- else {
- m = memblock_alloc_try_nid_raw(huge_page_size(h),
- huge_page_size(h), 0,
- MEMBLOCK_ALLOC_ACCESSIBLE, nid);
+ m = memblock_alloc_nid_user(huge_page_size(h), huge_page_size(h),
+ nid, node_exact);
+ if (m) {
/*
* For pre-HVO to work correctly, pages need to be on
* the list for the node they were actually allocated
* from. That node may be different in the case of
- * fallback by memblock_alloc_try_nid_raw. So,
- * extract the actual node first.
+ * fallback by memblock_alloc_try_nid_raw. So, extract
+ * the actual node first.
*/
- if (m)
+ if (node_exact)
listnode = early_pfn_to_nid(PHYS_PFN(__pa(m)));
- }
- if (m) {
m->flags = 0;
m->cma = NULL;
}
diff --git a/mm/memblock.c b/mm/memblock.c
index 6f76a6bb96d6..8cd52d34ad6e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -178,11 +178,21 @@ bool __init_memblock memblock_has_mirror(void)
return system_has_some_mirror;
}
-static enum memblock_flags __init_memblock choose_memblock_flags(void)
+static enum memblock_flags __init_memblock choose_memblock_flags(bool user)
{
/* skip non-scratch memory for kho early boot allocations */
- if (kho_scratch_only)
- return MEMBLOCK_KHO_SCRATCH | MEMBLOCK_KHO_SCRATCH_EXT;
+ if (kho_scratch_only) {
+ enum memblock_flags flags = MEMBLOCK_KHO_SCRATCH_EXT;
+
+ /*
+ * Scratch can only be used for kernel memory, since user memory
+ * might be preserved and thus can not be in scratch.
+ */
+ if (!user)
+ flags |= MEMBLOCK_KHO_SCRATCH;
+
+ return flags;
+ }
return system_has_some_mirror ? MEMBLOCK_MIRROR : MEMBLOCK_NONE;
}
@@ -346,7 +356,7 @@ static phys_addr_t __init_memblock memblock_find_in_range(phys_addr_t start,
phys_addr_t align)
{
phys_addr_t ret;
- enum memblock_flags flags = choose_memblock_flags();
+ enum memblock_flags flags = choose_memblock_flags(false);
again:
ret = memblock_find_in_range_node(size, align, start, end,
@@ -1175,6 +1185,20 @@ int __init_memblock memblock_reserved_mark_kern(phys_addr_t base, phys_addr_t si
MEMBLOCK_RSRV_KERN);
}
+/**
+ * memblock_reserved_clear_kern - Clear MEMBLOCK_RSRV_KERN flag for region
+ *
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_reserved_clear_kern(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(&memblock.reserved, base, size, 0,
+ MEMBLOCK_RSRV_KERN);
+}
+
/**
* memblock_mark_kho_scratch - Mark a memory region as MEMBLOCK_KHO_SCRATCH.
* @base: the base phys addr of the region
@@ -1534,37 +1558,11 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
return 0;
}
-/**
- * memblock_alloc_range_nid - allocate boot memory block
- * @size: size of memory block to be allocated in bytes
- * @align: alignment of the region and block's size
- * @start: the lower bound of the memory region to allocate (phys address)
- * @end: the upper bound of the memory region to allocate (phys address)
- * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
- * @exact_nid: control the allocation fall back to other nodes
- *
- * The allocation is performed from memory region limited by
- * memblock.current_limit if @end == %MEMBLOCK_ALLOC_ACCESSIBLE.
- *
- * If the specified node can not hold the requested memory and @exact_nid
- * is false, the allocation falls back to any node in the system.
- *
- * For systems with memory mirroring, the allocation is attempted first
- * from the regions with mirroring enabled and then retried from any
- * memory region.
- *
- * In addition, function using kmemleak_alloc_phys for allocated boot
- * memory block, it is never reported as leaks.
- *
- * Return:
- * Physical address of allocated memory block on success, %0 on failure.
- */
-phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
+static phys_addr_t __init __memblock_alloc_range_nid(phys_addr_t size,
phys_addr_t align, phys_addr_t start,
phys_addr_t end, int nid,
- bool exact_nid)
+ bool exact_nid, enum memblock_flags flags)
{
- enum memblock_flags flags = choose_memblock_flags();
phys_addr_t found;
/*
@@ -1633,6 +1631,41 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
return found;
}
+/**
+ * memblock_alloc_range_nid - allocate boot memory block
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @start: the lower bound of the memory region to allocate (phys address)
+ * @end: the upper bound of the memory region to allocate (phys address)
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ * @exact_nid: control the allocation fall back to other nodes
+ *
+ * The allocation is performed from memory region limited by
+ * memblock.current_limit if @end == %MEMBLOCK_ALLOC_ACCESSIBLE.
+ *
+ * If the specified node can not hold the requested memory and @exact_nid
+ * is false, the allocation falls back to any node in the system.
+ *
+ * For systems with memory mirroring, the allocation is attempted first
+ * from the regions with mirroring enabled and then retried from any
+ * memory region.
+ *
+ * In addition, function using kmemleak_alloc_phys for allocated boot
+ * memory block, it is never reported as leaks.
+ *
+ * Return:
+ * Physical address of allocated memory block on success, %0 on failure.
+ */
+phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
+ phys_addr_t align, phys_addr_t start,
+ phys_addr_t end, int nid,
+ bool exact_nid)
+{
+ enum memblock_flags flags = choose_memblock_flags(false);
+
+ return __memblock_alloc_range_nid(size, align, start, end, nid, exact_nid, flags);
+}
+
/**
* memblock_phys_alloc_range - allocate a memory block inside specified range
* @size: size of memory block to be allocated in bytes
@@ -1784,6 +1817,47 @@ void * __init memblock_alloc_try_nid_raw(
false);
}
+/**
+ * memblock_alloc_nid_user - allocate boot memory for use by userspace
+ * @size: size of the memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @exact_nid: control the allocation fall back to other nodes
+ *
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. Does not zero allocated memory, does not panic if request
+ * cannot be satisfied.
+ *
+ * If the specified node can not hold the requested memory and @exact_nid is
+ * false, the allocation falls back to any node in the system. The allocated
+ * memory has no restrictions on minimum or maximum address, and does not count
+ * towards %MEMBLOCK_RSRV_KERN.
+ *
+ * Return:
+ * Virtual address of allocated memory block on success, %NULL on failure.
+ */
+void * __init memblock_alloc_nid_user(phys_addr_t size, phys_addr_t align,
+ int nid, bool exact_nid)
+{
+ enum memblock_flags flags = choose_memblock_flags(true);
+ phys_addr_t alloc;
+
+ memblock_dbg("%s: %llu bytes align=0x%llx nid=%d %pS\n",
+ __func__, (u64)size, (u64)align, nid, (void *)_RET_IP_);
+
+ alloc = __memblock_alloc_range_nid(size, align, 0, MEMBLOCK_ALLOC_ACCESSIBLE,
+ nid, exact_nid, flags);
+ if (!alloc)
+ return NULL;
+
+ /* User memory should not be marked with RSRV_KERN. */
+ if (memblock_reserved_clear_kern(alloc, size)) {
+ memblock_phys_free(alloc, size);
+ return NULL;
+ }
+
+ return phys_to_virt(alloc);
+}
+
/**
* memblock_alloc_try_nid - allocate boot memory block
* @size: size of memory block to be allocated in bytes
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 12/20] mm/hugetlb: make bootmem allocation work with KHO
2026-05-28 0:41 ` [RFC PATCH 12/20] mm/hugetlb: make bootmem allocation work with KHO Jork Loeser
@ 2026-05-28 1:06 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:06 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Incorrect NUMA node extraction for huge page allocation fallback due to inverted `node_exact` condition, breaking HugeTLB Vmemmap Optimization (HVO).
- [Medium] Silent memory leak in `memblock_alloc_nid_user` error path due to inherently failing `memblock_phys_free`.
--
commit 8f512f66429ea3e879e3104adb050f66df1c4fb3
Author: Pratyush Yadav (Google) <pratyush@kernel.org>
mm/hugetlb: make bootmem allocation work with KHO
This patch addresses KHO scratch size accounting issues by introducing
memblock_alloc_nid_user(). It ensures that huge pages are not allocated
from scratch areas and removes the MEMBLOCK_RSRV_KERN flag from these
allocations, preventing them from artificially inflating scratch space
requirements.
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 571212b80835e..46f2b1bd5abe8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3033,26 +3033,19 @@ static __init void *alloc_bootmem(struct hstate *h, int nid, bool node_exact)
[ ... ]
> - if (node_exact)
> - m = memblock_alloc_exact_nid_raw(huge_page_size(h),
> - huge_page_size(h), 0,
> - MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> - else {
> - m = memblock_alloc_try_nid_raw(huge_page_size(h),
> - huge_page_size(h), 0,
> - MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> + m = memblock_alloc_nid_user(huge_page_size(h), huge_page_size(h),
> + nid, node_exact);
> + if (m) {
> /*
> * For pre-HVO to work correctly, pages need to be on
> * the list for the node they were actually allocated
> * from. That node may be different in the case of
> - * fallback by memblock_alloc_try_nid_raw. So,
> - * extract the actual node first.
> + * fallback by memblock_alloc_try_nid_raw. So, extract
> + * the actual node first.
> */
> - if (m)
> + if (node_exact)
> listnode = early_pfn_to_nid(PHYS_PFN(__pa(m)));
[Severity: High]
Does this extract the node under the wrong condition?
The original code extracted the node only in the fallback path (!node_exact),
since exact allocations guarantee the requested node is used.
With this change, if node_exact is true, we redundantly extract the node,
but if node_exact is false (meaning a fallback could have happened), we skip
extracting the actual node.
Can this place fallback-allocated huge pages onto the wrong node's list,
breaking NUMA locality and pre-HVO assumptions?
[ ... ]
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 6f76a6bb96d61..8cd52d34ad6e4 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
[ ... ]
> @@ -1784,6 +1817,47 @@ void * __init memblock_alloc_try_nid_raw(
> +void * __init memblock_alloc_nid_user(phys_addr_t size, phys_addr_t align,
> + int nid, bool exact_nid)
> +{
[ ... ]
> + alloc = __memblock_alloc_range_nid(size, align, 0, MEMBLOCK_ALLOC_ACCESSIBLE,
> + nid, exact_nid, flags);
> + if (!alloc)
> + return NULL;
> +
> + /* User memory should not be marked with RSRV_KERN. */
> + if (memblock_reserved_clear_kern(alloc, size)) {
> + memblock_phys_free(alloc, size);
> + return NULL;
> + }
[Severity: Medium]
Will this error path silently leak memory?
If memblock_reserved_clear_kern() fails, it is because
memblock_isolate_range() encountered an out-of-memory scenario when trying
to split a region and memblock_double_array() failed.
When we then call memblock_phys_free(), it eventually calls
memblock_remove_range(), which relies on the exact same
memblock_isolate_range() call to split the region before removing it.
Since the array is still full, this isolation will deterministically fail
again, and memblock_phys_free() will return without actually freeing the
memory, leaving it permanently reserved with MEMBLOCK_RSRV_KERN set.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=12
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 13/20] kho: add radix tree freeze and del_key() error reporting
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (11 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 12/20] mm/hugetlb: make bootmem allocation work with KHO Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:34 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 14/20] kho: Add crash-kernel-safe radix tree presence check Jork Loeser
` (6 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
Add kho_radix_tree_freeze() to prevent further modifications to a
KHO radix tree. After freezing, kho_radix_add_key() and
kho_radix_del_key() return -EBUSY. This is used by the MSHV page
preservation code to lock the tree before serializing it for kexec.
Also change kho_radix_del_key() from void to int so it can report
-EBUSY (frozen) and -ENOENT (key not present).
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
include/linux/kho_radix_tree.h | 24 ++++++++++----
kernel/liveupdate/kexec_handover.c | 51 +++++++++++++++++++++++-------
2 files changed, 57 insertions(+), 18 deletions(-)
diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
index c0840ecb230c..4fe2238e1e30 100644
--- a/include/linux/kho_radix_tree.h
+++ b/include/linux/kho_radix_tree.h
@@ -21,10 +21,10 @@
* scheme. Each key is an unsigned long that combines a page's physical
* address and its order.
*
- * Client code is responsible for allocating the root node of the tree,
- * initializing the mutex lock, and managing its lifecycle. It must use the
- * tree data structures defined in the KHO ABI,
- * `include/linux/kho/abi/kexec_handover.h`.
+ * Client code must initialize the tree using kho_radix_tree_init(). Pass
+ * a physical address to restore a tree preserved across kexec, or 0 to
+ * allocate a fresh empty tree. The tree uses data structures defined in
+ * the KHO ABI, `include/linux/kho/abi/kexec_handover.h`.
*/
struct kho_radix_node;
@@ -32,6 +32,7 @@ struct kho_radix_node;
struct kho_radix_tree {
struct kho_radix_node *root;
struct mutex lock; /* protects the tree's structure and root pointer */
+ bool frozen;
};
/**
@@ -51,11 +52,12 @@ struct kho_radix_walk_cb {
#ifdef CONFIG_KEXEC_HANDOVER
int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key);
-void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key);
+int kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key);
int kho_radix_walk_tree(struct kho_radix_tree *tree,
const struct kho_radix_walk_cb *cb, void *data);
int kho_radix_init_tree(struct kho_radix_tree *tree, struct kho_radix_node *root);
void kho_radix_destroy_tree(struct kho_radix_tree *tree);
+int kho_radix_tree_freeze(struct kho_radix_tree *tree);
#else /* #ifdef CONFIG_KEXEC_HANDOVER */
@@ -64,8 +66,11 @@ static inline int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long k
return -EOPNOTSUPP;
}
-static inline void kho_radix_del_key(struct kho_radix_tree *tree,
- unsigned long key) { }
+static inline int kho_radix_del_key(struct kho_radix_tree *tree,
+ unsigned long key)
+{
+ return -EOPNOTSUPP;
+}
static inline int kho_radix_walk_tree(struct kho_radix_tree *tree,
const struct kho_radix_walk_cb *cb, void *data)
@@ -81,6 +86,11 @@ static inline int kho_radix_init_tree(struct kho_radix_tree *tree,
static inline void kho_radix_destroy_tree(struct kho_radix_tree *tree) { }
+static inline int kho_radix_tree_freeze(struct kho_radix_tree *tree)
+{
+ return -EOPNOTSUPP;
+}
+
#endif /* #ifdef CONFIG_KEXEC_HANDOVER */
#endif /* _LINUX_KHO_RADIX_TREE_H */
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 797ec285b698..2e2b4e73f00d 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -79,9 +79,6 @@ struct kho_out {
static struct kho_out kho_out = {
.lock = __MUTEX_INITIALIZER(kho_out.lock),
- .radix_tree = {
- .lock = __MUTEX_INITIALIZER(kho_out.radix_tree.lock),
- },
};
struct kho_in {
@@ -180,6 +177,28 @@ static void __ref kho_radix_free_node(struct kho_radix_node *node)
memblock_free(node, PAGE_SIZE);
}
+/**
+ * kho_radix_tree_freeze - Freeze the tree, preventing further modifications.
+ * @tree: The KHO radix tree to freeze.
+ *
+ * After freezing, kho_radix_add_key() and kho_radix_del_key() will return
+ * -EBUSY. The check is performed under the tree's mutex, so there is no
+ * race between a concurrent add/del and the freeze.
+ *
+ * Return: 0 on success, -EBUSY if the tree is already frozen.
+ */
+int kho_radix_tree_freeze(struct kho_radix_tree *tree)
+{
+ guard(mutex)(&tree->lock);
+
+ if (tree->frozen)
+ return -EBUSY;
+
+ tree->frozen = true;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kho_radix_tree_freeze);
+
/**
* kho_radix_add_key - Add a key to the radix tree.
* @tree: The KHO radix tree.
@@ -210,6 +229,9 @@ int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key)
guard(mutex)(&tree->lock);
+ if (tree->frozen)
+ return -EBUSY;
+
/* Go from high levels to low levels */
for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) {
idx = kho_radix_get_table_index(key, i);
@@ -268,20 +290,26 @@ EXPORT_SYMBOL_GPL(kho_radix_add_key);
* This function traverses the radix tree and clears the bit corresponding to
* the key, effectively removing it from the tree. It does not free the tree's
* intermediate nodes, even if they become empty.
+ *
+ * Return: 0 on success, -EINVAL if the tree is uninitialized, -EBUSY if
+ * frozen, -ENOENT if the key was not present.
*/
-void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key)
+int kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key)
{
struct kho_radix_node *node = tree->root;
struct kho_radix_leaf *leaf;
unsigned int i, idx;
if (WARN_ON_ONCE(!tree->root))
- return;
+ return -EINVAL;
might_sleep();
guard(mutex)(&tree->lock);
+ if (WARN_ON_ONCE(tree->frozen))
+ return -EBUSY;
+
/* Go from high levels to low levels */
for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) {
idx = kho_radix_get_table_index(key, i);
@@ -291,7 +319,7 @@ void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key)
* return with a warning.
*/
if (WARN_ON(!node->table[idx]))
- return;
+ return -ENOENT;
node = phys_to_virt(node->table[idx]);
}
@@ -300,6 +328,8 @@ void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key)
leaf = (struct kho_radix_leaf *)node;
idx = kho_radix_get_bitmap_index(key);
__clear_bit(idx, leaf->bitmap);
+
+ return 0;
}
EXPORT_SYMBOL_GPL(kho_radix_del_key);
@@ -346,6 +376,7 @@ int kho_radix_init_tree(struct kho_radix_tree *tree, struct kho_radix_node *root
tree->root = root;
mutex_init(&tree->lock);
+ tree->frozen = false;
return 0;
}
EXPORT_SYMBOL_GPL(kho_radix_init_tree);
@@ -1746,11 +1777,9 @@ static __init int kho_init(void)
if (!kho_enable)
return 0;
- tree->root = kzalloc(PAGE_SIZE, GFP_KERNEL);
- if (!tree->root) {
- err = -ENOMEM;
+ err = kho_radix_init_tree(tree, NULL);
+ if (err)
goto err_free_scratch;
- }
kho_out.fdt = kho_alloc_preserve(PAGE_SIZE);
if (IS_ERR(kho_out.fdt)) {
@@ -1807,7 +1836,7 @@ static __init int kho_init(void)
err_free_fdt:
kho_unpreserve_free(kho_out.fdt);
err_free_kho_radix_tree_root:
- kfree(tree->root);
+ free_page((unsigned long)tree->root);
tree->root = NULL;
err_free_scratch:
kho_out.fdt = NULL;
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 13/20] kho: add radix tree freeze and del_key() error reporting
2026-05-28 0:41 ` [RFC PATCH 13/20] kho: add radix tree freeze and del_key() error reporting Jork Loeser
@ 2026-05-28 1:34 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:34 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] `kho_radix_tree_freeze()` locks the tree mutex without verifying if the tree has been initialized.
- [Medium] `kho_radix_del_key()` does not accurately report `-ENOENT` if the key is missing but its leaf node exists.
- [Low] The docstring for `kho_radix_tree` instructs developers to pass a physical address to initialize the tree, which will lead to a crash.
- [Low] The error path in `kho_init()` frees the radix tree root using `free_page()`, breaking the allocator abstraction.
--
commit da7ab3d5519dc40884c4213f1650e1238149beb0
Author: Jork Loeser <jloeser@linux.microsoft.com>
kho: add radix tree freeze and del_key() error reporting
This commit adds kho_radix_tree_freeze() to prevent modifications to a KHO
radix tree. It also updates kho_radix_del_key() to return an integer so it
can report errors like -EBUSY and -ENOENT.
> diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
> index c0840ecb230c8..4fe2238e1e30a 100644
> --- a/include/linux/kho_radix_tree.h
> +++ b/include/linux/kho_radix_tree.h
> @@ -21,10 +21,10 @@
> * scheme. Each key is an unsigned long that combines a page's physical
> * address and its order.
> *
> - * Client code is responsible for allocating the root node of the tree,
> - * initializing the mutex lock, and managing its lifecycle. It must use the
> - * tree data structures defined in the KHO ABI,
> - * `include/linux/kho/abi/kexec_handover.h`.
> + * Client code must initialize the tree using kho_radix_tree_init(). Pass
> + * a physical address to restore a tree preserved across kexec, or 0 to
[Severity: Low]
Is this documentation accurate? The tree traversal functions treat
tree->root as a virtual address pointer. If a caller passes a raw physical
address without using phys_to_virt(), would it cause a page fault upon the
first dereference?
> + * allocate a fresh empty tree. The tree uses data structures defined in
> + * the KHO ABI, `include/linux/kho/abi/kexec_handover.h`.
> */
[ ... ]
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index 797ec285b698c..2e2b4e73f00d3 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -79,9 +79,6 @@ struct kho_out {
>
> static struct kho_out kho_out = {
> .lock = __MUTEX_INITIALIZER(kho_out.lock),
> - .radix_tree = {
> - .lock = __MUTEX_INITIALIZER(kho_out.radix_tree.lock),
> - },
> };
[ ... ]
> @@ -180,6 +177,28 @@ static void __ref kho_radix_free_node(struct kho_radix_node *node)
> memblock_free(node, PAGE_SIZE);
> }
>
> +/**
> + * kho_radix_tree_freeze - Freeze the tree, preventing further modifications.
> + * @tree: The KHO radix tree to freeze.
> + *
> + * After freezing, kho_radix_add_key() and kho_radix_del_key() will return
> + * -EBUSY. The check is performed under the tree's mutex, so there is no
> + * race between a concurrent add/del and the freeze.
> + *
> + * Return: 0 on success, -EBUSY if the tree is already frozen.
> + */
> +int kho_radix_tree_freeze(struct kho_radix_tree *tree)
> +{
> + guard(mutex)(&tree->lock);
[Severity: High]
Could this cause a crash if called on an uninitialized tree?
Because the static __MUTEX_INITIALIZER was removed from
kho_out.radix_tree.lock in this patch, the mutex is zero-initialized before
kho_radix_init_tree() is called.
Unlike kho_radix_add_key() and kho_radix_del_key(), which safely check
if (WARN_ON_ONCE(!tree->root)) before taking the mutex, this function
directly attempts to lock it.
> +
> + if (tree->frozen)
> + return -EBUSY;
> +
> + tree->frozen = true;
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(kho_radix_tree_freeze);
[ ... ]
> @@ -300,6 +328,8 @@ void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key)
> leaf = (struct kho_radix_leaf *)node;
> idx = kho_radix_get_bitmap_index(key);
> __clear_bit(idx, leaf->bitmap);
> +
> + return 0;
> }
[Severity: Medium]
Will this correctly report -ENOENT if the key is missing but the leaf
node exists?
The commit message states kho_radix_del_key() was updated to report -ENOENT
when a key is not present. However, if other keys share the same leaf node
and the target key's bit is already 0, __clear_bit() does nothing and this
silently returns 0.
Should this use test_and_clear_bit() and return -ENOENT if the bit was not
previously set?
> EXPORT_SYMBOL_GPL(kho_radix_del_key);
[ ... ]
> @@ -1807,7 +1836,7 @@ static __init int kho_init(void)
> err_free_fdt:
> kho_unpreserve_free(kho_out.fdt);
> err_free_kho_radix_tree_root:
> - kfree(tree->root);
> + free_page((unsigned long)tree->root);
[Severity: Low]
Is it safe to use free_page() directly here instead of calling
kho_radix_free_node()?
Since kho_radix_init_tree() allocates the node internally via
kho_radix_alloc_node(), calling free_page() directly bypasses the fallback
safety checks in kho_radix_free_node(), which handles memblock_free() when
the slab is unavailable.
> tree->root = NULL;
> err_free_scratch:
> kho_out.fdt = NULL;
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=13
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 14/20] kho: Add crash-kernel-safe radix tree presence check
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (12 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 13/20] kho: add radix tree freeze and del_key() error reporting Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:27 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 15/20] mshv: Use page tracker to manage MSHV-owned pages and preserve with KHO Jork Loeser
` (5 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
In the crash kernel, the old kernel's memory is outside the direct
map. Add a read-only radix tree variant that memremaps nodes during
init so that subsequent page presence checks can traverse the tree
with plain pointer dereferencing.
This will be used by the MSHV driver to exclude hypervisor-owned pages
from /proc/vmcore via a pfn_is_ram() callback.
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
include/linux/kho_radix_tree.h | 30 +++++++
kernel/liveupdate/kexec_handover.c | 124 +++++++++++++++++++++++++++++
2 files changed, 154 insertions(+)
diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
index 4fe2238e1e30..e906a874e612 100644
--- a/include/linux/kho_radix_tree.h
+++ b/include/linux/kho_radix_tree.h
@@ -49,6 +49,19 @@ struct kho_radix_walk_cb {
int (*table)(phys_addr_t phys, void *data);
};
+/**
+ * struct kho_radix_crash_tree - Read-only radix tree for crash kernel use.
+ * @root: pointer to the remapped root node
+ *
+ * In the crash kernel, the old kernel's memory is not in the direct map.
+ * This variant uses memremap() during init to map the tree nodes and
+ * converts the physical address table entries to virtual addresses in-place,
+ * enabling efficient pointer-based traversal without per-lookup remapping.
+ */
+struct kho_radix_crash_tree {
+ struct kho_radix_node *root;
+};
+
#ifdef CONFIG_KEXEC_HANDOVER
int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key);
@@ -59,6 +72,11 @@ int kho_radix_init_tree(struct kho_radix_tree *tree, struct kho_radix_node *root
void kho_radix_destroy_tree(struct kho_radix_tree *tree);
int kho_radix_tree_freeze(struct kho_radix_tree *tree);
+int kho_radix_crash_init(struct kho_radix_crash_tree *tree, phys_addr_t root_pa);
+
+bool kho_radix_crash_contains_page(struct kho_radix_crash_tree *tree,
+ unsigned long pfn, unsigned int order);
+
#else /* #ifdef CONFIG_KEXEC_HANDOVER */
static inline int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key)
@@ -91,6 +109,18 @@ static inline int kho_radix_tree_freeze(struct kho_radix_tree *tree)
return -EOPNOTSUPP;
}
+static inline int kho_radix_crash_init(struct kho_radix_crash_tree *tree,
+ phys_addr_t root_pa)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline bool kho_radix_crash_contains_page(
+ struct kho_radix_crash_tree *tree,
+ unsigned long pfn, unsigned int order)
+{
+ return false;
+}
#endif /* #ifdef CONFIG_KEXEC_HANDOVER */
#endif /* _LINUX_KHO_RADIX_TREE_H */
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 2e2b4e73f00d..0dfdf0f9781e 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -15,6 +15,7 @@
#include <linux/kmemleak.h>
#include <linux/count_zeros.h>
#include <linux/kasan.h>
+#include <linux/io.h>
#include <linux/kexec.h>
#include <linux/kexec_handover.h>
#include <linux/kho_radix_tree.h>
@@ -396,6 +397,129 @@ void kho_radix_destroy_tree(struct kho_radix_tree *tree)
}
EXPORT_SYMBOL_GPL(kho_radix_destroy_tree);
+/*
+ * Convert a crash tree node's children from PA to VA in-place via memremap().
+ * On failure, already-remapped pages are not cleaned up -- the crash kernel
+ * is short-lived and will reboot after dump collection, so the leak is
+ * inconsequential.
+ */
+static int kho_radix_crash_convert_node(struct kho_radix_node *node,
+ unsigned int level)
+{
+ struct kho_radix_node *child;
+ unsigned int i;
+ int err;
+
+ for (i = 0; i < (1 << KHO_TABLE_SIZE_LOG2); i++) {
+ if (!node->table[i])
+ continue;
+
+ /* Validate: PA must have bit 63 clear and be page-aligned */
+ if ((node->table[i] & BIT_ULL(63)) ||
+ (node->table[i] & (PAGE_SIZE - 1)))
+ return -EINVAL;
+
+ child = memremap(node->table[i], PAGE_SIZE, MEMREMAP_WB);
+ if (!child)
+ return -ENOMEM;
+
+ /* Overwrite PA with VA in-place */
+ node->table[i] = (u64)(uintptr_t)child;
+
+ /* Recurse for intermediate levels; level 1 children are leaves */
+ if (level > 1) {
+ err = kho_radix_crash_convert_node(child, level - 1);
+ if (err)
+ return err;
+ }
+ }
+
+ return 0;
+}
+
+/**
+ * kho_radix_crash_init - Initialize a crash-kernel view of a KHO radix tree.
+ * @tree: The crash tree to initialize.
+ * @root_pa: Physical address of the radix tree root from the old kernel.
+ *
+ * Maps the old kernel's radix tree into the crash kernel's address space
+ * by memremapping each node and converting table entries from physical to
+ * virtual addresses in-place. After successful initialization, the tree
+ * can be traversed with kho_radix_crash_contains_page() using direct
+ * pointer dereferencing.
+ *
+ * This function is intended for use in the crash kernel where the old
+ * kernel's memory is not in the direct map. No locking is used as the
+ * crash kernel is effectively single-threaded during dump collection.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kho_radix_crash_init(struct kho_radix_crash_tree *tree, phys_addr_t root_pa)
+{
+ struct kho_radix_node *root;
+ int err;
+
+ tree->root = NULL;
+
+ if (!root_pa || (root_pa & (PAGE_SIZE - 1)))
+ return -EINVAL;
+
+ root = memremap(root_pa, PAGE_SIZE, MEMREMAP_WB);
+ if (!root)
+ return -ENOMEM;
+
+ err = kho_radix_crash_convert_node(root, KHO_TREE_MAX_DEPTH - 1);
+ if (err)
+ return err;
+
+ tree->root = root;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kho_radix_crash_init);
+
+/**
+ * kho_radix_crash_contains_page - Check if a page is in a crash-kernel radix tree.
+ * @tree: The crash tree, previously initialized with kho_radix_crash_init().
+ * @pfn: The page frame number to check.
+ * @order: The order of the page.
+ *
+ * Traverses the radix tree using direct pointer dereferencing (the table
+ * entries were converted from PA to VA during init). No locking is used as the
+ * crash kernel is effectively single-threaded during dump collection.
+ *
+ * Note: This function checks specifically for the presence of the page at the
+ * given order. If a larger order page that encompasses this page is preserved,
+ * this function will return false.
+ *
+ * Return: true if the page is present in the tree, false otherwise.
+ */
+bool kho_radix_crash_contains_page(struct kho_radix_crash_tree *tree,
+ unsigned long pfn, unsigned int order)
+{
+ unsigned long key = kho_encode_radix_key(PFN_PHYS(pfn), order);
+ struct kho_radix_node *node = tree->root;
+ struct kho_radix_leaf *leaf;
+ unsigned int i, idx;
+
+ if (!tree->root)
+ return false;
+
+ /* Traverse using VA pointers stored in table[] */
+ for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) {
+ idx = kho_radix_get_table_index(key, i);
+
+ if (!node->table[idx])
+ return false;
+
+ node = (struct kho_radix_node *)(uintptr_t)node->table[idx];
+ }
+
+ leaf = (struct kho_radix_leaf *)node;
+ idx = kho_radix_get_bitmap_index(key);
+ return test_bit(idx, leaf->bitmap);
+}
+EXPORT_SYMBOL_GPL(kho_radix_crash_contains_page);
+
static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key,
const struct kho_radix_walk_cb *cb, void *data)
{
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 14/20] kho: Add crash-kernel-safe radix tree presence check
2026-05-28 0:41 ` [RFC PATCH 14/20] kho: Add crash-kernel-safe radix tree presence check Jork Loeser
@ 2026-05-28 1:27 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:27 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [High] In-place conversion destructively modifies the old kernel's memory, corrupting the crash dump.
- [High] Unbounded per-node `memremap()` calls can quickly exhaust the crash kernel's limited memory resources.
- [High] Lack of physical address validation before `memremap()` risks triggering Machine Check Exceptions (MCE).
- [Medium] `kho_radix_crash_contains_page()` fails to identify subpages of higher-order folios.
- [Medium] The exported API lacks a teardown function, resulting in permanent vmap memory leaks.
--
commit c1bba78846e7698e1b8a124f8432407d41f69b0c
Author: Jork Loeser <jloeser@linux.microsoft.com>
Subject: kho: Add crash-kernel-safe radix tree presence check
In the crash kernel, the old kernel's memory is outside the direct map.
This patch adds a read-only radix tree variant that memremaps nodes during
init so that subsequent page presence checks can traverse the tree with
plain pointer dereferencing.
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index 2e2b4e73f00d3..0dfdf0f9781ef 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
[ ... ]
> @@ -396,6 +397,129 @@ void kho_radix_destroy_tree(struct kho_radix_tree *tree)
> }
> EXPORT_SYMBOL_GPL(kho_radix_destroy_tree);
>
> +/*
> + * Convert a crash tree node's children from PA to VA in-place via memremap().
> + * On failure, already-remapped pages are not cleaned up -- the crash kernel
> + * is short-lived and will reboot after dump collection, so the leak is
> + * inconsequential.
> + */
> +static int kho_radix_crash_convert_node(struct kho_radix_node *node,
> + unsigned int level)
> +{
> + struct kho_radix_node *child;
> + unsigned int i;
> + int err;
> +
> + for (i = 0; i < (1 << KHO_TABLE_SIZE_LOG2); i++) {
> + if (!node->table[i])
> + continue;
> +
> + /* Validate: PA must have bit 63 clear and be page-aligned */
> + if ((node->table[i] & BIT_ULL(63)) ||
> + (node->table[i] & (PAGE_SIZE - 1)))
> + return -EINVAL;
> +
> + child = memremap(node->table[i], PAGE_SIZE, MEMREMAP_WB);
[Severity: High]
Since this recursively calls memremap() for every populated node in the tree,
could this exhaust the crash kernel's limited memory resources? A fully
populated tree might create thousands of individual virtual mappings, consuming
significant memory for tracking structures like vm_struct and page tables.
Also, the untrusted physical addresses read from the crashed kernel's memory
are passed blindly to memremap(). If the old kernel crashed due to memory
corruption, could these pointers contain garbage values pointing to non-RAM
regions like MMIO, potentially triggering a Machine Check Exception (MCE) when
accessed?
> + if (!child)
> + return -ENOMEM;
> +
> + /* Overwrite PA with VA in-place */
> + node->table[i] = (u64)(uintptr_t)child;
[Severity: High]
Because memremap() creates a direct mapping to the underlying physical memory,
does assigning the virtual address here destructively modify the old kernel's
physical memory? This seems like it would corrupt the crash dump, causing
offline analysis tools like crash or makedumpfile to encounter meaningless
virtual addresses instead of the original physical addresses.
> +
> + /* Recurse for intermediate levels; level 1 children are leaves */
> + if (level > 1) {
> + err = kho_radix_crash_convert_node(child, level - 1);
> + if (err)
> + return err;
> + }
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * kho_radix_crash_init - Initialize a crash-kernel view of a KHO radix tree.
[ ... ]
> + */
> +int kho_radix_crash_init(struct kho_radix_crash_tree *tree, phys_addr_t root_pa)
> +{
> + struct kho_radix_node *root;
> + int err;
> +
> + tree->root = NULL;
> +
> + if (!root_pa || (root_pa & (PAGE_SIZE - 1)))
> + return -EINVAL;
> +
> + root = memremap(root_pa, PAGE_SIZE, MEMREMAP_WB);
> + if (!root)
> + return -ENOMEM;
> +
> + err = kho_radix_crash_convert_node(root, KHO_TREE_MAX_DEPTH - 1);
> + if (err)
> + return err;
> +
> + tree->root = root;
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(kho_radix_crash_init);
[Severity: Medium]
This API is exported to modules and allocates mappings via memremap(), but
there is no corresponding kho_radix_crash_destroy() provided. Even if the
crash kernel is short-lived, should kernel APIs exported to modules enforce
symmetrical resource lifecycle management to avoid permanent vmap memory
leaks if a module is unloaded or fails initialization?
> +
> +/**
> + * kho_radix_crash_contains_page - Check if a page is in a crash-kernel radix tree.
[ ... ]
> + * Note: This function checks specifically for the presence of the page at the
> + * given order. If a larger order page that encompasses this page is preserved,
> + * this function will return false.
> + *
> + * Return: true if the page is present in the tree, false otherwise.
> + */
> +bool kho_radix_crash_contains_page(struct kho_radix_crash_tree *tree,
> + unsigned long pfn, unsigned int order)
> +{
> + unsigned long key = kho_encode_radix_key(PFN_PHYS(pfn), order);
[Severity: Medium]
Since this searches for an exact match of the provided pfn and order, doesn't
this fail to identify subpages of higher-order folios? If a page was
preserved as part of a higher-order folio, an order-0 query for a PFN within
that folio will fail to find the key and incorrectly return false.
Would callers like pfn_is_ram() be forced to manually iterate through all
possible orders up to MAX_PAGE_ORDER to reliably determine if a page is
preserved?
> + struct kho_radix_node *node = tree->root;
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=14
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 15/20] mshv: Use page tracker to manage MSHV-owned pages and preserve with KHO
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (13 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 14/20] kho: Add crash-kernel-safe radix tree presence check Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:41 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 16/20] mshv: Add debugfs interface to page tracker Jork Loeser
` (4 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
The MSHV driver passes pages to MSHV for its exclusive use. A
subsequently kexec'd-to kernel must not use these pages, so
we need to register these pages with KHO.
- adapt hv_call_deposit_pages() and hv_call_withdraw_memory() to
use tracker
- Use KHO to preserve MSHV-owned pages across kexec
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
drivers/hv/Kconfig | 3 +
drivers/hv/Makefile | 2 +-
drivers/hv/hv_common.c | 3 +
drivers/hv/hv_proc.c | 32 ++-
drivers/hv/mshv_page_preserve.c | 374 ++++++++++++++++++++++++++++++++
drivers/hv/mshv_page_preserve.h | 15 ++
drivers/hv/mshv_root.h | 1 +
drivers/hv/mshv_root_hv_call.c | 12 +-
8 files changed, 434 insertions(+), 8 deletions(-)
create mode 100644 drivers/hv/mshv_page_preserve.c
create mode 100644 drivers/hv/mshv_page_preserve.h
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 2d0b3fcb0ff8..0c4ffc1c701b 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -74,6 +74,9 @@ config MSHV_ROOT
# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
# no particular order, making it impossible to reassemble larger pages
depends on PAGE_SIZE_4KB
+ # Pages deposited to the hypervisor must be tracked and preserved
+ # across kexec to avoid memory corruption.
+ depends on KEXEC_HANDOVER
select EVENTFD
select VIRT_XFER_TO_GUEST_WORK
select HMM_MIRROR
diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index 888a748cc7cb..49526ae704f9 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -21,7 +21,7 @@ mshv_vtl-y := mshv_vtl_main.o
# Code that must be built-in
obj-$(CONFIG_HYPERV) += hv_common.o
-obj-$(subst m,y,$(CONFIG_MSHV_ROOT)) += hv_proc.o
+obj-$(subst m,y,$(CONFIG_MSHV_ROOT)) += hv_proc.o mshv_page_preserve.o
ifneq ($(CONFIG_MSHV_ROOT)$(CONFIG_MSHV_VTL),)
obj-y += mshv_common.o
endif
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 6b67ac616789..8a593117e9b8 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -30,6 +30,7 @@
#include <linux/set_memory.h>
#include <hyperv/hvhdk.h>
#include <asm/mshyperv.h>
+#include "mshv_root.h"
u64 hv_current_partition_id = HV_PARTITION_ID_SELF;
EXPORT_SYMBOL_GPL(hv_current_partition_id);
@@ -382,6 +383,8 @@ int __init hv_common_init(void)
if (hv_parent_partition()) {
hv_synic_eventring_tail = alloc_percpu(u8 *);
BUG_ON(!hv_synic_eventring_tail);
+
+ mshv_preserve_init();
}
hv_vp_index = kmalloc_array(nr_cpu_ids, sizeof(*hv_vp_index),
diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index 57b2c64197cb..0392ea1f3cc5 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -8,6 +8,7 @@
#include <linux/minmax.h>
#include <linux/export.h>
#include <asm/mshyperv.h>
+#include "mshv_root.h"
/*
* See struct hv_deposit_memory. The first u64 is partition ID, the rest
@@ -22,6 +23,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
int *counts;
int num_allocations;
int i, j, page_count;
+ int reg_i = 0, reg_j = 0;
int order;
u64 status;
int ret;
@@ -72,6 +74,18 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
}
num_allocations = i;
+ /* Register the pages for preservation across kexec */
+ for (i = 0; i < num_allocations; ++i) {
+ for (j = 0; j < counts[i]; ++j) {
+ ret = mshv_register_preserve_page(pages[i] + j);
+ if (ret) {
+ reg_i = i;
+ reg_j = j;
+ goto err_unregister;
+ }
+ }
+ }
+
local_irq_save(flags);
input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -90,19 +104,27 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
if (!hv_result_success(status)) {
hv_status_err(status, "\n");
ret = hv_result_to_errno(status);
- goto err_free_allocations;
+ reg_i = num_allocations;
+ goto err_unregister;
}
ret = 0;
goto free_buf;
-err_free_allocations:
+err_unregister:
for (i = 0; i < num_allocations; ++i) {
- base_pfn = page_to_pfn(pages[i]);
- for (j = 0; j < counts[i]; ++j)
- __free_page(pfn_to_page(base_pfn + j));
+ for (j = 0; j < counts[i]; ++j) {
+ if (i == reg_i && j == reg_j)
+ goto err_free_allocations;
+ mshv_unregister_preserve_page(pages[i] + j);
+ }
}
+err_free_allocations:
+ for (i = 0; i < num_allocations; ++i)
+ for (j = 0; j < counts[i]; ++j)
+ __free_page(pages[i] + j);
+
free_buf:
free_page((unsigned long)pages);
kfree(counts);
diff --git a/drivers/hv/mshv_page_preserve.c b/drivers/hv/mshv_page_preserve.c
new file mode 100644
index 000000000000..a79725a74663
--- /dev/null
+++ b/drivers/hv/mshv_page_preserve.c
@@ -0,0 +1,374 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Preserve pages owned by Microsoft Hypervisor
+ *
+ * When handing pages to MSHV and kexec'ing, the next kernel needs to know which
+ * pages not to touch. Handles this preservation here.
+ *
+ * Copyright (C) 2026 Microsoft Corporation, Jork Loeser <jloeser@microsoft.com>
+ */
+
+#define pr_fmt(fmt) "mshv: " fmt
+
+#include <asm/mshyperv.h>
+#include <linux/kexec.h>
+#include <linux/kexec_handover.h>
+#include <linux/kho_radix_tree.h>
+#include <linux/libfdt.h>
+#include <linux/reboot.h>
+#include "mshv_page_preserve.h"
+
+#define FDT_SUBTREE_MSHV "mshv_prsv_pt"
+#define MSHV_KHO_COMPAT_STR "mshv_kho-v1"
+
+static void *fdt_page;
+static struct kho_radix_tree preserved_pages_tree;
+
+/**
+ * mshv_register_preserve_page() - Register a page to be preserved by KHO
+ * @pg: pointer to the page to preserve
+ *
+ * Registers a single page to be preserved by KHO across kexec.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int mshv_register_preserve_page(struct page *pg)
+{
+ return kho_radix_add_key(&preserved_pages_tree, page_to_pfn(pg));
+}
+EXPORT_SYMBOL_GPL(mshv_register_preserve_page);
+
+/**
+ * mshv_unregister_preserve_page() - Unregister a page from KHO preservation
+ * @pg: pointer to the page to unpreserve
+ *
+ * Unregisters a page that was previously registered to be preserved by KHO.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int mshv_unregister_preserve_page(struct page *pg)
+{
+ return kho_radix_del_key(&preserved_pages_tree, page_to_pfn(pg));
+}
+EXPORT_SYMBOL_GPL(mshv_unregister_preserve_page);
+
+/* Preserve a single page identified by its PFN key with KHO */
+static int preserve_key_cb(unsigned long key, void *data)
+{
+ return kho_preserve_pages(pfn_to_page(key), 1);
+}
+
+/* Preserve a radix tree metadata page with KHO */
+static int preserve_table_cb(phys_addr_t phys, void *data)
+{
+ return kho_preserve_pages(phys_to_page(phys), 1);
+}
+
+static int create_fdt(void)
+{
+ int err;
+ void *fdt;
+ phys_addr_t root_table;
+
+ if (!fdt_page)
+ return -EINVAL;
+
+ fdt = fdt_page;
+
+ err = fdt_create(fdt, PAGE_SIZE);
+ if (err)
+ return err;
+ err = fdt_finish_reservemap(fdt);
+ if (err)
+ return err;
+ err = fdt_begin_node(fdt, "");
+ if (err)
+ return err;
+ err = fdt_property(fdt, "compatible", MSHV_KHO_COMPAT_STR,
+ strlen(MSHV_KHO_COMPAT_STR) + 1);
+ if (err)
+ return err;
+ root_table = virt_to_phys(preserved_pages_tree.root);
+ err = fdt_property(fdt, "root_table", &root_table, sizeof(root_table));
+ if (err)
+ return err;
+ err = fdt_end_node(fdt);
+ if (err)
+ return err;
+ err = fdt_finish(fdt);
+ if (err)
+ return err;
+
+ return 0;
+}
+
+/**
+ * preserve_tree() - Preserve pages owned by Microsoft Hypervisor
+ *
+ * This gets called prior to kexec and is our signal to finally preserve the
+ * pages with KHO, and create & register the named FDT. We also need to freeze
+ * the tree, since we cannot communicate any later changes.
+ *
+ * Return: 0 on success, -errno on error.
+ */
+static int preserve_tree(void)
+{
+ const struct kho_radix_walk_cb preserve_cb = {
+ .key = preserve_key_cb,
+ .table = preserve_table_cb,
+ };
+ int err;
+
+ err = kho_radix_tree_freeze(&preserved_pages_tree);
+ if (err) {
+ pr_warn("%s() - kho_radix_tree_freeze() failed: %d\n",
+ __func__, err);
+ return err;
+ }
+
+ /* Populate the pre-allocated FDT page with current tree state */
+ err = create_fdt();
+ if (err) {
+ pr_warn("%s() - create_fdt() failed: %d\n", __func__, err);
+ return err;
+ }
+
+ /* Preserve both data- and meta-pages */
+ err = kho_radix_walk_tree(&preserved_pages_tree, &preserve_cb, NULL);
+ if (err) {
+ /* We could not preserve all pages and cannot kexec. */
+ pr_warn("%s() - kho_radix_walk_tree() failed: %d\n", __func__,
+ err);
+ return err;
+ }
+
+ err = kho_preserve_pages(virt_to_page(fdt_page), 1);
+ if (err) {
+ pr_warn("%s() - kho_preserve_pages(fdt) failed: %d\n", __func__,
+ err);
+ return err;
+ }
+
+ err = kho_add_subtree(FDT_SUBTREE_MSHV, fdt_page, PAGE_SIZE);
+ if (err) {
+ /* KHO will abort and undo all preservations. We cannot kexec. */
+ pr_warn("%s() - kho_add_subtree() failed: %d\n", __func__, err);
+ return err;
+ }
+
+ pr_debug("%s() - success\n", __func__);
+ return 0;
+}
+
+/*
+ * Reboot-callback triggering page preservation prior to kexec. Other reboots
+ * need no KHO preservation.
+ */
+static int reboot_cb(struct notifier_block *nb, unsigned long action,
+ void *data)
+{
+ /* codes such as SYS_RESTART, SYS_HALT do not convey kexec specifically */
+ if (kexec_in_progress) {
+ int err;
+
+ /* Finalize handover: write KHO descriptors, flush metadata */
+ pr_debug("%s() - KHO-preserving page tree\n", __func__);
+ err = preserve_tree();
+ if (err)
+ panic("preserve_tree() failed - must not kexec: %d\n",
+ err);
+ }
+ return NOTIFY_OK;
+}
+
+/**
+ * restore_tree() - Restore the page-tree state from KHO.
+ *
+ * Return: 0 on success, -ENOENT if no KHO subtree was found (i.e. this is
+ * not a KHO boot), -EINVAL if the preserved FDT is malformed or
+ * incompatible.
+ */
+static int __init restore_tree(void)
+{
+ void *fdt;
+ phys_addr_t fdt_pa;
+ int len;
+ int node;
+ const phys_addr_t *root_table_fdt_ptr;
+ int err;
+
+ err = kho_retrieve_subtree(FDT_SUBTREE_MSHV, &fdt_pa, NULL);
+ if (err)
+ return err;
+
+ fdt = phys_to_virt(fdt_pa);
+ node = fdt_path_offset(fdt, "/");
+ if (node < 0) {
+ pr_err("Could not find root node in KHO-preserved FDT.\n");
+ return -EINVAL;
+ }
+
+ if (fdt_node_check_compatible(fdt, node, MSHV_KHO_COMPAT_STR)) {
+ /*
+ * This is unfortunate. We kexec'd into a kernel that isn't
+ * compatible with prior preservations. Pages this kernel
+ * considers available might actually be held by MSHV. The only
+ * recourse is to reboot.
+ */
+ const char *s = fdt_getprop(fdt, node, "compatible", &len);
+
+ if (s && len >= 0)
+ pr_err("Incompatible kernel: Current is %s, preserved is %.*s\n",
+ MSHV_KHO_COMPAT_STR, len, s);
+ else
+ pr_err("Incompatible kernel: preserved misses 'compatible' mark.\n");
+ return -EINVAL;
+ }
+
+ root_table_fdt_ptr = fdt_getprop(fdt, node, "root_table", &len);
+ if (!root_table_fdt_ptr || len != sizeof(*root_table_fdt_ptr)) {
+ pr_err("Could not obtain root_table property from KHO-preserved FDT.\n");
+ return -EINVAL;
+ }
+
+ /* Restore struct page so it could be freed if needed */
+ if (!kho_restore_pages(fdt_pa, 1))
+ return -EINVAL;
+
+ fdt_page = phys_to_virt(fdt_pa);
+
+ err = kho_radix_init_tree(&preserved_pages_tree,
+ phys_to_virt(*root_table_fdt_ptr));
+ if (err)
+ return -EINVAL;
+
+ pr_debug("Restored tracking from KHO.\n");
+ return 0;
+}
+
+/*
+ * Restore individual pages using KHO's helper during boot.
+ *
+ * Pages must be restored one at a time because they were deposited to
+ * the hypervisor individually and will be withdrawn individually later.
+ * Restoring them as a higher-order group would create compound pages
+ * that cannot be freed with __free_page().
+ */
+static int __init restore_key_cb(unsigned long key, void *data)
+{
+ if (!kho_restore_pages(PFN_PHYS(key), 1))
+ return -EINVAL;
+ return 0;
+}
+
+static int __init restore_table_cb(phys_addr_t phys, void *data)
+{
+ if (!kho_restore_pages(phys, 1))
+ return -EINVAL;
+ return 0;
+}
+
+/**
+ * restore_page_structs() - Restore page-structs so they can be __free_page()'d
+ *
+ * This is necessary because KHO-preserved pages are in a "weird" state
+ * post-kexec. While doing so here in bulk adds to boot time, there is no vetted
+ * alternative that would allow doing this later, when we cannot say which pages
+ * had been freshly added, and which came into the tree through KHO.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+static int __init restore_page_structs(void)
+{
+ const struct kho_radix_walk_cb cb = {
+ .key = restore_key_cb,
+ .table = restore_table_cb,
+ };
+
+ return kho_radix_walk_tree(&preserved_pages_tree, &cb, NULL);
+}
+
+/**
+ * alloc_tree() - Allocate a fresh page tree and FDT page.
+ *
+ * Called on fresh boot (no KHO data). Allocates an empty radix tree and
+ * the FDT page used to serialize state before kexec.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+static int __init alloc_tree(void)
+{
+ int err;
+
+ fdt_page = (void *)get_zeroed_page(GFP_KERNEL);
+ if (!fdt_page)
+ return -ENOMEM;
+
+ err = kho_radix_init_tree(&preserved_pages_tree, NULL);
+ if (err) {
+ free_page((unsigned long)fdt_page);
+ fdt_page = NULL;
+ return err;
+ }
+
+ return 0;
+}
+
+static struct notifier_block reboot_notifier = {
+ .notifier_call = reboot_cb,
+ .priority = 0,
+};
+
+/**
+ * mshv_preserve_init() - Initialize the page preservation
+ *
+ * Upon return:
+ * - the tracker will be ready for use (restored post-kexec, or empty
+ * post-reboot),
+ * - restored pages will be in a state that can be __free_page()'d,
+ * - KHO notification for preservation will be registered.
+ *
+ * Return: 0 on success, -errno on error.
+ */
+int __init mshv_preserve_init(void)
+{
+ int err;
+
+ if (!kho_is_enabled()) {
+ pr_err("KHO is disabled; page deposits will fail.\n");
+ return 0;
+ }
+
+ err = restore_tree();
+ if (!err) {
+ /* Restore struct pages so they can be __free_page()'d */
+ if (restore_page_structs())
+ /*
+ * Unrestored struct pages would BUG when freed
+ * at withdraw time.
+ */
+ panic("Failed to restore MSHV page structs\n");
+ } else if (err == -ENOENT) {
+ pr_debug("Nothing to restore from KHO.\n");
+ if (alloc_tree()) {
+ pr_err("Could not allocate page tree; page deposits will fail.\n");
+ return 0;
+ }
+ } else {
+ /*
+ * Pages from the prior kernel are held by MSHV but we
+ * lost track of them -- memory corruption is inevitable.
+ */
+ panic("Could not restore page tree from KHO: %d\n", err);
+ }
+
+ err = register_reboot_notifier(&reboot_notifier);
+ if (err)
+ /*
+ * Deposits would succeed but pages would not be preserved
+ * across kexec, causing memory corruption post-kexec.
+ */
+ panic("Could not register reboot notification: %d\n", err);
+
+ return 0;
+}
diff --git a/drivers/hv/mshv_page_preserve.h b/drivers/hv/mshv_page_preserve.h
new file mode 100644
index 000000000000..0609002e5f1d
--- /dev/null
+++ b/drivers/hv/mshv_page_preserve.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2026 Microsoft Corporation, Jork Loeser <jloeser@microsoft.com>
+ */
+
+#ifndef _MSHV_PAGE_PRESERVE_H
+#define _MSHV_PAGE_PRESERVE_H
+
+struct page;
+
+int mshv_preserve_init(void);
+int mshv_register_preserve_page(struct page *pg);
+int mshv_unregister_preserve_page(struct page *pg);
+
+#endif /* _MSHV_PAGE_PRESERVE_H */
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 1f086dcb7aa1..362768786c17 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -18,6 +18,7 @@
#include <linux/mmu_notifier.h>
#include <uapi/linux/mshv.h>
#include "mshv_trace.h"
+#include "mshv_page_preserve.h"
/*
* Hypervisor must be between these version numbers (inclusive)
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index cb55d4d4be2e..f5ff03318787 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -69,8 +69,16 @@ int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
completed = hv_repcomp(status);
- for (i = 0; i < completed; i++)
- __free_page(pfn_to_page(output_page->gpa_page_list[i]));
+ for (i = 0; i < completed; i++) {
+ struct page *pg = pfn_to_page(output_page->gpa_page_list[i]);
+ int res = mshv_unregister_preserve_page(pg);
+
+ WARN_ONCE(res, "Failed to unregister PFN %#llx\n",
+ output_page->gpa_page_list[i]);
+
+ /* Free regardless -- HV has already released the page */
+ __free_page(pg);
+ }
if (!hv_result_success(status)) {
if (hv_result(status) == HV_STATUS_NO_RESOURCES)
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 15/20] mshv: Use page tracker to manage MSHV-owned pages and preserve with KHO
2026-05-28 0:41 ` [RFC PATCH 15/20] mshv: Use page tracker to manage MSHV-owned pages and preserve with KHO Jork Loeser
@ 2026-05-28 1:41 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:41 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] KHO radix tree is accessed without initialization if KHO is disabled, leading to API misuse and potential crashes.
- [High] Concurrent memory withdrawal during kexec tree freeze causes memory corruption and use-after-free.
--
commit 9c3872dd668bb8911798d680d377645cc757d8e0
Author: Jork Loeser <jloeser@linux.microsoft.com>
mshv: Use page tracker to manage MSHV-owned pages and preserve with KHO
This commit adapts MSHV page deposit and withdrawal paths to use a tracker,
and registers these pages with KHO to preserve them across kexecs.
> diff --git a/drivers/hv/mshv_page_preserve.c b/drivers/hv/mshv_page_preserve.c
> --- /dev/null
> +++ b/drivers/hv/mshv_page_preserve.c
[ ... ]
> +int __init mshv_preserve_init(void)
> +{
> + int err;
> +
> + if (!kho_is_enabled()) {
> + pr_err("KHO is disabled; page deposits will fail.\n");
> + return 0;
> + }
[Severity: High]
If kho_is_enabled() returns false, this function returns 0 but sets no global
state to actually reject subsequent deposit requests. Does this mean
hv_call_deposit_pages() will proceed and unconditionally call
mshv_register_preserve_page(), leading to a crash when it tries to operate on
the uninitialized radix tree?
[ ... ]
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
[ ... ]
> @@ -69,8 +69,16 @@ int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
>
> completed = hv_repcomp(status);
>
> - for (i = 0; i < completed; i++)
> - __free_page(pfn_to_page(output_page->gpa_page_list[i]));
> + for (i = 0; i < completed; i++) {
> + struct page *pg = pfn_to_page(output_page->gpa_page_list[i]);
> + int res = mshv_unregister_preserve_page(pg);
> +
> + WARN_ONCE(res, "Failed to unregister PFN %#llx\n",
> + output_page->gpa_page_list[i]);
> +
> + /* Free regardless -- HV has already released the page */
> + __free_page(pg);
> + }
[Severity: High]
Does this code introduce a use-after-free and memory corruption during kexec?
If the reboot notifier calls preserve_tree() and freezes the page tree,
mshv_unregister_preserve_page() will fail. However, the page is still freed
to the buddy allocator here. The concurrent preserve_tree() walk will then
visit this still-registered node and preserve the freed page, corrupting the
buddy allocator state.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=15
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 16/20] mshv: Add debugfs interface to page tracker
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (14 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 15/20] mshv: Use page tracker to manage MSHV-owned pages and preserve with KHO Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:48 ` sashiko-bot
2026-05-28 0:41 ` [RFC PATCH 17/20] hyperv: Reserve crash MSR P2 for page preservation root PA Jork Loeser
` (3 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
- expose stats (page-counts for data & metadata)
- expose tracked pages
- expose scheduler type
- add mshv_iterate_preserved() API for walking the radix tree
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
drivers/hv/mshv_debugfs.c | 99 +++++++++++++++++++++++++++++++++
drivers/hv/mshv_page_preserve.c | 13 +++++
drivers/hv/mshv_page_preserve.h | 3 +
drivers/hv/mshv_root.h | 2 +
drivers/hv/mshv_root_main.c | 2 +-
5 files changed, 118 insertions(+), 1 deletion(-)
diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
index 3c3e02237ae9..d79898e21b36 100644
--- a/drivers/hv/mshv_debugfs.c
+++ b/drivers/hv/mshv_debugfs.c
@@ -33,11 +33,18 @@ static struct dentry *mshv_debugfs;
static struct dentry *mshv_debugfs_partition;
static struct dentry *mshv_debugfs_lp;
static struct dentry **parent_vp_stats;
+
static struct dentry *parent_partition_stats;
static u64 mshv_lps_count;
static struct hv_stats_page **mshv_lps_stats;
+struct mshv_pt_stats {
+ unsigned long count_data;
+ unsigned long count_meta;
+ struct seq_file *stat_file;
+};
+
static int lp_stats_show(struct seq_file *m, void *v)
{
const struct hv_stats_page *stats = m->private;
@@ -668,8 +675,89 @@ void mshv_debugfs_partition_remove(struct mshv_partition *partition)
partition->pt_stats_dentry);
}
+static int pt_count_data_cb(unsigned long key __maybe_unused, void *stats)
+{
+ ((struct mshv_pt_stats *)stats)->count_data++;
+ return 0;
+}
+
+static int pt_count_meta_cb(phys_addr_t phys __maybe_unused, void *stats)
+{
+ ((struct mshv_pt_stats *)stats)->count_meta++;
+ return 0;
+}
+
+static int pt_stats_show(struct seq_file *m, void *v)
+{
+ const struct kho_radix_walk_cb cb = {
+ .key = pt_count_data_cb,
+ .table = pt_count_meta_cb,
+ };
+
+ struct mshv_pt_stats pt_stats = {0};
+
+ mshv_iterate_preserved(&cb, &pt_stats);
+ seq_printf(m, "Data pages: %lu\n", pt_stats.count_data);
+ seq_printf(m, "Meta pages: %lu\n", pt_stats.count_meta);
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(pt_stats);
+
+static int pt_tree_data_cb(unsigned long key, void *stats)
+{
+ seq_printf(((struct mshv_pt_stats *)stats)->stat_file,
+ "data pfn: %#lx\n", key);
+ return 0;
+}
+
+static int pt_tree_meta_cb(phys_addr_t phys, void *stats)
+{
+ seq_printf(((struct mshv_pt_stats *)stats)->stat_file,
+ "meta pfn: %#llx\n",
+ (unsigned long long)(phys >> PAGE_SHIFT));
+ return 0;
+}
+
+static int pt_tree_show(struct seq_file *m, void *v)
+{
+ const struct kho_radix_walk_cb cb = {
+ .key = pt_tree_data_cb,
+ .table = pt_tree_meta_cb,
+ };
+
+ struct mshv_pt_stats pt_stats = {.stat_file = m};
+
+ mshv_iterate_preserved(&cb, &pt_stats);
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(pt_tree);
+
+static int __init mshv_debugfs_pt_create(struct dentry *parent)
+{
+ struct dentry *d;
+
+ d = debugfs_create_file("pt_stats", 0400, parent, NULL, &pt_stats_fops);
+ if (IS_ERR(d))
+ return PTR_ERR(d);
+
+ d = debugfs_create_file("pt_tree", 0400, parent, NULL, &pt_tree_fops);
+ if (IS_ERR(d))
+ return PTR_ERR(d);
+
+ return 0;
+}
+
+static int scheduler_info_show(struct seq_file *m, void *v)
+{
+ seq_printf(m, "Scheduler type: %s (%d)\n",
+ scheduler_type_to_string(hv_scheduler_type), hv_scheduler_type);
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(scheduler_info);
+
int __init mshv_debugfs_init(void)
{
+ struct dentry *d;
int err;
mshv_debugfs = debugfs_create_dir("mshv", NULL);
@@ -694,6 +782,17 @@ int __init mshv_debugfs_init(void)
if (err)
goto unmap_lp_stats;
+ err = mshv_debugfs_pt_create(mshv_debugfs);
+ if (err)
+ goto unmap_lp_stats;
+
+ d = debugfs_create_file("scheduler_info", 0400, mshv_debugfs, NULL,
+ &scheduler_info_fops);
+ if (IS_ERR(d)) {
+ err = PTR_ERR(d);
+ goto unmap_lp_stats;
+ }
+
return 0;
unmap_lp_stats:
diff --git a/drivers/hv/mshv_page_preserve.c b/drivers/hv/mshv_page_preserve.c
index a79725a74663..bc3a3a688f5b 100644
--- a/drivers/hv/mshv_page_preserve.c
+++ b/drivers/hv/mshv_page_preserve.c
@@ -52,6 +52,19 @@ int mshv_unregister_preserve_page(struct page *pg)
}
EXPORT_SYMBOL_GPL(mshv_unregister_preserve_page);
+/**
+ * mshv_iterate_preserved() - Walk all preserved pages
+ * @cb: callbacks invoked for each key/table entry
+ * @data: opaque data passed to callbacks
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int mshv_iterate_preserved(const struct kho_radix_walk_cb *cb, void *data)
+{
+ return kho_radix_walk_tree(&preserved_pages_tree, cb, data);
+}
+EXPORT_SYMBOL_GPL(mshv_iterate_preserved);
+
/* Preserve a single page identified by its PFN key with KHO */
static int preserve_key_cb(unsigned long key, void *data)
{
diff --git a/drivers/hv/mshv_page_preserve.h b/drivers/hv/mshv_page_preserve.h
index 0609002e5f1d..ac99b4e33285 100644
--- a/drivers/hv/mshv_page_preserve.h
+++ b/drivers/hv/mshv_page_preserve.h
@@ -6,10 +6,13 @@
#ifndef _MSHV_PAGE_PRESERVE_H
#define _MSHV_PAGE_PRESERVE_H
+#include <linux/kho_radix_tree.h>
+
struct page;
int mshv_preserve_init(void);
int mshv_register_preserve_page(struct page *pg);
int mshv_unregister_preserve_page(struct page *pg);
+int mshv_iterate_preserved(const struct kho_radix_walk_cb *cb, void *data);
#endif /* _MSHV_PAGE_PRESERVE_H */
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 362768786c17..216053f8e0ab 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -379,4 +379,6 @@ bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
void mshv_region_movable_fini(struct mshv_mem_region *region);
bool mshv_region_movable_init(struct mshv_mem_region *region);
+const char *scheduler_type_to_string(enum hv_scheduler_type type);
+
#endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index bd1359eb58dd..5fbd01c12ab8 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2134,7 +2134,7 @@ mshv_dev_release(struct inode *inode, struct file *filp)
static int mshv_root_sched_online;
-static const char *scheduler_type_to_string(enum hv_scheduler_type type)
+const char *scheduler_type_to_string(enum hv_scheduler_type type)
{
switch (type) {
case HV_SCHEDULER_TYPE_LP:
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 16/20] mshv: Add debugfs interface to page tracker
2026-05-28 0:41 ` [RFC PATCH 16/20] mshv: Add debugfs interface to page tracker Jork Loeser
@ 2026-05-28 1:48 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:48 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Medium] The newly introduced `mshv_iterate_preserved()` and its debugfs callers invoke KHO APIs without checking `kho_is_enabled()`, triggering a kernel warning on systems where KHO is disabled.
- [Medium] The error handling path in `mshv_debugfs_init()` fails to clean up resources allocated by `mshv_debugfs_parent_partition_create()`.
- [Medium] The `pt_tree` debugfs interface uses `single_open()` for potentially unbounded radix tree iteration, leading to O(N^2) processing overhead, prolonged lock contention, and allocation failures.
--
commit 0782bfee5c54e7b8251a28d7d1fba762890a41cc
Author: Jork Loeser <jloeser@linux.microsoft.com>
mshv: Add debugfs interface to page tracker
This commit adds debugfs files to expose statistics and tracked pages from the
MSHV page tracker, as well as the scheduler type. It also adds a new API to
iterate over preserved pages using the KHO radix tree.
> diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
> index 3c3e02237ae94..d79898e21b361 100644
> --- a/drivers/hv/mshv_debugfs.c
> +++ b/drivers/hv/mshv_debugfs.c
[ ... ]
> @@ -668,8 +675,89 @@ void mshv_debugfs_partition_remove(struct mshv_partition *partition)
[ ... ]
> +static int pt_stats_show(struct seq_file *m, void *v)
> +{
> + const struct kho_radix_walk_cb cb = {
> + .key = pt_count_data_cb,
> + .table = pt_count_meta_cb,
> + };
> +
> + struct mshv_pt_stats pt_stats = {0};
> +
> + mshv_iterate_preserved(&cb, &pt_stats);
[Severity: Medium]
Is it possible to read this file when KHO is disabled?
According to KHO subsystem rules, callers must gate KHO usage on
kho_is_enabled(). If kho_is_enabled() is false, preserved_pages_tree may remain
uninitialized (root is NULL), and kho_radix_walk_tree() will trigger a kernel
warning.
> + seq_printf(m, "Data pages: %lu\n", pt_stats.count_data);
> + seq_printf(m, "Meta pages: %lu\n", pt_stats.count_meta);
> + return 0;
> +}
> +DEFINE_SHOW_ATTRIBUTE(pt_stats);
[ ... ]
> +static int pt_tree_show(struct seq_file *m, void *v)
> +{
> + const struct kho_radix_walk_cb cb = {
> + .key = pt_tree_data_cb,
> + .table = pt_tree_meta_cb,
> + };
> +
> + struct mshv_pt_stats pt_stats = {.stat_file = m};
> +
> + mshv_iterate_preserved(&cb, &pt_stats);
> + return 0;
> +}
> +DEFINE_SHOW_ATTRIBUTE(pt_tree);
[Severity: Medium]
Will this lead to large allocations and O(N^2) lock contention for large VMs?
DEFINE_SHOW_ATTRIBUTE() uses single_open(), which requires all output to fit in
a single contiguous kmalloc buffer. Iterating over all preserved pages can scale
to millions of pages.
If the output exceeds the initial seq_file buffer, seq_printf() discards the
overflow, but the tree traversal continues to completion. Afterwards, seq_read()
will reallocate a larger buffer and restart the entire traversal. This creates
an retry loop that monopolizes the KHO radix tree lock and will likely
fail with -ENOMEM when kmalloc reaches the MAX_ORDER limit.
Should unbounded outputs use stateful seq_operations instead?
[ ... ]
> @@ -694,6 +782,17 @@ int __init mshv_debugfs_init(void)
> if (err)
> goto unmap_lp_stats;
>
> + err = mshv_debugfs_pt_create(mshv_debugfs);
> + if (err)
> + goto unmap_lp_stats;
[Severity: Medium]
Does this error path bypass cleaning up the parent partition stats?
If mshv_debugfs_pt_create() fails, it jumps to unmap_lp_stats, which
skips mshv_debugfs_parent_partition_remove(). This appears to permanently leak
the resources allocated by the preceding mshv_debugfs_parent_partition_create()
call.
> +
> + d = debugfs_create_file("scheduler_info", 0400, mshv_debugfs, NULL,
> + &scheduler_info_fops);
> + if (IS_ERR(d)) {
> + err = PTR_ERR(d);
> + goto unmap_lp_stats;
> + }
> +
> return 0;
>
> unmap_lp_stats:
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=16
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 17/20] hyperv: Reserve crash MSR P2 for page preservation root PA
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (15 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 16/20] mshv: Add debugfs interface to page tracker Jork Loeser
@ 2026-05-28 0:41 ` Jork Loeser
2026-05-28 1:34 ` sashiko-bot
2026-05-28 0:42 ` [RFC PATCH 18/20] mshv: Exclude Hyper-V donated pages from crash dump collection Jork Loeser
` (2 subsequent siblings)
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:41 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
The crash MSRs have no formal semantics and nothing depends on their
contents, so the register assignment can be reshuffled freely.
Reserve crash MSR P2 for passing the KHO radix tree root physical
address to the crash kernel for MSHV page exclusion during dump
collection. Stop overwriting it in the panic reporting paths.
Move IP/PC to P3 and SP to P4 in hyperv_report_panic() on both x86
and ARM64. Remove the P2 write from hv_kmsg_dump().
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
arch/arm64/hyperv/hv_core.c | 6 +++---
arch/x86/hyperv/hv_init.c | 4 ++--
drivers/hv/hv_common.c | 2 +-
3 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
index e33a9e3c366a..b75337c4892d 100644
--- a/arch/arm64/hyperv/hv_core.c
+++ b/arch/arm64/hyperv/hv_core.c
@@ -185,9 +185,9 @@ void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die)
*/
hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P0, err);
hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P1, guest_id);
- hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P2, regs->pc);
- hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P3, regs->sp);
- hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P4, 0);
+ /* P2 is reserved for the KHO preserved-pages tree root PA */
+ hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P3, regs->pc);
+ hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P4, regs->sp);
/*
* Let Hyper-V know there is crash data available
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 55a8b6de2865..cd75e2be19b2 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -675,8 +675,8 @@ void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die)
wrmsrq(HV_X64_MSR_CRASH_P0, err);
wrmsrq(HV_X64_MSR_CRASH_P1, guest_id);
- wrmsrq(HV_X64_MSR_CRASH_P2, regs->ip);
- wrmsrq(HV_X64_MSR_CRASH_P3, regs->ax);
+ /* P2 is reserved for the KHO preserved-pages tree root PA */
+ wrmsrq(HV_X64_MSR_CRASH_P3, regs->ip);
wrmsrq(HV_X64_MSR_CRASH_P4, regs->sp);
/*
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 8a593117e9b8..ae6415f42f25 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -212,7 +212,7 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
*/
hv_set_msr(HV_MSR_CRASH_P0, 0);
hv_set_msr(HV_MSR_CRASH_P1, 0);
- hv_set_msr(HV_MSR_CRASH_P2, 0);
+ /* P2 is reserved for the KHO preserved-pages tree root PA */
hv_set_msr(HV_MSR_CRASH_P3, bytes_written ? virt_to_phys(hv_panic_page) : 0);
hv_set_msr(HV_MSR_CRASH_P4, bytes_written);
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 17/20] hyperv: Reserve crash MSR P2 for page preservation root PA
2026-05-28 0:41 ` [RFC PATCH 17/20] hyperv: Reserve crash MSR P2 for page preservation root PA Jork Loeser
@ 2026-05-28 1:34 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 1:34 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Removing the overwrites of crash MSR P2 in panic paths can leak a stale KHO root PA to the crash kernel if KHO is currently disabled.
- [Medium] The commit message incorrectly describes the register shifts on x86, hiding the undocumented removal of `regs->ax` from the crash report.
--
commit a2f7e475a7dfd02f8ed395438172d715377b9152
Author: Jork Loeser <jloeser@linux.microsoft.com>
hyperv: Reserve crash MSR P2 for page preservation root PA
This commit reserves the crash MSR P2 to pass the KHO radix tree root
physical address to the crash kernel for MSHV page exclusion. It updates
the panic reporting paths to stop overwriting P2 and shifts the remaining
register values accordingly.
> Move IP/PC to P3 and SP to P4 in hyperv_report_panic() on both x86
> and ARM64. Remove the P2 write from hv_kmsg_dump().
[Severity: Medium]
Does this description accurately reflect the changes on x86?
On x86, SP was already assigned to P4, so it is not moved. Additionally,
moving IP to P3 overwrites the previous contents of P3 (regs->ax), entirely
dropping it from the crash report.
Could the commit message be updated to explicitly mention this loss of
diagnostic data?
> diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
> --- a/arch/arm64/hyperv/hv_core.c
> +++ b/arch/arm64/hyperv/hv_core.c
> @@ -185,9 +185,9 @@ void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die)
> */
> hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P0, err);
> hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P1, guest_id);
> - hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P2, regs->pc);
> - hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P3, regs->sp);
> - hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P4, 0);
> + /* P2 is reserved for the KHO preserved-pages tree root PA */
[Severity: High]
Could skipping the overwrite of P2 leak a stale KHO root physical address
to the crash kernel if KHO is currently disabled?
If a KHO-enabled kernel writes a valid radix tree root PA to P2 and then
kexecs into a KHO-disabled kernel, the KHO-disabled kernel might not clear
the MSR during its initialization.
If this KHO-disabled kernel later panics, leaving P2 untouched here means
the crash kernel will read the stale PA from P2. Since the memory at the
stale PA was likely reused, could walking it as a valid KHO radix tree
cause a page fault and break crash dump collection?
> + hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P3, regs->pc);
> + hv_set_vpreg(HV_REGISTER_GUEST_CRASH_P4, regs->sp);
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=17
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 18/20] mshv: Exclude Hyper-V donated pages from crash dump collection
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (16 preceding siblings ...)
2026-05-28 0:41 ` [RFC PATCH 17/20] hyperv: Reserve crash MSR P2 for page preservation root PA Jork Loeser
@ 2026-05-28 0:42 ` Jork Loeser
2026-05-28 2:13 ` sashiko-bot
2026-05-28 0:42 ` [RFC PATCH 19/20] kexec: export kexec_in_progress for modules Jork Loeser
2026-05-28 0:42 ` [RFC PATCH 20/20] mshv: freeze and vacuum partitions across kexec Jork Loeser
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:42 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
Pages donated to Hyper-V must not be read during crash dump collection.
They are not ordinary RAM and accessing them can hang or corrupt the
crash kernel.
Use the KHO radix tree of preserved pages to drive a vmcore pfn_is_ram()
callback. The radix tree root PA is passed to the crash kernel via
Hyper-V crash MSR P2, since the old kernel's KHO FDT is not accessible
from the crash kernel's direct map.
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
drivers/hv/mshv_page_preserve.c | 80 +++++++++++++++++++++++++++++++++
1 file changed, 80 insertions(+)
diff --git a/drivers/hv/mshv_page_preserve.c b/drivers/hv/mshv_page_preserve.c
index bc3a3a688f5b..e16fb946790d 100644
--- a/drivers/hv/mshv_page_preserve.c
+++ b/drivers/hv/mshv_page_preserve.c
@@ -11,6 +11,7 @@
#define pr_fmt(fmt) "mshv: " fmt
#include <asm/mshyperv.h>
+#include <linux/crash_dump.h>
#include <linux/kexec.h>
#include <linux/kexec_handover.h>
#include <linux/kho_radix_tree.h>
@@ -327,6 +328,57 @@ static int __init alloc_tree(void)
return 0;
}
+#ifdef CONFIG_CRASH_DUMP
+static struct kho_radix_crash_tree crash_preserved_pages_tree;
+
+/**
+ * restore_crash_tree() - Set up the crash tree for dump-time page exclusion.
+ *
+ * In the crash kernel, the old kernel's memory is not in the direct map.
+ * The old kernel stashes the radix tree root PA in Hyper-V crash MSR P2
+ * so we can retrieve it without touching the old kernel's FDT.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+static int __init restore_crash_tree(void)
+{
+ phys_addr_t root_pa;
+
+ root_pa = hv_get_msr(HV_MSR_CRASH_P2);
+ if (!root_pa)
+ return -ENOENT;
+
+ /*
+ * The MSR may contain stale data from a previous
+ * hyperv_report_panic(). Sanity-check that it looks like a
+ * page-aligned physical address within the architectural limit.
+ */
+ if (!PAGE_ALIGNED(root_pa) || root_pa >> MAX_POSSIBLE_PHYSMEM_BITS) {
+ pr_warn("Invalid crash tree root PA: 0x%llx\n",
+ (unsigned long long)root_pa);
+ return -EINVAL;
+ }
+
+ return kho_radix_crash_init(&crash_preserved_pages_tree, root_pa);
+}
+
+static bool mshv_vmcore_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn)
+{
+ /*
+ * MSHV-owned pages must not be read during crash dump collection.
+ * Currently all pages are registered at order 0. If higher-order
+ * registrations are added, this lookup will need to handle them
+ * (e.g. by querying multiple orders or using a range-based API).
+ */
+ return !kho_radix_crash_contains_page(&crash_preserved_pages_tree,
+ pfn, 0);
+}
+
+static struct vmcore_cb mshv_vmcore_cb = {
+ .pfn_is_ram = mshv_vmcore_pfn_is_ram,
+};
+#endif
+
static struct notifier_block reboot_notifier = {
.notifier_call = reboot_cb,
.priority = 0,
@@ -347,6 +399,24 @@ int __init mshv_preserve_init(void)
{
int err;
+#ifdef CONFIG_CRASH_DUMP
+ if (is_kdump_kernel()) {
+ /*
+ * Crash kernel only needs the pfn_is_ram callback to exclude
+ * MSHV-owned pages from the dump. No page restoration, no
+ * reboot notifier — the crash kernel reboots after collection.
+ */
+ err = restore_crash_tree();
+ if (err) {
+ pr_err("Could not set up crash page tree: %d; MSHV pages may appear in dump\n",
+ err);
+ return 0;
+ }
+ register_vmcore_cb(&mshv_vmcore_cb);
+ return 0;
+ }
+#endif
+
if (!kho_is_enabled()) {
pr_err("KHO is disabled; page deposits will fail.\n");
return 0;
@@ -383,5 +453,15 @@ int __init mshv_preserve_init(void)
*/
panic("Could not register reboot notification: %d\n", err);
+ /*
+ * Stash the radix tree root PA in crash MSR P2 so the crash
+ * kernel can retrieve it without touching the old kernel's FDT
+ * (which is not in the crash kernel's direct map). The root
+ * pointer is stable once the tree is initialized — pages are
+ * added/removed within the existing tree structure.
+ */
+ hv_set_msr(HV_MSR_CRASH_P2,
+ virt_to_phys(preserved_pages_tree.root));
+
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 18/20] mshv: Exclude Hyper-V donated pages from crash dump collection
2026-05-28 0:42 ` [RFC PATCH 18/20] mshv: Exclude Hyper-V donated pages from crash dump collection Jork Loeser
@ 2026-05-28 2:13 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 2:13 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Radix tree key encoding mismatch prevents exclusion of Hyper-V owned pages during crash dump.
- [High] Failure to clear the Hyper-V crash MSR P2 when KHO is disabled or fails to initialize leaves a stale physical address, leading to memory corruption during crash dump collection.
- [Medium] The kdump initialization path calls a KHO API without being gated by `kho_is_enabled()`, violating strict KHO subsystem guidelines.
--
commit a672e76ca918e68a4de01d571d5f2e72810eec75
Author: Jork Loeser <jloeser@linux.microsoft.com>
mshv: Exclude Hyper-V donated pages from crash dump collection
Pages donated to Hyper-V must not be read during crash dump collection.
They are not ordinary RAM and accessing them can hang or corrupt the
crash kernel.
> diff --git a/drivers/hv/mshv_page_preserve.c b/drivers/hv/mshv_page_preserve.c
> index bc3a3a688f5b2..e16fb946790d5 100644
> --- a/drivers/hv/mshv_page_preserve.c
> +++ b/drivers/hv/mshv_page_preserve.c
[ ... ]
> @@ -327,6 +328,57 @@ static int __init alloc_tree(void)
> return 0;
> }
>
> +#ifdef CONFIG_CRASH_DUMP
> +static struct kho_radix_crash_tree crash_preserved_pages_tree;
> +
> +/**
> + * restore_crash_tree() - Set up the crash tree for dump-time page exclusion.
> + *
> + * In the crash kernel, the old kernel's memory is not in the direct map.
> + * The old kernel stashes the radix tree root PA in Hyper-V crash MSR P2
> + * so we can retrieve it without touching the old kernel's FDT.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +static int __init restore_crash_tree(void)
> +{
> + phys_addr_t root_pa;
> +
> + root_pa = hv_get_msr(HV_MSR_CRASH_P2);
> + if (!root_pa)
> + return -ENOENT;
> +
> + /*
> + * The MSR may contain stale data from a previous
> + * hyperv_report_panic(). Sanity-check that it looks like a
> + * page-aligned physical address within the architectural limit.
> + */
> + if (!PAGE_ALIGNED(root_pa) || root_pa >> MAX_POSSIBLE_PHYSMEM_BITS) {
> + pr_warn("Invalid crash tree root PA: 0x%llx\n",
> + (unsigned long long)root_pa);
> + return -EINVAL;
> + }
> +
> + return kho_radix_crash_init(&crash_preserved_pages_tree, root_pa);
> +}
> +
> +static bool mshv_vmcore_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn)
> +{
> + /*
> + * MSHV-owned pages must not be read during crash dump collection.
> + * Currently all pages are registered at order 0. If higher-order
> + * registrations are added, this lookup will need to handle them
> + * (e.g. by querying multiple orders or using a range-based API).
> + */
> + return !kho_radix_crash_contains_page(&crash_preserved_pages_tree,
> + pfn, 0);
[Severity: High]
Is the key inserted into the radix tree matched correctly here?
If pages are inserted into the tree using the raw page PFN (e.g.,
page_to_pfn()), this lookup might always miss. The
kho_radix_crash_contains_page() function internally encodes the query key
using kho_encode_radix_key(PFN_PHYS(pfn), order), which adds high-order
metadata bits.
Will this cause the crash kernel to fail to exclude Hyper-V donated pages,
treating them as normal RAM?
> +}
> +
> +static struct vmcore_cb mshv_vmcore_cb = {
> + .pfn_is_ram = mshv_vmcore_pfn_is_ram,
> +};
> +#endif
> +
> static struct notifier_block reboot_notifier = {
> .notifier_call = reboot_cb,
> .priority = 0,
> @@ -347,6 +399,24 @@ int __init mshv_preserve_init(void)
> {
> int err;
>
> +#ifdef CONFIG_CRASH_DUMP
> + if (is_kdump_kernel()) {
> + /*
> + * Crash kernel only needs the pfn_is_ram callback to exclude
> + * MSHV-owned pages from the dump. No page restoration, no
> + * reboot notifier — the crash kernel reboots after collection.
> + */
> + err = restore_crash_tree();
> + if (err) {
> + pr_err("Could not set up crash page tree: %d; MSHV pages may appear in dump\n",
> + err);
> + return 0;
> + }
> + register_vmcore_cb(&mshv_vmcore_cb);
> + return 0;
> + }
> +#endif
> +
[Severity: Medium]
Should this path be gated by kho_is_enabled() as well?
The KHO subsystem documentation mandates checking kho_is_enabled() at the
entry point of any code path that uses KHO APIs. Here, restore_crash_tree()
(which calls the KHO API kho_radix_crash_init()) is invoked before checking
if KHO is enabled.
> if (!kho_is_enabled()) {
> pr_err("KHO is disabled; page deposits will fail.\n");
> return 0;
[Severity: High]
If the system kexecs into a new kernel where KHO is disabled (or initialization
fails), this function returns early without clearing the HV_MSR_CRASH_P2 MSR.
Since Hyper-V crash MSRs persist across kexecs, could the new kernel leave a
stale physical address in P2?
If this new kernel panics, the kdump kernel will read the stale MSR. Because
the address is page-aligned and within bounds, it passes sanity checks in
restore_crash_tree(), and kho_radix_crash_init() will recursively map and
parse the arbitrary memory as a radix tree.
By executing node->table[i] = (u64)(uintptr_t)child, might this actively write
virtual addresses into the crashed kernel's physical memory, corrupting the
crash dump or triggering panics in the kdump kernel due to mapping invalid
addresses?
> }
>
> err = restore_tree();
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=18
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 19/20] kexec: export kexec_in_progress for modules
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (17 preceding siblings ...)
2026-05-28 0:42 ` [RFC PATCH 18/20] mshv: Exclude Hyper-V donated pages from crash dump collection Jork Loeser
@ 2026-05-28 0:42 ` Jork Loeser
2026-05-28 0:42 ` [RFC PATCH 20/20] mshv: freeze and vacuum partitions across kexec Jork Loeser
19 siblings, 0 replies; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:42 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
Modules that register reboot notifiers may need to distinguish a
kexec from a regular reboot. Export kexec_in_progress so they can
check without requiring a built-in wrapper.
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
kernel/kexec_core.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index a43d2da0fe3e..68efbba52fbd 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -53,6 +53,7 @@ atomic_t __kexec_lock = ATOMIC_INIT(0);
/* Flag to indicate we are going to kexec a new kernel */
bool kexec_in_progress = false;
+EXPORT_SYMBOL_GPL(kexec_in_progress);
bool kexec_file_dbg_print;
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* [RFC PATCH 20/20] mshv: freeze and vacuum partitions across kexec
2026-05-28 0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
` (18 preceding siblings ...)
2026-05-28 0:42 ` [RFC PATCH 19/20] kexec: export kexec_in_progress for modules Jork Loeser
@ 2026-05-28 0:42 ` Jork Loeser
2026-05-28 2:11 ` sashiko-bot
19 siblings, 1 reply; 37+ messages in thread
From: Jork Loeser @ 2026-05-28 0:42 UTC (permalink / raw)
To: linux-hyperv, linux-mm, kexec
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Mike Rapoport, Pasha Tatashin, Pratyush Yadav, Alexander Graf,
Jason Miu, Andrew Morton, David Hildenbrand, Muchun Song,
Oscar Salvador, Baoquan He, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kees Cook, Ran Xiaokai, Justinien Bouron,
Sourabh Jain, Pingfan Liu, Rafael J. Wysocki, Mario Limonciello,
linux-arm-kernel, x86, linux-kernel, Michael Kelley, Jork Loeser
Before kexec the kernel must ensure no VMs are actively running so that
no VP modifies VM-memory that Linux will re-use post-kexec, and record
the partition IDs so the next kernel can clean them up.
Add mshv_freeze_and_get_partition_ids() which:
- Sets a global frozen flag to block new partition and VP creation
- Prevents (re-)dispatching of existing VPs
- Kicks all VPs to exit their dispatch loops, then waits for each to
finish by acquiring its mutex
- Tears down doorbell ports owned by the parent partition, which
otherwise survive kexec and cause port ID collisions in the new kernel
- Collects all partition IDs into a kho_alloc_preserve()'d array
The freeze is triggered from the reboot notifier callback. The ID
array is serialized into the KHO FDT so the next kernel can retrieve
it via mshv_retrieve_frozen_partition_ids().
After kexec the previous kernel's partitions are still alive in the
hypervisor. vacuum_stale_partitions() retrieves their IDs from the
KHO-preserved FDT and tears each one down (finalize, withdraw
deposited memory, delete) so the pages become available to the new
kernel.
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
drivers/hv/mshv_page_preserve.c | 100 +++++++++-
drivers/hv/mshv_page_preserve.h | 3 +
drivers/hv/mshv_root.h | 2 +
drivers/hv/mshv_root_main.c | 339 +++++++++++++++++++++++++++++---
4 files changed, 417 insertions(+), 27 deletions(-)
diff --git a/drivers/hv/mshv_page_preserve.c b/drivers/hv/mshv_page_preserve.c
index e16fb946790d..dba7975ab058 100644
--- a/drivers/hv/mshv_page_preserve.c
+++ b/drivers/hv/mshv_page_preserve.c
@@ -24,6 +24,8 @@
static void *fdt_page;
static struct kho_radix_tree preserved_pages_tree;
+static u64 *frozen_partition_ids;
+static unsigned int nr_frozen_partition_ids;
/**
* mshv_register_preserve_page() - Register a page to be preserved by KHO
@@ -78,7 +80,7 @@ static int preserve_table_cb(phys_addr_t phys, void *data)
return kho_preserve_pages(phys_to_page(phys), 1);
}
-static int create_fdt(void)
+static int create_fdt(u64 *partition_ids, unsigned int nr_partition_ids)
{
int err;
void *fdt;
@@ -106,6 +108,19 @@ static int create_fdt(void)
err = fdt_property(fdt, "root_table", &root_table, sizeof(root_table));
if (err)
return err;
+ if (nr_partition_ids) {
+ phys_addr_t ids_pa = virt_to_phys(partition_ids);
+ u32 count = nr_partition_ids;
+
+ err = fdt_property(fdt, "partition_ids", &ids_pa,
+ sizeof(ids_pa));
+ if (err)
+ return err;
+ err = fdt_property(fdt, "nr_partition_ids", &count,
+ sizeof(count));
+ if (err)
+ return err;
+ }
err = fdt_end_node(fdt);
if (err)
return err;
@@ -118,6 +133,8 @@ static int create_fdt(void)
/**
* preserve_tree() - Preserve pages owned by Microsoft Hypervisor
+ * @partition_ids: array of frozen partition IDs to serialize, or NULL
+ * @nr_partition_ids: number of entries in @partition_ids
*
* This gets called prior to kexec and is our signal to finally preserve the
* pages with KHO, and create & register the named FDT. We also need to freeze
@@ -125,7 +142,7 @@ static int create_fdt(void)
*
* Return: 0 on success, -errno on error.
*/
-static int preserve_tree(void)
+static int preserve_tree(u64 *partition_ids, unsigned int nr_partition_ids)
{
const struct kho_radix_walk_cb preserve_cb = {
.key = preserve_key_cb,
@@ -141,7 +158,7 @@ static int preserve_tree(void)
}
/* Populate the pre-allocated FDT page with current tree state */
- err = create_fdt();
+ err = create_fdt(partition_ids, nr_partition_ids);
if (err) {
pr_warn("%s() - create_fdt() failed: %d\n", __func__, err);
return err;
@@ -177,6 +194,11 @@ static int preserve_tree(void)
/*
* Reboot-callback triggering page preservation prior to kexec. Other reboots
* need no KHO preservation.
+ *
+ * The mshv_root module's higher-priority reboot notifier freezes all VPs
+ * and hands off partition IDs via mshv_set_frozen_partition_ids() before
+ * this callback runs. If the module is not loaded, no partitions exist
+ * and the tree is preserved without partition IDs.
*/
static int reboot_cb(struct notifier_block *nb, unsigned long action,
void *data)
@@ -185,9 +207,9 @@ static int reboot_cb(struct notifier_block *nb, unsigned long action,
if (kexec_in_progress) {
int err;
- /* Finalize handover: write KHO descriptors, flush metadata */
pr_debug("%s() - KHO-preserving page tree\n", __func__);
- err = preserve_tree();
+ err = preserve_tree(frozen_partition_ids,
+ nr_frozen_partition_ids);
if (err)
panic("preserve_tree() failed - must not kexec: %d\n",
err);
@@ -260,6 +282,74 @@ static int __init restore_tree(void)
return 0;
}
+/**
+ * mshv_set_frozen_partition_ids() - Hand off frozen partition IDs for KHO
+ * @ids: kho_alloc_preserve()'d array of partition IDs, or NULL
+ * @nr: number of entries in @ids
+ *
+ * Called by the mshv_root module's reboot notifier (which runs at higher
+ * priority) to pass the frozen partition ID list to the built-in page
+ * preservation code before it serializes the KHO FDT.
+ */
+void mshv_set_frozen_partition_ids(u64 *ids, unsigned int nr)
+{
+ frozen_partition_ids = ids;
+ nr_frozen_partition_ids = nr;
+}
+EXPORT_SYMBOL_GPL(mshv_set_frozen_partition_ids);
+
+/**
+ * mshv_retrieve_frozen_partition_ids() - Retrieve frozen partition IDs
+ * @partition_ids: receives pointer to the preserved ID array, or NULL
+ * @nr_ids: receives the number of entries, or 0
+ *
+ * Counterpart to mshv_freeze_and_get_partition_ids(). Reads the partition
+ * ID list from the KHO-preserved FDT. The returned pointer (if non-NULL)
+ * refers to kho_alloc_preserve()'d memory from the previous kernel.
+ *
+ * Return: 0 on success (including when no IDs are found), negative errno on
+ * error.
+ */
+int mshv_retrieve_frozen_partition_ids(u64 **partition_ids,
+ unsigned int *nr_ids)
+{
+ int node, len;
+ const phys_addr_t *ids_pa;
+ const u32 *count_prop;
+
+ *partition_ids = NULL;
+ *nr_ids = 0;
+
+ if (!fdt_page)
+ return 0;
+
+ node = fdt_path_offset(fdt_page, "/");
+ if (node < 0)
+ return 0;
+
+ ids_pa = fdt_getprop(fdt_page, node, "partition_ids", &len);
+ if (!ids_pa)
+ return 0;
+
+ if (len != sizeof(*ids_pa)) {
+ pr_err("Malformed preserved FDT: invalid partition_ids property.\n");
+ return -EINVAL;
+ }
+
+ count_prop = fdt_getprop(fdt_page, node, "nr_partition_ids", &len);
+ if (!count_prop || len != sizeof(*count_prop)) {
+ pr_err("Malformed preserved FDT: invalid nr_partition_ids property.\n");
+ return -EINVAL;
+ }
+
+ *partition_ids = phys_to_virt(*ids_pa);
+ *nr_ids = *count_prop;
+
+ pr_info("Retrieved %u frozen partition ID(s) from KHO\n", *nr_ids);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mshv_retrieve_frozen_partition_ids);
+
/*
* Restore individual pages using KHO's helper during boot.
*
diff --git a/drivers/hv/mshv_page_preserve.h b/drivers/hv/mshv_page_preserve.h
index ac99b4e33285..4625d59a3070 100644
--- a/drivers/hv/mshv_page_preserve.h
+++ b/drivers/hv/mshv_page_preserve.h
@@ -14,5 +14,8 @@ int mshv_preserve_init(void);
int mshv_register_preserve_page(struct page *pg);
int mshv_unregister_preserve_page(struct page *pg);
int mshv_iterate_preserved(const struct kho_radix_walk_cb *cb, void *data);
+void mshv_set_frozen_partition_ids(u64 *ids, unsigned int nr);
+int mshv_retrieve_frozen_partition_ids(u64 **partition_ids,
+ unsigned int *nr_ids);
#endif /* _MSHV_PAGE_PRESERVE_H */
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 216053f8e0ab..7476398d3b47 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -195,6 +195,7 @@ struct mshv_root {
spinlock_t pt_ht_lock;
DECLARE_HASHTABLE(pt_htable, MSHV_PARTITIONS_HASH_BITS);
struct hv_partition_property_vmm_capabilities vmm_caps;
+ bool frozen;
};
/*
@@ -364,6 +365,7 @@ static inline void mshv_debugfs_vp_remove(struct mshv_vp *vp) { }
extern struct mshv_root mshv_root;
extern enum hv_scheduler_type hv_scheduler_type;
+
extern u8 * __percpu *hv_synic_eventring_tail;
struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 5fbd01c12ab8..e95abf4698f8 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -18,6 +18,7 @@
#include <linux/anon_inodes.h>
#include <linux/mm.h>
#include <linux/io.h>
+#include <linux/cleanup.h>
#include <linux/cpuhotplug.h>
#include <linux/random.h>
#include <asm/mshyperv.h>
@@ -27,6 +28,7 @@
#include <linux/kexec.h>
#include <linux/page-flags.h>
#include <linux/crash_dump.h>
+#include <linux/kexec_handover.h>
#include <linux/panic_notifier.h>
#include <linux/vmalloc.h>
#include <linux/rseq.h>
@@ -359,6 +361,7 @@ mshv_suspend_vp(const struct mshv_vp *vp, bool *message_in_flight)
*/
static long mshv_run_vp_with_hyp_scheduler(struct mshv_vp *vp)
{
+ bool message_in_flight;
long ret;
struct hv_register_assoc suspend_regs[2] = {
{ .name = HV_REGISTER_INTERCEPT_SUSPEND },
@@ -375,32 +378,40 @@ static long mshv_run_vp_with_hyp_scheduler(struct mshv_vp *vp)
}
ret = wait_event_interruptible(vp->run.vp_suspend_queue,
- vp->run.kicked_by_hv == 1);
- if (ret) {
- bool message_in_flight;
+ vp->run.kicked_by_hv == 1 ||
+ READ_ONCE(mshv_root.frozen));
- /*
- * Otherwise the waiting was interrupted by a signal: suspend
- * the vCPU explicitly and copy message in flight (if any).
- */
- ret = mshv_suspend_vp(vp, &message_in_flight);
- if (ret)
- return ret;
+ /* Normal wakeup: intercept arrived */
+ if (!ret && !READ_ONCE(mshv_root.frozen)) {
+ vp->run.kicked_by_hv = 0;
+ return 0;
+ }
- /* Return if no message in flight */
- if (!message_in_flight)
- return -EINTR;
+ /*
+ * Signal or frozen: VP was resumed above and may still be
+ * running in the hypervisor. Suspend it before returning.
+ */
+ ret = mshv_suspend_vp(vp, &message_in_flight);
+ if (ret)
+ return ret;
- /* Wait for the message in flight. */
- wait_event(vp->run.vp_suspend_queue, vp->run.kicked_by_hv == 1);
- }
+ /* No in-flight message or frozen — nothing to deliver */
+ if (!message_in_flight || READ_ONCE(mshv_root.frozen))
+ return -EINTR;
+
+ /* Signal case: wait for the in-flight intercept message */
+ wait_event(vp->run.vp_suspend_queue,
+ vp->run.kicked_by_hv == 1 ||
+ READ_ONCE(mshv_root.frozen));
+
+ if (READ_ONCE(mshv_root.frozen))
+ return -EINTR;
/*
* Reset the flag to make the wait_event call above work
* next time.
*/
vp->run.kicked_by_hv = 0;
-
return 0;
}
@@ -503,7 +514,8 @@ mshv_vp_wait_for_hv_kick(struct mshv_vp *vp)
ret = wait_event_interruptible(vp->run.vp_suspend_queue,
(vp->run.kicked_by_hv == 1 &&
!mshv_vp_dispatch_thread_blocked(vp)) ||
- mshv_vp_interrupt_pending(vp));
+ mshv_vp_interrupt_pending(vp) ||
+ READ_ONCE(mshv_root.frozen));
if (ret)
return -EINTR;
@@ -513,6 +525,9 @@ mshv_vp_wait_for_hv_kick(struct mshv_vp *vp)
mshv_vp_dispatch_thread_blocked(vp),
mshv_vp_interrupt_pending(vp));
+ if (READ_ONCE(mshv_root.frozen))
+ return -EBUSY;
+
vp->run.flags.root_sched_blocked = 0;
vp->run.kicked_by_hv = 0;
@@ -539,6 +554,11 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
u32 flags = 0;
struct hv_output_dispatch_vp output;
+ if (READ_ONCE(mshv_root.frozen)) {
+ ret = -EBUSY;
+ break;
+ }
+
if (__xfer_to_guest_mode_work_pending()) {
ret = xfer_to_guest_mode_handle_work();
@@ -712,6 +732,11 @@ static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
trace_mshv_run_vp_entry(vp->vp_partition->pt_id, vp->vp_index);
do {
+ if (READ_ONCE(mshv_root.frozen)) {
+ rc = -EBUSY;
+ break;
+ }
+
if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
rc = mshv_run_vp_with_root_scheduler(vp);
else
@@ -1074,6 +1099,9 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
struct hv_stats_page *stats_pages[2];
long ret;
+ if (READ_ONCE(mshv_root.frozen))
+ return -EBUSY;
+
if (copy_from_user(&args, arg, sizeof(args)))
return -EFAULT;
@@ -1762,6 +1790,201 @@ static void drain_all_vps(const struct mshv_partition *partition)
}
}
+/**
+ * mshv_freeze_and_get_partition_ids() - Freeze all partitions and collect IDs
+ * @partition_ids: on success, receives a kho_alloc_preserve()'d array of
+ * partition IDs; set to NULL on failure or when no partitions exist
+ * @nr_ids: on success, receives the number of entries in @partition_ids; set to
+ * 0 on failure or when no partitions exist
+ *
+ * Sets the global frozen flag to prevent creation of new partitions and
+ * (re-)dispatching of VPs. Kicks all VPs so they exit their dispatch loops,
+ * then waits for each VP to actually finish by acquiring its mutex.
+ *
+ * Must be called before kexec to ensure no VP modifies VM-memory that Linux
+ * will re-use post-kexec.
+ *
+ * Return: 0 on success, negative errno on failure. On failure, partitions
+ * and VPs are left in an undefined state — the caller must not proceed
+ * with kexec and should panic.
+ */
+static int
+mshv_freeze_and_get_partition_ids(u64 **partition_ids, unsigned int *nr_ids)
+{
+ unsigned int nr_alloc = 0, nr_ref = 0, nr_noref = 0;
+ struct mshv_partition *partition;
+ struct mshv_vp *vp;
+ int bkt, i;
+ u64 *ids;
+
+ *partition_ids = NULL;
+ *nr_ids = 0;
+
+ scoped_guard(spinlock, &mshv_root.pt_ht_lock)
+ mshv_root.frozen = true;
+
+ /*
+ * Count partitions to size the ID array. Frozen prevents new additions,
+ * so this is an upper bound.
+ */
+ scoped_guard(rcu)
+ hash_for_each_rcu(mshv_root.pt_htable, bkt, partition, pt_hnode)
+ nr_alloc++;
+
+ if (!nr_alloc) {
+ pr_info("Frozen 0 partition(s) for kexec\n");
+ return 0;
+ }
+
+ ids = kho_alloc_preserve(nr_alloc * sizeof(*ids));
+ if (IS_ERR(ids)) {
+ pr_err("Failed to allocate partition ID array for freeze\n");
+ return PTR_ERR(ids);
+ }
+
+ /*
+ * Record every partition's ID and obtain a reference for later use.
+ *
+ * Zero-refcount partitions (destroy_partition() in progress) still get
+ * their ID recorded — destruction may not complete before kexec, and
+ * the next kernel must clean them up. Their IDs are stored at the back
+ * of the array so the kick/drain phase can iterate only the ref'd
+ * prefix ids[0..nr_ref).
+ *
+ * VP kicking is deferred to the next phase where it happens under
+ * pt_mutex, which serializes against mshv_partition_ioctl_create_vp().
+ */
+ rcu_read_lock();
+ hash_for_each_rcu(mshv_root.pt_htable, bkt, partition, pt_hnode) {
+ if (!mshv_partition_get(partition)) {
+ /*
+ * Zero refcount — destroy_partition() is in progress.
+ * All fds are closed so no VP ioctl can be running.
+ * Store at the back; skip VP kicking.
+ */
+ ids[nr_alloc - 1 - nr_noref++] = partition->pt_id;
+ continue;
+ }
+
+ ids[nr_ref++] = partition->pt_id;
+ }
+ rcu_read_unlock();
+
+ /*
+ * For each ref'd partition, acquire and release pt_mutex as a barrier
+ * against any in-flight create_vp. After this, the frozen flag
+ * prevents new VPs from being created, so pt_vp_array is stable.
+ * Then kick all VPs and drain by acquiring each vp_mutex.
+ *
+ * Root scheduler: disable_vp_dispatch() sets
+ * HV_REGISTER_DISPATCH_SUSPEND, which causes any in-progress dispatch
+ * hypercall to return. This is safe regardless of VP state because the
+ * VP only executes while the kernel thread's dispatch hypercall is
+ * active — once it returns, the VP cannot run until re-dispatched,
+ * which the frozen check prevents.
+ *
+ * Hyp scheduler: the VP runs independently in the hypervisor and must
+ * be explicitly suspended from within its dispatch loop (via
+ * mshv_suspend_vp()) when the kernel thread detects the frozen flag.
+ * wake_up_all() unblocks the kernel thread so it can do so.
+ */
+ for (i = 0; i < nr_ref; i++) {
+ /* Ref held; partition stays in hash and alive outside RCU */
+ scoped_guard(rcu)
+ partition = mshv_partition_find(ids[i]);
+
+ /* Barrier: wait for any in-flight create_vp to complete */
+ scoped_guard(mutex, &partition->pt_mutex) {}
+
+ for (bkt = 0; bkt < MSHV_MAX_VPS; bkt++) {
+ vp = partition->pt_vp_array[bkt];
+ if (!vp)
+ continue;
+
+ if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
+ disable_vp_dispatch(vp);
+
+ wake_up_all(&vp->run.vp_suspend_queue);
+ }
+
+ /*
+ * Wait for every VP to finish its current ioctl. Taking the VP
+ * mutex proves the VP is no longer inside run_vp.
+ *
+ * On Hyp-scheduler, prior mshv_suspend_vp() might have failed.
+ * Since it's idempotent, we can safely re-issue and fail kexec
+ * if suspend fails again. In this case, the caller is expected
+ * to panic, so cleanup is unnecessary.
+ */
+ for (bkt = 0; bkt < MSHV_MAX_VPS; bkt++) {
+ vp = partition->pt_vp_array[bkt];
+ if (!vp)
+ continue;
+
+ scoped_guard(mutex, &vp->vp_mutex) {
+ if (hv_scheduler_type != HV_SCHEDULER_TYPE_ROOT) {
+ bool mif;
+ int ret;
+
+ ret = mshv_suspend_vp(vp, &mif);
+ if (ret)
+ return ret;
+ }
+ }
+ }
+
+ /*
+ * Tear down doorbell ports owned by the parent partition.
+ * These survive child partition deletion and kexec, so the
+ * new kernel would collide on port IDs if we leave them.
+ */
+ mshv_eventfd_release(partition);
+
+ mshv_partition_put(partition);
+ }
+
+ /* Move non-ref'd IDs next to ref'd IDs to form a contiguous array */
+ if (nr_noref) {
+ memmove(&ids[nr_ref], &ids[nr_alloc - nr_noref],
+ nr_noref * sizeof(*ids));
+ }
+
+ *partition_ids = ids;
+ *nr_ids = nr_ref + nr_noref;
+
+ pr_info("Frozen %u partition(s) for kexec\n", nr_ref + nr_noref);
+ return 0;
+}
+
+/*
+ * Reboot notifier for the mshv_root module. Runs at higher priority than
+ * the built-in page-preservation notifier so that all VPs are frozen and
+ * partition IDs are handed off before the tree is serialized.
+ */
+static int mshv_root_reboot_cb(struct notifier_block *nb, unsigned long action,
+ void *data)
+{
+ if (kexec_in_progress) {
+ u64 *partition_ids;
+ unsigned int nr_partition_ids;
+ int err;
+
+ err = mshv_freeze_and_get_partition_ids(&partition_ids,
+ &nr_partition_ids);
+ if (err)
+ panic("mshv_freeze_and_get_partition_ids() failed - must not kexec: %d\n",
+ err);
+
+ mshv_set_frozen_partition_ids(partition_ids, nr_partition_ids);
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block mshv_root_reboot_notifier = {
+ .notifier_call = mshv_root_reboot_cb,
+ .priority = 1, /* higher than the built-in preserve notifier (0) */
+};
+
static void
remove_partition(struct mshv_partition *partition)
{
@@ -1911,13 +2134,27 @@ mshv_partition_release(struct inode *inode, struct file *filp)
static int
add_partition(struct mshv_partition *partition)
{
- spin_lock(&mshv_root.pt_ht_lock);
+ guard(spinlock)(&mshv_root.pt_ht_lock);
+
+ /*
+ * Reject new partitions once frozen. Note: there is a small window
+ * where a concurrent create-ioctl has already called
+ * hv_call_create_partition() but not yet reached here. If kexec fires
+ * during that window, the caller's error-path
+ * hv_call_delete_partition() may never execute and the empty partition
+ * leaks in the hypervisor.
+ *
+ * No pages are deposited at that point, so only the hypervisor-internal
+ * tracking is lost. Closing this fully would require reworking the
+ * entire mshv-locking logic so that the frozen check and the hypervisor
+ * create call happen atomically.
+ */
+ if (mshv_root.frozen)
+ return -EBUSY;
hash_add_rcu(mshv_root.pt_htable, &partition->pt_hnode,
partition->pt_id);
- spin_unlock(&mshv_root.pt_ht_lock);
-
return 0;
}
@@ -2316,6 +2553,55 @@ root_scheduler_deinit(void)
free_percpu(root_scheduler_output);
}
+/**
+ * vacuum_stale_partitions() - Tear down partitions left by a prior kernel.
+ * @dev: device for logging
+ *
+ * After kexec the previous kernel's partitions are still alive in the
+ * hypervisor. Retrieve their IDs from the KHO-preserved FDT and finalize,
+ * withdraw, and delete each one so the deposited pages return to the free pool.
+ */
+static void __init vacuum_stale_partitions(struct device *dev)
+{
+ u64 *ids;
+ unsigned int nr;
+ int i, err;
+
+ err = mshv_retrieve_frozen_partition_ids(&ids, &nr);
+ if (err) {
+ dev_err(dev, "Failed to retrieve stale partition IDs: %d\n",
+ err);
+ return;
+ }
+
+ for (i = 0; i < nr; i++) {
+ dev_info(dev, "Cleaning up stale partition %llu\n",
+ ids[i]);
+
+ err = hv_call_finalize_partition(ids[i]);
+ if (err == -EINVAL) {
+ dev_info(dev, "partition %llu already gone\n",
+ ids[i]);
+ continue;
+ }
+ if (err)
+ dev_warn(dev, "finalize partition %llu failed: %d\n",
+ ids[i], err);
+
+ err = hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, ids[i]);
+ if (err)
+ dev_warn(dev, "withdraw memory %llu failed: %d\n",
+ ids[i], err);
+
+ err = hv_call_delete_partition(ids[i]);
+ if (err)
+ dev_warn(dev, "delete partition %llu failed: %d\n",
+ ids[i], err);
+ }
+
+ kho_restore_free(ids);
+}
+
static int __init mshv_init_vmm_caps(struct device *dev)
{
int ret;
@@ -2372,10 +2658,16 @@ static int __init mshv_parent_partition_init(void)
if (ret)
goto synic_cleanup;
- ret = root_scheduler_init(dev);
+ vacuum_stale_partitions(dev);
+
+ ret = register_reboot_notifier(&mshv_root_reboot_notifier);
if (ret)
goto synic_cleanup;
+ ret = root_scheduler_init(dev);
+ if (ret)
+ goto unregister_reboot;
+
ret = mshv_debugfs_init();
if (ret)
goto deinit_root_scheduler;
@@ -2395,6 +2687,8 @@ static int __init mshv_parent_partition_init(void)
mshv_debugfs_exit();
deinit_root_scheduler:
root_scheduler_deinit();
+unregister_reboot:
+ unregister_reboot_notifier(&mshv_root_reboot_notifier);
synic_cleanup:
mshv_synic_exit();
device_deregister:
@@ -2410,6 +2704,7 @@ static void __exit mshv_parent_partition_exit(void)
misc_deregister(&mshv_dev);
mshv_irqfd_wq_cleanup();
root_scheduler_deinit();
+ unregister_reboot_notifier(&mshv_root_reboot_notifier);
mshv_synic_exit();
}
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [RFC PATCH 20/20] mshv: freeze and vacuum partitions across kexec
2026-05-28 0:42 ` [RFC PATCH 20/20] mshv: freeze and vacuum partitions across kexec Jork Loeser
@ 2026-05-28 2:11 ` sashiko-bot
0 siblings, 0 replies; 37+ messages in thread
From: sashiko-bot @ 2026-05-28 2:11 UTC (permalink / raw)
To: Jork Loeser; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Critical] Concurrent kfree in partition creation error path causes UAF or NULL pointer dereference in mshv_freeze_and_get_partition_ids().
- [High] KHO serialization APIs are invoked without verifying if the KHO subsystem is enabled.
- [High] Lockless call to mshv_eventfd_release() races with concurrent ioctls, causing list corruption and double-frees.
--
commit 1d32c3ff3fb719ececcc6529a01a0e1cb8e01bc9
Author: Jork Loeser <jloeser@linux.microsoft.com>
mshv: freeze and vacuum partitions across kexec
This commit introduces logic to freeze VM partitions prior to a kexec,
ensuring no VP modifies VM-memory reused by the new kernel. It preserves
the partition IDs using the KHO subsystem so the successor kernel can
retrieve and clean them up. Finally, it provides a mechanism to vacuum
stale partitions left over from a previous kernel.
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1762,6 +1790,201 @@ static void drain_all_vps(const struct mshv_partition *partition)
[ ... ]
> + if (!nr_alloc) {
> + pr_info("Frozen 0 partition(s) for kexec\n");
> + return 0;
> + }
> +
> + ids = kho_alloc_preserve(nr_alloc * sizeof(*ids));
[Severity: High]
Should we verify kho_is_enabled() before calling KHO serialization APIs
like kho_alloc_preserve() here and kho_add_subtree() during preservation?
If a standard kexec is performed (with kexec_in_progress true but KHO
disabled), can calling these KHO serialization APIs lead to a NULL pointer
dereference on kho_out.fdt?
> + if (IS_ERR(ids)) {
[ ... ]
> + for (i = 0; i < nr_ref; i++) {
> + /* Ref held; partition stays in hash and alive outside RCU */
> + scoped_guard(rcu)
> + partition = mshv_partition_find(ids[i]);
> +
> + /* Barrier: wait for any in-flight create_vp to complete */
> + scoped_guard(mutex, &partition->pt_mutex) {}
[Severity: Critical]
Can mshv_partition_find() return NULL here if a partition creation error
path concurrently unlinks and unconditionally frees the partition?
If mshv_freeze_and_get_partition_ids() obtains a reference in the first
loop, but the error path ignores the reference count and frees the
partition anyway, would mshv_partition_find() return NULL after the
partition is removed from the hash table?
This would cause a panic when attempting to acquire &partition->pt_mutex.
Additionally, if the reference count was ignored during a concurrent
kfree, could this result in a use-after-free regression?
[ ... ]
> + /*
> + * Tear down doorbell ports owned by the parent partition.
> + * These survive child partition deletion and kexec, so the
> + * new kernel would collide on port IDs if we leave them.
> + */
> + mshv_eventfd_release(partition);
[Severity: High]
Is it safe to call mshv_eventfd_release() here without holding
partition->pt_mutex?
If mshv_root.frozen is not checked in the ioeventfd ioctl dispatch path,
could userspace concurrently issue ioctls that modify the ioeventfds_list
while this function is moving it, resulting in list corruption or
double-frees?
> +
> + mshv_partition_put(partition);
> + }
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=20
^ permalink raw reply [flat|nested] 37+ messages in thread