Re: [RFC PATCH 01/20] kho: generalize radix tree APIs

Linux-HyperV List
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Jork Loeser" <jloeser@linux.microsoft.com>
Cc: linux-hyperv@vger.kernel.org
Subject: Re: [RFC PATCH 01/20] kho: generalize radix tree APIs
Date: Thu, 28 May 2026 01:22:52 +0000	[thread overview]
Message-ID: <20260528012253.489281F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260528004204.1484584-2-jloeser@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 8 potential issue(s) to consider:

New issues:
- [High] Refactoring of scratch_size_update() skips CMA alignment for fixed scratch sizes.
- [High] Unsigned integer underflow when calculating scratch_size_global for unaligned lowmem sizes.
- [High] The generalized radix tree API silently truncates bits from generic 64-bit keys.
- [Low] Unnecessary duplication of __kho_preserve_pages_order() logic in kho_preserve_pages().

Pre-existing issues:
- [Critical] Heap buffer overflow in radix tree node allocation on 32-bit platforms with 4-byte phys_addr_t.
- [High] Sleeping memory allocation (GFP_KERNEL) is performed while holding a mutex required in reclaim paths.
- [High] Bitwise shift operations using a 32-bit signed integer literal '1' overflow for large memory preservations.
- [High] Lack of bounds checking on reconstructed radix tree keys causes an underflow of 'order', leading to severe memory corruption.
--

commit 8b45214cacd5b180aa13f584e115cd95e44f26d2
Author: Pratyush Yadav (Google) <pratyush@kernel.org>

kho: generalize radix tree APIs

This patch generalizes the KHO radix tree APIs to track arbitrary keys directly
instead of hardcoding page tracking operations (PFN and order). It renames the
functions to kho_radix_add_key() and kho_radix_del_key() and adjusts the
callback structures accordingly.

> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index 4834a809985ab..05a6eb56e1766 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
[ ... ]
> @@ -106,17 +106,17 @@ static unsigned long kho_radix_encode_key(phys_addr_t phys, unsigned int order)
>  }
>  
>  /**
> - * kho_radix_decode_key - Decodes a radix key back into a physical address and order.
> + * kho_decode_radix_key - Decodes a radix key back into a physical address and order.
>   * @key: The unsigned long key to decode.
>   * @order: An output parameter, a pointer to an unsigned int where the decoded
>   *         page order will be stored.
>   *
> - * This function reverses the encoding performed by kho_radix_encode_key(),
> + * This function reverses the encoding performed by kho_encode_radix_key(),
>   * extracting the original physical address and page order from a given key.
>   *
>   * Return: The decoded physical address.
>   */
> -static phys_addr_t kho_radix_decode_key(unsigned long key, unsigned int *order)
> +static phys_addr_t kho_decode_radix_key(unsigned long key, unsigned int *order)
>  {
>  	unsigned int order_bit = fls64(key);
>  	phys_addr_t phys;
>  
>  	/* order_bit is numbered starting at 1 from fls64 */
>  	*order = KHO_ORDER_0_LOG2 - order_bit + 1;

[Severity: High]
This isn't a bug introduced by this patch, but does this calculation risk an
unsigned integer underflow?

If a corrupted tree (for instance, an FDT from a crashing kernel) provides
entries at inappropriately high indices, the reconstructed key in
__kho_radix_walk_tree() might have bits set above KHO_ORDER_0_LOG2.

If fls64(key) returns an order_bit greater than KHO_ORDER_0_LOG2 + 1, this
calculation will underflow and produce an enormous order. This value is later
used in loops bounded by (1UL << order), which could lead to out-of-bounds
iteration.

[ ... ]
> @@ -144,24 +144,21 @@ static unsigned long kho_radix_get_table_index(unsigned long key,
>  }
>  
>  /**
> - * kho_radix_add_page - Marks a page as preserved in the radix tree.
> + * kho_radix_add_key - Add a key to the radix tree.
>   * @tree: The KHO radix tree.
> - * @pfn: The page frame number of the page to preserve.
> - * @order: The order of the page.
> + * @key: The key to add.
>   *
> - * This function traverses the radix tree based on the key derived from @pfn
> - * and @order. It sets the corresponding bit in the leaf bitmap to mark the
> - * page for preservation. If intermediate nodes do not exist along the path,
> - * they are allocated and added to the tree.
> + * This function traverses the radix tree based on the key provided. It sets the
> + * corresponding bit in the leaf bitmap to mark the key as present. If
> + * intermediate nodes do not exist along the path, they are allocated and added
> + * to the tree.
>   *
>   * Return: 0 on success, or a negative error code on failure.
>   */
> -int kho_radix_add_page(struct kho_radix_tree *tree,
> -		       unsigned long pfn, unsigned int order)
> +int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key)

[Severity: High]
Does this new arbitrary key API truncate bits from generic 64-bit keys?

The KHO radix tree depth and shifting logic appear to be statically sized
based on physical address and order bounds. When an arbitrary 64-bit key
is passed here, any bits that exceed the maximum addressable depth might
be silently ignored (e.g., bits shifted out or lost in the modulo operations
inside kho_radix_get_table_index()).

Could this cause distinct large 64-bit keys to silently alias to the same
path and overwrite each other?

>  {
>  	/* Newly allocated nodes for error cleanup */
>  	struct kho_radix_node *intermediate_nodes[KHO_TREE_MAX_DEPTH] = { 0 };
> -	unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order);
>  	struct kho_radix_node *anchor_node = NULL;

[Severity: Critical]
This is a pre-existing issue, but can the node allocation later in this
function cause an out-of-bounds heap access on 32-bit platforms where
sizeof(phys_addr_t) is 4?

The struct kho_radix_node array size is based on KHO_TABLE_SIZE_LOG2,
which is calculated as ilog2(PAGE_SIZE / sizeof(phys_addr_t)).
On 32-bit systems, this evaluates to 1024 elements of u64, requiring
8192 bytes. However, the allocation relies on get_zeroed_page():

	new_node = (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL);

This only allocates a single 4096-byte page. Could writes to indices >= 512,
or walking the tree via __kho_radix_walk_tree(), overflow this allocation?

[Severity: High]
This is also a pre-existing issue, but can using GFP_KERNEL for the node
allocation lead to an ABBA deadlock?

This allocation happens while holding tree->lock:

kho_radix_add_key()
    mutex_lock(&tree->lock)
    ...
    get_zeroed_page(GFP_KERNEL)

Since GFP_KERNEL can sleep and trigger direct memory reclaim, if a page cache
folio preserved for KHO is evicted, the filesystem's eviction path could call
kho_unpreserve_folio().

That function would attempt to acquire the exact same tree->lock, resulting
in a deadlock. Should this allocation be moved outside the lock or downgraded
to GFP_NOWAIT?

[ ... ]
> @@ -484,13 +474,16 @@ static struct page *__init kho_get_preserved_page(phys_addr_t phys,
>  	return pfn_to_page(pfn);
>  }
>  
> -static int __init kho_preserved_memory_reserve(phys_addr_t phys,
> -					       unsigned int order)
> +static int __init kho_preserved_memory_reserve(unsigned long key)
>  {
>  	union kho_page_info info;
>  	struct page *page;
> +	unsigned int order;
> +	phys_addr_t phys;
>  	u64 sz;
>  
> +	phys = kho_decode_radix_key(key, &order);
> +
>  	sz = 1 << (order + PAGE_SHIFT);

[Severity: High]
This is a pre-existing issue, but will this bitwise shift overflow for large
memory preservations?

Since the literal 1 is a 32-bit signed integer, a shift amount of 31 or
greater (which can happen for contiguous preservations of 2GB or more, like
hugepages or CMA areas) results in a signed integer overflow.

This would produce a negative 32-bit value that is sign-extended into the
64-bit sz variable, leading to incorrect memblock reservations.
Does this need to use 1UL or 1ULL?

[ ... ]
> @@ -618,30 +611,20 @@ early_param("kho_scratch", kho_parse_scratch_size);
>  
>  static void __init scratch_size_update(void)
>  {
> -	/*
> -	 * If fixed sizes are not provided via command line, calculate them
> -	 * now.
> -	 */
> -	if (scratch_scale) {
> -		phys_addr_t size;
> +	phys_addr_t size;
>  
> -		size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT,
> -						   NUMA_NO_NODE);
> -		size = size * scratch_scale / 100;
> -		scratch_size_lowmem = size;
> +	if (!scratch_scale)
> +		return;

[Severity: High]
Does this early return skip CMA alignment for fixed scratch sizes?

In the previous implementation, scratch_size_lowmem and scratch_size_global
were unconditionally aligned using round_up(..., CMA_MIN_ALIGNMENT_BYTES)
at the end of the function, regardless of whether scratch_scale was set.

By returning early when scratch_scale is 0 (which occurs when fixed sizes
are provided via the command line), the alignment step is bypassed.
Will this cause CMA initialization to fail when iterating over partial
pageblocks?

>  
> -		size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
> -						   NUMA_NO_NODE);
> -		size = size * scratch_scale / 100 - scratch_size_lowmem;
> -		scratch_size_global = size;
> -	}
> +	size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT,
> +					   NUMA_NO_NODE);
> +	size = size * scratch_scale / 100;
> +	scratch_size_lowmem = round_up(size, CMA_MIN_ALIGNMENT_BYTES);
>  
> -	/*
> -	 * Scratch areas are released as MIGRATE_CMA. Round them up to the right
> -	 * size.
> -	 */
> -	scratch_size_lowmem = round_up(scratch_size_lowmem, CMA_MIN_ALIGNMENT_BYTES);
> -	scratch_size_global = round_up(scratch_size_global, CMA_MIN_ALIGNMENT_BYTES);
> +	size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
> +					   NUMA_NO_NODE);
> +	size = size * scratch_scale / 100 - scratch_size_lowmem;

[Severity: High]
Can this subtraction result in an unsigned integer underflow?

Because the round_up() for scratch_size_lowmem now occurs before the
calculation of scratch_size_global, scratch_size_lowmem may be larger
than the unaligned size (size * scratch_scale / 100).

On a system where all memory is below ARCH_LOW_ADDRESS_LIMIT, both calls to
memblock_reserved_kern_size() return the same value. The subtraction will
then subtract the rounded-up value from the unaligned value, resulting in
a negative number.

Since size is a phys_addr_t (unsigned), this will underflow to an enormous
positive value and likely lead to an unfulfillable CMA reservation attempt.

[ ... ]
> @@ -937,9 +922,19 @@ int kho_preserve_pages(struct page *page, unsigned long nr_pages)
>  	}
>  
>  	while (pfn < end_pfn) {
> -		unsigned int order = __kho_preserve_pages_order(pfn, end_pfn);
> +		unsigned int order =
> +			min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
> +
> +		/*
> +		 * Make sure all the pages in a single preservation are in the
> +		 * same NUMA node. The restore machinery can not cope with a
> +		 * preservation spanning multiple NUMA nodes.
> +		 */
> +		while (pfn_to_nid(pfn) != pfn_to_nid(pfn + (1UL << order) - 1))
> +			order--;

[Severity: Low]
Is there a specific reason to inline and duplicate the logic from
__kho_preserve_pages_order() here?

The helper function __kho_preserve_pages_order() remains in the file and
is still actively used by __kho_unpreserve(). Duplicating this complex
order calculation and NUMA node alignment logic increases the risk of the
two implementations diverging in the future.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260528004204.1484584-1-jloeser@linux.microsoft.com?part=1

next prev parent reply	other threads:[~2026-05-28  1:22 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-28  0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 01/20] kho: generalize radix tree APIs Jork Loeser
2026-05-28  1:22   ` sashiko-bot [this message]
2026-05-28  0:41 ` [RFC PATCH 02/20] kho: store incoming radix tree in kho_in Jork Loeser
2026-05-28  1:08   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 03/20] kho: add a struct for radix callbacks Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 04/20] kho: add callback for table pages Jork Loeser
2026-05-28  1:33   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 05/20] kho: add data argument to radix walk callback Jork Loeser
2026-05-28  1:11   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 06/20] kho: allow early-boot usage of the KHO radix tree Jork Loeser
2026-05-28  1:40   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 07/20] kho: allow destroying " Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 08/20] kho: add kho_radix_init_tree() Jork Loeser
2026-05-28  1:21   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 09/20] memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 10/20] kho: extended scratch Jork Loeser
2026-05-28  1:21   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 11/20] kho: return virtual address of mem_map Jork Loeser
2026-05-28  1:27   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 12/20] mm/hugetlb: make bootmem allocation work with KHO Jork Loeser
2026-05-28  1:06   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 13/20] kho: add radix tree freeze and del_key() error reporting Jork Loeser
2026-05-28  1:34   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 14/20] kho: Add crash-kernel-safe radix tree presence check Jork Loeser
2026-05-28  1:27   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 15/20] mshv: Use page tracker to manage MSHV-owned pages and preserve with KHO Jork Loeser
2026-05-28  1:41   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 16/20] mshv: Add debugfs interface to page tracker Jork Loeser
2026-05-28  1:48   ` sashiko-bot
2026-05-28  0:41 ` [RFC PATCH 17/20] hyperv: Reserve crash MSR P2 for page preservation root PA Jork Loeser
2026-05-28  1:34   ` sashiko-bot
2026-05-28  0:42 ` [RFC PATCH 18/20] mshv: Exclude Hyper-V donated pages from crash dump collection Jork Loeser
2026-05-28  2:13   ` sashiko-bot
2026-05-28  0:42 ` [RFC PATCH 19/20] kexec: export kexec_in_progress for modules Jork Loeser
2026-05-28  0:42 ` [RFC PATCH 20/20] mshv: freeze and vacuum partitions across kexec Jork Loeser
2026-05-28  2:11   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260528012253.489281F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=jloeser@linux.microsoft.com \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox