From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 157AF18DB26 for ; Thu, 28 May 2026 01:22:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779931375; cv=none; b=qbIoeO2Jwq9UpDvipuf8ZolAIa63QvBwYb4VStnMQd1TEbyauBlh0F8cwc2VwkKp95KaKmUcJojW7NAElv4grVPpi88oB/RTMVttGRg9AUDNeJc61v4luFA0gF0jK4smBkBINu4NO5PD9c/0yyrwjGT+6JDHm+LxnkpWehTLoxc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779931375; c=relaxed/simple; bh=IgM4Lco3GPKlxn0aJQgMPI/S1WBqlk46NH0GlPvYcKk=; h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date: Message-Id; b=WuLyCR7uPR50VFAx+4wIRYreyEZ4qL9pNhXQyO6TL7xY9Akvi5n0GzzgMRxUuwQoKuwSMckVtLL7HUlL5uyK7zjLR35h1NvYZtc66c0w1pgxQKgdvAdFVCbB6Y+6Z/xDk7nlS+WmPSPEzDatIvpP2c5lt6veoUwU9QrTncvzKBY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=gCICs0go; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="gCICs0go" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 489281F000E9; Thu, 28 May 2026 01:22:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779931373; bh=QIB3FmLC4GJwDLNqKg+NKKmx1gqh1vWSAiX4s01/vT0=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date; b=gCICs0goZ/CZTYD9UGnq2L02pnwBl45H6j5pz58egH6QqA3FncatvX+0pHtXUkpuF AJ8aF08WIQwCpjxXXswZjDuC9vYYiQjITOyt5w3XM4KdpchcFAHMqrXwuej0vb+yIM 7A6s3l7za6h0Gf40+fXUBPKhITcLb7PRPrSA7zgiQ09gkp//cGUZlIKotysgg5DGx2 4NY987yU3PXVWhXg4onHpQGGy45cm5FvLrZs9hknSnSv0UVdHROjm0H4qC5iV9Rv1n rvuiMd2rtInNgThVTQ73YnkXJT4z/qJ6Qq/vOLi2lhYeHgc8F5b+vAhir5+UjLHZ2M GRKEaAzqsB7fg== From: sashiko-bot@kernel.org Subject: Re: [RFC PATCH 01/20] kho: generalize radix tree APIs Reply-To: sashiko-reviews@lists.linux.dev To: "Jork Loeser" Cc: linux-hyperv@vger.kernel.org In-Reply-To: <20260528004204.1484584-2-jloeser@linux.microsoft.com> References: <20260528004204.1484584-2-jloeser@linux.microsoft.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Thu, 28 May 2026 01:22:52 +0000 Message-Id: <20260528012253.489281F000E9@smtp.kernel.org> Precedence: bulk X-Mailing-List: linux-hyperv@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Thank you for your contribution! Sashiko AI review found 8 potential issue(= s) to consider: New issues: - [High] Refactoring of scratch_size_update() skips CMA alignment for fixed= scratch sizes. - [High] Unsigned integer underflow when calculating scratch_size_global fo= r unaligned lowmem sizes. - [High] The generalized radix tree API silently truncates bits from generi= c 64-bit keys. - [Low] Unnecessary duplication of __kho_preserve_pages_order() logic in kh= o_preserve_pages(). Pre-existing issues: - [Critical] Heap buffer overflow in radix tree node allocation on 32-bit p= latforms with 4-byte phys_addr_t. - [High] Sleeping memory allocation (GFP_KERNEL) is performed while holding= a mutex required in reclaim paths. - [High] Bitwise shift operations using a 32-bit signed integer literal '1'= overflow for large memory preservations. - [High] Lack of bounds checking on reconstructed radix tree keys causes an= underflow of 'order', leading to severe memory corruption. -- commit 8b45214cacd5b180aa13f584e115cd95e44f26d2 Author: Pratyush Yadav (Google) kho: generalize radix tree APIs This patch generalizes the KHO radix tree APIs to track arbitrary keys dire= ctly instead of hardcoding page tracking operations (PFN and order). It renames = the functions to kho_radix_add_key() and kho_radix_del_key() and adjusts the callback structures accordingly. > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec= _handover.c > index 4834a809985ab..05a6eb56e1766 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c [ ... ] > @@ -106,17 +106,17 @@ static unsigned long kho_radix_encode_key(phys_addr= _t phys, unsigned int order) > } > =20 > /** > - * kho_radix_decode_key - Decodes a radix key back into a physical addre= ss and order. > + * kho_decode_radix_key - Decodes a radix key back into a physical addre= ss and order. > * @key: The unsigned long key to decode. > * @order: An output parameter, a pointer to an unsigned int where the d= ecoded > * page order will be stored. > * > - * This function reverses the encoding performed by kho_radix_encode_key= (), > + * This function reverses the encoding performed by kho_encode_radix_key= (), > * extracting the original physical address and page order from a given = key. > * > * Return: The decoded physical address. > */ > -static phys_addr_t kho_radix_decode_key(unsigned long key, unsigned int = *order) > +static phys_addr_t kho_decode_radix_key(unsigned long key, unsigned int = *order) > { > unsigned int order_bit =3D fls64(key); > phys_addr_t phys; > =20 > /* order_bit is numbered starting at 1 from fls64 */ > *order =3D KHO_ORDER_0_LOG2 - order_bit + 1; [Severity: High] This isn't a bug introduced by this patch, but does this calculation risk an unsigned integer underflow? If a corrupted tree (for instance, an FDT from a crashing kernel) provides entries at inappropriately high indices, the reconstructed key in __kho_radix_walk_tree() might have bits set above KHO_ORDER_0_LOG2. If fls64(key) returns an order_bit greater than KHO_ORDER_0_LOG2 + 1, this calculation will underflow and produce an enormous order. This value is lat= er used in loops bounded by (1UL << order), which could lead to out-of-bounds iteration. [ ... ] > @@ -144,24 +144,21 @@ static unsigned long kho_radix_get_table_index(unsi= gned long key, > } > =20 > /** > - * kho_radix_add_page - Marks a page as preserved in the radix tree. > + * kho_radix_add_key - Add a key to the radix tree. > * @tree: The KHO radix tree. > - * @pfn: The page frame number of the page to preserve. > - * @order: The order of the page. > + * @key: The key to add. > * > - * This function traverses the radix tree based on the key derived from = @pfn > - * and @order. It sets the corresponding bit in the leaf bitmap to mark = the > - * page for preservation. If intermediate nodes do not exist along the p= ath, > - * they are allocated and added to the tree. > + * This function traverses the radix tree based on the key provided. It = sets the > + * corresponding bit in the leaf bitmap to mark the key as present. If > + * intermediate nodes do not exist along the path, they are allocated an= d added > + * to the tree. > * > * Return: 0 on success, or a negative error code on failure. > */ > -int kho_radix_add_page(struct kho_radix_tree *tree, > - unsigned long pfn, unsigned int order) > +int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key) [Severity: High] Does this new arbitrary key API truncate bits from generic 64-bit keys? The KHO radix tree depth and shifting logic appear to be statically sized based on physical address and order bounds. When an arbitrary 64-bit key is passed here, any bits that exceed the maximum addressable depth might be silently ignored (e.g., bits shifted out or lost in the modulo operations inside kho_radix_get_table_index()). Could this cause distinct large 64-bit keys to silently alias to the same path and overwrite each other? > { > /* Newly allocated nodes for error cleanup */ > struct kho_radix_node *intermediate_nodes[KHO_TREE_MAX_DEPTH] =3D { 0 }; > - unsigned long key =3D kho_radix_encode_key(PFN_PHYS(pfn), order); > struct kho_radix_node *anchor_node =3D NULL; [Severity: Critical] This is a pre-existing issue, but can the node allocation later in this function cause an out-of-bounds heap access on 32-bit platforms where sizeof(phys_addr_t) is 4? The struct kho_radix_node array size is based on KHO_TABLE_SIZE_LOG2, which is calculated as ilog2(PAGE_SIZE / sizeof(phys_addr_t)). On 32-bit systems, this evaluates to 1024 elements of u64, requiring 8192 bytes. However, the allocation relies on get_zeroed_page(): new_node =3D (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL); This only allocates a single 4096-byte page. Could writes to indices >=3D 5= 12, or walking the tree via __kho_radix_walk_tree(), overflow this allocation? [Severity: High] This is also a pre-existing issue, but can using GFP_KERNEL for the node allocation lead to an ABBA deadlock? This allocation happens while holding tree->lock: kho_radix_add_key() mutex_lock(&tree->lock) ... get_zeroed_page(GFP_KERNEL) Since GFP_KERNEL can sleep and trigger direct memory reclaim, if a page cac= he folio preserved for KHO is evicted, the filesystem's eviction path could ca= ll kho_unpreserve_folio(). That function would attempt to acquire the exact same tree->lock, resulting in a deadlock. Should this allocation be moved outside the lock or downgrad= ed to GFP_NOWAIT? [ ... ] > @@ -484,13 +474,16 @@ static struct page *__init kho_get_preserved_page(p= hys_addr_t phys, > return pfn_to_page(pfn); > } > =20 > -static int __init kho_preserved_memory_reserve(phys_addr_t phys, > - unsigned int order) > +static int __init kho_preserved_memory_reserve(unsigned long key) > { > union kho_page_info info; > struct page *page; > + unsigned int order; > + phys_addr_t phys; > u64 sz; > =20 > + phys =3D kho_decode_radix_key(key, &order); > + > sz =3D 1 << (order + PAGE_SHIFT); [Severity: High] This is a pre-existing issue, but will this bitwise shift overflow for large memory preservations? Since the literal 1 is a 32-bit signed integer, a shift amount of 31 or greater (which can happen for contiguous preservations of 2GB or more, like hugepages or CMA areas) results in a signed integer overflow. This would produce a negative 32-bit value that is sign-extended into the 64-bit sz variable, leading to incorrect memblock reservations. Does this need to use 1UL or 1ULL? [ ... ] > @@ -618,30 +611,20 @@ early_param("kho_scratch", kho_parse_scratch_size); > =20 > static void __init scratch_size_update(void) > { > - /* > - * If fixed sizes are not provided via command line, calculate them > - * now. > - */ > - if (scratch_scale) { > - phys_addr_t size; > + phys_addr_t size; > =20 > - size =3D memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT, > - NUMA_NO_NODE); > - size =3D size * scratch_scale / 100; > - scratch_size_lowmem =3D size; > + if (!scratch_scale) > + return; [Severity: High] Does this early return skip CMA alignment for fixed scratch sizes? In the previous implementation, scratch_size_lowmem and scratch_size_global were unconditionally aligned using round_up(..., CMA_MIN_ALIGNMENT_BYTES) at the end of the function, regardless of whether scratch_scale was set. By returning early when scratch_scale is 0 (which occurs when fixed sizes are provided via the command line), the alignment step is bypassed. Will this cause CMA initialization to fail when iterating over partial pageblocks? > =20 > - size =3D memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE, > - NUMA_NO_NODE); > - size =3D size * scratch_scale / 100 - scratch_size_lowmem; > - scratch_size_global =3D size; > - } > + size =3D memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT, > + NUMA_NO_NODE); > + size =3D size * scratch_scale / 100; > + scratch_size_lowmem =3D round_up(size, CMA_MIN_ALIGNMENT_BYTES); > =20 > - /* > - * Scratch areas are released as MIGRATE_CMA. Round them up to the right > - * size. > - */ > - scratch_size_lowmem =3D round_up(scratch_size_lowmem, CMA_MIN_ALIGNMENT= _BYTES); > - scratch_size_global =3D round_up(scratch_size_global, CMA_MIN_ALIGNMENT= _BYTES); > + size =3D memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE, > + NUMA_NO_NODE); > + size =3D size * scratch_scale / 100 - scratch_size_lowmem; [Severity: High] Can this subtraction result in an unsigned integer underflow? Because the round_up() for scratch_size_lowmem now occurs before the calculation of scratch_size_global, scratch_size_lowmem may be larger than the unaligned size (size * scratch_scale / 100). On a system where all memory is below ARCH_LOW_ADDRESS_LIMIT, both calls to memblock_reserved_kern_size() return the same value. The subtraction will then subtract the rounded-up value from the unaligned value, resulting in a negative number. Since size is a phys_addr_t (unsigned), this will underflow to an enormous positive value and likely lead to an unfulfillable CMA reservation attempt. [ ... ] > @@ -937,9 +922,19 @@ int kho_preserve_pages(struct page *page, unsigned l= ong nr_pages) > } > =20 > while (pfn < end_pfn) { > - unsigned int order =3D __kho_preserve_pages_order(pfn, end_pfn); > + unsigned int order =3D > + min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn)); > + > + /* > + * Make sure all the pages in a single preservation are in the > + * same NUMA node. The restore machinery can not cope with a > + * preservation spanning multiple NUMA nodes. > + */ > + while (pfn_to_nid(pfn) !=3D pfn_to_nid(pfn + (1UL << order) - 1)) > + order--; [Severity: Low] Is there a specific reason to inline and duplicate the logic from __kho_preserve_pages_order() here? The helper function __kho_preserve_pages_order() remains in the file and is still actively used by __kho_unpreserve(). Duplicating this complex order calculation and NUMA node alignment logic increases the risk of the two implementations diverging in the future. --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260528004204.1484= 584-1-jloeser@linux.microsoft.com?part=3D1