From: Jason Gunthorpe <jgg@nvidia.com>
To: Jason Miu <jasonmiu@google.com>
Cc: Alexander Graf <graf@amazon.com>,
Andrew Morton <akpm@linux-foundation.org>,
Baoquan He <bhe@redhat.com>,
Changyuan Lyu <changyuanl@google.com>,
David Matlack <dmatlack@google.com>,
David Rientjes <rientjes@google.com>,
Mike Rapoport <rppt@kernel.org>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Pratyush Yadav <pratyush@kernel.org>,
kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [PATCH v1 1/3] kho: Adopt KHO radix tree data structures
Date: Mon, 6 Oct 2025 11:14:44 -0300 [thread overview]
Message-ID: <20251006141444.GN3360665@nvidia.com> (raw)
In-Reply-To: <20251001011941.1513050-2-jasonmiu@google.com>
On Tue, Sep 30, 2025 at 06:19:39PM -0700, Jason Miu wrote:
> @@ -29,7 +30,7 @@
> #include "kexec_internal.h"
>
> #define KHO_FDT_COMPATIBLE "kho-v1"
We don't bump this?
> -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
> +#define PROP_PRESERVED_PAGE_RADIX_TREE "preserved-page-radix-tree"
> #define PROP_SUB_FDT "fdt"
I'de really like to see all of these sorts of definitions in some
structured ABI header not open coded all over the place..
> /*
> + * The KHO radix tree tracks preserved memory pages. It is a hierarchical
> + * structure that starts with a single root `kho_radix_tree`. This single
> + * tree stores pages of all orders.
> + *
> + * This is achieved by encoding the page's physical address and its order into
> + * a single `unsigned long` value. This encoded value is then used to traverse
> + * the tree.
> + *
> + * The tree hierarchy is shown below:
> + *
> + * kho_radix_tree_root
> + * +-------------------+
> + * | Level 6 | (struct kho_radix_tree)
> + * +-------------------+
> + * |
> + * v
> + * +-------------------+
> + * | Level 5 | (struct kho_radix_tree)
> + * +-------------------+
> + * |
> + * | ... (intermediate levels)
> + * |
> + * v
> + * +-------------------+
> + * | Level 1 | (struct kho_bitmap_table)
> + * +-------------------+
> + *
> + * The following diagram illustrates how the encoded value is split into
> + * indices for the tree levels:
> *
> + * 63:60 59:51 50:42 41:33 32:24 23:15 14:0
> + * +---------+--------+--------+--------+--------+--------+-----------------+
> + * | 0 | Lv 6 | Lv 5 | Lv 4 | Lv 3 | Lv 2 | Lv 1 (bitmap) |
> + * +---------+--------+--------+--------+--------+--------+-----------------+
> *
> + * Each `kho_radix_tree` (Levels 2-6) and `kho_bitmap_table` (Level 1) is
> + * PAGE_SIZE. Each entry in a `kho_radix_tree` is a descriptor (a physical
> + * address) pointing to the next level node. For Level 2 `kho_radix_tree`
> + * nodes, these descriptors point to a `kho_bitmap_table`. The final
> + * `kho_bitmap_table` is a bitmap where each set bit represents a single
> + * preserved page.
Maybe a note that this is example is for PAGE_SIZE=4k.
> */
> +struct kho_radix_tree {
> + unsigned long table[PAGE_SIZE / sizeof(unsigned long)];
This should be phys_addr_t.
> +};
You dropped the macros so now we don't know these are actually
pointers to 'struct kho_radix_tree'
> +/*
> + * `kho_radix_tree_root` points to a page thats serves as the root of the
> + * KHO radix tree. This page is allocated during KHO module initialization.
> + * Its physical address is written to the FDT and passed to the next kernel
> + * during kexec.
> + */
> +static struct kho_radix_tree *kho_radix_tree_root;
> +static DECLARE_RWSEM(kho_radix_tree_root_sem);
> +
> +static int kho_radix_tree_max_depth(void)
> +{
> + int page_offset_bit_num = BITS_PER_LONG - PAGE_SHIFT;
> + int order_bit_num = ilog2(__roundup_pow_of_two(page_offset_bit_num));
> + int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);
> + int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> + int table_level_num = DIV_ROUND_UP(page_offset_bit_num -
> + bitmap_bit_num + order_bit_num,
> + table_bit_num);
All should be unsigned int. Below I suggest to put it in an enum and
use different names.. And since the function is constant it can just
be an enum TOP_LEVEL too.
> +/*
> + * The KHO radix tree tracks preserved pages by encoding a page's physical
> + * address (pa) and its order into a single unsigned long value. This value
> + * is then used to traverse the tree. The encoded value is composed of two
> + * parts: the 'order bits' in the upper part and the 'page offset' in the
> + * lower part.
> + *
> + * <-- Higher Bits ------------------------------------ Lower Bits -->
> + * +--------------------------+-----------------------------------------+
> + * | Order Bits | Page Offset |
> + * +--------------------------+-----------------------------------------+
> + * | ... 0 0 1 0 0 ... | pa >> (PAGE_SHIFT + order) |
> + * +--------------------------+-----------------------------------------+
> + * ^
> + * |
> + * This single '1' bit's position
> + * uniquely identifies the 'order'.
> + *
> + *
> + * Page Offset:
> + * The 'page offset' is the physical address normalized for its order. It
> + * effectively represents the page offset for the given order.
> + *
> + * Order Bits:
> + * The 'order bits' encode the page order by setting a single bit at a
> + * specific position. The position of this bit itself represents the order.
> + *
> + * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the
> + * maximum range for a page offset (for order 0) is 52 bits (64 - 12). This
> + * offset occupies bits [0-51]. For order 0, the order bit is set at
> + * position 52.
> + *
> + * As the order increases, the number of bits required for the 'page offset'
> + * decreases. For example, order 1 requires one less bit for its page
> + * offset. This allows its order bit to be set at position 51 without
> + * conflicting with the page offset bits.
> + *
> + * This scheme ensures that the single order bit is always in a higher
> + * position than any bit used by the page offset for that same order,
> + * preventing collisions.
Should explain why it is like this:
This scheme allows storing all the multi-order page sizes in a single
6 level table with a good sharing of lower tables levels for 0 top
address bits. A single algorithm can efficiently process everything.
> + */
> +static unsigned long kho_radix_encode(unsigned long pa, unsigned int order)
pa is phys_addr_t in the kernel, never unsigned long.
If you want to make it all dynamic then this should be phys_addr_t
> +{
> + unsigned long h = 1UL << (BITS_PER_LONG - PAGE_SHIFT - order);
And this BITS_PER_LONG is confused, it is BITS_PER_PHYS_ADDR_T which
may not exist.
Use an enum ORDER_0_LG2 maybe
> + unsigned long l = pa >> (PAGE_SHIFT + order);
>
> + return h | l;
> +}
>
> +static unsigned long kho_radix_decode(unsigned long encoded, unsigned int *order)
Returns phys_addr_t
> {
> - void *elm, *res;
> + unsigned long order_bit = fls64(encoded);
unsigned int
> + unsigned long pa;
phys_addr_t
> + *order = BITS_PER_LONG - PAGE_SHIFT - order_bit + 1;
ORDER_0_LG2
> + pa = encoded << (PAGE_SHIFT + *order);
I'd add a comment that the shift always discards order.
> + return pa;
> +}
>
> +static unsigned long kho_radix_get_index(unsigned long encoded, int level)
unsigned int level
> +{
> + int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> + int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);
Stick all the constants that kho_radix_tree_max_depth() are computing
in an enum instead of recomputing them..
> + unsigned long mask;
> + int s;
unsigned for all of these.
> +
> + if (level == 1) {
I think the math is easier if level 0 == bitmap..
> + s = 0;
> + mask = (1UL << bitmap_bit_num) - 1;
> + } else {
> + s = ((level - 2) * table_bit_num) + bitmap_bit_num;
eg here you are doing level-2 which is a bit weird only because of the
arbitary choice to make level=1 be the bitmap.
I'd also use some different names
table_bit_num == TABLE_SIZE_LG2
BITMAP_BIT_NUM = BITMAP_SIZE_LG2
Log2 designates the value is 1<<LG2
> + mask = (1UL << table_bit_num) - 1;
> }
>
> - return elm;
> + return (encoded >> s) & mask;
It is just:
return encoded % (1 << BITMAP_SIZE_LG2);
return (encoded >> s) % (1 << TABLE_SIZE_LG2);
The compiler is smart enough to choose bit logic if that is the
fastest option and the above is more readable.
> +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb, unsigned long offset)
> {
> + if (!bit_tlb ||
> + offset >= PAGE_SIZE * BITS_PER_BYTE)
> + return -EINVAL;
>
> + set_bit(offset, bit_tlb->bitmaps);
set_bit is an atomic, you want __set_bit()
> + return 0;
> +}
>
> +static int kho_radix_preserve_page(unsigned long pa, unsigned int order)
phys_addr_t
> +{
> + unsigned long encoded = kho_radix_encode(pa, order);
> + int num_tree_level = kho_radix_tree_max_depth();
kho_radix_tree_max_depth() is constant, stick it in an enum with the
rest of them.
> + struct kho_radix_tree *current_tree, *new_tree;
> + struct kho_bitmap_table *bitmap_table;
> + int err = 0;
> + int i, idx;
various unsigned int.
>
> + down_write(&kho_radix_tree_root_sem);
>
> + current_tree = kho_radix_tree_root;
>
> + /* Go from high levels to low levels */
> + for (i = num_tree_level; i >= 1; i--) {
> + idx = kho_radix_get_index(encoded, i);
> +
> + if (i == 1) {
> + bitmap_table = (struct kho_bitmap_table *)current_tree;
> + err = kho_radix_set_bitmap(bitmap_table, idx);
> + goto out;
> + }
> +
> + if (!current_tree->table[idx]) {
> + new_tree = kho_alloc_radix_tree();
> + if (!new_tree) {
> + err = -ENOMEM;
> + goto out;
> + }
> +
> + current_tree->table[idx] =
> + (unsigned long)virt_to_phys(new_tree);
current_tree = new_tree
> + }
else
> +
> + current_tree = (struct kho_radix_tree *)
> + phys_to_virt(current_tree->table[idx]);
> }
> +
> +out:
> + up_write(&kho_radix_tree_root_sem);
> + return err;
> }
>
> +static int kho_radix_walk_bitmaps(struct kho_bitmap_table *bit_tlb,
> + unsigned long offset,
phys_addr_t
> + kho_radix_tree_walk_callback_t cb)
> {
> + unsigned long encoded = offset << (PAGE_SHIFT + ilog2(BITS_PER_BYTE));
> + unsigned long *bitmap = (unsigned long *)bit_tlb;
> + int err = 0;
> + int i;
>
> + for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
> + err = cb(encoded | i);
> + if (err)
> + return err;
> + }
>
> + return 0;
> +}
>
> +static int kho_radix_walk_trees(struct kho_radix_tree *root, int level,
unsigned int
> + unsigned long offset,
phys_addr_t. I would call this start not offset..
> + kho_radix_tree_walk_callback_t cb)
> +{
> + int level_shift = ilog2(PAGE_SIZE / sizeof(unsigned long));
> + struct kho_radix_tree *next_tree;
> + unsigned long encoded, i;
> + int err = 0;
>
> + if (level == 1) {
> + encoded = offset;
> + return kho_radix_walk_bitmaps((struct kho_bitmap_table *)root,
> + encoded, cb);
Better to do this in the caller a few lines below
> + }
>
> + for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
> + if (root->table[i]) {
> + encoded = offset << level_shift | i;
This doesn't seem right..
The argument to the walker should be the starting encoded of the table
it is about to walk.
Since everything always starts at 0 it should always be
start | (i << level_shift)
?
> + next_tree = (struct kho_radix_tree *)
> + phys_to_virt(root->table[i]);
> + err = kho_radix_walk_trees(next_tree, level - 1, encoded, cb);
> if (err)
> return err;
> }
> }
>
> + return 0;
> +}
>
> +static int kho_memblock_reserve(phys_addr_t pa, int order)
> +{
> + int sz = 1 << (order + PAGE_SHIFT);
> + struct page *page = phys_to_page(pa);
> +
> + memblock_reserve(pa, sz);
> + memblock_reserved_mark_noinit(pa, sz);
> + page->private = order;
>
> return 0;
> }
>
> +static int kho_radix_walk_trees_callback(unsigned long encoded)
> +{
> + unsigned int order;
> + unsigned long pa;
> +
> + pa = kho_radix_decode(encoded, &order);
> +
> + return kho_memblock_reserve(pa, order);
> +}
> +
> +struct kho_serialization {
> + struct page *fdt;
> + struct list_head fdt_list;
> + struct dentry *sub_fdt_dir;
> +};
> +
> +static int __kho_preserve_order(unsigned long pfn, unsigned int order)
> +{
> + unsigned long pa = PFN_PHYS(pfn);
phys_addr_t
Jason
next prev parent reply other threads:[~2025-10-06 14:15 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-01 1:19 [PATCH v1 0/3] Make KHO Stateless Jason Miu
2025-10-01 1:19 ` [PATCH v1 1/3] kho: Adopt KHO radix tree data structures Jason Miu
2025-10-02 4:29 ` kernel test robot
2025-10-06 14:14 ` Jason Gunthorpe [this message]
2025-10-06 17:26 ` Pasha Tatashin
2025-10-06 22:50 ` Jason Gunthorpe
2025-10-09 2:07 ` Jason Miu
2025-10-09 17:52 ` Jason Gunthorpe
2025-10-22 0:59 ` Jason Miu
2025-10-01 1:19 ` [PATCH v1 2/3] memblock: Remove KHO notifier usage Jason Miu
2025-10-01 16:35 ` kernel test robot
2025-10-01 1:19 ` [PATCH v1 3/3] kho: Remove notifier system infrastructure Jason Miu
2025-10-01 18:07 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251006141444.GN3360665@nvidia.com \
--to=jgg@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=changyuanl@google.com \
--cc=dmatlack@google.com \
--cc=graf@amazon.com \
--cc=jasonmiu@google.com \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=pasha.tatashin@soleen.com \
--cc=pratyush@kernel.org \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.