From: Jason Gunthorpe <jgg@nvidia.com>
To: Jason Miu <jasonmiu@google.com>
Cc: Alexander Graf <graf@amazon.com>,
Andrew Morton <akpm@linux-foundation.org>,
Baoquan He <bhe@redhat.com>,
Changyuan Lyu <changyuanl@google.com>,
David Matlack <dmatlack@google.com>,
David Rientjes <rientjes@google.com>,
Mike Rapoport <rppt@kernel.org>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Pratyush Yadav <pratyush@kernel.org>,
kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [PATCH v2 1/3] kho: Adopt KHO radix tree data structures
Date: Thu, 23 Oct 2025 14:43:17 -0300 [thread overview]
Message-ID: <20251023174317.GO262900@nvidia.com> (raw)
In-Reply-To: <20251020100306.2709352-2-jasonmiu@google.com>
On Mon, Oct 20, 2025 at 03:03:04AM -0700, Jason Miu wrote:
> - * Keep track of memory that is to be preserved across KHO.
> + * The KHO radix tree tracks preserved memory pages. It is a hierarchical
> + * structure that starts with a single root `kho_radix_tree`. This single
> + * tree stores pages of all orders.
> + *
> + * This is achieved by encoding the page's physical address and its order into
> + * a single `unsigned long` value. This encoded value is then used to traverse
> + * the tree.
> + *
> + * The tree hierarchy is shown below:
> + *
> + * kho_radix_tree_root
> + * +-------------------+
> + * | Level 5 | (struct kho_radix_tree)
> + * +-------------------+
> + * |
> + * v
> + * +-------------------+
> + * | Level 4 | (struct kho_radix_tree)
> + * +-------------------+
> + * |
> + * | ... (intermediate levels)
> + * |
> + * v
> + * +-------------------+
> + * | Level 0 | (struct kho_bitmap_table)
> + * +-------------------+
> *
> - * The serializing side uses two levels of xarrays to manage chunks of per-order
> - * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
> - * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
> - * each bitmap will cover 16M of address space. Thus, for 16G of memory at most
> - * 512K of bitmap memory will be needed for order 0.
I think it is valuable to preserve this justification for
bitmaps. There was a lot of debate over bitmaps vs ranges and this is
the advantage of bitmaps. It is a bit subtle..
> +/*
> + * The KHO radix tree tracks preserved pages by encoding a page's physical
> + * address (pa) and its order into a single unsigned long value. This value
> + * is then used to traverse the tree. The encoded value is composed of two
> + * parts: the 'order bits' in the upper part and the 'page offset' in the
> + * lower part.
> + *
> + * <-- Higher Bits ------------------------------------ Lower Bits -->
> + * +--------------------------+-----------------------------------------+
> + * | Order Bits | Page Offset |
> + * +--------------------------+-----------------------------------------+
> + * | ... 0 0 1 0 0 ... | pa >> (PAGE_SHIFT + order) |
> + * +--------------------------+-----------------------------------------+
> + * ^
> + * |
> + * This single '1' bit's position
> + * uniquely identifies the 'order'.
> + *
> + *
> + * Page Offset:
> + * The 'page offset' is the physical address normalized for its order. It
> + * effectively represents the page offset for the given order.
> + *
> + * Order Bits:
> + * The 'order bits' encode the page order by setting a single bit at a
> + * specific position. The position of this bit itself represents the order.
> + *
> + * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the
> + * maximum range for a page offset (for order 0) is 52 bits (64 - 12). This
> + * offset occupies bits [0-51]. For order 0, the order bit is set at
> + * position 52.
> + *
> + * As the order increases, the number of bits required for the 'page offset'
> + * decreases. For example, order 1 requires one less bit for its page
> + * offset. This allows its order bit to be set at position 51,
> + * i.e. shifting right, without conflicting with the page offset bits.
This description is correct, but the diagram is misleading. Maybe like this:
* |0| ... 0 0 0 1 | pa >> (PAGE_SHIFT + 0) |
* |1| ... 0 0 0 0 1 | pa >> (PAGE_SHIFT + 1) |
* |2| ... 0 0 0 0 0 1 | pa >> (PAGE_SHIFT + 2) |
[..]
> + *
> + * This design stores pages of all sizes (orders) in a single 6-level table. It
> + * efficiently shares lower table levels, especially due to common zero top
> + * address bits, allowing a single, efficient algorithm to manage all pages.
> + */
> +enum kho_radix_consts {
> + /* The bit position of a 0-order page, only supports 64bits system */
> + ORDER_0_LG2 = 64 - PAGE_SHIFT,
> + /* Bit number used for each level of tables */
> + TABLE_SIZE_LG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t)),
> + /* Bit number used for lowest level of bitmaps */
> + BITMAP_SIZE_LG2 = PAGE_SHIFT + const_ilog2(BITS_PER_BYTE),
"bit number" is a bit confusing when using a lg2 terms..
/* Size of the table in kho_radix_tree, in lg2 */
TABLE_SIZE_LG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t))
/* Number of bits in the kho_bitmap_table, in lg2 */
BITMAP_SIZE_LG2 = const_ilog2(BITS_PER_BYTE * PAGE_SIZE)
Then use the constants in the structs:
struct kho_radix_tree {
phys_addr_t table[1 << TABLE_SIZE_LG2];
};
struct kho_bitmap_table {
unsigned long bitmaps[(1 << BITMAP_SIZE_LG2) / BITS_PER_LONG];
};
> -struct khoser_mem_chunk;
> +static inline phys_addr_t kho_radix_tree_desc(struct kho_radix_tree *va)
> +{
> + return (phys_addr_t)virt_to_phys(va);
> +}
virt_to_phys already returns phys_addr_t ?
> +static inline struct kho_radix_tree *kho_radix_tree(phys_addr_t desc)
> +{
> + return (struct kho_radix_tree *)(phys_to_virt(desc));
> +}
phys_to_virt returns void *, no need for a cast
> +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb,
> + unsigned long offset)
> +{
> + if (!bit_tlb ||
> + offset >= PAGE_SIZE * BITS_PER_BYTE)
> + return -EINVAL;
WARN_ON ? These are assertions aren't they?
> +static int kho_radix_walk_trees(struct kho_radix_tree *root, unsigned int level,
> + unsigned long start,
> + kho_radix_tree_walk_callback_t cb)
> +{
> + struct kho_radix_tree *next_tree;
> + struct kho_bitmap_table *bitmap_table;
> + unsigned long encoded, i;
> + unsigned int level_shift;
> + int err = 0;
> +
> + for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) {
> + if (root->table[i]) {
if (!root->table[i])
continue
Remove a level of indentation here
> +static int __kho_preserve_order(unsigned long pfn, unsigned int order)
> +{
> + unsigned long pa = PFN_PHYS(pfn);
> +
> + might_sleep();
> +
> + return kho_radix_preserve_page(pa, order);
might_sleep should be in kho_radix_preserve_page() ? The might should
be placed around if statements that might avoid a sleep, and that is
not in this function.
> static void __init kho_mem_deserialize(const void *fdt)
> {
> + struct kho_radix_tree *tree_root;
> const phys_addr_t *mem;
> int len;
>
> + /* Retrieve the KHO radix tree from passed-in FDT. */
> + mem = fdt_getprop(fdt, 0, PROP_PRESERVED_PAGE_RADIX_TREE, &len);
>
> if (!mem || len != sizeof(*mem)) {
> + pr_err("failed to get preserved KHO memory tree\n");
> return;
> }
>
> + tree_root = *mem ?
> + (struct kho_radix_tree *)phys_to_virt(*mem) :
> + NULL;
>
> + if (!tree_root)
> + return;
Seems weird?
if (!*mem)
return;
tree_root = phys_to_virt(*mem)
kho_radix_walk_trees(tree_root, TREE_MAX_DEPTH - 1, 0,
kho_radix_walk_trees_memblock_callback);
This is prettty good now, IMHO
Jason
next prev parent reply other threads:[~2025-10-23 17:43 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-20 10:03 [PATCH v2 0/3] Make KHO Stateless Jason Miu
2025-10-20 10:03 ` [PATCH v2 1/3] kho: Adopt KHO radix tree data structures Jason Miu
2025-10-22 15:51 ` David Matlack
2025-10-23 0:51 ` Jason Miu
2025-10-23 17:43 ` Jason Gunthorpe [this message]
2025-10-29 1:28 ` Jason Miu
2025-10-23 23:45 ` Jason Gunthorpe
2025-10-24 1:23 ` Pasha Tatashin
2025-10-24 11:43 ` Jason Gunthorpe
2025-10-27 11:53 ` Mike Rapoport
2025-10-29 2:08 ` Jason Miu
2025-10-20 10:03 ` [PATCH v2 2/3] memblock: kho: Remove KHO notifier usage Jason Miu
2025-10-20 10:03 ` [PATCH v2 3/3] kho: Remove notifier system infrastructure Jason Miu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251023174317.GO262900@nvidia.com \
--to=jgg@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=changyuanl@google.com \
--cc=dmatlack@google.com \
--cc=graf@amazon.com \
--cc=jasonmiu@google.com \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=pasha.tatashin@soleen.com \
--cc=pratyush@kernel.org \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.