From: Jason Miu <jasonmiu@google.com>
To: Alexander Graf <graf@amazon.com>,
Andrew Morton <akpm@linux-foundation.org>,
Baoquan He <bhe@redhat.com>,
Changyuan Lyu <changyuanl@google.com>,
David Matlack <dmatlack@google.com>,
David Rientjes <rientjes@google.com>,
Jason Gunthorpe <jgg@nvidia.com>,
Jason Miu <jasonmiu@google.com>,
Joel Granados <joel.granados@kernel.org>,
Marcos Paulo de Souza <mpdesouza@suse.com>,
Mario Limonciello <mario.limonciello@amd.com>,
Mike Rapoport <rppt@kernel.org>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Petr Mladek <pmladek@suse.com>,
"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
Steven Chen <chenste@linux.microsoft.com>,
Yan Zhao <yan.y.zhao@intel.com>,
kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
Subject: [RFC v1 1/4] kho: Introduce KHO page table data structures
Date: Tue, 16 Sep 2025 19:50:16 -0700 [thread overview]
Message-ID: <20250917025019.1585041-2-jasonmiu@google.com> (raw)
In-Reply-To: <20250917025019.1585041-1-jasonmiu@google.com>
Introduce a page-table-like data structure for tracking preserved
memory pages, which will replace the current xarray-based
implementation.
The primary motivation for this change is to eliminate the need for
serialization. By marking preserved pages directly in these new tables
and passing them to the next kernel, the entire serialization process
can be removed. This ultimately allows for the removal of the KHO
finalize and abort states, simplifying the overall design.
The new KHO page table is a hierarchical structure that maps physical
addresses to preservation metadata. It begins with a root
`kho_order_table` that contains an entry for each page order. Each
entry points to a separate, multi-level tree of `kho_page_table`s that
splits a physical address into indices. The traversal terminates at a
`kho_bitmap_table`, where each bit represents a single preserved page.
This commit adds the core data structures for this hierarchy:
- kho_order_table: The root table, indexed by page order.
- kho_page_table: Intermediate-level tables.
- kho_bitmap_table: The lowest-level table where individual pages
are marked.
The new functions are not yet integrated with the public
`kho_preserve_*` APIs and are marked `__maybe_unused`. The full
integration and the removal of the old xarray code will follow in a
subsequent commit.
Signed-off-by: Jason Miu <jasonmiu@google.com>
---
kernel/kexec_handover.c | 344 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 344 insertions(+)
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index ecd1ac210dbd..0daed51c8fb7 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -46,6 +46,350 @@ static int __init kho_parse_enable(char *p)
}
early_param("kho", kho_parse_enable);
+/*
+ * KHO page tables provide a page-table-like data structure for tracking
+ * preserved memory pages. It is a hierarchical structure that starts with a
+ * `struct kho_order_table`. Each entry in this table points to the root of a
+ * `struct kho_page_table` tree, which tracks the preserved memory pages for a
+ * specific page order.
+ *
+ * Each entry in a `struct kho_page_table` points to the next level page table,
+ * until level 2, which points to a `struct kho_bitmap_table`. The lowest level
+ * (level 1) is a bitmap table where each bit represents a preserved page.
+ *
+ * The table hierarchy is shown as below.
+ *
+ * kho_order_table
+ * +-------------------------------+--------------------+
+ * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER |
+ * ++------------------------------+--------------------+
+ * |
+ * |
+ * v
+ * ++------+
+ * | Lv6 | kho_page_table
+ * ++------+
+ * |
+ * |
+ * | +-------+
+ * +-> | Lv5 | kho_page_table
+ * ++------+
+ * |
+ * |
+ * | +-------+
+ * +-> | Lv4 | kho_page_table
+ * ++------+
+ * |
+ * |
+ * | +-------+
+ * +-> | Lv3 | kho_page_table
+ * ++------+
+ * |
+ * |
+ * | +-------+
+ * +> | Lv2 | kho_page_table
+ * ++------+
+ * |
+ * |
+ * | +-------+
+ * +-> | Lv1 | kho_bitmap_table
+ * +-------+
+ *
+ * The depth of the KHO page tables depends on the system's page size and the
+ * page order. Both larger page sizes and higher page orders result in
+ * shallower KHO page tables. For example, on a system with a 4KB native
+ * page size, 0-order tables have a depth of 6 levels.
+ *
+ * The following diagram illustrates how a physical address is split into
+ * indices for the different KHO page table levels and the final bitmap.
+ *
+ * 63 62:54 53:45 44:36 35:27 26:0
+ * +--------+--------+--------+--------+--------+-----------------+
+ * | Lv 6 | Lv 5 | Lv 4 | Lv 3 | Lv 2 | Lv 1 (bitmap) |
+ * +--------+--------+--------+--------+--------+-----------------+
+ *
+ * For higher order pages, the bit fields for each level shift to the left by
+ * the page order.
+ *
+ * Each KHO page table and bitmap table is PAGE_SIZE in size. For 0-order
+ * pages, the bitmap table contains (PAGE_SIZE * 8) bits, covering a
+ * (PAGE_SIZE * 8 * PAGE_SIZE) memory range. For example, on a system with a
+ * 4KB native page size, the bitmap table contains 32768 bits and covers a
+ * 128MB memory range.
+ *
+ * Each KHO page table contains (PAGE_SIZE / 8) entries, where each entry is a
+ * descriptor (a physical address) pointing to the next level table.
+ * For example, with a 4KB page size, each page table holds 512 entries.
+ * The level 2 KHO page table is an exception, where each entry points to a
+ * KHO bitmap table instead.
+ *
+ * An entry of a KHO page table of a 4KB page system is shown as below as an
+ * example.
+ *
+ * 63:12 11:0
+ * +------------------------------+--------------+
+ * | descriptor to next table | zeros |
+ * +------------------------------+--------------+
+ */
+
+#define BITMAP_TABLE_SHIFT(_order) (PAGE_SHIFT + PAGE_SHIFT + 3 + (_order))
+#define BITMAP_TABLE_MASK(_order) ((1ULL << BITMAP_TABLE_SHIFT(_order)) - 1)
+#define PRESERVED_PAGE_OFFSET_SHIFT(_order) (PAGE_SHIFT + (_order))
+#define PAGE_TABLE_SHIFT_PER_LEVEL (ilog2(PAGE_SIZE / sizeof(unsigned long)))
+#define PAGE_TABLE_LEVEL_MASK ((1ULL << PAGE_TABLE_SHIFT_PER_LEVEL) - 1)
+#define PTR_PER_LEVEL (PAGE_SIZE / sizeof(unsigned long))
+
+typedef int (*kho_walk_callback_t)(phys_addr_t pa, int order);
+
+struct kho_bitmap_table {
+ unsigned long bitmaps[PAGE_SIZE / sizeof(unsigned long)];
+};
+
+struct kho_page_table {
+ unsigned long tables[PTR_PER_LEVEL];
+};
+
+struct kho_order_table {
+ unsigned long orders[HUGETLB_PAGE_ORDER + 1];
+};
+
+/*
+ * `kho_order_table` points to a page that serves as the root of the KHO page
+ * table hierarchy. This page is allocated during KHO module initialization.
+ * Its physical address is written to the FDT and passed to the next kernel
+ * during kexec.
+ */
+static struct kho_order_table *kho_order_table;
+
+static unsigned long kho_page_table_level_shift(int level, int order)
+{
+ /*
+ * Calculate the cumulative bit shift required to extract the page table
+ * index for a given physical address at a specific `level` and `order`.
+ *
+ * - Level 1 is the bitmap table, which has its own indexing logic, so
+ * the shift is 0.
+ * - Level 2 and above: The base shift is `BITMAP_TABLE_SHIFT(order)`,
+ * which corresponds to the entire address space covered by a single
+ * level 1 bitmap table.
+ * - Each subsequent level adds `PAGE_TABLE_SHIFT_PER_LEVEL` to the
+ * total shift amount.
+ */
+ return level <= 1 ? 0 :
+ BITMAP_TABLE_SHIFT(order) + PAGE_TABLE_SHIFT_PER_LEVEL * (level - 2);
+}
+
+static int kho_get_bitmap_table_index(unsigned long pa, int order)
+{
+ /* 4KB (12bits of addr) + 8B per entries (6bits of addr) + order bits */
+ unsigned long idx = pa >> (PAGE_SHIFT + 6 + order);
+
+ return idx;
+}
+
+static int kho_get_page_table_index(unsigned long pa, int order, int level)
+{
+ unsigned long high_addr;
+ unsigned long page_table_offset;
+ unsigned long shift;
+
+ if (level == 1)
+ return kho_get_bitmap_table_index(pa, order);
+
+ shift = kho_page_table_level_shift(level, order);
+ high_addr = pa >> shift;
+
+ page_table_offset = high_addr & PAGE_TABLE_LEVEL_MASK;
+ return page_table_offset;
+}
+
+static int kho_table_level(int order)
+{
+ unsigned long bits_to_resolve;
+ int page_table_num;
+
+ /* We just need 1 bitmap table to cover all addresses */
+ if (BITMAP_TABLE_SHIFT(order) >= 64)
+ return 1;
+
+ bits_to_resolve = 64 - BITMAP_TABLE_SHIFT(order);
+
+ /*
+ * The level we need is the bits to resolve over the bits a page tabel
+ * can resolve. Get the ceiling as ceil(a/b) = (a + b - 1) / b.
+ * Total level is the all table levels plus the buttom
+ * bitmap level.
+ */
+ page_table_num = (bits_to_resolve + PAGE_TABLE_SHIFT_PER_LEVEL - 1)
+ / PAGE_TABLE_SHIFT_PER_LEVEL;
+ return page_table_num + 1;
+}
+
+static struct kho_page_table *kho_alloc_page_table(void)
+{
+ return (struct kho_page_table *)get_zeroed_page(GFP_KERNEL);
+}
+
+static void kho_set_preserved_page_bit(struct kho_bitmap_table *bitmap_table,
+ unsigned long pa, int order)
+{
+ int bitmap_table_index = kho_get_bitmap_table_index(pa, order);
+ int offset;
+
+ /* Get the bit offset in a 64bits bitmap entry */
+ offset = (pa >> PRESERVED_PAGE_OFFSET_SHIFT(order)) & 0x3f;
+
+ set_bit(offset,
+ (unsigned long *)&bitmap_table->bitmaps[bitmap_table_index]);
+}
+
+static unsigned long kho_pgt_desc(struct kho_page_table *va)
+{
+ return (unsigned long)virt_to_phys(va);
+}
+
+static struct kho_page_table *kho_page_table(unsigned long desc)
+{
+ return (struct kho_page_table *)phys_to_virt(desc);
+}
+
+static int __kho_preserve_page_table(unsigned long pa, int order)
+{
+ int num_table_level = kho_table_level(order);
+ struct kho_page_table *cur;
+ struct kho_page_table *next;
+ struct kho_bitmap_table *bitmap_table;
+ int i, page_table_index;
+ unsigned long page_table_desc;
+
+ if (!kho_order_table->orders[order]) {
+ cur = kho_alloc_page_table();
+ if (!cur)
+ return -ENOMEM;
+ page_table_desc = kho_pgt_desc(cur);
+ kho_order_table->orders[order] = page_table_desc;
+ }
+
+ cur = kho_page_table(kho_order_table->orders[order]);
+
+ /* Go from high level tables to low level tables */
+ for (i = num_table_level; i > 1; i--) {
+ page_table_index = kho_get_page_table_index(pa, order, i);
+
+ if (!cur->tables[page_table_index]) {
+ next = kho_alloc_page_table();
+ if (!next)
+ return -ENOMEM;
+ cur->tables[page_table_index] = kho_pgt_desc(next);
+ } else {
+ next = kho_page_table(cur->tables[page_table_index]);
+ }
+
+ cur = next;
+ }
+
+ /* Cur is now pointing to the level 1 bitmap table */
+ bitmap_table = (struct kho_bitmap_table *)cur;
+ kho_set_preserved_page_bit(bitmap_table,
+ pa & BITMAP_TABLE_MASK(order),
+ order);
+
+ return 0;
+}
+
+/*
+ * TODO: __maybe_unused is added to the functions:
+ * kho_preserve_page_table()
+ * kho_walk_tables()
+ * kho_memblock_reserve()
+ * since they are not actually being called in this change.
+ * __maybe_unused will be removed in the next patch.
+ */
+static __maybe_unused int kho_preserve_page_table(unsigned long pfn, int order)
+{
+ unsigned long pa = PFN_PHYS(pfn);
+
+ might_sleep();
+
+ return __kho_preserve_page_table(pa, order);
+}
+
+static int __kho_walk_bitmap_table(int order,
+ struct kho_bitmap_table *bitmap_table,
+ unsigned long pa,
+ kho_walk_callback_t cb)
+{
+ int i;
+ unsigned long offset;
+ int ret = 0;
+ int order_factor = 1 << order;
+ unsigned long *bitmap = (unsigned long *)bitmap_table;
+
+ for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
+ offset = (unsigned long)PAGE_SIZE * order_factor * i;
+ ret = cb(offset + pa, order);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static int __kho_walk_page_tables(int order, int level,
+ struct kho_page_table *cur, unsigned long pa,
+ kho_walk_callback_t cb)
+{
+ struct kho_page_table *next;
+ struct kho_bitmap_table *bitmap_table;
+ int i;
+ unsigned long offset;
+ int ret = 0;
+
+ if (level == 1) {
+ bitmap_table = (struct kho_bitmap_table *)cur;
+ return __kho_walk_bitmap_table(order, bitmap_table, pa, cb);
+ }
+
+ for (i = 0; i < PTR_PER_LEVEL; i++) {
+ if (cur->tables[i]) {
+ next = kho_page_table(cur->tables[i]);
+ offset = i;
+ offset <<= kho_page_table_level_shift(level, order);
+ ret = __kho_walk_page_tables(order, level - 1,
+ next, offset + pa, cb);
+ if (ret < 0)
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+static __maybe_unused int kho_walk_page_tables(struct kho_page_table *top, int order,
+ kho_walk_callback_t cb)
+{
+ int num_table_level;
+
+ if (top) {
+ num_table_level = kho_table_level(order);
+ return __kho_walk_page_tables(order, num_table_level, top, 0, cb);
+ }
+
+ return 0;
+}
+
+static __maybe_unused int kho_memblock_reserve(phys_addr_t pa, int order)
+{
+ int sz = 1 << (order + PAGE_SHIFT);
+ struct page *page = phys_to_page(pa);
+
+ memblock_reserve(pa, sz);
+ memblock_reserved_mark_noinit(pa, sz);
+ page->private = order;
+
+ return 0;
+}
+
/*
* Keep track of memory that is to be preserved across KHO.
*
--
2.51.0.384.g4c02a37b29-goog
next prev parent reply other threads:[~2025-09-17 2:50 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-17 2:50 [RFC v1 0/4] Make KHO Stateless Jason Miu
2025-09-17 2:50 ` Jason Miu [this message]
2025-09-17 12:21 ` [RFC v1 1/4] kho: Introduce KHO page table data structures Jason Gunthorpe
2025-09-17 16:18 ` Pasha Tatashin
2025-09-17 16:32 ` Jason Gunthorpe
2025-09-19 6:49 ` Jason Miu
2025-09-19 12:56 ` Jason Gunthorpe
2025-09-17 2:50 ` [RFC v1 2/4] kho: Adopt KHO page tables and remove serialization Jason Miu
2025-09-17 17:52 ` Mike Rapoport
2025-09-19 6:58 ` Jason Miu
2025-09-17 2:50 ` [RFC v1 3/4] memblock: Remove KHO notifier usage Jason Miu
2025-09-17 2:50 ` [RFC v1 4/4] kho: Remove notifier system infrastructure Jason Miu
2025-09-17 11:36 ` [RFC v1 0/4] Make KHO Stateless Jason Gunthorpe
2025-09-17 14:48 ` Pasha Tatashin
2025-09-21 22:26 ` Matthew Wilcox
2025-09-21 23:07 ` Pasha Tatashin
2025-09-25 9:19 ` Mike Rapoport
2025-09-25 12:27 ` Pratyush Yadav
2025-09-25 12:33 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250917025019.1585041-2-jasonmiu@google.com \
--to=jasonmiu@google.com \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=changyuanl@google.com \
--cc=chenste@linux.microsoft.com \
--cc=dmatlack@google.com \
--cc=graf@amazon.com \
--cc=jgg@nvidia.com \
--cc=joel.granados@kernel.org \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mario.limonciello@amd.com \
--cc=mpdesouza@suse.com \
--cc=pasha.tatashin@soleen.com \
--cc=pmladek@suse.com \
--cc=rafael.j.wysocki@intel.com \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
--cc=yan.y.zhao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).