From: Pratyush Yadav <pratyush@kernel.org>
To: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pratyush Yadav <pratyush@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Evangelos Petrongonas <epetron@amazon.de>,
Alexander Graf <graf@amazon.com>,
Andrew Morton <akpm@linux-foundation.org>,
Jason Miu <jasonmiu@google.com>,
linux-kernel@vger.kernel.org, kexec@lists.infradead.org,
linux-mm@kvack.org, nh-open-source@amazon.com
Subject: Re: [PATCH] kho: add support for deferred struct page init
Date: Mon, 22 Dec 2025 17:24:07 +0100 [thread overview]
Message-ID: <863452cwns.fsf@kernel.org> (raw)
In-Reply-To: <CA+CK2bDxnTEe9Ohq5zLuyF-jqgD0DPhfdq6z=yztUsXU5p5fSQ@mail.gmail.com> (Pasha Tatashin's message of "Mon, 22 Dec 2025 10:55:34 -0500")
On Mon, Dec 22 2025, Pasha Tatashin wrote:
>> > NUMA node boundaries are SECTION_SIZE aligned. Since SECTION_SIZE is
>> > larger than MAX_PAGE_ORDER it is mathematically impossible for a
>> > single chunk to span multiple nodes.
>>
>> For folios, yes. The whole folio should only be in a single node. But we
>> also have kho_preserve_pages() (formerly kho_preserve_phys()) which can
>> be used to preserve an arbitrary size of memory and _that_ doesn't have
>> to be in the same section. And if the memory is properly aligned, then
>> it will end up being just one higher-order preservation in KHO.
>
> Both restore pages and folios we use: kho_restore_page() which has the
> following:
>
> /*
> * deserialize_bitmap() only sets the magic on the head page. This magic
> * check also implicitly makes sure phys is order-aligned since for
> * non-order-aligned phys addresses, magic will never be set.
> */
> if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
> return NULL;
See my patch that drops this restriction:
https://lore.kernel.org/linux-mm/20251206230222.853493-2-pratyush@kernel.org/
I think it was wrong to add it in the first place.
>
> My understanding the head page can never be more than MAX_PAGE_ORDER
> hence why I am saying it will be less than SECTION_SIZE. With HugeTLB
> the order can be more than MAX_PAGE_ORDER, but in that case it still
> has to be within a single NID, since a huge page cannot be split
> across multiple nodes.
For a "proper" page/folio, that either comes from the page allocator or
from HugeTLB, you are right. But see again how kho_preserve_pages()
works:
while (pfn < end_pfn) {
const unsigned int order =
min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
err = __kho_preserve_order(track, pfn, order);
[...]
It combines contiguous order-aligned pages into one KHO preservation.
So say I have two nodes, each 64G. If I call kho_preserve_pages() for
62G to 66G, I will get _one_ 4G preservation at 62G. kho_restore_page()
will split it into 0-order pages on restore.
>
>> >> > This approach seems to give us the best of both worlds: It avoids the
>> >> > memblock dependency during restoration. It keeps the serial work in
>> >> > deserialize_bitmap() to a minimum (O(1)O(1) per region). It allows the
>> >> > heavy lifting of tail page initialization to be done later in the boot
>> >> > process, potentially in parallel, as you suggested.
>> >>
>> >> Here's another idea I have been thinking about, but never dug deep
>> >> enough to figure out if it actually works.
>> >>
>> >> __init_page_from_nid() loops through all the zones for the node to find
>> >> the zone id for the page. We can flip it the other way round and loop
>> >> through all zones (on all nodes) to find out if the PFN spans that zone.
>> >> Once we find the zone, we can directly call __init_single_page() on it.
>> >> If a contiguous chunk of preserved memory lands in one zone, we can
>> >> batch the init to save some time.
>> >>
>> >> Something like the below (completely untested):
>> >>
>> >>
>> >> static void kho_init_page(struct page *page)
>> >> {
>> >> unsigned long pfn = page_to_pfn(page);
>> >> struct zone *zone;
>> >>
>> >> for_each_zone(zone) {
>> >> if (zone_spans_pfn(zone, pfn))
>> >> break;
>> >> }
>> >>
>> >> __init_single_page(page, pfn, zone_idx(zone), zone_to_nid(zone));
>> >> }
>> >>
>> >> It doesn't do the batching I mentioned, but I think it at least gets the
>> >> point across. And I think even this simple version would be a good first
>> >> step.
>> >>
>> >> This lets us initialize the page from kho_restore_folio() without having
>> >> to rely of memblock being alive, and saves us from doing work during
>> >> early boot. We should only have a handful of zones and nodes in
>> >> practice, so I think it should perform fairly well too.
>> >>
>> >> We would of course need to see how it performs in practice. If it works,
>> >> I think it would be cleaner and simpler than splitting the
>> >> initialization into two separate parts.
>> >
>> > I think your idea is clever and would work. However, consider the
>> > cache efficiency: in deserialize_bitmap(), we must write to the head
>> > struct page anyway to preserve the order. Since we are already
>> > bringing that 64-byte cacheline in and dirtying it, and since memblock
>> > is available and fast at this stage, it makes sense to fully
>> > initialize the head page right then.
>>
>> You will also bring in the cache line and dirty it during
>> kho_restore_folio() since you need to write the page refcounts. So I
>> don't think the cache efficiency makes any difference between either
>> approach.
>>
>> > If we do that, we get the nid for "free" (cache-wise) and we avoid the
>> > overhead of iterating zones during the restore phase. We can then
>> > simply inherit the nid from the head page when initializing the tail
>> > pages later.
>>
>> To get the nid, you would need to call early_pfn_to_nid(). This takes a
>> spinlock and searches through all memblock memory regions. I don't think
>> it is too expensive, but it isn't free either. And all this would be
>> done serially. With the zone search, you at least have some room for
>> concurrency.
>>
>> I think either approach only makes a difference when we have a large
>> number of low-order preservations. If we have a handful of high-order
>> preservations, I suppose the overhead of nid search would be negligible.
>
> We should be targeting a situation where the vast majority of the
> preserved memory is HugeTLB, but I am still worried about lower order
> preservation efficiency for IOMMU page tables, etc.
Yep. Plus we might get VMMs stashing some of their state in a memfd too.
>
>> Long term, I think we should hook this into page_alloc_init_late() so
>> that all the KHO pages also get initalized along with all the other
>> pages. This will result in better integration of KHO with rest of MM
>> init, and also have more consistent page restore performance.
>
> But we keep KHO as reserved memory, and hooking it up into
> page_alloc_init_late() would make it very different, since that memory
> is part of the buddy allocator memory...
The idea I have is to have a separate call in page_alloc_init_late()
that initalizes KHO pages. It would traverse the radix tree (probably in
parallel by distributing the address space across multiple threads?) and
initialize all the pages. Then kho_restore_page() would only have to
double-check the magic and it can directly return the page.
Radix tree makes parallelism easier than the linked lists we have now.
>
>> Jason's radix tree patches will make that a bit easier to do I think.
>> The zone search will scale better I reckon.
>
> It could, perhaps early in boot we should reserve the radix tree, and
> use it as a source of truth look-ups later in boot?
Yep. I think the radix tree should mark its own pages as preserved too
so they stick around later in boot.
--
Regards,
Pratyush Yadav
next prev parent reply other threads:[~2025-12-22 16:24 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-16 8:49 [PATCH] kho: add support for deferred struct page init Evangelos Petrongonas
2025-12-16 10:53 ` Pasha Tatashin
2025-12-16 11:57 ` Mike Rapoport
2025-12-16 14:26 ` Evangelos Petrongonas
2025-12-16 15:05 ` Pasha Tatashin
2025-12-16 15:19 ` Mike Rapoport
2025-12-16 15:36 ` Pasha Tatashin
2025-12-16 15:51 ` Pasha Tatashin
2025-12-20 2:27 ` Pratyush Yadav
2025-12-19 9:19 ` Mike Rapoport
2025-12-19 16:28 ` Pasha Tatashin
2025-12-20 3:20 ` Pratyush Yadav
2025-12-20 14:49 ` Pasha Tatashin
2025-12-22 15:33 ` Pratyush Yadav
2025-12-22 15:55 ` Pasha Tatashin
2025-12-22 16:24 ` Pratyush Yadav [this message]
2025-12-23 17:37 ` Pasha Tatashin
2025-12-29 21:03 ` Pratyush Yadav
2025-12-30 16:05 ` Pasha Tatashin
2025-12-30 16:16 ` Mike Rapoport
2025-12-30 16:18 ` Pasha Tatashin
2025-12-30 17:18 ` Mike Rapoport
2025-12-30 18:21 ` Pasha Tatashin
2025-12-31 9:46 ` Mike Rapoport
2026-01-02 14:24 ` Pratyush Yadav
2026-01-02 14:05 ` Pratyush Yadav
2025-12-30 16:14 ` Mike Rapoport
2026-01-03 5:23 ` Jason Miu
2026-02-04 18:44 ` Mike Rapoport
2026-02-05 9:39 ` Evangelos Petrongonas
-- strict thread matches above, loose matches on Subject: below --
2025-12-24 7:34 Fadouse
2025-12-29 21:09 ` Pratyush Yadav
2025-12-30 15:05 ` Pasha Tatashin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=863452cwns.fsf@kernel.org \
--to=pratyush@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=epetron@amazon.de \
--cc=graf@amazon.com \
--cc=jasonmiu@google.com \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nh-open-source@amazon.com \
--cc=pasha.tatashin@soleen.com \
--cc=rppt@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.