From: Pratyush Yadav <ptyadav@amazon.de>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Changyuan Lyu <changyuanl@google.com>,
<linux-kernel@vger.kernel.org>, <graf@amazon.com>,
<akpm@linux-foundation.org>, <luto@kernel.org>,
<anthony.yznaga@oracle.com>, <arnd@arndb.de>,
<ashish.kalra@amd.com>, <benh@kernel.crashing.org>,
<bp@alien8.de>, <catalin.marinas@arm.com>,
<dave.hansen@linux.intel.com>, <dwmw2@infradead.org>,
<ebiederm@xmission.com>, <mingo@redhat.com>, <jgowans@amazon.com>,
<corbet@lwn.net>, <krzk@kernel.org>, <rppt@kernel.org>,
<mark.rutland@arm.com>, <pbonzini@redhat.com>,
<pasha.tatashin@soleen.com>, <hpa@zytor.com>,
<peterz@infradead.org>, <robh+dt@kernel.org>, <robh@kernel.org>,
<saravanak@google.com>, <skinsburskii@linux.microsoft.com>,
<rostedt@goodmis.org>, <tglx@linutronix.de>,
<thomas.lendacky@amd.com>, <usama.arif@bytedance.com>,
<will@kernel.org>, <devicetree@vger.kernel.org>,
<kexec@lists.infradead.org>,
<linux-arm-kernel@lists.infradead.org>,
<linux-doc@vger.kernel.org>, <linux-mm@kvack.org>,
<x86@kernel.org>
Subject: Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
Date: Thu, 3 Apr 2025 17:37:06 +0000 [thread overview]
Message-ID: <mafs0tt75p2nx.fsf@amazon.de> (raw)
In-Reply-To: <20250403161001.GG342109@nvidia.com>
On Thu, Apr 03 2025, Jason Gunthorpe wrote:
> On Thu, Apr 03, 2025 at 03:50:04PM +0000, Pratyush Yadav wrote:
>
>> The patch currently has a limitation where it does not free any of the
>> empty tables after a unpreserve operation. But Changyuan's patch also
>> doesn't do it so at least it is not any worse off.
>
> We do we even have unpreserve? Just discard the entire KHO operation
> in a bulk.
Yeah, I guess that makes sense.
>
>> When working on this patch, I realized that kho_mem_deserialize() is
>> currently _very_ slow. It takes over 2 seconds to make memblock
>> reservations for 48 GiB of 0-order pages. I suppose this can later be
>> optimized by teaching memblock_free_all() to skip preserved pages
>> instead of making memblock reservations.
>
> Yes, this was my prior point of not having actual data to know what
> the actual hot spots are.. This saves a few ms on an operation that
> takes over 2 seconds :)
Yes, you're right. But for 2.5 days of work it isn't too shabby :-)
And I think this will help make the 2 seconds much smaller as well later
down the line since we can now find out if a given page is reserved in a
few operations, and do it in parallel.
>
>> +typedef unsigned long khomem_desc_t;
>
> This should be more like:
>
> union {
> void *table;
> phys_addr_t table_phys;
> };
>
> Since we are not using the low bits right now and it is alot cheaper
> to convert from va to phys only once during the final step. __va is
> not exactly fast.
The descriptor is used on _every_ level of the table, not just the top.
So if we use virtual addresses, at serialize time we would have to walk
the whole table and covert all addresses to physical. And __va() does
not seem to be doing too much. On x86, it expands to:
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
and on ARM64 to:
#define __va(x) ((void *)__phys_to_virt((phys_addr_t)(x)))
#define __phys_to_virt(x) ((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET)
So only some addition and bitwise or. Should be fast enough I reckon.
Maybe walking the table once is faster than calculating va every time,
but that walking would happen in the blackout time, and need more data
on whether that optimization is worth it.
>
>> +#define PTRS_PER_LEVEL (PAGE_SIZE / sizeof(unsigned long))
>> +#define KHOMEM_L1_BITS (PAGE_SIZE * BITS_PER_BYTE)
>> +#define KHOMEM_L1_MASK ((1 << ilog2(KHOMEM_L1_BITS)) - 1)
>> +#define KHOMEM_L1_SHIFT (PAGE_SHIFT)
>> +#define KHOMEM_L2_SHIFT (KHOMEM_L1_SHIFT + ilog2(KHOMEM_L1_BITS))
>> +#define KHOMEM_L3_SHIFT (KHOMEM_L2_SHIFT + ilog2(PTRS_PER_LEVEL))
>> +#define KHOMEM_L4_SHIFT (KHOMEM_L3_SHIFT + ilog2(PTRS_PER_LEVEL))
>> +#define KHOMEM_PFN_MASK PAGE_MASK
>
> This all works better if you just use GENMASK and FIELD_GET
I suppose yes. Though the masks need to be shifted by page order so need
to be careful. Will take a look.
>
>> +static int __khomem_table_alloc(khomem_desc_t *desc)
>> +{
>> + if (khomem_desc_none(*desc)) {
>
> Needs READ_ONCE
ACK, will add.
>
>> +struct kho_mem_track {
>> + /* Points to L4 KHOMEM descriptor, each order gets its own table. */
>> + struct xarray orders;
>> +};
>
> I think it would be easy to add a 5th level and just use bits 63:57 as
> a 6 bit order. Then you don't need all this stuff either.
I am guessing you mean to store the order in the table descriptor
itself, instead of having a different table for each order. I don't
think that would work since say you have a level 1 table spanning 128
MiB. You can have pages of different orders in that 128 MiB, and have no
way of knowing which is which. To have all orders in one table, we would
need more than one bit per page at the lowest level.
Though now that I think of it, it is probably much simpler to just use
khomem_desc_t orders[NR_PAGE_ORDERS] instead of the xarray.
>
>> +int kho_preserve_folio(struct folio *folio)
>> +{
>> + unsigned long pfn = folio_pfn(folio);
>> + unsigned int order = folio_order(folio);
>> + int err;
>> +
>> + if (!kho_enable)
>> + return -EOPNOTSUPP;
>> +
>> + down_read(&kho_out.tree_lock);
>
> This lock still needs to go away
Agree. I hope Changyuan's next version fixes it. I didn't really touch
any of these functions.
>
>> +static void kho_mem_serialize(void)
>> +{
>> + struct kho_mem_track *tracker = &kho_mem_track;
>> + khomem_desc_t *desc;
>> + unsigned long order;
>> +
>> + xa_for_each(&tracker->orders, order, desc) {
>> + if (WARN_ON(order >= NR_PAGE_ORDERS))
>> + break;
>> + kho_out.mem_tables[order] = *desc;
>
> Missing the virt_to_phys?
Nope. This isn't storing the pointer to the descriptor, but the _value_
of the descriptor -- so it already contains the physical address of the
level 4 table.
>
>> + nr_tables = min_t(unsigned int, len / sizeof(*tables), NR_PAGE_ORDERS);
>> + for (order = 0; order < nr_tables; order++)
>> + khomem_walk_preserved((khomem_desc_t *)&tables[order], order,
>
> Missing phys_to_virt
Same as above. tables contains the _values_ of the descriptors which
already have physical addresses which we turn into virtual ones in
khomem_table().
>
> Please dont' remove the KHOSER stuff, and do use it with proper
> structs and types. It is part of keeping this stuff understandable.
I didn't see any need for KHOSER stuff here to be honest. The only time
we deal with KHO pointers is with table addresses, and that is already
well abstracted in khomem_mkdesc() (I suppose that can require a
khomem_desc_t * instead of a void *, but beyond that it is quite easy to
understand IMO).
--
Regards,
Pratyush Yadav
next prev parent reply other threads:[~2025-04-03 17:37 UTC|newest]
Thread overview: 103+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-20 1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 01/16] kexec: define functions to map and unmap segments Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 02/16] mm/mm_init: rename init_reserved_page to init_deferred_page Changyuan Lyu
2025-03-20 7:10 ` Krzysztof Kozlowski
2025-03-20 17:15 ` Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 03/16] memblock: add MEMBLOCK_RSRV_KERN flag Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 04/16] memblock: Add support for scratch memory Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 05/16] memblock: introduce memmap_init_kho_scratch() Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 06/16] hashtable: add macro HASHTABLE_INIT Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers Changyuan Lyu
2025-03-21 13:34 ` Jason Gunthorpe
2025-03-23 19:02 ` Changyuan Lyu
2025-03-24 16:28 ` Jason Gunthorpe
2025-03-25 0:21 ` Changyuan Lyu
2025-03-25 2:20 ` Jason Gunthorpe
2025-03-24 18:40 ` Frank van der Linden
2025-03-25 19:19 ` Mike Rapoport
2025-03-25 21:56 ` Frank van der Linden
2025-03-26 11:59 ` Mike Rapoport
2025-03-26 16:25 ` Frank van der Linden
2025-03-20 1:55 ` [PATCH v5 08/16] kexec: add KHO parsing support Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 09/16] kexec: enable KHO support for memory preservation Changyuan Lyu
2025-03-21 13:46 ` Jason Gunthorpe
2025-03-22 19:12 ` Mike Rapoport
2025-03-23 18:55 ` Jason Gunthorpe
2025-03-24 18:18 ` Mike Rapoport
2025-03-24 20:07 ` Jason Gunthorpe
2025-03-26 12:07 ` Mike Rapoport
2025-03-23 19:07 ` Changyuan Lyu
2025-03-25 2:04 ` Jason Gunthorpe
2025-03-27 10:03 ` Pratyush Yadav
2025-03-27 13:31 ` Jason Gunthorpe
2025-03-27 17:28 ` Pratyush Yadav
2025-03-28 12:53 ` Jason Gunthorpe
2025-04-02 16:44 ` Changyuan Lyu
2025-04-02 16:47 ` Pratyush Yadav
2025-04-02 18:37 ` Pasha Tatashin
2025-04-02 18:49 ` Pratyush Yadav
2025-04-02 19:16 ` Pratyush Yadav
2025-04-03 11:42 ` Jason Gunthorpe
2025-04-03 13:58 ` Mike Rapoport
2025-04-03 14:24 ` Jason Gunthorpe
2025-04-04 9:54 ` Mike Rapoport
2025-04-04 12:47 ` Jason Gunthorpe
2025-04-04 13:53 ` Mike Rapoport
2025-04-04 14:30 ` Jason Gunthorpe
2025-04-04 16:24 ` Pratyush Yadav
2025-04-04 17:31 ` Jason Gunthorpe
2025-04-06 16:13 ` Mike Rapoport
2025-04-06 16:11 ` Mike Rapoport
2025-04-07 14:16 ` Jason Gunthorpe
2025-04-07 16:31 ` Mike Rapoport
2025-04-07 17:03 ` Jason Gunthorpe
2025-04-09 9:06 ` Mike Rapoport
2025-04-09 12:56 ` Jason Gunthorpe
2025-04-09 13:58 ` Mike Rapoport
2025-04-09 15:37 ` Jason Gunthorpe
2025-04-09 16:19 ` Mike Rapoport
2025-04-09 16:28 ` Jason Gunthorpe
2025-04-10 16:51 ` Matthew Wilcox
2025-04-10 17:31 ` Jason Gunthorpe
2025-04-09 16:28 ` Mike Rapoport
2025-04-09 18:32 ` Jason Gunthorpe
2025-04-04 16:15 ` Pratyush Yadav
2025-04-06 16:34 ` Mike Rapoport
2025-04-07 14:23 ` Jason Gunthorpe
2025-04-03 13:57 ` Mike Rapoport
2025-04-11 4:02 ` Changyuan Lyu
2025-04-03 15:50 ` Pratyush Yadav
2025-04-03 16:10 ` Jason Gunthorpe
2025-04-03 17:37 ` Pratyush Yadav [this message]
2025-04-04 12:54 ` Jason Gunthorpe
2025-04-04 15:39 ` Pratyush Yadav
2025-04-09 8:35 ` Mike Rapoport
2025-03-20 1:55 ` [PATCH v5 10/16] kexec: add KHO support to kexec file loads Changyuan Lyu
2025-03-21 13:48 ` Jason Gunthorpe
2025-03-20 1:55 ` [PATCH v5 11/16] kexec: add config option for KHO Changyuan Lyu
2025-03-20 7:10 ` Krzysztof Kozlowski
2025-03-20 17:18 ` Changyuan Lyu
2025-03-24 4:18 ` Dave Young
2025-03-24 19:26 ` Pasha Tatashin
2025-03-25 1:24 ` Dave Young
2025-03-25 3:07 ` Dave Young
2025-03-25 6:57 ` Baoquan He
2025-03-25 8:36 ` Dave Young
2025-03-26 9:17 ` Dave Young
2025-03-26 11:28 ` Mike Rapoport
2025-03-26 12:09 ` Dave Young
2025-03-25 14:04 ` Pasha Tatashin
2025-03-20 1:55 ` [PATCH v5 12/16] arm64: add KHO support Changyuan Lyu
2025-03-20 7:13 ` Krzysztof Kozlowski
2025-03-20 8:30 ` Krzysztof Kozlowski
2025-03-20 23:29 ` Changyuan Lyu
2025-04-11 3:47 ` Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 13/16] x86/setup: use memblock_reserve_kern for memory used by kernel Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 14/16] x86: add KHO support Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 15/16] memblock: add KHO support for reserve_mem Changyuan Lyu
2025-03-20 1:55 ` [PATCH v5 16/16] Documentation: add documentation for KHO Changyuan Lyu
2025-03-20 14:45 ` Jonathan Corbet
2025-03-21 6:33 ` Changyuan Lyu
2025-03-21 13:46 ` Jonathan Corbet
2025-03-25 14:19 ` [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Pasha Tatashin
2025-03-25 15:03 ` Mike Rapoport
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mafs0tt75p2nx.fsf@amazon.de \
--to=ptyadav@amazon.de \
--cc=akpm@linux-foundation.org \
--cc=anthony.yznaga@oracle.com \
--cc=arnd@arndb.de \
--cc=ashish.kalra@amd.com \
--cc=benh@kernel.crashing.org \
--cc=bp@alien8.de \
--cc=catalin.marinas@arm.com \
--cc=changyuanl@google.com \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=devicetree@vger.kernel.org \
--cc=dwmw2@infradead.org \
--cc=ebiederm@xmission.com \
--cc=graf@amazon.com \
--cc=hpa@zytor.com \
--cc=jgg@nvidia.com \
--cc=jgowans@amazon.com \
--cc=kexec@lists.infradead.org \
--cc=krzk@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mark.rutland@arm.com \
--cc=mingo@redhat.com \
--cc=pasha.tatashin@soleen.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=robh+dt@kernel.org \
--cc=robh@kernel.org \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=saravanak@google.com \
--cc=skinsburskii@linux.microsoft.com \
--cc=tglx@linutronix.de \
--cc=thomas.lendacky@amd.com \
--cc=usama.arif@bytedance.com \
--cc=will@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).