From: Jason Gunthorpe <jgg@nvidia.com>
To: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Mike Rapoport <rppt@kernel.org>,
linux-kernel@vger.kernel.org, Alexander Graf <graf@amazon.com>,
Andrew Morton <akpm@linux-foundation.org>,
Andy Lutomirski <luto@kernel.org>,
Anthony Yznaga <anthony.yznaga@oracle.com>,
Arnd Bergmann <arnd@arndb.de>,
Ashish Kalra <ashish.kalra@amd.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Borislav Petkov <bp@alien8.de>,
Catalin Marinas <catalin.marinas@arm.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
David Woodhouse <dwmw2@infradead.org>,
Eric Biederman <ebiederm@xmission.com>,
Ingo Molnar <mingo@redhat.com>, James Gowans <jgowans@amazon.com>,
Jonathan Corbet <corbet@lwn.net>,
Krzysztof Kozlowski <krzk@kernel.org>,
Mark Rutland <mark.rutland@arm.com>,
Paolo Bonzini <pbonzini@redhat.com>,
"H. Peter Anvin" <hpa@zytor.com>,
Peter Zijlstra <peterz@infradead.org>,
Pratyush Yadav <ptyadav@amazon.de>,
Rob Herring <robh+dt@kernel.org>, Rob Herring <robh@kernel.org>,
Saravana Kannan <saravanak@google.com>,
Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>,
Steven Rostedt <rostedt@goodmis.org>,
Thomas Gleixner <tglx@linutronix.de>,
Tom Lendacky <thomas.lendacky@amd.com>,
Usama Arif <usama.arif@bytedance.com>,
Will Deacon <will@kernel.org>,
devicetree@vger.kernel.org, kexec@lists.infradead.org,
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
linux-mm@kvack.org, x86@kernel.org
Subject: Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
Date: Wed, 12 Feb 2025 11:23:36 -0400 [thread overview]
Message-ID: <20250212152336.GA3848889@nvidia.com> (raw)
In-Reply-To: <20250211163720.GH3754072@nvidia.com>
On Tue, Feb 11, 2025 at 12:37:20PM -0400, Jason Gunthorpe wrote:
> To do that you need to preserve folios as the basic primitive.
I made a small sketch of what I suggest.
I imagine the FDT schema for this would look something like this:
/dts-v1/;
/ {
compatible = "linux-kho,v1";
phys-addr-size = 64;
void-p-size = 64;
preserved-folio-map = <phys_addr>;
// The per "driver" storage
instance@1 {..};
instance@2 {..};
};
I think this is alot better than what is in this series. It uses much
less memory when there are alot of allocation, it supports any order
folios, it is efficient for 1G guestmemfd folios, and it only needs a
few bytes in the FDT. It could preserve and restore the high order
folio struct page folding (HVO).
The use cases I'm imagining for drivers would be pushing gigabytes of
memory into this preservation mechanism. It needs to be scalable!
This also illustrates my point that I don't think FDT is a good
representation to use exclusively. This in-memory structure is much
better and faster than trying to represent the same information
embedded directly into the FDT. I imagine this to be the general
pattern that drivers will want to use. A few bytes in the FDT pointing
at a scalable in-memory structure for the bulk of the data.
/*
* Keep track of folio memory that is to be preserved across KHO.
*
* This is designed with the idea that the system will have alot of memory, eg
* 1TB, and the majority of it will be ~1G folios assigned to a hugetlb/etc
* being used to back guest memory. This would leave a smaller amount of memory,
* eg 16G, reserved for the hypervisor to use. The pages to preserve across KHO
* would be randomly distributed over the hypervisor memory. The hypervisor
* memory is not required to be contiguous.
*
* This approach is fully incremental, as the serialization progresses folios
* can continue be aggregated to the tracker. The final step, immediately prior
* to kexec would serialize the xarray information into a linked list for the
* successor kernel to parse.
*
* The serializing side uses two levels of xarrays to manage chunks of per-order
* 512 byte bitmaps. For instance the entire 1G order of a 1TB system would fit
* inside a single 512 byte bitmap. For order 0 allocations each bitmap will
* cover 16M of address space. Thus, for 16G of hypervisor memory at most 512K
* of bitmap memory will be needed for order 0.
*/
struct kho_mem_track
{
/* Points to kho_mem_phys, each order gets its own bitmap tree */
struct xarray orders;
};
struct kho_mem_phys
{
/*
* Points to kho_mem_phys_bits, a sparse bitmap array. Each bit is sized
* to order.
*/
struct xarray phys_bits;
};
#define PRESERVE_BITS (512 * 8)
struct kho_mem_phys_bits
{
DECLARE_BITMAP(preserve, PRESERVE_BITS)
};
static void *
xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t elmsz)
{
void *elm;
void *res;
elm = xa_load(xa, index);
if (elm)
return elm;
elm = kzalloc(elmsz, GFP_KERNEL);
if (!elm)
return ERR_PTR(-ENOMEM);
res = xa_cmpxchg(xa, elmsz, NULL, elm, GFP_KERNEL);
if (xa_is_err(res)) {
kfree(elm);
return ERR_PTR(xa_err(res));
};
if (res != NULL) {
kfree(elm);
return res;
}
return elm;
}
/*
* Record that the entire folio under virt is preserved across KHO. virt must
* have come from alloc_pages/folio_alloc or similar and point to the first page
* of the folio. The order will be preserved as well.
*/
int kho_preserve_folio(struct kho_mem_track *tracker, void *virt)
{
struct folio *folio = virt_to_folio(virt);
unsigned int order = folio_order(folio);
phys_addr_t phys = virt_to_phys(virt);
struct kho_mem_phys_bits *bits;
struct kho_mem_phys *physxa;
might_sleep();
physxa = xa_load_or_alloc(&tracker->orders, order, sizeof(*physxa));
if (IS_ERR(physxa))
return PTR_ERR(physxa);
phys >>= PAGE_SHIFT + order;
static_assert(sizeof(phys_addr_t) <= sizeof(unsigned long));
bits = xa_load_or_alloc(&physxa->phys_bits, phys / PRESERVE_BITS,
sizeof(*bits));
set_bit(phys % PRESERVE_BITS, bits->preserve);
return 0;
}
#define KHOSER_PTR(type) union {phys_addr_t phys; type ptr;}
#define KHOSER_STORE_PTR(dest, val) \
({ \
(dest).phys = virt_to_phys(val); \
typecheck(typeof((dest).ptr), val); \
})
#define KHOSER_LOAD_PTR(src) ((typeof((src).ptr))(phys_to_virt((src).phys)))
struct khoser_mem_bitmap_ptr {
phys_addr_t phys_start;
KHOSER_PTR(struct kho_mem_phys_bits *) bitmap;
};
struct khoser_mem_chunk {
unsigned int order;
unsigned int num_elms;
KHOSER_PTR(struct khoser_mem_chunk *) next;
struct khoser_mem_bitmap_ptr
bitmaps[(PAGE_SIZE - 16) / sizeof(struct khoser_mem_bitmap_ptr)];
};
static_assert(sizeof(struct khoser_mem_chunk) == PAGE_SIZE);
static int new_chunk(struct khoser_mem_chunk **cur_chunk)
{
struct khoser_mem_chunk *chunk;
chunk = kzalloc(sizeof(*chunk), GFP_KERNEL);
if (!chunk)
return -ENOMEM;
if (*cur_chunk)
KHOSER_STORE_PTR((*cur_chunk)->next, chunk);
*cur_chunk = chunk;
return 0;
}
/*
* Record all the bitmaps in a linked list of pages for the next kernel to
* process. Each chunk holds bitmaps of the same order and each block of bitmaps
* starts at a given physical address. This allows the bitmaps to be sparse. The
* xarray is used to store them in a tree while building up the data structure,
* but the KHO successor kernel only needs to process them once in order.
*
* All of this memory is normal kmalloc() memory and is not marked for
* preservation. The successor kernel will remain isolated to the scratch space
* until it completes processing this list. Once processed all the memory
* storing these ranges will be marked as free.
*/
int kho_serialize(struct kho_mem_track *tracker, phys_addr_t *fdt_value)
{
struct khoser_mem_chunk *first_chunk = NULL;
struct khoser_mem_chunk *chunk = NULL;
struct kho_mem_phys *physxa;
unsigned long order;
int ret;
xa_for_each(&tracker->orders, order, physxa) {
struct kho_mem_phys_bits *bits;
unsigned long phys;
ret = new_chunk(&chunk);
if (ret)
goto err_free;
if (!first_chunk)
first_chunk = chunk;
chunk->order = order;
xa_for_each(&physxa->phys_bits, phys, bits) {
struct khoser_mem_bitmap_ptr *elm;
if (chunk->num_elms == ARRAY_SIZE(chunk->bitmaps)) {
ret = new_chunk(&chunk);
if (ret)
goto err_free;
}
elm = &chunk->bitmaps[chunk->num_elms];
chunk->num_elms++;
elm->phys_start = phys << (order + PAGE_SIZE);
KHOSER_STORE_PTR(elm->bitmap, bits);
}
}
*fdt_value = virt_to_phys(first_chunk);
return 0;
err_free:
chunk = first_chunk;
while (chunk) {
struct khoser_mem_chunk *tmp = chunk;
chunk = KHOSER_LOAD_PTR(chunk->next);
kfree(tmp);
}
return ret;
}
static void preserve_bitmap(unsigned int order,
struct khoser_mem_bitmap_ptr *elm)
{
struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap);
unsigned int bit;
for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) {
phys_addr_t phys =
elm->phys_start + (bit << (order +
PAGE_SHIFT));
// Do the struct page stuff..
}
}
void kho_deserialize(phys_addr_t fdt_value)
{
struct khoser_mem_chunk *chunk = phys_to_virt(fdt_value);
while (chunk) {
unsigned int i;
for (i = 0; i != chunk->num_elms; i++)
preserve_bitmap(chunk->order, chunk->bitmaps[i]);
chunk = KHOSER_LOAD_PTR(chunk->next);
}
}
next prev parent reply other threads:[~2025-02-12 15:23 UTC|newest]
Thread overview: 97+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page Mike Rapoport
2025-02-18 14:59 ` Wei Yang
2025-02-19 7:13 ` Mike Rapoport
2025-02-20 8:36 ` Wei Yang
2025-02-20 14:54 ` Mike Rapoport
2025-02-25 7:40 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag Mike Rapoport
2025-02-18 15:50 ` Wei Yang
2025-02-19 7:24 ` Mike Rapoport
2025-02-23 0:22 ` Wei Yang
2025-03-10 9:51 ` Wei Yang
2025-03-11 5:27 ` Mike Rapoport
2025-03-11 13:41 ` Wei Yang
2025-03-12 5:22 ` Mike Rapoport
2025-02-24 1:31 ` Wei Yang
2025-02-25 7:46 ` Mike Rapoport
2025-02-26 2:09 ` Wei Yang
2025-03-10 7:56 ` Wei Yang
2025-03-10 8:28 ` Mike Rapoport
2025-03-10 9:42 ` Wei Yang
2025-02-26 1:53 ` Changyuan Lyu
2025-03-13 15:41 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 03/14] memblock: Add support for scratch memory Mike Rapoport
2025-02-24 2:50 ` Wei Yang
2025-02-25 7:47 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 04/14] memblock: introduce memmap_init_kho_scratch() Mike Rapoport
2025-02-24 3:02 ` Wei Yang
2025-02-06 13:27 ` [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers Mike Rapoport
2025-02-10 20:22 ` Jason Gunthorpe
2025-02-10 20:58 ` Pasha Tatashin
2025-02-11 12:49 ` Jason Gunthorpe
2025-02-11 16:14 ` Pasha Tatashin
2025-02-11 16:37 ` Jason Gunthorpe
2025-02-12 15:23 ` Jason Gunthorpe [this message]
2025-02-12 16:39 ` Mike Rapoport
2025-02-12 17:43 ` Jason Gunthorpe
2025-02-23 18:51 ` Mike Rapoport
2025-02-24 14:28 ` Jason Gunthorpe
2025-02-12 12:29 ` Thomas Weißschuh
2025-02-06 13:27 ` [PATCH v4 06/14] kexec: Add KHO parsing support Mike Rapoport
2025-02-10 20:50 ` Jason Gunthorpe
2025-03-10 16:20 ` Pratyush Yadav
2025-03-10 17:08 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 07/14] kexec: Add KHO support to kexec file loads Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 08/14] kexec: Add config option for KHO Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 09/14] kexec: Add documentation " Mike Rapoport
2025-02-10 19:26 ` Jason Gunthorpe
2025-02-06 13:27 ` [PATCH v4 10/14] arm64: Add KHO support Mike Rapoport
2025-02-09 10:38 ` Krzysztof Kozlowski
2025-02-06 13:27 ` [PATCH v4 11/14] x86/setup: use memblock_reserve_kern for memory used by kernel Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 12/14] x86: Add KHO support Mike Rapoport
2025-02-24 7:13 ` Wei Yang
2025-02-24 14:36 ` Mike Rapoport
2025-02-25 0:00 ` Wei Yang
2025-02-06 13:27 ` [PATCH v4 13/14] memblock: Add KHO support for reserve_mem Mike Rapoport
2025-02-10 16:03 ` Rob Herring
2025-02-12 16:30 ` Mike Rapoport
2025-02-17 4:04 ` Wei Yang
2025-02-19 7:25 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 14/14] Documentation: KHO: Add memblock bindings Mike Rapoport
2025-02-09 10:29 ` Krzysztof Kozlowski
2025-02-09 15:10 ` Mike Rapoport
2025-02-09 15:23 ` Krzysztof Kozlowski
2025-02-09 20:41 ` Mike Rapoport
2025-02-09 20:49 ` Krzysztof Kozlowski
2025-02-09 20:50 ` Krzysztof Kozlowski
2025-02-10 19:15 ` Jason Gunthorpe
2025-02-10 19:27 ` Krzysztof Kozlowski
2025-02-10 20:20 ` Jason Gunthorpe
2025-02-12 16:00 ` Mike Rapoport
2025-02-07 0:29 ` [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Andrew Morton
2025-02-07 1:28 ` Pasha Tatashin
2025-02-08 1:38 ` Baoquan He
2025-02-08 8:41 ` Mike Rapoport
2025-02-08 11:13 ` Baoquan He
2025-02-09 0:23 ` Pasha Tatashin
2025-02-09 3:07 ` Baoquan He
2025-02-07 8:06 ` Mike Rapoport
2025-02-09 10:33 ` Krzysztof Kozlowski
2025-02-07 4:50 ` Andrew Morton
2025-02-07 8:01 ` Mike Rapoport
2025-02-08 23:39 ` Cong Wang
2025-02-09 0:13 ` Pasha Tatashin
2025-02-09 1:00 ` Cong Wang
2025-02-09 0:51 ` Cong Wang
2025-02-17 3:19 ` RuiRui Yang
2025-02-19 7:32 ` Mike Rapoport
2025-02-19 12:49 ` Dave Young
2025-02-19 13:54 ` Alexander Graf
2025-02-20 1:49 ` Dave Young
2025-02-20 16:43 ` Alexander Gordeev
2025-02-23 17:54 ` Mike Rapoport
2025-02-26 20:08 ` Pratyush Yadav
2025-02-28 20:20 ` Mike Rapoport
2025-02-28 23:04 ` Pratyush Yadav
2025-03-02 9:52 ` Mike Rapoport
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250212152336.GA3848889@nvidia.com \
--to=jgg@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=anthony.yznaga@oracle.com \
--cc=arnd@arndb.de \
--cc=ashish.kalra@amd.com \
--cc=benh@kernel.crashing.org \
--cc=bp@alien8.de \
--cc=catalin.marinas@arm.com \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=devicetree@vger.kernel.org \
--cc=dwmw2@infradead.org \
--cc=ebiederm@xmission.com \
--cc=graf@amazon.com \
--cc=hpa@zytor.com \
--cc=jgowans@amazon.com \
--cc=kexec@lists.infradead.org \
--cc=krzk@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mark.rutland@arm.com \
--cc=mingo@redhat.com \
--cc=pasha.tatashin@soleen.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=ptyadav@amazon.de \
--cc=robh+dt@kernel.org \
--cc=robh@kernel.org \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=saravanak@google.com \
--cc=skinsburskii@linux.microsoft.com \
--cc=tglx@linutronix.de \
--cc=thomas.lendacky@amd.com \
--cc=usama.arif@bytedance.com \
--cc=will@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).