Re: [RFC v3 00/21] Preserved-over-Kexec RAM

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Anthony Yznaga <anthony.yznaga@oracle.com>
To: "Gowans, James" <jgowans@amazon.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: "kexec@lists.infradead.org" <kexec@lists.infradead.org>,
	"jason.zeng@intel.com" <jason.zeng@intel.com>,
	"keescook@chromium.org" <keescook@chromium.org>,
	"lei.l.li@intel.com" <lei.l.li@intel.com>,
	"luto@kernel.org" <luto@kernel.org>,
	"rppt@kernel.org" <rppt@kernel.org>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"steven.sistare@oracle.com" <steven.sistare@oracle.com>,
	"Graf (AWS), Alexander" <graf@amazon.de>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"mgalaxy@akamai.com" <mgalaxy@akamai.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"fam.zheng@bytedance.com" <fam.zheng@bytedance.com>,
	"Woodhouse, David" <dwmw@amazon.co.uk>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"yuleixzhang@tencent.com" <yuleixzhang@tencent.com>,
	"ebiederm@xmission.com" <ebiederm@xmission.com>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"bp@alien8.de" <bp@alien8.de>, "x86@kernel.org" <x86@kernel.org>
Subject: Re: [RFC v3 00/21] Preserved-over-Kexec RAM
Date: Wed, 31 May 2023 16:14:10 -0700	[thread overview]
Message-ID: <66d7eda2-c136-1245-b433-784264b31683@oracle.com> (raw)
In-Reply-To: <a4f62a8e1b0f43db005cc1117c06c00e6c0c85ff.camel@amazon.com>


On 5/26/23 6:57 AM, Gowans, James wrote:
> On Wed, 2023-04-26 at 17:08 -0700, Anthony Yznaga wrote:
>> Sending out this RFC in part to guage community interest.
>> This patchset implements preserved-over-kexec memory storage or PKRAM as a
>> method for saving memory pages of the currently executing kernel so that
>> they may be restored after kexec into a new kernel. The patches are adapted
>> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
>> introduce the PKRAM kernel API.
>>
>> One use case for PKRAM is preserving guest memory and/or auxillary
>> supporting data (e.g. iommu data) across kexec to support reboot of the
>> host with minimal disruption to the guest.
> Hi Anthony,

Hi James,


Thank you for looking at this.

>
> Thanks for re-posting this - I'm been wanting to re-kindle the discussion
> on preserving memory across kexec for a while now.
>
> There are a few aspects at play in this space of memory management
> designed specifically for the virtualisation and live update (kexec) use-
> case which I think we should consider:
>
> 1. Preserving userspace-accessible memory across kexec: this is what pkram
> addresses.
>
> 2. Preserving kernel state: This would include memory required for kexec
> with DMA passthrough devices, like IOMMU root page and page tables, DMA-
> able buffers for drivers, etc. Also certain structures for improved kernel
> boot performance after kexec, like a PCI device cache, clock LPJ and
> possible others, sort of what Xen breadcrumbs [0] achieves. The pkram RFC
> indicates that this should be possible, though IMO this could be more
> straight forward to do with a new filesystem with first-class support for
> kernel persistence via something like inode types for kernel data.

PKRAM as it is now can preserve kernel data by streaming bytes to a
PKRAM object, but the data must be location independent since the data
is stored in allocated 4k pages rather than being preserved in place
This really isn't usable for things like page tables or memory expected
not to move because of DMA, etc.

One issue with preserving non-relocatable, regular memory that is not
partitioned from the kernel is the risk that a kexec kernel has already
been loaded and that its pre-computed destination where it will be copied
to on reboot will overwrite the preserved memory. Either some way of
re-processing the kexec kernel to load somewhere else would be needed,
or kexec load would need to be restricted from loading where memory

might be preserved. Plusses for a partitioning approach.


>
> 3. Ensuring huge/gigantic memory allocations: to improve the TLB perf of
> 2-stage translations it's beneficial to allocate guest memory in large
> contiguous blocks, preferably PUD-level blocks for multi-GiB guests. If
> the buddy allocator is used this may be a challenge both from an
> implementation and a fragmentation perspective, and it may be desirable to
> have stronger guarantees about allocation sizes.
Agreed that guaranteeing large blocks and fragmentation are issues for
PKRAM.  One possible avenue to address this could be to support preserving

hugetlb pages.


>
> 4. Removing struct page overhead: When doing the huge/gigantic
> allocations, in generally it won't be necessary to have 4 KiB struct
> pages. This is something with dmemfs [1, 2] tries to achieve by using a
> large chunk of reserved memory and managing that by a new filesystem.
Has using DAX been considered? Not familiar with dmemfs but it sounds

functionally similar.


>
> 5. More "advanced" memory management APIs/ioctls for virtualisation: Being
> able to support things like DMA-driven post-copy live migration, memory
> oversubscription, carving out chunks of memory from a VM to launch side-
> car VMs, more fine-grain control of IOMMU or MMU permissions, etc. This
> may be easier to achieve with a new filesystem, rather than coupling to
> tempfs semantics and ioctls.
>
> Overall, with the above in mind, my take is that we may have a smoother
> path to implement a more comprehensive solution by going the route of a
> new purpose-built file system on top of reserved memory. Sort of like
> dmemfs with persistence and specifically support for kernel persistence.
>
> Does my take here make sense?
Yes, I believe so. There are some serious issues with PKRAM to address
before it could be truly viable (fragmentation, relocation, etc), so

a memory partitioning approach might be the way to go.


>
> I'm hoping to put together an RFC for something like the above (dmemfs
> with persistence) soon, focusing on how the IOMMU persistence will work.
> This is an important differentiating factor to cover in the RFC, IMO.

Great! I'll keep an eye out for it.


Anthony


>
>> PKRAM provides a flexible way
>> for doing this without requiring that the amount of memory used by a fixed
>> size created a prior.
> AFAICT the main down-side of what I'm suggesting here compared to pkram,
> is that as you say here: pkram doesn't require the up-front reserving of
> memory - allocations from the global shared pool are dynamic. I'm on the
> fence as to whether this is actually a desirable property though. Carving
> out a large chunk of system memory as reserved memory for a persisted
> filesystem (as I'm suggesting) has the advantages of removing struct page
> overhead, providing better guarantees about huge/gigantic page
> allocations, and probably makes the kexec restore path simpler and more
> self contained.
>
> I think there's an argument to be made that having a clearly-defined large
> range of memory which is persisted, and the rest is normal "ephemeral"
> kernel memory may be preferable.
>
> Keen to hear your (and others) thoughts!
>
> JG
>
> [0] http://david.woodhou.se/live-update-handover.pdf
> [1] https://lwn.net/Articles/839216/
> [2] https://lkml.org/lkml/2020/12/7/342

next prev parent reply	other threads:[~2023-05-31 23:16 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-27  0:08 [RFC v3 00/21] Preserved-over-Kexec RAM Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 01/21] mm: add PKRAM API stubs and Kconfig Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 02/21] mm: PKRAM: implement node load and save functions Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 03/21] mm: PKRAM: implement object " Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 04/21] mm: PKRAM: implement folio stream operations Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 05/21] mm: PKRAM: implement byte " Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 06/21] mm: PKRAM: link nodes by pfn before reboot Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 07/21] mm: PKRAM: introduce super block Anthony Yznaga
2023-06-05  2:40   ` Coiby Xu
2023-06-06  2:01     ` Anthony Yznaga
2023-06-06  2:55       ` Coiby Xu
2023-06-06  3:12         ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 08/21] PKRAM: track preserved pages in a physical mapping pagetable Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 09/21] PKRAM: pass a list of preserved ranges to the next kernel Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 10/21] PKRAM: prepare for adding preserved ranges to memblock reserved Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 11/21] mm: PKRAM: reserve preserved memory at boot Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 12/21] PKRAM: free the preserved ranges list Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 13/21] PKRAM: prevent inadvertent use of a stale superblock Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 14/21] PKRAM: provide a way to ban pages from use by PKRAM Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 15/21] kexec: PKRAM: prevent kexec clobbering preserved pages in some cases Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 16/21] PKRAM: provide a way to check if a memory range has preserved pages Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 17/21] kexec: PKRAM: avoid clobbering already " Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 18/21] mm: PKRAM: allow preserved memory to be freed from userspace Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 19/21] PKRAM: disable feature when running the kdump kernel Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 20/21] x86/KASLR: PKRAM: support physical kaslr Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 21/21] x86/boot/compressed/64: use 1GB pages for mappings Anthony Yznaga
2023-04-27 18:40   ` H. Peter Anvin
2023-04-27 22:38     ` Anthony Yznaga
2023-05-26 13:57 ` [RFC v3 00/21] Preserved-over-Kexec RAM Gowans, James
2023-05-31 23:14   ` Anthony Yznaga [this message]
2023-06-01  2:15 ` Baoquan He
2023-06-01 23:58   ` Anthony Yznaga

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=66d7eda2-c136-1245-b433-784264b31683@oracle.com \
    --to=anthony.yznaga@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dwmw@amazon.co.uk \
    --cc=ebiederm@xmission.com \
    --cc=fam.zheng@bytedance.com \
    --cc=graf@amazon.de \
    --cc=hpa@zytor.com \
    --cc=jason.zeng@intel.com \
    --cc=jgowans@amazon.com \
    --cc=keescook@chromium.org \
    --cc=kexec@lists.infradead.org \
    --cc=lei.l.li@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgalaxy@akamai.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@kernel.org \
    --cc=steven.sistare@oracle.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    --cc=yuleixzhang@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox