From: Alex Williamson <alex.williamson@redhat.com>
To: James Gowans <jgowans@amazon.com>
Cc: <linux-kernel@vger.kernel.org>,
Eric Biederman <ebiederm@xmission.com>,
<kexec@lists.infradead.org>, "Joerg Roedel" <joro@8bytes.org>,
Will Deacon <will@kernel.org>, <iommu@lists.linux.dev>,
Alexander Viro <viro@zeniv.linux.org.uk>,
"Christian Brauner" <brauner@kernel.org>,
<linux-fsdevel@vger.kernel.org>,
Paolo Bonzini <pbonzini@redhat.com>,
Sean Christopherson <seanjc@google.com>, <kvm@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>,
Alexander Graf <graf@amazon.com>,
David Woodhouse <dwmw@amazon.co.uk>,
"Jan H . Schoenherr" <jschoenh@amazon.de>,
Usama Arif <usama.arif@bytedance.com>,
Anthony Yznaga <anthony.yznaga@oracle.com>,
Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>,
<madvenka@linux.microsoft.com>, <steven.sistare@oracle.com>,
<yuleixzhang@tencent.com>
Subject: Re: [RFC 00/18] Pkernfs: Support persistence for live update
Date: Mon, 5 Feb 2024 10:10:40 -0700 [thread overview]
Message-ID: <20240205101040.5d32a7e4.alex.williamson@redhat.com> (raw)
In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com>
On Mon, 5 Feb 2024 12:01:45 +0000
James Gowans <jgowans@amazon.com> wrote:
> This RFC is to solicit feedback on the approach of implementing support for live
> update via an in-memory filesystem responsible for storing all live update state
> as files in the filesystem.
>
> Hypervisor live update is a mechanism to support updating a hypervisor via kexec
> in a way that has limited impact to running virtual machines. This is done by
> pausing/serialising running VMs, kexec-ing into a new kernel, starting new VMM
> processes and then deserialising/resuming the VMs so that they continue running
> from where they were. Virtual machines can have PCI devices passed through and
> in order to support live update it’s necessary to persist the IOMMU page tables
> so that the devices can continue to do DMA to guest RAM during kexec.
>
> This RFC is a follow-on from a discussion held during LPC 2023 KVM MC
> which explored ways in which the live update problem could be tackled;
> this was one of them:
> https://lpc.events/event/17/contributions/1485/
>
> The approach sketched out in this RFC introduces a new in-memory filesystem,
> pkernfs. Pkernfs takes over ownership separate from Linux memory
> management system RAM which is carved out from the normal MM allocator
> and donated to pkernfs. Files are created in pkernfs for a few purposes:
> There are a few things that need to be preserved and re-hydrated after
> kexec to support this:
>
> * Guest memory: to be able to restore the VM its memory must be
> preserved. This is achieved by using a regular file in pkernfs for guest RAM.
> As this guest RAM is not part of the normal linux core mm allocator and
> has no struct pages, it can be removed from the direct map which
> improves security posture for guest RAM. Similar to memfd_secret.
>
> * IOMMU root page tables: for the IOMMU to have any ability to do DMA
> during kexec it needs root page tables to look up per-domain page
> tables. IOMMU root page tables are stored in a special path in pkernfs:
> iommu/root-pgtables. The intel IOMMU driver is modified to hook into
> pkernfs to get the chunk of memory that it can use to allocate root
> pgtables.
>
> * IOMMU domain page tables: in order for VM-initiated DMA operations to
> continue running while kexec is happening the IOVA to PA address
> translations for persisted devices needs to continue to work. Similar to
> root pgtables the per-domain page tables for persisted devices are
> allocated from a pkernfs file so they they are also persisted across
> kexec. This is done by using pkernfs files for IOMMU domain page
> tables. Not all devices are persistent, so VFIO is updated to support
> defining persistent page tables on passed through devices.
>
> * Updates to IOMMU and PCI are needed to make device handover across
> kexec work properly. Although not fully complete some of the changed
> needed around avoiding device re-setting and re-probing are sketched
> in this RFC.
>
> Guest RAM and IOMMU state are just the first two things needed for live update.
> Pkernfs opens the door for other kernel state which can improve kexec or add
> more capabilities to live update to also be persisted as new files.
>
> The main aspect we’re looking for feedback/opinions on here is the concept of
> putting all persistent state in a single filesystem: combining guest RAM and
> IOMMU pgtables in one store. Also, the question of a hard separation between
> persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> be persisted. Pkernfs does it via a hard separation defined at boot time, other
> approaches could make the carving out of persistent pages dynamic.
>
> Sign-offs are intentionally omitted to make it clear that this is a
> concept sketch RFC and not intended for merging.
>
> On CC are folks who have sent RFCs around this problem space before, as
> well as filesystem, kexec, IOMMU, MM and KVM lists and maintainers.
>
> == Alternatives ==
>
> There have been other RFCs which cover some aspect of the live update problem
> space. So far, all public approaches with KVM neglected device assignment which
> introduces a new dimension of problems. Prior art in this space includes:
>
> 1) Kexec Hand Over (KHO) [0]: This is a generic mechanism to pass kernel state
> across kexec. It also supports specifying persisted memory page which could be
> used to carve out IOMMU pgtable pages from the new kernel’s buddy allocator.
>
> 2) PKRAM [1]: Tmpfs-style filesystem which dynamically allocates memory which can
> be used for guest RAM and is preserved across kexec by passing a pointer to the
> root page.
>
> 3) DMEMFS [2]: Similar to pkernfs, DMEMFS is a filesystem on top of a reserved
> chunk of memory specified via kernel cmdline parameter. It is not persistent but
> aims to remove the need for struct page overhead.
>
> 4) Kernel memory pools [3, 4]: These provide a mechanism for kernel modules/drivers
> to allocate persistent memory, and restore that memory after kexec. They do do
> not attempt to provide the ability to store userspace accessible state or have a
> filesystem interface.
>
> == How to use ==
>
> Use the mmemap and pkernfs cmd line args to carve memory out of system RAM and
> donate it to pkernfs. For example to carve out 1 GiB of RAM starting at physical
> offset 1 GiB:
> memmap=1G%1G nopat pkernfs=1G!1G
>
> Mount pkernfs somewhere, for example:
> mount -t pkernfs /mnt/pkernfs
>
> Allocate a file for guest RAM:
> touch /mnt/pkernfs/guest-ram
> truncate -s 100M /mnt/pkernfs/guest-ram
>
> Add QEMU cmdline option to use this as guest RAM:
> -object memory-backend-file,id=pc.ram,size=100M,mem-path=/mnt/pkernfs/guest-ram,share=yes
> -M q35,memory-backend=pc.ram
>
> Allocate a file for IOMMU domain page tables:
> touch /mnt/pkernfs/iommu/dom-0
> truncate -s 2M /mnt/pkernfs/iommu/dom-0
>
> That file must be supplied to VFIO when creating the IOMMU container, via the
> VFIO_CONTAINER_SET_PERSISTENT_PGTABLES ioctl. Example: [4]
>
> After kexec, re-mount pkernfs, re-used those files for guest RAM and IOMMU
> state. When doing DMA mapping specify the additional flag
> VFIO_DMA_MAP_FLAG_LIVE_UPDATE to indicate that IOVAs are set up already.
> Example: [5].
>
> == Limitations ==
>
> This is a RFC design to sketch out the concept so that there can be a discussion
> about the general approach. There are many gaps and hacks; the idea is to keep
> this RFC as simple as possible. Limitations include:
>
> * Needing to supply the physical memory range for pkernfs as a kernel cmdline
> parameter. Better would be to allocate memory for pkernfs dynamically on first
> boot and pass that across kexec. Doing so would require additional integration
> with memblocks and some ability to pass the dynamically allocated ranges
> across. KHO [0] could support this.
>
> * A single filesystem with no support for NUMA awareness. Better would be to
> support multiple named pkernfs mounts which can cover different NUMA nodes.
>
> * Skeletal filesystem code. There’s just enough functionality to make it usable to
> demonstrate the concept of using files for guest RAM and IOMMU state.
>
> * Use-after-frees for IOMMU mappings. Currently nothing stops the pkernfs guest
> RAM files being deleted or resized while IOMMU mappings are set up which would
> allow DMA to freed memory. Better integration with guest RAM files and
> IOMMU/VFIO is necessary.
>
> * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> Really we should move the abstraction one level up and make the whole VFIO
> container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> container file and all of the DMA mappings inside VFIO would already be set up.
Note that the vfio container is on a path towards deprecation, this
should be refocused on vfio relative to iommufd. There would need to
be a strong argument for a container/type1 extension to support this,
iommufd would need to be the first class implementation. Thanks,
Alex
> * Inefficient use of filesystem space. Every mappings block is 2 MiB which is both
> wasteful and an hard upper limit on file size.
>
> [0] https://lore.kernel.org/kexec/20231213000452.88295-1-graf@amazon.com/
> [1] https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/
> [2] https://lkml.org/lkml/2020/12/7/342
> [3] https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./
> [4] https://lore.kernel.org/all/2023082506-enchanted-tripping-d1d5@gregkh/#t
> [5] https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e
> [6] https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0
>
>
> James Gowans (18):
> pkernfs: Introduce filesystem skeleton
> pkernfs: Add persistent inodes hooked into directies
> pkernfs: Define an allocator for persistent pages
> pkernfs: support file truncation
> pkernfs: add file mmap callback
> init: Add liveupdate cmdline param
> pkernfs: Add file type for IOMMU root pgtables
> iommu: Add allocator for pgtables from persistent region
> intel-iommu: Use pkernfs for root/context pgtable pages
> iommu/intel: zap context table entries on kexec
> dma-iommu: Always enable deferred attaches for liveupdate
> pkernfs: Add IOMMU domain pgtables file
> vfio: add ioctl to define persistent pgtables on container
> intel-iommu: Allocate domain pgtable pages from pkernfs
> pkernfs: register device memory for IOMMU domain pgtables
> vfio: support not mapping IOMMU pgtables on live-update
> pci: Don't clear bus master is persistence enabled
> vfio-pci: Assume device working after liveupdate
>
> drivers/iommu/Makefile | 1 +
> drivers/iommu/dma-iommu.c | 2 +-
> drivers/iommu/intel/dmar.c | 1 +
> drivers/iommu/intel/iommu.c | 93 +++++++++++++---
> drivers/iommu/intel/iommu.h | 5 +
> drivers/iommu/iommu.c | 22 ++--
> drivers/iommu/pgtable_alloc.c | 43 +++++++
> drivers/iommu/pgtable_alloc.h | 10 ++
> drivers/pci/pci-driver.c | 4 +-
> drivers/vfio/container.c | 27 +++++
> drivers/vfio/pci/vfio_pci_core.c | 20 ++--
> drivers/vfio/vfio.h | 2 +
> drivers/vfio/vfio_iommu_type1.c | 51 ++++++---
> fs/Kconfig | 1 +
> fs/Makefile | 3 +
> fs/pkernfs/Kconfig | 9 ++
> fs/pkernfs/Makefile | 6 +
> fs/pkernfs/allocator.c | 51 +++++++++
> fs/pkernfs/dir.c | 43 +++++++
> fs/pkernfs/file.c | 93 ++++++++++++++++
> fs/pkernfs/inode.c | 185 +++++++++++++++++++++++++++++++
> fs/pkernfs/iommu.c | 163 +++++++++++++++++++++++++++
> fs/pkernfs/pkernfs.c | 115 +++++++++++++++++++
> fs/pkernfs/pkernfs.h | 61 ++++++++++
> include/linux/init.h | 1 +
> include/linux/iommu.h | 6 +-
> include/linux/pkernfs.h | 38 +++++++
> include/uapi/linux/vfio.h | 10 ++
> init/main.c | 10 ++
> 29 files changed, 1029 insertions(+), 47 deletions(-)
> create mode 100644 drivers/iommu/pgtable_alloc.c
> create mode 100644 drivers/iommu/pgtable_alloc.h
> create mode 100644 fs/pkernfs/Kconfig
> create mode 100644 fs/pkernfs/Makefile
> create mode 100644 fs/pkernfs/allocator.c
> create mode 100644 fs/pkernfs/dir.c
> create mode 100644 fs/pkernfs/file.c
> create mode 100644 fs/pkernfs/inode.c
> create mode 100644 fs/pkernfs/iommu.c
> create mode 100644 fs/pkernfs/pkernfs.c
> create mode 100644 fs/pkernfs/pkernfs.h
> create mode 100644 include/linux/pkernfs.h
>
next prev parent reply other threads:[~2024-02-05 17:10 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-05 12:01 [RFC 00/18] Pkernfs: Support persistence for live update James Gowans
2024-02-05 12:01 ` [RFC 01/18] pkernfs: Introduce filesystem skeleton James Gowans
2024-02-05 12:01 ` [RFC 02/18] pkernfs: Add persistent inodes hooked into directies James Gowans
2024-02-05 12:01 ` [RFC 03/18] pkernfs: Define an allocator for persistent pages James Gowans
2024-02-05 12:01 ` [RFC 04/18] pkernfs: support file truncation James Gowans
2024-02-05 12:01 ` [RFC 05/18] pkernfs: add file mmap callback James Gowans
2024-02-05 23:34 ` Dave Chinner
2024-02-05 12:01 ` [RFC 06/18] init: Add liveupdate cmdline param James Gowans
2024-02-05 12:01 ` [RFC 07/18] pkernfs: Add file type for IOMMU root pgtables James Gowans
2024-02-05 12:01 ` [RFC 08/18] iommu: Add allocator for pgtables from persistent region James Gowans
2024-02-05 12:01 ` [RFC 09/18] intel-iommu: Use pkernfs for root/context pgtable pages James Gowans
2024-02-05 12:01 ` [RFC 10/18] iommu/intel: zap context table entries on kexec James Gowans
2024-02-05 12:01 ` [RFC 11/18] dma-iommu: Always enable deferred attaches for liveupdate James Gowans
2024-02-05 17:45 ` Jason Gunthorpe
2024-02-05 12:01 ` [RFC 12/18] pkernfs: Add IOMMU domain pgtables file James Gowans
2024-02-05 12:01 ` [RFC 13/18] vfio: add ioctl to define persistent pgtables on container James Gowans
2024-02-05 17:08 ` Jason Gunthorpe
2024-02-05 12:01 ` [RFC 14/18] intel-iommu: Allocate domain pgtable pages from pkernfs James Gowans
2024-02-05 17:12 ` Jason Gunthorpe
2024-02-05 12:02 ` [RFC 15/18] pkernfs: register device memory for IOMMU domain pgtables James Gowans
2024-02-05 12:02 ` [RFC 16/18] vfio: support not mapping IOMMU pgtables on live-update James Gowans
2024-02-05 12:02 ` [RFC 17/18] pci: Don't clear bus master is persistence enabled James Gowans
2024-02-05 12:02 ` [RFC 18/18] vfio-pci: Assume device working after liveupdate James Gowans
2024-02-05 17:10 ` Alex Williamson [this message]
2024-02-07 14:56 ` [RFC 00/18] Pkernfs: Support persistence for live update Gowans, James
2024-02-07 15:28 ` Jason Gunthorpe
2024-02-05 17:42 ` Jason Gunthorpe
2024-02-07 14:45 ` Gowans, James
2024-02-07 15:22 ` Jason Gunthorpe
2024-02-06 14:55 ` Luca Boccassi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240205101040.5d32a7e4.alex.williamson@redhat.com \
--to=alex.williamson@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=anthony.yznaga@oracle.com \
--cc=brauner@kernel.org \
--cc=dwmw@amazon.co.uk \
--cc=ebiederm@xmission.com \
--cc=graf@amazon.com \
--cc=iommu@lists.linux.dev \
--cc=jgowans@amazon.com \
--cc=joro@8bytes.org \
--cc=jschoenh@amazon.de \
--cc=kexec@lists.infradead.org \
--cc=kvm@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=madvenka@linux.microsoft.com \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=skinsburskii@linux.microsoft.com \
--cc=steven.sistare@oracle.com \
--cc=usama.arif@bytedance.com \
--cc=viro@zeniv.linux.org.uk \
--cc=will@kernel.org \
--cc=yuleixzhang@tencent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).