From: Jason Gunthorpe <jgg@ziepe.ca>
To: James Gowans <jgowans@amazon.com>
Cc: linux-kernel@vger.kernel.org,
Eric Biederman <ebiederm@xmission.com>,
kexec@lists.infradead.org, Joerg Roedel <joro@8bytes.org>,
Will Deacon <will@kernel.org>,
iommu@lists.linux.dev, Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>,
linux-fsdevel@vger.kernel.org,
Paolo Bonzini <pbonzini@redhat.com>,
Sean Christopherson <seanjc@google.com>,
kvm@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, Alexander Graf <graf@amazon.com>,
David Woodhouse <dwmw@amazon.co.uk>,
"Jan H . Schoenherr" <jschoenh@amazon.de>,
Usama Arif <usama.arif@bytedance.com>,
Anthony Yznaga <anthony.yznaga@oracle.com>,
Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>,
madvenka@linux.microsoft.com, steven.sistare@oracle.com,
yuleixzhang@tencent.com
Subject: Re: [RFC 00/18] Pkernfs: Support persistence for live update
Date: Mon, 5 Feb 2024 13:42:38 -0400 [thread overview]
Message-ID: <20240205174238.GC31743@ziepe.ca> (raw)
In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com>
On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote:
> The main aspect we’re looking for feedback/opinions on here is the concept of
> putting all persistent state in a single filesystem: combining guest RAM and
> IOMMU pgtables in one store. Also, the question of a hard separation between
> persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> be persisted. Pkernfs does it via a hard separation defined at boot time, other
> approaches could make the carving out of persistent pages dynamic.
I think if you are going to attempt something like this then the end
result must bring things back to having the same data structures fully
restored.
It is fine that the pkernfs holds some persistant memory that
guarentees the IOMMU can remain programmed and the VM pages can become
fixed across the kexec
But once the VMM starts to restore it self we need to get back to the
original configuration:
- A mmap that points to the VM's physical pages
- An iommufd IOAS that points to the above mmap
- An iommufd HWPT that represents that same mapping
- An iommu_domain programmed into HW that the HWPT
Ie you can't just reboot and leave the IOMMU hanging out in some
undefined land - especially in latest kernels!
For vt-d you need to retain the entire root table and all the required
context entries too, The restarting iommu needs to understand that it
has to "restore" a temporary iommu_domain from the pkernfs.
You can later reconstitute a proper iommu_domain from the VMM and
atomic switch.
So, I'm surprised to see this approach where things just live forever
in the kernfs, I don't see how "restore" is going to work very well
like this.
I would think that a save/restore mentalitity would make more
sense. For instance you could make a special iommu_domain that is fixed
and lives in the pkernfs. The operation would be to copy from the live
iommu_domain to the fixed one and then replace the iommu HW to the
fixed one.
In the post-kexec world the iommu would recreate that special domain
and point the iommu at it. (copying the root and context descriptions
out of the pkernfs). Then somehow that would get into iommufd and VFIO
so that it could take over that special mapping during its startup.
Then you'd build the normal operating ioas and hwpt (with all the
right page refcounts/etc) then switch to it and free the pkernfs
memory.
It seems alot less invasive to me. The special case is clearly a
special case and doesn't mess up the normal operation of the drivers.
It becomes more like kdump where the iommu driver is running in a
fairly normal mode, just with some stuff copied from the prior kernel.
Your text spent alot of time talking about the design of how the pages
persist, which is interesting, but it seems like only a small part of
the problem. Actually using that mechanism in a sane way and cover all
the functional issues in the HW drivers is going to be really
challenging.
> * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> Really we should move the abstraction one level up and make the whole VFIO
> container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> container file and all of the DMA mappings inside VFIO would already be set up.
I doubt this.. It probably needs to be much finer grained actually,
otherwise you are going to be serializing everything. Somehow I think
you are better to serialize a minimum and try to reconstruct
everything else in userspace. Like conserving iommufd IDs would be a
huge PITA.
There are also going to be lots of security questions here, like we
can't just let userspace feed in any garbage and violate vfio and
iommu invariants.
Jason
next prev parent reply other threads:[~2024-02-05 17:42 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-05 12:01 [RFC 00/18] Pkernfs: Support persistence for live update James Gowans
2024-02-05 12:01 ` [RFC 01/18] pkernfs: Introduce filesystem skeleton James Gowans
2024-02-05 12:01 ` [RFC 02/18] pkernfs: Add persistent inodes hooked into directies James Gowans
2024-02-05 12:01 ` [RFC 03/18] pkernfs: Define an allocator for persistent pages James Gowans
2024-02-05 12:01 ` [RFC 04/18] pkernfs: support file truncation James Gowans
2024-02-05 12:01 ` [RFC 05/18] pkernfs: add file mmap callback James Gowans
2024-02-05 23:34 ` Dave Chinner
2024-02-05 12:01 ` [RFC 06/18] init: Add liveupdate cmdline param James Gowans
2024-02-05 12:01 ` [RFC 07/18] pkernfs: Add file type for IOMMU root pgtables James Gowans
2024-02-05 12:01 ` [RFC 08/18] iommu: Add allocator for pgtables from persistent region James Gowans
2024-02-05 12:01 ` [RFC 09/18] intel-iommu: Use pkernfs for root/context pgtable pages James Gowans
2024-02-05 12:01 ` [RFC 10/18] iommu/intel: zap context table entries on kexec James Gowans
2024-02-05 12:01 ` [RFC 11/18] dma-iommu: Always enable deferred attaches for liveupdate James Gowans
2024-02-05 17:45 ` Jason Gunthorpe
2024-02-05 12:01 ` [RFC 12/18] pkernfs: Add IOMMU domain pgtables file James Gowans
2024-02-05 12:01 ` [RFC 13/18] vfio: add ioctl to define persistent pgtables on container James Gowans
2024-02-05 17:08 ` Jason Gunthorpe
2024-02-05 12:01 ` [RFC 14/18] intel-iommu: Allocate domain pgtable pages from pkernfs James Gowans
2024-02-05 17:12 ` Jason Gunthorpe
2024-02-05 12:02 ` [RFC 15/18] pkernfs: register device memory for IOMMU domain pgtables James Gowans
2024-02-05 12:02 ` [RFC 16/18] vfio: support not mapping IOMMU pgtables on live-update James Gowans
2024-02-05 12:02 ` [RFC 17/18] pci: Don't clear bus master is persistence enabled James Gowans
2024-02-05 12:02 ` [RFC 18/18] vfio-pci: Assume device working after liveupdate James Gowans
2024-02-05 17:10 ` [RFC 00/18] Pkernfs: Support persistence for live update Alex Williamson
2024-02-07 14:56 ` Gowans, James
2024-02-07 15:28 ` Jason Gunthorpe
2024-02-05 17:42 ` Jason Gunthorpe [this message]
2024-02-07 14:45 ` Gowans, James
2024-02-07 15:22 ` Jason Gunthorpe
2024-02-06 14:55 ` Luca Boccassi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240205174238.GC31743@ziepe.ca \
--to=jgg@ziepe.ca \
--cc=akpm@linux-foundation.org \
--cc=anthony.yznaga@oracle.com \
--cc=brauner@kernel.org \
--cc=dwmw@amazon.co.uk \
--cc=ebiederm@xmission.com \
--cc=graf@amazon.com \
--cc=iommu@lists.linux.dev \
--cc=jgowans@amazon.com \
--cc=joro@8bytes.org \
--cc=jschoenh@amazon.de \
--cc=kexec@lists.infradead.org \
--cc=kvm@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=madvenka@linux.microsoft.com \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=skinsburskii@linux.microsoft.com \
--cc=steven.sistare@oracle.com \
--cc=usama.arif@bytedance.com \
--cc=viro@zeniv.linux.org.uk \
--cc=will@kernel.org \
--cc=yuleixzhang@tencent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).