public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/13] Support iommu(fd) persistence for live update
@ 2024-09-16 11:30 James Gowans
  2024-09-16 11:30 ` [RFC PATCH 01/13] iommufd: Support marking and tracking persistent iommufds James Gowans
                   ` (12 more replies)
  0 siblings, 13 replies; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

Live update is a mechanism to support updating a hypervisor in a way
that has limited impact to running virtual machines. This is done by
pausing/serialising running VMs, kexec-ing into a new kernel, starting
new VMM processes and then deserialising/resuming the VMs so that they
continue running from where they were. When the VMs have DMA devices
assigned to them, the IOMMU state and page tables needs to be persisted
so that DMA transactions can continue across kexec.

Currently there is no mechanism in Linux to be able to continue running DMA
across kexec, and pick up handles to the original DMA mappings after
kexec. We are looking for a path to be able to support this capability
which is necessary for live update.  In this RFC patch series a
potential solution is sketched out, and we are looking for feedback on
the approach and userspace interfaces.

This RFC is intended to serve as the discussion ground for a Linux
Plumbers Conf 2024 session on iommu persistence:
https://lpc.events/event/18/contributions/1686/
... and a BoF session on memory persistence in general:
https://lpc.events/event/18/contributions/1970/
Please join those to further this discussion.

The concept is as follows:
IOMMUFDs are marked as persistent via a new option.  When a struct
iommu_domain is allocated by a iommufd, that iommu_domain also gets
marked as persistend.  Before kexec iommufd serialises the metadata for
all persistent domains to the KHO device tree blob. Similarly the IOMMU
platform driver marks all of the page table pages as persistent.  After
kexec the persistent IOMMUFDs are expose to userspace in sysfs.  Fresh
IOMMUFD objects are built up from the data which was passed across in
KHO. Userspace can open these sysfs files to get a handle on the IOMMUFD
again. Iommufd ensures that only persistent memory can be mapped into
persistent address spaces.

This depends on KHO as the foundational framework for passing data
across kexec and for marking pages as persistent:
https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/

It also depends on guestmemfs as the provider of persistent guest RAM to
be mapped into the IOMMU:
https://lore.kernel.org/all/20240805093245.889357-1-jgowans@amazon.com/

The code is not a complete solution, it is more of a sketch to show the
moving parts and proposed interfaces to gather feedback and have the
discussion. Only a small portion of the IOMMUFD object and dmar_domains
are serialised; it is necessary to figure out all the data which needs
to be serialised to have a fully working deserialised object.

Sign-offs are omitted to make it clear that this is not for merging yet.

Adding maintainers from IOMMU drivers, iommufd, kexec and KVM (seeing as
this is designed for live update of a hypervisor specifically), and
others who have engaged with the topic of memory persistence previously.

James Gowans (13):
  iommufd: Support marking and tracking persistent iommufds
  iommufd: Add plumbing for KHO (de)serialise
  iommu/intel: zap context table entries on kexec
  iommu: Support marking domains as persistent on alloc
  iommufd: Serialise persisted iommufds and ioas
  iommufd: Expose persistent iommufd IDs in sysfs
  iommufd: Re-hydrate a usable iommufd ctx from sysfs
  intel-iommu: Add serialise and deserialise boilerplate
  intel-iommu: Serialise dmar_domain on KHO activaet
  intel-iommu: Re-hydrate persistent domains after kexec
  iommu: Add callback to restore persisted iommu_domain
  iommufd, guestmemfs: Ensure persistent file used for persistent DMA
  iommufd, guestmemfs: Pin files when mapped for persistent DMA

 drivers/iommu/amd/iommu.c                   |   4 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |   3 +-
 drivers/iommu/intel/Makefile                |   1 +
 drivers/iommu/intel/dmar.c                  |   1 +
 drivers/iommu/intel/iommu.c                 |  84 ++++++--
 drivers/iommu/intel/iommu.h                 |  31 +++
 drivers/iommu/intel/serialise.c             | 174 ++++++++++++++++
 drivers/iommu/iommufd/Makefile              |   1 +
 drivers/iommu/iommufd/hw_pagetable.c        |   5 +-
 drivers/iommu/iommufd/io_pagetable.c        |   2 +-
 drivers/iommu/iommufd/ioas.c                |  26 +++
 drivers/iommu/iommufd/iommufd_private.h     |  34 ++++
 drivers/iommu/iommufd/main.c                |  75 ++++++-
 drivers/iommu/iommufd/selftest.c            |   1 +
 drivers/iommu/iommufd/serialise.c           | 213 ++++++++++++++++++++
 fs/guestmemfs/file.c                        |  25 +++
 fs/guestmemfs/guestmemfs.h                  |   1 +
 fs/guestmemfs/inode.c                       |   4 +
 include/linux/guestmemfs.h                  |  15 ++
 include/linux/iommu.h                       |  16 +-
 include/uapi/linux/iommufd.h                |   5 +
 21 files changed, 698 insertions(+), 23 deletions(-)
 create mode 100644 drivers/iommu/intel/serialise.c
 create mode 100644 drivers/iommu/iommufd/serialise.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH 01/13] iommufd: Support marking and tracking persistent iommufds
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
@ 2024-09-16 11:30 ` James Gowans
  2024-09-16 11:30 ` [RFC PATCH 02/13] iommufd: Add plumbing for KHO (de)serialise James Gowans
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

Introduce a new iommufd option to mark an iommufd as persistent. For now
this allocates it a unique persistent ID from an xarray index and keeps
a reference to the domain.

This will be used so that at serialisation time the open iommufds can be
iterated through and serialised.
---
 drivers/iommu/iommufd/iommufd_private.h |  1 +
 drivers/iommu/iommufd/main.c            | 47 +++++++++++++++++++++++++
 include/uapi/linux/iommufd.h            |  5 +++
 3 files changed, 53 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 92efe30a8f0d..b23f7766066c 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -28,6 +28,7 @@ struct iommufd_ctx {
 	/* Compatibility with VFIO no iommu */
 	u8 no_iommu_mode;
 	struct iommufd_ioas *vfio_ioas;
+	unsigned long persistent_id;
 };
 
 /*
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 83bbd7c5d160..6708ad629b1e 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -29,6 +29,8 @@ struct iommufd_object_ops {
 static const struct iommufd_object_ops iommufd_object_ops[];
 static struct miscdevice vfio_misc_dev;
 
+static DEFINE_XARRAY_ALLOC(persistent_iommufds);
+
 struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
 					     size_t size,
 					     enum iommufd_object_type type)
@@ -287,10 +289,52 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
 			break;
 	}
 	WARN_ON(!xa_empty(&ictx->groups));
+
+	rcu_read_lock();
+	if (ictx->persistent_id)
+		xa_erase(&persistent_iommufds, ictx->persistent_id);
+	rcu_read_unlock();
 	kfree(ictx);
 	return 0;
 }
 
+static int iommufd_option_persistent(struct iommufd_ucmd *ucmd)
+{
+	unsigned int persistent_id;
+	int rc;
+	struct iommu_option *cmd = ucmd->cmd;
+	struct iommufd_ctx *ictx = ucmd->ictx;
+	struct xa_limit id_limit = XA_LIMIT(1, UINT_MAX);
+
+	if (cmd->op == IOMMU_OPTION_OP_GET) {
+		cmd->val64 = ictx->persistent_id;
+		return 0;
+	}
+
+	if (cmd->op == IOMMU_OPTION_OP_SET) {
+		/*
+		 * iommufds can only be marked persistent before they
+		 * have been used for DMA mappings. HWPTs must be known
+		 * to be persistent at creation time.
+		 */
+		if (!xa_empty(&ictx->objects)) {
+			pr_warn("iommufd can only be marked persistented when unused\n");
+			return -EFAULT;
+		}
+
+		rc = xa_alloc(&persistent_iommufds, &persistent_id, ictx, id_limit, GFP_KERNEL_ACCOUNT);
+		if (rc) {
+			pr_warn("Unable to keep track of iommufd object\n");
+			return rc;
+		}
+
+		ictx->persistent_id = persistent_id;
+		cmd->val64 = ictx->persistent_id;
+		return 0;
+	}
+	return -EOPNOTSUPP;
+}
+
 static int iommufd_option(struct iommufd_ucmd *ucmd)
 {
 	struct iommu_option *cmd = ucmd->cmd;
@@ -306,6 +350,9 @@ static int iommufd_option(struct iommufd_ucmd *ucmd)
 	case IOMMU_OPTION_HUGE_PAGES:
 		rc = iommufd_ioas_option(ucmd);
 		break;
+	case IOMMU_OPTION_PERSISTENT:
+		rc = iommufd_option_persistent(ucmd);
+		break;
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 4dde745cfb7e..7d8cb242e9b0 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -276,10 +276,15 @@ struct iommu_ioas_unmap {
  *    iommu mappings. Value 0 disables combining, everything is mapped to
  *    PAGE_SIZE. This can be useful for benchmarking.  This is a per-IOAS
  *    option, the object_id must be the IOAS ID.
+ * @IOMMU_OPTION_PERSISTENT
+ *    Value 1 sets this iommufd object as a persistent iommufd. Mappings will
+ *    survive across kexec. The returned value is the persistent ID which can
+ *    be used to restore the iommufd after kexec.
  */
 enum iommufd_option {
 	IOMMU_OPTION_RLIMIT_MODE = 0,
 	IOMMU_OPTION_HUGE_PAGES = 1,
+	IOMMU_OPTION_PERSISTENT = 2,
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 02/13] iommufd: Add plumbing for KHO (de)serialise
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
  2024-09-16 11:30 ` [RFC PATCH 01/13] iommufd: Support marking and tracking persistent iommufds James Gowans
@ 2024-09-16 11:30 ` James Gowans
  2024-09-16 11:30 ` [RFC PATCH 03/13] iommu/intel: zap context table entries on kexec James Gowans
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

To support serialising persistent iommufd objects to KHO, and to be able
to restore the persisted data it is necessary to have a serialise hook
on the KHO active path as well as a deserialise hook on module init.

This commit adds those hooks and the new serialise.c file which will
hold the logic here; for now it's just empty functions.
---
 drivers/iommu/iommufd/Makefile          |  1 +
 drivers/iommu/iommufd/iommufd_private.h | 22 +++++++++++++++++++++
 drivers/iommu/iommufd/main.c            | 24 ++++++++++++++++++++++-
 drivers/iommu/iommufd/serialise.c       | 26 +++++++++++++++++++++++++
 4 files changed, 72 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/serialise.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index cf4605962bea..80bc775c170d 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -13,3 +13,4 @@ iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
 obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o
+obj-$(CONFIG_KEXEC_KHO) += serialise.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index b23f7766066c..a26728646a22 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -497,6 +497,28 @@ static inline void iommufd_hwpt_detach_device(struct iommufd_hw_pagetable *hwpt,
 	iommu_detach_group(hwpt->domain, idev->igroup->group);
 }
 
+/*
+ * Serialise is invoked as a callback by KHO when changing KHO active state,
+ * it stores current iommufd state into KHO's persistent store.
+ * Deserialise is run by the iommufd module when loaded to re-hydrate state
+ * carried across from the previous kernel.
+ */
+#ifdef CONFIG_KEXEC_KHO
+int iommufd_serialise_kho(struct notifier_block *self, unsigned long cmd,
+			  void *fdt);
+int __init iommufd_deserialise_kho(void);
+#else
+int iommufd_serialise_kho(struct notifier_block *self, unsigned long cmd,
+			  void *fdt)
+{
+	return 0;
+}
+int __init iommufd_deserialise_kho(void)
+{
+	return 0;
+}
+#endif
+
 static inline int iommufd_hwpt_replace_device(struct iommufd_device *idev,
 					      struct iommufd_hw_pagetable *hwpt,
 					      struct iommufd_hw_pagetable *old)
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 6708ad629b1e..fa4f0fe336ad 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -10,6 +10,7 @@
 
 #include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/kexec.h>
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/miscdevice.h>
@@ -590,6 +591,10 @@ static struct miscdevice vfio_misc_dev = {
 	.mode = 0666,
 };
 
+static struct notifier_block serialise_kho_nb = {
+	.notifier_call = iommufd_serialise_kho,
+};
+
 static int __init iommufd_init(void)
 {
 	int ret;
@@ -603,11 +608,26 @@ static int __init iommufd_init(void)
 		if (ret)
 			goto err_misc;
 	}
+
+	if (IS_ENABLED(CONFIG_KEXEC_KHO)) {
+		ret = register_kho_notifier(&serialise_kho_nb);
+		if (ret)
+			goto err_vfio_misc;
+	}
+
+	ret = iommufd_deserialise_kho();
+	if (ret)
+		goto err_kho;
+
 	ret = iommufd_test_init();
+
 	if (ret)
-		goto err_vfio_misc;
+		goto err_kho;
 	return 0;
 
+err_kho:
+	if (IS_ENABLED(CONFIG_KEXEC_KHO))
+		unregister_kho_notifier(&serialise_kho_nb);
 err_vfio_misc:
 	if (IS_ENABLED(CONFIG_IOMMUFD_VFIO_CONTAINER))
 		misc_deregister(&vfio_misc_dev);
@@ -621,6 +641,8 @@ static void __exit iommufd_exit(void)
 	iommufd_test_exit();
 	if (IS_ENABLED(CONFIG_IOMMUFD_VFIO_CONTAINER))
 		misc_deregister(&vfio_misc_dev);
+	if (IS_ENABLED(CONFIG_FTRACE_KHO))
+		unregister_kho_notifier(&serialise_kho_nb);
 	misc_deregister(&iommu_misc_dev);
 }
 
diff --git a/drivers/iommu/iommufd/serialise.c b/drivers/iommu/iommufd/serialise.c
new file mode 100644
index 000000000000..6e8bcc384771
--- /dev/null
+++ b/drivers/iommu/iommufd/serialise.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/kexec.h>
+#include "iommufd_private.h"
+
+int iommufd_serialise_kho(struct notifier_block *self, unsigned long cmd,
+			  void *fdt)
+{
+	pr_info("would serialise here\n");
+	switch (cmd) {
+	case KEXEC_KHO_ABORT:
+		/* Would do serialise rollback here. */
+		return NOTIFY_DONE;
+	case KEXEC_KHO_DUMP:
+		/* Would do serialise here. */
+		return NOTIFY_DONE;
+	default:
+		return NOTIFY_BAD;
+	}
+}
+
+int __init iommufd_deserialise_kho(void)
+{
+	pr_info("would deserialise here\n");
+	return 0;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 03/13] iommu/intel: zap context table entries on kexec
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
  2024-09-16 11:30 ` [RFC PATCH 01/13] iommufd: Support marking and tracking persistent iommufds James Gowans
  2024-09-16 11:30 ` [RFC PATCH 02/13] iommufd: Add plumbing for KHO (de)serialise James Gowans
@ 2024-09-16 11:30 ` James Gowans
  2024-10-03 13:27   ` Jason Gunthorpe
  2024-09-16 11:30 ` [RFC PATCH 04/13] iommu: Support marking domains as persistent on alloc James Gowans
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

Instead of fully shutting down the IOMMU on kexec, rather zap context
table entries for devices. This is the initial step to be able to
persist some domains. Once a struct iommu_domain can be marked
persistent then those persistent domains will be skipped when doing the
IOMMU shut down.
---
 drivers/iommu/intel/dmar.c  |  1 +
 drivers/iommu/intel/iommu.c | 34 ++++++++++++++++++++++++++++++----
 drivers/iommu/intel/iommu.h |  2 ++
 3 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 1c8d3141cb55..f79aba382e77 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1099,6 +1099,7 @@ static int alloc_iommu(struct dmar_drhd_unit *drhd)
 	spin_lock_init(&iommu->device_rbtree_lock);
 	mutex_init(&iommu->iopf_lock);
 	iommu->node = NUMA_NO_NODE;
+	INIT_LIST_HEAD(&iommu->domains);
 
 	ver = readl(iommu->reg + DMAR_VER_REG);
 	pr_info("%s: reg_base_addr %llx ver %d:%d cap %llx ecap %llx\n",
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 9ff8b83c19a3..2297cbb0253f 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1575,6 +1575,7 @@ int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
 		goto err_clear;
 	}
 	domain_update_iommu_cap(domain);
+	list_add(&domain->domains, &iommu->domains);
 
 	spin_unlock(&iommu->lock);
 	return 0;
@@ -3185,6 +3186,33 @@ static void intel_disable_iommus(void)
 		iommu_disable_translation(iommu);
 }
 
+static void zap_context_table_entries(struct intel_iommu *iommu)
+{
+	struct context_entry *context;
+	struct dmar_domain *domain;
+	struct device_domain_info *device;
+	int bus, devfn;
+	u16 did_old;
+
+	list_for_each_entry(domain, &iommu->domains, domains) {
+		list_for_each_entry(device, &domain->devices, link) {
+			context = iommu_context_addr(iommu, device->bus, device->devfn, 0);
+			if (!context || !context_present(context))
+				continue;
+			context_domain_id(context);
+			context_clear_entry(context);
+			__iommu_flush_cache(iommu, context, sizeof(*context));
+			iommu->flush.flush_context(iommu,
+						   did_old,
+						   (((u16)bus) << 8) | devfn,
+						   DMA_CCMD_MASK_NOBIT,
+						   DMA_CCMD_DEVICE_INVL);
+			iommu->flush.flush_iotlb(iommu,	did_old, 0, 0,
+						 DMA_TLB_DSI_FLUSH);
+		}
+	}
+}
+
 void intel_iommu_shutdown(void)
 {
 	struct dmar_drhd_unit *drhd;
@@ -3197,10 +3225,8 @@ void intel_iommu_shutdown(void)
 
 	/* Disable PMRs explicitly here. */
 	for_each_iommu(iommu, drhd)
-		iommu_disable_protect_mem_regions(iommu);
-
-	/* Make sure the IOMMUs are switched off */
-	intel_disable_iommus();
+		zap_context_table_entries(iommu);
+	return
 
 	up_write(&dmar_global_lock);
 }
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index b67c14da1240..cfd006588824 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -606,6 +606,7 @@ struct dmar_domain {
 	spinlock_t lock;		/* Protect device tracking lists */
 	struct list_head devices;	/* all devices' list */
 	struct list_head dev_pasids;	/* all attached pasids */
+	struct list_head domains;	/* all struct dmar_domains on this IOMMU */
 
 	spinlock_t cache_lock;		/* Protect the cache tag list */
 	struct list_head cache_tags;	/* Cache tag list */
@@ -749,6 +750,7 @@ struct intel_iommu {
 	void *perf_statistic;
 
 	struct iommu_pmu *pmu;
+	struct list_head domains;	/* all struct dmar_domains on this IOMMU */
 };
 
 /* PCI domain-device relationship */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 04/13] iommu: Support marking domains as persistent on alloc
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
                   ` (2 preceding siblings ...)
  2024-09-16 11:30 ` [RFC PATCH 03/13] iommu/intel: zap context table entries on kexec James Gowans
@ 2024-09-16 11:30 ` James Gowans
  2024-09-16 11:30 ` [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas James Gowans
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

Adding the persistent ID field to the struct iommu_domain and allow it
to be set from the domain_alloc callback which is used by iommufd.
So far unused, and for now it will only be supported on Intel IOMMU as
proof of concept.

Going forward this ID will be used as a unique handle on the
iommu_domain so that after kexec the caller (iommufd) will be able to
restore a reference to the struct iommu_domain which existed before
kexec.
---
 drivers/iommu/amd/iommu.c                   |  4 +++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  3 ++-
 drivers/iommu/intel/iommu.c                 |  2 ++
 drivers/iommu/iommufd/hw_pagetable.c        |  5 ++++-
 drivers/iommu/iommufd/selftest.c            |  1 +
 include/linux/iommu.h                       | 11 +++++++++--
 6 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index b19e8c0f48fa..daeb609aa4f5 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2432,12 +2432,14 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned int type)
 static struct iommu_domain *
 amd_iommu_domain_alloc_user(struct device *dev, u32 flags,
 			    struct iommu_domain *parent,
+			    unsigned long persistent_id,
 			    const struct iommu_user_data *user_data)
 
 {
 	unsigned int type = IOMMU_DOMAIN_UNMANAGED;
 
-	if ((flags & ~IOMMU_HWPT_ALLOC_DIRTY_TRACKING) || parent || user_data)
+	if ((flags & ~IOMMU_HWPT_ALLOC_DIRTY_TRACKING) || parent || user_data
+			|| persistent_id)
 		return ERR_PTR(-EOPNOTSUPP);
 
 	return do_iommu_domain_alloc(type, dev, flags);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a31460f9f3d4..41c964891c84 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3049,6 +3049,7 @@ static struct iommu_domain arm_smmu_blocked_domain = {
 static struct iommu_domain *
 arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 			   struct iommu_domain *parent,
+			   unsigned long persistent_id,
 			   const struct iommu_user_data *user_data)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
@@ -3058,7 +3059,7 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 
 	if (flags & ~PAGING_FLAGS)
 		return ERR_PTR(-EOPNOTSUPP);
-	if (parent || user_data)
+	if (parent || user_data || persistent_id)
 		return ERR_PTR(-EOPNOTSUPP);
 
 	smmu_domain = arm_smmu_domain_alloc();
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 2297cbb0253f..f473a8c008a7 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3729,6 +3729,7 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type)
 static struct iommu_domain *
 intel_iommu_domain_alloc_user(struct device *dev, u32 flags,
 			      struct iommu_domain *parent,
+			      unsigned long persistent_id,
 			      const struct iommu_user_data *user_data)
 {
 	struct device_domain_info *info = dev_iommu_priv_get(dev);
@@ -3761,6 +3762,7 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags,
 	domain->type = IOMMU_DOMAIN_UNMANAGED;
 	domain->owner = &intel_iommu_ops;
 	domain->ops = intel_iommu_ops.default_domain_ops;
+	domain->persistent_id = persistent_id;
 
 	if (nested_parent) {
 		dmar_domain->nested_parent = true;
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index aefde4443671..4bbf1dc98053 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -137,6 +137,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 
 	if (ops->domain_alloc_user) {
 		hwpt->domain = ops->domain_alloc_user(idev->dev, flags, NULL,
+						      ictx->persistent_id,
 						      user_data);
 		if (IS_ERR(hwpt->domain)) {
 			rc = PTR_ERR(hwpt->domain);
@@ -239,7 +240,9 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
 
 	hwpt->domain = ops->domain_alloc_user(idev->dev,
 					      flags & ~IOMMU_HWPT_FAULT_ID_VALID,
-					      parent->common.domain, user_data);
+					      parent->common.domain,
+					      ictx->persistent_id,
+					      user_data);
 	if (IS_ERR(hwpt->domain)) {
 		rc = PTR_ERR(hwpt->domain);
 		hwpt->domain = NULL;
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 222cfc11ebfd..7a9a454369d5 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -318,6 +318,7 @@ __mock_domain_alloc_nested(struct mock_iommu_domain *mock_parent,
 static struct iommu_domain *
 mock_domain_alloc_user(struct device *dev, u32 flags,
 		       struct iommu_domain *parent,
+		       unsigned long persistent_id,
 		       const struct iommu_user_data *user_data)
 {
 	struct mock_iommu_domain *mock_parent;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 04cbdae0052e..a616e8702a1c 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -215,6 +215,11 @@ struct iommu_domain {
 	struct iommu_dma_cookie *iova_cookie;
 	int (*iopf_handler)(struct iopf_group *group);
 	void *fault_data;
+	/*
+	 * Persisting and restoring across kexec via KHO.
+	 * 0 indicates non-persistent.
+	 */
+	unsigned long persistent_id;
 	union {
 		struct {
 			iommu_fault_handler_t handler;
@@ -518,7 +523,9 @@ static inline int __iommu_copy_struct_from_user_array(
  *                     IOMMU_DOMAIN_NESTED type; otherwise, the @parent must be
  *                     NULL while the @user_data can be optionally provided, the
  *                     new domain must support __IOMMU_DOMAIN_PAGING.
- *                     Upon failure, ERR_PTR must be returned.
+ *                     Upon failure, ERR_PTR must be returned. Persistent ID is
+ *                     used to save/restore across kexec; 0 indicates not
+ *                     persistent.
  * @domain_alloc_paging: Allocate an iommu_domain that can be used for
  *                       UNMANAGED, DMA, and DMA_FQ domain types.
  * @domain_alloc_sva: Allocate an iommu_domain for Shared Virtual Addressing.
@@ -564,7 +571,7 @@ struct iommu_ops {
 	struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type);
 	struct iommu_domain *(*domain_alloc_user)(
 		struct device *dev, u32 flags, struct iommu_domain *parent,
-		const struct iommu_user_data *user_data);
+		unsigned long persistent_id, const struct iommu_user_data *user_data);
 	struct iommu_domain *(*domain_alloc_paging)(struct device *dev);
 	struct iommu_domain *(*domain_alloc_sva)(struct device *dev,
 						 struct mm_struct *mm);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
                   ` (3 preceding siblings ...)
  2024-09-16 11:30 ` [RFC PATCH 04/13] iommu: Support marking domains as persistent on alloc James Gowans
@ 2024-09-16 11:30 ` James Gowans
  2024-10-02 18:55   ` Jason Gunthorpe
  2024-10-16 22:20   ` Jacob Pan
  2024-09-16 11:30 ` [RFC PATCH 06/13] iommufd: Expose persistent iommufd IDs in sysfs James Gowans
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

Now actually implementing the serialise callback for iommufd.
On KHO activate, iterate through all persisted domains and write their
metadata to the device tree format. For now just a few fields are
serialised to demonstrate the concept. To actually make this useful a
lot more field and related objects will need to be serialised too.
---
 drivers/iommu/iommufd/iommufd_private.h |  2 +
 drivers/iommu/iommufd/main.c            |  2 +-
 drivers/iommu/iommufd/serialise.c       | 81 ++++++++++++++++++++++++-
 3 files changed, 81 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index a26728646a22..ad8d180269bd 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -18,6 +18,8 @@ struct iommu_group;
 struct iommu_option;
 struct iommufd_device;
 
+extern struct xarray persistent_iommufds;
+
 struct iommufd_ctx {
 	struct file *file;
 	struct xarray objects;
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index fa4f0fe336ad..21a7e1ad40d1 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -30,7 +30,7 @@ struct iommufd_object_ops {
 static const struct iommufd_object_ops iommufd_object_ops[];
 static struct miscdevice vfio_misc_dev;
 
-static DEFINE_XARRAY_ALLOC(persistent_iommufds);
+DEFINE_XARRAY_ALLOC(persistent_iommufds);
 
 struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
 					     size_t size,
diff --git a/drivers/iommu/iommufd/serialise.c b/drivers/iommu/iommufd/serialise.c
index 6e8bcc384771..6b4c306dce40 100644
--- a/drivers/iommu/iommufd/serialise.c
+++ b/drivers/iommu/iommufd/serialise.c
@@ -1,19 +1,94 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
 #include <linux/kexec.h>
+#include <linux/libfdt.h>
 #include "iommufd_private.h"
+#include "io_pagetable.h"
+
+/**
+ * Serialised format:
+ * /iommufd
+ *   compatible = "iommufd-v0",
+ *   iommufds = [
+ *     persistent_id = {
+ *       account_mode = u8
+ *       ioases = [
+ *         {
+ *           areas = [
+ *           ]
+ *         }
+ *       ]
+ *     }
+ *   ]
+ */
+static int serialise_iommufd(void *fdt, struct iommufd_ctx *ictx)
+{
+	int err = 0;
+	char name[24];
+	struct iommufd_object *obj;
+	unsigned long obj_idx;
+
+	snprintf(name, sizeof(name), "%lu", ictx->persistent_id);
+	err |= fdt_begin_node(fdt, name);
+	err |= fdt_begin_node(fdt, "ioases");
+	xa_for_each(&ictx->objects, obj_idx, obj) {
+		struct iommufd_ioas *ioas;
+		struct iopt_area *area;
+		int area_idx = 0;
+
+		if (obj->type != IOMMUFD_OBJ_IOAS)
+			continue;
+
+		ioas = (struct iommufd_ioas *) obj;
+		snprintf(name, sizeof(name), "%lu", obj_idx);
+		err |= fdt_begin_node(fdt, name);
+
+		for (area = iopt_area_iter_first(&ioas->iopt, 0, ULONG_MAX); area;
+				area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+			unsigned long iova_start, iova_len;
+
+			snprintf(name, sizeof(name), "%i", area_idx);
+			err |= fdt_begin_node(fdt, name);
+			iova_start = iopt_area_iova(area);
+			iova_len = iopt_area_length(area);
+			err |= fdt_property(fdt, "iova-start",
+					&iova_start, sizeof(iova_start));
+			err |= fdt_property(fdt, "iova-len",
+					&iova_len, sizeof(iova_len));
+			err |= fdt_property(fdt, "iommu-prot",
+					&area->iommu_prot, sizeof(area->iommu_prot));
+			err |= fdt_end_node(fdt); /* area_idx */
+			++area_idx;
+		}
+		err |= fdt_end_node(fdt); /* ioas obj_idx */
+	}
+	err |= fdt_end_node(fdt); /* ioases*/
+	err |= fdt_end_node(fdt); /* ictx->persistent_id */
+	return 0;
+}
 
 int iommufd_serialise_kho(struct notifier_block *self, unsigned long cmd,
 			  void *fdt)
 {
-	pr_info("would serialise here\n");
+	static const char compatible[] = "iommufd-v0";
+	struct iommufd_ctx *ictx;
+	unsigned long xa_idx;
+	int err = 0;
+
 	switch (cmd) {
 	case KEXEC_KHO_ABORT:
 		/* Would do serialise rollback here. */
 		return NOTIFY_DONE;
 	case KEXEC_KHO_DUMP:
-		/* Would do serialise here. */
-		return NOTIFY_DONE;
+		err |= fdt_begin_node(fdt, "iommufd");
+		fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+		err |= fdt_begin_node(fdt, "iommufds");
+		xa_for_each(&persistent_iommufds, xa_idx, ictx) {
+			err |= serialise_iommufd(fdt, ictx);
+		}
+		err |= fdt_end_node(fdt); /* iommufds */
+		err |= fdt_end_node(fdt); /* iommufd */
+		return err? NOTIFY_BAD : NOTIFY_DONE;
 	default:
 		return NOTIFY_BAD;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 06/13] iommufd: Expose persistent iommufd IDs in sysfs
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
                   ` (4 preceding siblings ...)
  2024-09-16 11:30 ` [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas James Gowans
@ 2024-09-16 11:30 ` James Gowans
  2024-09-16 11:30 ` [RFC PATCH 07/13] iommufd: Re-hydrate a usable iommufd ctx from sysfs James Gowans
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

After kexec userspace needs the ability to re-acquire a handle to the
IOMMUFD which it was using before kexec. To provide userspace the
ability to discover persisted domains and get a handle to them, expose
all of the persisted IDs in sysfs. Each persisted ID will create a
directory like:
/sys/kernel/persisted_iommufd/<id>
In the next commit a file will be added to this directory to allow
actually restoring the IOMMUFD.
---
 drivers/iommu/iommufd/serialise.c | 48 ++++++++++++++++++++++++++++++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/serialise.c b/drivers/iommu/iommufd/serialise.c
index 6b4c306dce40..7f2e7b1eda13 100644
--- a/drivers/iommu/iommufd/serialise.c
+++ b/drivers/iommu/iommufd/serialise.c
@@ -21,6 +21,9 @@
  *     }
  *   ]
  */
+
+static struct kobject *persisted_dir_kobj;
+
 static int serialise_iommufd(void *fdt, struct iommufd_ctx *ictx)
 {
 	int err = 0;
@@ -94,8 +97,51 @@ int iommufd_serialise_kho(struct notifier_block *self, unsigned long cmd,
 	}
 }
 
+static ssize_t iommufd_show(struct kobject *kobj, struct kobj_attribute *attr,
+	char *buf)
+{
+	return 0;
+}
+
+static struct kobj_attribute persisted_attr =
+	__ATTR_RO_MODE(iommufd, 0440);
+
+static int deserialise_iommufds(const void *fdt, int root_off)
+{
+	int off;
+
+	/*
+	 * For each persisted iommufd id, create a directory
+	 * in sysfs with an iommufd file in it.
+	 */
+	fdt_for_each_subnode(off, fdt, root_off) {
+		struct kobject *kobj;
+		const char *name = fdt_get_name(fdt, off, NULL);
+		int rc;
+
+		kobj = kobject_create_and_add(name, persisted_dir_kobj);
+		rc = sysfs_create_file(kobj, &persisted_attr.attr);
+		if (rc)
+			pr_warn("Unable to create sysfs file for iommufd node %s\n", name);
+	}
+	return 0;
+}
+
 int __init iommufd_deserialise_kho(void)
 {
-	pr_info("would deserialise here\n");
+	const void *fdt = kho_get_fdt();
+	int off;
+
+	if (!fdt)
+		return 0;
+
+	/* Parent directory for persisted iommufd files. */
+	persisted_dir_kobj = kobject_create_and_add("iommufd_persisted", kernel_kobj);
+
+	off = fdt_path_offset(fdt, "/iommufd");
+	if (off <= 0)
+		return 0; /* No data in KHO */
+
+	deserialise_iommufds(fdt, fdt_subnode_offset(fdt, off, "iommufds"));
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 07/13] iommufd: Re-hydrate a usable iommufd ctx from sysfs
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
                   ` (5 preceding siblings ...)
  2024-09-16 11:30 ` [RFC PATCH 06/13] iommufd: Expose persistent iommufd IDs in sysfs James Gowans
@ 2024-09-16 11:30 ` James Gowans
  2024-09-16 11:30 ` [RFC PATCH 08/13] intel-iommu: Add serialise and deserialise boilerplate James Gowans
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

When the sysfs file is read, create an iommufd file descriptor, create a
fresh iommufd_ctx, and populate that ictx struct and related structs
with the data about mapped IOVA ranges from KHO.

This is done in a super yucky way by having the sysfs file's .show()
callback create a new file and then print out the new file's fd number.
Done this way because I couldn't figure out how to define a custom
.open() callback on a sysfs object.
An alternative would be to have a new iommufd pseudo-filesystem which
could be mounted somewhere and would have all of the relevant persistent
data in it.

Opinions/ideas on how best to expose persisted domains to userspace are
welcome.
---
 drivers/iommu/iommufd/io_pagetable.c    |  2 +-
 drivers/iommu/iommufd/iommufd_private.h |  4 ++
 drivers/iommu/iommufd/main.c            |  4 +-
 drivers/iommu/iommufd/serialise.c       | 54 ++++++++++++++++++++++++-
 4 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 05fd9d3abf1b..b4b75663d7cf 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -222,7 +222,7 @@ static int iopt_insert_area(struct io_pagetable *iopt, struct iopt_area *area,
 	return 0;
 }
 
-static struct iopt_area *iopt_area_alloc(void)
+struct iopt_area *iopt_area_alloc(void)
 {
 	struct iopt_area *area;
 
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index ad8d180269bd..94612cec2814 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -59,6 +59,10 @@ struct io_pagetable {
 	unsigned long iova_alignment;
 };
 
+extern const struct file_operations iommufd_fops;
+int iommufd_fops_open(struct inode *inode, struct file *filp);
+struct iopt_area *iopt_area_alloc(void);
+
 void iopt_init_table(struct io_pagetable *iopt);
 void iopt_destroy_table(struct io_pagetable *iopt);
 int iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 21a7e1ad40d1..f78a4cf23741 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -233,7 +233,7 @@ static int iommufd_destroy(struct iommufd_ucmd *ucmd)
 	return iommufd_object_remove(ucmd->ictx, NULL, cmd->id, 0);
 }
 
-static int iommufd_fops_open(struct inode *inode, struct file *filp)
+int iommufd_fops_open(struct inode *inode, struct file *filp)
 {
 	struct iommufd_ctx *ictx;
 
@@ -473,7 +473,7 @@ static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
 	return ret;
 }
 
-static const struct file_operations iommufd_fops = {
+const struct file_operations iommufd_fops = {
 	.owner = THIS_MODULE,
 	.open = iommufd_fops_open,
 	.release = iommufd_fops_release,
diff --git a/drivers/iommu/iommufd/serialise.c b/drivers/iommu/iommufd/serialise.c
index 7f2e7b1eda13..9519969bd201 100644
--- a/drivers/iommu/iommufd/serialise.c
+++ b/drivers/iommu/iommufd/serialise.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
+#include <linux/anon_inodes.h>
+#include <linux/fdtable.h>
 #include <linux/kexec.h>
 #include <linux/libfdt.h>
 #include "iommufd_private.h"
@@ -97,10 +99,60 @@ int iommufd_serialise_kho(struct notifier_block *self, unsigned long cmd,
 	}
 }
 
+static int rehydrate_iommufd(char *iommufd_name)
+{
+	struct file *file;
+	int fd;
+	int off;
+	struct iommufd_ctx *ictx;
+	struct files_struct *files = current->files;  // Current process's files_struct
+	const void *fdt = kho_get_fdt();
+	char kho_path[42];
+
+	fd = anon_inode_getfd("iommufd", &iommufd_fops, NULL, O_RDWR);
+	if (fd < 0)
+		return fd;
+	file = files_lookup_fd_raw(files, fd);
+	iommufd_fops_open(NULL, file);
+	ictx = file->private_data;
+
+	snprintf(kho_path, sizeof(kho_path), "/iommufd/iommufds/%s/ioases", iommufd_name);
+	fdt_for_each_subnode(off, fdt, fdt_path_offset(fdt, kho_path)) {
+	    struct iommufd_ioas *ioas;
+	    int range_off;
+
+	    ioas = iommufd_ioas_alloc(ictx);
+	    iommufd_object_finalize(ictx, &ioas->obj);
+
+	    fdt_for_each_subnode(range_off, fdt, off) {
+		    const unsigned long *iova_start, *iova_len;
+		    const int *iommu_prot;
+		    int len;
+		    struct iopt_area *area = iopt_area_alloc();
+
+		    iova_start = fdt_getprop(fdt, range_off, "iova-start", &len);
+		    iova_len = fdt_getprop(fdt, range_off, "iova-len", &len);
+		    iommu_prot = fdt_getprop(fdt, range_off, "iommu-prot", &len);
+
+		    area->iommu_prot = *iommu_prot;
+		    area->node.start = *iova_start;
+		    area->node.last = *iova_start + *iova_len - 1;
+		    interval_tree_insert(&area->node, &ioas->iopt.area_itree);
+	    }
+	    /* TODO: restore link from ioas to hwpt. */
+	}
+
+	return fd;
+}
+
 static ssize_t iommufd_show(struct kobject *kobj, struct kobj_attribute *attr,
 	char *buf)
 {
-	return 0;
+	char fd_str[10];
+	ssize_t len;
+
+	len = snprintf(buf, sizeof(fd_str), "%i\n", rehydrate_iommufd("1"));
+	return len;
 }
 
 static struct kobj_attribute persisted_attr =
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 08/13] intel-iommu: Add serialise and deserialise boilerplate
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
                   ` (6 preceding siblings ...)
  2024-09-16 11:30 ` [RFC PATCH 07/13] iommufd: Re-hydrate a usable iommufd ctx from sysfs James Gowans
@ 2024-09-16 11:30 ` James Gowans
  2024-09-16 11:30 ` [RFC PATCH 09/13] intel-iommu: Serialise dmar_domain on KHO activaet James Gowans
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

Similar to how iommufd got serialise and deserialise hooks, now add this
to the platform iommu driver, in this case intel-iommu. Once again this
will be fleshed out in the next commits to actually serialise the struct
dmar_domain before kexec and restore them after kexec.
---
 drivers/iommu/intel/Makefile    |  1 +
 drivers/iommu/intel/iommu.c     | 18 +++++++++++++++
 drivers/iommu/intel/iommu.h     | 18 +++++++++++++++
 drivers/iommu/intel/serialise.c | 40 +++++++++++++++++++++++++++++++++
 4 files changed, 77 insertions(+)
 create mode 100644 drivers/iommu/intel/serialise.c

diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
index c8beb0281559..ca9f73992620 100644
--- a/drivers/iommu/intel/Makefile
+++ b/drivers/iommu/intel/Makefile
@@ -9,3 +9,4 @@ ifdef CONFIG_INTEL_IOMMU
 obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o
 endif
 obj-$(CONFIG_INTEL_IOMMU_PERF_EVENTS) += perfmon.o
+obj-$(CONFIG_KEXEC_KHO) += serialise.o
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index f473a8c008a7..7e77b787148a 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -65,6 +65,7 @@ static int rwbf_quirk;
 static int force_on = 0;
 static int intel_iommu_tboot_noforce;
 static int no_platform_optin;
+DEFINE_XARRAY(persistent_domains);
 
 #define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
 
@@ -3393,6 +3394,10 @@ static __init int tboot_force_iommu(void)
 	return 1;
 }
 
+static struct notifier_block serialise_kho_nb = {
+	.notifier_call = intel_iommu_serialise_kho,
+};
+
 int __init intel_iommu_init(void)
 {
 	int ret = -ENODEV;
@@ -3432,6 +3437,12 @@ int __init intel_iommu_init(void)
 	if (!no_iommu)
 		intel_iommu_debugfs_init();
 
+	if (IS_ENABLED(CONFIG_KEXEC_KHO)) {
+		ret = register_kho_notifier(&serialise_kho_nb);
+		if (ret)
+			goto out_free_dmar;
+	}
+
 	if (no_iommu || dmar_disabled) {
 		/*
 		 * We exit the function here to ensure IOMMU's remapping and
@@ -3738,6 +3749,7 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags,
 	struct intel_iommu *iommu = info->iommu;
 	struct dmar_domain *dmar_domain;
 	struct iommu_domain *domain;
+	int rc;
 
 	/* Must be NESTING domain */
 	if (parent) {
@@ -3778,6 +3790,12 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags,
 		domain->dirty_ops = &intel_dirty_ops;
 	}
 
+	if (persistent_id) {
+		rc = xa_insert(&persistent_domains, persistent_id, domain, GFP_KERNEL_ACCOUNT);
+		if (rc)
+			pr_warn("Unable to track persistent domain %lu\n", persistent_id);
+	}
+
 	return domain;
 }
 
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index cfd006588824..7866342f0909 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -11,6 +11,7 @@
 #define _INTEL_IOMMU_H_
 
 #include <linux/types.h>
+#include <linux/kexec.h>
 #include <linux/iova.h>
 #include <linux/io.h>
 #include <linux/idr.h>
@@ -496,6 +497,7 @@ struct q_inval {
 #define PRQ_DEPTH	((0x1000 << PRQ_ORDER) >> 5)
 
 struct dmar_pci_notify_info;
+extern struct xarray persistent_domains;
 
 #ifdef CONFIG_IRQ_REMAP
 /* 1MB - maximum possible interrupt remapping table size */
@@ -1225,6 +1227,22 @@ static inline int iommu_calculate_max_sagaw(struct intel_iommu *iommu)
 #define intel_iommu_sm (0)
 #endif
 
+#ifdef CONFIG_KEXEC_KHO
+int intel_iommu_serialise_kho(struct notifier_block *self, unsigned long cmd,
+			  void *fdt);
+int __init intel_iommu_deserialise_kho(void);
+#else
+int intel_iommu_serialise_kho(struct notifier_block *self, unsigned long cmd,
+			  void *fdt)
+{
+	return 0;
+}
+int __init intel_iommu_deserialise_kho(void)
+{
+	return 0;
+}
+#endif /* CONFIG_KEXEC_KHO */
+
 static inline const char *decode_prq_descriptor(char *str, size_t size,
 		u64 dw0, u64 dw1, u64 dw2, u64 dw3)
 {
diff --git a/drivers/iommu/intel/serialise.c b/drivers/iommu/intel/serialise.c
new file mode 100644
index 000000000000..08a548b33703
--- /dev/null
+++ b/drivers/iommu/intel/serialise.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "iommu.h"
+
+static int serialise_domain(void *fdt, struct iommu_domain *domain)
+{
+	return 0;
+}
+
+int intel_iommu_serialise_kho(struct notifier_block *self, unsigned long cmd,
+			  void *fdt)
+{
+	static const char compatible[] = "intel-iommu-v0";
+	struct iommu_domain *domain;
+	unsigned long xa_idx;
+	int err = 0;
+
+	switch (cmd) {
+	case KEXEC_KHO_ABORT:
+		/* Would do serialise rollback here. */
+		return NOTIFY_DONE;
+	case KEXEC_KHO_DUMP:
+		err |= fdt_begin_node(fdt, "intel-iommu");
+		fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+		err |= fdt_begin_node(fdt, "domains");
+		xa_for_each(&persistent_domains, xa_idx, domain) {
+			err |= serialise_domain(fdt, domain);
+		}
+		err |= fdt_end_node(fdt); /* domains */
+		err |= fdt_end_node(fdt); /* intel-iommu*/
+		return err? NOTIFY_BAD : NOTIFY_DONE;
+	default:
+		return NOTIFY_BAD;
+	}
+}
+
+int __init intel_iommu_deserialise_kho(void)
+{
+	return 0;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 09/13] intel-iommu: Serialise dmar_domain on KHO activaet
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
                   ` (7 preceding siblings ...)
  2024-09-16 11:30 ` [RFC PATCH 08/13] intel-iommu: Add serialise and deserialise boilerplate James Gowans
@ 2024-09-16 11:30 ` James Gowans
  2024-09-16 11:30 ` [RFC PATCH 10/13] intel-iommu: Re-hydrate persistent domains after kexec James Gowans
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

Add logic to iterate through persistent domains, add the page table
pages to KHO persistent memory pages. Also serialise some metadata about
the domains and attached PCI devices.

By adding the page table pages to the `mem` attribute on the KHO object
these pages will be carved out of system memory early in boot by KHO,
guaranteeing that they will not be used for any other purpose by the new
kernel. This persists the page tables across kexec.
---
 drivers/iommu/intel/iommu.c     |  9 ----
 drivers/iommu/intel/iommu.h     | 10 ++++
 drivers/iommu/intel/serialise.c | 92 ++++++++++++++++++++++++++++++++-
 3 files changed, 101 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 7e77b787148a..0a2118a3b7c4 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -46,15 +46,6 @@
 
 #define DEFAULT_DOMAIN_ADDRESS_WIDTH 57
 
-#define __DOMAIN_MAX_PFN(gaw)  ((((uint64_t)1) << ((gaw) - VTD_PAGE_SHIFT)) - 1)
-#define __DOMAIN_MAX_ADDR(gaw) ((((uint64_t)1) << (gaw)) - 1)
-
-/* We limit DOMAIN_MAX_PFN to fit in an unsigned long, and DOMAIN_MAX_ADDR
-   to match. That way, we can use 'unsigned long' for PFNs with impunity. */
-#define DOMAIN_MAX_PFN(gaw)	((unsigned long) min_t(uint64_t, \
-				__DOMAIN_MAX_PFN(gaw), (unsigned long)-1))
-#define DOMAIN_MAX_ADDR(gaw)	(((uint64_t)__DOMAIN_MAX_PFN(gaw)) << VTD_PAGE_SHIFT)
-
 static void __init check_tylersburg_isoch(void);
 static int rwbf_quirk;
 
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 7866342f0909..cd932a97a9bc 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -38,6 +38,16 @@
 
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
 
+#define __DOMAIN_MAX_PFN(gaw)  ((((uint64_t)1) << ((gaw) - VTD_PAGE_SHIFT)) - 1)
+#define __DOMAIN_MAX_ADDR(gaw) ((((uint64_t)1) << (gaw)) - 1)
+
+/* We limit DOMAIN_MAX_PFN to fit in an unsigned long, and DOMAIN_MAX_ADDR
+   to match. That way, we can use 'unsigned long' for PFNs with impunity. */
+#define DOMAIN_MAX_PFN(gaw)	((unsigned long) min_t(uint64_t, \
+				__DOMAIN_MAX_PFN(gaw), (unsigned long)-1))
+#define DOMAIN_MAX_ADDR(gaw)	(((uint64_t)__DOMAIN_MAX_PFN(gaw)) << VTD_PAGE_SHIFT)
+
+
 #define VTD_STRIDE_SHIFT        (9)
 #define VTD_STRIDE_MASK         (((u64)-1) << VTD_STRIDE_SHIFT)
 
diff --git a/drivers/iommu/intel/serialise.c b/drivers/iommu/intel/serialise.c
index 08a548b33703..bc755e51732b 100644
--- a/drivers/iommu/intel/serialise.c
+++ b/drivers/iommu/intel/serialise.c
@@ -2,9 +2,99 @@
 
 #include "iommu.h"
 
+/*
+ * Serialised format:
+ * /intel-iommu
+ *     compatible = str
+ *     domains = {
+ *         persistent-id = {
+ *             mem = [ ... ] // page table pages
+ *             agaw = i32
+ *             pgd = u64
+ *             devices = {
+ *                 id = {
+ *                     u8 bus;
+ *                     u8 devfn
+ *                 },
+ *                 ...
+ *             }
+ *         }
+ *      }
+ */
+
+/*
+ * Adds all present PFNs on the PTE page to the kho_mem pointer and advances
+ * the pointer.
+ * Stolen from dma_pte_list_pagetables() */
+static void save_pte_pages(struct dmar_domain *domain, int level,
+			   struct dma_pte *pte, struct kho_mem **kho_mem)
+{
+	struct page *pg;
+
+	pg = pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT);
+	
+	if (level == 1)
+		return;
+
+	pte = page_address(pg);
+	do {
+		if (dma_pte_present(pte)) {
+			(*kho_mem)->addr = dma_pte_addr(pte);
+			(*kho_mem)->len = PAGE_SIZE;
+			(*kho_mem)++;
+			if (!dma_pte_superpage(pte))
+				save_pte_pages(domain, level - 1, pte, kho_mem);
+		}
+		pte++;
+	} while (!first_pte_in_page(pte));
+}
+		
 static int serialise_domain(void *fdt, struct iommu_domain *domain)
 {
-	return 0;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	/*
+	 * kho_mems_start points to the original allocated array; kho_mems
+	 * is incremented by the callee. Keep both to know how many were added.
+	 */
+	struct kho_mem *kho_mems, *kho_mems_start;
+	struct device_domain_info *info;
+	int err = 0;
+	char name[24];
+	int device_idx = 0;
+	phys_addr_t pgd;
+
+	/*
+	 * Assume just one page worth of kho_mem objects is enough.
+	 * Better would be to keep track of number of allocated pages in the domain.
+	 * */
+	kho_mems_start = kho_mems = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	save_pte_pages(dmar_domain, agaw_to_level(dmar_domain->agaw),
+		       dmar_domain->pgd, &kho_mems);
+
+	snprintf(name, sizeof(name), "%lu", domain->persistent_id);
+	err |= fdt_begin_node(fdt, name);
+	err |= fdt_property(fdt, "mem", kho_mems_start,
+			sizeof(struct kho_mem) * (kho_mems - kho_mems_start));
+	err |= fdt_property(fdt, "persistent_id", &domain->persistent_id,
+			sizeof(domain->persistent_id));
+	pgd = virt_to_phys(dmar_domain->pgd);
+	err |= fdt_property(fdt, "pgd", &pgd, sizeof(pgd));
+	err |= fdt_property(fdt, "agaw", &dmar_domain->agaw,
+			sizeof(dmar_domain->agaw));
+
+	err |= fdt_begin_node(fdt, "devices");
+	list_for_each_entry(info, &dmar_domain->devices, link) {
+		snprintf(name, sizeof(name), "%i", device_idx++);
+		err |= fdt_begin_node(fdt, name);
+		err |= fdt_property(fdt, "bus", &info->bus, sizeof(info->bus));
+		err |= fdt_property(fdt, "devfn", &info->devfn, sizeof(info->devfn));
+		err |= fdt_end_node(fdt); /* device_idx */
+	}
+	err |= fdt_end_node(fdt); /* devices */
+	err |= fdt_end_node(fdt); /* domain->persistent_id */
+
+	return err;
 }
 
 int intel_iommu_serialise_kho(struct notifier_block *self, unsigned long cmd,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 10/13] intel-iommu: Re-hydrate persistent domains after kexec
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
                   ` (8 preceding siblings ...)
  2024-09-16 11:30 ` [RFC PATCH 09/13] intel-iommu: Serialise dmar_domain on KHO activaet James Gowans
@ 2024-09-16 11:30 ` James Gowans
  2024-09-16 11:31 ` [RFC PATCH 11/13] iommu: Add callback to restore persisted iommu_domain James Gowans
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

Go through the domain data persisted in KHO, allocate fresh dmar_domain
structs and populate the structs with the persisted data.
Persisted page table pages in the "mem" field are also claimed to
transfer ownership of the pages from KHO back to the intel-iommu driver.

Once re-hydrated the struct iommu_domain pointers are inserted into the
persisted_domains xarray so that they can be fetched later when they
need to be restored by iommufd. This will be done in the next commit.
---
 drivers/iommu/intel/iommu.c     |  9 ++++++-
 drivers/iommu/intel/iommu.h     |  1 +
 drivers/iommu/intel/serialise.c | 44 +++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 0a2118a3b7c4..8e0ed033b03f 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1505,7 +1505,7 @@ static bool first_level_by_default(unsigned int type)
 	return type != IOMMU_DOMAIN_UNMANAGED;
 }
 
-static struct dmar_domain *alloc_domain(unsigned int type)
+struct dmar_domain *alloc_domain(unsigned int type)
 {
 	struct dmar_domain *domain;
 
@@ -3468,6 +3468,7 @@ int __init intel_iommu_init(void)
 
 	init_no_remapping_devices();
 
+	intel_iommu_deserialise_kho();
 	ret = init_dmars();
 	if (ret) {
 		if (force_on)
@@ -4127,6 +4128,12 @@ static struct iommu_device *intel_iommu_probe_device(struct device *dev)
 	}
 
 	dev_iommu_priv_set(dev, info);
+
+	/*
+	 * TODO: around here the device should be added to the persistent
+	 * domain if it is a persistent device.
+	 */
+
 	if (pdev && pci_ats_supported(pdev)) {
 		ret = device_rbtree_insert(iommu, info);
 		if (ret)
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index cd932a97a9bc..7ee050ebfaca 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -1118,6 +1118,7 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
  */
 #define QI_OPT_WAIT_DRAIN		BIT(0)
 
+struct dmar_domain *alloc_domain(unsigned int type);
 void domain_update_iotlb(struct dmar_domain *domain);
 int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
 void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
diff --git a/drivers/iommu/intel/serialise.c b/drivers/iommu/intel/serialise.c
index bc755e51732b..20f42b84d490 100644
--- a/drivers/iommu/intel/serialise.c
+++ b/drivers/iommu/intel/serialise.c
@@ -124,7 +124,51 @@ int intel_iommu_serialise_kho(struct notifier_block *self, unsigned long cmd,
 	}
 }
 
+static void deserialise_domains(const void *fdt, int root_off)
+{
+	int off;
+	struct dmar_domain *dmar_domain;
+
+	fdt_for_each_subnode(off, fdt, root_off) {
+		const struct kho_mem *kho_mems;
+		int len, idx;
+		const unsigned long *pgd_phys;
+		const int *agaw;
+		const unsigned long *persistent_id;
+		int rc;
+
+		dmar_domain = alloc_domain(IOMMU_DOMAIN_UNMANAGED);
+
+		kho_mems = fdt_getprop(fdt, off, "mem", &len);
+		for (idx = 0; idx * sizeof(struct kho_mem) < len; ++idx)
+			kho_claim_mem(&kho_mems[idx]);
+
+		pgd_phys = fdt_getprop(fdt, off, "pgd", &len);
+		dmar_domain->pgd = phys_to_virt(*pgd_phys);
+		agaw = fdt_getprop(fdt, off, "agaw", &len);
+		dmar_domain->agaw = *agaw;
+		persistent_id = fdt_getprop(fdt, off, "persistent_id", &len);
+		dmar_domain->domain.persistent_id = *persistent_id;
+
+		rc = xa_insert(&persistent_domains, *persistent_id,
+				&dmar_domain->domain, GFP_KERNEL);
+		if (rc)
+			pr_warn("Unable to re-insert persistent domain %lu\n", *persistent_id);
+	}
+}
+
 int __init intel_iommu_deserialise_kho(void)
 {
+	const void *fdt = kho_get_fdt();
+	int off;
+
+	if (!fdt)
+		return 0;
+
+	off = fdt_path_offset(fdt, "/intel-iommu");
+	if (off <= 0)
+		return 0; /* No data in KHO */
+
+	deserialise_domains(fdt, fdt_subnode_offset(fdt, off, "domains"));
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 11/13] iommu: Add callback to restore persisted iommu_domain
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
                   ` (9 preceding siblings ...)
  2024-09-16 11:30 ` [RFC PATCH 10/13] intel-iommu: Re-hydrate persistent domains after kexec James Gowans
@ 2024-09-16 11:31 ` James Gowans
  2024-10-03 13:33   ` Jason Gunthorpe
  2024-09-16 11:31 ` [RFC PATCH 12/13] iommufd, guestmemfs: Ensure persistent file used for persistent DMA James Gowans
  2024-09-16 11:31 ` [RFC PATCH 13/13] iommufd, guestmemfs: Pin files when mapped " James Gowans
  12 siblings, 1 reply; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

The previous commits re-hydrated the struct iommu_domain and added them
to the persisted_domains xarray. Now provide a callback to get the
domain so that iommufd can restore a link to it.

Roughly where the restore would happen is called out in a comment, but
some more head scratching is needed to figure out how to actually do
this.
---
 drivers/iommu/intel/iommu.c       | 12 ++++++++++++
 drivers/iommu/iommufd/serialise.c |  9 ++++++++-
 include/linux/iommu.h             |  5 +++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 8e0ed033b03f..000ddfe5b6de 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4690,6 +4690,17 @@ static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
 	return 0;
 }
 
+static struct iommu_domain *intel_domain_restore(struct device *dev,
+		unsigned long persistent_id)
+{
+	struct iommu_domain *domain;
+
+	domain = xa_load(&persistent_domains, persistent_id);
+	if (!domain)
+		pr_warn("No such persisted domain id %lu\n", persistent_id);
+	return domain;
+}
+
 static const struct iommu_dirty_ops intel_dirty_ops = {
 	.set_dirty_tracking = intel_iommu_set_dirty_tracking,
 	.read_and_clear_dirty = intel_iommu_read_and_clear_dirty,
@@ -4703,6 +4714,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.domain_alloc		= intel_iommu_domain_alloc,
 	.domain_alloc_user	= intel_iommu_domain_alloc_user,
 	.domain_alloc_sva	= intel_svm_domain_alloc,
+	.domain_restore 	= intel_domain_restore,
 	.probe_device		= intel_iommu_probe_device,
 	.release_device		= intel_iommu_release_device,
 	.get_resv_regions	= intel_iommu_get_resv_regions,
diff --git a/drivers/iommu/iommufd/serialise.c b/drivers/iommu/iommufd/serialise.c
index 9519969bd201..baac7d6150cb 100644
--- a/drivers/iommu/iommufd/serialise.c
+++ b/drivers/iommu/iommufd/serialise.c
@@ -139,7 +139,14 @@ static int rehydrate_iommufd(char *iommufd_name)
 		    area->node.last = *iova_start + *iova_len - 1;
 		    interval_tree_insert(&area->node, &ioas->iopt.area_itree);
 	    }
-	    /* TODO: restore link from ioas to hwpt. */
+	    /*
+	     * Here we should do something to associate struct iommufd_device with the
+	     * ictx, then get the iommu_ops via dev_iommu_ops(), and call the new
+	     * .domain_restore callback to get the struct iommu_domain.
+	     * Something like:
+	     * hwpt->domain = ops->domain_restore(dev, persistent_id);
+	     * Hand wavy - the details allude me at the moment...
+	     */
 	}
 
 	return fd;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a616e8702a1c..0dc97d494fd9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -529,6 +529,8 @@ static inline int __iommu_copy_struct_from_user_array(
  * @domain_alloc_paging: Allocate an iommu_domain that can be used for
  *                       UNMANAGED, DMA, and DMA_FQ domain types.
  * @domain_alloc_sva: Allocate an iommu_domain for Shared Virtual Addressing.
+ * @domain_restore: After kexec, give the same persistent_id which was originally
+ *                  used to allocate the domain, and the domain will be restored.
  * @probe_device: Add device to iommu driver handling
  * @release_device: Remove device from iommu driver handling
  * @probe_finalize: Do final setup work after the device is added to an IOMMU
@@ -576,6 +578,9 @@ struct iommu_ops {
 	struct iommu_domain *(*domain_alloc_sva)(struct device *dev,
 						 struct mm_struct *mm);
 
+	struct iommu_domain *(*domain_restore)(struct device *dev,
+			unsigned long persistent_id);
+
 	struct iommu_device *(*probe_device)(struct device *dev);
 	void (*release_device)(struct device *dev);
 	void (*probe_finalize)(struct device *dev);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 12/13] iommufd, guestmemfs: Ensure persistent file used for persistent DMA
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
                   ` (10 preceding siblings ...)
  2024-09-16 11:31 ` [RFC PATCH 11/13] iommu: Add callback to restore persisted iommu_domain James Gowans
@ 2024-09-16 11:31 ` James Gowans
  2024-10-03 13:36   ` Jason Gunthorpe
  2024-09-16 11:31 ` [RFC PATCH 13/13] iommufd, guestmemfs: Pin files when mapped " James Gowans
  12 siblings, 1 reply; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

When IOASes and hardware page tables are made persistent then DMA will
continue accessing memory which is mapped for DMA during and after
kexec. This is only legal if we are sure that the memory being accessed
by that DMA is also persistent. It would not be legal to map normal
buddy-list managed anonymous memory for persistent DMA.

Currently there is one provider of persistent memory: guestmemfs:
https://lore.kernel.org/all/20240805093245.889357-1-jgowans@amazon.com/

This commit ensures that only guestmemfs memory can be mapped into
persistent iommufds. This is almost certainly the wrong way and place to
do it, but something similar to this is needed. Perhaps in page.c?
As more persistent memory providers become available they can be added
to the list to check for.
---
 drivers/iommu/iommufd/ioas.c | 22 ++++++++++++++++++++++
 fs/guestmemfs/file.c         |  5 +++++
 include/linux/guestmemfs.h   |  7 +++++++
 3 files changed, 34 insertions(+)

diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 742248276548..ce76b41d2d72 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -2,9 +2,11 @@
 /*
  * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
  */
+#include <linux/guestmemfs.h>
 #include <linux/interval_tree.h>
 #include <linux/iommufd.h>
 #include <linux/iommu.h>
+#include <linux/mm_types.h>
 #include <uapi/linux/iommufd.h>
 
 #include "io_pagetable.h"
@@ -217,6 +219,26 @@ int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
 	if (IS_ERR(ioas))
 		return PTR_ERR(ioas);
 
+	pr_info("iommufd_ioas_map persistent id %lu\n",
+			ucmd->ictx->persistent_id);
+	if (ucmd->ictx->persistent_id) {
+#ifdef CONFIG_GUESTMEMFS_FS
+		struct vm_area_struct *vma;
+		struct mm_struct *mm = current->mm;
+
+		mmap_read_lock(mm);
+		vma = find_vma_intersection(current->mm,
+				 cmd->user_va, cmd->user_va + cmd->length);
+		if (!vma || !is_guestmemfs_file(vma->vm_file)) {
+			mmap_read_unlock(mm);
+			return -EFAULT;
+		}
+		mmap_read_unlock(mm);
+#else
+		return -EFAULT;
+#endif /* CONFIG_GUESTMEMFS_FS */
+	}
+
 	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
 		flags = IOPT_ALLOC_IOVA;
 	rc = iopt_map_user_pages(ucmd->ictx, &ioas->iopt, &iova,
diff --git a/fs/guestmemfs/file.c b/fs/guestmemfs/file.c
index 8707a9d3ad90..ecacaf200a31 100644
--- a/fs/guestmemfs/file.c
+++ b/fs/guestmemfs/file.c
@@ -104,3 +104,8 @@ const struct file_operations guestmemfs_file_fops = {
 	.owner = THIS_MODULE,
 	.mmap = mmap,
 };
+
+bool is_guestmemfs_file(struct file const *file)
+{
+	return file && file->f_op == &guestmemfs_file_fops;
+}
diff --git a/include/linux/guestmemfs.h b/include/linux/guestmemfs.h
index 60e769c8e533..c5cd7b6a5630 100644
--- a/include/linux/guestmemfs.h
+++ b/include/linux/guestmemfs.h
@@ -3,14 +3,21 @@
 #ifndef _LINUX_GUESTMEMFS_H
 #define _LINUX_GUESTMEMFS_H
 
+#include <linux/fs.h>
+
 /*
  * Carves out chunks of memory from memblocks for guestmemfs.
  * Must be called in early boot before memblocks are freed.
  */
 # ifdef CONFIG_GUESTMEMFS_FS
 void guestmemfs_reserve_mem(void);
+bool is_guestmemfs_file(struct file const *filp);
 #else
 void guestmemfs_reserve_mem(void) { }
+inline bool is_guestmemfs_file(struct file const *filp)
+{
+	return 0;
+}
 #endif
 
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 13/13] iommufd, guestmemfs: Pin files when mapped for persistent DMA
  2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
                   ` (11 preceding siblings ...)
  2024-09-16 11:31 ` [RFC PATCH 12/13] iommufd, guestmemfs: Ensure persistent file used for persistent DMA James Gowans
@ 2024-09-16 11:31 ` James Gowans
  12 siblings, 0 replies; 33+ messages in thread
From: James Gowans @ 2024-09-16 11:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

Ordinarily after kexec the new kernel would have no idea that some files
are still actually in use as DMA targets, this could allow the files to
be deleted while still actually in use behind the scenes. This would
allow use-after-frees of the persistent memory.

To prevent this, add the ability to do long term (across kexec) pinning
of files in guestmemfs. Iommufd is updated to use this when mapping a
file into a persistent domain. As long as the file has pins it cannot be
deleted.

A hand-wavy alternative would be to use something like the iommufd's
storage domain and actually do this at the PFN level.
---
 drivers/iommu/iommufd/ioas.c            |  4 ++++
 drivers/iommu/iommufd/iommufd_private.h |  5 +++++
 drivers/iommu/iommufd/serialise.c       |  9 ++++++++-
 fs/guestmemfs/file.c                    | 20 ++++++++++++++++++++
 fs/guestmemfs/guestmemfs.h              |  1 +
 fs/guestmemfs/inode.c                   |  4 ++++
 include/linux/guestmemfs.h              |  8 ++++++++
 7 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index ce76b41d2d72..8b7fa3d17e8a 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -233,6 +233,7 @@ int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
 			mmap_read_unlock(mm);
 			return -EFAULT;
 		}
+		ioas->pinned_file_handle = guestmemfs_pin_file(vma->vm_file);
 		mmap_read_unlock(mm);
 #else
 		return -EFAULT;
@@ -331,6 +332,9 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
 				     &unmapped);
 		if (rc)
 			goto out_put;
+
+		if (ioas->pinned_file_handle)
+			guestmemfs_unpin_file(ioas->pinned_file_handle);
 	}
 
 	cmd->length = unmapped;
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 94612cec2814..597a54a1adf3 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -260,12 +260,17 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
  * An iommu_domain & iommfd_hw_pagetable will be automatically selected
  * for a device based on the hwpt_list. If no suitable iommu_domain
  * is found a new iommu_domain will be created.
+ *
+ * If this IOAS is pinning a file for persistent DMA, pinned_file_handle will
+ * be set to a non-zero value. When unmapping this IOAS the file will be
+ * unpinned.
  */
 struct iommufd_ioas {
 	struct iommufd_object obj;
 	struct io_pagetable iopt;
 	struct mutex mutex;
 	struct list_head hwpt_list;
+	unsigned long pinned_file_handle;
 };
 
 static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ctx *ictx,
diff --git a/drivers/iommu/iommufd/serialise.c b/drivers/iommu/iommufd/serialise.c
index baac7d6150cb..d95e150c3dd9 100644
--- a/drivers/iommu/iommufd/serialise.c
+++ b/drivers/iommu/iommufd/serialise.c
@@ -16,6 +16,7 @@
  *       account_mode = u8
  *       ioases = [
  *         {
+ *           pinned_file_handle = u64
  *           areas = [
  *           ]
  *         }
@@ -48,6 +49,9 @@ static int serialise_iommufd(void *fdt, struct iommufd_ctx *ictx)
 		snprintf(name, sizeof(name), "%lu", obj_idx);
 		err |= fdt_begin_node(fdt, name);
 
+		err |= fdt_property(fdt, "pinned-file-handle",
+				&ioas->pinned_file_handle, sizeof(ioas->pinned_file_handle));
+
 		for (area = iopt_area_iter_first(&ioas->iopt, 0, ULONG_MAX); area;
 				area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
 			unsigned long iova_start, iova_len;
@@ -119,15 +123,18 @@ static int rehydrate_iommufd(char *iommufd_name)
 	snprintf(kho_path, sizeof(kho_path), "/iommufd/iommufds/%s/ioases", iommufd_name);
 	fdt_for_each_subnode(off, fdt, fdt_path_offset(fdt, kho_path)) {
 	    struct iommufd_ioas *ioas;
+	    int len;
 	    int range_off;
+	    const unsigned long *pinned_file_handle;
 
 	    ioas = iommufd_ioas_alloc(ictx);
+	    pinned_file_handle = fdt_getprop(fdt, off, "pinned-file-handle", &len);
+	    ioas->pinned_file_handle = *pinned_file_handle;
 	    iommufd_object_finalize(ictx, &ioas->obj);
 
 	    fdt_for_each_subnode(range_off, fdt, off) {
 		    const unsigned long *iova_start, *iova_len;
 		    const int *iommu_prot;
-		    int len;
 		    struct iopt_area *area = iopt_area_alloc();
 
 		    iova_start = fdt_getprop(fdt, range_off, "iova-start", &len);
diff --git a/fs/guestmemfs/file.c b/fs/guestmemfs/file.c
index ecacaf200a31..d7840831df03 100644
--- a/fs/guestmemfs/file.c
+++ b/fs/guestmemfs/file.c
@@ -109,3 +109,23 @@ bool is_guestmemfs_file(struct file const *file)
 {
 	return file && file->f_op == &guestmemfs_file_fops;
 }
+
+unsigned long guestmemfs_pin_file(struct file *file)
+{
+	struct guestmemfs_inode *inode =
+		guestmemfs_get_persisted_inode(file->f_inode->i_sb,
+				file->f_inode->i_ino);
+
+	atomic_inc(&inode->long_term_pins);
+	return file->f_inode->i_ino;
+}
+
+void guestmemfs_unpin_file(unsigned long pin_handle)
+{
+	struct guestmemfs_inode *inode =
+		guestmemfs_get_persisted_inode(guestmemfs_sb, pin_handle);
+	int new;
+
+	new = atomic_dec_return(&inode->long_term_pins);
+	WARN_ON(new < 0);
+}
diff --git a/fs/guestmemfs/guestmemfs.h b/fs/guestmemfs/guestmemfs.h
index 91cc06ae45a5..d107ad0e3323 100644
--- a/fs/guestmemfs/guestmemfs.h
+++ b/fs/guestmemfs/guestmemfs.h
@@ -42,6 +42,7 @@ struct guestmemfs_inode {
 	char filename[GUESTMEMFS_FILENAME_LEN];
 	void *mappings;
 	int num_mappings;
+	atomic_t long_term_pins;
 };
 
 void guestmemfs_initialise_inode_store(struct super_block *sb);
diff --git a/fs/guestmemfs/inode.c b/fs/guestmemfs/inode.c
index d521b35d4992..6bc0abbde8d1 100644
--- a/fs/guestmemfs/inode.c
+++ b/fs/guestmemfs/inode.c
@@ -151,6 +151,10 @@ static int guestmemfs_unlink(struct inode *dir, struct dentry *dentry)
 
 	ino = guestmemfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino;
 
+	inode = guestmemfs_get_persisted_inode(dir->i_sb, dentry->d_inode->i_ino);
+	if (atomic_read(&inode->long_term_pins))
+		return -EBUSY;
+
 	/* Special case for first file in dir */
 	if (ino == dentry->d_inode->i_ino) {
 		guestmemfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino =
diff --git a/include/linux/guestmemfs.h b/include/linux/guestmemfs.h
index c5cd7b6a5630..c2018b4f38fd 100644
--- a/include/linux/guestmemfs.h
+++ b/include/linux/guestmemfs.h
@@ -20,4 +20,12 @@ inline bool is_guestmemfs_file(struct file const *filp)
 }
 #endif
 
+/*
+ * Ensure that the file cannot be deleted or have its memory changed
+ * until it is unpinned. The returned value is a handle which can be
+ * used to un-pin the file.
+ */
+unsigned long guestmemfs_pin_file(struct file *file);
+void guestmemfs_unpin_file(unsigned long pin_handle);
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-09-16 11:30 ` [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas James Gowans
@ 2024-10-02 18:55   ` Jason Gunthorpe
  2024-10-07  8:39     ` Gowans, James
  2024-10-16 22:20   ` Jacob Pan
  1 sibling, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2024-10-02 18:55 UTC (permalink / raw)
  To: James Gowans
  Cc: linux-kernel, Kevin Tian, Joerg Roedel, Krzysztof Wilczyński,
	Will Deacon, Robin Murphy, Mike Rapoport,
	Madhavan T. Venkataraman, iommu, Sean Christopherson,
	Paolo Bonzini, kvm, David Woodhouse, Lu Baolu, Alexander Graf,
	anthony.yznaga, steven.sistare, nh-open-source,
	Saenz Julienne, Nicolas

On Mon, Sep 16, 2024 at 01:30:54PM +0200, James Gowans wrote:
> Now actually implementing the serialise callback for iommufd.
> On KHO activate, iterate through all persisted domains and write their
> metadata to the device tree format. For now just a few fields are
> serialised to demonstrate the concept. To actually make this useful a
> lot more field and related objects will need to be serialised too.

But isn't that a rather difficult problem? The "a lot more fields"
include things like pointers to the mm struct, the user_struct and
task_struct, then all the pinning accounting as well.

Coming work extends this to memfds and more is coming. I would expect
this KHO stuff to use the memfd-like path to access the physical VM
memory too.

I think expecting to serialize and restore everything like this is
probably much too complicated.

If you could just retain a small portion and then directly reconstruct
the missing parts it seems like it would be more maintainable.

Ie "recover" a HWPT from a KHO on a manually created a IOAS with the
right "memfd" for the backing storage. Then the recovery can just
validate that things are correct and adopt the iommu_domain as the
hwpt.

Eventually you'll want this to work for the viommus as well, and that
seems like a lot more tricky complexity..

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 03/13] iommu/intel: zap context table entries on kexec
  2024-09-16 11:30 ` [RFC PATCH 03/13] iommu/intel: zap context table entries on kexec James Gowans
@ 2024-10-03 13:27   ` Jason Gunthorpe
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2024-10-03 13:27 UTC (permalink / raw)
  To: James Gowans
  Cc: linux-kernel, Kevin Tian, Joerg Roedel, Krzysztof Wilczyński,
	Will Deacon, Robin Murphy, Mike Rapoport,
	Madhavan T. Venkataraman, iommu, Sean Christopherson,
	Paolo Bonzini, kvm, David Woodhouse, Lu Baolu, Alexander Graf,
	anthony.yznaga, steven.sistare, nh-open-source,
	Saenz Julienne, Nicolas

On Mon, Sep 16, 2024 at 01:30:52PM +0200, James Gowans wrote:
> Instead of fully shutting down the IOMMU on kexec, rather zap context
> table entries for devices. This is the initial step to be able to
> persist some domains. Once a struct iommu_domain can be marked
> persistent then those persistent domains will be skipped when doing the
> IOMMU shut down.
> ---
>  drivers/iommu/intel/dmar.c  |  1 +
>  drivers/iommu/intel/iommu.c | 34 ++++++++++++++++++++++++++++++----
>  drivers/iommu/intel/iommu.h |  2 ++
>  3 files changed, 33 insertions(+), 4 deletions(-)

We should probably try to avoid doing this kind of stuff in
drivers. The core code can generically ask drivers to attach a
BLOCKING domain as part of the kexec sequence and the core code can
then decide which devices should be held over.

There is also some complexity here around RMRs, we can't always apply
a blocking domain... Not sure what you'd do in those cases.

IIRC we already do something like this with the bus master enable on
the PCI side?? At least, if the kernel is deciding to block DMA when
the IOMMU is on it should do it consistently and inhibit the PCI
device as well.

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/13] iommu: Add callback to restore persisted iommu_domain
  2024-09-16 11:31 ` [RFC PATCH 11/13] iommu: Add callback to restore persisted iommu_domain James Gowans
@ 2024-10-03 13:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2024-10-03 13:33 UTC (permalink / raw)
  To: James Gowans
  Cc: linux-kernel, Kevin Tian, Joerg Roedel, Krzysztof Wilczyński,
	Will Deacon, Robin Murphy, Mike Rapoport,
	Madhavan T. Venkataraman, iommu, Sean Christopherson,
	Paolo Bonzini, kvm, David Woodhouse, Lu Baolu, Alexander Graf,
	anthony.yznaga, steven.sistare, nh-open-source,
	Saenz Julienne, Nicolas

On Mon, Sep 16, 2024 at 01:31:00PM +0200, James Gowans wrote:
> diff --git a/drivers/iommu/iommufd/serialise.c b/drivers/iommu/iommufd/serialise.c
> index 9519969bd201..baac7d6150cb 100644
> --- a/drivers/iommu/iommufd/serialise.c
> +++ b/drivers/iommu/iommufd/serialise.c
> @@ -139,7 +139,14 @@ static int rehydrate_iommufd(char *iommufd_name)
>  		    area->node.last = *iova_start + *iova_len - 1;
>  		    interval_tree_insert(&area->node, &ioas->iopt.area_itree);
>  	    }
> -	    /* TODO: restore link from ioas to hwpt. */
> +	    /*
> +	     * Here we should do something to associate struct iommufd_device with the
> +	     * ictx, then get the iommu_ops via dev_iommu_ops(), and call the new
> +	     * .domain_restore callback to get the struct iommu_domain.
> +	     * Something like:
> +	     * hwpt->domain = ops->domain_restore(dev, persistent_id);
> +	     * Hand wavy - the details allude me at the moment...
> +	     */
>  	}

The core code should request a iommu_domain handle for the
pre-existing translation very early on, it should not leave the device
in some weird NULL domain state. I have been trying hard to eliminate
that.

The special domain would need to remain attached and some protocol
would be needed to carefully convey that past vfio to iommufd,
including inhibiting attaching a blocked domain in VFIO
startup. Including blocking FLRs from VFIO and rejecting attaches to
other non-VFIO drivers.

This is a twisty complicated path, it needs some solid definition of
what the lifecycle of this special domain is, and some sensible exits
if userspace isn't expecting/co-operating with the hand over, or it
crashes while doing this..

> @@ -576,6 +578,9 @@ struct iommu_ops {
>  	struct iommu_domain *(*domain_alloc_sva)(struct device *dev,
>  						 struct mm_struct *mm);
>  
> +	struct iommu_domain *(*domain_restore)(struct device *dev,
> +			unsigned long persistent_id);
> +

Why do we need an ID? There is only one persistent domain per device,
right?

This may need PASID, at least Intel requires the hypervisor to handle
PASID domains, and they would need to persist as well.

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 12/13] iommufd, guestmemfs: Ensure persistent file used for persistent DMA
  2024-09-16 11:31 ` [RFC PATCH 12/13] iommufd, guestmemfs: Ensure persistent file used for persistent DMA James Gowans
@ 2024-10-03 13:36   ` Jason Gunthorpe
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2024-10-03 13:36 UTC (permalink / raw)
  To: James Gowans
  Cc: linux-kernel, Kevin Tian, Joerg Roedel, Krzysztof Wilczyński,
	Will Deacon, Robin Murphy, Mike Rapoport,
	Madhavan T. Venkataraman, iommu, Sean Christopherson,
	Paolo Bonzini, kvm, David Woodhouse, Lu Baolu, Alexander Graf,
	anthony.yznaga, steven.sistare, nh-open-source,
	Saenz Julienne, Nicolas

On Mon, Sep 16, 2024 at 01:31:01PM +0200, James Gowans wrote:

> +#ifdef CONFIG_GUESTMEMFS_FS
> +		struct vm_area_struct *vma;
> +		struct mm_struct *mm = current->mm;
> +
> +		mmap_read_lock(mm);
> +		vma = find_vma_intersection(current->mm,
> +				 cmd->user_va, cmd->user_va + cmd->length);
> +		if (!vma || !is_guestmemfs_file(vma->vm_file)) {
> +			mmap_read_unlock(mm);
> +			return -EFAULT;
> +		}
> +		mmap_read_unlock(mm);
> +#else

Any kind of FD interaction needs to go through the new FD path that
Steve is building:

https://lore.kernel.org/linux-iommu/1727190338-385692-1-git-send-email-steven.sistare@oracle.com

I'm expecting multiple kinds of fds to fall into that pattern,
including memfs, guestmemfd, and dmabuf. guestmemfd can just be one
more..

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-02 18:55   ` Jason Gunthorpe
@ 2024-10-07  8:39     ` Gowans, James
  2024-10-07  8:47       ` David Woodhouse
  2024-10-07 15:16       ` Jason Gunthorpe
  0 siblings, 2 replies; 33+ messages in thread
From: Gowans, James @ 2024-10-07  8:39 UTC (permalink / raw)
  To: jgg@ziepe.ca
  Cc: kvm@vger.kernel.org, rppt@kernel.org, kw@linux.com,
	iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, dwmw2@infradead.org,
	steven.sistare@oracle.com, Graf (AWS), Alexander, will@kernel.org,
	joro@8bytes.org

On Wed, 2024-10-02 at 15:55 -0300, Jason Gunthorpe wrote:
> On Mon, Sep 16, 2024 at 01:30:54PM +0200, James Gowans wrote:
> > Now actually implementing the serialise callback for iommufd.
> > On KHO activate, iterate through all persisted domains and write their
> > metadata to the device tree format. For now just a few fields are
> > serialised to demonstrate the concept. To actually make this useful a
> > lot more field and related objects will need to be serialised too.
> 
> But isn't that a rather difficult problem? The "a lot more fields"
> include things like pointers to the mm struct, the user_struct and
> task_struct, then all the pinning accounting as well.
> 
> Coming work extends this to memfds and more is coming. I would expect
> this KHO stuff to use the memfd-like path to access the physical VM
> memory too.
> 
> I think expecting to serialize and restore everything like this is
> probably much too complicated.

On reflection I think you're right - this will be complex both from a
development and a maintenance perspective, trying to make sure we
serialise all the necessary state and reconstruct it correctly. Even
more complex when structs are refactored/changed across kernel versions.
An important requirement of this functionality is the ability to kexec
between different kernel versions including going back to an older
kernel version in the case of a rollback.

So, let's look at other options:

> 
> If you could just retain a small portion and then directly reconstruct
> the missing parts it seems like it would be more maintainable.

I think we have two other possible approaches here:

1. What this RFC is sketching out, serialising fields from the structs
and setting those fields again on deserialise. As you point out this
will be complicated.

2. Get userspace to do the work: userspace needs to re-do the ioctls
after kexec to reconstruct the objects. My main issue with this approach
is that the kernel needs to do some sort of trust but verify approach to
ensure that userspace constructs everything the same way after kexec as
it was before kexec. We don't want to end up in a state where the
iommufd objects don't match the persisted page tables.

3. Serialise and reply the ioctls. Ioctl APIs and payloads should
(must?) be stable across kernel versions. If IOMMUFD records the ioctls
executed by userspace then it could replay them as part of deserialise
and give userspace a handle to the resulting objects after kexec. This
way we are guaranteed consistent iommufd / IOAS objects. By "consistent"
I mean they are the same as before kexec and match the persisted page
tables. By having the kernel do this it means it doesn't need to depend
on userspace doing the correct thing.

What do you think of this 3rd approach? I can try to sketch it out and
send another RFC if you think it sounds reasonable.

> 
> Ie "recover" a HWPT from a KHO on a manually created a IOAS with the
> right "memfd" for the backing storage. Then the recovery can just
> validate that things are correct and adopt the iommu_domain as the
> hwpt.

This sounds more like option 2 where we expect userspace to re-drive the
ioctls, but verify that they have corresponding payloads as before kexec
so that iommufd objects are consistent with persisted page tables.
If the kernel is doing verification wouldn't it be better for the kernel
to do the ioctl work itself and give the resulting objects to userspace?

> 
> Eventually you'll want this to work for the viommus as well, and that
> seems like a lot more tricky complexity..
> 
> Jason


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-07  8:39     ` Gowans, James
@ 2024-10-07  8:47       ` David Woodhouse
  2024-10-07  8:57         ` Gowans, James
  2024-10-07 15:11         ` Jason Gunthorpe
  2024-10-07 15:16       ` Jason Gunthorpe
  1 sibling, 2 replies; 33+ messages in thread
From: David Woodhouse @ 2024-10-07  8:47 UTC (permalink / raw)
  To: Gowans, James, jgg@ziepe.ca
  Cc: kvm@vger.kernel.org, rppt@kernel.org, kw@linux.com,
	iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, steven.sistare@oracle.com,
	Graf (AWS), Alexander, will@kernel.org, joro@8bytes.org

[-- Attachment #1: Type: text/plain, Size: 1803 bytes --]

On Mon, 2024-10-07 at 08:39 +0000, Gowans, James wrote:
> 
> I think we have two other possible approaches here:
> 
> 1. What this RFC is sketching out, serialising fields from the structs
> and setting those fields again on deserialise. As you point out this
> will be complicated.
> 
> 2. Get userspace to do the work: userspace needs to re-do the ioctls
> after kexec to reconstruct the objects. My main issue with this approach
> is that the kernel needs to do some sort of trust but verify approach to
> ensure that userspace constructs everything the same way after kexec as
> it was before kexec. We don't want to end up in a state where the
> iommufd objects don't match the persisted page tables.

To what extent does the kernel really need to trust or verify? At LPC
we seemed to speak of a model where userspace builds a "new" address
space for each device and then atomically switches to the new page
tables instead of the original ones inherited from the previous kernel.

That does involve having space for another set of page tables, of
course, but that's not impossible.

> 3. Serialise and reply the ioctls. Ioctl APIs and payloads should
> (must?) be stable across kernel versions. If IOMMUFD records the ioctls
> executed by userspace then it could replay them as part of deserialise
> and give userspace a handle to the resulting objects after kexec. This
> way we are guaranteed consistent iommufd / IOAS objects. By "consistent"
> I mean they are the same as before kexec and match the persisted page
> tables. By having the kernel do this it means it doesn't need to depend
> on userspace doing the correct thing.
> 
> What do you think of this 3rd approach? I can try to sketch it out and
> send another RFC if you think it sounds reasonable.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-07  8:47       ` David Woodhouse
@ 2024-10-07  8:57         ` Gowans, James
  2024-10-07 15:01           ` Jason Gunthorpe
  2024-10-07 15:11         ` Jason Gunthorpe
  1 sibling, 1 reply; 33+ messages in thread
From: Gowans, James @ 2024-10-07  8:57 UTC (permalink / raw)
  To: jgg@ziepe.ca, dwmw2@infradead.org
  Cc: kvm@vger.kernel.org, rppt@kernel.org, kw@linux.com,
	iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, maz@kernel.org, steven.sistare@oracle.com,
	Graf (AWS), Alexander, will@kernel.org, joro@8bytes.org

On Mon, 2024-10-07 at 09:47 +0100, David Woodhouse wrote:
> On Mon, 2024-10-07 at 08:39 +0000, Gowans, James wrote:
> > 
> > I think we have two other possible approaches here:
> > 
> > 1. What this RFC is sketching out, serialising fields from the structs
> > and setting those fields again on deserialise. As you point out this
> > will be complicated.
> > 
> > 2. Get userspace to do the work: userspace needs to re-do the ioctls
> > after kexec to reconstruct the objects. My main issue with this approach
> > is that the kernel needs to do some sort of trust but verify approach to
> > ensure that userspace constructs everything the same way after kexec as
> > it was before kexec. We don't want to end up in a state where the
> > iommufd objects don't match the persisted page tables.
> 
> To what extent does the kernel really need to trust or verify? At LPC
> we seemed to speak of a model where userspace builds a "new" address
> space for each device and then atomically switches to the new page
> tables instead of the original ones inherited from the previous kernel.
> 
> That does involve having space for another set of page tables, of
> course, but that's not impossible.

The idea of constructing fresh page tables and then swapping over to
that is indeed appealing, but I don't know if that's always possible.
With the ARM SMMUv3 for example I think there are break-before-make
requirement, so is it possible to do an atomic switch of the SMMUv3 page
table PGD in a hitless way? Everything here must be hitless - serialise
and deserialise must not cause any DMA faults.

If it's not possible to do a hitless atomic switch (I am unsure about
this, need to RTFM) then we're compelled to re-use the existing page
tables and if that's the case I think the kernel MUST ensure that the
iommufd IOAS object exactly match the ones before kexec. I can imagine
all sorts of mess if those go out of sync! 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-07  8:57         ` Gowans, James
@ 2024-10-07 15:01           ` Jason Gunthorpe
  2024-10-09 11:44             ` Gowans, James
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2024-10-07 15:01 UTC (permalink / raw)
  To: Gowans, James
  Cc: dwmw2@infradead.org, kvm@vger.kernel.org, rppt@kernel.org,
	kw@linux.com, iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, maz@kernel.org, steven.sistare@oracle.com,
	Graf (AWS), Alexander, will@kernel.org, joro@8bytes.org

On Mon, Oct 07, 2024 at 08:57:07AM +0000, Gowans, James wrote:
> With the ARM SMMUv3 for example I think there are break-before-make
> requirement, so is it possible to do an atomic switch of the SMMUv3 page
> table PGD in a hitless way? 

The BBM rules are only about cached translations. If all your IOPTEs
result in the same translation *size* then you are safe. You can
change the radix memory storing the IOPTEs freely, AFAIK.

BBM level 2 capable HW doesn't have those limitations either and
everything is possible.

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-07  8:47       ` David Woodhouse
  2024-10-07  8:57         ` Gowans, James
@ 2024-10-07 15:11         ` Jason Gunthorpe
  1 sibling, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2024-10-07 15:11 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Gowans, James, kvm@vger.kernel.org, rppt@kernel.org, kw@linux.com,
	iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, steven.sistare@oracle.com,
	Graf (AWS), Alexander, will@kernel.org, joro@8bytes.org

On Mon, Oct 07, 2024 at 09:47:53AM +0100, David Woodhouse wrote:
> On Mon, 2024-10-07 at 08:39 +0000, Gowans, James wrote:
> > 
> > I think we have two other possible approaches here:
> > 
> > 1. What this RFC is sketching out, serialising fields from the structs
> > and setting those fields again on deserialise. As you point out this
> > will be complicated.
> > 
> > 2. Get userspace to do the work: userspace needs to re-do the ioctls
> > after kexec to reconstruct the objects. My main issue with this approach
> > is that the kernel needs to do some sort of trust but verify approach to
> > ensure that userspace constructs everything the same way after kexec as
> > it was before kexec. We don't want to end up in a state where the
> > iommufd objects don't match the persisted page tables.
> 
> To what extent does the kernel really need to trust or verify? 

If iommufd is going to adopt an existing iommu_domain then that
iommu_domain must have exactly the IOPTEs it expects it to have
otherwise there will be functional problems in iommufd.

So, IMHO, some kind of validation would be needed to ensure that
userspace has created the same structure as the old kernel had.

 >At LPC we seemed to speak of a model where userspace builds a "new"
> address space for each device and then atomically switches to the
> new page tables instead of the original ones inherited from the
> previous kernel.

The hitless replace model would leave the old translation in place
while userspace builds up a replacement translation that is
equivalent. Then hitless replace would adopt the new translation and
we discard the old ones memory.

IMHO this is easiest to make correct and least maintenance burden
because the only kernel thing you are asking for in iommufd is hitless
iommu_domain replace, which we already want to add to the drivers
anyhow. (ARM already has it)

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-07  8:39     ` Gowans, James
  2024-10-07  8:47       ` David Woodhouse
@ 2024-10-07 15:16       ` Jason Gunthorpe
  1 sibling, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2024-10-07 15:16 UTC (permalink / raw)
  To: Gowans, James
  Cc: kvm@vger.kernel.org, rppt@kernel.org, kw@linux.com,
	iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, dwmw2@infradead.org,
	steven.sistare@oracle.com, Graf (AWS), Alexander, will@kernel.org,
	joro@8bytes.org

On Mon, Oct 07, 2024 at 08:39:53AM +0000, Gowans, James wrote:

> 2. Get userspace to do the work: userspace needs to re-do the ioctls
> after kexec to reconstruct the objects. My main issue with this approach
> is that the kernel needs to do some sort of trust but verify approach to
> ensure that userspace constructs everything the same way after kexec as
> it was before kexec. We don't want to end up in a state where the
> iommufd objects don't match the persisted page tables.

I think the verification is not so bad, it is just extracting the
physical addresses from the IOAS and comparing to what is stored in
the iommu_domain. If they don't match then the domain can't be adopted
to the IOAS.

We actually don't care about anything else, if userspace creates
different objects with different parameters who cares? All that
matters is that the radix tree contains the same expected information.

> What do you think of this 3rd approach? I can try to sketch it out and
> send another RFC if you think it sounds reasonable.

I think it is the same problem, just in a more maintainable
wrapper. You still have to serialize lots and lots of different
objects and their relationships.

> > Ie "recover" a HWPT from a KHO on a manually created a IOAS with the
> > right "memfd" for the backing storage. Then the recovery can just
> > validate that things are correct and adopt the iommu_domain as the
> > hwpt.
>
> This sounds more like option 2 where we expect userspace to re-drive the
> ioctls, but verify that they have corresponding payloads as before kexec
> so that iommufd objects are consistent with persisted page tables.

Yes

> If the kernel is doing verification wouldn't it be better for the kernel
> to do the ioctl work itself and give the resulting objects to
> userspace?

No :)

It is so much easier to validate the IOPTEs in a radix tree.

At the very worst you just create a HWPT and iommu_domain for
validation, do validation and then throw it away. Compare for two
radix trees is about 50 lines in generic pt, I have it already.

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-07 15:01           ` Jason Gunthorpe
@ 2024-10-09 11:44             ` Gowans, James
  2024-10-09 12:28               ` Jason Gunthorpe
  0 siblings, 1 reply; 33+ messages in thread
From: Gowans, James @ 2024-10-09 11:44 UTC (permalink / raw)
  To: jgg@ziepe.ca
  Cc: kvm@vger.kernel.org, rppt@kernel.org, kw@linux.com,
	iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, dwmw2@infradead.org,
	steven.sistare@oracle.com, Graf (AWS), Alexander, will@kernel.org,
	joro@8bytes.org, maz@kernel.org

On Mon, 2024-10-07 at 12:01 -0300, Jason Gunthorpe wrote:
> On Mon, Oct 07, 2024 at 08:57:07AM +0000, Gowans, James wrote:
> > With the ARM SMMUv3 for example I think there are break-before-make
> > requirement, so is it possible to do an atomic switch of the SMMUv3 page
> > table PGD in a hitless way?
> 
> The BBM rules are only about cached translations. If all your IOPTEs
> result in the same translation *size* then you are safe. You can
> change the radix memory storing the IOPTEs freely, AFAIK.

Okay, but in general this still means that the page tables must have
exactly the same translations if we try to switch from one set to
another. If it is possible to change translations then translation table
entries could be created at different granularity (PTE, PMD, PUD) level
which would violate this requirement. 

It's also possible for different IOMMU driver versions to set up the the
same translations, but at different page table levels. Perhaps an older
version did not coalesce come PTEs, but a newer version does coalesce.
Would the same translations but at a different size violate BBM?

If we say that to be safe/correct in the general case then it is
necessary for the translations to be *exactly* the same before and after
kexec, is there any benefit to building new translation tables and
switching to them? We may as well continue to use the exact same page
tables and construct iommufd objects (IOAS, etc) to match.

There is also a performance consideration here: when doing live update
every millisecond of down time matters. I'm not sure if this iommufd re-
initialisation will end up being in the hot path of things that need to
be done before the VM can start running again. If it in the hot path
then it would be useful to avoid rebuilding identical tables. Maybe it
ends up being in the "warm" path - the VM can start running but will
sleep if taking a page fault before IOMMUFD is re-initalised...

So overall my point is that I think we end up with a requirement that
the pgtables are identical before and after kexec so there is any
benefit in rebuilding them?

JG

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-09 11:44             ` Gowans, James
@ 2024-10-09 12:28               ` Jason Gunthorpe
  2024-10-10 15:12                 ` Gowans, James
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2024-10-09 12:28 UTC (permalink / raw)
  To: Gowans, James
  Cc: kvm@vger.kernel.org, rppt@kernel.org, kw@linux.com,
	iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, dwmw2@infradead.org,
	steven.sistare@oracle.com, Graf (AWS), Alexander, will@kernel.org,
	joro@8bytes.org, maz@kernel.org

On Wed, Oct 09, 2024 at 11:44:30AM +0000, Gowans, James wrote:

> Okay, but in general this still means that the page tables must have
> exactly the same translations if we try to switch from one set to
> another. If it is possible to change translations then translation table
> entries could be created at different granularity (PTE, PMD, PUD) level
> which would violate this requirement. 

Yes, but we strive to make page tables consistently and it isn't that
often that we get new features that would chang the layout (contig bit
for instance). I'd suggest in these cases you'd add some creation flag
to the HWPT that can inhibit the new feature and your VMM will deal
with it.

Or you sweep it and manually split/join to deal with BBM < level
2. Generic pt will have code to do all of this so it is not that bad.

If this little issue already scares you then I don't think I want to
see you serialize anything more complex, there are endless scenarios
for compatibility problems :\

> It's also possible for different IOMMU driver versions to set up the the
> same translations, but at different page table levels. Perhaps an older
> version did not coalesce come PTEs, but a newer version does coalesce.
> Would the same translations but at a different size violate BBM?

Yes, that is the only thing that violates BBM.

> If we say that to be safe/correct in the general case then it is
> necessary for the translations to be *exactly* the same before and after
> kexec, is there any benefit to building new translation tables and
> switching to them? We may as well continue to use the exact same page
> tables and construct iommufd objects (IOAS, etc) to match.

The benifit is principally that you did all the machinery to get up to
that point, including re-pinning and so forth all the memory, instead
of trying to magically recover that additional state.

This is the philosophy that you replay instead of de-serialize, so you
have to replay into a page table at some level to make that work.

> There is also a performance consideration here: when doing live update
> every millisecond of down time matters. I'm not sure if this iommufd re-
> initialisation will end up being in the hot path of things that need to
> be done before the VM can start running again. 

As we talked about in the session, your KVM can start running
immediately, you don't need iommufd to be fully setup.

You only need iommufd fully working again if you intend to do certain
operations, like memory hotplug or something that requires an address
map change. So you can operate in a degraded state that is largely
invisible to the guest while recovering this stuff. It shouldn't be on
your critical path.

> then it would be useful to avoid rebuilding identical tables. Maybe it
> ends up being in the "warm" path - the VM can start running but will
> sleep if taking a page fault before IOMMUFD is re-initalised...

I didn't think you'd support page faults? There are bigger issues here
if you expect to have a vIOMMU in the guest.

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-09 12:28               ` Jason Gunthorpe
@ 2024-10-10 15:12                 ` Gowans, James
  2024-10-10 15:32                   ` Jason Gunthorpe
  0 siblings, 1 reply; 33+ messages in thread
From: Gowans, James @ 2024-10-10 15:12 UTC (permalink / raw)
  To: jgg@ziepe.ca
  Cc: kvm@vger.kernel.org, rppt@kernel.org, kw@linux.com,
	iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, dwmw2@infradead.org,
	steven.sistare@oracle.com, Graf (AWS), Alexander, will@kernel.org,
	joro@8bytes.org, maz@kernel.org

On Wed, 2024-10-09 at 09:28 -0300, Jason Gunthorpe wrote:
> On Wed, Oct 09, 2024 at 11:44:30AM +0000, Gowans, James wrote:
> 
> > Okay, but in general this still means that the page tables must have
> > exactly the same translations if we try to switch from one set to
> > another. If it is possible to change translations then translation table
> > entries could be created at different granularity (PTE, PMD, PUD) level
> > which would violate this requirement.
> 
> Yes, but we strive to make page tables consistently and it isn't that
> often that we get new features that would chang the layout (contig bit
> for instance). I'd suggest in these cases you'd add some creation flag
> to the HWPT that can inhibit the new feature and your VMM will deal
> with it.
> 
> Or you sweep it and manually split/join to deal with BBM < level
> 2. Generic pt will have code to do all of this so it is not that bad.
> 
> If this little issue already scares you then I don't think I want to
> see you serialize anything more complex, there are endless scenarios
> for compatibility problems :\

The things that scare me is some subtle page table difference which
causes silent data corruption... This is one of the reasons I liked re-
using the existing tables, there is no way for this sort of subtle bug
to happen.
In general I agree with your point about reducing the complexity of what
we serialise and rather exercising the machinery again after kexec to
create fresh objects. The only (?) counter point to that is about
performance of anything in the hot path (discussed more below).


> 
> > If we say that to be safe/correct in the general case then it is
> > necessary for the translations to be *exactly* the same before and after
> > kexec, is there any benefit to building new translation tables and
> > switching to them? We may as well continue to use the exact same page
> > tables and construct iommufd objects (IOAS, etc) to match.
> 
> The benifit is principally that you did all the machinery to get up to
> that point, including re-pinning and so forth all the memory, instead
> of trying to magically recover that additional state.
> 
> This is the philosophy that you replay instead of de-serialize, so you
> have to replay into a page table at some level to make that work.

We could have some "skip_pgtable_update" flag which the replay machinery
sets, allowing IOMMUFD to create fresh objects internally and leave the
page tables alone?

Anyway, that sort of thing is a potential future optimisation we could
do. In the interests of making progress I'll re-work this RFC to re-
create everything (iommufd objects, IOMMU driver domains and page
tables) and do the atomic handover of page tables after re-
initialisation.

> 
> 
> > then it would be useful to avoid rebuilding identical tables. Maybe it
> > ends up being in the "warm" path - the VM can start running but will
> > sleep if taking a page fault before IOMMUFD is re-initalised...
> 
> I didn't think you'd support page faults? There are bigger issues here
> if you expect to have a vIOMMU in the guest.

vIOMMU is one case, but another is memory oversubscription. With PRI/ATS
we can oversubscribe memory which is DMA mapped. In that case a page
fault would be a blocking operation until IOMMUFD is all set up and
ready to go. I suspect there will be benefit in getting this fast, but
as long as we have a path to optimise it in future I'm totally fine to
start with re-creating everything.

JG

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-10 15:12                 ` Gowans, James
@ 2024-10-10 15:32                   ` Jason Gunthorpe
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2024-10-10 15:32 UTC (permalink / raw)
  To: Gowans, James
  Cc: kvm@vger.kernel.org, rppt@kernel.org, kw@linux.com,
	iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, dwmw2@infradead.org,
	steven.sistare@oracle.com, Graf (AWS), Alexander, will@kernel.org,
	joro@8bytes.org, maz@kernel.org

On Thu, Oct 10, 2024 at 03:12:09PM +0000, Gowans, James wrote:
> > If this little issue already scares you then I don't think I want to
> > see you serialize anything more complex, there are endless scenarios
> > for compatibility problems :\
> 
> The things that scare me is some subtle page table difference which
> causes silent data corruption... This is one of the reasons I liked re-
> using the existing tables, there is no way for this sort of subtle bug
> to happen.

> > > If we say that to be safe/correct in the general case then it is
> > > necessary for the translations to be *exactly* the same before and after
> > > kexec, is there any benefit to building new translation tables and
> > > switching to them? We may as well continue to use the exact same page
> > > tables and construct iommufd objects (IOAS, etc) to match.
> > 
> > The benifit is principally that you did all the machinery to get up to
> > that point, including re-pinning and so forth all the memory, instead
> > of trying to magically recover that additional state.
> > 
> > This is the philosophy that you replay instead of de-serialize, so you
> > have to replay into a page table at some level to make that work.
> 
> We could have some "skip_pgtable_update" flag which the replay machinery
> sets, allowing IOMMUFD to create fresh objects internally and leave the
> page tables alone?

The point made before was that iommufd hard depends on the content of
the iommu_domain for correctness since it uses it as the storage for
the PFNs.

Making an assumption that the prior kernle domain matches what iommufd
requires opens up the easy possibility of hypervisor kernel
corruption.

I think this is a bad direction..

You have to at least validate that userspace has set things up in a
way that is consistent with the prior domain before adopting it.

It would be easier to understand this if the performance costs of
doing such a validation was more understood. Perhaps it can be
optimized somehow.

> > > then it would be useful to avoid rebuilding identical tables. Maybe it
> > > ends up being in the "warm" path - the VM can start running but will
> > > sleep if taking a page fault before IOMMUFD is re-initalised...
> > 
> > I didn't think you'd support page faults? There are bigger issues here
> > if you expect to have a vIOMMU in the guest.
> 
> vIOMMU is one case, but another is memory oversubscription. With PRI/ATS
> we can oversubscribe memory which is DMA mapped. In that case a page
> fault would be a blocking operation until IOMMUFD is all set up and
> ready to go. I suspect there will be benefit in getting this fast, but
> as long as we have a path to optimise it in future I'm totally fine to
> start with re-creating everything.

Yes, this is true, but if you intend to do this kind of manipulation
of the page table then it really should be in the exact format the new
kernel is tested to understand. Expecting the new kernel to interwork
with the old kernel's page table is likely to be OK, but also along
the same lines of your fear there could be differences :\

Still, PRI/ATS for backing guests storage is a pretty advanced
concept, we don't have support for that yet.

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-09-16 11:30 ` [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas James Gowans
  2024-10-02 18:55   ` Jason Gunthorpe
@ 2024-10-16 22:20   ` Jacob Pan
  2024-10-28 16:03     ` Jacob Pan
  1 sibling, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2024-10-16 22:20 UTC (permalink / raw)
  To: James Gowans
  Cc: linux-kernel, Jason Gunthorpe, Kevin Tian, Joerg Roedel,
	Krzysztof Wilczyński, Will Deacon, Robin Murphy,
	Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas, jacob.pan

Hi James,

On Mon, 16 Sep 2024 13:30:54 +0200
James Gowans <jgowans@amazon.com> wrote:

> +static int serialise_iommufd(void *fdt, struct iommufd_ctx *ictx)
> +{
> +	int err = 0;
> +	char name[24];
> +	struct iommufd_object *obj;
> +	unsigned long obj_idx;
> +
> +	snprintf(name, sizeof(name), "%lu", ictx->persistent_id);
> +	err |= fdt_begin_node(fdt, name);
> +	err |= fdt_begin_node(fdt, "ioases");
> +	xa_for_each(&ictx->objects, obj_idx, obj) {
> +		struct iommufd_ioas *ioas;
> +		struct iopt_area *area;
> +		int area_idx = 0;
> +
> +		if (obj->type != IOMMUFD_OBJ_IOAS)
> +			continue;
I was wondering how device state persistency is managed here. Is it
correct to assume that all devices bound to an iommufd context should
be persistent? If so, should we be serializing IOMMUFD_OBJ_DEVICE as
well?

I'm considering this from the perspective of user mode drivers,
including those that use noiommu mode (need to be added to iommufd
cdev). In this scenario, we only need to maintain the device states
persistently without IOAS.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-16 22:20   ` Jacob Pan
@ 2024-10-28 16:03     ` Jacob Pan
  2024-11-02 10:22       ` Gowans, James
  0 siblings, 1 reply; 33+ messages in thread
From: Jacob Pan @ 2024-10-28 16:03 UTC (permalink / raw)
  To: James Gowans
  Cc: jacob.pan, Saurabh Sengar, linux-kernel, Jason Gunthorpe,
	Kevin Tian, Joerg Roedel, Krzysztof Wilczyński, Will Deacon,
	Robin Murphy, Mike Rapoport, Madhavan T. Venkataraman, iommu,
	Sean Christopherson, Paolo Bonzini, kvm, David Woodhouse,
	Lu Baolu, Alexander Graf, anthony.yznaga, steven.sistare,
	nh-open-source, Saenz Julienne, Nicolas

Hi James,

Just a gentle reminder. Let me also explain the problem we are trying
to solve for the live update of OpenHCL paravisor[1]. OpenHCL has user
space drivers based on VFIO noiommu mode, we are in the process of
converting to iommufd cdev.

Similarly, running DMA continuously across updates is required, but
unlike your case, OpenHCL updates do not involve preserving the IO page
tables in that it is managed by the hypervisor which is not part of the
update.

It seems reasonable to share the device persistence code path
with the plan laid out in your cover letter. IOAS code path will be
different since noiommu option does not have IOAS.

If we were to revive noiommu support for iommufd cdev[2], can we use
the persistent iommufd context to allow device persistence? Perhaps
through IOMMUFD_OBJ_DEVICE and IOMMUFD_OBJ_ACCESS(used in [2])?

@David, @Jason, @Alex, @Yi, any comments or suggestions?


Thanks,

Jacob

1. (openvmm/Guide/src/reference/architecture/openhcl.md at main ·
microsoft/openvmm. 
2. [PATCH v11 00/23] Add vfio_device cdev for
iommufd support - Yi Liu

On Wed, 16 Oct 2024 15:20:47 -0700 Jacob Pan
<jacob.pan@linux.microsoft.com> wrote:

> Hi James,
> 
> On Mon, 16 Sep 2024 13:30:54 +0200
> James Gowans <jgowans@amazon.com> wrote:
> 
> > +static int serialise_iommufd(void *fdt, struct iommufd_ctx *ictx)
> > +{
> > +	int err = 0;
> > +	char name[24];
> > +	struct iommufd_object *obj;
> > +	unsigned long obj_idx;
> > +
> > +	snprintf(name, sizeof(name), "%lu", ictx->persistent_id);
> > +	err |= fdt_begin_node(fdt, name);
> > +	err |= fdt_begin_node(fdt, "ioases");
> > +	xa_for_each(&ictx->objects, obj_idx, obj) {
> > +		struct iommufd_ioas *ioas;
> > +		struct iopt_area *area;
> > +		int area_idx = 0;
> > +
> > +		if (obj->type != IOMMUFD_OBJ_IOAS)
> > +			continue;  
> I was wondering how device state persistency is managed here. Is it
> correct to assume that all devices bound to an iommufd context should
> be persistent? If so, should we be serializing IOMMUFD_OBJ_DEVICE as
> well?
> 
> I'm considering this from the perspective of user mode drivers,
> including those that use noiommu mode (need to be added to iommufd
> cdev). In this scenario, we only need to maintain the device states
> persistently without IOAS.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-10-28 16:03     ` Jacob Pan
@ 2024-11-02 10:22       ` Gowans, James
  2024-11-04 13:00         ` Jason Gunthorpe
  0 siblings, 1 reply; 33+ messages in thread
From: Gowans, James @ 2024-11-02 10:22 UTC (permalink / raw)
  To: jacob.pan@linux.microsoft.com, yi.l.liu@intel.com,
	jinankjain@linux.microsoft.com
  Cc: kvm@vger.kernel.org, rppt@kernel.org, kw@linux.com,
	iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, dwmw2@infradead.org,
	ssengar@linux.microsoft.com, joro@8bytes.org, will@kernel.org,
	Graf (AWS), Alexander, steven.sistare@oracle.com, jgg@ziepe.ca

Hi Jacob, apologies for the late reply.

On Mon, 2024-10-28 at 09:03 -0700, Jacob Pan wrote:
> Hi James,
> 
> Just a gentle reminder. Let me also explain the problem we are trying
> to solve for the live update of OpenHCL paravisor[1]. OpenHCL has user
> space drivers based on VFIO noiommu mode, we are in the process of
> converting to iommufd cdev.
> 
> Similarly, running DMA continuously across updates is required, but
> unlike your case, OpenHCL updates do not involve preserving the IO page
> tables in that it is managed by the hypervisor which is not part of the
> update.
> 
> It seems reasonable to share the device persistence code path
> with the plan laid out in your cover letter. IOAS code path will be
> different since noiommu option does not have IOAS.
> 
> If we were to revive noiommu support for iommufd cdev[2], can we use
> the persistent iommufd context to allow device persistence? Perhaps
> through IOMMUFD_OBJ_DEVICE and IOMMUFD_OBJ_ACCESS(used in [2])?
> 
> @David, @Jason, @Alex, @Yi, any comments or suggestions?

IIRC we did discuss this device persistence use case with some of your
colleagues at Linux Plumbers. Adding Jinank to this thread as he was
part of the conversation too.

Yes, I think the guidance was to bind a device to iommufd in noiommu
mode. It does seem a bit weird to use iommufd with noiommu, but we
agreed it's the best/simplest way to get the functionality. Then as you
suggest below the IOMMUFD_OBJ_DEVICE would be serialised too in some
way, probably by iommufd telling the PCI layer that this device must be
persistent and hence not to re-probe it on kexec. I think this would get
you what you want? Specifically you want to make sure that the device is
not reset during kexec so it can keep running? And some handle for
userspace to pick it up again and continue interacting with it after
kexec.

It's all a bit hand wavy at the moment, but something along those lines
probably makes sense. I need to work on rev2 of this RFC as per Jason's
feedback in the other thread. Rev2 will make the restore path more
userspace driven, with fresh iommufd and pgtables objects being created
and then atomically swapped over too. I'll also get the PCI layer
involved with rev2. Once that's out (it'll be a few weeks as I'm on
leave) then let's take a look at how the noiommu device persistence case
would fit in.

JG

> 
> 
> Thanks,
> 
> Jacob
> 
> 1. (openvmm/Guide/src/reference/architecture/openhcl.md at main ·
> microsoft/openvmm.
> 2. [PATCH v11 00/23] Add vfio_device cdev for
> iommufd support - Yi Liu
> 
> On Wed, 16 Oct 2024 15:20:47 -0700 Jacob Pan
> <jacob.pan@linux.microsoft.com> wrote:
> 
> > Hi James,
> > 
> > On Mon, 16 Sep 2024 13:30:54 +0200
> > James Gowans <jgowans@amazon.com> wrote:
> > 
> > > +static int serialise_iommufd(void *fdt, struct iommufd_ctx *ictx)
> > > +{
> > > +   int err = 0;
> > > +   char name[24];
> > > +   struct iommufd_object *obj;
> > > +   unsigned long obj_idx;
> > > +
> > > +   snprintf(name, sizeof(name), "%lu", ictx->persistent_id);
> > > +   err |= fdt_begin_node(fdt, name);
> > > +   err |= fdt_begin_node(fdt, "ioases");
> > > +   xa_for_each(&ictx->objects, obj_idx, obj) {
> > > +           struct iommufd_ioas *ioas;
> > > +           struct iopt_area *area;
> > > +           int area_idx = 0;
> > > +
> > > +           if (obj->type != IOMMUFD_OBJ_IOAS)
> > > +                   continue;
> > I was wondering how device state persistency is managed here. Is it
> > correct to assume that all devices bound to an iommufd context should
> > be persistent? If so, should we be serializing IOMMUFD_OBJ_DEVICE as
> > well?
> > 
> > I'm considering this from the perspective of user mode drivers,
> > including those that use noiommu mode (need to be added to iommufd
> > cdev). In this scenario, we only need to maintain the device states
> > persistently without IOAS.
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-11-02 10:22       ` Gowans, James
@ 2024-11-04 13:00         ` Jason Gunthorpe
  2024-11-06 19:18           ` Jacob Pan
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2024-11-04 13:00 UTC (permalink / raw)
  To: Gowans, James
  Cc: jacob.pan@linux.microsoft.com, yi.l.liu@intel.com,
	jinankjain@linux.microsoft.com, kvm@vger.kernel.org,
	rppt@kernel.org, kw@linux.com, iommu@lists.linux.dev,
	madvenka@linux.microsoft.com, anthony.yznaga@oracle.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com,
	nh-open-source@amazon.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, dwmw2@infradead.org,
	ssengar@linux.microsoft.com, joro@8bytes.org, will@kernel.org,
	Graf (AWS), Alexander, steven.sistare@oracle.com

On Sat, Nov 02, 2024 at 10:22:54AM +0000, Gowans, James wrote:

> Yes, I think the guidance was to bind a device to iommufd in noiommu
> mode. It does seem a bit weird to use iommufd with noiommu, but we
> agreed it's the best/simplest way to get the functionality. 

noiommu should still have an ioas and still have kernel managed page
pinning.

My remark to bring it to iommufd was to also make it a fully
architected feature and stop relying on mprotect and /proc/ tricks.

> Then as you suggest below the IOMMUFD_OBJ_DEVICE would be serialised
> too in some way, probably by iommufd telling the PCI layer that this
> device must be persistent and hence not to re-probe it on kexec.

Presumably VFIO would be doing some/most of this part since it is the
driver that will be binding?

> It's all a bit hand wavy at the moment, but something along those lines
> probably makes sense. I need to work on rev2 of this RFC as per Jason's
> feedback in the other thread. Rev2 will make the restore path more
> userspace driven, with fresh iommufd and pgtables objects being created
> and then atomically swapped over too. I'll also get the PCI layer
> involved with rev2. Once that's out (it'll be a few weeks as I'm on
> leave) then let's take a look at how the noiommu device persistence case
> would fit in.

In a certain sense it would be nice to see the noiommu flow as it
breaks apart the problem into the first dependency:

 How to get the device handed across the kexec and safely land back in
 VFIO, and only VFIO's hands.

Preserving the iommu HW configuration is an incremental step built on
that base line.

Also, FWIW, this needs to follow good open source practices - we need
an open userspace for the feature and the kernel stuff should be
merged in a logical order.

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas
  2024-11-04 13:00         ` Jason Gunthorpe
@ 2024-11-06 19:18           ` Jacob Pan
  0 siblings, 0 replies; 33+ messages in thread
From: Jacob Pan @ 2024-11-06 19:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Gowans, James, yi.l.liu@intel.com, jinankjain@linux.microsoft.com,
	kvm@vger.kernel.org, rppt@kernel.org, kw@linux.com,
	iommu@lists.linux.dev, madvenka@linux.microsoft.com,
	anthony.yznaga@oracle.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, nh-open-source@amazon.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	Saenz Julienne, Nicolas, pbonzini@redhat.com,
	kevin.tian@intel.com, dwmw2@infradead.org,
	ssengar@linux.microsoft.com, joro@8bytes.org, will@kernel.org,
	Graf (AWS), Alexander, steven.sistare@oracle.com, jacob.pan,
	zhangyu1@microsoft.com

Hi Jason,

On Mon, 4 Nov 2024 09:00:11 -0400
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Sat, Nov 02, 2024 at 10:22:54AM +0000, Gowans, James wrote:
> 
> > Yes, I think the guidance was to bind a device to iommufd in noiommu
> > mode. It does seem a bit weird to use iommufd with noiommu, but we
> > agreed it's the best/simplest way to get the functionality.   
> 
> noiommu should still have an ioas and still have kernel managed page
> pinning.
> 
> My remark to bring it to iommufd was to also make it a fully
> architected feature and stop relying on mprotect and /proc/ tricks.
> 
Just to clarify my tentative understanding with more details(please
correct):

1. create an iommufd access object for noiommu device when
binding to an iommufd ctx.

2. all user memory used by the device under noiommu mode should be
pinned by iommufd, i.e. iommufd_access_pin_pages().
I guess you meant stop doing mlock instead of mprotect trick? I think
openHCL is using /dev/mem trick.

3. ioas can be attched to the noiommu iommufd_access object, similar to
emulated device, mdev.

What kind/source of memory should be supported here?
e.g. device meory regions exposed by PCI BARs.


> > Then as you suggest below the IOMMUFD_OBJ_DEVICE would be serialised
> > too in some way, probably by iommufd telling the PCI layer that this
> > device must be persistent and hence not to re-probe it on kexec.  
> 
> Presumably VFIO would be doing some/most of this part since it is the
> driver that will be binding?
> 
Yes, it is the user mode driver that initiates the binding. I was
thinking since the granularity for persistency is per iommufd ctx, the
VFIO device flag to mark keep_alive can come from iommufd ctx.

> > It's all a bit hand wavy at the moment, but something along those
> > lines probably makes sense. I need to work on rev2 of this RFC as
> > per Jason's feedback in the other thread. Rev2 will make the
> > restore path more userspace driven, with fresh iommufd and pgtables
> > objects being created and then atomically swapped over too. I'll
> > also get the PCI layer involved with rev2. Once that's out (it'll
> > be a few weeks as I'm on leave) then let's take a look at how the
> > noiommu device persistence case would fit in.  
> 
> In a certain sense it would be nice to see the noiommu flow as it
> breaks apart the problem into the first dependency:
> 
>  How to get the device handed across the kexec and safely land back in
>  VFIO, and only VFIO's hands.
> 
> Preserving the iommu HW configuration is an incremental step built on
> that base line.
Makes sense, I need to catch up on the KHO series and hook up noiommu
at the first step.

> Also, FWIW, this needs to follow good open source practices - we need
> an open userspace for the feature and the kernel stuff should be
> merged in a logical order.
> 
Yes, we will have matching userspace in openHCL
https://github.com/microsoft/openvmm

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-11-06 19:18 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-16 11:30 [RFC PATCH 00/13] Support iommu(fd) persistence for live update James Gowans
2024-09-16 11:30 ` [RFC PATCH 01/13] iommufd: Support marking and tracking persistent iommufds James Gowans
2024-09-16 11:30 ` [RFC PATCH 02/13] iommufd: Add plumbing for KHO (de)serialise James Gowans
2024-09-16 11:30 ` [RFC PATCH 03/13] iommu/intel: zap context table entries on kexec James Gowans
2024-10-03 13:27   ` Jason Gunthorpe
2024-09-16 11:30 ` [RFC PATCH 04/13] iommu: Support marking domains as persistent on alloc James Gowans
2024-09-16 11:30 ` [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas James Gowans
2024-10-02 18:55   ` Jason Gunthorpe
2024-10-07  8:39     ` Gowans, James
2024-10-07  8:47       ` David Woodhouse
2024-10-07  8:57         ` Gowans, James
2024-10-07 15:01           ` Jason Gunthorpe
2024-10-09 11:44             ` Gowans, James
2024-10-09 12:28               ` Jason Gunthorpe
2024-10-10 15:12                 ` Gowans, James
2024-10-10 15:32                   ` Jason Gunthorpe
2024-10-07 15:11         ` Jason Gunthorpe
2024-10-07 15:16       ` Jason Gunthorpe
2024-10-16 22:20   ` Jacob Pan
2024-10-28 16:03     ` Jacob Pan
2024-11-02 10:22       ` Gowans, James
2024-11-04 13:00         ` Jason Gunthorpe
2024-11-06 19:18           ` Jacob Pan
2024-09-16 11:30 ` [RFC PATCH 06/13] iommufd: Expose persistent iommufd IDs in sysfs James Gowans
2024-09-16 11:30 ` [RFC PATCH 07/13] iommufd: Re-hydrate a usable iommufd ctx from sysfs James Gowans
2024-09-16 11:30 ` [RFC PATCH 08/13] intel-iommu: Add serialise and deserialise boilerplate James Gowans
2024-09-16 11:30 ` [RFC PATCH 09/13] intel-iommu: Serialise dmar_domain on KHO activaet James Gowans
2024-09-16 11:30 ` [RFC PATCH 10/13] intel-iommu: Re-hydrate persistent domains after kexec James Gowans
2024-09-16 11:31 ` [RFC PATCH 11/13] iommu: Add callback to restore persisted iommu_domain James Gowans
2024-10-03 13:33   ` Jason Gunthorpe
2024-09-16 11:31 ` [RFC PATCH 12/13] iommufd, guestmemfs: Ensure persistent file used for persistent DMA James Gowans
2024-10-03 13:36   ` Jason Gunthorpe
2024-09-16 11:31 ` [RFC PATCH 13/13] iommufd, guestmemfs: Pin files when mapped " James Gowans

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox