public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] iommu: Add live update state preservation
@ 2026-02-03 22:09 Samiullah Khawaja
  2026-02-03 22:09 ` [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks Samiullah Khawaja
                   ` (13 more replies)
  0 siblings, 14 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

Hi,

This patch series introduces a mechanism for IOMMU state preservation
across live update, including the Intel VT-d driver support
implementation.

This is a non-RFC version of the previously sent RFC:
https://lore.kernel.org/all/20251202230303.1017519-1-skhawaja@google.com/

Please take a look at the following LWN article to learn about KHO and
Live Update Orchestrator:

https://lwn.net/Articles/1033364/

This work is based on,

- linux-next (tag: next-20260115)
- MEMFD SEAL preservation series:
  https://lore.kernel.org/all/20260123095854.535058-1-pratyush@kernel.org/
- VFIO CDEV preservation series (v2):
  https://lore.kernel.org/all/20260129212510.967611-1-dmatlack@google.com/

The kernel tree with all dependencies is uploaded to the following
Github location:

https://github.com/samikhawaja/linux/tree/iommu/phase1-v1

Overall Goals:

The goal of this effort is to preserve the IOMMU domains, managed by
iommufd, attached to devices preserved through VFIO cdev. This allows
DMA mappings and IOMMU context of a device assigned to a VM to be
maintained across a kexec live update.

This is achieved by preserving IOMMU page tables using Generic Page
Table support, IOMMU root table and the relevant context entries across
live update.

The functionality in the previously sent RFC is split into two phases
and this series implements the Phase 1. Phase 1 implements the following
functionality:

  - Foundational work in IOMMU core and VT-d driver to preserve and
    restore IOMMU translation units, IOMMU domains and devices across
    liveupdate kexec.
  - The preservation is triggered by preserving vfio cdev FD and bound
    iommufd FD into a live update session.
  - An HWPT (and backing IOMMU domain) is only preserved if it contains
    only file type DMA mappings. Also the memfd being used for such
    mapping should be SEAL SEAL'd during mapping.
  - During live update boot, the state of preserved Intel VT-d, IOMMU
    domain and devices is restored.
  - The restored IOMMU domains are reattached to the preserved devices
    during early boot.
  - The DMA ownership of restored devices is also claimed during
    live update boot. This means that any attempt to bind a non-vfio
    drivers with them or binding a new iommufd with them will fail.

Architectural Overview:

The target architecture for IOMMU state preservation across a live
update involves coordination between the Live Update Orchestrator,
iommufd, and the IOMMU drivers.

The core design uses the Live Update Orchestrator's file descriptor
preservation mechanism to preserve iommufd file descriptors. The user
marks the iommufd HWPTs for preservation using a new ioctl added in this
series. Once done, the preservation of iommufd inside an LUO session is
triggered using LUO ioctls. During preservation, the LUO preserve
callback for an iommufd walks through the HWPTs it manages to identify
the ones that need to be preserved. Once identified, a new IOMMU core
API is used to preserve the iommu domain. The IOMMU core uses Generic
Page Table to preserve the page tables of these domains. The domains are
then marked as preserved.

When the user triggers the preservation of a VFIO cdev that is attached
to an iommufd that is preserved, the device attachment state of that
VFIO cdev is also preserved using an API exported by iommufd. IOMMUFD
fetches all the information that needs to be preserved and calls the
IOMMU core API to preserve the device state. The IOMMU core also
preserves state of IOMMU that is associated with this device.

The IOMMU core has LUO FLB registered with the iommufd LUO file handler
so the preserved iommu domain and iommu hardware unit state is available
during boot for early restore in the next kernel.

During boot the driver fetches the preserved state from the IOMMU core
and restores the state of preserved IOMMUs. Later when IOMMU core goes
through the devices and probes them, the iommu domains of preserved
devices are restored and the preserved devices are attached to them.
During attachment, the DMA ownership of these devices is also claimed.

Tested:

The new iommufd_liveupdate selftest was used to verify the preservation
logic. It was tested using QEMU with virtual IOMMU (VT-d) support with
virtio pcie device bound to the vfio-pci driver.

Also Tested on an Intel machine with DSA device bound to vfio-pci driver.

Following steps were followed for verification,

- Bind the test device with vfio-pci driver
- Run test on the machine by running

  ./iommufd_liveupdate <vfio-cdev-path>

- Trigger Kexec.
- After reboot, try binding the device to a non-vfio pci driver,

  echo <device bdf> > /sys/class/bus/drivers/pci-pf-stub/bind

- This should fail with "Device or resource busy".
- Bind the device with vfio-pci driver and run the test again.
- Test verifies that the device cannot be bound with a new iommufd and
  the session cannot be finished.

Future Work:

- Phase 2 with IOMMUFD restore to reclaim the preserved vfio cdev and
  restore the preserved HWPTs.
- Full support for PASID preservation.
- Nested IOMMU preservation.
- Extend support to other IOMMU architectures (e.g., AMD-Vi, Arm SMMUv3).

High-Level Sequence Flow:

The following diagrams illustrate the high-level interactions during the
preservation phase. Note that function names in the diagram are kept
abbreviated to save horizontal space.

Prepare:

Before live update the PREPARE event of Liveupdate Orchestrator invokes
callbacks of the registered file and subsystem handlers.

 Userspace (VMM) | LUO Core |    iommufd    |  IOMMU Core   | IOMMU Driver
-----------------|----------|---------------|---------------|-------------
                 |          |               |               |
MARK_HWPT        |          |               |               |
--------------------------->                |               |
                 |          | Mark HWPT for |               |
                 |          | preservation  |               |
                 |          |               |               |
PRESERVE         |          |               |               |
 iommufd_fd      |          |               |               |
----------------->          |               |               |
                 | preserve |               |               |
                 |---------->               |               |
                 |          | For each HWPT |               |
                 |          |-------------->                |
                 |          |               | domain_presrv |
                 |          |               |-------------->
                 |          |               |               | gpt(preserve)
                 |          |               |<--------------|
                 |          |<--------------|               |
                 |<---------|               |               |
                 |          |               |               |
...              |          |               |               |
                 |          |               |               |
PRESERVE,        |          |               |               |
 vfio_cdev_fd    |          |               |               |
----------------->          |               |               |
                 | preserve |               |               |
                 |---------->               |               |
                 |          |               |               |
                 |          | iommu_preserv |               |
                 |          | _device()     |               |
                 |          |-------------->                |
                 |          |               | preserve      |
                 |          |               | (iommu_hw)    |
                 |          |               |-------------->
                 |          |               |               | preserve(root)
                 |          |               |               | preserve(pasid)
                 |          |               |<--------------|
                 |          |               |               |
                 |          |               | preserve      |
                 |          |               | _device(dev)  |
                 |          |               |-------------->
                 |          |               |               |
                 |          |               |<--------------|
                 |          |<--------------|               |
                 |<---------|               |               |

Restore:

After a live update, the preserved state is restored during boot.

 Userspace (VMM) | LUO Core |    iommufd    |  IOMMU Core   | IOMMU Driver
-----------------|----------|---------------|---------------|-------------
                 |          |               |               |
                 |          |               |               | Restore
                 |          |               |               | Root, DIDs
                 |          |               |               |
                 |          |               |               | Register
                 |          |               | probe devices |
                 |          |               |               |
                 |          |               | restore       |
                 |          |               | domain        |
                 |          |               |-------------->
                 |          |               |               | restore
                 |          |               | reattach      |
                 |          |               | domain        |
                 |          |               |-------------->
                 |          |               |               |


Looking forward to your feedback on this.

Pasha Tatashin (1):
  liveupdate: luo_file: Add internal APIs for file preservation

Samiullah Khawaja (11):
  iommu: Implement IOMMU LU FLB callbacks
  iommu: Implement IOMMU core liveupdate skeleton
  iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages
  iommupt: Implement preserve/unpreserve/restore callbacks
  iommu/vt-d: Implement device and iommu preserve/unpreserve ops
  iommu/vt-d: Restore IOMMU state and reclaimed domain ids
  iommu: Restore and reattach preserved domains to devices
  iommu/vt-d: preserve PASID table of preserved device
  iommufd: Add APIs to preserve/unpreserve a vfio cdev
  vfio/pci: Preserve the iommufd state of the vfio cdev
  iommufd/selftest: Add test to verify iommufd preservation

YiFei Zhu (2):
  iommufd-lu: Implement ioctl to let userspace mark an HWPT to be
    preserved
  iommufd-lu: Persist iommu hardware pagetables for live update

 drivers/iommu/Kconfig                         |  11 +
 drivers/iommu/Makefile                        |   1 +
 drivers/iommu/generic_pt/iommu_pt.h           |  96 ++++
 drivers/iommu/intel/Makefile                  |   1 +
 drivers/iommu/intel/iommu.c                   | 115 +++-
 drivers/iommu/intel/iommu.h                   |  42 +-
 drivers/iommu/intel/liveupdate.c              | 304 ++++++++++
 drivers/iommu/intel/nested.c                  |   2 +-
 drivers/iommu/intel/pasid.c                   |   7 +-
 drivers/iommu/intel/pasid.h                   |   9 +
 drivers/iommu/iommu-pages.c                   |  74 +++
 drivers/iommu/iommu-pages.h                   |  30 +
 drivers/iommu/iommu.c                         |  50 +-
 drivers/iommu/iommufd/Makefile                |   1 +
 drivers/iommu/iommufd/device.c                |  69 +++
 drivers/iommu/iommufd/io_pagetable.c          |  17 +
 drivers/iommu/iommufd/io_pagetable.h          |   1 +
 drivers/iommu/iommufd/iommufd_private.h       |  38 ++
 drivers/iommu/iommufd/liveupdate.c            | 349 ++++++++++++
 drivers/iommu/iommufd/main.c                  |  16 +-
 drivers/iommu/iommufd/pages.c                 |   8 +
 drivers/iommu/liveupdate.c                    | 534 ++++++++++++++++++
 drivers/vfio/pci/vfio_pci_liveupdate.c        |  28 +-
 include/linux/generic_pt/iommu.h              |  10 +
 include/linux/iommu-lu.h                      | 144 +++++
 include/linux/iommu.h                         |  32 ++
 include/linux/iommufd.h                       |  23 +
 include/linux/kho/abi/iommu.h                 | 127 +++++
 include/linux/kho/abi/iommufd.h               |  39 ++
 include/linux/kho/abi/vfio_pci.h              |  10 +
 include/linux/liveupdate.h                    |  21 +
 include/uapi/linux/iommufd.h                  |  19 +
 kernel/liveupdate/luo_file.c                  |  71 +++
 kernel/liveupdate/luo_internal.h              |  16 +
 tools/testing/selftests/iommu/Makefile        |  12 +
 .../selftests/iommu/iommufd_liveupdate.c      | 209 +++++++
 36 files changed, 2502 insertions(+), 34 deletions(-)
 create mode 100644 drivers/iommu/intel/liveupdate.c
 create mode 100644 drivers/iommu/iommufd/liveupdate.c
 create mode 100644 drivers/iommu/liveupdate.c
 create mode 100644 include/linux/iommu-lu.h
 create mode 100644 include/linux/kho/abi/iommu.h
 create mode 100644 include/linux/kho/abi/iommufd.h
 create mode 100644 tools/testing/selftests/iommu/iommufd_liveupdate.c


base-commit: 9b7977f9e39b7768c70c2aa497f04e7569fd3e00
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-11 21:07   ` Pranjal Shrivastava
  2026-03-16 22:54   ` Vipin Sharma
  2026-02-03 22:09 ` [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton Samiullah Khawaja
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

Add liveupdate FLB for IOMMU state preservation. Use KHO preserve memory
alloc/free helper functions to allocate memory for the IOMMU LU FLB
object and the serialization structs for device, domain and iommu.

During retrieve, walk through the preserved objs nodes and restore each
folio. Also recreate the FLB obj.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/Kconfig         |  11 +++
 drivers/iommu/Makefile        |   1 +
 drivers/iommu/liveupdate.c    | 177 ++++++++++++++++++++++++++++++++++
 include/linux/iommu-lu.h      |  17 ++++
 include/linux/kho/abi/iommu.h | 119 +++++++++++++++++++++++
 5 files changed, 325 insertions(+)
 create mode 100644 drivers/iommu/liveupdate.c
 create mode 100644 include/linux/iommu-lu.h
 create mode 100644 include/linux/kho/abi/iommu.h

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index f86262b11416..fdcfbedee5ed 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -11,6 +11,17 @@ config IOMMUFD_DRIVER
 	bool
 	default n
 
+config IOMMU_LIVEUPDATE
+	bool "IOMMU live update state preservation support"
+	depends on LIVEUPDATE && IOMMUFD
+	help
+	  Enable support for preserving IOMMU state across a kexec live update.
+
+	  This allows devices managed by iommufd to maintain their DMA mappings
+	  during kexec base kernel update.
+
+	  If unsure, say N.
+
 menuconfig IOMMU_SUPPORT
 	bool "IOMMU Hardware Support"
 	depends on MMU
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 0275821f4ef9..b3715c5a6b97 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST) += io-pgtable-arm-selftests.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
+obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
 obj-$(CONFIG_IOMMU_IOVA) += iova.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
 obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o
diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
new file mode 100644
index 000000000000..6189ba32ff2c
--- /dev/null
+++ b/drivers/iommu/liveupdate.c
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * Copyright (C) 2025, Google LLC
+ * Author: Samiullah Khawaja <skhawaja@google.com>
+ */
+
+#define pr_fmt(fmt)    "iommu: liveupdate: " fmt
+
+#include <linux/kexec_handover.h>
+#include <linux/liveupdate.h>
+#include <linux/iommu-lu.h>
+#include <linux/iommu.h>
+#include <linux/errno.h>
+
+static void iommu_liveupdate_restore_objs(u64 next)
+{
+	struct iommu_objs_ser *objs;
+
+	while (next) {
+		BUG_ON(!kho_restore_folio(next));
+		objs = __va(next);
+		next = objs->next_objs;
+	}
+}
+
+static void iommu_liveupdate_free_objs(u64 next, bool incoming)
+{
+	struct iommu_objs_ser *objs;
+
+	while (next) {
+		objs = __va(next);
+		next = objs->next_objs;
+
+		if (!incoming)
+			kho_unpreserve_free(objs);
+		else
+			folio_put(virt_to_folio(objs));
+	}
+}
+
+static void iommu_liveupdate_flb_free(struct iommu_lu_flb_obj *obj)
+{
+	if (obj->iommu_domains)
+		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, false);
+
+	if (obj->devices)
+		iommu_liveupdate_free_objs(obj->ser->devices_phys, false);
+
+	if (obj->iommus)
+		iommu_liveupdate_free_objs(obj->ser->iommus_phys, false);
+
+	kho_unpreserve_free(obj->ser);
+	kfree(obj);
+}
+
+static int iommu_liveupdate_flb_preserve(struct liveupdate_flb_op_args *argp)
+{
+	struct iommu_lu_flb_obj *obj;
+	struct iommu_lu_flb_ser *ser;
+	void *mem;
+
+	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return -ENOMEM;
+
+	mutex_init(&obj->lock);
+	mem = kho_alloc_preserve(sizeof(*ser));
+	if (IS_ERR(mem))
+		goto err_free;
+
+	ser = mem;
+	obj->ser = ser;
+
+	mem = kho_alloc_preserve(PAGE_SIZE);
+	if (IS_ERR(mem))
+		goto err_free;
+
+	obj->iommu_domains = mem;
+	ser->iommu_domains_phys = virt_to_phys(obj->iommu_domains);
+
+	mem = kho_alloc_preserve(PAGE_SIZE);
+	if (IS_ERR(mem))
+		goto err_free;
+
+	obj->devices = mem;
+	ser->devices_phys = virt_to_phys(obj->devices);
+
+	mem = kho_alloc_preserve(PAGE_SIZE);
+	if (IS_ERR(mem))
+		goto err_free;
+
+	obj->iommus = mem;
+	ser->iommus_phys = virt_to_phys(obj->iommus);
+
+	argp->obj = obj;
+	argp->data = virt_to_phys(ser);
+	return 0;
+
+err_free:
+	iommu_liveupdate_flb_free(obj);
+	return PTR_ERR(mem);
+}
+
+static void iommu_liveupdate_flb_unpreserve(struct liveupdate_flb_op_args *argp)
+{
+	iommu_liveupdate_flb_free(argp->obj);
+}
+
+static void iommu_liveupdate_flb_finish(struct liveupdate_flb_op_args *argp)
+{
+	struct iommu_lu_flb_obj *obj = argp->obj;
+
+	if (obj->iommu_domains)
+		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, true);
+
+	if (obj->devices)
+		iommu_liveupdate_free_objs(obj->ser->devices_phys, true);
+
+	if (obj->iommus)
+		iommu_liveupdate_free_objs(obj->ser->iommus_phys, true);
+
+	folio_put(virt_to_folio(obj->ser));
+	kfree(obj);
+}
+
+static int iommu_liveupdate_flb_retrieve(struct liveupdate_flb_op_args *argp)
+{
+	struct iommu_lu_flb_obj *obj;
+	struct iommu_lu_flb_ser *ser;
+
+	obj = kzalloc(sizeof(*obj), GFP_ATOMIC);
+	if (!obj)
+		return -ENOMEM;
+
+	mutex_init(&obj->lock);
+	BUG_ON(!kho_restore_folio(argp->data));
+	ser = phys_to_virt(argp->data);
+	obj->ser = ser;
+
+	iommu_liveupdate_restore_objs(ser->iommu_domains_phys);
+	obj->iommu_domains = phys_to_virt(ser->iommu_domains_phys);
+
+	iommu_liveupdate_restore_objs(ser->devices_phys);
+	obj->devices = phys_to_virt(ser->devices_phys);
+
+	iommu_liveupdate_restore_objs(ser->iommus_phys);
+	obj->iommus = phys_to_virt(ser->iommus_phys);
+
+	argp->obj = obj;
+
+	return 0;
+}
+
+static struct liveupdate_flb_ops iommu_flb_ops = {
+	.preserve = iommu_liveupdate_flb_preserve,
+	.unpreserve = iommu_liveupdate_flb_unpreserve,
+	.finish = iommu_liveupdate_flb_finish,
+	.retrieve = iommu_liveupdate_flb_retrieve,
+};
+
+static struct liveupdate_flb iommu_flb = {
+	.compatible = IOMMU_LUO_FLB_COMPATIBLE,
+	.ops = &iommu_flb_ops,
+};
+
+int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler)
+{
+	return liveupdate_register_flb(handler, &iommu_flb);
+}
+EXPORT_SYMBOL(iommu_liveupdate_register_flb);
+
+int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler)
+{
+	return liveupdate_unregister_flb(handler, &iommu_flb);
+}
+EXPORT_SYMBOL(iommu_liveupdate_unregister_flb);
diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
new file mode 100644
index 000000000000..59095d2f1bb2
--- /dev/null
+++ b/include/linux/iommu-lu.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (C) 2025, Google LLC
+ * Author: Samiullah Khawaja <skhawaja@google.com>
+ */
+
+#ifndef _LINUX_IOMMU_LU_H
+#define _LINUX_IOMMU_LU_H
+
+#include <linux/liveupdate.h>
+#include <linux/kho/abi/iommu.h>
+
+int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler);
+int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler);
+
+#endif /* _LINUX_IOMMU_LU_H */
diff --git a/include/linux/kho/abi/iommu.h b/include/linux/kho/abi/iommu.h
new file mode 100644
index 000000000000..8e1c05cfe7bb
--- /dev/null
+++ b/include/linux/kho/abi/iommu.h
@@ -0,0 +1,119 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (C) 2025, Google LLC
+ * Author: Samiullah Khawaja <skhawaja@google.com>
+ */
+
+#ifndef _LINUX_KHO_ABI_IOMMU_H
+#define _LINUX_KHO_ABI_IOMMU_H
+
+#include <linux/mutex_types.h>
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/**
+ * DOC: IOMMU File-Lifecycle Bound (FLB) Live Update ABI
+ *
+ * This header defines the ABI for preserving IOMMU state across kexec using
+ * Live Update File-Lifecycle Bound (FLB) data.
+ *
+ * This interface is a contract. Any modification to any of the serialization
+ * structs defined here constitutes a breaking change. Such changes require
+ * incrementing the version number in the IOMMU_LUO_FLB_COMPATIBLE string.
+ */
+
+#define IOMMU_LUO_FLB_COMPATIBLE "iommu-v1"
+
+enum iommu_lu_type {
+	IOMMU_INVALID,
+	IOMMU_INTEL,
+};
+
+struct iommu_obj_ser {
+	u32 idx;
+	u32 ref_count;
+	u32 deleted:1;
+	u32 incoming:1;
+} __packed;
+
+struct iommu_domain_ser {
+	struct iommu_obj_ser obj;
+	u64 top_table;
+	u64 top_level;
+	struct iommu_domain *restored_domain;
+} __packed;
+
+struct device_domain_iommu_ser {
+	u32 did;
+	u64 domain_phys;
+	u64 iommu_phys;
+} __packed;
+
+struct device_ser {
+	struct iommu_obj_ser obj;
+	u64 token;
+	u32 devid;
+	u32 pci_domain;
+	struct device_domain_iommu_ser domain_iommu_ser;
+	enum iommu_lu_type type;
+} __packed;
+
+struct iommu_intel_ser {
+	u64 phys_addr;
+	u64 root_table;
+} __packed;
+
+struct iommu_ser {
+	struct iommu_obj_ser obj;
+	u64 token;
+	enum iommu_lu_type type;
+	union {
+		struct iommu_intel_ser intel;
+	};
+} __packed;
+
+struct iommu_objs_ser {
+	u64 next_objs;
+	u64 nr_objs;
+} __packed;
+
+struct iommus_ser {
+	struct iommu_objs_ser objs;
+	struct iommu_ser iommus[];
+} __packed;
+
+struct iommu_domains_ser {
+	struct iommu_objs_ser objs;
+	struct iommu_domain_ser iommu_domains[];
+} __packed;
+
+struct devices_ser {
+	struct iommu_objs_ser objs;
+	struct device_ser devices[];
+} __packed;
+
+#define MAX_IOMMU_SERS ((PAGE_SIZE - sizeof(struct iommus_ser)) / sizeof(struct iommu_ser))
+#define MAX_IOMMU_DOMAIN_SERS \
+		((PAGE_SIZE - sizeof(struct iommu_domains_ser)) / sizeof(struct iommu_domain_ser))
+#define MAX_DEVICE_SERS ((PAGE_SIZE - sizeof(struct devices_ser)) / sizeof(struct device_ser))
+
+struct iommu_lu_flb_ser {
+	u64 iommus_phys;
+	u64 nr_iommus;
+	u64 iommu_domains_phys;
+	u64 nr_domains;
+	u64 devices_phys;
+	u64 nr_devices;
+} __packed;
+
+struct iommu_lu_flb_obj {
+	struct mutex lock;
+	struct iommu_lu_flb_ser *ser;
+
+	struct iommu_domains_ser *iommu_domains;
+	struct iommus_ser *iommus;
+	struct devices_ser *devices;
+} __packed;
+
+#endif /* _LINUX_KHO_ABI_IOMMU_H */
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
  2026-02-03 22:09 ` [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-12 23:10   ` Pranjal Shrivastava
  2026-03-17 19:58   ` Vipin Sharma
  2026-02-03 22:09 ` [PATCH 03/14] liveupdate: luo_file: Add internal APIs for file preservation Samiullah Khawaja
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

Add IOMMU domain ops that can be implemented by the IOMMU drivers if
they support IOMMU domain preservation across liveupdate. The new IOMMU
domain preserve, unpreserve and restore APIs call these ops to perform
respective live update operations.

Similarly add IOMMU ops to preserve/unpreserve a device. These can be
implemented by the IOMMU drivers that support preservation of devices
that have their IOMMU domains preserved. During device preservation the
state of the associated IOMMU is also preserved. The device can only be
preserved if the attached iommu domain is preserved and the associated
iommu supports preservation.

The preserved state of the device and IOMMU needs to be fetched during
shutdown and boot in the next kernel. Add APIs that can be used to fetch
the preserved state of a device and IOMMU. The APIs will only be used
during shutdown and after liveupdate so no locking needed.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/iommu.c      |   3 +
 drivers/iommu/liveupdate.c | 326 +++++++++++++++++++++++++++++++++++++
 include/linux/iommu-lu.h   | 119 ++++++++++++++
 include/linux/iommu.h      |  32 ++++
 4 files changed, 480 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 4926a43118e6..c0632cb5b570 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -389,6 +389,9 @@ static struct dev_iommu *dev_iommu_get(struct device *dev)
 
 	mutex_init(&param->lock);
 	dev->iommu = param;
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+	dev->iommu->device_ser = NULL;
+#endif
 	return param;
 }
 
diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
index 6189ba32ff2c..83eb609b3fd7 100644
--- a/drivers/iommu/liveupdate.c
+++ b/drivers/iommu/liveupdate.c
@@ -11,6 +11,7 @@
 #include <linux/liveupdate.h>
 #include <linux/iommu-lu.h>
 #include <linux/iommu.h>
+#include <linux/pci.h>
 #include <linux/errno.h>
 
 static void iommu_liveupdate_restore_objs(u64 next)
@@ -175,3 +176,328 @@ int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler)
 	return liveupdate_unregister_flb(handler, &iommu_flb);
 }
 EXPORT_SYMBOL(iommu_liveupdate_unregister_flb);
+
+int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
+				    void *arg)
+{
+	struct iommu_lu_flb_obj *obj;
+	struct devices_ser *devices;
+	int ret, i, idx;
+
+	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
+	if (ret)
+		return -ENOENT;
+
+	devices = __va(obj->ser->devices_phys);
+	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
+		if (idx >= MAX_DEVICE_SERS) {
+			devices = __va(devices->objs.next_objs);
+			idx = 0;
+		}
+
+		if (devices->devices[idx].obj.deleted)
+			continue;
+
+		ret = fn(&devices->devices[idx], arg);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(iommu_for_each_preserved_device);
+
+static inline bool device_ser_match(struct device_ser *match,
+				    struct pci_dev *pdev)
+{
+	return match->devid == pci_dev_id(pdev) && match->pci_domain == pci_domain_nr(pdev->bus);
+}
+
+struct device_ser *iommu_get_device_preserved_data(struct device *dev)
+{
+	struct iommu_lu_flb_obj *obj;
+	struct devices_ser *devices;
+	int ret, i, idx;
+
+	if (!dev_is_pci(dev))
+		return NULL;
+
+	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
+	if (ret)
+		return NULL;
+
+	devices = __va(obj->ser->devices_phys);
+	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
+		if (idx >= MAX_DEVICE_SERS) {
+			devices = __va(devices->objs.next_objs);
+			idx = 0;
+		}
+
+		if (devices->devices[idx].obj.deleted)
+			continue;
+
+		if (device_ser_match(&devices->devices[idx], to_pci_dev(dev))) {
+			devices->devices[idx].obj.incoming = true;
+			return &devices->devices[idx];
+		}
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL(iommu_get_device_preserved_data);
+
+struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
+{
+	struct iommu_lu_flb_obj *obj;
+	struct iommus_ser *iommus;
+	int ret, i, idx;
+
+	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
+	if (ret)
+		return NULL;
+
+	iommus = __va(obj->ser->iommus_phys);
+	for (i = 0, idx = 0; i < obj->ser->nr_iommus; ++i, ++idx) {
+		if (idx >= MAX_IOMMU_SERS) {
+			iommus = __va(iommus->objs.next_objs);
+			idx = 0;
+		}
+
+		if (iommus->iommus[idx].obj.deleted)
+			continue;
+
+		if (iommus->iommus[idx].token == token &&
+		    iommus->iommus[idx].type == type)
+			return &iommus->iommus[idx];
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL(iommu_get_preserved_data);
+
+static int reserve_obj_ser(struct iommu_objs_ser **objs_ptr, u64 max_objs)
+{
+	struct iommu_objs_ser *next_objs, *objs = *objs_ptr;
+	int idx;
+
+	if (objs->nr_objs == max_objs) {
+		next_objs = kho_alloc_preserve(PAGE_SIZE);
+		if (IS_ERR(next_objs))
+			return PTR_ERR(next_objs);
+
+		objs->next_objs = virt_to_phys(next_objs);
+		objs = next_objs;
+		*objs_ptr = objs;
+		objs->nr_objs = 0;
+		objs->next_objs = 0;
+	}
+
+	idx = objs->nr_objs++;
+	return idx;
+}
+
+int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
+{
+	struct iommu_domain_ser *domain_ser;
+	struct iommu_lu_flb_obj *flb_obj;
+	int idx, ret;
+
+	if (!domain->ops->preserve)
+		return -EOPNOTSUPP;
+
+	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
+	if (ret)
+		return ret;
+
+	guard(mutex)(&flb_obj->lock);
+	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommu_domains,
+			      MAX_IOMMU_DOMAIN_SERS);
+	if (idx < 0)
+		return idx;
+
+	domain_ser = &flb_obj->iommu_domains->iommu_domains[idx];
+	idx = flb_obj->ser->nr_domains++;
+	domain_ser->obj.idx = idx;
+	domain_ser->obj.ref_count = 1;
+
+	ret = domain->ops->preserve(domain, domain_ser);
+	if (ret) {
+		domain_ser->obj.deleted = true;
+		return ret;
+	}
+
+	domain->preserved_state = domain_ser;
+	*ser = domain_ser;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_domain_preserve);
+
+void iommu_domain_unpreserve(struct iommu_domain *domain)
+{
+	struct iommu_domain_ser *domain_ser;
+	struct iommu_lu_flb_obj *flb_obj;
+	int ret;
+
+	if (!domain->ops->unpreserve)
+		return;
+
+	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
+	if (ret)
+		return;
+
+	guard(mutex)(&flb_obj->lock);
+
+	/*
+	 * There is no check for attached devices here. The correctness relies
+	 * on the Live Update Orchestrator's session lifecycle. All resources
+	 * (iommufd, vfio devices) are preserved within a single session. If the
+	 * session is torn down, the .unpreserve callbacks for all files will be
+	 * invoked, ensuring a consistent cleanup without needing explicit
+	 * refcounting for the serialized objects here.
+	 */
+	domain_ser = domain->preserved_state;
+	domain->ops->unpreserve(domain, domain_ser);
+	domain_ser->obj.deleted = true;
+	domain->preserved_state = NULL;
+}
+EXPORT_SYMBOL_GPL(iommu_domain_unpreserve);
+
+static int iommu_preserve_locked(struct iommu_device *iommu)
+{
+	struct iommu_lu_flb_obj *flb_obj;
+	struct iommu_ser *iommu_ser;
+	int idx, ret;
+
+	if (!iommu->ops->preserve)
+		return -EOPNOTSUPP;
+
+	if (iommu->outgoing_preserved_state) {
+		iommu->outgoing_preserved_state->obj.ref_count++;
+		return 0;
+	}
+
+	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
+	if (ret)
+		return ret;
+
+	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommus,
+			      MAX_IOMMU_SERS);
+	if (idx < 0)
+		return idx;
+
+	iommu_ser = &flb_obj->iommus->iommus[idx];
+	idx = flb_obj->ser->nr_iommus++;
+	iommu_ser->obj.idx = idx;
+	iommu_ser->obj.ref_count = 1;
+
+	ret = iommu->ops->preserve(iommu, iommu_ser);
+	if (ret)
+		iommu_ser->obj.deleted = true;
+
+	iommu->outgoing_preserved_state = iommu_ser;
+	return ret;
+}
+
+static void iommu_unpreserve_locked(struct iommu_device *iommu)
+{
+	struct iommu_ser *iommu_ser = iommu->outgoing_preserved_state;
+
+	iommu_ser->obj.ref_count--;
+	if (iommu_ser->obj.ref_count)
+		return;
+
+	iommu->outgoing_preserved_state = NULL;
+	iommu->ops->unpreserve(iommu, iommu_ser);
+	iommu_ser->obj.deleted = true;
+}
+
+int iommu_preserve_device(struct iommu_domain *domain,
+			  struct device *dev, u64 token)
+{
+	struct iommu_lu_flb_obj *flb_obj;
+	struct device_ser *device_ser;
+	struct dev_iommu *iommu;
+	struct pci_dev *pdev;
+	int ret, idx;
+
+	if (!dev_is_pci(dev))
+		return -EOPNOTSUPP;
+
+	if (!domain->preserved_state)
+		return -EINVAL;
+
+	pdev = to_pci_dev(dev);
+	iommu = dev->iommu;
+	if (!iommu->iommu_dev->ops->preserve_device ||
+	    !iommu->iommu_dev->ops->preserve)
+		return -EOPNOTSUPP;
+
+	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
+	if (ret)
+		return ret;
+
+	guard(mutex)(&flb_obj->lock);
+	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->devices,
+			      MAX_DEVICE_SERS);
+	if (idx < 0)
+		return idx;
+
+	device_ser = &flb_obj->devices->devices[idx];
+	idx = flb_obj->ser->nr_devices++;
+	device_ser->obj.idx = idx;
+	device_ser->obj.ref_count = 1;
+
+	ret = iommu_preserve_locked(iommu->iommu_dev);
+	if (ret) {
+		device_ser->obj.deleted = true;
+		return ret;
+	}
+
+	device_ser->domain_iommu_ser.domain_phys = __pa(domain->preserved_state);
+	device_ser->domain_iommu_ser.iommu_phys = __pa(iommu->iommu_dev->outgoing_preserved_state);
+	device_ser->devid = pci_dev_id(pdev);
+	device_ser->pci_domain = pci_domain_nr(pdev->bus);
+	device_ser->token = token;
+
+	ret = iommu->iommu_dev->ops->preserve_device(dev, device_ser);
+	if (ret) {
+		device_ser->obj.deleted = true;
+		iommu_unpreserve_locked(iommu->iommu_dev);
+		return ret;
+	}
+
+	dev->iommu->device_ser = device_ser;
+	return 0;
+}
+
+void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
+{
+	struct iommu_lu_flb_obj *flb_obj;
+	struct device_ser *device_ser;
+	struct dev_iommu *iommu;
+	struct pci_dev *pdev;
+	int ret;
+
+	if (!dev_is_pci(dev))
+		return;
+
+	pdev = to_pci_dev(dev);
+	iommu = dev->iommu;
+	if (!iommu->iommu_dev->ops->unpreserve_device ||
+	    !iommu->iommu_dev->ops->unpreserve)
+		return;
+
+	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
+	if (WARN_ON(ret))
+		return;
+
+	guard(mutex)(&flb_obj->lock);
+	device_ser = dev_iommu_preserved_state(dev);
+	if (WARN_ON(!device_ser))
+		return;
+
+	iommu->iommu_dev->ops->unpreserve_device(dev, device_ser);
+	dev->iommu->device_ser = NULL;
+
+	iommu_unpreserve_locked(iommu->iommu_dev);
+}
diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
index 59095d2f1bb2..48c07514a776 100644
--- a/include/linux/iommu-lu.h
+++ b/include/linux/iommu-lu.h
@@ -8,9 +8,128 @@
 #ifndef _LINUX_IOMMU_LU_H
 #define _LINUX_IOMMU_LU_H
 
+#include <linux/device.h>
+#include <linux/iommu.h>
 #include <linux/liveupdate.h>
 #include <linux/kho/abi/iommu.h>
 
+typedef int (*iommu_preserved_device_iter_fn)(struct device_ser *ser,
+					      void *arg);
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+static inline void *dev_iommu_preserved_state(struct device *dev)
+{
+	struct device_ser *ser;
+
+	if (!dev->iommu)
+		return NULL;
+
+	ser = dev->iommu->device_ser;
+	if (ser && !ser->obj.incoming)
+		return ser;
+
+	return NULL;
+}
+
+static inline void *dev_iommu_restored_state(struct device *dev)
+{
+	struct device_ser *ser;
+
+	if (!dev->iommu)
+		return NULL;
+
+	ser = dev->iommu->device_ser;
+	if (ser && ser->obj.incoming)
+		return ser;
+
+	return NULL;
+}
+
+static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
+{
+	struct iommu_domain_ser *ser;
+
+	ser = domain->preserved_state;
+	if (ser && ser->obj.incoming)
+		return ser;
+
+	return NULL;
+}
+
+static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
+{
+	struct device_ser *ser = dev_iommu_restored_state(dev);
+
+	if (ser && iommu_domain_restored_state(domain))
+		return ser->domain_iommu_ser.did;
+
+	return -1;
+}
+
+int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
+				    void *arg);
+struct device_ser *iommu_get_device_preserved_data(struct device *dev);
+struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type);
+int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser);
+void iommu_domain_unpreserve(struct iommu_domain *domain);
+int iommu_preserve_device(struct iommu_domain *domain,
+			  struct device *dev, u64 token);
+void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev);
+#else
+static inline void *dev_iommu_preserved_state(struct device *dev)
+{
+	return NULL;
+}
+
+static inline void *dev_iommu_restored_state(struct device *dev)
+{
+	return NULL;
+}
+
+static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
+{
+	return -1;
+}
+
+static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
+{
+	return NULL;
+}
+
+static inline int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn, void *arg)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline struct device_ser *iommu_get_device_preserved_data(struct device *dev)
+{
+	return NULL;
+}
+
+static inline struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
+{
+	return NULL;
+}
+
+static inline int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void iommu_domain_unpreserve(struct iommu_domain *domain)
+{
+}
+
+static inline int iommu_preserve_device(struct iommu_domain *domain,
+					struct device *dev, u64 token)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
+{
+}
+#endif
+
 int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler);
 int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler);
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 54b8b48c762e..bd949c1ce7c5 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -14,6 +14,8 @@
 #include <linux/err.h>
 #include <linux/of.h>
 #include <linux/iova_bitmap.h>
+#include <linux/atomic.h>
+#include <linux/kho/abi/iommu.h>
 #include <uapi/linux/iommufd.h>
 
 #define IOMMU_READ	(1 << 0)
@@ -248,6 +250,10 @@ struct iommu_domain {
 			struct list_head next;
 		};
 	};
+
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+	struct iommu_domain_ser *preserved_state;
+#endif
 };
 
 static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
@@ -647,6 +653,10 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
  *               resources shared/passed to user space IOMMU instance. Associate
  *               it with a nesting @parent_domain. It is required for driver to
  *               set @viommu->ops pointing to its own viommu_ops
+ * @preserve_device: Preserve state of a device for liveupdate.
+ * @unpreserve_device: Unpreserve state that was preserved earlier.
+ * @preserve: Preserve state of iommu translation hardware for liveupdate.
+ * @unpreserve: Unpreserve state of iommu that was preserved earlier.
  * @owner: Driver module providing these ops
  * @identity_domain: An always available, always attachable identity
  *                   translation.
@@ -703,6 +713,11 @@ struct iommu_ops {
 			   struct iommu_domain *parent_domain,
 			   const struct iommu_user_data *user_data);
 
+	int (*preserve_device)(struct device *dev, struct device_ser *device_ser);
+	void (*unpreserve_device)(struct device *dev, struct device_ser *device_ser);
+	int (*preserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
+	void (*unpreserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
+
 	const struct iommu_domain_ops *default_domain_ops;
 	struct module *owner;
 	struct iommu_domain *identity_domain;
@@ -749,6 +764,11 @@ struct iommu_ops {
  *                           specific mechanisms.
  * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
  * @free: Release the domain after use.
+ * @preserve: Preserve the iommu domain for liveupdate.
+ *            Returns 0 on success, a negative errno on failure.
+ * @unpreserve: Unpreserve the iommu domain that was preserved earlier.
+ * @restore: Restore the iommu domain after liveupdate.
+ *           Returns 0 on success, a negative errno on failure.
  */
 struct iommu_domain_ops {
 	int (*attach_dev)(struct iommu_domain *domain, struct device *dev,
@@ -779,6 +799,9 @@ struct iommu_domain_ops {
 				  unsigned long quirks);
 
 	void (*free)(struct iommu_domain *domain);
+	int (*preserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
+	void (*unpreserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
+	int (*restore)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
 };
 
 /**
@@ -790,6 +813,8 @@ struct iommu_domain_ops {
  * @singleton_group: Used internally for drivers that have only one group
  * @max_pasids: number of supported PASIDs
  * @ready: set once iommu_device_register() has completed successfully
+ * @outgoing_preserved_state: preserved iommu state of outgoing kernel for
+ * liveupdate.
  */
 struct iommu_device {
 	struct list_head list;
@@ -799,6 +824,10 @@ struct iommu_device {
 	struct iommu_group *singleton_group;
 	u32 max_pasids;
 	bool ready;
+
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+	struct iommu_ser *outgoing_preserved_state;
+#endif
 };
 
 /**
@@ -853,6 +882,9 @@ struct dev_iommu {
 	u32				pci_32bit_workaround:1;
 	u32				require_direct:1;
 	u32				shadow_on_flush:1;
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+	struct device_ser		*device_ser;
+#endif
 };
 
 int iommu_device_register(struct iommu_device *iommu,
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 03/14] liveupdate: luo_file: Add internal APIs for file preservation
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
  2026-02-03 22:09 ` [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks Samiullah Khawaja
  2026-02-03 22:09 ` [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-18 10:00   ` Pranjal Shrivastava
  2026-02-03 22:09 ` [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages Samiullah Khawaja
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Pasha Tatashin, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Samiullah Khawaja, Pratyush Yadav, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

From: Pasha Tatashin <pasha.tatashin@soleen.com>

The core liveupdate mechanism allows userspace to preserve file
descriptors. However, kernel subsystems often manage struct file
objects directly and need to participate in the preservation process
programmatically without relying solely on userspace interaction.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 include/linux/liveupdate.h       | 21 ++++++++++
 kernel/liveupdate/luo_file.c     | 71 ++++++++++++++++++++++++++++++++
 kernel/liveupdate/luo_internal.h | 16 +++++++
 3 files changed, 108 insertions(+)

diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
index fe82a6c3005f..8e47504ba01e 100644
--- a/include/linux/liveupdate.h
+++ b/include/linux/liveupdate.h
@@ -23,6 +23,7 @@ struct file;
 /**
  * struct liveupdate_file_op_args - Arguments for file operation callbacks.
  * @handler:          The file handler being called.
+ * @session:          The session this file belongs to.
  * @retrieved:        The retrieve status for the 'can_finish / finish'
  *                    operation.
  * @file:             The file object. For retrieve: [OUT] The callback sets
@@ -40,6 +41,7 @@ struct file;
  */
 struct liveupdate_file_op_args {
 	struct liveupdate_file_handler *handler;
+	struct liveupdate_session *session;
 	bool retrieved;
 	struct file *file;
 	u64 serialized_data;
@@ -234,6 +236,13 @@ int liveupdate_unregister_flb(struct liveupdate_file_handler *fh,
 
 int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp);
 int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp);
+/* kernel can internally retrieve files */
+int liveupdate_get_file_incoming(struct liveupdate_session *s, u64 token,
+				 struct file **filep);
+
+/* Get a token for an outgoing file, or -ENOENT if file is not preserved */
+int liveupdate_get_token_outgoing(struct liveupdate_session *s,
+				  struct file *file, u64 *tokenp);
 
 #else /* CONFIG_LIVEUPDATE */
 
@@ -281,5 +290,17 @@ static inline int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb,
 	return -EOPNOTSUPP;
 }
 
+static inline int liveupdate_get_file_incoming(struct liveupdate_session *s,
+					       u64 token, struct file **filep)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int liveupdate_get_token_outgoing(struct liveupdate_session *s,
+						struct file *file, u64 *tokenp)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_LIVEUPDATE */
 #endif /* _LINUX_LIVEUPDATE_H */
diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c
index 32759e846bc9..7ac591542059 100644
--- a/kernel/liveupdate/luo_file.c
+++ b/kernel/liveupdate/luo_file.c
@@ -302,6 +302,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd)
 	mutex_init(&luo_file->mutex);
 
 	args.handler = fh;
+	args.session = luo_session_from_file_set(file_set);
 	args.file = file;
 	err = fh->ops->preserve(&args);
 	if (err)
@@ -355,6 +356,7 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set)
 					   struct luo_file, list);
 
 		args.handler = luo_file->fh;
+		args.session = luo_session_from_file_set(file_set);
 		args.file = luo_file->file;
 		args.serialized_data = luo_file->serialized_data;
 		args.private_data = luo_file->private_data;
@@ -383,6 +385,7 @@ static int luo_file_freeze_one(struct luo_file_set *file_set,
 		struct liveupdate_file_op_args args = {0};
 
 		args.handler = luo_file->fh;
+		args.session = luo_session_from_file_set(file_set);
 		args.file = luo_file->file;
 		args.serialized_data = luo_file->serialized_data;
 		args.private_data = luo_file->private_data;
@@ -404,6 +407,7 @@ static void luo_file_unfreeze_one(struct luo_file_set *file_set,
 		struct liveupdate_file_op_args args = {0};
 
 		args.handler = luo_file->fh;
+		args.session = luo_session_from_file_set(file_set);
 		args.file = luo_file->file;
 		args.serialized_data = luo_file->serialized_data;
 		args.private_data = luo_file->private_data;
@@ -590,6 +594,7 @@ int luo_retrieve_file(struct luo_file_set *file_set, u64 token,
 	}
 
 	args.handler = luo_file->fh;
+	args.session = luo_session_from_file_set(file_set);
 	args.serialized_data = luo_file->serialized_data;
 	err = luo_file->fh->ops->retrieve(&args);
 	if (!err) {
@@ -615,6 +620,7 @@ static int luo_file_can_finish_one(struct luo_file_set *file_set,
 		struct liveupdate_file_op_args args = {0};
 
 		args.handler = luo_file->fh;
+		args.session = luo_session_from_file_set(file_set);
 		args.file = luo_file->file;
 		args.serialized_data = luo_file->serialized_data;
 		args.retrieved = luo_file->retrieved;
@@ -632,6 +638,7 @@ static void luo_file_finish_one(struct luo_file_set *file_set,
 	guard(mutex)(&luo_file->mutex);
 
 	args.handler = luo_file->fh;
+	args.session = luo_session_from_file_set(file_set);
 	args.file = luo_file->file;
 	args.serialized_data = luo_file->serialized_data;
 	args.retrieved = luo_file->retrieved;
@@ -919,3 +926,67 @@ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh)
 	return err;
 }
 EXPORT_SYMBOL_GPL(liveupdate_unregister_file_handler);
+
+/**
+ * liveupdate_get_token_outgoing - Get the token for a preserved file.
+ * @s:      The outgoing liveupdate session.
+ * @file:   The file object to search for.
+ * @tokenp: Output parameter for the found token.
+ *
+ * Searches the list of preserved files in an outgoing session for a matching
+ * file object. If found, the corresponding user-provided token is returned.
+ *
+ * This function is intended for in-kernel callers that need to correlate a
+ * file with its liveupdate token.
+ *
+ * Context: Can be called from any context that can acquire the session mutex.
+ * Return: 0 on success, -ENOENT if the file is not preserved in this session.
+ */
+int liveupdate_get_token_outgoing(struct liveupdate_session *s,
+				  struct file *file, u64 *tokenp)
+{
+	struct luo_file_set *file_set = luo_file_set_from_session(s);
+	struct luo_file *luo_file;
+	int err = -ENOENT;
+
+	list_for_each_entry(luo_file, &file_set->files_list, list) {
+		if (luo_file->file == file) {
+			if (tokenp)
+				*tokenp = luo_file->token;
+			err = 0;
+			break;
+		}
+	}
+
+	return err;
+}
+
+/**
+ * liveupdate_get_file_incoming - Retrieves a preserved file for in-kernel use.
+ * @s:      The incoming liveupdate session (restored from the previous kernel).
+ * @token:  The unique token identifying the file to retrieve.
+ * @filep:  On success, this will be populated with a pointer to the retrieved
+ *          'struct file'.
+ *
+ * Provides a kernel-internal API for other subsystems to retrieve their
+ * preserved files after a live update. This function is a simple wrapper
+ * around luo_retrieve_file(), allowing callers to find a file by its token.
+ *
+ * The operation is idempotent; subsequent calls for the same token will return
+ * a pointer to the same 'struct file' object.
+ *
+ * The caller receives a new reference to the file and must call fput() when it
+ * is no longer needed. The file's lifetime is managed by LUO and any userspace
+ * file descriptors. If the caller needs to hold a reference to the file beyond
+ * the immediate scope, it must call get_file() itself.
+ *
+ * Context: Can be called from any context in the new kernel that has a handle
+ *          to a restored session.
+ * Return: 0 on success. Returns -ENOENT if no file with the matching token is
+ *         found, or any other negative errno on failure.
+ */
+int liveupdate_get_file_incoming(struct liveupdate_session *s, u64 token,
+				 struct file **filep)
+{
+	return luo_retrieve_file(luo_file_set_from_session(s), token, filep);
+}
diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
index 8083d8739b09..a24933d24fd9 100644
--- a/kernel/liveupdate/luo_internal.h
+++ b/kernel/liveupdate/luo_internal.h
@@ -77,6 +77,22 @@ struct luo_session {
 	struct mutex mutex;
 };
 
+static inline struct liveupdate_session *luo_session_from_file_set(struct luo_file_set *file_set)
+{
+	struct luo_session *session;
+
+	session = container_of(file_set, struct luo_session, file_set);
+
+	return (struct liveupdate_session *)session;
+}
+
+static inline struct luo_file_set *luo_file_set_from_session(struct liveupdate_session *s)
+{
+	struct luo_session *session = (struct luo_session *)s;
+
+	return &session->file_set;
+}
+
 int luo_session_create(const char *name, struct file **filep);
 int luo_session_retrieve(const char *name, struct file **filep);
 int __init luo_session_setup_outgoing(void *fdt);
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
                   ` (2 preceding siblings ...)
  2026-02-03 22:09 ` [PATCH 03/14] liveupdate: luo_file: Add internal APIs for file preservation Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-03 16:42   ` Ankit Soni
  2026-03-17 20:59   ` Vipin Sharma
  2026-02-03 22:09 ` [PATCH 05/14] iommupt: Implement preserve/unpreserve/restore callbacks Samiullah Khawaja
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

IOMMU pages are allocated/freed using APIs using struct ioptdesc. For
the proper preservation and restoration of ioptdesc add helper
functions.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/iommu-pages.c | 74 +++++++++++++++++++++++++++++++++++++
 drivers/iommu/iommu-pages.h | 30 +++++++++++++++
 2 files changed, 104 insertions(+)

diff --git a/drivers/iommu/iommu-pages.c b/drivers/iommu/iommu-pages.c
index 3bab175d8557..588a8f19b196 100644
--- a/drivers/iommu/iommu-pages.c
+++ b/drivers/iommu/iommu-pages.c
@@ -6,6 +6,7 @@
 #include "iommu-pages.h"
 #include <linux/dma-mapping.h>
 #include <linux/gfp.h>
+#include <linux/kexec_handover.h>
 #include <linux/mm.h>
 
 #define IOPTDESC_MATCH(pg_elm, elm)                    \
@@ -131,6 +132,79 @@ void iommu_put_pages_list(struct iommu_pages_list *list)
 }
 EXPORT_SYMBOL_GPL(iommu_put_pages_list);
 
+#if IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)
+void iommu_unpreserve_page(void *virt)
+{
+	kho_unpreserve_folio(ioptdesc_folio(virt_to_ioptdesc(virt)));
+}
+EXPORT_SYMBOL_GPL(iommu_unpreserve_page);
+
+int iommu_preserve_page(void *virt)
+{
+	return kho_preserve_folio(ioptdesc_folio(virt_to_ioptdesc(virt)));
+}
+EXPORT_SYMBOL_GPL(iommu_preserve_page);
+
+void iommu_unpreserve_pages(struct iommu_pages_list *list, int count)
+{
+	struct ioptdesc *iopt;
+
+	if (!count)
+		return;
+
+	/* If less than zero then unpreserve all pages. */
+	if (count < 0)
+		count = 0;
+
+	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
+		kho_unpreserve_folio(ioptdesc_folio(iopt));
+		if (count > 0 && --count ==  0)
+			break;
+	}
+}
+EXPORT_SYMBOL_GPL(iommu_unpreserve_pages);
+
+void iommu_restore_page(u64 phys)
+{
+	struct ioptdesc *iopt;
+	struct folio *folio;
+	unsigned long pgcnt;
+	unsigned int order;
+
+	folio = kho_restore_folio(phys);
+	BUG_ON(!folio);
+
+	iopt = folio_ioptdesc(folio);
+
+	order = folio_order(folio);
+	pgcnt = 1UL << order;
+	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, pgcnt);
+	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, pgcnt);
+}
+EXPORT_SYMBOL_GPL(iommu_restore_page);
+
+int iommu_preserve_pages(struct iommu_pages_list *list)
+{
+	struct ioptdesc *iopt;
+	int count = 0;
+	int ret;
+
+	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
+		ret = kho_preserve_folio(ioptdesc_folio(iopt));
+		if (ret) {
+			iommu_unpreserve_pages(list, count);
+			return ret;
+		}
+
+		++count;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_preserve_pages);
+
+#endif
+
 /**
  * iommu_pages_start_incoherent - Setup the page for cache incoherent operation
  * @virt: The page to setup
diff --git a/drivers/iommu/iommu-pages.h b/drivers/iommu/iommu-pages.h
index ae9da4f571f6..bd336fb56b5f 100644
--- a/drivers/iommu/iommu-pages.h
+++ b/drivers/iommu/iommu-pages.h
@@ -53,6 +53,36 @@ void *iommu_alloc_pages_node_sz(int nid, gfp_t gfp, size_t size);
 void iommu_free_pages(void *virt);
 void iommu_put_pages_list(struct iommu_pages_list *list);
 
+#if IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)
+int iommu_preserve_page(void *virt);
+void iommu_unpreserve_page(void *virt);
+int iommu_preserve_pages(struct iommu_pages_list *list);
+void iommu_unpreserve_pages(struct iommu_pages_list *list, int count);
+void iommu_restore_page(u64 phys);
+#else
+static inline int iommu_preserve_page(void *virt)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void iommu_unpreserve_page(void *virt)
+{
+}
+
+static inline int iommu_preserve_pages(struct iommu_pages_list *list)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void iommu_unpreserve_pages(struct iommu_pages_list *list, int count)
+{
+}
+
+static inline void iommu_restore_page(u64 phys)
+{
+}
+#endif
+
 /**
  * iommu_pages_list_add - add the page to a iommu_pages_list
  * @list: List to add the page to
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 05/14] iommupt: Implement preserve/unpreserve/restore callbacks
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
                   ` (3 preceding siblings ...)
  2026-02-03 22:09 ` [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-20 21:57   ` Pranjal Shrivastava
  2026-02-03 22:09 ` [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops Samiullah Khawaja
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

Implement the iommu domain ops for presevation, unpresevation and
restoration of iommu domains for liveupdate. Use the existing page
walker to preserve the ioptdesc of the top_table and the lower tables.
Preserve the top_level also so it can be restored during boot.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/generic_pt/iommu_pt.h | 96 +++++++++++++++++++++++++++++
 include/linux/generic_pt/iommu.h    | 10 +++
 2 files changed, 106 insertions(+)

diff --git a/drivers/iommu/generic_pt/iommu_pt.h b/drivers/iommu/generic_pt/iommu_pt.h
index 3327116a441c..0a1adb6312dd 100644
--- a/drivers/iommu/generic_pt/iommu_pt.h
+++ b/drivers/iommu/generic_pt/iommu_pt.h
@@ -921,6 +921,102 @@ int DOMAIN_NS(map_pages)(struct iommu_domain *domain, unsigned long iova,
 }
 EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(map_pages), "GENERIC_PT_IOMMU");
 
+/**
+ * unpreserve() - Unpreserve page tables and other state of a domain.
+ * @domain: Domain to unpreserve
+ */
+void DOMAIN_NS(unpreserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser)
+{
+	struct pt_iommu *iommu_table =
+		container_of(domain, struct pt_iommu, domain);
+	struct pt_common *common = common_from_iommu(iommu_table);
+	struct pt_range range = pt_all_range(common);
+	struct pt_iommu_collect_args collect = {
+		.free_list = IOMMU_PAGES_LIST_INIT(collect.free_list),
+	};
+
+	iommu_pages_list_add(&collect.free_list, range.top_table);
+	pt_walk_range(&range, __collect_tables, &collect);
+
+	iommu_unpreserve_pages(&collect.free_list, -1);
+}
+EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(unpreserve), "GENERIC_PT_IOMMU");
+
+/**
+ * preserve() - Preserve page tables and other state of a domain.
+ * @domain: Domain to preserve
+ *
+ * Returns: -ERRNO on failure, on success.
+ */
+int DOMAIN_NS(preserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser)
+{
+	struct pt_iommu *iommu_table =
+		container_of(domain, struct pt_iommu, domain);
+	struct pt_common *common = common_from_iommu(iommu_table);
+	struct pt_range range = pt_all_range(common);
+	struct pt_iommu_collect_args collect = {
+		.free_list = IOMMU_PAGES_LIST_INIT(collect.free_list),
+	};
+	int ret;
+
+	iommu_pages_list_add(&collect.free_list, range.top_table);
+	pt_walk_range(&range, __collect_tables, &collect);
+
+	ret = iommu_preserve_pages(&collect.free_list);
+	if (ret)
+		return ret;
+
+	ser->top_table = virt_to_phys(range.top_table);
+	ser->top_level = range.top_level;
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(preserve), "GENERIC_PT_IOMMU");
+
+static int __restore_tables(struct pt_range *range, void *arg,
+			    unsigned int level, struct pt_table_p *table)
+{
+	struct pt_state pts = pt_init(range, level, table);
+	int ret;
+
+	for_each_pt_level_entry(&pts) {
+		if (pts.type == PT_ENTRY_TABLE) {
+			iommu_restore_page(virt_to_phys(pts.table_lower));
+			ret = pt_descend(&pts, arg, __restore_tables);
+			if (ret)
+				return ret;
+		}
+	}
+	return 0;
+}
+
+/**
+ * restore() - Restore page tables and other state of a domain.
+ * @domain: Domain to preserve
+ *
+ * Returns: -ERRNO on failure, on success.
+ */
+int DOMAIN_NS(restore)(struct iommu_domain *domain, struct iommu_domain_ser *ser)
+{
+	struct pt_iommu *iommu_table =
+		container_of(domain, struct pt_iommu, domain);
+	struct pt_common *common = common_from_iommu(iommu_table);
+	struct pt_range range = pt_all_range(common);
+
+	iommu_restore_page(ser->top_table);
+
+	/* Free new table */
+	iommu_free_pages(range.top_table);
+
+	/* Set the restored top table */
+	pt_top_set(common, phys_to_virt(ser->top_table), ser->top_level);
+
+	/* Restore all pages*/
+	range = pt_all_range(common);
+	return pt_walk_range(&range, __restore_tables, NULL);
+}
+EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(restore), "GENERIC_PT_IOMMU");
+
 struct pt_unmap_args {
 	struct iommu_pages_list free_list;
 	pt_vaddr_t unmapped;
diff --git a/include/linux/generic_pt/iommu.h b/include/linux/generic_pt/iommu.h
index 9eefbb74efd0..b824a8642571 100644
--- a/include/linux/generic_pt/iommu.h
+++ b/include/linux/generic_pt/iommu.h
@@ -13,6 +13,7 @@ struct iommu_iotlb_gather;
 struct pt_iommu_ops;
 struct pt_iommu_driver_ops;
 struct iommu_dirty_bitmap;
+struct iommu_domain_ser;
 
 /**
  * DOC: IOMMU Radix Page Table
@@ -198,6 +199,12 @@ struct pt_iommu_cfg {
 				       unsigned long iova, phys_addr_t paddr,  \
 				       size_t pgsize, size_t pgcount,          \
 				       int prot, gfp_t gfp, size_t *mapped);   \
+	int pt_iommu_##fmt##_preserve(struct iommu_domain *domain,             \
+				      struct iommu_domain_ser *ser);           \
+	void pt_iommu_##fmt##_unpreserve(struct iommu_domain *domain,          \
+					 struct iommu_domain_ser *ser);        \
+	int pt_iommu_##fmt##_restore(struct iommu_domain *domain,              \
+				     struct iommu_domain_ser *ser);            \
 	size_t pt_iommu_##fmt##_unmap_pages(                                   \
 		struct iommu_domain *domain, unsigned long iova,               \
 		size_t pgsize, size_t pgcount,                                 \
@@ -224,6 +231,9 @@ struct pt_iommu_cfg {
 #define IOMMU_PT_DOMAIN_OPS(fmt)                        \
 	.iova_to_phys = &pt_iommu_##fmt##_iova_to_phys, \
 	.map_pages = &pt_iommu_##fmt##_map_pages,       \
+	.preserve = &pt_iommu_##fmt##_preserve,		\
+	.unpreserve = &pt_iommu_##fmt##_unpreserve,	\
+	.restore = &pt_iommu_##fmt##_restore,		\
 	.unmap_pages = &pt_iommu_##fmt##_unmap_pages
 #define IOMMU_PT_DIRTY_OPS(fmt) \
 	.read_and_clear_dirty = &pt_iommu_##fmt##_read_and_clear_dirty
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
                   ` (4 preceding siblings ...)
  2026-02-03 22:09 ` [PATCH 05/14] iommupt: Implement preserve/unpreserve/restore callbacks Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-19 16:04   ` Vipin Sharma
  2026-03-20 23:01   ` Pranjal Shrivastava
  2026-02-03 22:09 ` [PATCH 07/14] iommu/vt-d: Restore IOMMU state and reclaimed domain ids Samiullah Khawaja
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

Add implementation of the device and iommu presevation in a separate
file. Also set the device and iommu preserve/unpreserve ops in the
struct iommu_ops.

During normal shutdown the iommu translation is disabled. Since the root
table is preserved during live update, it needs to be cleaned up and the
context entries of the unpreserved devices need to be cleared.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/intel/Makefile     |   1 +
 drivers/iommu/intel/iommu.c      |  47 ++++++++++-
 drivers/iommu/intel/iommu.h      |  27 +++++++
 drivers/iommu/intel/liveupdate.c | 134 +++++++++++++++++++++++++++++++
 4 files changed, 205 insertions(+), 4 deletions(-)
 create mode 100644 drivers/iommu/intel/liveupdate.c

diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
index ada651c4a01b..d38fc101bc35 100644
--- a/drivers/iommu/intel/Makefile
+++ b/drivers/iommu/intel/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o
 obj-$(CONFIG_INTEL_IOMMU_SVM) += svm.o
 obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o
 obj-$(CONFIG_INTEL_IOMMU_PERF_EVENTS) += perfmon.o
+obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 134302fbcd92..c95de93fb72f 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -16,6 +16,7 @@
 #include <linux/crash_dump.h>
 #include <linux/dma-direct.h>
 #include <linux/dmi.h>
+#include <linux/iommu-lu.h>
 #include <linux/memory.h>
 #include <linux/pci.h>
 #include <linux/pci-ats.h>
@@ -52,6 +53,8 @@ static int rwbf_quirk;
 
 #define rwbf_required(iommu)	(rwbf_quirk || cap_rwbf((iommu)->cap))
 
+static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu);
+
 /*
  * set to 1 to panic kernel if can't successfully enable VT-d
  * (used when kernel is launched w/ TXT)
@@ -60,8 +63,6 @@ static int force_on = 0;
 static int intel_iommu_tboot_noforce;
 static int no_platform_optin;
 
-#define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
-
 /*
  * Take a root_entry and return the Lower Context Table Pointer (LCTP)
  * if marked present.
@@ -2378,8 +2379,10 @@ void intel_iommu_shutdown(void)
 		/* Disable PMRs explicitly here. */
 		iommu_disable_protect_mem_regions(iommu);
 
-		/* Make sure the IOMMUs are switched off */
-		iommu_disable_translation(iommu);
+		if (!__maybe_clean_unpreserved_context_entries(iommu)) {
+			/* Make sure the IOMMUs are switched off */
+			iommu_disable_translation(iommu);
+		}
 	}
 }
 
@@ -2902,6 +2905,38 @@ static const struct iommu_dirty_ops intel_second_stage_dirty_ops = {
 	.set_dirty_tracking = intel_iommu_set_dirty_tracking,
 };
 
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu)
+{
+	struct device_domain_info *info;
+	struct pci_dev *pdev = NULL;
+
+	if (!iommu->iommu.outgoing_preserved_state)
+		return false;
+
+	for_each_pci_dev(pdev) {
+		info = dev_iommu_priv_get(&pdev->dev);
+		if (!info)
+			continue;
+
+		if (info->iommu != iommu)
+			continue;
+
+		if (dev_iommu_preserved_state(&pdev->dev))
+			continue;
+
+		domain_context_clear(info);
+	}
+
+	return true;
+}
+#else
+static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu)
+{
+	return false;
+}
+#endif
+
 static struct iommu_domain *
 intel_iommu_domain_alloc_second_stage(struct device *dev,
 				      struct intel_iommu *iommu, u32 flags)
@@ -3925,6 +3960,10 @@ const struct iommu_ops intel_iommu_ops = {
 	.is_attach_deferred	= intel_iommu_is_attach_deferred,
 	.def_domain_type	= device_def_domain_type,
 	.page_response		= intel_iommu_page_response,
+	.preserve_device	= intel_iommu_preserve_device,
+	.unpreserve_device	= intel_iommu_unpreserve_device,
+	.preserve		= intel_iommu_preserve,
+	.unpreserve		= intel_iommu_unpreserve,
 };
 
 static void quirk_iommu_igfx(struct pci_dev *dev)
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 25c5e22096d4..70032e86437d 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -557,6 +557,8 @@ struct root_entry {
 	u64     hi;
 };
 
+#define ROOT_ENTRY_NR (VTD_PAGE_SIZE / sizeof(struct root_entry))
+
 /*
  * low 64 bits:
  * 0: present
@@ -1276,6 +1278,31 @@ static inline int iopf_for_domain_replace(struct iommu_domain *new,
 	return 0;
 }
 
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser);
+void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser);
+int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
+void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
+#else
+static inline int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser)
+{
+}
+
+static inline int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser)
+{
+}
+#endif
+
 #ifdef CONFIG_INTEL_IOMMU_SVM
 void intel_svm_check(struct intel_iommu *iommu);
 struct iommu_domain *intel_svm_domain_alloc(struct device *dev,
diff --git a/drivers/iommu/intel/liveupdate.c b/drivers/iommu/intel/liveupdate.c
new file mode 100644
index 000000000000..82ba1daf1711
--- /dev/null
+++ b/drivers/iommu/intel/liveupdate.c
@@ -0,0 +1,134 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * Copyright (C) 2025, Google LLC
+ * Author: Samiullah Khawaja <skhawaja@google.com>
+ */
+
+#define pr_fmt(fmt)    "iommu: liveupdate: " fmt
+
+#include <linux/kexec_handover.h>
+#include <linux/liveupdate.h>
+#include <linux/iommu-lu.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+
+#include "iommu.h"
+#include "../iommu-pages.h"
+
+static void unpreserve_iommu_context(struct intel_iommu *iommu, int end)
+{
+	struct context_entry *context;
+	int i;
+
+	if (end < 0)
+		end = ROOT_ENTRY_NR;
+
+	for (i = 0; i < end; i++) {
+		context = iommu_context_addr(iommu, i, 0, 0);
+		if (context)
+			iommu_unpreserve_page(context);
+
+		if (!sm_supported(iommu))
+			continue;
+
+		context = iommu_context_addr(iommu, i, 0x80, 0);
+		if (context)
+			iommu_unpreserve_page(context);
+	}
+}
+
+static int preserve_iommu_context(struct intel_iommu *iommu)
+{
+	struct context_entry *context;
+	int ret;
+	int i;
+
+	for (i = 0; i < ROOT_ENTRY_NR; i++) {
+		context = iommu_context_addr(iommu, i, 0, 0);
+		if (context) {
+			ret = iommu_preserve_page(context);
+			if (ret)
+				goto error;
+		}
+
+		if (!sm_supported(iommu))
+			continue;
+
+		context = iommu_context_addr(iommu, i, 0x80, 0);
+		if (context) {
+			ret = iommu_preserve_page(context);
+			if (ret)
+				goto error_sm;
+		}
+	}
+
+	return 0;
+
+error_sm:
+	context = iommu_context_addr(iommu, i, 0, 0);
+	iommu_unpreserve_page(context);
+error:
+	unpreserve_iommu_context(iommu, i);
+	return ret;
+}
+
+int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
+{
+	struct device_domain_info *info = dev_iommu_priv_get(dev);
+
+	if (!dev_is_pci(dev))
+		return -EOPNOTSUPP;
+
+	if (!info)
+		return -EINVAL;
+
+	device_ser->domain_iommu_ser.did = domain_id_iommu(info->domain, info->iommu);
+	return 0;
+}
+
+void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser)
+{
+}
+
+int intel_iommu_preserve(struct iommu_device *iommu_dev, struct iommu_ser *ser)
+{
+	struct intel_iommu *iommu;
+	int ret;
+
+	iommu = container_of(iommu_dev, struct intel_iommu, iommu);
+
+	spin_lock(&iommu->lock);
+	ret = preserve_iommu_context(iommu);
+	if (ret)
+		goto err;
+
+	ret = iommu_preserve_page(iommu->root_entry);
+	if (ret) {
+		unpreserve_iommu_context(iommu, -1);
+		goto err;
+	}
+
+	ser->intel.phys_addr = iommu->reg_phys;
+	ser->intel.root_table = __pa(iommu->root_entry);
+	ser->type = IOMMU_INTEL;
+	ser->token = ser->intel.phys_addr;
+	spin_unlock(&iommu->lock);
+
+	return 0;
+err:
+	spin_unlock(&iommu->lock);
+	return ret;
+}
+
+void intel_iommu_unpreserve(struct iommu_device *iommu_dev, struct iommu_ser *iommu_ser)
+{
+	struct intel_iommu *iommu;
+
+	iommu = container_of(iommu_dev, struct intel_iommu, iommu);
+
+	spin_lock(&iommu->lock);
+	unpreserve_iommu_context(iommu, -1);
+	iommu_unpreserve_page(iommu->root_entry);
+	spin_unlock(&iommu->lock);
+}
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 07/14] iommu/vt-d: Restore IOMMU state and reclaimed domain ids
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
                   ` (5 preceding siblings ...)
  2026-02-03 22:09 ` [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-19 20:54   ` Vipin Sharma
  2026-03-22 19:51   ` Pranjal Shrivastava
  2026-02-03 22:09 ` [PATCH 08/14] iommu: Restore and reattach preserved domains to devices Samiullah Khawaja
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

During boot fetch the preserved state of IOMMU unit and if found then
restore the state.

- Reuse the root_table that was preserved in the previous kernel.
- Reclaim the domain ids of the preserved domains for each preserved
  devices so these are not acquired by another domain.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/intel/iommu.c      | 26 +++++++++++++++------
 drivers/iommu/intel/iommu.h      |  7 ++++++
 drivers/iommu/intel/liveupdate.c | 40 ++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index c95de93fb72f..8acb7f8a7627 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -222,12 +222,12 @@ static void clear_translation_pre_enabled(struct intel_iommu *iommu)
 	iommu->flags &= ~VTD_FLAG_TRANS_PRE_ENABLED;
 }
 
-static void init_translation_status(struct intel_iommu *iommu)
+static void init_translation_status(struct intel_iommu *iommu, bool restoring)
 {
 	u32 gsts;
 
 	gsts = readl(iommu->reg + DMAR_GSTS_REG);
-	if (gsts & DMA_GSTS_TES)
+	if (!restoring && (gsts & DMA_GSTS_TES))
 		iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
 }
 
@@ -670,10 +670,16 @@ void dmar_fault_dump_ptes(struct intel_iommu *iommu, u16 source_id,
 #endif
 
 /* iommu handling */
-static int iommu_alloc_root_entry(struct intel_iommu *iommu)
+static int iommu_alloc_root_entry(struct intel_iommu *iommu, struct iommu_ser *restored_state)
 {
 	struct root_entry *root;
 
+	if (restored_state) {
+		intel_iommu_liveupdate_restore_root_table(iommu, restored_state);
+		__iommu_flush_cache(iommu, iommu->root_entry, ROOT_SIZE);
+		return 0;
+	}
+
 	root = iommu_alloc_pages_node_sz(iommu->node, GFP_ATOMIC, SZ_4K);
 	if (!root) {
 		pr_err("Allocating root entry for %s failed\n",
@@ -1614,6 +1620,7 @@ static int copy_translation_tables(struct intel_iommu *iommu)
 
 static int __init init_dmars(void)
 {
+	struct iommu_ser *iommu_ser = NULL;
 	struct dmar_drhd_unit *drhd;
 	struct intel_iommu *iommu;
 	int ret;
@@ -1636,8 +1643,10 @@ static int __init init_dmars(void)
 						   intel_pasid_max_id);
 		}
 
+		iommu_ser = iommu_get_preserved_data(iommu->reg_phys, IOMMU_INTEL);
+
 		intel_iommu_init_qi(iommu);
-		init_translation_status(iommu);
+		init_translation_status(iommu, !!iommu_ser);
 
 		if (translation_pre_enabled(iommu) && !is_kdump_kernel()) {
 			iommu_disable_translation(iommu);
@@ -1651,7 +1660,7 @@ static int __init init_dmars(void)
 		 * we could share the same root & context tables
 		 * among all IOMMU's. Need to Split it later.
 		 */
-		ret = iommu_alloc_root_entry(iommu);
+		ret = iommu_alloc_root_entry(iommu, iommu_ser);
 		if (ret)
 			goto free_iommu;
 
@@ -2110,15 +2119,18 @@ int dmar_parse_one_satc(struct acpi_dmar_header *hdr, void *arg)
 static int intel_iommu_add(struct dmar_drhd_unit *dmaru)
 {
 	struct intel_iommu *iommu = dmaru->iommu;
+	struct iommu_ser *iommu_ser = NULL;
 	int ret;
 
+	iommu_ser = iommu_get_preserved_data(iommu->reg_phys, IOMMU_INTEL);
+
 	/*
 	 * Disable translation if already enabled prior to OS handover.
 	 */
-	if (iommu->gcmd & DMA_GCMD_TE)
+	if (!iommu_ser && iommu->gcmd & DMA_GCMD_TE)
 		iommu_disable_translation(iommu);
 
-	ret = iommu_alloc_root_entry(iommu);
+	ret = iommu_alloc_root_entry(iommu, iommu_ser);
 	if (ret)
 		goto out;
 
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 70032e86437d..d7bf63aff17d 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -1283,6 +1283,8 @@ int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_se
 void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser);
 int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
 void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
+void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
+					       struct iommu_ser *iommu_ser);
 #else
 static inline int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
 {
@@ -1301,6 +1303,11 @@ static inline int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_
 static inline void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser)
 {
 }
+
+static inline void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
+							     struct iommu_ser *iommu_ser)
+{
+}
 #endif
 
 #ifdef CONFIG_INTEL_IOMMU_SVM
diff --git a/drivers/iommu/intel/liveupdate.c b/drivers/iommu/intel/liveupdate.c
index 82ba1daf1711..6dcb5783d1db 100644
--- a/drivers/iommu/intel/liveupdate.c
+++ b/drivers/iommu/intel/liveupdate.c
@@ -73,6 +73,46 @@ static int preserve_iommu_context(struct intel_iommu *iommu)
 	return ret;
 }
 
+static void restore_iommu_context(struct intel_iommu *iommu)
+{
+	struct context_entry *context;
+	int i;
+
+	for (i = 0; i < ROOT_ENTRY_NR; i++) {
+		context = iommu_context_addr(iommu, i, 0, 0);
+		if (context)
+			BUG_ON(!kho_restore_folio(virt_to_phys(context)));
+
+		if (!sm_supported(iommu))
+			continue;
+
+		context = iommu_context_addr(iommu, i, 0x80, 0);
+		if (context)
+			BUG_ON(!kho_restore_folio(virt_to_phys(context)));
+	}
+}
+
+static int __restore_used_domain_ids(struct device_ser *ser, void *arg)
+{
+	int id = ser->domain_iommu_ser.did;
+	struct intel_iommu *iommu = arg;
+
+	ida_alloc_range(&iommu->domain_ida, id, id, GFP_ATOMIC);
+	return 0;
+}
+
+void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
+					       struct iommu_ser *iommu_ser)
+{
+	BUG_ON(!kho_restore_folio(iommu_ser->intel.root_table));
+	iommu->root_entry = __va(iommu_ser->intel.root_table);
+
+	restore_iommu_context(iommu);
+	iommu_for_each_preserved_device(__restore_used_domain_ids, iommu);
+	pr_info("Restored IOMMU[0x%llx] Root Table at: 0x%llx\n",
+		iommu->reg_phys, iommu_ser->intel.root_table);
+}
+
 int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
 {
 	struct device_domain_info *info = dev_iommu_priv_get(dev);
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 08/14] iommu: Restore and reattach preserved domains to devices
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
                   ` (6 preceding siblings ...)
  2026-02-03 22:09 ` [PATCH 07/14] iommu/vt-d: Restore IOMMU state and reclaimed domain ids Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-10  5:16   ` Ankit Soni
  2026-03-22 21:59   ` Pranjal Shrivastava
  2026-02-03 22:09 ` [PATCH 09/14] iommu/vt-d: preserve PASID table of preserved device Samiullah Khawaja
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

Restore the preserved domains by restoring the page tables using restore
IOMMU domain op. Reattach the preserved domain to the device during
default domain setup. While attaching, reuse the domain ID that was used
in the previous kernel. The context entry setup is not needed as that is
preserved during liveupdate.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/intel/iommu.c  | 40 ++++++++++++++++++------------
 drivers/iommu/intel/iommu.h  |  3 ++-
 drivers/iommu/intel/nested.c |  2 +-
 drivers/iommu/iommu.c        | 47 ++++++++++++++++++++++++++++++++++--
 drivers/iommu/liveupdate.c   | 31 ++++++++++++++++++++++++
 include/linux/iommu-lu.h     |  8 ++++++
 6 files changed, 112 insertions(+), 19 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 8acb7f8a7627..83faad53f247 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1029,7 +1029,8 @@ static bool first_level_by_default(struct intel_iommu *iommu)
 	return true;
 }
 
-int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
+int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
+			int restore_did)
 {
 	struct iommu_domain_info *info, *curr;
 	int num, ret = -ENOSPC;
@@ -1049,8 +1050,11 @@ int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
 		return 0;
 	}
 
-	num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
-			      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
+	if (restore_did >= 0)
+		num = restore_did;
+	else
+		num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
+				      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
 	if (num < 0) {
 		pr_err("%s: No free domain ids\n", iommu->name);
 		goto err_unlock;
@@ -1321,10 +1325,14 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
 {
 	struct device_domain_info *info = dev_iommu_priv_get(dev);
 	struct intel_iommu *iommu = info->iommu;
+	struct device_ser *device_ser = NULL;
 	unsigned long flags;
 	int ret;
 
-	ret = domain_attach_iommu(domain, iommu);
+	device_ser = dev_iommu_restored_state(dev);
+
+	ret = domain_attach_iommu(domain, iommu,
+				  dev_iommu_restore_did(dev, &domain->domain));
 	if (ret)
 		return ret;
 
@@ -1337,16 +1345,18 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
 	if (dev_is_real_dma_subdevice(dev))
 		return 0;
 
-	if (!sm_supported(iommu))
-		ret = domain_context_mapping(domain, dev);
-	else if (intel_domain_is_fs_paging(domain))
-		ret = domain_setup_first_level(iommu, domain, dev,
-					       IOMMU_NO_PASID, NULL);
-	else if (intel_domain_is_ss_paging(domain))
-		ret = domain_setup_second_level(iommu, domain, dev,
-						IOMMU_NO_PASID, NULL);
-	else if (WARN_ON(true))
-		ret = -EINVAL;
+	if (!device_ser) {
+		if (!sm_supported(iommu))
+			ret = domain_context_mapping(domain, dev);
+		else if (intel_domain_is_fs_paging(domain))
+			ret = domain_setup_first_level(iommu, domain, dev,
+						       IOMMU_NO_PASID, NULL);
+		else if (intel_domain_is_ss_paging(domain))
+			ret = domain_setup_second_level(iommu, domain, dev,
+							IOMMU_NO_PASID, NULL);
+		else if (WARN_ON(true))
+			ret = -EINVAL;
+	}
 
 	if (ret)
 		goto out_block_translation;
@@ -3630,7 +3640,7 @@ domain_add_dev_pasid(struct iommu_domain *domain,
 	if (!dev_pasid)
 		return ERR_PTR(-ENOMEM);
 
-	ret = domain_attach_iommu(dmar_domain, iommu);
+	ret = domain_attach_iommu(dmar_domain, iommu, -1);
 	if (ret)
 		goto out_free;
 
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index d7bf63aff17d..057bd6035d85 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -1174,7 +1174,8 @@ void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
  */
 #define QI_OPT_WAIT_DRAIN		BIT(0)
 
-int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
+int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
+			int restore_did);
 void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
 void device_block_translation(struct device *dev);
 int paging_domain_compatible(struct iommu_domain *domain, struct device *dev);
diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
index a3fb8c193ca6..4fed9f5981e5 100644
--- a/drivers/iommu/intel/nested.c
+++ b/drivers/iommu/intel/nested.c
@@ -40,7 +40,7 @@ static int intel_nested_attach_dev(struct iommu_domain *domain,
 		return ret;
 	}
 
-	ret = domain_attach_iommu(dmar_domain, iommu);
+	ret = domain_attach_iommu(dmar_domain, iommu, -1);
 	if (ret) {
 		dev_err_ratelimited(dev, "Failed to attach domain to iommu\n");
 		return ret;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c0632cb5b570..8103b5372364 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -18,6 +18,7 @@
 #include <linux/errno.h>
 #include <linux/host1x_context_bus.h>
 #include <linux/iommu.h>
+#include <linux/iommu-lu.h>
 #include <linux/iommufd.h>
 #include <linux/idr.h>
 #include <linux/err.h>
@@ -489,6 +490,10 @@ static int iommu_init_device(struct device *dev)
 		goto err_free;
 	}
 
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+	dev->iommu->device_ser = iommu_get_device_preserved_data(dev);
+#endif
+
 	iommu_dev = ops->probe_device(dev);
 	if (IS_ERR(iommu_dev)) {
 		ret = PTR_ERR(iommu_dev);
@@ -2149,6 +2154,13 @@ static int __iommu_attach_device(struct iommu_domain *domain,
 	ret = domain->ops->attach_dev(domain, dev, old);
 	if (ret)
 		return ret;
+
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+	/* The associated state can be unset once restored. */
+	if (dev_iommu_restored_state(dev))
+		WRITE_ONCE(dev->iommu->device_ser, NULL);
+#endif
+
 	dev->iommu->attach_deferred = 0;
 	trace_attach_device_to_domain(dev);
 	return 0;
@@ -3061,6 +3073,34 @@ int iommu_fwspec_add_ids(struct device *dev, const u32 *ids, int num_ids)
 }
 EXPORT_SYMBOL_GPL(iommu_fwspec_add_ids);
 
+static struct iommu_domain *__iommu_group_maybe_restore_domain(struct iommu_group *group)
+{
+	struct device_ser *device_ser;
+	struct iommu_domain *domain;
+	struct device *dev;
+
+	dev = iommu_group_first_dev(group);
+	if (!dev_is_pci(dev))
+		return NULL;
+
+	device_ser = dev_iommu_restored_state(dev);
+	if (!device_ser)
+		return NULL;
+
+	domain = iommu_restore_domain(dev, device_ser);
+	if (WARN_ON(IS_ERR(domain)))
+		return NULL;
+
+	/*
+	 * The group is owned by the entity (a preserved iommufd) that provided
+	 * this token in the previous kernel. It will be used to reclaim it
+	 * later.
+	 */
+	group->owner = (void *)device_ser->token;
+	group->owner_cnt = 1;
+	return domain;
+}
+
 /**
  * iommu_setup_default_domain - Set the default_domain for the group
  * @group: Group to change
@@ -3075,8 +3115,8 @@ static int iommu_setup_default_domain(struct iommu_group *group,
 				      int target_type)
 {
 	struct iommu_domain *old_dom = group->default_domain;
+	struct iommu_domain *dom, *restored_domain;
 	struct group_device *gdev;
-	struct iommu_domain *dom;
 	bool direct_failed;
 	int req_type;
 	int ret;
@@ -3120,6 +3160,9 @@ static int iommu_setup_default_domain(struct iommu_group *group,
 	/* We must set default_domain early for __iommu_device_set_domain */
 	group->default_domain = dom;
 	if (!group->domain) {
+		restored_domain = __iommu_group_maybe_restore_domain(group);
+		if (!restored_domain)
+			restored_domain = dom;
 		/*
 		 * Drivers are not allowed to fail the first domain attach.
 		 * The only way to recover from this is to fail attaching the
@@ -3127,7 +3170,7 @@ static int iommu_setup_default_domain(struct iommu_group *group,
 		 * in group->default_domain so it is freed after.
 		 */
 		ret = __iommu_group_set_domain_internal(
-			group, dom, IOMMU_SET_DOMAIN_MUST_SUCCEED);
+			group, restored_domain, IOMMU_SET_DOMAIN_MUST_SUCCEED);
 		if (WARN_ON(ret))
 			goto out_free_old;
 	} else {
diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
index 83eb609b3fd7..6b211436ad25 100644
--- a/drivers/iommu/liveupdate.c
+++ b/drivers/iommu/liveupdate.c
@@ -501,3 +501,34 @@ void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
 
 	iommu_unpreserve_locked(iommu->iommu_dev);
 }
+
+struct iommu_domain *iommu_restore_domain(struct device *dev, struct device_ser *ser)
+{
+	struct iommu_domain_ser *domain_ser;
+	struct iommu_lu_flb_obj *flb_obj;
+	struct iommu_domain *domain;
+	int ret;
+
+	domain_ser = __va(ser->domain_iommu_ser.domain_phys);
+
+	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&flb_obj);
+	if (ret)
+		return ERR_PTR(ret);
+
+	guard(mutex)(&flb_obj->lock);
+	if (domain_ser->restored_domain)
+		return domain_ser->restored_domain;
+
+	domain_ser->obj.incoming =  true;
+	domain = iommu_paging_domain_alloc(dev);
+	if (IS_ERR(domain))
+		return domain;
+
+	ret = domain->ops->restore(domain, domain_ser);
+	if (ret)
+		return ERR_PTR(ret);
+
+	domain->preserved_state = domain_ser;
+	domain_ser->restored_domain = domain;
+	return domain;
+}
diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
index 48c07514a776..4879abaf83d3 100644
--- a/include/linux/iommu-lu.h
+++ b/include/linux/iommu-lu.h
@@ -65,6 +65,8 @@ static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain
 	return -1;
 }
 
+struct iommu_domain *iommu_restore_domain(struct device *dev,
+					  struct device_ser *ser);
 int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
 				    void *arg);
 struct device_ser *iommu_get_device_preserved_data(struct device *dev);
@@ -95,6 +97,12 @@ static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
 	return NULL;
 }
 
+static inline struct iommu_domain *iommu_restore_domain(struct device *dev,
+							struct device_ser *ser)
+{
+	return NULL;
+}
+
 static inline int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn, void *arg)
 {
 	return -EOPNOTSUPP;
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 09/14] iommu/vt-d: preserve PASID table of preserved device
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
                   ` (7 preceding siblings ...)
  2026-02-03 22:09 ` [PATCH 08/14] iommu: Restore and reattach preserved domains to devices Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-23 18:19   ` Pranjal Shrivastava
  2026-02-03 22:09 ` [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved Samiullah Khawaja
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

In scalable mode the PASID table is used to fetch the io page tables.
Preserve and restore the PASID table of the preserved devices.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/intel/iommu.c      |   4 +-
 drivers/iommu/intel/iommu.h      |   5 ++
 drivers/iommu/intel/liveupdate.c | 130 +++++++++++++++++++++++++++++++
 drivers/iommu/intel/pasid.c      |   7 +-
 drivers/iommu/intel/pasid.h      |   9 +++
 include/linux/kho/abi/iommu.h    |   8 ++
 6 files changed, 160 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 83faad53f247..2d0dae57f5a2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2944,8 +2944,10 @@ static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu)
 		if (info->iommu != iommu)
 			continue;
 
-		if (dev_iommu_preserved_state(&pdev->dev))
+		if (dev_iommu_preserved_state(&pdev->dev)) {
+			pasid_cleanup_preserved_table(&pdev->dev);
 			continue;
+		}
 
 		domain_context_clear(info);
 	}
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 057bd6035d85..d24d6aeaacc0 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -1286,6 +1286,7 @@ int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser
 void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
 void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
 					       struct iommu_ser *iommu_ser);
+void pasid_cleanup_preserved_table(struct device *dev);
 #else
 static inline int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
 {
@@ -1309,6 +1310,10 @@ static inline void intel_iommu_liveupdate_restore_root_table(struct intel_iommu
 							     struct iommu_ser *iommu_ser)
 {
 }
+
+static inline void pasid_cleanup_preserved_table(struct device *dev)
+{
+}
 #endif
 
 #ifdef CONFIG_INTEL_IOMMU_SVM
diff --git a/drivers/iommu/intel/liveupdate.c b/drivers/iommu/intel/liveupdate.c
index 6dcb5783d1db..53bb5fe3a764 100644
--- a/drivers/iommu/intel/liveupdate.c
+++ b/drivers/iommu/intel/liveupdate.c
@@ -14,6 +14,7 @@
 #include <linux/pci.h>
 
 #include "iommu.h"
+#include "pasid.h"
 #include "../iommu-pages.h"
 
 static void unpreserve_iommu_context(struct intel_iommu *iommu, int end)
@@ -113,9 +114,89 @@ void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
 		iommu->reg_phys, iommu_ser->intel.root_table);
 }
 
+enum pasid_lu_op {
+	PASID_LU_OP_PRESERVE = 1,
+	PASID_LU_OP_UNPRESERVE,
+	PASID_LU_OP_RESTORE,
+	PASID_LU_OP_FREE,
+};
+
+static int pasid_lu_do_op(void *table, enum pasid_lu_op op)
+{
+	int ret = 0;
+
+	switch (op) {
+	case PASID_LU_OP_PRESERVE:
+		ret = iommu_preserve_page(table);
+		break;
+	case PASID_LU_OP_UNPRESERVE:
+		iommu_unpreserve_page(table);
+		break;
+	case PASID_LU_OP_RESTORE:
+		iommu_restore_page(virt_to_phys(table));
+		break;
+	case PASID_LU_OP_FREE:
+		iommu_free_pages(table);
+		break;
+	}
+
+	return ret;
+}
+
+static int pasid_lu_handle_pd(struct pasid_dir_entry *dir, enum pasid_lu_op op)
+{
+	struct pasid_entry *table;
+	int ret;
+
+	/* Only preserve first table for NO_PASID. */
+	table = get_pasid_table_from_pde(&dir[0]);
+	if (!table)
+		return -EINVAL;
+
+	ret = pasid_lu_do_op(table, op);
+	if (ret)
+		return ret;
+
+	ret = pasid_lu_do_op(dir, op);
+	if (ret)
+		goto err;
+
+	return 0;
+err:
+	if (op == PASID_LU_OP_PRESERVE)
+		pasid_lu_do_op(table, PASID_LU_OP_UNPRESERVE);
+
+	return ret;
+}
+
+void pasid_cleanup_preserved_table(struct device *dev)
+{
+	struct pasid_table *pasid_table;
+	struct pasid_dir_entry *dir;
+	struct pasid_entry *table;
+
+	pasid_table = intel_pasid_get_table(dev);
+	if (!pasid_table)
+		return;
+
+	dir = pasid_table->table;
+	table = get_pasid_table_from_pde(&dir[0]);
+	if (!table)
+		return;
+
+	/* Cleanup everything except the first entry. */
+	memset(&table[1], 0, SZ_4K - sizeof(*table));
+	memset(&dir[1], 0, SZ_4K - sizeof(struct pasid_dir_entry));
+
+	clflush_cache_range(&table[0], SZ_4K);
+	clflush_cache_range(&dir[0], SZ_4K);
+}
+
 int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
 {
 	struct device_domain_info *info = dev_iommu_priv_get(dev);
+	struct pasid_table *pasid_table;
+	int ret;
 
 	if (!dev_is_pci(dev))
 		return -EOPNOTSUPP;
@@ -124,11 +205,42 @@ int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_se
 		return -EINVAL;
 
 	device_ser->domain_iommu_ser.did = domain_id_iommu(info->domain, info->iommu);
+
+	if (!sm_supported(info->iommu))
+		return 0;
+
+	pasid_table = intel_pasid_get_table(dev);
+	if (!pasid_table)
+		return -EINVAL;
+
+	ret = pasid_lu_handle_pd(pasid_table->table, PASID_LU_OP_PRESERVE);
+	if (ret)
+		return ret;
+
+	device_ser->intel.pasid_table = virt_to_phys(pasid_table->table);
+	device_ser->intel.max_pasid = pasid_table->max_pasid;
 	return 0;
 }
 
 void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser)
 {
+	struct device_domain_info *info = dev_iommu_priv_get(dev);
+	struct pasid_table *pasid_table;
+
+	if (!dev_is_pci(dev))
+		return;
+
+	if (!info)
+		return;
+
+	if (!sm_supported(info->iommu))
+		return;
+
+	pasid_table = intel_pasid_get_table(dev);
+	if (!pasid_table)
+		return;
+
+	pasid_lu_handle_pd(pasid_table->table, PASID_LU_OP_UNPRESERVE);
 }
 
 int intel_iommu_preserve(struct iommu_device *iommu_dev, struct iommu_ser *ser)
@@ -172,3 +284,21 @@ void intel_iommu_unpreserve(struct iommu_device *iommu_dev, struct iommu_ser *io
 	iommu_unpreserve_page(iommu->root_entry);
 	spin_unlock(&iommu->lock);
 }
+
+void *intel_pasid_try_restore_table(struct device *dev, u64 max_pasid)
+{
+	struct device_ser *ser = dev_iommu_restored_state(dev);
+
+	if (!ser)
+		return NULL;
+
+	BUG_ON(pasid_lu_handle_pd(phys_to_virt(ser->intel.pasid_table),
+				  PASID_LU_OP_RESTORE));
+	if (WARN_ON_ONCE(ser->intel.max_pasid != max_pasid)) {
+		pasid_lu_handle_pd(phys_to_virt(ser->intel.pasid_table),
+				   PASID_LU_OP_FREE);
+		return NULL;
+	}
+
+	return phys_to_virt(ser->intel.pasid_table);
+}
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 3e2255057079..96b9daf9083d 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -60,8 +60,11 @@ int intel_pasid_alloc_table(struct device *dev)
 
 	size = max_pasid >> (PASID_PDE_SHIFT - 3);
 	order = size ? get_order(size) : 0;
-	dir = iommu_alloc_pages_node_sz(info->iommu->node, GFP_KERNEL,
-					1 << (order + PAGE_SHIFT));
+
+	dir = intel_pasid_try_restore_table(dev, max_pasid);
+	if (!dir)
+		dir = iommu_alloc_pages_node_sz(info->iommu->node, GFP_KERNEL,
+						1 << (order + PAGE_SHIFT));
 	if (!dir) {
 		kfree(pasid_table);
 		return -ENOMEM;
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index b4c85242dc79..e8a626c47daf 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -287,6 +287,15 @@ static inline void pasid_set_eafe(struct pasid_entry *pe)
 
 extern unsigned int intel_pasid_max_id;
 int intel_pasid_alloc_table(struct device *dev);
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+void *intel_pasid_try_restore_table(struct device *dev, u64 max_pasid);
+#else
+static inline void *intel_pasid_try_restore_table(struct device *dev,
+						  u64 max_pasid)
+{
+	return NULL;
+}
+#endif
 void intel_pasid_free_table(struct device *dev);
 struct pasid_table *intel_pasid_get_table(struct device *dev);
 int intel_pasid_setup_first_level(struct intel_iommu *iommu, struct device *dev,
diff --git a/include/linux/kho/abi/iommu.h b/include/linux/kho/abi/iommu.h
index 8e1c05cfe7bb..111a46c31d92 100644
--- a/include/linux/kho/abi/iommu.h
+++ b/include/linux/kho/abi/iommu.h
@@ -50,6 +50,11 @@ struct device_domain_iommu_ser {
 	u64 iommu_phys;
 } __packed;
 
+struct device_intel_ser {
+	u64 pasid_table;
+	u64 max_pasid;
+} __packed;
+
 struct device_ser {
 	struct iommu_obj_ser obj;
 	u64 token;
@@ -57,6 +62,9 @@ struct device_ser {
 	u32 pci_domain;
 	struct device_domain_iommu_ser domain_iommu_ser;
 	enum iommu_lu_type type;
+	union {
+		struct device_intel_ser intel;
+	};
 } __packed;
 
 struct iommu_intel_ser {
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
                   ` (8 preceding siblings ...)
  2026-02-03 22:09 ` [PATCH 09/14] iommu/vt-d: preserve PASID table of preserved device Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-19 23:35   ` Vipin Sharma
  2026-03-25 14:37   ` Pranjal Shrivastava
  2026-02-03 22:09 ` [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update Samiullah Khawaja
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: YiFei Zhu, Samiullah Khawaja, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Pranjal Shrivastava,
	Vipin Sharma

From: YiFei Zhu <zhuyifei@google.com>

Userspace provides a token, which will then be used at restore to
identify this HWPT. The restoration logic is not implemented and will be
added later.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/iommufd/Makefile          |  1 +
 drivers/iommu/iommufd/iommufd_private.h | 13 +++++++
 drivers/iommu/iommufd/liveupdate.c      | 49 +++++++++++++++++++++++++
 drivers/iommu/iommufd/main.c            |  2 +
 include/uapi/linux/iommufd.h            | 19 ++++++++++
 5 files changed, 84 insertions(+)
 create mode 100644 drivers/iommu/iommufd/liveupdate.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 71d692c9a8f4..c3bf0b6452d3 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -17,3 +17,4 @@ obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o
 
 iommufd_driver-y := driver.o
 obj-$(CONFIG_IOMMUFD_DRIVER_CORE) += iommufd_driver.o
+obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index eb6d1a70f673..6424e7cea5b2 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
 	bool auto_domain : 1;
 	bool enforce_cache_coherency : 1;
 	bool nest_parent : 1;
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+	bool lu_preserve : 1;
+	u32 lu_token;
+#endif
 	/* Head at iommufd_ioas::hwpt_list */
 	struct list_head hwpt_item;
 	struct iommufd_sw_msi_maps present_sw_msi;
@@ -707,6 +711,15 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
 			    struct iommufd_vdevice, obj);
 }
 
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
+#else
+static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
+{
+	return -ENOTTY;
+}
+#endif
+
 #ifdef CONFIG_IOMMUFD_TEST
 int iommufd_test(struct iommufd_ucmd *ucmd);
 void iommufd_selftest_destroy(struct iommufd_object *obj);
diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
new file mode 100644
index 000000000000..ae74f5b54735
--- /dev/null
+++ b/drivers/iommu/iommufd/liveupdate.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define pr_fmt(fmt) "iommufd: " fmt
+
+#include <linux/file.h>
+#include <linux/iommufd.h>
+#include <linux/liveupdate.h>
+
+#include "iommufd_private.h"
+
+int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hwpt_lu_set_preserve *cmd = ucmd->cmd;
+	struct iommufd_hwpt_paging *hwpt_target, *hwpt;
+	struct iommufd_ctx *ictx = ucmd->ictx;
+	struct iommufd_object *obj;
+	unsigned long index;
+	int rc = 0;
+
+	hwpt_target = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id);
+	if (IS_ERR(hwpt_target))
+		return PTR_ERR(hwpt_target);
+
+	xa_lock(&ictx->objects);
+	xa_for_each(&ictx->objects, index, obj) {
+		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
+			continue;
+
+		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
+
+		if (hwpt == hwpt_target)
+			continue;
+		if (!hwpt->lu_preserve)
+			continue;
+		if (hwpt->lu_token == cmd->hwpt_token) {
+			rc = -EADDRINUSE;
+			goto out;
+		}
+	}
+
+	hwpt_target->lu_preserve = true;
+	hwpt_target->lu_token = cmd->hwpt_token;
+
+out:
+	xa_unlock(&ictx->objects);
+	iommufd_put_object(ictx, &hwpt_target->common.obj);
+	return rc;
+}
+
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 5cc4b08c25f5..e1a9b3051f65 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -493,6 +493,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
 		 struct iommu_viommu_alloc, out_viommu_id),
+	IOCTL_OP(IOMMU_HWPT_LU_SET_PRESERVE, iommufd_hwpt_lu_set_preserve,
+		 struct iommu_hwpt_lu_set_preserve, hwpt_token),
 #ifdef CONFIG_IOMMUFD_TEST
 	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
 #endif
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2c41920b641d..25d8cff987eb 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -57,6 +57,7 @@ enum {
 	IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92,
 	IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93,
 	IOMMUFD_CMD_HW_QUEUE_ALLOC = 0x94,
+	IOMMUFD_CMD_HWPT_LU_SET_PRESERVE = 0x95,
 };
 
 /**
@@ -1299,4 +1300,22 @@ struct iommu_hw_queue_alloc {
 	__aligned_u64 length;
 };
 #define IOMMU_HW_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HW_QUEUE_ALLOC)
+
+/**
+ * struct iommu_hwpt_lu_set_preserve - ioctl(IOMMU_HWPT_LU_SET_PRESERVE)
+ * @size: sizeof(struct iommu_hwpt_lu_set_preserve)
+ * @hwpt_id: Iommufd object ID of the target HWPT
+ * @hwpt_token: Token to identify this hwpt upon restore
+ *
+ * The target HWPT will be preserved during iommufd preservation.
+ *
+ * The hwpt_token is provided by userspace. If userspace enters a token
+ * already in use within this iommufd, -EADDRINUSE is returned from this ioctl.
+ */
+struct iommu_hwpt_lu_set_preserve {
+	__u32 size;
+	__u32 hwpt_id;
+	__u32 hwpt_token;
+};
+#define IOMMU_HWPT_LU_SET_PRESERVE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_LU_SET_PRESERVE)
 #endif
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
                   ` (9 preceding siblings ...)
  2026-02-03 22:09 ` [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-02-25 23:47   ` Samiullah Khawaja
                     ` (3 more replies)
  2026-02-03 22:09 ` [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev Samiullah Khawaja
                   ` (2 subsequent siblings)
  13 siblings, 4 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: YiFei Zhu, Samiullah Khawaja, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Pranjal Shrivastava,
	Vipin Sharma

From: YiFei Zhu <zhuyifei@google.com>

The caller is expected to mark each HWPT to be preserved with an ioctl
call, with a token that will be used in restore. At preserve time, each
HWPT's domain is then called with iommu_domain_preserve to preserve the
iommu domain.

The HWPTs containing dma mappings backed by unpreserved memory should
not be preserved. During preservation check if the mappings contained in
the HWPT being preserved are only file based and all the files are
preserved.

The memfd file preservation check is not enough when preserving iommufd.
The memfd might have shrunk between the mapping and memfd preservation.
This means that if it shrunk some pages that are right now pinned due to
iommu mappings are not preserved with the memfd. Only allow iommufd
preservation when all the iopt_pages are file backed and the memory file
was seal sealed during mapping. This guarantees that all the pages that
were backing memfd when it was mapped are preserved.

Once HWPT is preserved the iopt associated with the HWPT is made
immutable. Since the map and unmap ioctls operates directly on iopt,
which contains an array of domains, while each hwpt contains only one
domain. The logic then becomes that mapping and unmapping is prohibited
if any of the domains in an iopt belongs to a preserved hwpt. However,
tracing to the hwpt through the domain is a lot more tedious than
tracing through the ioas, so if an hwpt is preserved, hwpt->ioas->iopt
is made immutable.

When undoing this (making the iopts mutable again), there's never
a need to make some iopts mutable and some kept immutable, since
the undo only happen on unpreserve and error path of preserve.
Simply iterate all the ioas and clear the immutability flag on all
their iopts.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/iommufd/io_pagetable.c    |  17 ++
 drivers/iommu/iommufd/io_pagetable.h    |   1 +
 drivers/iommu/iommufd/iommufd_private.h |  25 ++
 drivers/iommu/iommufd/liveupdate.c      | 300 ++++++++++++++++++++++++
 drivers/iommu/iommufd/main.c            |  14 +-
 drivers/iommu/iommufd/pages.c           |   8 +
 include/linux/kho/abi/iommufd.h         |  39 +++
 7 files changed, 403 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/kho/abi/iommufd.h

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 436992331111..43e8a2443793 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -270,6 +270,11 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
 	}
 
 	down_write(&iopt->iova_rwsem);
+	if (iopt_lu_map_immutable(iopt)) {
+		rc = -EBUSY;
+		goto out_unlock;
+	}
+
 	if ((length & (iopt->iova_alignment - 1)) || !length) {
 		rc = -EINVAL;
 		goto out_unlock;
@@ -328,6 +333,7 @@ static void iopt_abort_area(struct iopt_area *area)
 		WARN_ON(area->pages);
 	if (area->iopt) {
 		down_write(&area->iopt->iova_rwsem);
+		WARN_ON(iopt_lu_map_immutable(area->iopt));
 		interval_tree_remove(&area->node, &area->iopt->area_itree);
 		up_write(&area->iopt->iova_rwsem);
 	}
@@ -755,6 +761,12 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 again:
 	down_read(&iopt->domains_rwsem);
 	down_write(&iopt->iova_rwsem);
+
+	if (iopt_lu_map_immutable(iopt)) {
+		rc = -EBUSY;
+		goto out_unlock_iova;
+	}
+
 	while ((area = iopt_area_iter_first(iopt, start, last))) {
 		unsigned long area_last = iopt_area_last_iova(area);
 		unsigned long area_first = iopt_area_iova(area);
@@ -1398,6 +1410,11 @@ int iopt_cut_iova(struct io_pagetable *iopt, unsigned long *iovas,
 	int i;
 
 	down_write(&iopt->iova_rwsem);
+	if (iopt_lu_map_immutable(iopt)) {
+		up_write(&iopt->iova_rwsem);
+		return -EBUSY;
+	}
+
 	for (i = 0; i < num_iovas; i++) {
 		struct iopt_area *area;
 
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 14cd052fd320..b64cb4cf300c 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -234,6 +234,7 @@ struct iopt_pages {
 		struct {			/* IOPT_ADDRESS_FILE */
 			struct file *file;
 			unsigned long start;
+			u32 seals;
 		};
 		/* IOPT_ADDRESS_DMABUF */
 		struct iopt_pages_dmabuf dmabuf;
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 6424e7cea5b2..f8366a23999f 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -94,6 +94,9 @@ struct io_pagetable {
 	/* IOVA that cannot be allocated, struct iopt_reserved */
 	struct rb_root_cached reserved_itree;
 	u8 disable_large_pages;
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+	bool lu_map_immutable;
+#endif
 	unsigned long iova_alignment;
 };
 
@@ -712,12 +715,34 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
 }
 
 #ifdef CONFIG_IOMMU_LIVEUPDATE
+int iommufd_liveupdate_register_lufs(void);
+int iommufd_liveupdate_unregister_lufs(void);
+
 int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
+static inline bool iopt_lu_map_immutable(const struct io_pagetable *iopt)
+{
+	return iopt->lu_map_immutable;
+}
 #else
+static inline int iommufd_liveupdate_register_lufs(void)
+{
+	return 0;
+}
+
+static inline int iommufd_liveupdate_unregister_lufs(void)
+{
+	return 0;
+}
+
 static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
 {
 	return -ENOTTY;
 }
+
+static inline bool iopt_lu_map_immutable(const struct io_pagetable *iopt)
+{
+	return false;
+}
 #endif
 
 #ifdef CONFIG_IOMMUFD_TEST
diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
index ae74f5b54735..ec11ae345fe7 100644
--- a/drivers/iommu/iommufd/liveupdate.c
+++ b/drivers/iommu/iommufd/liveupdate.c
@@ -4,9 +4,15 @@
 
 #include <linux/file.h>
 #include <linux/iommufd.h>
+#include <linux/kexec_handover.h>
+#include <linux/kho/abi/iommufd.h>
 #include <linux/liveupdate.h>
+#include <linux/iommu-lu.h>
+#include <linux/mm.h>
+#include <linux/pci.h>
 
 #include "iommufd_private.h"
+#include "io_pagetable.h"
 
 int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
 {
@@ -47,3 +53,297 @@ int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
 	return rc;
 }
 
+static void iommufd_set_ioas_mutable(struct iommufd_ctx *ictx)
+{
+	struct iommufd_object *obj;
+	struct iommufd_ioas *ioas;
+	unsigned long index;
+
+	xa_lock(&ictx->objects);
+	xa_for_each(&ictx->objects, index, obj) {
+		if (obj->type != IOMMUFD_OBJ_IOAS)
+			continue;
+
+		ioas = container_of(obj, struct iommufd_ioas, obj);
+
+		/*
+		 * Not taking any IOAS lock here. All writers take LUO
+		 * session mutex, and this writer racing with readers is not
+		 * really a problem.
+		 */
+		WRITE_ONCE(ioas->iopt.lu_map_immutable, false);
+	}
+	xa_unlock(&ictx->objects);
+}
+
+static int check_iopt_pages_preserved(struct liveupdate_session *s,
+				      struct iommufd_hwpt_paging *hwpt)
+{
+	u32 req_seals = F_SEAL_SEAL | F_SEAL_GROW | F_SEAL_SHRINK;
+	struct iopt_area *area;
+	int ret;
+
+	for (area = iopt_area_iter_first(&hwpt->ioas->iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		/* Only allow file based mapping */
+		if (pages->type != IOPT_ADDRESS_FILE)
+			return -EINVAL;
+
+		/*
+		 * When this memory file was mapped it should be sealed and seal
+		 * should be sealed. This means that since mapping was done the
+		 * memory file was not grown or shrink and the pages being used
+		 * until now remain pinnned and preserved.
+		 */
+		if ((pages->seals & req_seals) != req_seals)
+			return -EINVAL;
+
+		/* Make sure that the file was preserved. */
+		ret = liveupdate_get_token_outgoing(s, pages->file, NULL);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int iommufd_save_hwpts(struct iommufd_ctx *ictx,
+			      struct iommufd_lu *iommufd_lu,
+			      struct liveupdate_session *session)
+{
+	struct iommufd_hwpt_paging *hwpt, **hwpts = NULL;
+	struct iommu_domain_ser *domain_ser;
+	struct iommufd_hwpt_lu *hwpt_lu;
+	struct iommufd_object *obj;
+	unsigned int nr_hwpts = 0;
+	unsigned long index;
+	unsigned int i;
+	int rc = 0;
+
+	if (iommufd_lu) {
+		hwpts = kcalloc(iommufd_lu->nr_hwpts, sizeof(*hwpts),
+				GFP_KERNEL);
+		if (!hwpts)
+			return -ENOMEM;
+	}
+
+	xa_lock(&ictx->objects);
+	xa_for_each(&ictx->objects, index, obj) {
+		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
+			continue;
+
+		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
+		if (!hwpt->lu_preserve)
+			continue;
+
+		if (hwpt->ioas) {
+			/*
+			 * Obtain exclusive access to the IOAS and IOPT while we
+			 * set immutability
+			 */
+			mutex_lock(&hwpt->ioas->mutex);
+			down_write(&hwpt->ioas->iopt.domains_rwsem);
+			down_write(&hwpt->ioas->iopt.iova_rwsem);
+
+			hwpt->ioas->iopt.lu_map_immutable = true;
+
+			up_write(&hwpt->ioas->iopt.iova_rwsem);
+			up_write(&hwpt->ioas->iopt.domains_rwsem);
+			mutex_unlock(&hwpt->ioas->mutex);
+		}
+
+		if (!hwpt->common.domain) {
+			rc = -EINVAL;
+			xa_unlock(&ictx->objects);
+			goto out;
+		}
+
+		if (!iommufd_lu) {
+			rc = check_iopt_pages_preserved(session, hwpt);
+			if (rc) {
+				xa_unlock(&ictx->objects);
+				goto out;
+			}
+		} else if (iommufd_lu) {
+			hwpts[nr_hwpts] = hwpt;
+			hwpt_lu = &iommufd_lu->hwpts[nr_hwpts];
+
+			hwpt_lu->token = hwpt->lu_token;
+			hwpt_lu->reclaimed = false;
+		}
+
+		nr_hwpts++;
+	}
+	xa_unlock(&ictx->objects);
+
+	if (WARN_ON(iommufd_lu && iommufd_lu->nr_hwpts != nr_hwpts)) {
+		rc = -EFAULT;
+		goto out;
+	}
+
+	if (iommufd_lu) {
+		/*
+		 * iommu_domain_preserve may sleep and must be called
+		 * outside of xa_lock
+		 */
+		for (i = 0; i < nr_hwpts; i++) {
+			hwpt = hwpts[i];
+			hwpt_lu = &iommufd_lu->hwpts[i];
+
+			rc = iommu_domain_preserve(hwpt->common.domain, &domain_ser);
+			if (rc < 0)
+				goto out;
+
+			hwpt_lu->domain_data = __pa(domain_ser);
+		}
+	}
+
+	rc = nr_hwpts;
+
+out:
+	kfree(hwpts);
+	return rc;
+}
+
+static int iommufd_liveupdate_preserve(struct liveupdate_file_op_args *args)
+{
+	struct iommufd_ctx *ictx = iommufd_ctx_from_file(args->file);
+	struct iommufd_lu *iommufd_lu;
+	size_t serial_size;
+	void *mem;
+	int rc;
+
+	if (IS_ERR(ictx))
+		return PTR_ERR(ictx);
+
+	rc = iommufd_save_hwpts(ictx, NULL, args->session);
+	if (rc < 0)
+		goto err_ioas_mutable;
+
+	serial_size = struct_size(iommufd_lu, hwpts, rc);
+
+	mem = kho_alloc_preserve(serial_size);
+	if (!mem) {
+		rc = -ENOMEM;
+		goto err_ioas_mutable;
+	}
+
+	iommufd_lu = mem;
+	iommufd_lu->nr_hwpts = rc;
+	rc = iommufd_save_hwpts(ictx, iommufd_lu, args->session);
+	if (rc < 0)
+		goto err_free;
+
+	args->serialized_data = virt_to_phys(iommufd_lu);
+	iommufd_ctx_put(ictx);
+	return 0;
+
+err_free:
+	kho_unpreserve_free(mem);
+err_ioas_mutable:
+	iommufd_set_ioas_mutable(ictx);
+	iommufd_ctx_put(ictx);
+	return rc;
+}
+
+static int iommufd_liveupdate_freeze(struct liveupdate_file_op_args *args)
+{
+	/* No-Op; everything should be made read-only */
+	return 0;
+}
+
+static void iommufd_liveupdate_unpreserve(struct liveupdate_file_op_args *args)
+{
+	struct iommufd_ctx *ictx = iommufd_ctx_from_file(args->file);
+	struct iommufd_hwpt_paging *hwpt;
+	struct iommufd_object *obj;
+	unsigned long index;
+
+	if (WARN_ON(IS_ERR(ictx)))
+		return;
+
+	xa_lock(&ictx->objects);
+	xa_for_each(&ictx->objects, index, obj) {
+		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
+			continue;
+
+		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
+		if (!hwpt->lu_preserve)
+			continue;
+		if (!hwpt->common.domain)
+			continue;
+
+		iommu_domain_unpreserve(hwpt->common.domain);
+	}
+	xa_unlock(&ictx->objects);
+
+	kho_unpreserve_free(phys_to_virt(args->serialized_data));
+
+	iommufd_set_ioas_mutable(ictx);
+	iommufd_ctx_put(ictx);
+}
+
+static int iommufd_liveupdate_retrieve(struct liveupdate_file_op_args *args)
+{
+	return -EOPNOTSUPP;
+}
+
+static bool iommufd_liveupdate_can_finish(struct liveupdate_file_op_args *args)
+{
+	return false;
+}
+
+static void iommufd_liveupdate_finish(struct liveupdate_file_op_args *args)
+{
+}
+
+static bool iommufd_liveupdate_can_preserve(struct liveupdate_file_handler *handler,
+					    struct file *file)
+{
+	struct iommufd_ctx *ictx = iommufd_ctx_from_file(file);
+
+	if (IS_ERR(ictx))
+		return false;
+
+	iommufd_ctx_put(ictx);
+	return true;
+}
+
+static struct liveupdate_file_ops iommufd_lu_file_ops = {
+	.can_preserve = iommufd_liveupdate_can_preserve,
+	.preserve = iommufd_liveupdate_preserve,
+	.unpreserve = iommufd_liveupdate_unpreserve,
+	.freeze = iommufd_liveupdate_freeze,
+	.retrieve = iommufd_liveupdate_retrieve,
+	.can_finish = iommufd_liveupdate_can_finish,
+	.finish = iommufd_liveupdate_finish,
+};
+
+static struct liveupdate_file_handler iommufd_lu_handler = {
+	.compatible = IOMMUFD_LUO_COMPATIBLE,
+	.ops = &iommufd_lu_file_ops,
+};
+
+int iommufd_liveupdate_register_lufs(void)
+{
+	int ret;
+
+	ret = liveupdate_register_file_handler(&iommufd_lu_handler);
+	if (ret)
+		return ret;
+
+	ret = iommu_liveupdate_register_flb(&iommufd_lu_handler);
+	if (ret)
+		liveupdate_unregister_file_handler(&iommufd_lu_handler);
+
+	return ret;
+}
+
+int iommufd_liveupdate_unregister_lufs(void)
+{
+	WARN_ON(iommu_liveupdate_unregister_flb(&iommufd_lu_handler));
+
+	return liveupdate_unregister_file_handler(&iommufd_lu_handler);
+}
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index e1a9b3051f65..d7683244c67a 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -775,11 +775,21 @@ static int __init iommufd_init(void)
 		if (ret)
 			goto err_misc;
 	}
+
+	if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)) {
+		ret = iommufd_liveupdate_register_lufs();
+		if (ret)
+			goto err_vfio_misc;
+	}
+
 	ret = iommufd_test_init();
 	if (ret)
-		goto err_vfio_misc;
+		goto err_lufs;
 	return 0;
 
+err_lufs:
+	if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE))
+		iommufd_liveupdate_unregister_lufs();
 err_vfio_misc:
 	if (IS_ENABLED(CONFIG_IOMMUFD_VFIO_CONTAINER))
 		misc_deregister(&vfio_misc_dev);
@@ -791,6 +801,8 @@ static int __init iommufd_init(void)
 static void __exit iommufd_exit(void)
 {
 	iommufd_test_exit();
+	if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE))
+		iommufd_liveupdate_unregister_lufs();
 	if (IS_ENABLED(CONFIG_IOMMUFD_VFIO_CONTAINER))
 		misc_deregister(&vfio_misc_dev);
 	misc_deregister(&iommu_misc_dev);
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index dbe51ecb9a20..cc0e3265ba4e 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -55,6 +55,7 @@
 #include <linux/overflow.h>
 #include <linux/slab.h>
 #include <linux/sched/mm.h>
+#include <linux/memfd.h>
 #include <linux/vfio_pci_core.h>
 
 #include "double_span.h"
@@ -1420,6 +1421,7 @@ struct iopt_pages *iopt_alloc_file_pages(struct file *file,
 
 {
 	struct iopt_pages *pages;
+	int seals;
 
 	pages = iopt_alloc_pages(start_byte, length, writable);
 	if (IS_ERR(pages))
@@ -1427,6 +1429,12 @@ struct iopt_pages *iopt_alloc_file_pages(struct file *file,
 	pages->file = get_file(file);
 	pages->start = start - start_byte;
 	pages->type = IOPT_ADDRESS_FILE;
+
+	pages->seals = 0;
+	seals = memfd_get_seals(file);
+	if (seals > 0)
+		pages->seals = seals;
+
 	return pages;
 }
 
diff --git a/include/linux/kho/abi/iommufd.h b/include/linux/kho/abi/iommufd.h
new file mode 100644
index 000000000000..f7393ac78aa9
--- /dev/null
+++ b/include/linux/kho/abi/iommufd.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (C) 2025, Google LLC
+ * Author: Samiullah Khawaja <skhawaja@google.com>
+ */
+
+#ifndef _LINUX_KHO_ABI_IOMMUFD_H
+#define _LINUX_KHO_ABI_IOMMUFD_H
+
+#include <linux/mutex_types.h>
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/**
+ * DOC: IOMMUFD Live Update ABI
+ *
+ * This header defines the ABI for preserving the state of an IOMMUFD file
+ * across a kexec reboot using LUO.
+ *
+ * This interface is a contract. Any modification to any of the serialization
+ * structs defined here constitutes a breaking change. Such changes require
+ * incrementing the version number in the IOMMUFD_LUO_COMPATIBLE string.
+ */
+
+#define IOMMUFD_LUO_COMPATIBLE "iommufd-v1"
+
+struct iommufd_hwpt_lu {
+	u32 token;
+	u64 domain_data;
+	bool reclaimed;
+} __packed;
+
+struct iommufd_lu {
+	unsigned int nr_hwpts;
+	struct iommufd_hwpt_lu hwpts[];
+};
+
+#endif /* _LINUX_KHO_ABI_IOMMUFD_H */
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
                   ` (10 preceding siblings ...)
  2026-02-03 22:09 ` [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-23 20:59   ` Vipin Sharma
  2026-03-25 20:24   ` Pranjal Shrivastava
  2026-02-03 22:09 ` [PATCH 13/14] vfio/pci: Preserve the iommufd state of the " Samiullah Khawaja
  2026-02-03 22:09 ` [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation Samiullah Khawaja
  13 siblings, 2 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

Add APIs that can be used to preserve and unpreserve a vfio cdev. Use
the APIs exported by the IOMMU core to preserve/unpreserve device. Pass
the LUO preservation token of the attached iommufd into IOMMU preserve
device API. This establishes the ownership of the device with the
preserved iommufd.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/iommufd/device.c | 69 ++++++++++++++++++++++++++++++++++
 include/linux/iommufd.h        | 23 ++++++++++++
 2 files changed, 92 insertions(+)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 4c842368289f..30cb5218093b 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
  */
 #include <linux/iommu.h>
+#include <linux/iommu-lu.h>
 #include <linux/iommufd.h>
 #include <linux/pci-ats.h>
 #include <linux/slab.h>
@@ -1661,3 +1662,71 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(ucmd->ictx, &idev->obj);
 	return rc;
 }
+
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+int iommufd_device_preserve(struct liveupdate_session *s,
+			    struct iommufd_device *idev,
+			    u64 *tokenp)
+{
+	struct iommufd_group *igroup = idev->igroup;
+	struct iommufd_hwpt_paging *hwpt_paging;
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_attach *attach;
+	int ret;
+
+	mutex_lock(&igroup->lock);
+	attach = xa_load(&igroup->pasid_attach, IOMMU_NO_PASID);
+	if (!attach) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	hwpt = attach->hwpt;
+	hwpt_paging = find_hwpt_paging(hwpt);
+	if (!hwpt_paging || !hwpt_paging->lu_preserve) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = liveupdate_get_token_outgoing(s, idev->ictx->file, tokenp);
+	if (ret)
+		goto out;
+
+	ret = iommu_preserve_device(hwpt_paging->common.domain,
+				    idev->dev,
+				    *tokenp);
+out:
+	mutex_unlock(&igroup->lock);
+	return ret;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_preserve, "IOMMUFD");
+
+void iommufd_device_unpreserve(struct liveupdate_session *s,
+			       struct iommufd_device *idev,
+			       u64 token)
+{
+	struct iommufd_group *igroup = idev->igroup;
+	struct iommufd_hwpt_paging *hwpt_paging;
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_attach *attach;
+
+	mutex_lock(&igroup->lock);
+	attach = xa_load(&igroup->pasid_attach, IOMMU_NO_PASID);
+	if (!attach) {
+		WARN_ON(-ENOENT);
+		goto out;
+	}
+
+	hwpt = attach->hwpt;
+	hwpt_paging = find_hwpt_paging(hwpt);
+	if (!hwpt_paging || !hwpt_paging->lu_preserve) {
+		WARN_ON(-EINVAL);
+		goto out;
+	}
+
+	iommu_unpreserve_device(hwpt_paging->common.domain, idev->dev);
+out:
+	mutex_unlock(&igroup->lock);
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_unpreserve, "IOMMUFD");
+#endif
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 6e7efe83bc5d..c4b3ed5b518c 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -9,6 +9,7 @@
 #include <linux/err.h>
 #include <linux/errno.h>
 #include <linux/iommu.h>
+#include <linux/liveupdate.h>
 #include <linux/refcount.h>
 #include <linux/types.h>
 #include <linux/xarray.h>
@@ -71,6 +72,28 @@ void iommufd_device_detach(struct iommufd_device *idev, ioasid_t pasid);
 struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev);
 u32 iommufd_device_to_id(struct iommufd_device *idev);
 
+#ifdef CONFIG_IOMMU_LIVEUPDATE
+int iommufd_device_preserve(struct liveupdate_session *s,
+			    struct iommufd_device *idev,
+			    u64 *tokenp);
+void iommufd_device_unpreserve(struct liveupdate_session *s,
+			       struct iommufd_device *idev,
+			       u64 token);
+#else
+static inline int iommufd_device_preserve(struct liveupdate_session *s,
+					  struct iommufd_device *idev,
+					  u64 *tokenp)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void iommufd_device_unpreserve(struct liveupdate_session *s,
+					     struct iommufd_device *idev,
+					     u64 token)
+{
+}
+#endif
+
 struct iommufd_access_ops {
 	u8 needs_pin_pages : 1;
 	void (*unmap)(void *data, unsigned long iova, unsigned long length);
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 13/14] vfio/pci: Preserve the iommufd state of the vfio cdev
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
                   ` (11 preceding siblings ...)
  2026-02-03 22:09 ` [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-02-17  4:18   ` Ankit Soni
                     ` (2 more replies)
  2026-02-03 22:09 ` [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation Samiullah Khawaja
  13 siblings, 3 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

If the vfio cdev is attached to an iommufd, preserve the state of the
attached iommufd also. Basically preserve the iommu state of the device
and also the attached domain. The token returned by the preservation API
will be used to restore/rebind to the iommufd state after liveupdate.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/vfio/pci/vfio_pci_liveupdate.c | 28 +++++++++++++++++++++++++-
 include/linux/kho/abi/vfio_pci.h       | 10 +++++++++
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_liveupdate.c b/drivers/vfio/pci/vfio_pci_liveupdate.c
index c52d6bdb455f..af6fbfb7a65c 100644
--- a/drivers/vfio/pci/vfio_pci_liveupdate.c
+++ b/drivers/vfio/pci/vfio_pci_liveupdate.c
@@ -15,6 +15,7 @@
 #include <linux/liveupdate.h>
 #include <linux/errno.h>
 #include <linux/vfio.h>
+#include <linux/iommufd.h>
 
 #include "vfio_pci_priv.h"
 
@@ -39,6 +40,7 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
 	struct vfio_pci_core_device_ser *ser;
 	struct vfio_pci_core_device *vdev;
 	struct pci_dev *pdev;
+	u64 token = 0;
 
 	vdev = container_of(device, struct vfio_pci_core_device, vdev);
 	pdev = vdev->pdev;
@@ -49,15 +51,32 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
 	if (vfio_pci_is_intel_display(pdev))
 		return -EINVAL;
 
+#if CONFIG_IOMMU_LIVEUPDATE
+	/* If iommufd is attached, preserve the underlying domain */
+	if (device->iommufd_attached) {
+		int err = iommufd_device_preserve(args->session,
+						  device->iommufd_device,
+						  &token);
+		if (err < 0)
+			return err;
+	}
+#endif
+
 	ser = kho_alloc_preserve(sizeof(*ser));
-	if (IS_ERR(ser))
+	if (IS_ERR(ser)) {
+		if (device->iommufd_attached)
+			iommufd_device_unpreserve(args->session,
+						  device->iommufd_device, token);
+
 		return PTR_ERR(ser);
+	}
 
 	pci_liveupdate_outgoing_preserve(pdev);
 
 	ser->bdf = pci_dev_id(pdev);
 	ser->domain = pci_domain_nr(pdev->bus);
 	ser->reset_works = vdev->reset_works;
+	ser->iommufd_ser.token = token;
 
 	args->serialized_data = virt_to_phys(ser);
 	return 0;
@@ -66,6 +85,13 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
 static void vfio_pci_liveupdate_unpreserve(struct liveupdate_file_op_args *args)
 {
 	struct vfio_device *device = vfio_device_from_file(args->file);
+	struct vfio_pci_core_device_ser *ser;
+
+	ser = phys_to_virt(args->serialized_data);
+	if (device->iommufd_attached)
+		iommufd_device_unpreserve(args->session,
+					  device->iommufd_device,
+					  ser->iommufd_ser.token);
 
 	pci_liveupdate_outgoing_unpreserve(to_pci_dev(device->dev));
 	kho_unpreserve_free(phys_to_virt(args->serialized_data));
diff --git a/include/linux/kho/abi/vfio_pci.h b/include/linux/kho/abi/vfio_pci.h
index 6c3d3c6dfc09..d01bd58711c2 100644
--- a/include/linux/kho/abi/vfio_pci.h
+++ b/include/linux/kho/abi/vfio_pci.h
@@ -28,6 +28,15 @@
 
 #define VFIO_PCI_LUO_FH_COMPATIBLE "vfio-pci-v1"
 
+/**
+ * struct vfio_iommufd_ser - Serialized state of the attached iommufd.
+ *
+ * @token: The token of the bound iommufd state.
+ */
+struct vfio_iommufd_ser {
+	u32 token;
+} __packed;
+
 /**
  * struct vfio_pci_core_device_ser - Serialized state of a single VFIO PCI
  * device.
@@ -40,6 +49,7 @@ struct vfio_pci_core_device_ser {
 	u16 bdf;
 	u16 domain;
 	u8 reset_works;
+	struct vfio_iommufd_ser iommufd_ser;
 } __packed;
 
 #endif /* _LINUX_LIVEUPDATE_ABI_VFIO_PCI_H */
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation
  2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
                   ` (12 preceding siblings ...)
  2026-02-03 22:09 ` [PATCH 13/14] vfio/pci: Preserve the iommufd state of the " Samiullah Khawaja
@ 2026-02-03 22:09 ` Samiullah Khawaja
  2026-03-23 22:18   ` Vipin Sharma
  2026-03-25 21:05   ` Pranjal Shrivastava
  13 siblings, 2 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-03 22:09 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Samiullah Khawaja, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

Test iommufd preservation by setting up an iommufd and vfio cdev and
preserve it across live update. Test takes VFIO cdev path of a device
bound to vfio-pci driver and binds it to an iommufd being preserved. It
also preserves the vfio cdev so the iommufd state associated with it is
also preserved.

The restore path is tested by restoring the preserved vfio cdev only.
Test tries to finish the session without restoring iommufd and confirms
that it fails.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
---
 tools/testing/selftests/iommu/Makefile        |  12 +
 .../selftests/iommu/iommufd_liveupdate.c      | 209 ++++++++++++++++++
 2 files changed, 221 insertions(+)
 create mode 100644 tools/testing/selftests/iommu/iommufd_liveupdate.c

diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
index 84abeb2f0949..263195af4d6a 100644
--- a/tools/testing/selftests/iommu/Makefile
+++ b/tools/testing/selftests/iommu/Makefile
@@ -7,4 +7,16 @@ TEST_GEN_PROGS :=
 TEST_GEN_PROGS += iommufd
 TEST_GEN_PROGS += iommufd_fail_nth
 
+TEST_GEN_PROGS_EXTENDED += iommufd_liveupdate
+
 include ../lib.mk
+include ../liveupdate/lib/libliveupdate.mk
+
+CFLAGS += -I$(top_srcdir)/tools/include
+CFLAGS += -MD
+CFLAGS += $(EXTRA_CFLAGS)
+
+$(TEST_GEN_PROGS_EXTENDED): %: %.o $(LIBLIVEUPDATE_O)
+	$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) $(TARGET_ARCH) $< $(LIBLIVEUPDATE_O) $(LDLIBS) -static -o $@
+
+EXTRA_CLEAN += $(LIBLIVEUPDATE_O)
diff --git a/tools/testing/selftests/iommu/iommufd_liveupdate.c b/tools/testing/selftests/iommu/iommufd_liveupdate.c
new file mode 100644
index 000000000000..8b4ea9f2b7e9
--- /dev/null
+++ b/tools/testing/selftests/iommu/iommufd_liveupdate.c
@@ -0,0 +1,209 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Samiullah Khawaja <skhawaja@google.com>
+ */
+
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <stdbool.h>
+#include <unistd.h>
+
+#define __EXPORTED_HEADERS__
+#include <linux/iommufd.h>
+#include <linux/types.h>
+#include <linux/vfio.h>
+#include <linux/sizes.h>
+#include <libliveupdate.h>
+
+#include "../kselftest.h"
+
+#define ksft_assert(condition) \
+	do { if (!(condition)) \
+	ksft_exit_fail_msg("Failed: %s at %s %d: %s\n", \
+	#condition, __FILE__, __LINE__, strerror(errno)); } while (0)
+
+int setup_cdev(const char *vfio_cdev_path)
+{
+	int cdev_fd;
+
+	cdev_fd = open(vfio_cdev_path, O_RDWR);
+	if (cdev_fd < 0)
+		ksft_exit_skip("Failed to open VFIO cdev: %s\n", vfio_cdev_path);
+
+	return cdev_fd;
+}
+
+int open_iommufd(void)
+{
+	int iommufd;
+
+	iommufd = open("/dev/iommu", O_RDWR);
+	if (iommufd < 0)
+		ksft_exit_skip("Failed to open /dev/iommu. IOMMUFD support not enabled.\n");
+
+	return iommufd;
+}
+
+int setup_iommufd(int iommufd, int memfd, int cdev_fd, int hwpt_token)
+{
+	int ret;
+
+	struct vfio_device_bind_iommufd bind = {
+		.argsz = sizeof(bind),
+		.flags = 0,
+	};
+	struct iommu_ioas_alloc alloc_data = {
+		.size = sizeof(alloc_data),
+		.flags = 0,
+	};
+	struct iommu_hwpt_alloc hwpt_alloc = {
+		.size = sizeof(hwpt_alloc),
+		.flags = 0,
+	};
+	struct vfio_device_attach_iommufd_pt attach_data = {
+		.argsz = sizeof(attach_data),
+		.flags = 0,
+	};
+	struct iommu_hwpt_lu_set_preserve set_preserve = {
+		.size = sizeof(set_preserve),
+		.hwpt_token = hwpt_token,
+	};
+	struct iommu_ioas_map_file map_file = {
+		.size = sizeof(map_file),
+		.length = SZ_1M,
+		.flags = IOMMU_IOAS_MAP_WRITEABLE | IOMMU_IOAS_MAP_READABLE,
+		.iova = SZ_4G,
+		.fd = memfd,
+		.start = 0,
+	};
+
+	bind.iommufd = iommufd;
+	ret = ioctl(cdev_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
+	ksft_assert(!ret);
+
+	ret = ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data);
+	ksft_assert(!ret);
+
+	hwpt_alloc.dev_id = bind.out_devid;
+	hwpt_alloc.pt_id = alloc_data.out_ioas_id;
+	ret = ioctl(iommufd, IOMMU_HWPT_ALLOC, &hwpt_alloc);
+	ksft_assert(!ret);
+
+	attach_data.pt_id = hwpt_alloc.out_hwpt_id;
+	ret = ioctl(cdev_fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data);
+	ksft_assert(!ret);
+
+	map_file.ioas_id = alloc_data.out_ioas_id;
+	ret = ioctl(iommufd, IOMMU_IOAS_MAP_FILE, &map_file);
+	ksft_assert(!ret);
+
+	set_preserve.hwpt_id = attach_data.pt_id;
+	ret = ioctl(iommufd, IOMMU_HWPT_LU_SET_PRESERVE, &set_preserve);
+	ksft_assert(!ret);
+
+	return ret;
+}
+
+static int create_sealed_memfd(size_t size)
+{
+	int fd, ret;
+
+	fd = memfd_create("buffer", MFD_ALLOW_SEALING);
+	ksft_assert(fd > 0);
+
+	ret = ftruncate(fd, size);
+	ksft_assert(!ret);
+
+	ret = fcntl(fd, F_ADD_SEALS,
+		    F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL);
+	ksft_assert(!ret);
+
+	return fd;
+}
+
+int main(int argc, char *argv[])
+{
+	int iommufd, cdev_fd, memfd, luo, session, ret;
+	const int token = 0x123456;
+	const int cdev_token = 0x654321;
+	const int hwpt_token = 0x789012;
+	const int memfd_token = 0x890123;
+
+	if (argc < 2) {
+		printf("Usage: ./iommufd_liveupdate <vfio_cdev_path>\n");
+		return 1;
+	}
+
+	luo = luo_open_device();
+	ksft_assert(luo > 0);
+
+	session = luo_retrieve_session(luo, "iommufd-test");
+	if (session == -ENOENT) {
+		session = luo_create_session(luo, "iommufd-test");
+
+		iommufd = open_iommufd();
+		memfd = create_sealed_memfd(SZ_1M);
+		cdev_fd = setup_cdev(argv[1]);
+
+		ret = setup_iommufd(iommufd, memfd, cdev_fd, hwpt_token);
+		ksft_assert(!ret);
+
+		/* Cannot preserve cdev without iommufd */
+		ret = luo_session_preserve_fd(session, cdev_fd, cdev_token);
+		ksft_assert(ret);
+
+		/* Cannot preserve iommufd without preserving memfd. */
+		ret = luo_session_preserve_fd(session, iommufd, token);
+		ksft_assert(ret);
+
+		ret = luo_session_preserve_fd(session, memfd, memfd_token);
+		ksft_assert(!ret);
+
+		ret = luo_session_preserve_fd(session, iommufd, token);
+		ksft_assert(!ret);
+
+		ret = luo_session_preserve_fd(session, cdev_fd, cdev_token);
+		ksft_assert(!ret);
+
+		close(session);
+		session = luo_create_session(luo, "iommufd-test");
+
+		ret = luo_session_preserve_fd(session, memfd, memfd_token);
+		ksft_assert(!ret);
+
+		ret = luo_session_preserve_fd(session, iommufd, token);
+		ksft_assert(!ret);
+
+		ret = luo_session_preserve_fd(session, cdev_fd, cdev_token);
+		ksft_assert(!ret);
+
+		daemonize_and_wait();
+	} else {
+		struct vfio_device_bind_iommufd bind = {
+			.argsz = sizeof(bind),
+			.flags = 0,
+		};
+
+		cdev_fd = luo_session_retrieve_fd(session, cdev_token);
+		ksft_assert(cdev_fd > 0);
+
+		iommufd = luo_session_retrieve_fd(session, token);
+		ksft_assert(iommufd < 0);
+
+		iommufd = open_iommufd();
+
+		bind.iommufd = iommufd;
+		ret = ioctl(cdev_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
+		ksft_assert(ret);
+		ksft_assert(errno == EPERM);
+
+		/* Should fail */
+		ret = luo_session_finish(session);
+		ksft_assert(ret);
+	}
+
+	return 0;
+}
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 13/14] vfio/pci: Preserve the iommufd state of the vfio cdev
  2026-02-03 22:09 ` [PATCH 13/14] vfio/pci: Preserve the iommufd state of the " Samiullah Khawaja
@ 2026-02-17  4:18   ` Ankit Soni
  2026-03-03 18:35     ` Samiullah Khawaja
  2026-03-23 21:17   ` Vipin Sharma
  2026-03-25 20:55   ` Pranjal Shrivastava
  2 siblings, 1 reply; 98+ messages in thread
From: Ankit Soni @ 2026-02-17  4:18 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:47PM +0000, Samiullah Khawaja wrote:
> If the vfio cdev is attached to an iommufd, preserve the state of the
> attached iommufd also. Basically preserve the iommu state of the device
> and also the attached domain. The token returned by the preservation API
> will be used to restore/rebind to the iommufd state after liveupdate.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/vfio/pci/vfio_pci_liveupdate.c | 28 +++++++++++++++++++++++++-
>  include/linux/kho/abi/vfio_pci.h       | 10 +++++++++
>  2 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_liveupdate.c b/drivers/vfio/pci/vfio_pci_liveupdate.c
> index c52d6bdb455f..af6fbfb7a65c 100644
> --- a/drivers/vfio/pci/vfio_pci_liveupdate.c
> +++ b/drivers/vfio/pci/vfio_pci_liveupdate.c
> @@ -15,6 +15,7 @@
>  #include <linux/liveupdate.h>
>  #include <linux/errno.h>
>  #include <linux/vfio.h>
> +#include <linux/iommufd.h>
>  
>  #include "vfio_pci_priv.h"
>  
> @@ -39,6 +40,7 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
>  	struct vfio_pci_core_device_ser *ser;
>  	struct vfio_pci_core_device *vdev;
>  	struct pci_dev *pdev;
> +	u64 token = 0;
>  
>  	vdev = container_of(device, struct vfio_pci_core_device, vdev);
>  	pdev = vdev->pdev;
> @@ -49,15 +51,32 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
>  	if (vfio_pci_is_intel_display(pdev))
>  		return -EINVAL;
>  
> +#if CONFIG_IOMMU_LIVEUPDATE
> +	/* If iommufd is attached, preserve the underlying domain */
> +	if (device->iommufd_attached) {
> +		int err = iommufd_device_preserve(args->session,
> +						  device->iommufd_device,
> +						  &token);
> +		if (err < 0)
> +			return err;
> +	}
> +#endif
> +
>  	ser = kho_alloc_preserve(sizeof(*ser));
> -	if (IS_ERR(ser))
> +	if (IS_ERR(ser)) {
> +		if (device->iommufd_attached)
> +			iommufd_device_unpreserve(args->session,
> +						  device->iommufd_device, token);
> +

To use iommufd_device_preserve()/iommufd_device_unpreserve(),
looks like the IOMMUFD namespace import is missing here —  MODULE_IMPORT_NS("IOMMUFD");

-Ankit

>  		return PTR_ERR(ser);
> +	}
>  
>  	pci_liveupdate_outgoing_preserve(pdev);
>  
>  	ser->bdf = pci_dev_id(pdev);
>  	ser->domain = pci_domain_nr(pdev->bus);
>  	ser->reset_works = vdev->reset_works;
> +	ser->iommufd_ser.token = token;
>  
>  	args->serialized_data = virt_to_phys(ser);
>  	return 0;
> @@ -66,6 +85,13 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
>  static void vfio_pci_liveupdate_unpreserve(struct liveupdate_file_op_args *args)
>  {
>  	struct vfio_device *device = vfio_device_from_file(args->file);
> +	struct vfio_pci_core_device_ser *ser;
> +
> +	ser = phys_to_virt(args->serialized_data);
> +	if (device->iommufd_attached)
> +		iommufd_device_unpreserve(args->session,
> +					  device->iommufd_device,
> +					  ser->iommufd_ser.token);
>  
>  	pci_liveupdate_outgoing_unpreserve(to_pci_dev(device->dev));
>  	kho_unpreserve_free(phys_to_virt(args->serialized_data));
> diff --git a/include/linux/kho/abi/vfio_pci.h b/include/linux/kho/abi/vfio_pci.h
> index 6c3d3c6dfc09..d01bd58711c2 100644
> --- a/include/linux/kho/abi/vfio_pci.h
> +++ b/include/linux/kho/abi/vfio_pci.h
> @@ -28,6 +28,15 @@
>  
>  #define VFIO_PCI_LUO_FH_COMPATIBLE "vfio-pci-v1"
>  
> +/**
> + * struct vfio_iommufd_ser - Serialized state of the attached iommufd.
> + *
> + * @token: The token of the bound iommufd state.
> + */
> +struct vfio_iommufd_ser {
> +	u32 token;
> +} __packed;
> +
>  /**
>   * struct vfio_pci_core_device_ser - Serialized state of a single VFIO PCI
>   * device.
> @@ -40,6 +49,7 @@ struct vfio_pci_core_device_ser {
>  	u16 bdf;
>  	u16 domain;
>  	u8 reset_works;
> +	struct vfio_iommufd_ser iommufd_ser;
>  } __packed;
>  
>  #endif /* _LINUX_LIVEUPDATE_ABI_VFIO_PCI_H */
> -- 
> 2.53.0.rc2.204.g2597b5adb4-goog
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update
  2026-02-03 22:09 ` [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update Samiullah Khawaja
@ 2026-02-25 23:47   ` Samiullah Khawaja
  2026-03-03  5:56   ` Ankit Soni
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-02-25 23:47 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: YiFei Zhu, Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan,
	iommu, linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma

On Tue, Feb 3, 2026 at 2:10 PM Samiullah Khawaja <skhawaja@google.com> wrote:
>
> From: YiFei Zhu <zhuyifei@google.com>
>
> The caller is expected to mark each HWPT to be preserved with an ioctl
> call, with a token that will be used in restore. At preserve time, each
> HWPT's domain is then called with iommu_domain_preserve to preserve the
> iommu domain.
>
> The HWPTs containing dma mappings backed by unpreserved memory should
> not be preserved. During preservation check if the mappings contained in
> the HWPT being preserved are only file based and all the files are
> preserved.
>
> The memfd file preservation check is not enough when preserving iommufd.
> The memfd might have shrunk between the mapping and memfd preservation.
> This means that if it shrunk some pages that are right now pinned due to
> iommu mappings are not preserved with the memfd. Only allow iommufd
> preservation when all the iopt_pages are file backed and the memory file
> was seal sealed during mapping. This guarantees that all the pages that
> were backing memfd when it was mapped are preserved.
>
> Once HWPT is preserved the iopt associated with the HWPT is made
> immutable. Since the map and unmap ioctls operates directly on iopt,
> which contains an array of domains, while each hwpt contains only one
> domain. The logic then becomes that mapping and unmapping is prohibited
> if any of the domains in an iopt belongs to a preserved hwpt. However,
> tracing to the hwpt through the domain is a lot more tedious than
> tracing through the ioas, so if an hwpt is preserved, hwpt->ioas->iopt
> is made immutable.
>
> When undoing this (making the iopts mutable again), there's never
> a need to make some iopts mutable and some kept immutable, since
> the undo only happen on unpreserve and error path of preserve.
> Simply iterate all the ioas and clear the immutability flag on all
> their iopts.
>
> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/iommufd/io_pagetable.c    |  17 ++
>  drivers/iommu/iommufd/io_pagetable.h    |   1 +
>  drivers/iommu/iommufd/iommufd_private.h |  25 ++
>  drivers/iommu/iommufd/liveupdate.c      | 300 ++++++++++++++++++++++++
>  drivers/iommu/iommufd/main.c            |  14 +-
>  drivers/iommu/iommufd/pages.c           |   8 +
>  include/linux/kho/abi/iommufd.h         |  39 +++
>  7 files changed, 403 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/kho/abi/iommufd.h
>
> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
> index 436992331111..43e8a2443793 100644
> --- a/drivers/iommu/iommufd/io_pagetable.c
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -270,6 +270,11 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
>         }
>
>         down_write(&iopt->iova_rwsem);
> +       if (iopt_lu_map_immutable(iopt)) {
> +               rc = -EBUSY;
> +               goto out_unlock;
> +       }
> +
>         if ((length & (iopt->iova_alignment - 1)) || !length) {
>                 rc = -EINVAL;
>                 goto out_unlock;
> @@ -328,6 +333,7 @@ static void iopt_abort_area(struct iopt_area *area)
>                 WARN_ON(area->pages);
>         if (area->iopt) {
>                 down_write(&area->iopt->iova_rwsem);
> +               WARN_ON(iopt_lu_map_immutable(area->iopt));
>                 interval_tree_remove(&area->node, &area->iopt->area_itree);
>                 up_write(&area->iopt->iova_rwsem);
>         }
> @@ -755,6 +761,12 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
>  again:
>         down_read(&iopt->domains_rwsem);
>         down_write(&iopt->iova_rwsem);
> +
> +       if (iopt_lu_map_immutable(iopt)) {
> +               rc = -EBUSY;
> +               goto out_unlock_iova;
> +       }
> +
>         while ((area = iopt_area_iter_first(iopt, start, last))) {
>                 unsigned long area_last = iopt_area_last_iova(area);
>                 unsigned long area_first = iopt_area_iova(area);
> @@ -1398,6 +1410,11 @@ int iopt_cut_iova(struct io_pagetable *iopt, unsigned long *iovas,
>         int i;
>
>         down_write(&iopt->iova_rwsem);
> +       if (iopt_lu_map_immutable(iopt)) {
> +               up_write(&iopt->iova_rwsem);
> +               return -EBUSY;
> +       }
> +
>         for (i = 0; i < num_iovas; i++) {
>                 struct iopt_area *area;
>
> diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
> index 14cd052fd320..b64cb4cf300c 100644
> --- a/drivers/iommu/iommufd/io_pagetable.h
> +++ b/drivers/iommu/iommufd/io_pagetable.h
> @@ -234,6 +234,7 @@ struct iopt_pages {
>                 struct {                        /* IOPT_ADDRESS_FILE */
>                         struct file *file;
>                         unsigned long start;
> +                       u32 seals;
>                 };
>                 /* IOPT_ADDRESS_DMABUF */
>                 struct iopt_pages_dmabuf dmabuf;
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 6424e7cea5b2..f8366a23999f 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -94,6 +94,9 @@ struct io_pagetable {
>         /* IOVA that cannot be allocated, struct iopt_reserved */
>         struct rb_root_cached reserved_itree;
>         u8 disable_large_pages;
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +       bool lu_map_immutable;
> +#endif
>         unsigned long iova_alignment;
>  };
>
> @@ -712,12 +715,34 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
>  }
>
>  #ifdef CONFIG_IOMMU_LIVEUPDATE
> +int iommufd_liveupdate_register_lufs(void);
> +int iommufd_liveupdate_unregister_lufs(void);
> +
>  int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
> +static inline bool iopt_lu_map_immutable(const struct io_pagetable *iopt)
> +{
> +       return iopt->lu_map_immutable;
> +}
>  #else
> +static inline int iommufd_liveupdate_register_lufs(void)
> +{
> +       return 0;
> +}
> +
> +static inline int iommufd_liveupdate_unregister_lufs(void)
> +{
> +       return 0;
> +}
> +
>  static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>  {
>         return -ENOTTY;
>  }
> +
> +static inline bool iopt_lu_map_immutable(const struct io_pagetable *iopt)
> +{
> +       return false;
> +}
>  #endif
>
>  #ifdef CONFIG_IOMMUFD_TEST
> diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
> index ae74f5b54735..ec11ae345fe7 100644
> --- a/drivers/iommu/iommufd/liveupdate.c
> +++ b/drivers/iommu/iommufd/liveupdate.c
> @@ -4,9 +4,15 @@
>
>  #include <linux/file.h>
>  #include <linux/iommufd.h>
> +#include <linux/kexec_handover.h>
> +#include <linux/kho/abi/iommufd.h>
>  #include <linux/liveupdate.h>
> +#include <linux/iommu-lu.h>
> +#include <linux/mm.h>
> +#include <linux/pci.h>
>
>  #include "iommufd_private.h"
> +#include "io_pagetable.h"
>
>  int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>  {
> @@ -47,3 +53,297 @@ int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>         return rc;
>  }
>
> +static void iommufd_set_ioas_mutable(struct iommufd_ctx *ictx)
> +{
> +       struct iommufd_object *obj;
> +       struct iommufd_ioas *ioas;
> +       unsigned long index;
> +
> +       xa_lock(&ictx->objects);
> +       xa_for_each(&ictx->objects, index, obj) {
> +               if (obj->type != IOMMUFD_OBJ_IOAS)
> +                       continue;
> +
> +               ioas = container_of(obj, struct iommufd_ioas, obj);
> +
> +               /*
> +                * Not taking any IOAS lock here. All writers take LUO
> +                * session mutex, and this writer racing with readers is not
> +                * really a problem.
> +                */
> +               WRITE_ONCE(ioas->iopt.lu_map_immutable, false);
> +       }
> +       xa_unlock(&ictx->objects);
> +}
> +
> +static int check_iopt_pages_preserved(struct liveupdate_session *s,
> +                                     struct iommufd_hwpt_paging *hwpt)
> +{
> +       u32 req_seals = F_SEAL_SEAL | F_SEAL_GROW | F_SEAL_SHRINK;
> +       struct iopt_area *area;
> +       int ret;
> +
> +       for (area = iopt_area_iter_first(&hwpt->ioas->iopt, 0, ULONG_MAX); area;
> +            area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +               struct iopt_pages *pages = area->pages;
> +
> +               /* Only allow file based mapping */
> +               if (pages->type != IOPT_ADDRESS_FILE)
> +                       return -EINVAL;
> +
> +               /*
> +                * When this memory file was mapped it should be sealed and seal
> +                * should be sealed. This means that since mapping was done the
> +                * memory file was not grown or shrink and the pages being used
> +                * until now remain pinnned and preserved.
> +                */
> +               if ((pages->seals & req_seals) != req_seals)
> +                       return -EINVAL;
> +
> +               /* Make sure that the file was preserved. */
> +               ret = liveupdate_get_token_outgoing(s, pages->file, NULL);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       return 0;
> +}
> +
> +static int iommufd_save_hwpts(struct iommufd_ctx *ictx,
> +                             struct iommufd_lu *iommufd_lu,
> +                             struct liveupdate_session *session)
> +{
> +       struct iommufd_hwpt_paging *hwpt, **hwpts = NULL;
> +       struct iommu_domain_ser *domain_ser;
> +       struct iommufd_hwpt_lu *hwpt_lu;
> +       struct iommufd_object *obj;
> +       unsigned int nr_hwpts = 0;
> +       unsigned long index;
> +       unsigned int i;
> +       int rc = 0;
> +
> +       if (iommufd_lu) {
> +               hwpts = kcalloc(iommufd_lu->nr_hwpts, sizeof(*hwpts),
> +                               GFP_KERNEL);
> +               if (!hwpts)
> +                       return -ENOMEM;
> +       }
> +
> +       xa_lock(&ictx->objects);
> +       xa_for_each(&ictx->objects, index, obj) {
> +               if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
> +                       continue;
> +
> +               hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
> +               if (!hwpt->lu_preserve)
> +                       continue;
> +
> +               if (hwpt->ioas) {
> +                       /*
> +                        * Obtain exclusive access to the IOAS and IOPT while we
> +                        * set immutability
> +                        */
> +                       mutex_lock(&hwpt->ioas->mutex);
> +                       down_write(&hwpt->ioas->iopt.domains_rwsem);
> +                       down_write(&hwpt->ioas->iopt.iova_rwsem);
> +
> +                       hwpt->ioas->iopt.lu_map_immutable = true;
> +
> +                       up_write(&hwpt->ioas->iopt.iova_rwsem);
> +                       up_write(&hwpt->ioas->iopt.domains_rwsem);
> +                       mutex_unlock(&hwpt->ioas->mutex);
> +               }
> +
> +               if (!hwpt->common.domain) {
> +                       rc = -EINVAL;
> +                       xa_unlock(&ictx->objects);
> +                       goto out;
> +               }
> +
> +               if (!iommufd_lu) {
> +                       rc = check_iopt_pages_preserved(session, hwpt);
> +                       if (rc) {
> +                               xa_unlock(&ictx->objects);
> +                               goto out;
> +                       }
> +               } else if (iommufd_lu) {
> +                       hwpts[nr_hwpts] = hwpt;
> +                       hwpt_lu = &iommufd_lu->hwpts[nr_hwpts];
> +
> +                       hwpt_lu->token = hwpt->lu_token;
> +                       hwpt_lu->reclaimed = false;
> +               }
> +
> +               nr_hwpts++;
> +       }
> +       xa_unlock(&ictx->objects);
> +
> +       if (WARN_ON(iommufd_lu && iommufd_lu->nr_hwpts != nr_hwpts)) {
> +               rc = -EFAULT;
> +               goto out;
> +       }
> +
> +       if (iommufd_lu) {
> +               /*
> +                * iommu_domain_preserve may sleep and must be called
> +                * outside of xa_lock
> +                */
> +               for (i = 0; i < nr_hwpts; i++) {
> +                       hwpt = hwpts[i];
> +                       hwpt_lu = &iommufd_lu->hwpts[i];
> +
> +                       rc = iommu_domain_preserve(hwpt->common.domain, &domain_ser);
> +                       if (rc < 0)
> +                               goto out;
> +
> +                       hwpt_lu->domain_data = __pa(domain_ser);
> +               }
> +       }
> +
> +       rc = nr_hwpts;
> +
> +out:
> +       kfree(hwpts);
> +       return rc;
> +}
> +
> +static int iommufd_liveupdate_preserve(struct liveupdate_file_op_args *args)
> +{
> +       struct iommufd_ctx *ictx = iommufd_ctx_from_file(args->file);
> +       struct iommufd_lu *iommufd_lu;
> +       size_t serial_size;
> +       void *mem;
> +       int rc;
> +
> +       if (IS_ERR(ictx))
> +               return PTR_ERR(ictx);
> +
> +       rc = iommufd_save_hwpts(ictx, NULL, args->session);
> +       if (rc < 0)
> +               goto err_ioas_mutable;
> +
> +       serial_size = struct_size(iommufd_lu, hwpts, rc);
> +
> +       mem = kho_alloc_preserve(serial_size);
> +       if (!mem) {
> +               rc = -ENOMEM;
> +               goto err_ioas_mutable;
> +       }
> +
> +       iommufd_lu = mem;
> +       iommufd_lu->nr_hwpts = rc;
> +       rc = iommufd_save_hwpts(ictx, iommufd_lu, args->session);
> +       if (rc < 0)
> +               goto err_free;
> +
> +       args->serialized_data = virt_to_phys(iommufd_lu);
> +       iommufd_ctx_put(ictx);
> +       return 0;
> +
> +err_free:
> +       kho_unpreserve_free(mem);
> +err_ioas_mutable:
> +       iommufd_set_ioas_mutable(ictx);
> +       iommufd_ctx_put(ictx);
> +       return rc;
> +}
> +
> +static int iommufd_liveupdate_freeze(struct liveupdate_file_op_args *args)
> +{
> +       /* No-Op; everything should be made read-only */
> +       return 0;
> +}
> +
> +static void iommufd_liveupdate_unpreserve(struct liveupdate_file_op_args *args)
> +{
> +       struct iommufd_ctx *ictx = iommufd_ctx_from_file(args->file);
> +       struct iommufd_hwpt_paging *hwpt;
> +       struct iommufd_object *obj;
> +       unsigned long index;
> +
> +       if (WARN_ON(IS_ERR(ictx)))
> +               return;
> +
> +       xa_lock(&ictx->objects);
> +       xa_for_each(&ictx->objects, index, obj) {
> +               if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
> +                       continue;
> +
> +               hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
> +               if (!hwpt->lu_preserve)
> +                       continue;
> +               if (!hwpt->common.domain)
> +                       continue;
> +
> +               iommu_domain_unpreserve(hwpt->common.domain);
> +       }
> +       xa_unlock(&ictx->objects);
> +
> +       kho_unpreserve_free(phys_to_virt(args->serialized_data));
> +
> +       iommufd_set_ioas_mutable(ictx);
> +       iommufd_ctx_put(ictx);
> +}
> +
> +static int iommufd_liveupdate_retrieve(struct liveupdate_file_op_args *args)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static bool iommufd_liveupdate_can_finish(struct liveupdate_file_op_args *args)
> +{
> +       return false;
> +}
> +
> +static void iommufd_liveupdate_finish(struct liveupdate_file_op_args *args)
> +{
> +}
> +
> +static bool iommufd_liveupdate_can_preserve(struct liveupdate_file_handler *handler,
> +                                           struct file *file)
> +{
> +       struct iommufd_ctx *ictx = iommufd_ctx_from_file(file);
> +
> +       if (IS_ERR(ictx))
> +               return false;
> +
> +       iommufd_ctx_put(ictx);
> +       return true;
> +}
> +
> +static struct liveupdate_file_ops iommufd_lu_file_ops = {
> +       .can_preserve = iommufd_liveupdate_can_preserve,
> +       .preserve = iommufd_liveupdate_preserve,
> +       .unpreserve = iommufd_liveupdate_unpreserve,
> +       .freeze = iommufd_liveupdate_freeze,
> +       .retrieve = iommufd_liveupdate_retrieve,
> +       .can_finish = iommufd_liveupdate_can_finish,
> +       .finish = iommufd_liveupdate_finish,
> +};
> +
> +static struct liveupdate_file_handler iommufd_lu_handler = {
> +       .compatible = IOMMUFD_LUO_COMPATIBLE,
> +       .ops = &iommufd_lu_file_ops,
> +};
> +
> +int iommufd_liveupdate_register_lufs(void)
> +{
> +       int ret;
> +
> +       ret = liveupdate_register_file_handler(&iommufd_lu_handler);
> +       if (ret)
> +               return ret;
> +
> +       ret = iommu_liveupdate_register_flb(&iommufd_lu_handler);
> +       if (ret)
> +               liveupdate_unregister_file_handler(&iommufd_lu_handler);
> +
> +       return ret;
> +}
> +
> +int iommufd_liveupdate_unregister_lufs(void)
> +{
> +       WARN_ON(iommu_liveupdate_unregister_flb(&iommufd_lu_handler));
> +
> +       return liveupdate_unregister_file_handler(&iommufd_lu_handler);
> +}
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index e1a9b3051f65..d7683244c67a 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -775,11 +775,21 @@ static int __init iommufd_init(void)
>                 if (ret)
>                         goto err_misc;
>         }
> +
> +       if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)) {
> +               ret = iommufd_liveupdate_register_lufs();

Alex Williamson pointed out in the vfio-pci preservation series that
registering the file handler in module init/exit would make the module
unloadable as LUO takes a reference on it. That problem will occur
here also as the registration is being done in module_init. For
iommufd, the register/unregister can be moved to iommufd open/release.
Register on first iommufd open and unregister on last iommufd close,
basically a kref that gets inc/dec on open/release.

https://lore.kernel.org/all/20260225143328.35be89f6@shazbot.org/
> +               if (ret)
> +                       goto err_vfio_misc;
> +       }
> +
>         ret = iommufd_test_init();
>         if (ret)
> -               goto err_vfio_misc;
> +               goto err_lufs;
>         return 0;
>
> +err_lufs:
> +       if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE))
> +               iommufd_liveupdate_unregister_lufs();
>  err_vfio_misc:
>         if (IS_ENABLED(CONFIG_IOMMUFD_VFIO_CONTAINER))
>                 misc_deregister(&vfio_misc_dev);
> @@ -791,6 +801,8 @@ static int __init iommufd_init(void)
>  static void __exit iommufd_exit(void)
>  {
>         iommufd_test_exit();
> +       if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE))
> +               iommufd_liveupdate_unregister_lufs();
>         if (IS_ENABLED(CONFIG_IOMMUFD_VFIO_CONTAINER))
>                 misc_deregister(&vfio_misc_dev);
>         misc_deregister(&iommu_misc_dev);
> diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
> index dbe51ecb9a20..cc0e3265ba4e 100644
> --- a/drivers/iommu/iommufd/pages.c
> +++ b/drivers/iommu/iommufd/pages.c
> @@ -55,6 +55,7 @@
>  #include <linux/overflow.h>
>  #include <linux/slab.h>
>  #include <linux/sched/mm.h>
> +#include <linux/memfd.h>
>  #include <linux/vfio_pci_core.h>
>
>  #include "double_span.h"
> @@ -1420,6 +1421,7 @@ struct iopt_pages *iopt_alloc_file_pages(struct file *file,
>
>  {
>         struct iopt_pages *pages;
> +       int seals;
>
>         pages = iopt_alloc_pages(start_byte, length, writable);
>         if (IS_ERR(pages))
> @@ -1427,6 +1429,12 @@ struct iopt_pages *iopt_alloc_file_pages(struct file *file,
>         pages->file = get_file(file);
>         pages->start = start - start_byte;
>         pages->type = IOPT_ADDRESS_FILE;
> +
> +       pages->seals = 0;
> +       seals = memfd_get_seals(file);
> +       if (seals > 0)
> +               pages->seals = seals;
> +
>         return pages;
>  }
>
> diff --git a/include/linux/kho/abi/iommufd.h b/include/linux/kho/abi/iommufd.h
> new file mode 100644
> index 000000000000..f7393ac78aa9
> --- /dev/null
> +++ b/include/linux/kho/abi/iommufd.h
> @@ -0,0 +1,39 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Copyright (C) 2025, Google LLC
> + * Author: Samiullah Khawaja <skhawaja@google.com>
> + */
> +
> +#ifndef _LINUX_KHO_ABI_IOMMUFD_H
> +#define _LINUX_KHO_ABI_IOMMUFD_H
> +
> +#include <linux/mutex_types.h>
> +#include <linux/compiler.h>
> +#include <linux/types.h>
> +
> +/**
> + * DOC: IOMMUFD Live Update ABI
> + *
> + * This header defines the ABI for preserving the state of an IOMMUFD file
> + * across a kexec reboot using LUO.
> + *
> + * This interface is a contract. Any modification to any of the serialization
> + * structs defined here constitutes a breaking change. Such changes require
> + * incrementing the version number in the IOMMUFD_LUO_COMPATIBLE string.
> + */
> +
> +#define IOMMUFD_LUO_COMPATIBLE "iommufd-v1"
> +
> +struct iommufd_hwpt_lu {
> +       u32 token;
> +       u64 domain_data;
> +       bool reclaimed;
> +} __packed;
> +
> +struct iommufd_lu {
> +       unsigned int nr_hwpts;
> +       struct iommufd_hwpt_lu hwpts[];
> +};
> +
> +#endif /* _LINUX_KHO_ABI_IOMMUFD_H */
> --
> 2.53.0.rc2.204.g2597b5adb4-goog
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update
  2026-02-03 22:09 ` [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update Samiullah Khawaja
  2026-02-25 23:47   ` Samiullah Khawaja
@ 2026-03-03  5:56   ` Ankit Soni
  2026-03-03 18:51     ` Samiullah Khawaja
  2026-03-23 20:28   ` Vipin Sharma
  2026-03-25 20:08   ` Pranjal Shrivastava
  3 siblings, 1 reply; 98+ messages in thread
From: Ankit Soni @ 2026-03-03  5:56 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Pranjal Shrivastava,
	Vipin Sharma

Hi,

On Tue, Feb 03, 2026 at 10:09:45PM +0000, Samiullah Khawaja wrote:
> From: YiFei Zhu <zhuyifei@google.com>
> 
> The caller is expected to mark each HWPT to be preserved with an ioctl
> call, with a token that will be used in restore. At preserve time, each
> HWPT's domain is then called with iommu_domain_preserve to preserve the
> iommu domain.
> 
> The HWPTs containing dma mappings backed by unpreserved memory should
> not be preserved. During preservation check if the mappings contained in
> the HWPT being preserved are only file based and all the files are
> preserved.
> 
> The memfd file preservation check is not enough when preserving iommufd.
> The memfd might have shrunk between the mapping and memfd preservation.
> This means that if it shrunk some pages that are right now pinned due to
> iommu mappings are not preserved with the memfd. Only allow iommufd
> preservation when all the iopt_pages are file backed and the memory file
> was seal sealed during mapping. This guarantees that all the pages that
> were backing memfd when it was mapped are preserved.
> 
> Once HWPT is preserved the iopt associated with the HWPT is made
> immutable. Since the map and unmap ioctls operates directly on iopt,
> which contains an array of domains, while each hwpt contains only one
> domain. The logic then becomes that mapping and unmapping is prohibited
> if any of the domains in an iopt belongs to a preserved hwpt. However,
> tracing to the hwpt through the domain is a lot more tedious than
> tracing through the ioas, so if an hwpt is preserved, hwpt->ioas->iopt
> is made immutable.
> 
> When undoing this (making the iopts mutable again), there's never
> a need to make some iopts mutable and some kept immutable, since
> the undo only happen on unpreserve and error path of preserve.
> Simply iterate all the ioas and clear the immutability flag on all
> their iopts.
> 
> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/iommufd/io_pagetable.c    |  17 ++
>  drivers/iommu/iommufd/io_pagetable.h    |   1 +
>  drivers/iommu/iommufd/iommufd_private.h |  25 ++
>  drivers/iommu/iommufd/liveupdate.c      | 300 ++++++++++++++++++++++++
>  drivers/iommu/iommufd/main.c            |  14 +-
>  drivers/iommu/iommufd/pages.c           |   8 +
>  include/linux/kho/abi/iommufd.h         |  39 +++
>  7 files changed, 403 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/kho/abi/iommufd.h
> 
> +
> +static int iommufd_save_hwpts(struct iommufd_ctx *ictx,
> +			      struct iommufd_lu *iommufd_lu,
> +			      struct liveupdate_session *session)
> +{
> +	struct iommufd_hwpt_paging *hwpt, **hwpts = NULL;
> +	struct iommu_domain_ser *domain_ser;
> +	struct iommufd_hwpt_lu *hwpt_lu;
> +	struct iommufd_object *obj;
> +	unsigned int nr_hwpts = 0;
> +	unsigned long index;
> +	unsigned int i;
> +	int rc = 0;
> +
> +	if (iommufd_lu) {
> +		hwpts = kcalloc(iommufd_lu->nr_hwpts, sizeof(*hwpts),
> +				GFP_KERNEL);
> +		if (!hwpts)
> +			return -ENOMEM;
> +	}
> +
> +	xa_lock(&ictx->objects);
> +	xa_for_each(&ictx->objects, index, obj) {
> +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
> +			continue;
> +
> +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
> +		if (!hwpt->lu_preserve)
> +			continue;
> +
> +		if (hwpt->ioas) {
> +			/*
> +			 * Obtain exclusive access to the IOAS and IOPT while we
> +			 * set immutability
> +			 */
> +			mutex_lock(&hwpt->ioas->mutex);
> +			down_write(&hwpt->ioas->iopt.domains_rwsem);
> +			down_write(&hwpt->ioas->iopt.iova_rwsem);

Taking mutex/rwsem under spin-lock is not a good idea.

> +
> +			hwpt->ioas->iopt.lu_map_immutable = true;
> +
> +			up_write(&hwpt->ioas->iopt.iova_rwsem);
> +			up_write(&hwpt->ioas->iopt.domains_rwsem);
> +			mutex_unlock(&hwpt->ioas->mutex);
> +		}
> +
> +		if (!hwpt->common.domain) {
> +			rc = -EINVAL;
> +			xa_unlock(&ictx->objects);
> +			goto out;
> +		}
> +
> +		if (!iommufd_lu) {
> +			rc = check_iopt_pages_preserved(session, hwpt);
> +			if (rc) {
> +				xa_unlock(&ictx->objects);
> +				goto out;
> +			}
> +		} else if (iommufd_lu) {

Redundant else_if().

-Ankit

> +			hwpts[nr_hwpts] = hwpt;
> +			hwpt_lu = &iommufd_lu->hwpts[nr_hwpts];
> +
> +			hwpt_lu->token = hwpt->lu_token;
> +			hwpt_lu->reclaimed = false;
> +		}
> +
> +		nr_hwpts++;
> +	}
> +	xa_unlock(&ictx->objects);
> +

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages
  2026-02-03 22:09 ` [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages Samiullah Khawaja
@ 2026-03-03 16:42   ` Ankit Soni
  2026-03-03 18:41     ` Samiullah Khawaja
  2026-03-17 20:59   ` Vipin Sharma
  1 sibling, 1 reply; 98+ messages in thread
From: Ankit Soni @ 2026-03-03 16:42 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:38PM +0000, Samiullah Khawaja wrote:
> IOMMU pages are allocated/freed using APIs using struct ioptdesc. For
> the proper preservation and restoration of ioptdesc add helper
> functions.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/iommu-pages.c | 74 +++++++++++++++++++++++++++++++++++++
>  drivers/iommu/iommu-pages.h | 30 +++++++++++++++
>  2 files changed, 104 insertions(+)
> 
> diff --git a/drivers/iommu/iommu-pages.c b/drivers/iommu/iommu-pages.c
> index 3bab175d8557..588a8f19b196 100644
> --- a/drivers/iommu/iommu-pages.c
> +++ b/drivers/iommu/iommu-pages.c
> @@ -6,6 +6,7 @@
>  #include "iommu-pages.h"
>  #include <linux/dma-mapping.h>
>  #include <linux/gfp.h>
> +#include <linux/kexec_handover.h>
>  #include <linux/mm.h>
>  
>  #define IOPTDESC_MATCH(pg_elm, elm)                    \
> @@ -131,6 +132,79 @@ void iommu_put_pages_list(struct iommu_pages_list *list)
>  }
>  EXPORT_SYMBOL_GPL(iommu_put_pages_list);
>  
> +#if IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)
> +void iommu_unpreserve_page(void *virt)
> +{
> +	kho_unpreserve_folio(ioptdesc_folio(virt_to_ioptdesc(virt)));
> +}
> +EXPORT_SYMBOL_GPL(iommu_unpreserve_page);
> +
> +int iommu_preserve_page(void *virt)
> +{
> +	return kho_preserve_folio(ioptdesc_folio(virt_to_ioptdesc(virt)));
> +}
> +EXPORT_SYMBOL_GPL(iommu_preserve_page);
> +
> +void iommu_unpreserve_pages(struct iommu_pages_list *list, int count)
> +{
> +	struct ioptdesc *iopt;
> +
> +	if (!count)
> +		return;
> +
> +	/* If less than zero then unpreserve all pages. */
> +	if (count < 0)
> +		count = 0;
> +
> +	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
> +		kho_unpreserve_folio(ioptdesc_folio(iopt));
> +		if (count > 0 && --count ==  0)
> +			break;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(iommu_unpreserve_pages);
> +
> +void iommu_restore_page(u64 phys)
> +{
> +	struct ioptdesc *iopt;
> +	struct folio *folio;
> +	unsigned long pgcnt;
> +	unsigned int order;
> +
> +	folio = kho_restore_folio(phys);
> +	BUG_ON(!folio);
> +
> +	iopt = folio_ioptdesc(folio);

iopt->incoherent = false; should be here?

> +
> +	order = folio_order(folio);
> +	pgcnt = 1UL << order;
> +	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, pgcnt);
> +	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, pgcnt);
> +}
> +EXPORT_SYMBOL_GPL(iommu_restore_page);
> +
> +int iommu_preserve_pages(struct iommu_pages_list *list)
> +{
> +	struct ioptdesc *iopt;
> +	int count = 0;
> +	int ret;
> +
> +	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
> +		ret = kho_preserve_folio(ioptdesc_folio(iopt));
> +		if (ret) {
> +			iommu_unpreserve_pages(list, count);
> +			return ret;
> +		}
> +
> +		++count;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_preserve_pages);
> +
> +#endif
> +
>  /**
>   * iommu_pages_start_incoherent - Setup the page for cache incoherent operation
>   * @virt: The page to setup
> diff --git a/drivers/iommu/iommu-pages.h b/drivers/iommu/iommu-pages.h
> index ae9da4f571f6..bd336fb56b5f 100644
> --- a/drivers/iommu/iommu-pages.h
> +++ b/drivers/iommu/iommu-pages.h
> @@ -53,6 +53,36 @@ void *iommu_alloc_pages_node_sz(int nid, gfp_t gfp, size_t size);
>  void iommu_free_pages(void *virt);
>  void iommu_put_pages_list(struct iommu_pages_list *list);
>  
> +#if IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)
> +int iommu_preserve_page(void *virt);
> +void iommu_unpreserve_page(void *virt);
> +int iommu_preserve_pages(struct iommu_pages_list *list);
> +void iommu_unpreserve_pages(struct iommu_pages_list *list, int count);
> +void iommu_restore_page(u64 phys);
> +#else
> +static inline int iommu_preserve_page(void *virt)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void iommu_unpreserve_page(void *virt)
> +{
> +}
> +
> +static inline int iommu_preserve_pages(struct iommu_pages_list *list)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void iommu_unpreserve_pages(struct iommu_pages_list *list, int count)
> +{
> +}
> +
> +static inline void iommu_restore_page(u64 phys)
> +{
> +}
> +#endif
> +
>  /**
>   * iommu_pages_list_add - add the page to a iommu_pages_list
>   * @list: List to add the page to
> -- 
> 2.53.0.rc2.204.g2597b5adb4-goog
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 13/14] vfio/pci: Preserve the iommufd state of the vfio cdev
  2026-02-17  4:18   ` Ankit Soni
@ 2026-03-03 18:35     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-03 18:35 UTC (permalink / raw)
  To: Ankit Soni
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Tue, Feb 17, 2026 at 04:18:08AM +0000, Ankit Soni wrote:
>On Tue, Feb 03, 2026 at 10:09:47PM +0000, Samiullah Khawaja wrote:
>> If the vfio cdev is attached to an iommufd, preserve the state of the
>> attached iommufd also. Basically preserve the iommu state of the device
>> and also the attached domain. The token returned by the preservation API
>> will be used to restore/rebind to the iommufd state after liveupdate.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_liveupdate.c | 28 +++++++++++++++++++++++++-
>>  include/linux/kho/abi/vfio_pci.h       | 10 +++++++++
>>  2 files changed, 37 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_liveupdate.c b/drivers/vfio/pci/vfio_pci_liveupdate.c
>> index c52d6bdb455f..af6fbfb7a65c 100644
>> --- a/drivers/vfio/pci/vfio_pci_liveupdate.c
>> +++ b/drivers/vfio/pci/vfio_pci_liveupdate.c
>> @@ -15,6 +15,7 @@
>>  #include <linux/liveupdate.h>
>>  #include <linux/errno.h>
>>  #include <linux/vfio.h>
>> +#include <linux/iommufd.h>
>>
>>  #include "vfio_pci_priv.h"
>>
>> @@ -39,6 +40,7 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
>>  	struct vfio_pci_core_device_ser *ser;
>>  	struct vfio_pci_core_device *vdev;
>>  	struct pci_dev *pdev;
>> +	u64 token = 0;
>>
>>  	vdev = container_of(device, struct vfio_pci_core_device, vdev);
>>  	pdev = vdev->pdev;
>> @@ -49,15 +51,32 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
>>  	if (vfio_pci_is_intel_display(pdev))
>>  		return -EINVAL;
>>
>> +#if CONFIG_IOMMU_LIVEUPDATE
>> +	/* If iommufd is attached, preserve the underlying domain */
>> +	if (device->iommufd_attached) {
>> +		int err = iommufd_device_preserve(args->session,
>> +						  device->iommufd_device,
>> +						  &token);
>> +		if (err < 0)
>> +			return err;
>> +	}
>> +#endif
>> +
>>  	ser = kho_alloc_preserve(sizeof(*ser));
>> -	if (IS_ERR(ser))
>> +	if (IS_ERR(ser)) {
>> +		if (device->iommufd_attached)
>> +			iommufd_device_unpreserve(args->session,
>> +						  device->iommufd_device, token);
>> +
>
>To use iommufd_device_preserve()/iommufd_device_unpreserve(),
>looks like the IOMMUFD namespace import is missing here —  MODULE_IMPORT_NS("IOMMUFD");
>
>-Ankit

Agreed, I will add it to this file in v2.
>
>>  		return PTR_ERR(ser);
>> +	}
>>
>>  	pci_liveupdate_outgoing_preserve(pdev);
>>
>>  	ser->bdf = pci_dev_id(pdev);
>>  	ser->domain = pci_domain_nr(pdev->bus);
>>  	ser->reset_works = vdev->reset_works;
>> +	ser->iommufd_ser.token = token;
>>
>>  	args->serialized_data = virt_to_phys(ser);
>>  	return 0;
>> @@ -66,6 +85,13 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
>>  static void vfio_pci_liveupdate_unpreserve(struct liveupdate_file_op_args *args)
>>  {
>>  	struct vfio_device *device = vfio_device_from_file(args->file);
>> +	struct vfio_pci_core_device_ser *ser;
>> +
>> +	ser = phys_to_virt(args->serialized_data);
>> +	if (device->iommufd_attached)
>> +		iommufd_device_unpreserve(args->session,
>> +					  device->iommufd_device,
>> +					  ser->iommufd_ser.token);
>>
>>  	pci_liveupdate_outgoing_unpreserve(to_pci_dev(device->dev));
>>  	kho_unpreserve_free(phys_to_virt(args->serialized_data));
>> diff --git a/include/linux/kho/abi/vfio_pci.h b/include/linux/kho/abi/vfio_pci.h
>> index 6c3d3c6dfc09..d01bd58711c2 100644
>> --- a/include/linux/kho/abi/vfio_pci.h
>> +++ b/include/linux/kho/abi/vfio_pci.h
>> @@ -28,6 +28,15 @@
>>
>>  #define VFIO_PCI_LUO_FH_COMPATIBLE "vfio-pci-v1"
>>
>> +/**
>> + * struct vfio_iommufd_ser - Serialized state of the attached iommufd.
>> + *
>> + * @token: The token of the bound iommufd state.
>> + */
>> +struct vfio_iommufd_ser {
>> +	u32 token;
>> +} __packed;
>> +
>>  /**
>>   * struct vfio_pci_core_device_ser - Serialized state of a single VFIO PCI
>>   * device.
>> @@ -40,6 +49,7 @@ struct vfio_pci_core_device_ser {
>>  	u16 bdf;
>>  	u16 domain;
>>  	u8 reset_works;
>> +	struct vfio_iommufd_ser iommufd_ser;
>>  } __packed;
>>
>>  #endif /* _LINUX_LIVEUPDATE_ABI_VFIO_PCI_H */
>> --
>> 2.53.0.rc2.204.g2597b5adb4-goog
>>

Thanks for looking at this.

Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages
  2026-03-03 16:42   ` Ankit Soni
@ 2026-03-03 18:41     ` Samiullah Khawaja
  2026-03-20 17:27       ` Pranjal Shrivastava
  0 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-03 18:41 UTC (permalink / raw)
  To: Ankit Soni
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Tue, Mar 03, 2026 at 04:42:02PM +0000, Ankit Soni wrote:
>On Tue, Feb 03, 2026 at 10:09:38PM +0000, Samiullah Khawaja wrote:
>> IOMMU pages are allocated/freed using APIs using struct ioptdesc. For
>> the proper preservation and restoration of ioptdesc add helper
>> functions.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/iommu-pages.c | 74 +++++++++++++++++++++++++++++++++++++
>>  drivers/iommu/iommu-pages.h | 30 +++++++++++++++
>>  2 files changed, 104 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu-pages.c b/drivers/iommu/iommu-pages.c
>> index 3bab175d8557..588a8f19b196 100644
>> --- a/drivers/iommu/iommu-pages.c
>> +++ b/drivers/iommu/iommu-pages.c
>> @@ -6,6 +6,7 @@
>>  #include "iommu-pages.h"
>>  #include <linux/dma-mapping.h>
>>  #include <linux/gfp.h>
>> +#include <linux/kexec_handover.h>
>>  #include <linux/mm.h>
>>
>>  #define IOPTDESC_MATCH(pg_elm, elm)                    \
>> @@ -131,6 +132,79 @@ void iommu_put_pages_list(struct iommu_pages_list *list)
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_put_pages_list);
>>
>> +#if IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)
>> +void iommu_unpreserve_page(void *virt)
>> +{
>> +	kho_unpreserve_folio(ioptdesc_folio(virt_to_ioptdesc(virt)));
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_unpreserve_page);
>> +
>> +int iommu_preserve_page(void *virt)
>> +{
>> +	return kho_preserve_folio(ioptdesc_folio(virt_to_ioptdesc(virt)));
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_preserve_page);
>> +
>> +void iommu_unpreserve_pages(struct iommu_pages_list *list, int count)
>> +{
>> +	struct ioptdesc *iopt;
>> +
>> +	if (!count)
>> +		return;
>> +
>> +	/* If less than zero then unpreserve all pages. */
>> +	if (count < 0)
>> +		count = 0;
>> +
>> +	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
>> +		kho_unpreserve_folio(ioptdesc_folio(iopt));
>> +		if (count > 0 && --count ==  0)
>> +			break;
>> +	}
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_unpreserve_pages);
>> +
>> +void iommu_restore_page(u64 phys)
>> +{
>> +	struct ioptdesc *iopt;
>> +	struct folio *folio;
>> +	unsigned long pgcnt;
>> +	unsigned int order;
>> +
>> +	folio = kho_restore_folio(phys);
>> +	BUG_ON(!folio);
>> +
>> +	iopt = folio_ioptdesc(folio);
>
>iopt->incoherent = false; should be here?
>

Yes this should be set here. I will update this.
>> +
>> +	order = folio_order(folio);
>> +	pgcnt = 1UL << order;
>> +	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, pgcnt);
>> +	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, pgcnt);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_restore_page);
>> +
>> +int iommu_preserve_pages(struct iommu_pages_list *list)
>> +{
>> +	struct ioptdesc *iopt;
>> +	int count = 0;
>> +	int ret;
>> +
>> +	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
>> +		ret = kho_preserve_folio(ioptdesc_folio(iopt));
>> +		if (ret) {
>> +			iommu_unpreserve_pages(list, count);
>> +			return ret;
>> +		}
>> +
>> +		++count;
>> +	}
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_preserve_pages);
>> +
>> +#endif
>> +
>>  /**
>>   * iommu_pages_start_incoherent - Setup the page for cache incoherent operation
>>   * @virt: The page to setup
>> diff --git a/drivers/iommu/iommu-pages.h b/drivers/iommu/iommu-pages.h
>> index ae9da4f571f6..bd336fb56b5f 100644
>> --- a/drivers/iommu/iommu-pages.h
>> +++ b/drivers/iommu/iommu-pages.h
>> @@ -53,6 +53,36 @@ void *iommu_alloc_pages_node_sz(int nid, gfp_t gfp, size_t size);
>>  void iommu_free_pages(void *virt);
>>  void iommu_put_pages_list(struct iommu_pages_list *list);
>>
>> +#if IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)
>> +int iommu_preserve_page(void *virt);
>> +void iommu_unpreserve_page(void *virt);
>> +int iommu_preserve_pages(struct iommu_pages_list *list);
>> +void iommu_unpreserve_pages(struct iommu_pages_list *list, int count);
>> +void iommu_restore_page(u64 phys);
>> +#else
>> +static inline int iommu_preserve_page(void *virt)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static inline void iommu_unpreserve_page(void *virt)
>> +{
>> +}
>> +
>> +static inline int iommu_preserve_pages(struct iommu_pages_list *list)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static inline void iommu_unpreserve_pages(struct iommu_pages_list *list, int count)
>> +{
>> +}
>> +
>> +static inline void iommu_restore_page(u64 phys)
>> +{
>> +}
>> +#endif
>> +
>>  /**
>>   * iommu_pages_list_add - add the page to a iommu_pages_list
>>   * @list: List to add the page to
>> --
>> 2.53.0.rc2.204.g2597b5adb4-goog
>>

Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update
  2026-03-03  5:56   ` Ankit Soni
@ 2026-03-03 18:51     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-03 18:51 UTC (permalink / raw)
  To: Ankit Soni
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Pranjal Shrivastava,
	Vipin Sharma

On Tue, Mar 03, 2026 at 05:56:15AM +0000, Ankit Soni wrote:
>Hi,
>
>On Tue, Feb 03, 2026 at 10:09:45PM +0000, Samiullah Khawaja wrote:
>> From: YiFei Zhu <zhuyifei@google.com>
>>
>> The caller is expected to mark each HWPT to be preserved with an ioctl
>> call, with a token that will be used in restore. At preserve time, each
>> HWPT's domain is then called with iommu_domain_preserve to preserve the
>> iommu domain.
>>
>> The HWPTs containing dma mappings backed by unpreserved memory should
>> not be preserved. During preservation check if the mappings contained in
>> the HWPT being preserved are only file based and all the files are
>> preserved.
>>
>> The memfd file preservation check is not enough when preserving iommufd.
>> The memfd might have shrunk between the mapping and memfd preservation.
>> This means that if it shrunk some pages that are right now pinned due to
>> iommu mappings are not preserved with the memfd. Only allow iommufd
>> preservation when all the iopt_pages are file backed and the memory file
>> was seal sealed during mapping. This guarantees that all the pages that
>> were backing memfd when it was mapped are preserved.
>>
>> Once HWPT is preserved the iopt associated with the HWPT is made
>> immutable. Since the map and unmap ioctls operates directly on iopt,
>> which contains an array of domains, while each hwpt contains only one
>> domain. The logic then becomes that mapping and unmapping is prohibited
>> if any of the domains in an iopt belongs to a preserved hwpt. However,
>> tracing to the hwpt through the domain is a lot more tedious than
>> tracing through the ioas, so if an hwpt is preserved, hwpt->ioas->iopt
>> is made immutable.
>>
>> When undoing this (making the iopts mutable again), there's never
>> a need to make some iopts mutable and some kept immutable, since
>> the undo only happen on unpreserve and error path of preserve.
>> Simply iterate all the ioas and clear the immutability flag on all
>> their iopts.
>>
>> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/iommufd/io_pagetable.c    |  17 ++
>>  drivers/iommu/iommufd/io_pagetable.h    |   1 +
>>  drivers/iommu/iommufd/iommufd_private.h |  25 ++
>>  drivers/iommu/iommufd/liveupdate.c      | 300 ++++++++++++++++++++++++
>>  drivers/iommu/iommufd/main.c            |  14 +-
>>  drivers/iommu/iommufd/pages.c           |   8 +
>>  include/linux/kho/abi/iommufd.h         |  39 +++
>>  7 files changed, 403 insertions(+), 1 deletion(-)
>>  create mode 100644 include/linux/kho/abi/iommufd.h
>>
>> +
>> +static int iommufd_save_hwpts(struct iommufd_ctx *ictx,
>> +			      struct iommufd_lu *iommufd_lu,
>> +			      struct liveupdate_session *session)
>> +{
>> +	struct iommufd_hwpt_paging *hwpt, **hwpts = NULL;
>> +	struct iommu_domain_ser *domain_ser;
>> +	struct iommufd_hwpt_lu *hwpt_lu;
>> +	struct iommufd_object *obj;
>> +	unsigned int nr_hwpts = 0;
>> +	unsigned long index;
>> +	unsigned int i;
>> +	int rc = 0;
>> +
>> +	if (iommufd_lu) {
>> +		hwpts = kcalloc(iommufd_lu->nr_hwpts, sizeof(*hwpts),
>> +				GFP_KERNEL);
>> +		if (!hwpts)
>> +			return -ENOMEM;
>> +	}
>> +
>> +	xa_lock(&ictx->objects);
>> +	xa_for_each(&ictx->objects, index, obj) {
>> +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
>> +			continue;
>> +
>> +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
>> +		if (!hwpt->lu_preserve)
>> +			continue;
>> +
>> +		if (hwpt->ioas) {
>> +			/*
>> +			 * Obtain exclusive access to the IOAS and IOPT while we
>> +			 * set immutability
>> +			 */
>> +			mutex_lock(&hwpt->ioas->mutex);
>> +			down_write(&hwpt->ioas->iopt.domains_rwsem);
>> +			down_write(&hwpt->ioas->iopt.iova_rwsem);
>
>Taking mutex/rwsem under spin-lock is not a good idea.

Agreed. I will move this out by taking a reference on the object and
then setting it separately without xa_lock.
>
>> +
>> +			hwpt->ioas->iopt.lu_map_immutable = true;
>> +
>> +			up_write(&hwpt->ioas->iopt.iova_rwsem);
>> +			up_write(&hwpt->ioas->iopt.domains_rwsem);
>> +			mutex_unlock(&hwpt->ioas->mutex);
>> +		}
>> +
>> +		if (!hwpt->common.domain) {
>> +			rc = -EINVAL;
>> +			xa_unlock(&ictx->objects);
>> +			goto out;
>> +		}
>> +
>> +		if (!iommufd_lu) {
>> +			rc = check_iopt_pages_preserved(session, hwpt);
>> +			if (rc) {
>> +				xa_unlock(&ictx->objects);
>> +				goto out;
>> +			}
>> +		} else if (iommufd_lu) {
>
>Redundant else_if().

Will remove
>
>-Ankit
>
>> +			hwpts[nr_hwpts] = hwpt;
>> +			hwpt_lu = &iommufd_lu->hwpts[nr_hwpts];
>> +
>> +			hwpt_lu->token = hwpt->lu_token;
>> +			hwpt_lu->reclaimed = false;
>> +		}
>> +
>> +		nr_hwpts++;
>> +	}
>> +	xa_unlock(&ictx->objects);
>> +

Thanks for looking into this.

Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 08/14] iommu: Restore and reattach preserved domains to devices
  2026-02-03 22:09 ` [PATCH 08/14] iommu: Restore and reattach preserved domains to devices Samiullah Khawaja
@ 2026-03-10  5:16   ` Ankit Soni
  2026-03-10 21:47     ` Samiullah Khawaja
  2026-03-22 21:59   ` Pranjal Shrivastava
  1 sibling, 1 reply; 98+ messages in thread
From: Ankit Soni @ 2026-03-10  5:16 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:42PM +0000, Samiullah Khawaja wrote:
> Restore the preserved domains by restoring the page tables using restore
> IOMMU domain op. Reattach the preserved domain to the device during
> default domain setup. While attaching, reuse the domain ID that was used
> in the previous kernel. The context entry setup is not needed as that is
> preserved during liveupdate.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/intel/iommu.c  | 40 ++++++++++++++++++------------
>  drivers/iommu/intel/iommu.h  |  3 ++-
>  drivers/iommu/intel/nested.c |  2 +-
>  drivers/iommu/iommu.c        | 47 ++++++++++++++++++++++++++++++++++--
>  drivers/iommu/liveupdate.c   | 31 ++++++++++++++++++++++++
>  include/linux/iommu-lu.h     |  8 ++++++
>  6 files changed, 112 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 8acb7f8a7627..83faad53f247 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -1029,7 +1029,8 @@ static bool first_level_by_default(struct intel_iommu *iommu)
>  	return true;
>  }
>  
> -int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
> +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
> +			int restore_did)
>  {
>  	struct iommu_domain_info *info, *curr;
>  	int num, ret = -ENOSPC;
> @@ -1049,8 +1050,11 @@ int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
>  		return 0;
>  	}
>  
> -	num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
> -			      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
> +	if (restore_did >= 0)
> +		num = restore_did;
> +	else
> +		num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
> +				      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
>  	if (num < 0) {
>  		pr_err("%s: No free domain ids\n", iommu->name);
>  		goto err_unlock;
> @@ -1321,10 +1325,14 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>  {
>  	struct device_domain_info *info = dev_iommu_priv_get(dev);
>  	struct intel_iommu *iommu = info->iommu;
> +	struct device_ser *device_ser = NULL;
>  	unsigned long flags;
>  	int ret;
>  
> -	ret = domain_attach_iommu(domain, iommu);
> +	device_ser = dev_iommu_restored_state(dev);
> +
> +	ret = domain_attach_iommu(domain, iommu,
> +				  dev_iommu_restore_did(dev, &domain->domain));
>  	if (ret)
>  		return ret;
>  
> @@ -1337,16 +1345,18 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>  	if (dev_is_real_dma_subdevice(dev))
>  		return 0;
>  
> -	if (!sm_supported(iommu))
> -		ret = domain_context_mapping(domain, dev);
> -	else if (intel_domain_is_fs_paging(domain))
> -		ret = domain_setup_first_level(iommu, domain, dev,
> -					       IOMMU_NO_PASID, NULL);
> -	else if (intel_domain_is_ss_paging(domain))
> -		ret = domain_setup_second_level(iommu, domain, dev,
> -						IOMMU_NO_PASID, NULL);
> -	else if (WARN_ON(true))
> -		ret = -EINVAL;
> +	if (!device_ser) {
> +		if (!sm_supported(iommu))
> +			ret = domain_context_mapping(domain, dev);
> +		else if (intel_domain_is_fs_paging(domain))
> +			ret = domain_setup_first_level(iommu, domain, dev,
> +						       IOMMU_NO_PASID, NULL);
> +		else if (intel_domain_is_ss_paging(domain))
> +			ret = domain_setup_second_level(iommu, domain, dev,
> +							IOMMU_NO_PASID, NULL);
> +		else if (WARN_ON(true))
> +			ret = -EINVAL;
> +	}
>  
>  	if (ret)
>  		goto out_block_translation;
> @@ -3630,7 +3640,7 @@ domain_add_dev_pasid(struct iommu_domain *domain,
>  	if (!dev_pasid)
>  		return ERR_PTR(-ENOMEM);
>  
> -	ret = domain_attach_iommu(dmar_domain, iommu);
> +	ret = domain_attach_iommu(dmar_domain, iommu, -1);
>  	if (ret)
>  		goto out_free;
>  
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index d7bf63aff17d..057bd6035d85 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -1174,7 +1174,8 @@ void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>   */
>  #define QI_OPT_WAIT_DRAIN		BIT(0)
>  
> -int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
> +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
> +			int restore_did);
>  void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
>  void device_block_translation(struct device *dev);
>  int paging_domain_compatible(struct iommu_domain *domain, struct device *dev);
> diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
> index a3fb8c193ca6..4fed9f5981e5 100644
> --- a/drivers/iommu/intel/nested.c
> +++ b/drivers/iommu/intel/nested.c
> @@ -40,7 +40,7 @@ static int intel_nested_attach_dev(struct iommu_domain *domain,
>  		return ret;
>  	}
>  
> -	ret = domain_attach_iommu(dmar_domain, iommu);
> +	ret = domain_attach_iommu(dmar_domain, iommu, -1);
>  	if (ret) {
>  		dev_err_ratelimited(dev, "Failed to attach domain to iommu\n");
>  		return ret;
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index c0632cb5b570..8103b5372364 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -18,6 +18,7 @@
>  #include <linux/errno.h>
>  #include <linux/host1x_context_bus.h>
>  #include <linux/iommu.h>
> +#include <linux/iommu-lu.h>
>  #include <linux/iommufd.h>
>  #include <linux/idr.h>
>  #include <linux/err.h>
> @@ -489,6 +490,10 @@ static int iommu_init_device(struct device *dev)
>  		goto err_free;
>  	}
>  
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	dev->iommu->device_ser = iommu_get_device_preserved_data(dev);
> +#endif
> +
>  	iommu_dev = ops->probe_device(dev);
>  	if (IS_ERR(iommu_dev)) {
>  		ret = PTR_ERR(iommu_dev);
> @@ -2149,6 +2154,13 @@ static int __iommu_attach_device(struct iommu_domain *domain,
>  	ret = domain->ops->attach_dev(domain, dev, old);
>  	if (ret)
>  		return ret;
> +
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	/* The associated state can be unset once restored. */
> +	if (dev_iommu_restored_state(dev))
> +		WRITE_ONCE(dev->iommu->device_ser, NULL);
> +#endif
> +
>  	dev->iommu->attach_deferred = 0;
>  	trace_attach_device_to_domain(dev);
>  	return 0;
> @@ -3061,6 +3073,34 @@ int iommu_fwspec_add_ids(struct device *dev, const u32 *ids, int num_ids)
>  }
>  EXPORT_SYMBOL_GPL(iommu_fwspec_add_ids);
>  
> +static struct iommu_domain *__iommu_group_maybe_restore_domain(struct iommu_group *group)
> +{
> +	struct device_ser *device_ser;
> +	struct iommu_domain *domain;
> +	struct device *dev;
> +
> +	dev = iommu_group_first_dev(group);
> +	if (!dev_is_pci(dev))
> +		return NULL;
> +
> +	device_ser = dev_iommu_restored_state(dev);
> +	if (!device_ser)
> +		return NULL;
> +
> +	domain = iommu_restore_domain(dev, device_ser);
> +	if (WARN_ON(IS_ERR(domain)))
> +		return NULL;
> +
> +	/*
> +	 * The group is owned by the entity (a preserved iommufd) that provided
> +	 * this token in the previous kernel. It will be used to reclaim it
> +	 * later.
> +	 */
> +	group->owner = (void *)device_ser->token;
> +	group->owner_cnt = 1;
> +	return domain;
> +}
> +
>  /**
>   * iommu_setup_default_domain - Set the default_domain for the group
>   * @group: Group to change
> @@ -3075,8 +3115,8 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>  				      int target_type)
>  {
>  	struct iommu_domain *old_dom = group->default_domain;
> +	struct iommu_domain *dom, *restored_domain;
>  	struct group_device *gdev;
> -	struct iommu_domain *dom;
>  	bool direct_failed;
>  	int req_type;
>  	int ret;
> @@ -3120,6 +3160,9 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>  	/* We must set default_domain early for __iommu_device_set_domain */
>  	group->default_domain = dom;
>  	if (!group->domain) {
> +		restored_domain = __iommu_group_maybe_restore_domain(group);
> +		if (!restored_domain)
> +			restored_domain = dom;
>  		/*
>  		 * Drivers are not allowed to fail the first domain attach.
>  		 * The only way to recover from this is to fail attaching the
> @@ -3127,7 +3170,7 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>  		 * in group->default_domain so it is freed after.
>  		 */
>  		ret = __iommu_group_set_domain_internal(
> -			group, dom, IOMMU_SET_DOMAIN_MUST_SUCCEED);
> +			group, restored_domain, IOMMU_SET_DOMAIN_MUST_SUCCEED);
>  		if (WARN_ON(ret))
>  			goto out_free_old;
>  	} else {
> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
> index 83eb609b3fd7..6b211436ad25 100644
> --- a/drivers/iommu/liveupdate.c
> +++ b/drivers/iommu/liveupdate.c
> @@ -501,3 +501,34 @@ void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>  
>  	iommu_unpreserve_locked(iommu->iommu_dev);
>  }
> +
> +struct iommu_domain *iommu_restore_domain(struct device *dev, struct device_ser *ser)
> +{
> +	struct iommu_domain_ser *domain_ser;
> +	struct iommu_lu_flb_obj *flb_obj;
> +	struct iommu_domain *domain;
> +	int ret;
> +
> +	domain_ser = __va(ser->domain_iommu_ser.domain_phys);
> +
> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&flb_obj);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	guard(mutex)(&flb_obj->lock);
> +	if (domain_ser->restored_domain)
> +		return domain_ser->restored_domain;
> +
> +	domain_ser->obj.incoming =  true;
> +	domain = iommu_paging_domain_alloc(dev);
> +	if (IS_ERR(domain))
> +		return domain;
> +
> +	ret = domain->ops->restore(domain, domain_ser);
> +	if (ret)

'domain' will leak here.

> +		return ERR_PTR(ret);
> +
> +	domain->preserved_state = domain_ser;
> +	domain_ser->restored_domain = domain;
> +	return domain;
> +}
> diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
> index 48c07514a776..4879abaf83d3 100644
> --- a/include/linux/iommu-lu.h
> +++ b/include/linux/iommu-lu.h
> @@ -65,6 +65,8 @@ static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain
>  	return -1;
>  }
>  
> +struct iommu_domain *iommu_restore_domain(struct device *dev,
> +					  struct device_ser *ser);
>  int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
>  				    void *arg);
>  struct device_ser *iommu_get_device_preserved_data(struct device *dev);
> @@ -95,6 +97,12 @@ static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
>  	return NULL;
>  }
>  
> +static inline struct iommu_domain *iommu_restore_domain(struct device *dev,
> +							struct device_ser *ser)
> +{
> +	return NULL;
> +}
> +
>  static inline int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn, void *arg)
>  {
>  	return -EOPNOTSUPP;
> -- 
> 2.53.0.rc2.204.g2597b5adb4-goog
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 08/14] iommu: Restore and reattach preserved domains to devices
  2026-03-10  5:16   ` Ankit Soni
@ 2026-03-10 21:47     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-10 21:47 UTC (permalink / raw)
  To: Ankit Soni
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Tue, Mar 10, 2026 at 05:16:35AM +0000, Ankit Soni wrote:
>On Tue, Feb 03, 2026 at 10:09:42PM +0000, Samiullah Khawaja wrote:
>> Restore the preserved domains by restoring the page tables using restore
>> IOMMU domain op. Reattach the preserved domain to the device during
>> default domain setup. While attaching, reuse the domain ID that was used
>> in the previous kernel. The context entry setup is not needed as that is
>> preserved during liveupdate.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/intel/iommu.c  | 40 ++++++++++++++++++------------
>>  drivers/iommu/intel/iommu.h  |  3 ++-
>>  drivers/iommu/intel/nested.c |  2 +-
>>  drivers/iommu/iommu.c        | 47 ++++++++++++++++++++++++++++++++++--
>>  drivers/iommu/liveupdate.c   | 31 ++++++++++++++++++++++++
>>  include/linux/iommu-lu.h     |  8 ++++++
>>  6 files changed, 112 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index 8acb7f8a7627..83faad53f247 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -1029,7 +1029,8 @@ static bool first_level_by_default(struct intel_iommu *iommu)
>>  	return true;
>>  }
>>
>> -int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
>> +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
>> +			int restore_did)
>>  {
>>  	struct iommu_domain_info *info, *curr;
>>  	int num, ret = -ENOSPC;
>> @@ -1049,8 +1050,11 @@ int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
>>  		return 0;
>>  	}
>>
>> -	num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
>> -			      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
>> +	if (restore_did >= 0)
>> +		num = restore_did;
>> +	else
>> +		num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
>> +				      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
>>  	if (num < 0) {
>>  		pr_err("%s: No free domain ids\n", iommu->name);
>>  		goto err_unlock;
>> @@ -1321,10 +1325,14 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>>  {
>>  	struct device_domain_info *info = dev_iommu_priv_get(dev);
>>  	struct intel_iommu *iommu = info->iommu;
>> +	struct device_ser *device_ser = NULL;
>>  	unsigned long flags;
>>  	int ret;
>>
>> -	ret = domain_attach_iommu(domain, iommu);
>> +	device_ser = dev_iommu_restored_state(dev);
>> +
>> +	ret = domain_attach_iommu(domain, iommu,
>> +				  dev_iommu_restore_did(dev, &domain->domain));
>>  	if (ret)
>>  		return ret;
>>
>> @@ -1337,16 +1345,18 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>>  	if (dev_is_real_dma_subdevice(dev))
>>  		return 0;
>>
>> -	if (!sm_supported(iommu))
>> -		ret = domain_context_mapping(domain, dev);
>> -	else if (intel_domain_is_fs_paging(domain))
>> -		ret = domain_setup_first_level(iommu, domain, dev,
>> -					       IOMMU_NO_PASID, NULL);
>> -	else if (intel_domain_is_ss_paging(domain))
>> -		ret = domain_setup_second_level(iommu, domain, dev,
>> -						IOMMU_NO_PASID, NULL);
>> -	else if (WARN_ON(true))
>> -		ret = -EINVAL;
>> +	if (!device_ser) {
>> +		if (!sm_supported(iommu))
>> +			ret = domain_context_mapping(domain, dev);
>> +		else if (intel_domain_is_fs_paging(domain))
>> +			ret = domain_setup_first_level(iommu, domain, dev,
>> +						       IOMMU_NO_PASID, NULL);
>> +		else if (intel_domain_is_ss_paging(domain))
>> +			ret = domain_setup_second_level(iommu, domain, dev,
>> +							IOMMU_NO_PASID, NULL);
>> +		else if (WARN_ON(true))
>> +			ret = -EINVAL;
>> +	}
>>
>>  	if (ret)
>>  		goto out_block_translation;
>> @@ -3630,7 +3640,7 @@ domain_add_dev_pasid(struct iommu_domain *domain,
>>  	if (!dev_pasid)
>>  		return ERR_PTR(-ENOMEM);
>>
>> -	ret = domain_attach_iommu(dmar_domain, iommu);
>> +	ret = domain_attach_iommu(dmar_domain, iommu, -1);
>>  	if (ret)
>>  		goto out_free;
>>
>> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
>> index d7bf63aff17d..057bd6035d85 100644
>> --- a/drivers/iommu/intel/iommu.h
>> +++ b/drivers/iommu/intel/iommu.h
>> @@ -1174,7 +1174,8 @@ void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>>   */
>>  #define QI_OPT_WAIT_DRAIN		BIT(0)
>>
>> -int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
>> +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
>> +			int restore_did);
>>  void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
>>  void device_block_translation(struct device *dev);
>>  int paging_domain_compatible(struct iommu_domain *domain, struct device *dev);
>> diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
>> index a3fb8c193ca6..4fed9f5981e5 100644
>> --- a/drivers/iommu/intel/nested.c
>> +++ b/drivers/iommu/intel/nested.c
>> @@ -40,7 +40,7 @@ static int intel_nested_attach_dev(struct iommu_domain *domain,
>>  		return ret;
>>  	}
>>
>> -	ret = domain_attach_iommu(dmar_domain, iommu);
>> +	ret = domain_attach_iommu(dmar_domain, iommu, -1);
>>  	if (ret) {
>>  		dev_err_ratelimited(dev, "Failed to attach domain to iommu\n");
>>  		return ret;
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index c0632cb5b570..8103b5372364 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -18,6 +18,7 @@
>>  #include <linux/errno.h>
>>  #include <linux/host1x_context_bus.h>
>>  #include <linux/iommu.h>
>> +#include <linux/iommu-lu.h>
>>  #include <linux/iommufd.h>
>>  #include <linux/idr.h>
>>  #include <linux/err.h>
>> @@ -489,6 +490,10 @@ static int iommu_init_device(struct device *dev)
>>  		goto err_free;
>>  	}
>>
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	dev->iommu->device_ser = iommu_get_device_preserved_data(dev);
>> +#endif
>> +
>>  	iommu_dev = ops->probe_device(dev);
>>  	if (IS_ERR(iommu_dev)) {
>>  		ret = PTR_ERR(iommu_dev);
>> @@ -2149,6 +2154,13 @@ static int __iommu_attach_device(struct iommu_domain *domain,
>>  	ret = domain->ops->attach_dev(domain, dev, old);
>>  	if (ret)
>>  		return ret;
>> +
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	/* The associated state can be unset once restored. */
>> +	if (dev_iommu_restored_state(dev))
>> +		WRITE_ONCE(dev->iommu->device_ser, NULL);
>> +#endif
>> +
>>  	dev->iommu->attach_deferred = 0;
>>  	trace_attach_device_to_domain(dev);
>>  	return 0;
>> @@ -3061,6 +3073,34 @@ int iommu_fwspec_add_ids(struct device *dev, const u32 *ids, int num_ids)
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_fwspec_add_ids);
>>
>> +static struct iommu_domain *__iommu_group_maybe_restore_domain(struct iommu_group *group)
>> +{
>> +	struct device_ser *device_ser;
>> +	struct iommu_domain *domain;
>> +	struct device *dev;
>> +
>> +	dev = iommu_group_first_dev(group);
>> +	if (!dev_is_pci(dev))
>> +		return NULL;
>> +
>> +	device_ser = dev_iommu_restored_state(dev);
>> +	if (!device_ser)
>> +		return NULL;
>> +
>> +	domain = iommu_restore_domain(dev, device_ser);
>> +	if (WARN_ON(IS_ERR(domain)))
>> +		return NULL;
>> +
>> +	/*
>> +	 * The group is owned by the entity (a preserved iommufd) that provided
>> +	 * this token in the previous kernel. It will be used to reclaim it
>> +	 * later.
>> +	 */
>> +	group->owner = (void *)device_ser->token;
>> +	group->owner_cnt = 1;
>> +	return domain;
>> +}
>> +
>>  /**
>>   * iommu_setup_default_domain - Set the default_domain for the group
>>   * @group: Group to change
>> @@ -3075,8 +3115,8 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>>  				      int target_type)
>>  {
>>  	struct iommu_domain *old_dom = group->default_domain;
>> +	struct iommu_domain *dom, *restored_domain;
>>  	struct group_device *gdev;
>> -	struct iommu_domain *dom;
>>  	bool direct_failed;
>>  	int req_type;
>>  	int ret;
>> @@ -3120,6 +3160,9 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>>  	/* We must set default_domain early for __iommu_device_set_domain */
>>  	group->default_domain = dom;
>>  	if (!group->domain) {
>> +		restored_domain = __iommu_group_maybe_restore_domain(group);
>> +		if (!restored_domain)
>> +			restored_domain = dom;
>>  		/*
>>  		 * Drivers are not allowed to fail the first domain attach.
>>  		 * The only way to recover from this is to fail attaching the
>> @@ -3127,7 +3170,7 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>>  		 * in group->default_domain so it is freed after.
>>  		 */
>>  		ret = __iommu_group_set_domain_internal(
>> -			group, dom, IOMMU_SET_DOMAIN_MUST_SUCCEED);
>> +			group, restored_domain, IOMMU_SET_DOMAIN_MUST_SUCCEED);
>>  		if (WARN_ON(ret))
>>  			goto out_free_old;
>>  	} else {
>> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
>> index 83eb609b3fd7..6b211436ad25 100644
>> --- a/drivers/iommu/liveupdate.c
>> +++ b/drivers/iommu/liveupdate.c
>> @@ -501,3 +501,34 @@ void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>>
>>  	iommu_unpreserve_locked(iommu->iommu_dev);
>>  }
>> +
>> +struct iommu_domain *iommu_restore_domain(struct device *dev, struct device_ser *ser)
>> +{
>> +	struct iommu_domain_ser *domain_ser;
>> +	struct iommu_lu_flb_obj *flb_obj;
>> +	struct iommu_domain *domain;
>> +	int ret;
>> +
>> +	domain_ser = __va(ser->domain_iommu_ser.domain_phys);
>> +
>> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&flb_obj);
>> +	if (ret)
>> +		return ERR_PTR(ret);
>> +
>> +	guard(mutex)(&flb_obj->lock);
>> +	if (domain_ser->restored_domain)
>> +		return domain_ser->restored_domain;
>> +
>> +	domain_ser->obj.incoming =  true;
>> +	domain = iommu_paging_domain_alloc(dev);
>> +	if (IS_ERR(domain))
>> +		return domain;
>> +
>> +	ret = domain->ops->restore(domain, domain_ser);
>> +	if (ret)
>
>'domain' will leak here.

Agreed. It should be cleaned here if restore fails. I will fix this in
the next revision.
>
>> +		return ERR_PTR(ret);
>> +
>> +	domain->preserved_state = domain_ser;
>> +	domain_ser->restored_domain = domain;
>> +	return domain;
>> +}
>> diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
>> index 48c07514a776..4879abaf83d3 100644
>> --- a/include/linux/iommu-lu.h
>> +++ b/include/linux/iommu-lu.h
>> @@ -65,6 +65,8 @@ static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain
>>  	return -1;
>>  }
>>
>> +struct iommu_domain *iommu_restore_domain(struct device *dev,
>> +					  struct device_ser *ser);
>>  int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
>>  				    void *arg);
>>  struct device_ser *iommu_get_device_preserved_data(struct device *dev);
>> @@ -95,6 +97,12 @@ static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
>>  	return NULL;
>>  }
>>
>> +static inline struct iommu_domain *iommu_restore_domain(struct device *dev,
>> +							struct device_ser *ser)
>> +{
>> +	return NULL;
>> +}
>> +
>>  static inline int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn, void *arg)
>>  {
>>  	return -EOPNOTSUPP;
>> --
>> 2.53.0.rc2.204.g2597b5adb4-goog
>>

Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks
  2026-02-03 22:09 ` [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks Samiullah Khawaja
@ 2026-03-11 21:07   ` Pranjal Shrivastava
  2026-03-12 16:43     ` Samiullah Khawaja
  2026-03-16 22:54   ` Vipin Sharma
  1 sibling, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-11 21:07 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:35PM +0000, Samiullah Khawaja wrote:
> Add liveupdate FLB for IOMMU state preservation. Use KHO preserve memory
> alloc/free helper functions to allocate memory for the IOMMU LU FLB
> object and the serialization structs for device, domain and iommu.
> 
> During retrieve, walk through the preserved objs nodes and restore each
> folio. Also recreate the FLB obj.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/Kconfig         |  11 +++
>  drivers/iommu/Makefile        |   1 +
>  drivers/iommu/liveupdate.c    | 177 ++++++++++++++++++++++++++++++++++
>  include/linux/iommu-lu.h      |  17 ++++
>  include/linux/kho/abi/iommu.h | 119 +++++++++++++++++++++++
>  5 files changed, 325 insertions(+)
>  create mode 100644 drivers/iommu/liveupdate.c
>  create mode 100644 include/linux/iommu-lu.h
>  create mode 100644 include/linux/kho/abi/iommu.h
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index f86262b11416..fdcfbedee5ed 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -11,6 +11,17 @@ config IOMMUFD_DRIVER
>  	bool
>  	default n
>  
> +config IOMMU_LIVEUPDATE
> +	bool "IOMMU live update state preservation support"
> +	depends on LIVEUPDATE && IOMMUFD
> +	help
> +	  Enable support for preserving IOMMU state across a kexec live update.
> +
> +	  This allows devices managed by iommufd to maintain their DMA mappings
> +	  during kexec base kernel update.
> +
> +	  If unsure, say N.
> +

I'm wondering if this should be under the if IOMMU_SUPPORT below? I
believe this was added here because IOMMUFD isn't under IOMMU_SUPPORT,
but it wouldn't make sense to "preserve" IOMMU across a liveupdate if
IOMMU_SUPPORT is disabled? Should we probably be move it inside the 
if IOMMU_SUPPORT block for better organization, or at least have a depends
on IOMMU_SUPPORT added to it? The IOMMU_LUO still depends on the 
IOMMU_SUPPORT infrastructure to actually function.. as we add calls
within core functions like dev_iommu_get etc.

>  menuconfig IOMMU_SUPPORT
>  	bool "IOMMU Hardware Support"
>  	depends on MMU
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 0275821f4ef9..b3715c5a6b97 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -15,6 +15,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST) += io-pgtable-arm-selftests.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
> +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
>  obj-$(CONFIG_IOMMU_IOVA) += iova.o
>  obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
>  obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o
> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
> new file mode 100644
> index 000000000000..6189ba32ff2c
> --- /dev/null
> +++ b/drivers/iommu/liveupdate.c
> @@ -0,0 +1,177 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * Copyright (C) 2025, Google LLC

Minor nit: 2026 OR 2025-26, here and everywhere else

> + * Author: Samiullah Khawaja <skhawaja@google.com>
> + */
> +
> +#define pr_fmt(fmt)    "iommu: liveupdate: " fmt
> +
> +#include <linux/kexec_handover.h>
> +#include <linux/liveupdate.h>
> +#include <linux/iommu-lu.h>
> +#include <linux/iommu.h>
> +#include <linux/errno.h>
> +
> +static void iommu_liveupdate_restore_objs(u64 next)
> +{
> +	struct iommu_objs_ser *objs;
> +
> +	while (next) {
> +		BUG_ON(!kho_restore_folio(next));

Same thing about BUG_ON [1] as mentioned below in the 
iommu_liveupdate_flb_retrieve() function, can we consider returning an
error which can be checked in the caller and the error can be bubbled up
as -ENODATA?

> +		objs = __va(next);
> +		next = objs->next_objs;
> +	}
> +}
> +
> +static void iommu_liveupdate_free_objs(u64 next, bool incoming)
> +{
> +	struct iommu_objs_ser *objs;
> +
> +	while (next) {
> +		objs = __va(next);
> +		next = objs->next_objs;
> +
> +		if (!incoming)
> +			kho_unpreserve_free(objs);
> +		else
> +			folio_put(virt_to_folio(objs));

Interesting! kho_restore_folio already adjusts the refcount via 
adjust_managed_page_count which is why folio_put() is needed for a valid
refcount. Sweet :)

> +	}
> +}
> +
> +static void iommu_liveupdate_flb_free(struct iommu_lu_flb_obj *obj)
> +{
> +	if (obj->iommu_domains)
> +		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, false);
> +
> +	if (obj->devices)
> +		iommu_liveupdate_free_objs(obj->ser->devices_phys, false);
> +
> +	if (obj->iommus)
> +		iommu_liveupdate_free_objs(obj->ser->iommus_phys, false);
> +
> +	kho_unpreserve_free(obj->ser);
> +	kfree(obj);
> +}
> +
> +static int iommu_liveupdate_flb_preserve(struct liveupdate_flb_op_args *argp)
> +{
> +	struct iommu_lu_flb_obj *obj;
> +	struct iommu_lu_flb_ser *ser;
> +	void *mem;
> +
> +	obj = kzalloc(sizeof(*obj), GFP_KERNEL);

I know this is obvious but let's add a comment that obj just exists in 
the "current" kernel whereas mem is supposed to survive the KHO..

> +	if (!obj)
> +		return -ENOMEM;
> +
> +	mutex_init(&obj->lock);
> +	mem = kho_alloc_preserve(sizeof(*ser));
> +	if (IS_ERR(mem))
> +		goto err_free;
> +
> +	ser = mem;
> +	obj->ser = ser;
> +
> +	mem = kho_alloc_preserve(PAGE_SIZE);
> +	if (IS_ERR(mem))
> +		goto err_free;
> +
> +	obj->iommu_domains = mem;
> +	ser->iommu_domains_phys = virt_to_phys(obj->iommu_domains);
> +
> +	mem = kho_alloc_preserve(PAGE_SIZE);
> +	if (IS_ERR(mem))
> +		goto err_free;
> +
> +	obj->devices = mem;
> +	ser->devices_phys = virt_to_phys(obj->devices);
> +
> +	mem = kho_alloc_preserve(PAGE_SIZE);
> +	if (IS_ERR(mem))
> +		goto err_free;
> +
> +	obj->iommus = mem;
> +	ser->iommus_phys = virt_to_phys(obj->iommus);
> +
> +	argp->obj = obj;
> +	argp->data = virt_to_phys(ser);
> +	return 0;
> +
> +err_free:
> +	iommu_liveupdate_flb_free(obj);
> +	return PTR_ERR(mem);
> +}
> +
> +static void iommu_liveupdate_flb_unpreserve(struct liveupdate_flb_op_args *argp)
> +{
> +	iommu_liveupdate_flb_free(argp->obj);
> +}
> +
> +static void iommu_liveupdate_flb_finish(struct liveupdate_flb_op_args *argp)
> +{
> +	struct iommu_lu_flb_obj *obj = argp->obj;
> +
> +	if (obj->iommu_domains)
> +		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, true);
> +
> +	if (obj->devices)
> +		iommu_liveupdate_free_objs(obj->ser->devices_phys, true);
> +
> +	if (obj->iommus)
> +		iommu_liveupdate_free_objs(obj->ser->iommus_phys, true);
> +
> +	folio_put(virt_to_folio(obj->ser));
> +	kfree(obj);
> +}
> +
> +static int iommu_liveupdate_flb_retrieve(struct liveupdate_flb_op_args *argp)
> +{
> +	struct iommu_lu_flb_obj *obj;
> +	struct iommu_lu_flb_ser *ser;
> +
> +	obj = kzalloc(sizeof(*obj), GFP_ATOMIC);

Why does this have to be GFP_ATOMIC? IIUC, the retrieve path is 
triggered by a userspace IOCTL in the new kernel. The system should be
able to sleep here? (unless we have a use-case to call this in IRQ-ctx?)
AFAICT, we call this under mutexes already, hence there's no situation
where we could sleep in a spinlock context?

GFP_ATOMIC creates a point of failure if the system is under memory 
pressure. I believe we should be allowed to sleep for this allocation
because the "preserved" mappings still allow DMAs to go on and we're in
no hurry to restore the IOMMU state? I believe this could be GFP_KERNEL.

> +	if (!obj)
> +		return -ENOMEM;
> +
> +	mutex_init(&obj->lock);
> +	BUG_ON(!kho_restore_folio(argp->data));

The use of BUG_ON in new code is heavily discouraged [1].
If KHO can't restore the folio for whatever reason, we can be treat it
as a corruption of the handover data. I believe crashing the kernel for
it would be an overkill? 

Can we consider returning a graceful failure like -ENODATA or something?
BUG_ON would instantly cause a kernel panic without providing no 
opportunity for the system to log the failure or attempt a graceful
teardown of the 'preserved' mapping.

> +	ser = phys_to_virt(argp->data);
> +	obj->ser = ser;
> +
> +	iommu_liveupdate_restore_objs(ser->iommu_domains_phys);
> +	obj->iommu_domains = phys_to_virt(ser->iommu_domains_phys);
> +
> +	iommu_liveupdate_restore_objs(ser->devices_phys);
> +	obj->devices = phys_to_virt(ser->devices_phys);
> +
> +	iommu_liveupdate_restore_objs(ser->iommus_phys);
> +	obj->iommus = phys_to_virt(ser->iommus_phys);
> +
> +	argp->obj = obj;
> +
> +	return 0;
> +}
> +
> +static struct liveupdate_flb_ops iommu_flb_ops = {
> +	.preserve = iommu_liveupdate_flb_preserve,
> +	.unpreserve = iommu_liveupdate_flb_unpreserve,
> +	.finish = iommu_liveupdate_flb_finish,
> +	.retrieve = iommu_liveupdate_flb_retrieve,
> +};
> +
> +static struct liveupdate_flb iommu_flb = {
> +	.compatible = IOMMU_LUO_FLB_COMPATIBLE,
> +	.ops = &iommu_flb_ops,
> +};
> +
> +int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler)
> +{
> +	return liveupdate_register_flb(handler, &iommu_flb);
> +}
> +EXPORT_SYMBOL(iommu_liveupdate_register_flb);
> +
> +int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler)
> +{
> +	return liveupdate_unregister_flb(handler, &iommu_flb);
> +}
> +EXPORT_SYMBOL(iommu_liveupdate_unregister_flb);
> diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
> new file mode 100644
> index 000000000000..59095d2f1bb2
> --- /dev/null
> +++ b/include/linux/iommu-lu.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Copyright (C) 2025, Google LLC
> + * Author: Samiullah Khawaja <skhawaja@google.com>
> + */
> +
> +#ifndef _LINUX_IOMMU_LU_H
> +#define _LINUX_IOMMU_LU_H
> +
> +#include <linux/liveupdate.h>
> +#include <linux/kho/abi/iommu.h>
> +
> +int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler);
> +int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler);
> +
> +#endif /* _LINUX_IOMMU_LU_H */
> diff --git a/include/linux/kho/abi/iommu.h b/include/linux/kho/abi/iommu.h
> new file mode 100644
> index 000000000000..8e1c05cfe7bb
> --- /dev/null
> +++ b/include/linux/kho/abi/iommu.h
> @@ -0,0 +1,119 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Copyright (C) 2025, Google LLC
> + * Author: Samiullah Khawaja <skhawaja@google.com>
> + */
> +
> +#ifndef _LINUX_KHO_ABI_IOMMU_H
> +#define _LINUX_KHO_ABI_IOMMU_H
> +
> +#include <linux/mutex_types.h>
> +#include <linux/compiler.h>
> +#include <linux/types.h>
> +
> +/**
> + * DOC: IOMMU File-Lifecycle Bound (FLB) Live Update ABI
> + *
> + * This header defines the ABI for preserving IOMMU state across kexec using
> + * Live Update File-Lifecycle Bound (FLB) data.
> + *
> + * This interface is a contract. Any modification to any of the serialization
> + * structs defined here constitutes a breaking change. Such changes require
> + * incrementing the version number in the IOMMU_LUO_FLB_COMPATIBLE string.
> + */
> +
> +#define IOMMU_LUO_FLB_COMPATIBLE "iommu-v1"
> +

Let's call this "iommu-liveupdate-v1" or "iommu-lu-v1" instead?

> +enum iommu_lu_type {
> +	IOMMU_INVALID,
> +	IOMMU_INTEL,
> +};
> +
> +struct iommu_obj_ser {
> +	u32 idx;
> +	u32 ref_count;
> +	u32 deleted:1;
> +	u32 incoming:1;
> +} __packed;
> +
> +struct iommu_domain_ser {
> +	struct iommu_obj_ser obj;
> +	u64 top_table;
> +	u64 top_level;
> +	struct iommu_domain *restored_domain;
> +} __packed;
> +
> +struct device_domain_iommu_ser {
> +	u32 did;
> +	u64 domain_phys;
> +	u64 iommu_phys;
> +} __packed;
> +
> +struct device_ser {
> +	struct iommu_obj_ser obj;
> +	u64 token;
> +	u32 devid;
> +	u32 pci_domain;
> +	struct device_domain_iommu_ser domain_iommu_ser;
> +	enum iommu_lu_type type;
> +} __packed;
> +
> +struct iommu_intel_ser {
> +	u64 phys_addr;
> +	u64 root_table;
> +} __packed;
> +
> +struct iommu_ser {
> +	struct iommu_obj_ser obj;
> +	u64 token;
> +	enum iommu_lu_type type;
> +	union {
> +		struct iommu_intel_ser intel;
> +	};
> +} __packed;
> +
> +struct iommu_objs_ser {
> +	u64 next_objs;
> +	u64 nr_objs;
> +} __packed;
> +
> +struct iommus_ser {
> +	struct iommu_objs_ser objs;
> +	struct iommu_ser iommus[];
> +} __packed;
> +
> +struct iommu_domains_ser {
> +	struct iommu_objs_ser objs;
> +	struct iommu_domain_ser iommu_domains[];
> +} __packed;
> +
> +struct devices_ser {
> +	struct iommu_objs_ser objs;
> +	struct device_ser devices[];
> +} __packed;
> +
> +#define MAX_IOMMU_SERS ((PAGE_SIZE - sizeof(struct iommus_ser)) / sizeof(struct iommu_ser))
> +#define MAX_IOMMU_DOMAIN_SERS \
> +		((PAGE_SIZE - sizeof(struct iommu_domains_ser)) / sizeof(struct iommu_domain_ser))
> +#define MAX_DEVICE_SERS ((PAGE_SIZE - sizeof(struct devices_ser)) / sizeof(struct device_ser))
> +
> +struct iommu_lu_flb_ser {
> +	u64 iommus_phys;
> +	u64 nr_iommus;
> +	u64 iommu_domains_phys;
> +	u64 nr_domains;
> +	u64 devices_phys;
> +	u64 nr_devices;
> +} __packed;
> +
> +struct iommu_lu_flb_obj {
> +	struct mutex lock;
> +	struct iommu_lu_flb_ser *ser;
> +
> +	struct iommu_domains_ser *iommu_domains;
> +	struct iommus_ser *iommus;
> +	struct devices_ser *devices;
> +} __packed;
> +

Please let's add some comments describing the structs & their members
here like we have in memfd [2]. This should be descriptive for the user.
For example:

+/**
+ * struct iommu_lu_flb_ser - Main serialization header for IOMMU state.
+ * @iommus_phys:        Physical address of the first page in the IOMMU unit chain.
+ * @nr_iommus:          Total number of hardware IOMMU units preserved.
+ * @iommu_domains_phys: [...]
+ * @nr_domains:         [...]
+ * @devices_phys:       [...]
+ * @nr_devices:         [...]
+ *
+ * This structure acts as the root of the IOMMU state tree. It is hitching a ride
+ * on the iommufd file descriptor's preservation flow.
+ */
+struct iommu_lu_flb_ser {
+	u64 iommus_phys;
+	u64 nr_iommus;
+	u64 iommu_domains_phys;
+	u64 nr_domains;
+	u64 devices_phys;
+	u64 nr_devices;
+} __packed;

> +#endif /* _LINUX_KHO_ABI_IOMMU_H */

Thanks,
Praan

[1] https://docs.kernel.org/process/coding-style.html#use-warn-rather-than-bug
[2] https://elixir.bootlin.com/linux/v7.0-rc3/source/include/linux/kho/abi/memfd.h


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks
  2026-03-11 21:07   ` Pranjal Shrivastava
@ 2026-03-12 16:43     ` Samiullah Khawaja
  2026-03-12 23:43       ` Pranjal Shrivastava
  2026-03-13 15:36       ` Pranjal Shrivastava
  0 siblings, 2 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-12 16:43 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Wed, Mar 11, 2026 at 09:07:00PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:35PM +0000, Samiullah Khawaja wrote:
>> Add liveupdate FLB for IOMMU state preservation. Use KHO preserve memory
>> alloc/free helper functions to allocate memory for the IOMMU LU FLB
>> object and the serialization structs for device, domain and iommu.
>>
>> During retrieve, walk through the preserved objs nodes and restore each
>> folio. Also recreate the FLB obj.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/Kconfig         |  11 +++
>>  drivers/iommu/Makefile        |   1 +
>>  drivers/iommu/liveupdate.c    | 177 ++++++++++++++++++++++++++++++++++
>>  include/linux/iommu-lu.h      |  17 ++++
>>  include/linux/kho/abi/iommu.h | 119 +++++++++++++++++++++++
>>  5 files changed, 325 insertions(+)
>>  create mode 100644 drivers/iommu/liveupdate.c
>>  create mode 100644 include/linux/iommu-lu.h
>>  create mode 100644 include/linux/kho/abi/iommu.h
>>
>> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>> index f86262b11416..fdcfbedee5ed 100644
>> --- a/drivers/iommu/Kconfig
>> +++ b/drivers/iommu/Kconfig
>> @@ -11,6 +11,17 @@ config IOMMUFD_DRIVER
>>  	bool
>>  	default n
>>
>> +config IOMMU_LIVEUPDATE
>> +	bool "IOMMU live update state preservation support"
>> +	depends on LIVEUPDATE && IOMMUFD
>> +	help
>> +	  Enable support for preserving IOMMU state across a kexec live update.
>> +
>> +	  This allows devices managed by iommufd to maintain their DMA mappings
>> +	  during kexec base kernel update.
>> +
>> +	  If unsure, say N.
>> +
>
>I'm wondering if this should be under the if IOMMU_SUPPORT below? I
>believe this was added here because IOMMUFD isn't under IOMMU_SUPPORT,
>but it wouldn't make sense to "preserve" IOMMU across a liveupdate if
>IOMMU_SUPPORT is disabled? Should we probably be move it inside the
>if IOMMU_SUPPORT block for better organization, or at least have a depends
>on IOMMU_SUPPORT added to it? The IOMMU_LUO still depends on the
>IOMMU_SUPPORT infrastructure to actually function.. as we add calls
>within core functions like dev_iommu_get etc.

Agreed. I will move it under IOMMU_SUPPORT and sort out any other
dependencies.
>
>>  menuconfig IOMMU_SUPPORT
>>  	bool "IOMMU Hardware Support"
>>  	depends on MMU
>> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>> index 0275821f4ef9..b3715c5a6b97 100644
>> --- a/drivers/iommu/Makefile
>> +++ b/drivers/iommu/Makefile
>> @@ -15,6 +15,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST) += io-pgtable-arm-selftests.o
>>  obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
>> +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
>>  obj-$(CONFIG_IOMMU_IOVA) += iova.o
>>  obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
>>  obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o
>> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
>> new file mode 100644
>> index 000000000000..6189ba32ff2c
>> --- /dev/null
>> +++ b/drivers/iommu/liveupdate.c
>> @@ -0,0 +1,177 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +
>> +/*
>> + * Copyright (C) 2025, Google LLC
>
>Minor nit: 2026 OR 2025-26, here and everywhere else

Will fix in next revision.
>
>> + * Author: Samiullah Khawaja <skhawaja@google.com>
>> + */
>> +
>> +#define pr_fmt(fmt)    "iommu: liveupdate: " fmt
>> +
>> +#include <linux/kexec_handover.h>
>> +#include <linux/liveupdate.h>
>> +#include <linux/iommu-lu.h>
>> +#include <linux/iommu.h>
>> +#include <linux/errno.h>
>> +
>> +static void iommu_liveupdate_restore_objs(u64 next)
>> +{
>> +	struct iommu_objs_ser *objs;
>> +
>> +	while (next) {
>> +		BUG_ON(!kho_restore_folio(next));
>
>Same thing about BUG_ON [1] as mentioned below in the
>iommu_liveupdate_flb_retrieve() function, can we consider returning an
>error which can be checked in the caller and the error can be bubbled up
>as -ENODATA?

Please see the explanation below on the BUG_ON.
>
>> +		objs = __va(next);
>> +		next = objs->next_objs;
>> +	}
>> +}
>> +
>> +static void iommu_liveupdate_free_objs(u64 next, bool incoming)
>> +{
>> +	struct iommu_objs_ser *objs;
>> +
>> +	while (next) {
>> +		objs = __va(next);
>> +		next = objs->next_objs;
>> +
>> +		if (!incoming)
>> +			kho_unpreserve_free(objs);
>> +		else
>> +			folio_put(virt_to_folio(objs));
>
>Interesting! kho_restore_folio already adjusts the refcount via
>adjust_managed_page_count which is why folio_put() is needed for a valid
>refcount. Sweet :)
>
>> +	}
>> +}
>> +
>> +static void iommu_liveupdate_flb_free(struct iommu_lu_flb_obj *obj)
>> +{
>> +	if (obj->iommu_domains)
>> +		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, false);
>> +
>> +	if (obj->devices)
>> +		iommu_liveupdate_free_objs(obj->ser->devices_phys, false);
>> +
>> +	if (obj->iommus)
>> +		iommu_liveupdate_free_objs(obj->ser->iommus_phys, false);
>> +
>> +	kho_unpreserve_free(obj->ser);
>> +	kfree(obj);
>> +}
>> +
>> +static int iommu_liveupdate_flb_preserve(struct liveupdate_flb_op_args *argp)
>> +{
>> +	struct iommu_lu_flb_obj *obj;
>> +	struct iommu_lu_flb_ser *ser;
>> +	void *mem;
>> +
>> +	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
>
>I know this is obvious but let's add a comment that obj just exists in
>the "current" kernel whereas mem is supposed to survive the KHO..

Agreed. Will add in the next revision.
>
>> +	if (!obj)
>> +		return -ENOMEM;
>> +
>> +	mutex_init(&obj->lock);
>> +	mem = kho_alloc_preserve(sizeof(*ser));
>> +	if (IS_ERR(mem))
>> +		goto err_free;
>> +
>> +	ser = mem;
>> +	obj->ser = ser;
>> +
>> +	mem = kho_alloc_preserve(PAGE_SIZE);
>> +	if (IS_ERR(mem))
>> +		goto err_free;
>> +
>> +	obj->iommu_domains = mem;
>> +	ser->iommu_domains_phys = virt_to_phys(obj->iommu_domains);
>> +
>> +	mem = kho_alloc_preserve(PAGE_SIZE);
>> +	if (IS_ERR(mem))
>> +		goto err_free;
>> +
>> +	obj->devices = mem;
>> +	ser->devices_phys = virt_to_phys(obj->devices);
>> +
>> +	mem = kho_alloc_preserve(PAGE_SIZE);
>> +	if (IS_ERR(mem))
>> +		goto err_free;
>> +
>> +	obj->iommus = mem;
>> +	ser->iommus_phys = virt_to_phys(obj->iommus);
>> +
>> +	argp->obj = obj;
>> +	argp->data = virt_to_phys(ser);
>> +	return 0;
>> +
>> +err_free:
>> +	iommu_liveupdate_flb_free(obj);
>> +	return PTR_ERR(mem);
>> +}
>> +
>> +static void iommu_liveupdate_flb_unpreserve(struct liveupdate_flb_op_args *argp)
>> +{
>> +	iommu_liveupdate_flb_free(argp->obj);
>> +}
>> +
>> +static void iommu_liveupdate_flb_finish(struct liveupdate_flb_op_args *argp)
>> +{
>> +	struct iommu_lu_flb_obj *obj = argp->obj;
>> +
>> +	if (obj->iommu_domains)
>> +		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, true);
>> +
>> +	if (obj->devices)
>> +		iommu_liveupdate_free_objs(obj->ser->devices_phys, true);
>> +
>> +	if (obj->iommus)
>> +		iommu_liveupdate_free_objs(obj->ser->iommus_phys, true);
>> +
>> +	folio_put(virt_to_folio(obj->ser));
>> +	kfree(obj);
>> +}
>> +
>> +static int iommu_liveupdate_flb_retrieve(struct liveupdate_flb_op_args *argp)
>> +{
>> +	struct iommu_lu_flb_obj *obj;
>> +	struct iommu_lu_flb_ser *ser;
>> +
>> +	obj = kzalloc(sizeof(*obj), GFP_ATOMIC);
>
>Why does this have to be GFP_ATOMIC? IIUC, the retrieve path is
>triggered by a userspace IOCTL in the new kernel. The system should be
>able to sleep here? (unless we have a use-case to call this in IRQ-ctx?)
>AFAICT, we call this under mutexes already, hence there's no situation
>where we could sleep in a spinlock context?
>
>GFP_ATOMIC creates a point of failure if the system is under memory
>pressure. I believe we should be allowed to sleep for this allocation
>because the "preserved" mappings still allow DMAs to go on and we're in
>no hurry to restore the IOMMU state? I believe this could be GFP_KERNEL.
>
>> +	if (!obj)
>> +		return -ENOMEM;
>> +
>> +	mutex_init(&obj->lock);
>> +	BUG_ON(!kho_restore_folio(argp->data));
>
>The use of BUG_ON in new code is heavily discouraged [1].
>If KHO can't restore the folio for whatever reason, we can be treat it
>as a corruption of the handover data. I believe crashing the kernel for
>it would be an overkill?

The FLB restore is done during early boot and this has been discussed in
past in KHO/LUO and IOMMU context also. But basically if this fails, the
restoration of IOMMU state cannot be done, preserved devices would
already have corrupted memory due to ongoing DMA as we didn't disable
translation in the previous kernel. So logging an error and a BUG_ON at
this point would be most appropriate.

Please see discussion here on this topic:
https://lore.kernel.org/all/20251118153631.GB90703@nvidia.com/
>
>Can we consider returning a graceful failure like -ENODATA or something?
>BUG_ON would instantly cause a kernel panic without providing no
>opportunity for the system to log the failure or attempt a graceful
>teardown of the 'preserved' mapping.

I will update this by adding a comment and also log an error.
>
>> +	ser = phys_to_virt(argp->data);
>> +	obj->ser = ser;
>> +
>> +	iommu_liveupdate_restore_objs(ser->iommu_domains_phys);
>> +	obj->iommu_domains = phys_to_virt(ser->iommu_domains_phys);
>> +
>> +	iommu_liveupdate_restore_objs(ser->devices_phys);
>> +	obj->devices = phys_to_virt(ser->devices_phys);
>> +
>> +	iommu_liveupdate_restore_objs(ser->iommus_phys);
>> +	obj->iommus = phys_to_virt(ser->iommus_phys);
>> +
>> +	argp->obj = obj;
>> +
>> +	return 0;
>> +}
>> +
>> +static struct liveupdate_flb_ops iommu_flb_ops = {
>> +	.preserve = iommu_liveupdate_flb_preserve,
>> +	.unpreserve = iommu_liveupdate_flb_unpreserve,
>> +	.finish = iommu_liveupdate_flb_finish,
>> +	.retrieve = iommu_liveupdate_flb_retrieve,
>> +};
>> +
>> +static struct liveupdate_flb iommu_flb = {
>> +	.compatible = IOMMU_LUO_FLB_COMPATIBLE,
>> +	.ops = &iommu_flb_ops,
>> +};
>> +
>> +int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler)
>> +{
>> +	return liveupdate_register_flb(handler, &iommu_flb);
>> +}
>> +EXPORT_SYMBOL(iommu_liveupdate_register_flb);
>> +
>> +int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler)
>> +{
>> +	return liveupdate_unregister_flb(handler, &iommu_flb);
>> +}
>> +EXPORT_SYMBOL(iommu_liveupdate_unregister_flb);
>> diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
>> new file mode 100644
>> index 000000000000..59095d2f1bb2
>> --- /dev/null
>> +++ b/include/linux/iommu-lu.h
>> @@ -0,0 +1,17 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +/*
>> + * Copyright (C) 2025, Google LLC
>> + * Author: Samiullah Khawaja <skhawaja@google.com>
>> + */
>> +
>> +#ifndef _LINUX_IOMMU_LU_H
>> +#define _LINUX_IOMMU_LU_H
>> +
>> +#include <linux/liveupdate.h>
>> +#include <linux/kho/abi/iommu.h>
>> +
>> +int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler);
>> +int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler);
>> +
>> +#endif /* _LINUX_IOMMU_LU_H */
>> diff --git a/include/linux/kho/abi/iommu.h b/include/linux/kho/abi/iommu.h
>> new file mode 100644
>> index 000000000000..8e1c05cfe7bb
>> --- /dev/null
>> +++ b/include/linux/kho/abi/iommu.h
>> @@ -0,0 +1,119 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +/*
>> + * Copyright (C) 2025, Google LLC
>> + * Author: Samiullah Khawaja <skhawaja@google.com>
>> + */
>> +
>> +#ifndef _LINUX_KHO_ABI_IOMMU_H
>> +#define _LINUX_KHO_ABI_IOMMU_H
>> +
>> +#include <linux/mutex_types.h>
>> +#include <linux/compiler.h>
>> +#include <linux/types.h>
>> +
>> +/**
>> + * DOC: IOMMU File-Lifecycle Bound (FLB) Live Update ABI
>> + *
>> + * This header defines the ABI for preserving IOMMU state across kexec using
>> + * Live Update File-Lifecycle Bound (FLB) data.
>> + *
>> + * This interface is a contract. Any modification to any of the serialization
>> + * structs defined here constitutes a breaking change. Such changes require
>> + * incrementing the version number in the IOMMU_LUO_FLB_COMPATIBLE string.
>> + */
>> +
>> +#define IOMMU_LUO_FLB_COMPATIBLE "iommu-v1"
>> +
>
>Let's call this "iommu-liveupdate-v1" or "iommu-lu-v1" instead?

Agreed.
>
>> +enum iommu_lu_type {
>> +	IOMMU_INVALID,
>> +	IOMMU_INTEL,
>> +};
>> +
>> +struct iommu_obj_ser {
>> +	u32 idx;
>> +	u32 ref_count;
>> +	u32 deleted:1;
>> +	u32 incoming:1;
>> +} __packed;
>> +
>> +struct iommu_domain_ser {
>> +	struct iommu_obj_ser obj;
>> +	u64 top_table;
>> +	u64 top_level;
>> +	struct iommu_domain *restored_domain;
>> +} __packed;
>> +
>> +struct device_domain_iommu_ser {
>> +	u32 did;
>> +	u64 domain_phys;
>> +	u64 iommu_phys;
>> +} __packed;
>> +
>> +struct device_ser {
>> +	struct iommu_obj_ser obj;
>> +	u64 token;
>> +	u32 devid;
>> +	u32 pci_domain;
>> +	struct device_domain_iommu_ser domain_iommu_ser;
>> +	enum iommu_lu_type type;
>> +} __packed;
>> +
>> +struct iommu_intel_ser {
>> +	u64 phys_addr;
>> +	u64 root_table;
>> +} __packed;
>> +
>> +struct iommu_ser {
>> +	struct iommu_obj_ser obj;
>> +	u64 token;
>> +	enum iommu_lu_type type;
>> +	union {
>> +		struct iommu_intel_ser intel;
>> +	};
>> +} __packed;
>> +
>> +struct iommu_objs_ser {
>> +	u64 next_objs;
>> +	u64 nr_objs;
>> +} __packed;
>> +
>> +struct iommus_ser {
>> +	struct iommu_objs_ser objs;
>> +	struct iommu_ser iommus[];
>> +} __packed;
>> +
>> +struct iommu_domains_ser {
>> +	struct iommu_objs_ser objs;
>> +	struct iommu_domain_ser iommu_domains[];
>> +} __packed;
>> +
>> +struct devices_ser {
>> +	struct iommu_objs_ser objs;
>> +	struct device_ser devices[];
>> +} __packed;
>> +
>> +#define MAX_IOMMU_SERS ((PAGE_SIZE - sizeof(struct iommus_ser)) / sizeof(struct iommu_ser))
>> +#define MAX_IOMMU_DOMAIN_SERS \
>> +		((PAGE_SIZE - sizeof(struct iommu_domains_ser)) / sizeof(struct iommu_domain_ser))
>> +#define MAX_DEVICE_SERS ((PAGE_SIZE - sizeof(struct devices_ser)) / sizeof(struct device_ser))
>> +
>> +struct iommu_lu_flb_ser {
>> +	u64 iommus_phys;
>> +	u64 nr_iommus;
>> +	u64 iommu_domains_phys;
>> +	u64 nr_domains;
>> +	u64 devices_phys;
>> +	u64 nr_devices;
>> +} __packed;
>> +
>> +struct iommu_lu_flb_obj {
>> +	struct mutex lock;
>> +	struct iommu_lu_flb_ser *ser;
>> +
>> +	struct iommu_domains_ser *iommu_domains;
>> +	struct iommus_ser *iommus;
>> +	struct devices_ser *devices;
>> +} __packed;
>> +
>
>Please let's add some comments describing the structs & their members
>here like we have in memfd [2]. This should be descriptive for the user.
>For example:

I agree. Will add comments for these and others.
>
>+/**
>+ * struct iommu_lu_flb_ser - Main serialization header for IOMMU state.
>+ * @iommus_phys:        Physical address of the first page in the IOMMU unit chain.
>+ * @nr_iommus:          Total number of hardware IOMMU units preserved.
>+ * @iommu_domains_phys: [...]
>+ * @nr_domains:         [...]
>+ * @devices_phys:       [...]
>+ * @nr_devices:         [...]
>+ *
>+ * This structure acts as the root of the IOMMU state tree. It is hitching a ride
>+ * on the iommufd file descriptor's preservation flow.
>+ */
>+struct iommu_lu_flb_ser {
>+	u64 iommus_phys;
>+	u64 nr_iommus;
>+	u64 iommu_domains_phys;
>+	u64 nr_domains;
>+	u64 devices_phys;
>+	u64 nr_devices;
>+} __packed;
>
>> +#endif /* _LINUX_KHO_ABI_IOMMU_H */
>
>Thanks,
>Praan
>
>[1] https://docs.kernel.org/process/coding-style.html#use-warn-rather-than-bug
>[2] https://elixir.bootlin.com/linux/v7.0-rc3/source/include/linux/kho/abi/memfd.h
>

Thanks for the review.

Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-02-03 22:09 ` [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton Samiullah Khawaja
@ 2026-03-12 23:10   ` Pranjal Shrivastava
  2026-03-13 18:42     ` Samiullah Khawaja
  2026-03-17 19:58   ` Vipin Sharma
  1 sibling, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-12 23:10 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
> Add IOMMU domain ops that can be implemented by the IOMMU drivers if
> they support IOMMU domain preservation across liveupdate. The new IOMMU
> domain preserve, unpreserve and restore APIs call these ops to perform
> respective live update operations.
> 
> Similarly add IOMMU ops to preserve/unpreserve a device. These can be
> implemented by the IOMMU drivers that support preservation of devices
> that have their IOMMU domains preserved. During device preservation the
> state of the associated IOMMU is also preserved. The device can only be
> preserved if the attached iommu domain is preserved and the associated
> iommu supports preservation.
> 
> The preserved state of the device and IOMMU needs to be fetched during
> shutdown and boot in the next kernel. Add APIs that can be used to fetch
> the preserved state of a device and IOMMU. The APIs will only be used
> during shutdown and after liveupdate so no locking needed.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/iommu.c      |   3 +
>  drivers/iommu/liveupdate.c | 326 +++++++++++++++++++++++++++++++++++++
>  include/linux/iommu-lu.h   | 119 ++++++++++++++
>  include/linux/iommu.h      |  32 ++++
>  4 files changed, 480 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 4926a43118e6..c0632cb5b570 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -389,6 +389,9 @@ static struct dev_iommu *dev_iommu_get(struct device *dev)
>  
>  	mutex_init(&param->lock);
>  	dev->iommu = param;
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	dev->iommu->device_ser = NULL;
> +#endif
>  	return param;
>  }
>  
> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
> index 6189ba32ff2c..83eb609b3fd7 100644
> --- a/drivers/iommu/liveupdate.c
> +++ b/drivers/iommu/liveupdate.c
> @@ -11,6 +11,7 @@
>  #include <linux/liveupdate.h>
>  #include <linux/iommu-lu.h>
>  #include <linux/iommu.h>
> +#include <linux/pci.h>
>  #include <linux/errno.h>
>  
>  static void iommu_liveupdate_restore_objs(u64 next)
> @@ -175,3 +176,328 @@ int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler)
>  	return liveupdate_unregister_flb(handler, &iommu_flb);
>  }
>  EXPORT_SYMBOL(iommu_liveupdate_unregister_flb);
> +
> +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
> +				    void *arg)
> +{
> +	struct iommu_lu_flb_obj *obj;
> +	struct devices_ser *devices;
> +	int ret, i, idx;
> +
> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> +	if (ret)
> +		return -ENOENT;
> +
> +	devices = __va(obj->ser->devices_phys);
> +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
> +		if (idx >= MAX_DEVICE_SERS) {
> +			devices = __va(devices->objs.next_objs);
> +			idx = 0;
> +		}
> +
> +		if (devices->devices[idx].obj.deleted)
> +			continue;
> +
> +		ret = fn(&devices->devices[idx], arg);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(iommu_for_each_preserved_device);
> +
> +static inline bool device_ser_match(struct device_ser *match,
> +				    struct pci_dev *pdev)
> +{
> +	return match->devid == pci_dev_id(pdev) && match->pci_domain == pci_domain_nr(pdev->bus);
> +}
> +
> +struct device_ser *iommu_get_device_preserved_data(struct device *dev)
> +{
> +	struct iommu_lu_flb_obj *obj;
> +	struct devices_ser *devices;
> +	int ret, i, idx;
> +
> +	if (!dev_is_pci(dev))
> +		return NULL;
> +
> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> +	if (ret)
> +		return NULL;
> +
> +	devices = __va(obj->ser->devices_phys);
> +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
> +		if (idx >= MAX_DEVICE_SERS) {
> +			devices = __va(devices->objs.next_objs);
> +			idx = 0;
> +		}
> +
> +		if (devices->devices[idx].obj.deleted)
> +			continue;
> +
> +		if (device_ser_match(&devices->devices[idx], to_pci_dev(dev))) {
> +			devices->devices[idx].obj.incoming = true;
> +			return &devices->devices[idx];
> +		}
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(iommu_get_device_preserved_data);
> +
> +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
> +{
> +	struct iommu_lu_flb_obj *obj;
> +	struct iommus_ser *iommus;
> +	int ret, i, idx;
> +
> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> +	if (ret)
> +		return NULL;
> +
> +	iommus = __va(obj->ser->iommus_phys);
> +	for (i = 0, idx = 0; i < obj->ser->nr_iommus; ++i, ++idx) {
> +		if (idx >= MAX_IOMMU_SERS) {
> +			iommus = __va(iommus->objs.next_objs);
> +			idx = 0;
> +		}
> +
> +		if (iommus->iommus[idx].obj.deleted)
> +			continue;
> +
> +		if (iommus->iommus[idx].token == token &&
> +		    iommus->iommus[idx].type == type)
> +			return &iommus->iommus[idx];
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(iommu_get_preserved_data);
> +
> +static int reserve_obj_ser(struct iommu_objs_ser **objs_ptr, u64 max_objs)

Isn't this more of an "insert" / "populate" / write_ser_entry? We can
rename it to something like iommu_ser_push_entry / iommu_ser_write_entry

> +{
> +	struct iommu_objs_ser *next_objs, *objs = *objs_ptr;

Not loving these names :(

TBH, the reserve_obj_ser function isn't too readable, esp. with all the
variable names, here and in the lu header.. I've had to go back n forth
to the first patch. For example, here next_objs can be next_objs_page &
objs can be curr_page. (PTAL at my reply on PATCH 01 about renaming).

> +	int idx;
> +
> +	if (objs->nr_objs == max_objs) {
> +		next_objs = kho_alloc_preserve(PAGE_SIZE);
> +		if (IS_ERR(next_objs))
> +			return PTR_ERR(next_objs);
> +
> +		objs->next_objs = virt_to_phys(next_objs);
> +		objs = next_objs;
> +		*objs_ptr = objs;
> +		objs->nr_objs = 0;
> +		objs->next_objs = 0;

This seems redundant, no need to zero these out, kho_alloc_preserve 
passes __GFP_ZERO to folio_alloc [1], which should z-alloc the pages.

> +	}
> +
> +	idx = objs->nr_objs++;
> +	return idx;
> +}

Just to give a mental model to fellow reviewers, here's how this is laid
out:

----------------------------------------------------------------------
[ PAGE START ]                                                       |
----------------------------------------------------------------------
| iommu_objs_ser (The Page Header)                                   |
|   - next_objs: 0x0000 (End of the page-chain)                      |
|   - nr_objs: 2                                                     |
----------------------------------------------------------------------
| ITEM 0: iommu_domain_ser                                           |
|   [ iommu_obj_ser (The entry header) ]                             |
|     - idx: 0                                                       |
|     - ref_count: 1                                                 |
|     - deleted: 0                                                   |
|   [ Domain Data ]                                                  |
----------------------------------------------------------------------
| ITEM 1: iommu_domain_ser                                           |
|   [ iommu_obj_ser (The Price Tag) ]                                |
|     - idx: 1                                                       |
|     - ref_count: 1                                                 |
|     - deleted: 0                                                   |
|   [ Domain Data ]                                                  |
----------------------------------------------------------------------
| ... (Empty space for more domains) ...                             |
|                                                                    |
----------------------------------------------------------------------
[ PAGE END ]                                                         |
----------------------------------------------------------------------

> +
> +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
> +{
> +	struct iommu_domain_ser *domain_ser;
> +	struct iommu_lu_flb_obj *flb_obj;
> +	int idx, ret;
> +
> +	if (!domain->ops->preserve)
> +		return -EOPNOTSUPP;
> +
> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> +	if (ret)
> +		return ret;
> +
> +	guard(mutex)(&flb_obj->lock);
> +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommu_domains,
> +			      MAX_IOMMU_DOMAIN_SERS);
> +	if (idx < 0)
> +		return idx;
> +
> +	domain_ser = &flb_obj->iommu_domains->iommu_domains[idx];

This is slightly less-readable as well, I understand we're trying to:

iommu_domains_ser -> iommu_domain_ser[idx] but the same name
(iommu_domains) makes it difficult to read.. we should rename this as:

&flb_obj->iommu_domains_page->domain_entries[idx] or something for
better readability..

Also, let's add a comment explaining that reserve_obj_ser actually
advances the flb_obj ptr when necessary..

IIUC, we start with PAGE 0 initially, store it's phys in the
iommu_flb_preserve op (the iommu_ser_phys et al) & then we go on 
alloc-ing more pages and keep storing the "current" active page with the
liveupdate core. Now when we jump into the new kernel, we read the
ser_phys and then follow the page chain, right?

> +	idx = flb_obj->ser->nr_domains++;
> +	domain_ser->obj.idx = idx;
> +	domain_ser->obj.ref_count = 1;
> +
> +	ret = domain->ops->preserve(domain, domain_ser);
> +	if (ret) {
> +		domain_ser->obj.deleted = true;
> +		return ret;
> +	}
> +
> +	domain->preserved_state = domain_ser;
> +	*ser = domain_ser;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_domain_preserve);
> +
> +void iommu_domain_unpreserve(struct iommu_domain *domain)
> +{
> +	struct iommu_domain_ser *domain_ser;
> +	struct iommu_lu_flb_obj *flb_obj;
> +	int ret;
> +
> +	if (!domain->ops->unpreserve)
> +		return;
> +
> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> +	if (ret)
> +		return;
> +
> +	guard(mutex)(&flb_obj->lock);
> +
> +	/*
> +	 * There is no check for attached devices here. The correctness relies
> +	 * on the Live Update Orchestrator's session lifecycle. All resources
> +	 * (iommufd, vfio devices) are preserved within a single session. If the
> +	 * session is torn down, the .unpreserve callbacks for all files will be
> +	 * invoked, ensuring a consistent cleanup without needing explicit
> +	 * refcounting for the serialized objects here.
> +	 */
> +	domain_ser = domain->preserved_state;
> +	domain->ops->unpreserve(domain, domain_ser);
> +	domain_ser->obj.deleted = true;
> +	domain->preserved_state = NULL;
> +}
> +EXPORT_SYMBOL_GPL(iommu_domain_unpreserve);
> +
> +static int iommu_preserve_locked(struct iommu_device *iommu)
> +{
> +	struct iommu_lu_flb_obj *flb_obj;
> +	struct iommu_ser *iommu_ser;
> +	int idx, ret;
> +
> +	if (!iommu->ops->preserve)
> +		return -EOPNOTSUPP;
> +
> +	if (iommu->outgoing_preserved_state) {
> +		iommu->outgoing_preserved_state->obj.ref_count++;
> +		return 0;
> +	}
> +
> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> +	if (ret)
> +		return ret;
> +
> +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommus,
> +			      MAX_IOMMU_SERS);
> +	if (idx < 0)
> +		return idx;
> +
> +	iommu_ser = &flb_obj->iommus->iommus[idx];
> +	idx = flb_obj->ser->nr_iommus++;
> +	iommu_ser->obj.idx = idx;
> +	iommu_ser->obj.ref_count = 1;
> +
> +	ret = iommu->ops->preserve(iommu, iommu_ser);
> +	if (ret)
> +		iommu_ser->obj.deleted = true;
> +
> +	iommu->outgoing_preserved_state = iommu_ser;
> +	return ret;
> +}
> +
> +static void iommu_unpreserve_locked(struct iommu_device *iommu)
> +{
> +	struct iommu_ser *iommu_ser = iommu->outgoing_preserved_state;
> +
> +	iommu_ser->obj.ref_count--;
> +	if (iommu_ser->obj.ref_count)
> +		return;
> +
> +	iommu->outgoing_preserved_state = NULL;
> +	iommu->ops->unpreserve(iommu, iommu_ser);
> +	iommu_ser->obj.deleted = true;
> +}
> +
> +int iommu_preserve_device(struct iommu_domain *domain,
> +			  struct device *dev, u64 token)
> +{
> +	struct iommu_lu_flb_obj *flb_obj;
> +	struct device_ser *device_ser;
> +	struct dev_iommu *iommu;
> +	struct pci_dev *pdev;
> +	int ret, idx;
> +
> +	if (!dev_is_pci(dev))
> +		return -EOPNOTSUPP;
> +
> +	if (!domain->preserved_state)
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(dev);
> +	iommu = dev->iommu;
> +	if (!iommu->iommu_dev->ops->preserve_device ||
> +	    !iommu->iommu_dev->ops->preserve)
> +		return -EOPNOTSUPP;
> +
> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> +	if (ret)
> +		return ret;
> +
> +	guard(mutex)(&flb_obj->lock);
> +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->devices,
> +			      MAX_DEVICE_SERS);
> +	if (idx < 0)
> +		return idx;
> +
> +	device_ser = &flb_obj->devices->devices[idx];
> +	idx = flb_obj->ser->nr_devices++;
> +	device_ser->obj.idx = idx;
> +	device_ser->obj.ref_count = 1;
> +
> +	ret = iommu_preserve_locked(iommu->iommu_dev);
> +	if (ret) {
> +		device_ser->obj.deleted = true;
> +		return ret;
> +	}
> +
> +	device_ser->domain_iommu_ser.domain_phys = __pa(domain->preserved_state);
> +	device_ser->domain_iommu_ser.iommu_phys = __pa(iommu->iommu_dev->outgoing_preserved_state);
> +	device_ser->devid = pci_dev_id(pdev);
> +	device_ser->pci_domain = pci_domain_nr(pdev->bus);
> +	device_ser->token = token;
> +
> +	ret = iommu->iommu_dev->ops->preserve_device(dev, device_ser);
> +	if (ret) {
> +		device_ser->obj.deleted = true;
> +		iommu_unpreserve_locked(iommu->iommu_dev);
> +		return ret;
> +	}
> +
> +	dev->iommu->device_ser = device_ser;
> +	return 0;
> +}
> +
> +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
> +{
> +	struct iommu_lu_flb_obj *flb_obj;
> +	struct device_ser *device_ser;
> +	struct dev_iommu *iommu;
> +	struct pci_dev *pdev;
> +	int ret;
> +
> +	if (!dev_is_pci(dev))
> +		return;
> +
> +	pdev = to_pci_dev(dev);
> +	iommu = dev->iommu;
> +	if (!iommu->iommu_dev->ops->unpreserve_device ||
> +	    !iommu->iommu_dev->ops->unpreserve)
> +		return;
> +
> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> +	if (WARN_ON(ret))
> +		return;
> +
> +	guard(mutex)(&flb_obj->lock);
> +	device_ser = dev_iommu_preserved_state(dev);
> +	if (WARN_ON(!device_ser))
> +		return;
> +
> +	iommu->iommu_dev->ops->unpreserve_device(dev, device_ser);
> +	dev->iommu->device_ser = NULL;
> +
> +	iommu_unpreserve_locked(iommu->iommu_dev);
> +}

I'm wondering ig we should guard these APIs against accidental or 
potential abuse by other in-kernel drivers. Since the Live Update 
Orchestrator (LUO) is architecturally designed around user-space 
driven sequences (IOCTLs, specific mutex ordering, etc.), for now.

Since the header-file is also under include/linux, we should avoid any
possibility where we end up having drivers depending on these API.
We could add a check based on dma owner:

+	/* Only devices explicitly claimed by a user-space driver 
+	 * (VFIO/IOMMUFD) are eligible for Live Update preservation.
+	 */
+	if (!iommu_dma_owner_claimed(dev))
+		return -EPERM;

This should ensure we aren't creating 'zombie' preserved states for 
devices not managed by IOMMUFD/VFIO.


> diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
> index 59095d2f1bb2..48c07514a776 100644
> --- a/include/linux/iommu-lu.h
> +++ b/include/linux/iommu-lu.h
> @@ -8,9 +8,128 @@
>  #ifndef _LINUX_IOMMU_LU_H
>  #define _LINUX_IOMMU_LU_H
>  
> +#include <linux/device.h>
> +#include <linux/iommu.h>
>  #include <linux/liveupdate.h>
>  #include <linux/kho/abi/iommu.h>
>  
> +typedef int (*iommu_preserved_device_iter_fn)(struct device_ser *ser,
> +					      void *arg);
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +static inline void *dev_iommu_preserved_state(struct device *dev)
> +{
> +	struct device_ser *ser;
> +
> +	if (!dev->iommu)
> +		return NULL;
> +
> +	ser = dev->iommu->device_ser;
> +	if (ser && !ser->obj.incoming)
> +		return ser;
> +
> +	return NULL;
> +}
> +
> +static inline void *dev_iommu_restored_state(struct device *dev)
> +{
> +	struct device_ser *ser;
> +
> +	if (!dev->iommu)
> +		return NULL;
> +
> +	ser = dev->iommu->device_ser;
> +	if (ser && ser->obj.incoming)
> +		return ser;
> +
> +	return NULL;
> +}
> +
> +static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
> +{
> +	struct iommu_domain_ser *ser;
> +
> +	ser = domain->preserved_state;
> +	if (ser && ser->obj.incoming)
> +		return ser;
> +
> +	return NULL;
> +}
> +
> +static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
> +{
> +	struct device_ser *ser = dev_iommu_restored_state(dev);
> +
> +	if (ser && iommu_domain_restored_state(domain))
> +		return ser->domain_iommu_ser.did;
> +
> +	return -1;
> +}
> +
> +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
> +				    void *arg);
> +struct device_ser *iommu_get_device_preserved_data(struct device *dev);
> +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type);
> +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser);
> +void iommu_domain_unpreserve(struct iommu_domain *domain);
> +int iommu_preserve_device(struct iommu_domain *domain,
> +			  struct device *dev, u64 token);
> +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev);
> +#else
> +static inline void *dev_iommu_preserved_state(struct device *dev)
> +{
> +	return NULL;
> +}
> +
> +static inline void *dev_iommu_restored_state(struct device *dev)
> +{
> +	return NULL;
> +}
> +
> +static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
> +{
> +	return -1;
> +}
> +
> +static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
> +{
> +	return NULL;
> +}
> +
> +static inline int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn, void *arg)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline struct device_ser *iommu_get_device_preserved_data(struct device *dev)
> +{
> +	return NULL;
> +}
> +
> +static inline struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
> +{
> +	return NULL;
> +}
> +
> +static inline int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void iommu_domain_unpreserve(struct iommu_domain *domain)
> +{
> +}
> +
> +static inline int iommu_preserve_device(struct iommu_domain *domain,
> +					struct device *dev, u64 token)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
> +{
> +}
> +#endif
> +
>  int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler);
>  int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler);
>  
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 54b8b48c762e..bd949c1ce7c5 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -14,6 +14,8 @@
>  #include <linux/err.h>
>  #include <linux/of.h>
>  #include <linux/iova_bitmap.h>
> +#include <linux/atomic.h>
> +#include <linux/kho/abi/iommu.h>
>  #include <uapi/linux/iommufd.h>
>  
>  #define IOMMU_READ	(1 << 0)
> @@ -248,6 +250,10 @@ struct iommu_domain {
>  			struct list_head next;
>  		};
>  	};
> +
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	struct iommu_domain_ser *preserved_state;
> +#endif
>  };
>  
>  static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
> @@ -647,6 +653,10 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
>   *               resources shared/passed to user space IOMMU instance. Associate
>   *               it with a nesting @parent_domain. It is required for driver to
>   *               set @viommu->ops pointing to its own viommu_ops
> + * @preserve_device: Preserve state of a device for liveupdate.
> + * @unpreserve_device: Unpreserve state that was preserved earlier.
> + * @preserve: Preserve state of iommu translation hardware for liveupdate.
> + * @unpreserve: Unpreserve state of iommu that was preserved earlier.
>   * @owner: Driver module providing these ops
>   * @identity_domain: An always available, always attachable identity
>   *                   translation.
> @@ -703,6 +713,11 @@ struct iommu_ops {
>  			   struct iommu_domain *parent_domain,
>  			   const struct iommu_user_data *user_data);
>  
> +	int (*preserve_device)(struct device *dev, struct device_ser *device_ser);
> +	void (*unpreserve_device)(struct device *dev, struct device_ser *device_ser);

Nit: Let's move the _device ops under the comment:
`/* Per device IOMMU features */`

> +	int (*preserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
> +	void (*unpreserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
> +

I'm wondering if there's any benefit to add these ops under a #ifdef ?

>  	const struct iommu_domain_ops *default_domain_ops;
>  	struct module *owner;
>  	struct iommu_domain *identity_domain;
> @@ -749,6 +764,11 @@ struct iommu_ops {
>   *                           specific mechanisms.
>   * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>   * @free: Release the domain after use.
> + * @preserve: Preserve the iommu domain for liveupdate.
> + *            Returns 0 on success, a negative errno on failure.
> + * @unpreserve: Unpreserve the iommu domain that was preserved earlier.
> + * @restore: Restore the iommu domain after liveupdate.
> + *           Returns 0 on success, a negative errno on failure.
>   */
>  struct iommu_domain_ops {
>  	int (*attach_dev)(struct iommu_domain *domain, struct device *dev,
> @@ -779,6 +799,9 @@ struct iommu_domain_ops {
>  				  unsigned long quirks);
>  
>  	void (*free)(struct iommu_domain *domain);
> +	int (*preserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
> +	void (*unpreserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
> +	int (*restore)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
>  };
>  
>  /**
> @@ -790,6 +813,8 @@ struct iommu_domain_ops {
>   * @singleton_group: Used internally for drivers that have only one group
>   * @max_pasids: number of supported PASIDs
>   * @ready: set once iommu_device_register() has completed successfully
> + * @outgoing_preserved_state: preserved iommu state of outgoing kernel for
> + * liveupdate.
>   */
>  struct iommu_device {
>  	struct list_head list;
> @@ -799,6 +824,10 @@ struct iommu_device {
>  	struct iommu_group *singleton_group;
>  	u32 max_pasids;
>  	bool ready;
> +
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	struct iommu_ser *outgoing_preserved_state;
> +#endif
>  };
>  
>  /**
> @@ -853,6 +882,9 @@ struct dev_iommu {
>  	u32				pci_32bit_workaround:1;
>  	u32				require_direct:1;
>  	u32				shadow_on_flush:1;
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	struct device_ser		*device_ser;
> +#endif
>  };
>  
>  int iommu_device_register(struct iommu_device *iommu,

Thanks,
Praan

[1] https://elixir.bootlin.com/linux/v7.0-rc3/source/kernel/liveupdate/kexec_handover.c#L1182

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks
  2026-03-12 16:43     ` Samiullah Khawaja
@ 2026-03-12 23:43       ` Pranjal Shrivastava
  2026-03-13 16:47         ` Samiullah Khawaja
  2026-03-13 15:36       ` Pranjal Shrivastava
  1 sibling, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-12 23:43 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Thu, Mar 12, 2026 at 04:43:00PM +0000, Samiullah Khawaja wrote:
> On Wed, Mar 11, 2026 at 09:07:00PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Feb 03, 2026 at 10:09:35PM +0000, Samiullah Khawaja wrote:
> > > Add liveupdate FLB for IOMMU state preservation. Use KHO preserve memory
> > > alloc/free helper functions to allocate memory for the IOMMU LU FLB
> > > object and the serialization structs for device, domain and iommu.
> > > 
> > > During retrieve, walk through the preserved objs nodes and restore each
> > > folio. Also recreate the FLB obj.
> > > 
> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> > > ---
> > >  drivers/iommu/Kconfig         |  11 +++
> > >  drivers/iommu/Makefile        |   1 +
> > >  drivers/iommu/liveupdate.c    | 177 ++++++++++++++++++++++++++++++++++
> > >  include/linux/iommu-lu.h      |  17 ++++
> > >  include/linux/kho/abi/iommu.h | 119 +++++++++++++++++++++++
> > >  5 files changed, 325 insertions(+)
> > >  create mode 100644 drivers/iommu/liveupdate.c
> > >  create mode 100644 include/linux/iommu-lu.h
> > >  create mode 100644 include/linux/kho/abi/iommu.h
> > > 
> > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > > index f86262b11416..fdcfbedee5ed 100644
> > > --- a/drivers/iommu/Kconfig
> > > +++ b/drivers/iommu/Kconfig
> > > @@ -11,6 +11,17 @@ config IOMMUFD_DRIVER
> > >  	bool
> > >  	default n
> > > 
> > > +config IOMMU_LIVEUPDATE
> > > +	bool "IOMMU live update state preservation support"
> > > +	depends on LIVEUPDATE && IOMMUFD
> > > +	help
> > > +	  Enable support for preserving IOMMU state across a kexec live update.
> > > +
> > > +	  This allows devices managed by iommufd to maintain their DMA mappings
> > > +	  during kexec base kernel update.
> > > +
> > > +	  If unsure, say N.
> > > +
> > 
> > I'm wondering if this should be under the if IOMMU_SUPPORT below? I
> > believe this was added here because IOMMUFD isn't under IOMMU_SUPPORT,
> > but it wouldn't make sense to "preserve" IOMMU across a liveupdate if
> > IOMMU_SUPPORT is disabled? Should we probably be move it inside the
> > if IOMMU_SUPPORT block for better organization, or at least have a depends
> > on IOMMU_SUPPORT added to it? The IOMMU_LUO still depends on the
> > IOMMU_SUPPORT infrastructure to actually function.. as we add calls
> > within core functions like dev_iommu_get etc.
> 
> Agreed. I will move it under IOMMU_SUPPORT and sort out any other
> dependencies.
> > 
> > >  menuconfig IOMMU_SUPPORT
> > >  	bool "IOMMU Hardware Support"
> > >  	depends on MMU
> > > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> > > index 0275821f4ef9..b3715c5a6b97 100644
> > > --- a/drivers/iommu/Makefile
> > > +++ b/drivers/iommu/Makefile
> > > @@ -15,6 +15,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
> > >  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
> > >  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST) += io-pgtable-arm-selftests.o
> > >  obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
> > > +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
> > >  obj-$(CONFIG_IOMMU_IOVA) += iova.o
> > >  obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
> > >  obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o
> > > diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
> > > new file mode 100644
> > > index 000000000000..6189ba32ff2c
> > > --- /dev/null
> > > +++ b/drivers/iommu/liveupdate.c
> > > @@ -0,0 +1,177 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +
> > > +/*
> > > + * Copyright (C) 2025, Google LLC
> > 
> > Minor nit: 2026 OR 2025-26, here and everywhere else
> 
> Will fix in next revision.
> > 
> > > + * Author: Samiullah Khawaja <skhawaja@google.com>
> > > + */
> > > +
> > > +#define pr_fmt(fmt)    "iommu: liveupdate: " fmt
> > > +
> > > +#include <linux/kexec_handover.h>
> > > +#include <linux/liveupdate.h>
> > > +#include <linux/iommu-lu.h>
> > > +#include <linux/iommu.h>
> > > +#include <linux/errno.h>
> > > +
> > > +static void iommu_liveupdate_restore_objs(u64 next)
> > > +{
> > > +	struct iommu_objs_ser *objs;
> > > +
> > > +	while (next) {
> > > +		BUG_ON(!kho_restore_folio(next));
> > 
> > Same thing about BUG_ON [1] as mentioned below in the
> > iommu_liveupdate_flb_retrieve() function, can we consider returning an
> > error which can be checked in the caller and the error can be bubbled up
> > as -ENODATA?
> 
> Please see the explanation below on the BUG_ON.

[------- snip >8 --------]

> > > +
> > > +static int iommu_liveupdate_flb_retrieve(struct liveupdate_flb_op_args *argp)
> > > +{
> > > +	struct iommu_lu_flb_obj *obj;
> > > +	struct iommu_lu_flb_ser *ser;
> > > +
> > > +	obj = kzalloc(sizeof(*obj), GFP_ATOMIC);
> > 
> > Why does this have to be GFP_ATOMIC? IIUC, the retrieve path is
> > triggered by a userspace IOCTL in the new kernel. The system should be
> > able to sleep here? (unless we have a use-case to call this in IRQ-ctx?)
> > AFAICT, we call this under mutexes already, hence there's no situation
> > where we could sleep in a spinlock context?
> > 
> > GFP_ATOMIC creates a point of failure if the system is under memory
> > pressure. I believe we should be allowed to sleep for this allocation
> > because the "preserved" mappings still allow DMAs to go on and we're in
> > no hurry to restore the IOMMU state? I believe this could be GFP_KERNEL.
> > 

I guess we missed discussing this comment about s/GFP_ATOMIC/GFP_KERNEL?

> > > +	if (!obj)
> > > +		return -ENOMEM;
> > > +
> > > +	mutex_init(&obj->lock);
> > > +	BUG_ON(!kho_restore_folio(argp->data));
> > 
> > The use of BUG_ON in new code is heavily discouraged [1].
> > If KHO can't restore the folio for whatever reason, we can be treat it
> > as a corruption of the handover data. I believe crashing the kernel for
> > it would be an overkill?
> 
> The FLB restore is done during early boot and this has been discussed in
> past in KHO/LUO and IOMMU context also. But basically if this fails, the
> restoration of IOMMU state cannot be done, preserved devices would
> already have corrupted memory due to ongoing DMA as we didn't disable
> translation in the previous kernel. So logging an error and a BUG_ON at
> this point would be most appropriate.
> 
> Please see discussion here on this topic:
> https://lore.kernel.org/all/20251118153631.GB90703@nvidia.com/

I see.. so a failure here would mean an entire VM tear-down?

> > 
> > Can we consider returning a graceful failure like -ENODATA or something?
> > BUG_ON would instantly cause a kernel panic without providing no
> > opportunity for the system to log the failure or attempt a graceful
> > teardown of the 'preserved' mapping.
> 
> I will update this by adding a comment and also log an error.

Ack. Thanks

[ ---- snip >8 -----]

> > > +enum iommu_lu_type {
> > > +	IOMMU_INVALID,
> > > +	IOMMU_INTEL,
> > > +};
> > > +
> > > +struct iommu_obj_ser {
> > > +	u32 idx;
> > > +	u32 ref_count;
> > > +	u32 deleted:1;
> > > +	u32 incoming:1;
> > > +} __packed;
> > > +
> > > +struct iommu_domain_ser {
> > > +	struct iommu_obj_ser obj;
> > > +	u64 top_table;
> > > +	u64 top_level;
> > > +	struct iommu_domain *restored_domain;
> > > +} __packed;
> > > +
> > > +struct device_domain_iommu_ser {
> > > +	u32 did;

Nit: `did` sounds intel-specific, we can either call it something
generic or make it into a union when we support other archs in the
future.

> > > +	u64 domain_phys;
> > > +	u64 iommu_phys;
> > > +} __packed;
> > > +
> > > +struct device_ser {
> > > +	struct iommu_obj_ser obj;
> > > +	u64 token;
> > > +	u32 devid;
> > > +	u32 pci_domain;
> > > +	struct device_domain_iommu_ser domain_iommu_ser;
> > > +	enum iommu_lu_type type;
> > > +} __packed;
> > > +
> > > +struct iommu_intel_ser {
> > > +	u64 phys_addr;
> > > +	u64 root_table;
> > > +} __packed;
> > > +
> > > +struct iommu_ser {
> > > +	struct iommu_obj_ser obj;
> > > +	u64 token;
> > > +	enum iommu_lu_type type;
> > > +	union {
> > > +		struct iommu_intel_ser intel;
> > > +	};
> > > +} __packed;
> > > +
> > > +struct iommu_objs_ser {
> > > +	u64 next_objs;
> > > +	u64 nr_objs;
> > > +} __packed;
> > > +
> > > +struct iommus_ser {
> > > +	struct iommu_objs_ser objs;
> > > +	struct iommu_ser iommus[];
> > > +} __packed;
> > > +
> > > +struct iommu_domains_ser {
> > > +	struct iommu_objs_ser objs;
> > > +	struct iommu_domain_ser iommu_domains[];
> > > +} __packed;
> > > +
> > > +struct devices_ser {
> > > +	struct iommu_objs_ser objs;
> > > +	struct device_ser devices[];
> > > +} __packed;

I have a bone to pick here about the naming LOL, the names are kinda
confusing and make the code unreadable, I had to keep re-visitng this
patch while looking at the subsequent ones to understand what's going
on..

One suggestion is to add a graphic that could help understand the layout
like:

----------------------------------------------------------------------
[ PAGE START ]                                                       |
----------------------------------------------------------------------
| iommu_objs_ser (The Page Header)                                   |
|   - next_objs: 0x0000 (End of the page-chain)                      |
|   - nr_objs: 2                                                     |
----------------------------------------------------------------------
| ITEM 0: iommu_domain_ser                                           |
|   [ iommu_obj_ser (The entry header) ]                             |
|     - idx: 0                                                       |
|     - ref_count: 1                                                 |
|     - deleted: 0                                                   |
|   [ Domain Data ]                                                  |
----------------------------------------------------------------------
| ITEM 1: iommu_domain_ser                                           |
|   [ iommu_obj_ser (The Price Tag) ]                                |
|     - idx: 1                                                       |
|     - ref_count: 1                                                 |
|     - deleted: 0                                                   |
|   [ Domain Data ]                                                  |
----------------------------------------------------------------------
| ... (Empty space for more domains) ...                             |
|                                                                    |
----------------------------------------------------------------------
[ PAGE END ]                                                         |
----------------------------------------------------------------------

Additionally, a few naming suggestions here:

1. struct iommu_obj_ser -> struct iommu_ser_entry_hdr
2. struct iommu_objs_ser -> struct iommu_ser_page_hdr
3. struct iommu_domains_ser -> struct iommu_ser_domain_page

This makes things clearer:

struct iommu_ser_page_hdr {
    u64 next_page_phys;
    u64 entry_count;
} __packed;

/* The Container Page */
struct iommu_ser_domain_page {
    struct iommu_ser_page_hdr hdr;
    struct iommu_ser_domain domain_entries[];
} __packed;

Similarly, something like:

4. struct devices_ser -> struct iommu_ser_device_page
5. struct iommu_lu_flb_ser -> struct iommu_flb_metadata
6. struct iommu_lu_flb_obj -> struct iommu_flb_ctx

struct iommu_flb_ctx {
	struct mutex lock;
	struct iommu_flb_metadata *cookie;

	struct iommu_ser_domain_page    *curr_domains_page;
	struct iommu_ser_iommu_ctx_page *curr_iommu_ctx_page;
	struct iommu_ser_device_page    *curr_devices_page;
} __packed;

Makes things slightly more readable.

> > > +
> > > +#define MAX_IOMMU_SERS ((PAGE_SIZE - sizeof(struct iommus_ser)) / sizeof(struct iommu_ser))

Nit: For clarity, can we consider adding another set of braces:

+#define MAX_IOMMU_SERS ((PAGE_SIZE - (sizeof(struct iommus_ser)) / sizeof(struct iommu_ser)))

> > > +#define MAX_IOMMU_DOMAIN_SERS \
> > > +		((PAGE_SIZE - sizeof(struct iommu_domains_ser)) / sizeof(struct iommu_domain_ser))
> > > +#define MAX_DEVICE_SERS ((PAGE_SIZE - sizeof(struct devices_ser)) / sizeof(struct device_ser))
> > > +
> > > +struct iommu_lu_flb_ser {
> > > +	u64 iommus_phys;
> > > +	u64 nr_iommus;
> > > +	u64 iommu_domains_phys;
> > > +	u64 nr_domains;
> > > +	u64 devices_phys;
> > > +	u64 nr_devices;
> > > +} __packed;
> > > +
> > > +struct iommu_lu_flb_obj {
> > > +	struct mutex lock;
> > > +	struct iommu_lu_flb_ser *ser;
> > > +
> > > +	struct iommu_domains_ser *iommu_domains;
> > > +	struct iommus_ser *iommus;
> > > +	struct devices_ser *devices;
> > > +} __packed;
> > > +
> > 
> > Please let's add some comments describing the structs & their members
> > here like we have in memfd [2]. This should be descriptive for the user.
> > For example:
> 
> I agree. Will add comments for these and others.
> > 
> > +/**
> > + * struct iommu_lu_flb_ser - Main serialization header for IOMMU state.
> > + * @iommus_phys:        Physical address of the first page in the IOMMU unit chain.
> > + * @nr_iommus:          Total number of hardware IOMMU units preserved.
> > + * @iommu_domains_phys: [...]
> > + * @nr_domains:         [...]
> > + * @devices_phys:       [...]
> > + * @nr_devices:         [...]
> > + *
> > + * This structure acts as the root of the IOMMU state tree. It is hitching a ride
> > + * on the iommufd file descriptor's preservation flow.
> > + */
> > +struct iommu_lu_flb_ser {
> > +	u64 iommus_phys;
> > +	u64 nr_iommus;
> > +	u64 iommu_domains_phys;
> > +	u64 nr_domains;
> > +	u64 devices_phys;
> > +	u64 nr_devices;
> > +} __packed;
> > 
> > > +#endif /* _LINUX_KHO_ABI_IOMMU_H */
> > 
> > Thanks,
> > Praan
> > 
> > [1] https://docs.kernel.org/process/coding-style.html#use-warn-rather-than-bug
> > [2] https://elixir.bootlin.com/linux/v7.0-rc3/source/include/linux/kho/abi/memfd.h
> > 
> 
> Thanks for the review.
> 
> Sami

Thanks,
Praan 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks
  2026-03-12 16:43     ` Samiullah Khawaja
  2026-03-12 23:43       ` Pranjal Shrivastava
@ 2026-03-13 15:36       ` Pranjal Shrivastava
  2026-03-13 16:58         ` Samiullah Khawaja
  1 sibling, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-13 15:36 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Thu, Mar 12, 2026 at 04:43:00PM +0000, Samiullah Khawaja wrote:
> On Wed, Mar 11, 2026 at 09:07:00PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Feb 03, 2026 at 10:09:35PM +0000, Samiullah Khawaja wrote:
> > > Add liveupdate FLB for IOMMU state preservation. Use KHO preserve memory
> > > alloc/free helper functions to allocate memory for the IOMMU LU FLB
> > > object and the serialization structs for device, domain and iommu.
> > > 
> > > During retrieve, walk through the preserved objs nodes and restore each
> > > folio. Also recreate the FLB obj.
> > > 
> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> > > ---
> > >  drivers/iommu/Kconfig         |  11 +++
> > >  drivers/iommu/Makefile        |   1 +
> > >  drivers/iommu/liveupdate.c    | 177 ++++++++++++++++++++++++++++++++++
> > >  include/linux/iommu-lu.h      |  17 ++++
> > >  include/linux/kho/abi/iommu.h | 119 +++++++++++++++++++++++
> > >  5 files changed, 325 insertions(+)
> > >  create mode 100644 drivers/iommu/liveupdate.c
> > >  create mode 100644 include/linux/iommu-lu.h
> > >  create mode 100644 include/linux/kho/abi/iommu.h
> > > 
> > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > > index f86262b11416..fdcfbedee5ed 100644
> > > --- a/drivers/iommu/Kconfig
> > > +++ b/drivers/iommu/Kconfig
> > > @@ -11,6 +11,17 @@ config IOMMUFD_DRIVER
> > >  	bool
> > >  	default n
> > > 

[ snip ---- >8 -----]

> > 
> > > +enum iommu_lu_type {
> > > +	IOMMU_INVALID,
> > > +	IOMMU_INTEL,
> > > +};
> > > +
> > > +struct iommu_obj_ser {
> > > +	u32 idx;
> > > +	u32 ref_count;
> > > +	u32 deleted:1;
> > > +	u32 incoming:1;
> > > +} __packed;
> > > +
> > > +struct iommu_domain_ser {
> > > +	struct iommu_obj_ser obj;
> > > +	u64 top_table;
> > > +	u64 top_level;
> > > +	struct iommu_domain *restored_domain;
> > > +} __packed;
> > > +
> > > +struct device_domain_iommu_ser {
> > > +	u32 did;
> > > +	u64 domain_phys;
> > > +	u64 iommu_phys;
> > > +} __packed;
> > > +
> > > +struct device_ser {
> > > +	struct iommu_obj_ser obj;
> > > +	u64 token;
> > > +	u32 devid;
> > > +	u32 pci_domain;
> > > +	struct device_domain_iommu_ser domain_iommu_ser;
> > > +	enum iommu_lu_type type;
> > > +} __packed;
> > > +
> > > +struct iommu_intel_ser {
> > > +	u64 phys_addr;
> > > +	u64 root_table;
> > > +} __packed;
> > > +

One more thing here, let's add the "intel" stuff with the intel patches

Thanks,
Praan

> > > +struct iommu_ser {
> > > +	struct iommu_obj_ser obj;
> > > +	u64 token;
> > > +	enum iommu_lu_type type;
> > > +	union {
> > > +		struct iommu_intel_ser intel;
> > > +	};
> > > +} __packed;
> > > +
> > > +struct iommu_objs_ser {
> > > +	u64 next_objs;
> > > +	u64 nr_objs;
> > > +} __packed;
> > > +
> > > +struct iommus_ser {
> > > +	struct iommu_objs_ser objs;
> > > +	struct iommu_ser iommus[];
> > > +} __packed;
> > > +
> > > +struct iommu_domains_ser {
> > > +	struct iommu_objs_ser objs;
> > > +	struct iommu_domain_ser iommu_domains[];
> > > +} __packed;
> > > +
> > > +struct devices_ser {
> > > +	struct iommu_objs_ser objs;
> > > +	struct device_ser devices[];
> > > +} __packed;
> > > +

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks
  2026-03-12 23:43       ` Pranjal Shrivastava
@ 2026-03-13 16:47         ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-13 16:47 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Thu, Mar 12, 2026 at 11:43:52PM +0000, Pranjal Shrivastava wrote:
>On Thu, Mar 12, 2026 at 04:43:00PM +0000, Samiullah Khawaja wrote:
>> On Wed, Mar 11, 2026 at 09:07:00PM +0000, Pranjal Shrivastava wrote:
>> > On Tue, Feb 03, 2026 at 10:09:35PM +0000, Samiullah Khawaja wrote:
>> > > Add liveupdate FLB for IOMMU state preservation. Use KHO preserve memory
>> > > alloc/free helper functions to allocate memory for the IOMMU LU FLB
>> > > object and the serialization structs for device, domain and iommu.
>> > >
>> > > During retrieve, walk through the preserved objs nodes and restore each
>> > > folio. Also recreate the FLB obj.
>> > >
>> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> > > ---
>> > >  drivers/iommu/Kconfig         |  11 +++
>> > >  drivers/iommu/Makefile        |   1 +
>> > >  drivers/iommu/liveupdate.c    | 177 ++++++++++++++++++++++++++++++++++
>> > >  include/linux/iommu-lu.h      |  17 ++++
>> > >  include/linux/kho/abi/iommu.h | 119 +++++++++++++++++++++++
>> > >  5 files changed, 325 insertions(+)
>> > >  create mode 100644 drivers/iommu/liveupdate.c
>> > >  create mode 100644 include/linux/iommu-lu.h
>> > >  create mode 100644 include/linux/kho/abi/iommu.h
>> > >
>> > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>> > > index f86262b11416..fdcfbedee5ed 100644
>> > > --- a/drivers/iommu/Kconfig
>> > > +++ b/drivers/iommu/Kconfig
>> > > @@ -11,6 +11,17 @@ config IOMMUFD_DRIVER
>> > >  	bool
>> > >  	default n
>> > >
>> > > +config IOMMU_LIVEUPDATE
>> > > +	bool "IOMMU live update state preservation support"
>> > > +	depends on LIVEUPDATE && IOMMUFD
>> > > +	help
>> > > +	  Enable support for preserving IOMMU state across a kexec live update.
>> > > +
>> > > +	  This allows devices managed by iommufd to maintain their DMA mappings
>> > > +	  during kexec base kernel update.
>> > > +
>> > > +	  If unsure, say N.
>> > > +
>> >
>> > I'm wondering if this should be under the if IOMMU_SUPPORT below? I
>> > believe this was added here because IOMMUFD isn't under IOMMU_SUPPORT,
>> > but it wouldn't make sense to "preserve" IOMMU across a liveupdate if
>> > IOMMU_SUPPORT is disabled? Should we probably be move it inside the
>> > if IOMMU_SUPPORT block for better organization, or at least have a depends
>> > on IOMMU_SUPPORT added to it? The IOMMU_LUO still depends on the
>> > IOMMU_SUPPORT infrastructure to actually function.. as we add calls
>> > within core functions like dev_iommu_get etc.
>>
>> Agreed. I will move it under IOMMU_SUPPORT and sort out any other
>> dependencies.
>> >
>> > >  menuconfig IOMMU_SUPPORT
>> > >  	bool "IOMMU Hardware Support"
>> > >  	depends on MMU
>> > > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>> > > index 0275821f4ef9..b3715c5a6b97 100644
>> > > --- a/drivers/iommu/Makefile
>> > > +++ b/drivers/iommu/Makefile
>> > > @@ -15,6 +15,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>> > >  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>> > >  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST) += io-pgtable-arm-selftests.o
>> > >  obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
>> > > +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
>> > >  obj-$(CONFIG_IOMMU_IOVA) += iova.o
>> > >  obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
>> > >  obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o
>> > > diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
>> > > new file mode 100644
>> > > index 000000000000..6189ba32ff2c
>> > > --- /dev/null
>> > > +++ b/drivers/iommu/liveupdate.c
>> > > @@ -0,0 +1,177 @@
>> > > +// SPDX-License-Identifier: GPL-2.0-only
>> > > +
>> > > +/*
>> > > + * Copyright (C) 2025, Google LLC
>> >
>> > Minor nit: 2026 OR 2025-26, here and everywhere else
>>
>> Will fix in next revision.
>> >
>> > > + * Author: Samiullah Khawaja <skhawaja@google.com>
>> > > + */
>> > > +
>> > > +#define pr_fmt(fmt)    "iommu: liveupdate: " fmt
>> > > +
>> > > +#include <linux/kexec_handover.h>
>> > > +#include <linux/liveupdate.h>
>> > > +#include <linux/iommu-lu.h>
>> > > +#include <linux/iommu.h>
>> > > +#include <linux/errno.h>
>> > > +
>> > > +static void iommu_liveupdate_restore_objs(u64 next)
>> > > +{
>> > > +	struct iommu_objs_ser *objs;
>> > > +
>> > > +	while (next) {
>> > > +		BUG_ON(!kho_restore_folio(next));
>> >
>> > Same thing about BUG_ON [1] as mentioned below in the
>> > iommu_liveupdate_flb_retrieve() function, can we consider returning an
>> > error which can be checked in the caller and the error can be bubbled up
>> > as -ENODATA?
>>
>> Please see the explanation below on the BUG_ON.
>
>[------- snip >8 --------]
>
>> > > +
>> > > +static int iommu_liveupdate_flb_retrieve(struct liveupdate_flb_op_args *argp)
>> > > +{
>> > > +	struct iommu_lu_flb_obj *obj;
>> > > +	struct iommu_lu_flb_ser *ser;
>> > > +
>> > > +	obj = kzalloc(sizeof(*obj), GFP_ATOMIC);
>> >
>> > Why does this have to be GFP_ATOMIC? IIUC, the retrieve path is
>> > triggered by a userspace IOCTL in the new kernel. The system should be
>> > able to sleep here? (unless we have a use-case to call this in IRQ-ctx?)
>> > AFAICT, we call this under mutexes already, hence there's no situation
>> > where we could sleep in a spinlock context?
>> >
>> > GFP_ATOMIC creates a point of failure if the system is under memory
>> > pressure. I believe we should be allowed to sleep for this allocation
>> > because the "preserved" mappings still allow DMAs to go on and we're in
>> > no hurry to restore the IOMMU state? I believe this could be GFP_KERNEL.
>> >
>
>I guess we missed discussing this comment about s/GFP_ATOMIC/GFP_KERNEL?

I think this can be updated to GFP_KERNEL. My original reason of
GFP_ATOMIC was since the flb_retrieve is also called during boot in
driver init and vt-d seems to be using GFP_ATOMIC for root table
initialization. Looking at those code paths, it seems none of it is in
IRQ ctx. Same goes for other drivers.

I will update this.
>
>> > > +	if (!obj)
>> > > +		return -ENOMEM;
>> > > +
>> > > +	mutex_init(&obj->lock);
>> > > +	BUG_ON(!kho_restore_folio(argp->data));
>> >
>> > The use of BUG_ON in new code is heavily discouraged [1].
>> > If KHO can't restore the folio for whatever reason, we can be treat it
>> > as a corruption of the handover data. I believe crashing the kernel for
>> > it would be an overkill?
>>
>> The FLB restore is done during early boot and this has been discussed in
>> past in KHO/LUO and IOMMU context also. But basically if this fails, the
>> restoration of IOMMU state cannot be done, preserved devices would
>> already have corrupted memory due to ongoing DMA as we didn't disable
>> translation in the previous kernel. So logging an error and a BUG_ON at
>> this point would be most appropriate.
>>
>> Please see discussion here on this topic:
>> https://lore.kernel.org/all/20251118153631.GB90703@nvidia.com/
>
>I see.. so a failure here would mean an entire VM tear-down?

Yes a corrupted FLB/KHO would make the incoming kernel unable to restore
IOMMU state. This indicates a memory corruption and the devices might
have corrupted kernel and/or VM memory.
>
>> >
>> > Can we consider returning a graceful failure like -ENODATA or something?
>> > BUG_ON would instantly cause a kernel panic without providing no
>> > opportunity for the system to log the failure or attempt a graceful
>> > teardown of the 'preserved' mapping.
>>
>> I will update this by adding a comment and also log an error.
>
>Ack. Thanks
>
>[ ---- snip >8 -----]
>
>> > > +enum iommu_lu_type {
>> > > +	IOMMU_INVALID,
>> > > +	IOMMU_INTEL,
>> > > +};
>> > > +
>> > > +struct iommu_obj_ser {
>> > > +	u32 idx;
>> > > +	u32 ref_count;
>> > > +	u32 deleted:1;
>> > > +	u32 incoming:1;
>> > > +} __packed;
>> > > +
>> > > +struct iommu_domain_ser {
>> > > +	struct iommu_obj_ser obj;
>> > > +	u64 top_table;
>> > > +	u64 top_level;
>> > > +	struct iommu_domain *restored_domain;
>> > > +} __packed;
>> > > +
>> > > +struct device_domain_iommu_ser {
>> > > +	u32 did;
>
>Nit: `did` sounds intel-specific, we can either call it something
>generic or make it into a union when we support other archs in the
>future.

Yes, I will change this to a union like done at other places.
>
>> > > +	u64 domain_phys;
>> > > +	u64 iommu_phys;
>> > > +} __packed;
>> > > +
>> > > +struct device_ser {
>> > > +	struct iommu_obj_ser obj;
>> > > +	u64 token;
>> > > +	u32 devid;
>> > > +	u32 pci_domain;
>> > > +	struct device_domain_iommu_ser domain_iommu_ser;
>> > > +	enum iommu_lu_type type;
>> > > +} __packed;
>> > > +
>> > > +struct iommu_intel_ser {
>> > > +	u64 phys_addr;
>> > > +	u64 root_table;
>> > > +} __packed;
>> > > +
>> > > +struct iommu_ser {
>> > > +	struct iommu_obj_ser obj;
>> > > +	u64 token;
>> > > +	enum iommu_lu_type type;
>> > > +	union {
>> > > +		struct iommu_intel_ser intel;
>> > > +	};
>> > > +} __packed;
>> > > +
>> > > +struct iommu_objs_ser {
>> > > +	u64 next_objs;
>> > > +	u64 nr_objs;
>> > > +} __packed;
>> > > +
>> > > +struct iommus_ser {
>> > > +	struct iommu_objs_ser objs;
>> > > +	struct iommu_ser iommus[];
>> > > +} __packed;
>> > > +
>> > > +struct iommu_domains_ser {
>> > > +	struct iommu_objs_ser objs;
>> > > +	struct iommu_domain_ser iommu_domains[];
>> > > +} __packed;
>> > > +
>> > > +struct devices_ser {
>> > > +	struct iommu_objs_ser objs;
>> > > +	struct device_ser devices[];
>> > > +} __packed;
>
>I have a bone to pick here about the naming LOL, the names are kinda

LOL
>confusing and make the code unreadable, I had to keep re-visitng this
>patch while looking at the subsequent ones to understand what's going
>on..
>
>One suggestion is to add a graphic that could help understand the layout
>like:
>
>----------------------------------------------------------------------
>[ PAGE START ]                                                       |
>----------------------------------------------------------------------
>| iommu_objs_ser (The Page Header)                                   |
>|   - next_objs: 0x0000 (End of the page-chain)                      |
>|   - nr_objs: 2                                                     |
>----------------------------------------------------------------------
>| ITEM 0: iommu_domain_ser                                           |
>|   [ iommu_obj_ser (The entry header) ]                             |
>|     - idx: 0                                                       |
>|     - ref_count: 1                                                 |
>|     - deleted: 0                                                   |
>|   [ Domain Data ]                                                  |
>----------------------------------------------------------------------
>| ITEM 1: iommu_domain_ser                                           |
>|   [ iommu_obj_ser (The Price Tag) ]                                |
>|     - idx: 1                                                       |
>|     - ref_count: 1                                                 |
>|     - deleted: 0                                                   |
>|   [ Domain Data ]                                                  |
>----------------------------------------------------------------------
>| ... (Empty space for more domains) ...                             |
>|                                                                    |
>----------------------------------------------------------------------
>[ PAGE END ]                                                         |
>----------------------------------------------------------------------

I think this is a good idea. I can add a diagram like this, but maybe
just for objs as all the preserved state is built on top of it.
>
>Additionally, a few naming suggestions here:
>
>1. struct iommu_obj_ser -> struct iommu_ser_entry_hdr
>2. struct iommu_objs_ser -> struct iommu_ser_page_hdr
>3. struct iommu_domains_ser -> struct iommu_ser_domain_page

The hdr and page stuff in the ser name will be super confusing. Those
details are part of the objs->obj construction and I think with the
diagram you suggested above and some comments with explanation, the
objs->obj can be demistified.
>
>This makes things clearer:
>
>struct iommu_ser_page_hdr {
>    u64 next_page_phys;
>    u64 entry_count;
>} __packed;
>
>/* The Container Page */
>struct iommu_ser_domain_page {
>    struct iommu_ser_page_hdr hdr;
>    struct iommu_ser_domain domain_entries[];
>} __packed;
>
>Similarly, something like:
>
>4. struct devices_ser -> struct iommu_ser_device_page
>5. struct iommu_lu_flb_ser -> struct iommu_flb_metadata
>6. struct iommu_lu_flb_obj -> struct iommu_flb_ctx
>
>struct iommu_flb_ctx {
>	struct mutex lock;
>	struct iommu_flb_metadata *cookie;
>
>	struct iommu_ser_domain_page    *curr_domains_page;
>	struct iommu_ser_iommu_ctx_page *curr_iommu_ctx_page;
>	struct iommu_ser_device_page    *curr_devices_page;
>} __packed;
>
>Makes things slightly more readable.
>
>> > > +
>> > > +#define MAX_IOMMU_SERS ((PAGE_SIZE - sizeof(struct iommus_ser)) / sizeof(struct iommu_ser))
>
>Nit: For clarity, can we consider adding another set of braces:
>
>+#define MAX_IOMMU_SERS ((PAGE_SIZE - (sizeof(struct iommus_ser)) / sizeof(struct iommu_ser)))

Agreed.
>
>> > > +#define MAX_IOMMU_DOMAIN_SERS \
>> > > +		((PAGE_SIZE - sizeof(struct iommu_domains_ser)) / sizeof(struct iommu_domain_ser))
>> > > +#define MAX_DEVICE_SERS ((PAGE_SIZE - sizeof(struct devices_ser)) / sizeof(struct device_ser))
>> > > +
>> > > +struct iommu_lu_flb_ser {
>> > > +	u64 iommus_phys;
>> > > +	u64 nr_iommus;
>> > > +	u64 iommu_domains_phys;
>> > > +	u64 nr_domains;
>> > > +	u64 devices_phys;
>> > > +	u64 nr_devices;
>> > > +} __packed;
>> > > +
>> > > +struct iommu_lu_flb_obj {
>> > > +	struct mutex lock;
>> > > +	struct iommu_lu_flb_ser *ser;
>> > > +
>> > > +	struct iommu_domains_ser *iommu_domains;
>> > > +	struct iommus_ser *iommus;
>> > > +	struct devices_ser *devices;
>> > > +} __packed;
>> > > +
>> >
>> > Please let's add some comments describing the structs & their members
>> > here like we have in memfd [2]. This should be descriptive for the user.
>> > For example:
>>
>> I agree. Will add comments for these and others.
>> >
>> > +/**
>> > + * struct iommu_lu_flb_ser - Main serialization header for IOMMU state.
>> > + * @iommus_phys:        Physical address of the first page in the IOMMU unit chain.
>> > + * @nr_iommus:          Total number of hardware IOMMU units preserved.
>> > + * @iommu_domains_phys: [...]
>> > + * @nr_domains:         [...]
>> > + * @devices_phys:       [...]
>> > + * @nr_devices:         [...]
>> > + *
>> > + * This structure acts as the root of the IOMMU state tree. It is hitching a ride
>> > + * on the iommufd file descriptor's preservation flow.
>> > + */
>> > +struct iommu_lu_flb_ser {
>> > +	u64 iommus_phys;
>> > +	u64 nr_iommus;
>> > +	u64 iommu_domains_phys;
>> > +	u64 nr_domains;
>> > +	u64 devices_phys;
>> > +	u64 nr_devices;
>> > +} __packed;
>> >
>> > > +#endif /* _LINUX_KHO_ABI_IOMMU_H */
>> >
>> > Thanks,
>> > Praan
>> >
>> > [1] https://docs.kernel.org/process/coding-style.html#use-warn-rather-than-bug
>> > [2] https://elixir.bootlin.com/linux/v7.0-rc3/source/include/linux/kho/abi/memfd.h
>> >
>>
>> Thanks for the review.
>>
>> Sami
>
>Thanks,
>Praan

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks
  2026-03-13 15:36       ` Pranjal Shrivastava
@ 2026-03-13 16:58         ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-13 16:58 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Fri, Mar 13, 2026 at 03:36:33PM +0000, Pranjal Shrivastava wrote:
>On Thu, Mar 12, 2026 at 04:43:00PM +0000, Samiullah Khawaja wrote:
>> On Wed, Mar 11, 2026 at 09:07:00PM +0000, Pranjal Shrivastava wrote:
>> > On Tue, Feb 03, 2026 at 10:09:35PM +0000, Samiullah Khawaja wrote:
>> > > Add liveupdate FLB for IOMMU state preservation. Use KHO preserve memory
>> > > alloc/free helper functions to allocate memory for the IOMMU LU FLB
>> > > object and the serialization structs for device, domain and iommu.
>> > >
>> > > During retrieve, walk through the preserved objs nodes and restore each
>> > > folio. Also recreate the FLB obj.
>> > >
>> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> > > ---
>> > >  drivers/iommu/Kconfig         |  11 +++
>> > >  drivers/iommu/Makefile        |   1 +
>> > >  drivers/iommu/liveupdate.c    | 177 ++++++++++++++++++++++++++++++++++
>> > >  include/linux/iommu-lu.h      |  17 ++++
>> > >  include/linux/kho/abi/iommu.h | 119 +++++++++++++++++++++++
>> > >  5 files changed, 325 insertions(+)
>> > >  create mode 100644 drivers/iommu/liveupdate.c
>> > >  create mode 100644 include/linux/iommu-lu.h
>> > >  create mode 100644 include/linux/kho/abi/iommu.h
>> > >
>> > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>> > > index f86262b11416..fdcfbedee5ed 100644
>> > > --- a/drivers/iommu/Kconfig
>> > > +++ b/drivers/iommu/Kconfig
>> > > @@ -11,6 +11,17 @@ config IOMMUFD_DRIVER
>> > >  	bool
>> > >  	default n
>> > >
>
>[ snip ---- >8 -----]
>
>> >
>> > > +enum iommu_lu_type {
>> > > +	IOMMU_INVALID,
>> > > +	IOMMU_INTEL,
>> > > +};
>> > > +
>> > > +struct iommu_obj_ser {
>> > > +	u32 idx;
>> > > +	u32 ref_count;
>> > > +	u32 deleted:1;
>> > > +	u32 incoming:1;
>> > > +} __packed;
>> > > +
>> > > +struct iommu_domain_ser {
>> > > +	struct iommu_obj_ser obj;
>> > > +	u64 top_table;
>> > > +	u64 top_level;
>> > > +	struct iommu_domain *restored_domain;
>> > > +} __packed;
>> > > +
>> > > +struct device_domain_iommu_ser {
>> > > +	u32 did;
>> > > +	u64 domain_phys;
>> > > +	u64 iommu_phys;
>> > > +} __packed;
>> > > +
>> > > +struct device_ser {
>> > > +	struct iommu_obj_ser obj;
>> > > +	u64 token;
>> > > +	u32 devid;
>> > > +	u32 pci_domain;
>> > > +	struct device_domain_iommu_ser domain_iommu_ser;
>> > > +	enum iommu_lu_type type;
>> > > +} __packed;
>> > > +
>> > > +struct iommu_intel_ser {
>> > > +	u64 phys_addr;
>> > > +	u64 root_table;
>> > > +} __packed;
>> > > +
>
>One more thing here, let's add the "intel" stuff with the intel patches

Agreed. Will move these to intel specific patches.
>
>Thanks,
>Praan
>
>> > > +struct iommu_ser {
>> > > +	struct iommu_obj_ser obj;
>> > > +	u64 token;
>> > > +	enum iommu_lu_type type;
>> > > +	union {
>> > > +		struct iommu_intel_ser intel;
>> > > +	};
>> > > +} __packed;
>> > > +
>> > > +struct iommu_objs_ser {
>> > > +	u64 next_objs;
>> > > +	u64 nr_objs;
>> > > +} __packed;
>> > > +
>> > > +struct iommus_ser {
>> > > +	struct iommu_objs_ser objs;
>> > > +	struct iommu_ser iommus[];
>> > > +} __packed;
>> > > +
>> > > +struct iommu_domains_ser {
>> > > +	struct iommu_objs_ser objs;
>> > > +	struct iommu_domain_ser iommu_domains[];
>> > > +} __packed;
>> > > +
>> > > +struct devices_ser {
>> > > +	struct iommu_objs_ser objs;
>> > > +	struct device_ser devices[];
>> > > +} __packed;
>> > > +
>
>Thanks,
>Praan

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-03-12 23:10   ` Pranjal Shrivastava
@ 2026-03-13 18:42     ` Samiullah Khawaja
  2026-03-17 20:09       ` Pranjal Shrivastava
  0 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-13 18:42 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Thu, Mar 12, 2026 at 11:10:36PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
>> Add IOMMU domain ops that can be implemented by the IOMMU drivers if
>> they support IOMMU domain preservation across liveupdate. The new IOMMU
>> domain preserve, unpreserve and restore APIs call these ops to perform
>> respective live update operations.
>>
>> Similarly add IOMMU ops to preserve/unpreserve a device. These can be
>> implemented by the IOMMU drivers that support preservation of devices
>> that have their IOMMU domains preserved. During device preservation the
>> state of the associated IOMMU is also preserved. The device can only be
>> preserved if the attached iommu domain is preserved and the associated
>> iommu supports preservation.
>>
>> The preserved state of the device and IOMMU needs to be fetched during
>> shutdown and boot in the next kernel. Add APIs that can be used to fetch
>> the preserved state of a device and IOMMU. The APIs will only be used
>> during shutdown and after liveupdate so no locking needed.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/iommu.c      |   3 +
>>  drivers/iommu/liveupdate.c | 326 +++++++++++++++++++++++++++++++++++++
>>  include/linux/iommu-lu.h   | 119 ++++++++++++++
>>  include/linux/iommu.h      |  32 ++++
>>  4 files changed, 480 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 4926a43118e6..c0632cb5b570 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -389,6 +389,9 @@ static struct dev_iommu *dev_iommu_get(struct device *dev)
>>
>>  	mutex_init(&param->lock);
>>  	dev->iommu = param;
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	dev->iommu->device_ser = NULL;
>> +#endif
>>  	return param;
>>  }
>>
>> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
>> index 6189ba32ff2c..83eb609b3fd7 100644
>> --- a/drivers/iommu/liveupdate.c
>> +++ b/drivers/iommu/liveupdate.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/liveupdate.h>
>>  #include <linux/iommu-lu.h>
>>  #include <linux/iommu.h>
>> +#include <linux/pci.h>
>>  #include <linux/errno.h>
>>
>>  static void iommu_liveupdate_restore_objs(u64 next)
>> @@ -175,3 +176,328 @@ int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler)
>>  	return liveupdate_unregister_flb(handler, &iommu_flb);
>>  }
>>  EXPORT_SYMBOL(iommu_liveupdate_unregister_flb);
>> +
>> +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
>> +				    void *arg)
>> +{
>> +	struct iommu_lu_flb_obj *obj;
>> +	struct devices_ser *devices;
>> +	int ret, i, idx;
>> +
>> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
>> +	if (ret)
>> +		return -ENOENT;
>> +
>> +	devices = __va(obj->ser->devices_phys);
>> +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
>> +		if (idx >= MAX_DEVICE_SERS) {
>> +			devices = __va(devices->objs.next_objs);
>> +			idx = 0;
>> +		}
>> +
>> +		if (devices->devices[idx].obj.deleted)
>> +			continue;
>> +
>> +		ret = fn(&devices->devices[idx], arg);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(iommu_for_each_preserved_device);
>> +
>> +static inline bool device_ser_match(struct device_ser *match,
>> +				    struct pci_dev *pdev)
>> +{
>> +	return match->devid == pci_dev_id(pdev) && match->pci_domain == pci_domain_nr(pdev->bus);
>> +}
>> +
>> +struct device_ser *iommu_get_device_preserved_data(struct device *dev)
>> +{
>> +	struct iommu_lu_flb_obj *obj;
>> +	struct devices_ser *devices;
>> +	int ret, i, idx;
>> +
>> +	if (!dev_is_pci(dev))
>> +		return NULL;
>> +
>> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
>> +	if (ret)
>> +		return NULL;
>> +
>> +	devices = __va(obj->ser->devices_phys);
>> +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
>> +		if (idx >= MAX_DEVICE_SERS) {
>> +			devices = __va(devices->objs.next_objs);
>> +			idx = 0;
>> +		}
>> +
>> +		if (devices->devices[idx].obj.deleted)
>> +			continue;
>> +
>> +		if (device_ser_match(&devices->devices[idx], to_pci_dev(dev))) {
>> +			devices->devices[idx].obj.incoming = true;
>> +			return &devices->devices[idx];
>> +		}
>> +	}
>> +
>> +	return NULL;
>> +}
>> +EXPORT_SYMBOL(iommu_get_device_preserved_data);
>> +
>> +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
>> +{
>> +	struct iommu_lu_flb_obj *obj;
>> +	struct iommus_ser *iommus;
>> +	int ret, i, idx;
>> +
>> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
>> +	if (ret)
>> +		return NULL;
>> +
>> +	iommus = __va(obj->ser->iommus_phys);
>> +	for (i = 0, idx = 0; i < obj->ser->nr_iommus; ++i, ++idx) {
>> +		if (idx >= MAX_IOMMU_SERS) {
>> +			iommus = __va(iommus->objs.next_objs);
>> +			idx = 0;
>> +		}
>> +
>> +		if (iommus->iommus[idx].obj.deleted)
>> +			continue;
>> +
>> +		if (iommus->iommus[idx].token == token &&
>> +		    iommus->iommus[idx].type == type)
>> +			return &iommus->iommus[idx];
>> +	}
>> +
>> +	return NULL;
>> +}
>> +EXPORT_SYMBOL(iommu_get_preserved_data);
>> +
>> +static int reserve_obj_ser(struct iommu_objs_ser **objs_ptr, u64 max_objs)
>
>Isn't this more of an "insert" / "populate" / write_ser_entry? We can
>rename it to something like iommu_ser_push_entry / iommu_ser_write_entry

This is reserving an object in the objects array. The object is filled
once reservation is done. Maybe I can call it alloc_obj_ser or
alloc_entry_ser.
>
>> +{
>> +	struct iommu_objs_ser *next_objs, *objs = *objs_ptr;
>
>Not loving these names :(
>
>TBH, the reserve_obj_ser function isn't too readable, esp. with all the
>variable names, here and in the lu header.. I've had to go back n forth
>to the first patch. For example, here next_objs can be next_objs_page &
>objs can be curr_page. (PTAL at my reply on PATCH 01 about renaming).

Basically with current naming there are "serialization objects" in
"serializatoin object arrays". The object is a "base" type with
"inherited" types for iommu, domain etc. I will add explanation of all
this in the ABI header.

Maybe we can name it to:

struct iommu_obj_ser;
struct iommu_obj_array_ser; or struct iommu_obj_ser_array;

Then for reserve/alloc, we use names like "next_obj_array" and
"curr_obj_array"?
>
>> +	int idx;
>> +
>> +	if (objs->nr_objs == max_objs) {
>> +		next_objs = kho_alloc_preserve(PAGE_SIZE);
>> +		if (IS_ERR(next_objs))
>> +			return PTR_ERR(next_objs);
>> +
>> +		objs->next_objs = virt_to_phys(next_objs);
>> +		objs = next_objs;
>> +		*objs_ptr = objs;
>> +		objs->nr_objs = 0;
>> +		objs->next_objs = 0;
>
>This seems redundant, no need to zero these out, kho_alloc_preserve
>passes __GFP_ZERO to folio_alloc [1], which should z-alloc the pages.

Agreed. Will remove this.
>
>> +	}
>> +
>> +	idx = objs->nr_objs++;
>> +	return idx;
>> +}
>
>Just to give a mental model to fellow reviewers, here's how this is laid
>out:
>
>----------------------------------------------------------------------
>[ PAGE START ]                                                       |
>----------------------------------------------------------------------
>| iommu_objs_ser (The Page Header)                                   |
>|   - next_objs: 0x0000 (End of the page-chain)                      |
>|   - nr_objs: 2                                                     |
>----------------------------------------------------------------------
>| ITEM 0: iommu_domain_ser                                           |
>|   [ iommu_obj_ser (The entry header) ]                             |
>|     - idx: 0                                                       |
>|     - ref_count: 1                                                 |
>|     - deleted: 0                                                   |
>|   [ Domain Data ]                                                  |
>----------------------------------------------------------------------
>| ITEM 1: iommu_domain_ser                                           |
>|   [ iommu_obj_ser (The Price Tag) ]                                |
>|     - idx: 1                                                       |
>|     - ref_count: 1                                                 |
>|     - deleted: 0                                                   |
>|   [ Domain Data ]                                                  |
>----------------------------------------------------------------------
>| ... (Empty space for more domains) ...                             |
>|                                                                    |
>----------------------------------------------------------------------
>[ PAGE END ]                                                         |
>----------------------------------------------------------------------

+1

Will add a table in the header as you suggested.
>
>> +
>> +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
>> +{
>> +	struct iommu_domain_ser *domain_ser;
>> +	struct iommu_lu_flb_obj *flb_obj;
>> +	int idx, ret;
>> +
>> +	if (!domain->ops->preserve)
>> +		return -EOPNOTSUPP;
>> +
>> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> +	if (ret)
>> +		return ret;
>> +
>> +	guard(mutex)(&flb_obj->lock);
>> +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommu_domains,
>> +			      MAX_IOMMU_DOMAIN_SERS);
>> +	if (idx < 0)
>> +		return idx;
>> +
>> +	domain_ser = &flb_obj->iommu_domains->iommu_domains[idx];
>
>This is slightly less-readable as well, I understand we're trying to:
>
>iommu_domains_ser -> iommu_domain_ser[idx] but the same name
>(iommu_domains) makes it difficult to read.. we should rename this as:

Agreed.

I will update it to,

&flb_obj->curr_iommu_domains->domain_array[idx]
>
>&flb_obj->iommu_domains_page->domain_entries[idx] or something for
>better readability..
>
>Also, let's add a comment explaining that reserve_obj_ser actually
>advances the flb_obj ptr when necessary..

Agreed.
>
>IIUC, we start with PAGE 0 initially, store it's phys in the
>iommu_flb_preserve op (the iommu_ser_phys et al) & then we go on
>alloc-ing more pages and keep storing the "current" active page with the
>liveupdate core. Now when we jump into the new kernel, we read the
>ser_phys and then follow the page chain, right?

Yes, we do an initial allocation of arrays for each object type and then
later allocate more as needed. The flb_obj holds the address of
currently active array.

I will add this explanation in the header.
>
>> +	idx = flb_obj->ser->nr_domains++;
>> +	domain_ser->obj.idx = idx;
>> +	domain_ser->obj.ref_count = 1;
>> +
>> +	ret = domain->ops->preserve(domain, domain_ser);
>> +	if (ret) {
>> +		domain_ser->obj.deleted = true;
>> +		return ret;
>> +	}
>> +
>> +	domain->preserved_state = domain_ser;
>> +	*ser = domain_ser;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_domain_preserve);
>> +
>> +void iommu_domain_unpreserve(struct iommu_domain *domain)
>> +{
>> +	struct iommu_domain_ser *domain_ser;
>> +	struct iommu_lu_flb_obj *flb_obj;
>> +	int ret;
>> +
>> +	if (!domain->ops->unpreserve)
>> +		return;
>> +
>> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> +	if (ret)
>> +		return;
>> +
>> +	guard(mutex)(&flb_obj->lock);
>> +
>> +	/*
>> +	 * There is no check for attached devices here. The correctness relies
>> +	 * on the Live Update Orchestrator's session lifecycle. All resources
>> +	 * (iommufd, vfio devices) are preserved within a single session. If the
>> +	 * session is torn down, the .unpreserve callbacks for all files will be
>> +	 * invoked, ensuring a consistent cleanup without needing explicit
>> +	 * refcounting for the serialized objects here.
>> +	 */
>> +	domain_ser = domain->preserved_state;
>> +	domain->ops->unpreserve(domain, domain_ser);
>> +	domain_ser->obj.deleted = true;
>> +	domain->preserved_state = NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_domain_unpreserve);
>> +
>> +static int iommu_preserve_locked(struct iommu_device *iommu)
>> +{
>> +	struct iommu_lu_flb_obj *flb_obj;
>> +	struct iommu_ser *iommu_ser;
>> +	int idx, ret;
>> +
>> +	if (!iommu->ops->preserve)
>> +		return -EOPNOTSUPP;
>> +
>> +	if (iommu->outgoing_preserved_state) {
>> +		iommu->outgoing_preserved_state->obj.ref_count++;
>> +		return 0;
>> +	}
>> +
>> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> +	if (ret)
>> +		return ret;
>> +
>> +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommus,
>> +			      MAX_IOMMU_SERS);
>> +	if (idx < 0)
>> +		return idx;
>> +
>> +	iommu_ser = &flb_obj->iommus->iommus[idx];
>> +	idx = flb_obj->ser->nr_iommus++;
>> +	iommu_ser->obj.idx = idx;
>> +	iommu_ser->obj.ref_count = 1;
>> +
>> +	ret = iommu->ops->preserve(iommu, iommu_ser);
>> +	if (ret)
>> +		iommu_ser->obj.deleted = true;
>> +
>> +	iommu->outgoing_preserved_state = iommu_ser;
>> +	return ret;
>> +}
>> +
>> +static void iommu_unpreserve_locked(struct iommu_device *iommu)
>> +{
>> +	struct iommu_ser *iommu_ser = iommu->outgoing_preserved_state;
>> +
>> +	iommu_ser->obj.ref_count--;
>> +	if (iommu_ser->obj.ref_count)
>> +		return;
>> +
>> +	iommu->outgoing_preserved_state = NULL;
>> +	iommu->ops->unpreserve(iommu, iommu_ser);
>> +	iommu_ser->obj.deleted = true;
>> +}
>> +
>> +int iommu_preserve_device(struct iommu_domain *domain,
>> +			  struct device *dev, u64 token)
>> +{
>> +	struct iommu_lu_flb_obj *flb_obj;
>> +	struct device_ser *device_ser;
>> +	struct dev_iommu *iommu;
>> +	struct pci_dev *pdev;
>> +	int ret, idx;
>> +
>> +	if (!dev_is_pci(dev))
>> +		return -EOPNOTSUPP;
>> +
>> +	if (!domain->preserved_state)
>> +		return -EINVAL;
>> +
>> +	pdev = to_pci_dev(dev);
>> +	iommu = dev->iommu;
>> +	if (!iommu->iommu_dev->ops->preserve_device ||
>> +	    !iommu->iommu_dev->ops->preserve)
>> +		return -EOPNOTSUPP;
>> +
>> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> +	if (ret)
>> +		return ret;
>> +
>> +	guard(mutex)(&flb_obj->lock);
>> +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->devices,
>> +			      MAX_DEVICE_SERS);
>> +	if (idx < 0)
>> +		return idx;
>> +
>> +	device_ser = &flb_obj->devices->devices[idx];
>> +	idx = flb_obj->ser->nr_devices++;
>> +	device_ser->obj.idx = idx;
>> +	device_ser->obj.ref_count = 1;
>> +
>> +	ret = iommu_preserve_locked(iommu->iommu_dev);
>> +	if (ret) {
>> +		device_ser->obj.deleted = true;
>> +		return ret;
>> +	}
>> +
>> +	device_ser->domain_iommu_ser.domain_phys = __pa(domain->preserved_state);
>> +	device_ser->domain_iommu_ser.iommu_phys = __pa(iommu->iommu_dev->outgoing_preserved_state);
>> +	device_ser->devid = pci_dev_id(pdev);
>> +	device_ser->pci_domain = pci_domain_nr(pdev->bus);
>> +	device_ser->token = token;
>> +
>> +	ret = iommu->iommu_dev->ops->preserve_device(dev, device_ser);
>> +	if (ret) {
>> +		device_ser->obj.deleted = true;
>> +		iommu_unpreserve_locked(iommu->iommu_dev);
>> +		return ret;
>> +	}
>> +
>> +	dev->iommu->device_ser = device_ser;
>> +	return 0;
>> +}
>> +
>> +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>> +{
>> +	struct iommu_lu_flb_obj *flb_obj;
>> +	struct device_ser *device_ser;
>> +	struct dev_iommu *iommu;
>> +	struct pci_dev *pdev;
>> +	int ret;
>> +
>> +	if (!dev_is_pci(dev))
>> +		return;
>> +
>> +	pdev = to_pci_dev(dev);
>> +	iommu = dev->iommu;
>> +	if (!iommu->iommu_dev->ops->unpreserve_device ||
>> +	    !iommu->iommu_dev->ops->unpreserve)
>> +		return;
>> +
>> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> +	if (WARN_ON(ret))
>> +		return;
>> +
>> +	guard(mutex)(&flb_obj->lock);
>> +	device_ser = dev_iommu_preserved_state(dev);
>> +	if (WARN_ON(!device_ser))
>> +		return;
>> +
>> +	iommu->iommu_dev->ops->unpreserve_device(dev, device_ser);
>> +	dev->iommu->device_ser = NULL;
>> +
>> +	iommu_unpreserve_locked(iommu->iommu_dev);
>> +}
>
>I'm wondering ig we should guard these APIs against accidental or
>potential abuse by other in-kernel drivers. Since the Live Update
>Orchestrator (LUO) is architecturally designed around user-space
>driven sequences (IOCTLs, specific mutex ordering, etc.), for now.
>
>Since the header-file is also under include/linux, we should avoid any
>possibility where we end up having drivers depending on these API.
>We could add a check based on dma owner:
>
>+	/* Only devices explicitly claimed by a user-space driver
>+	 * (VFIO/IOMMUFD) are eligible for Live Update preservation.
>+	 */
>+	if (!iommu_dma_owner_claimed(dev))
>+		return -EPERM;
>
>This should ensure we aren't creating 'zombie' preserved states for
>devices not managed by IOMMUFD/VFIO.

Agreed. I will update this.
>
>
>> diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
>> index 59095d2f1bb2..48c07514a776 100644
>> --- a/include/linux/iommu-lu.h
>> +++ b/include/linux/iommu-lu.h
>> @@ -8,9 +8,128 @@
>>  #ifndef _LINUX_IOMMU_LU_H
>>  #define _LINUX_IOMMU_LU_H
>>
>> +#include <linux/device.h>
>> +#include <linux/iommu.h>
>>  #include <linux/liveupdate.h>
>>  #include <linux/kho/abi/iommu.h>
>>
>> +typedef int (*iommu_preserved_device_iter_fn)(struct device_ser *ser,
>> +					      void *arg);
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +static inline void *dev_iommu_preserved_state(struct device *dev)
>> +{
>> +	struct device_ser *ser;
>> +
>> +	if (!dev->iommu)
>> +		return NULL;
>> +
>> +	ser = dev->iommu->device_ser;
>> +	if (ser && !ser->obj.incoming)
>> +		return ser;
>> +
>> +	return NULL;
>> +}
>> +
>> +static inline void *dev_iommu_restored_state(struct device *dev)
>> +{
>> +	struct device_ser *ser;
>> +
>> +	if (!dev->iommu)
>> +		return NULL;
>> +
>> +	ser = dev->iommu->device_ser;
>> +	if (ser && ser->obj.incoming)
>> +		return ser;
>> +
>> +	return NULL;
>> +}
>> +
>> +static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
>> +{
>> +	struct iommu_domain_ser *ser;
>> +
>> +	ser = domain->preserved_state;
>> +	if (ser && ser->obj.incoming)
>> +		return ser;
>> +
>> +	return NULL;
>> +}
>> +
>> +static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
>> +{
>> +	struct device_ser *ser = dev_iommu_restored_state(dev);
>> +
>> +	if (ser && iommu_domain_restored_state(domain))
>> +		return ser->domain_iommu_ser.did;
>> +
>> +	return -1;
>> +}
>> +
>> +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
>> +				    void *arg);
>> +struct device_ser *iommu_get_device_preserved_data(struct device *dev);
>> +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type);
>> +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser);
>> +void iommu_domain_unpreserve(struct iommu_domain *domain);
>> +int iommu_preserve_device(struct iommu_domain *domain,
>> +			  struct device *dev, u64 token);
>> +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev);
>> +#else
>> +static inline void *dev_iommu_preserved_state(struct device *dev)
>> +{
>> +	return NULL;
>> +}
>> +
>> +static inline void *dev_iommu_restored_state(struct device *dev)
>> +{
>> +	return NULL;
>> +}
>> +
>> +static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
>> +{
>> +	return -1;
>> +}
>> +
>> +static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
>> +{
>> +	return NULL;
>> +}
>> +
>> +static inline int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn, void *arg)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static inline struct device_ser *iommu_get_device_preserved_data(struct device *dev)
>> +{
>> +	return NULL;
>> +}
>> +
>> +static inline struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
>> +{
>> +	return NULL;
>> +}
>> +
>> +static inline int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static inline void iommu_domain_unpreserve(struct iommu_domain *domain)
>> +{
>> +}
>> +
>> +static inline int iommu_preserve_device(struct iommu_domain *domain,
>> +					struct device *dev, u64 token)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static inline void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>> +{
>> +}
>> +#endif
>> +
>>  int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler);
>>  int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler);
>>
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 54b8b48c762e..bd949c1ce7c5 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -14,6 +14,8 @@
>>  #include <linux/err.h>
>>  #include <linux/of.h>
>>  #include <linux/iova_bitmap.h>
>> +#include <linux/atomic.h>
>> +#include <linux/kho/abi/iommu.h>
>>  #include <uapi/linux/iommufd.h>
>>
>>  #define IOMMU_READ	(1 << 0)
>> @@ -248,6 +250,10 @@ struct iommu_domain {
>>  			struct list_head next;
>>  		};
>>  	};
>> +
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	struct iommu_domain_ser *preserved_state;
>> +#endif
>>  };
>>
>>  static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
>> @@ -647,6 +653,10 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
>>   *               resources shared/passed to user space IOMMU instance. Associate
>>   *               it with a nesting @parent_domain. It is required for driver to
>>   *               set @viommu->ops pointing to its own viommu_ops
>> + * @preserve_device: Preserve state of a device for liveupdate.
>> + * @unpreserve_device: Unpreserve state that was preserved earlier.
>> + * @preserve: Preserve state of iommu translation hardware for liveupdate.
>> + * @unpreserve: Unpreserve state of iommu that was preserved earlier.
>>   * @owner: Driver module providing these ops
>>   * @identity_domain: An always available, always attachable identity
>>   *                   translation.
>> @@ -703,6 +713,11 @@ struct iommu_ops {
>>  			   struct iommu_domain *parent_domain,
>>  			   const struct iommu_user_data *user_data);
>>
>> +	int (*preserve_device)(struct device *dev, struct device_ser *device_ser);
>> +	void (*unpreserve_device)(struct device *dev, struct device_ser *device_ser);
>
>Nit: Let's move the _device ops under the comment:
>`/* Per device IOMMU features */`

I will move these.
>
>> +	int (*preserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
>> +	void (*unpreserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
>> +
>
>I'm wondering if there's any benefit to add these ops under a #ifdef ?

These should be NULL when liveupdate is disabled, so #ifdef should not
be needed. Not sure how much space we save if don't define these. I will
move these in #ifdef anyway.
>
>>  	const struct iommu_domain_ops *default_domain_ops;
>>  	struct module *owner;
>>  	struct iommu_domain *identity_domain;
>> @@ -749,6 +764,11 @@ struct iommu_ops {
>>   *                           specific mechanisms.
>>   * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>>   * @free: Release the domain after use.
>> + * @preserve: Preserve the iommu domain for liveupdate.
>> + *            Returns 0 on success, a negative errno on failure.
>> + * @unpreserve: Unpreserve the iommu domain that was preserved earlier.
>> + * @restore: Restore the iommu domain after liveupdate.
>> + *           Returns 0 on success, a negative errno on failure.
>>   */
>>  struct iommu_domain_ops {
>>  	int (*attach_dev)(struct iommu_domain *domain, struct device *dev,
>> @@ -779,6 +799,9 @@ struct iommu_domain_ops {
>>  				  unsigned long quirks);
>>
>>  	void (*free)(struct iommu_domain *domain);
>> +	int (*preserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
>> +	void (*unpreserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
>> +	int (*restore)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
>>  };
>>
>>  /**
>> @@ -790,6 +813,8 @@ struct iommu_domain_ops {
>>   * @singleton_group: Used internally for drivers that have only one group
>>   * @max_pasids: number of supported PASIDs
>>   * @ready: set once iommu_device_register() has completed successfully
>> + * @outgoing_preserved_state: preserved iommu state of outgoing kernel for
>> + * liveupdate.
>>   */
>>  struct iommu_device {
>>  	struct list_head list;
>> @@ -799,6 +824,10 @@ struct iommu_device {
>>  	struct iommu_group *singleton_group;
>>  	u32 max_pasids;
>>  	bool ready;
>> +
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	struct iommu_ser *outgoing_preserved_state;
>> +#endif
>>  };
>>
>>  /**
>> @@ -853,6 +882,9 @@ struct dev_iommu {
>>  	u32				pci_32bit_workaround:1;
>>  	u32				require_direct:1;
>>  	u32				shadow_on_flush:1;
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	struct device_ser		*device_ser;
>> +#endif
>>  };
>>
>>  int iommu_device_register(struct iommu_device *iommu,
>
>Thanks,
>Praan
>
>[1] https://elixir.bootlin.com/linux/v7.0-rc3/source/kernel/liveupdate/kexec_handover.c#L1182

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks
  2026-02-03 22:09 ` [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks Samiullah Khawaja
  2026-03-11 21:07   ` Pranjal Shrivastava
@ 2026-03-16 22:54   ` Vipin Sharma
  2026-03-17  1:06     ` Samiullah Khawaja
  1 sibling, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-16 22:54 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:35PM +0000, Samiullah Khawaja wrote:
> +config IOMMU_LIVEUPDATE
> +	bool "IOMMU live update state preservation support"
> +	depends on LIVEUPDATE && IOMMUFD
> +	help
> +	  Enable support for preserving IOMMU state across a kexec live update.
> +
> +	  This allows devices managed by iommufd to maintain their DMA mappings
> +	  during kexec base kernel update.
> +
> +	  If unsure, say N.
> +

Do we need a separate config? Can't we just use CONFIG_LIVEUPDATE?

>  menuconfig IOMMU_SUPPORT
>  	bool "IOMMU Hardware Support"
>  	depends on MMU
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 0275821f4ef9..b3715c5a6b97 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -15,6 +15,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST) += io-pgtable-arm-selftests.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
> +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o

It seems like there is a sorted order for CONFIG_IOMMU_* in the
Makefile, lets keep it same if possible.

> +static void iommu_liveupdate_free_objs(u64 next, bool incoming)
> +{
> +	struct iommu_objs_ser *objs;
> +
> +	while (next) {
> +		objs = __va(next);

There is also call to phys_to_virt() in other functions in this patch.
Should we use the same here to be consistent?

> +		next = objs->next_objs;
> +
> +		if (!incoming)
> +			kho_unpreserve_free(objs);
> +		else
> +			folio_put(virt_to_folio(objs));
> +	}
> +}

Instead of passing boolean, and calling with different arguments, I
think it will be simpler to just have two functions

- iommu_liveupdate_unpreserve()
- iommu_liveupdate_folio_put()

> +
> +static void iommu_liveupdate_flb_free(struct iommu_lu_flb_obj *obj)
> +{
> +	if (obj->iommu_domains)
> +		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, false);
> +
> +	if (obj->devices)
> +		iommu_liveupdate_free_objs(obj->ser->devices_phys, false);
> +
> +	if (obj->iommus)
> +		iommu_liveupdate_free_objs(obj->ser->iommus_phys, false);
> +
> +	kho_unpreserve_free(obj->ser);
> +	kfree(obj);
> +}
> +
> +static int iommu_liveupdate_flb_preserve(struct liveupdate_flb_op_args *argp)
> +{
> +	struct iommu_lu_flb_obj *obj;
> +	struct iommu_lu_flb_ser *ser;
> +	void *mem;
> +
> +	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
> +	if (!obj)
> +		return -ENOMEM;
> +
> +	mutex_init(&obj->lock);
> +	mem = kho_alloc_preserve(sizeof(*ser));
> +	if (IS_ERR(mem))
> +		goto err_free;
> +
> +	ser = mem;
> +	obj->ser = ser;
> +
> +	mem = kho_alloc_preserve(PAGE_SIZE);
> +	if (IS_ERR(mem))
> +		goto err_free;
> +
> +	obj->iommu_domains = mem;
> +	ser->iommu_domains_phys = virt_to_phys(obj->iommu_domains);
> +
> +	mem = kho_alloc_preserve(PAGE_SIZE);
> +	if (IS_ERR(mem))
> +		goto err_free;
> +
> +	obj->devices = mem;
> +	ser->devices_phys = virt_to_phys(obj->devices);
> +
> +	mem = kho_alloc_preserve(PAGE_SIZE);
> +	if (IS_ERR(mem))
> +		goto err_free;
> +
> +	obj->iommus = mem;
> +	ser->iommus_phys = virt_to_phys(obj->iommus);
> +
> +	argp->obj = obj;
> +	argp->data = virt_to_phys(ser);
> +	return 0;
> +
> +err_free:
> +	iommu_liveupdate_flb_free(obj);

Generally, I have seen in the function goto will call corresponding
error tags, and free corresponding allocations and all the one which
happend before. It is easier to read code that way. I know you are
combining the free call from iommu_liveupdate_flb_unpreserve() also.
IMHO, code readability will be better this way.

> +	return PTR_ERR(mem);
> +}
> +
> +static void iommu_liveupdate_flb_unpreserve(struct liveupdate_flb_op_args *argp)
> +{
> +	iommu_liveupdate_flb_free(argp->obj);
> +}
> +
> +static void iommu_liveupdate_flb_finish(struct liveupdate_flb_op_args *argp)
> +{
> +	struct iommu_lu_flb_obj *obj = argp->obj;
> +
> +	if (obj->iommu_domains)
> +		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, true);

Can there be the case where obj->iommu_domains is NULL but
obj->ser->iommu_domains_phys is not? If that is not possible, I will
just simplify the patch and unconditionally call
iommu_liveupdate_free_objs()?

> +
> +static int iommu_liveupdate_flb_retrieve(struct liveupdate_flb_op_args *argp)
> +{
> +	struct iommu_lu_flb_obj *obj;
> +	struct iommu_lu_flb_ser *ser;
> +
> +	obj = kzalloc(sizeof(*obj), GFP_ATOMIC);
> +	if (!obj)
> +		return -ENOMEM;

Is kzalloc() failure here recoverable whereas iommu_liveupdate_restore_objs()
below is not? If it is not recoverable should there be a BUG_ON here?

> +
> +	mutex_init(&obj->lock);
> +	BUG_ON(!kho_restore_folio(argp->data));
> +	ser = phys_to_virt(argp->data);
> +	obj->ser = ser;
> +
> +	iommu_liveupdate_restore_objs(ser->iommu_domains_phys);
> +	obj->iommu_domains = phys_to_virt(ser->iommu_domains_phys);

Can iommu_liveupdate_restore_obj() just return virtual address and we
can simplify code to:

	obj->iommu_domains = iommu_liveupdate_restore_objs(ser->iommu_domains_phys);

> +
> +	iommu_liveupdate_restore_objs(ser->devices_phys);
> +	obj->devices = phys_to_virt(ser->devices_phys);
> +
> +	iommu_liveupdate_restore_objs(ser->iommus_phys);
> +	obj->iommus = phys_to_virt(ser->iommus_phys);
> +
> +	argp->obj = obj;
> +
> +	return 0;
> +}
> +
> diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h

I will recommend to use full name and not short "lu". iommu-liveupdate.h
seems more readable and not too long.

> +#define MAX_IOMMU_SERS ((PAGE_SIZE - sizeof(struct iommus_ser)) / sizeof(struct iommu_ser))
> +#define MAX_IOMMU_DOMAIN_SERS \
> +		((PAGE_SIZE - sizeof(struct iommu_domains_ser)) / sizeof(struct iommu_domain_ser))
> +#define MAX_DEVICE_SERS ((PAGE_SIZE - sizeof(struct devices_ser)) / sizeof(struct device_ser))

This is per page limit, not whole serialization limit. May be we can
name something like:

- MAX_IOMMU_SERS_PER_PAGE, or
- MAX_IOMMU_SERS_PAGE_CAPACITY

> +
> +struct iommu_lu_flb_obj {
> +	struct mutex lock;
> +	struct iommu_lu_flb_ser *ser;
> +
> +	struct iommu_domains_ser *iommu_domains;
> +	struct iommus_ser *iommus;
> +	struct devices_ser *devices;
> +} __packed;
> +

I think naming scheme used here is little hard to absorb when we have so
many individual structs in this header file. Specifically, struct names like:

- iommu_domains_ser vs iommu_domain_ser
- iommus_ser vs iommu_ser
- devices_ser vs device_ser
- iommu_objs_ser vs iommu_obj_ser

First three are showing container and its elements relation, however,
last one doesn't have that relation but naming is same there.

I will recommend to change the naming scheme of containers to something like:

	struct iommu_domain_ser_[hdr|header|table|arr] {};
	struct iommu_ser_hdr {}
	struct device_ser_hdr {}

Individual element of container can be same.

For objs, something like: 
	iommu_objs_ser -> iommu_hdr_meta


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks
  2026-03-16 22:54   ` Vipin Sharma
@ 2026-03-17  1:06     ` Samiullah Khawaja
  2026-03-23 23:27       ` Vipin Sharma
  0 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-17  1:06 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Mon, Mar 16, 2026 at 03:54:50PM -0700, Vipin Sharma wrote:
>On Tue, Feb 03, 2026 at 10:09:35PM +0000, Samiullah Khawaja wrote:
>> +config IOMMU_LIVEUPDATE
>> +	bool "IOMMU live update state preservation support"
>> +	depends on LIVEUPDATE && IOMMUFD
>> +	help
>> +	  Enable support for preserving IOMMU state across a kexec live update.
>> +
>> +	  This allows devices managed by iommufd to maintain their DMA mappings
>> +	  during kexec base kernel update.
>> +
>> +	  If unsure, say N.
>> +
>
>Do we need a separate config? Can't we just use CONFIG_LIVEUPDATE?

We have a separate CONFIG here so that the phase 1/2 split for iommu
preservation doesn't break the vfio preservation. See following
discussion in the RFCv2:

https://lore.kernel.org/all/aYEpHBYxlQxhXrwl@google.com/
>
>>  menuconfig IOMMU_SUPPORT
>>  	bool "IOMMU Hardware Support"
>>  	depends on MMU
>> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>> index 0275821f4ef9..b3715c5a6b97 100644
>> --- a/drivers/iommu/Makefile
>> +++ b/drivers/iommu/Makefile
>> @@ -15,6 +15,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST) += io-pgtable-arm-selftests.o
>>  obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
>> +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
>
>It seems like there is a sorted order for CONFIG_IOMMU_* in the
>Makefile, lets keep it same if possible.

Will fix in the next revision.
>
>> +static void iommu_liveupdate_free_objs(u64 next, bool incoming)
>> +{
>> +	struct iommu_objs_ser *objs;
>> +
>> +	while (next) {
>> +		objs = __va(next);
>
>There is also call to phys_to_virt() in other functions in this patch.
>Should we use the same here to be consistent?

Agreed. I will fix this.
>
>> +		next = objs->next_objs;
>> +
>> +		if (!incoming)
>> +			kho_unpreserve_free(objs);
>> +		else
>> +			folio_put(virt_to_folio(objs));
>> +	}
>> +}
>
>Instead of passing boolean, and calling with different arguments, I
>think it will be simpler to just have two functions
>
>- iommu_liveupdate_unpreserve()
>- iommu_liveupdate_folio_put()

This is a helper function to free the serialized state without
duplicating multiple checks for various type of state (iommu,
iommu_domain and devices).

Do you think maybe I should add these two functions and make it call the
helper?
>
>> +
>> +static void iommu_liveupdate_flb_free(struct iommu_lu_flb_obj *obj)
>> +{
>> +	if (obj->iommu_domains)
>> +		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, false);
>> +
>> +	if (obj->devices)
>> +		iommu_liveupdate_free_objs(obj->ser->devices_phys, false);
>> +
>> +	if (obj->iommus)
>> +		iommu_liveupdate_free_objs(obj->ser->iommus_phys, false);
>> +
>> +	kho_unpreserve_free(obj->ser);
>> +	kfree(obj);
>> +}
>> +
>> +static int iommu_liveupdate_flb_preserve(struct liveupdate_flb_op_args *argp)
>> +{
>> +	struct iommu_lu_flb_obj *obj;
>> +	struct iommu_lu_flb_ser *ser;
>> +	void *mem;
>> +
>> +	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
>> +	if (!obj)
>> +		return -ENOMEM;
>> +
>> +	mutex_init(&obj->lock);
>> +	mem = kho_alloc_preserve(sizeof(*ser));
>> +	if (IS_ERR(mem))
>> +		goto err_free;
>> +
>> +	ser = mem;
>> +	obj->ser = ser;
>> +
>> +	mem = kho_alloc_preserve(PAGE_SIZE);
>> +	if (IS_ERR(mem))
>> +		goto err_free;
>> +
>> +	obj->iommu_domains = mem;
>> +	ser->iommu_domains_phys = virt_to_phys(obj->iommu_domains);
>> +
>> +	mem = kho_alloc_preserve(PAGE_SIZE);
>> +	if (IS_ERR(mem))
>> +		goto err_free;
>> +
>> +	obj->devices = mem;
>> +	ser->devices_phys = virt_to_phys(obj->devices);
>> +
>> +	mem = kho_alloc_preserve(PAGE_SIZE);
>> +	if (IS_ERR(mem))
>> +		goto err_free;
>> +
>> +	obj->iommus = mem;
>> +	ser->iommus_phys = virt_to_phys(obj->iommus);
>> +
>> +	argp->obj = obj;
>> +	argp->data = virt_to_phys(ser);
>> +	return 0;
>> +
>> +err_free:
>> +	iommu_liveupdate_flb_free(obj);
>
>Generally, I have seen in the function goto will call corresponding
>error tags, and free corresponding allocations and all the one which
>happend before. It is easier to read code that way. I know you are
>combining the free call from iommu_liveupdate_flb_unpreserve() also.
>IMHO, code readability will be better this way.

I had that originally when I was writing this function, but it gets
really cluttered :(. Instead it is more clean without code duplication
using this one cleanup function here to free the state on error and also
when doing unpreserve. Please consider this a "destroy" function of obj
and it can be called from 2 places,

- Error during allocation of internal state.
- During unpreserve.
>
>> +	return PTR_ERR(mem);
>> +}
>> +
>> +static void iommu_liveupdate_flb_unpreserve(struct liveupdate_flb_op_args *argp)
>> +{
>> +	iommu_liveupdate_flb_free(argp->obj);
>> +}
>> +
>> +static void iommu_liveupdate_flb_finish(struct liveupdate_flb_op_args *argp)
>> +{
>> +	struct iommu_lu_flb_obj *obj = argp->obj;
>> +
>> +	if (obj->iommu_domains)
>> +		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, true);
>
>Can there be the case where obj->iommu_domains is NULL but
>obj->ser->iommu_domains_phys is not? If that is not possible, I will
>just simplify the patch and unconditionally call
>iommu_liveupdate_free_objs()?

Are you suggesting that on flb_finish() the obj->iommu_domains should be
non-NULL as flb_retrieve() succeeded? If yes, then that is correct. I
will update this to call the free_objs() without checking
obj->iommu_domains. I will do same for other types.
>
>> +
>> +static int iommu_liveupdate_flb_retrieve(struct liveupdate_flb_op_args *argp)
>> +{
>> +	struct iommu_lu_flb_obj *obj;
>> +	struct iommu_lu_flb_ser *ser;
>> +
>> +	obj = kzalloc(sizeof(*obj), GFP_ATOMIC);
>> +	if (!obj)
>> +		return -ENOMEM;
>
>Is kzalloc() failure here recoverable whereas iommu_liveupdate_restore_objs()
>below is not? If it is not recoverable should there be a BUG_ON here?

Interesting... This should be recoverable as there is no corruption or
bad state. LUO will propagate this to caller and it should be handle
properly. I will make sure that this is handled in init.
>
>> +
>> +	mutex_init(&obj->lock);
>> +	BUG_ON(!kho_restore_folio(argp->data));
>> +	ser = phys_to_virt(argp->data);
>> +	obj->ser = ser;
>> +
>> +	iommu_liveupdate_restore_objs(ser->iommu_domains_phys);
>> +	obj->iommu_domains = phys_to_virt(ser->iommu_domains_phys);
>
>Can iommu_liveupdate_restore_obj() just return virtual address and we
>can simplify code to:
>
>	obj->iommu_domains = iommu_liveupdate_restore_objs(ser->iommu_domains_phys);

Yes that is a good idea. I will change this.
>
>> +
>> +	iommu_liveupdate_restore_objs(ser->devices_phys);
>> +	obj->devices = phys_to_virt(ser->devices_phys);
>> +
>> +	iommu_liveupdate_restore_objs(ser->iommus_phys);
>> +	obj->iommus = phys_to_virt(ser->iommus_phys);
>> +
>> +	argp->obj = obj;
>> +
>> +	return 0;
>> +}
>> +
>> diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
>
>I will recommend to use full name and not short "lu". iommu-liveupdate.h
>seems more readable and not too long.

Agreed. I will change this.
>
>> +#define MAX_IOMMU_SERS ((PAGE_SIZE - sizeof(struct iommus_ser)) / sizeof(struct iommu_ser))
>> +#define MAX_IOMMU_DOMAIN_SERS \
>> +		((PAGE_SIZE - sizeof(struct iommu_domains_ser)) / sizeof(struct iommu_domain_ser))
>> +#define MAX_DEVICE_SERS ((PAGE_SIZE - sizeof(struct devices_ser)) / sizeof(struct device_ser))
>
>This is per page limit, not whole serialization limit. May be we can
>name something like:
>
>- MAX_IOMMU_SERS_PER_PAGE, or
>- MAX_IOMMU_SERS_PAGE_CAPACITY

Agreed.
>
>> +
>> +struct iommu_lu_flb_obj {
>> +	struct mutex lock;
>> +	struct iommu_lu_flb_ser *ser;
>> +
>> +	struct iommu_domains_ser *iommu_domains;
>> +	struct iommus_ser *iommus;
>> +	struct devices_ser *devices;
>> +} __packed;
>> +
>
>I think naming scheme used here is little hard to absorb when we have so
>many individual structs in this header file. Specifically, struct names like:
>
>- iommu_domains_ser vs iommu_domain_ser
>- iommus_ser vs iommu_ser
>- devices_ser vs device_ser
>- iommu_objs_ser vs iommu_obj_ser
>
>First three are showing container and its elements relation, however,
>last one doesn't have that relation but naming is same there.
>
>I will recommend to change the naming scheme of containers to something like:
>
>	struct iommu_domain_ser_[hdr|header|table|arr] {};
>	struct iommu_ser_hdr {}
>	struct device_ser_hdr {}
>
>Individual element of container can be same.
>
>For objs, something like:
>	iommu_objs_ser -> iommu_hdr_meta
>
>

Agreed. The singular vs plural for object vs aggregate is tricky. I will
rework these names. I am thinking something like following based on the
feedback on this patch,

struct iommu_ser_hdr; <= object hdr.
struct iommu_ser_arr_hdr <= array of objects hdr.
struct iommu_domain_ser <= contains a preserved domain.
struct iommu_domain_ser_arr <= array of domains.

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-02-03 22:09 ` [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton Samiullah Khawaja
  2026-03-12 23:10   ` Pranjal Shrivastava
@ 2026-03-17 19:58   ` Vipin Sharma
  2026-03-17 20:33     ` Samiullah Khawaja
  1 sibling, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-17 19:58 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 4926a43118e6..c0632cb5b570 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -389,6 +389,9 @@ static struct dev_iommu *dev_iommu_get(struct device *dev)
>  
>  	mutex_init(&param->lock);
>  	dev->iommu = param;
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	dev->iommu->device_ser = NULL;
> +#endif

This should already be NULL due to kzalloc() above.

>  	return param;
>  }
>  
> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
> index 6189ba32ff2c..83eb609b3fd7 100644
> --- a/drivers/iommu/liveupdate.c
> +++ b/drivers/iommu/liveupdate.c
> @@ -11,6 +11,7 @@
>  #include <linux/liveupdate.h>
>  #include <linux/iommu-lu.h>
>  #include <linux/iommu.h>
> +#include <linux/pci.h>
>  #include <linux/errno.h>
>  
>  static void iommu_liveupdate_restore_objs(u64 next)
> @@ -175,3 +176,328 @@ int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler)
>  	return liveupdate_unregister_flb(handler, &iommu_flb);
>  }
>  EXPORT_SYMBOL(iommu_liveupdate_unregister_flb);
> +
> +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
> +				    void *arg)
> +{
> +	struct iommu_lu_flb_obj *obj;
> +	struct devices_ser *devices;
> +	int ret, i, idx;
> +
> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> +	if (ret)
> +		return -ENOENT;
> +
> +	devices = __va(obj->ser->devices_phys);
> +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
> +		if (idx >= MAX_DEVICE_SERS) {
> +			devices = __va(devices->objs.next_objs);
> +			idx = 0;
> +		}
> +
> +		if (devices->devices[idx].obj.deleted)
> +			continue;
> +
> +		ret = fn(&devices->devices[idx], arg);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(iommu_for_each_preserved_device);

IMHO, with all the obj, ser, devices->devices[..], it is little harder
to follow the code. I will recommend to reconsider naming.

Also, should this function be introduced in the patch where it is
getting used? Other changes in this patch are already big and complex.
Same for iommu_get_device_preserved_data() and
iommu_get_preserved_data().

I think this patch can be split in three.
Patch 1: Preserve iommu_domain
Patch 2: Preserve pci device and iommu device
Patch 3: The helper functions I mentioned above.


> +
> +static inline bool device_ser_match(struct device_ser *match,
> +				    struct pci_dev *pdev)
> +{
> +	return match->devid == pci_dev_id(pdev) && match->pci_domain == pci_domain_nr(pdev->bus);

Nit: s/match/device_ser for readability or something similar.

> +}
> +
> +struct device_ser *iommu_get_device_preserved_data(struct device *dev)
> +{
> +	struct iommu_lu_flb_obj *obj;
> +	struct devices_ser *devices;
> +	int ret, i, idx;
> +
> +	if (!dev_is_pci(dev))
> +		return NULL;
> +
> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> +	if (ret)
> +		return NULL;
> +
> +	devices = __va(obj->ser->devices_phys);
> +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
> +		if (idx >= MAX_DEVICE_SERS) {
> +			devices = __va(devices->objs.next_objs);
> +			idx = 0;
> +		}
> +
> +		if (devices->devices[idx].obj.deleted)
> +			continue;
> +
> +		if (device_ser_match(&devices->devices[idx], to_pci_dev(dev))) {
> +			devices->devices[idx].obj.incoming = true;
> +			return &devices->devices[idx];
> +		}
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(iommu_get_device_preserved_data);
> +
> +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
> +{
> +	struct iommu_lu_flb_obj *obj;
> +	struct iommus_ser *iommus;
> +	int ret, i, idx;
> +
> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> +	if (ret)
> +		return NULL;
> +
> +	iommus = __va(obj->ser->iommus_phys);
> +	for (i = 0, idx = 0; i < obj->ser->nr_iommus; ++i, ++idx) {
> +		if (idx >= MAX_IOMMU_SERS) {
> +			iommus = __va(iommus->objs.next_objs);
> +			idx = 0;
> +		}
> +
> +		if (iommus->iommus[idx].obj.deleted)
> +			continue;
> +
> +		if (iommus->iommus[idx].token == token &&
> +		    iommus->iommus[idx].type == type)
> +			return &iommus->iommus[idx];
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(iommu_get_preserved_data);
> +
> +static int reserve_obj_ser(struct iommu_objs_ser **objs_ptr, u64 max_objs)
> +{
> +	struct iommu_objs_ser *next_objs, *objs = *objs_ptr;
> +	int idx;
> +
> +	if (objs->nr_objs == max_objs) {
> +		next_objs = kho_alloc_preserve(PAGE_SIZE);
> +		if (IS_ERR(next_objs))
> +			return PTR_ERR(next_objs);
> +
> +		objs->next_objs = virt_to_phys(next_objs);
> +		objs = next_objs;
> +		*objs_ptr = objs;
> +		objs->nr_objs = 0;
> +		objs->next_objs = 0;
> +	}
> +
> +	idx = objs->nr_objs++;
> +	return idx;

I think we should rename these variables, it is difficult to comprehend
the code.

> +}
> +
> +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
> +{
> +	struct iommu_domain_ser *domain_ser;
> +	struct iommu_lu_flb_obj *flb_obj;
> +	int idx, ret;
> +
> +	if (!domain->ops->preserve)
> +		return -EOPNOTSUPP;
> +
> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> +	if (ret)
> +		return ret;
> +
> +	guard(mutex)(&flb_obj->lock);
> +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommu_domains,
> +			      MAX_IOMMU_DOMAIN_SERS);
> +	if (idx < 0)
> +		return idx;
> +
> +	domain_ser = &flb_obj->iommu_domains->iommu_domains[idx];
> +	idx = flb_obj->ser->nr_domains++;

In for loops above, nr_domains are compared with 'i', and idx is for
index inside a page. Lets not reuse idx here for nr_domains, keep it
consistent for easier understanding. May be use something more
descriptive than 'i' like global_idx and local_idx?

> +	domain_ser->obj.idx = idx;
> +	domain_ser->obj.ref_count = 1;
> +
> +	ret = domain->ops->preserve(domain, domain_ser);
> +	if (ret) {
> +		domain_ser->obj.deleted = true;
> +		return ret;
> +	}
> +
> +	domain->preserved_state = domain_ser;
> +	*ser = domain_ser;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_domain_preserve);
> +
> +static int iommu_preserve_locked(struct iommu_device *iommu)
> +{
> +	struct iommu_lu_flb_obj *flb_obj;
> +	struct iommu_ser *iommu_ser;
> +	int idx, ret;

Should there be lockdep asserts in these locked version APIs?

> +
> +}
> +
> +static void iommu_unpreserve_locked(struct iommu_device *iommu)
> +{
> +	struct iommu_ser *iommu_ser = iommu->outgoing_preserved_state;
> +
> +	iommu_ser->obj.ref_count--;

Should there be a null check?

> +	if (iommu_ser->obj.ref_count)
> +		return;
> +
> +	iommu->outgoing_preserved_state = NULL;
> +	iommu->ops->unpreserve(iommu, iommu_ser);
> +	iommu_ser->obj.deleted = true;
> +}
> +
> +int iommu_preserve_device(struct iommu_domain *domain,
> +			  struct device *dev, u64 token)
> +{
> +	struct iommu_lu_flb_obj *flb_obj;
> +	struct device_ser *device_ser;
> +	struct dev_iommu *iommu;
> +	struct pci_dev *pdev;
> +	int ret, idx;
> +
> +	if (!dev_is_pci(dev))
> +		return -EOPNOTSUPP;
> +
> +	if (!domain->preserved_state)
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(dev);
> +	iommu = dev->iommu;
> +	if (!iommu->iommu_dev->ops->preserve_device ||
> +	    !iommu->iommu_dev->ops->preserve)

iommu_preserve_locked() is already checking for preserve(), we can just
check preserve_device here.

> +		return -EOPNOTSUPP;
> +
> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> +	if (ret)
> +		return ret;
> +
> +	guard(mutex)(&flb_obj->lock);
> +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->devices,
> +			      MAX_DEVICE_SERS);
> +	if (idx < 0)
> +		return idx;
> +
> +	device_ser = &flb_obj->devices->devices[idx];
> +	idx = flb_obj->ser->nr_devices++;
> +	device_ser->obj.idx = idx;
> +	device_ser->obj.ref_count = 1;
> +
> +	ret = iommu_preserve_locked(iommu->iommu_dev);
> +	if (ret) {
> +		device_ser->obj.deleted = true;
> +		return ret;
> +	}
> +
> +	device_ser->domain_iommu_ser.domain_phys = __pa(domain->preserved_state);
> +	device_ser->domain_iommu_ser.iommu_phys = __pa(iommu->iommu_dev->outgoing_preserved_state);
> +	device_ser->devid = pci_dev_id(pdev);
> +	device_ser->pci_domain = pci_domain_nr(pdev->bus);
> +	device_ser->token = token;
> +
> +	ret = iommu->iommu_dev->ops->preserve_device(dev, device_ser);
> +	if (ret) {
> +		device_ser->obj.deleted = true;
> +		iommu_unpreserve_locked(iommu->iommu_dev);
> +		return ret;
> +	}
> +
> +	dev->iommu->device_ser = device_ser;
> +	return 0;
> +}
> +
> +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
> +{
> +	struct iommu_lu_flb_obj *flb_obj;
> +	struct device_ser *device_ser;
> +	struct dev_iommu *iommu;
> +	struct pci_dev *pdev;
> +	int ret;
> +
> +	if (!dev_is_pci(dev))
> +		return;
> +
> +	pdev = to_pci_dev(dev);
> +	iommu = dev->iommu;
> +	if (!iommu->iommu_dev->ops->unpreserve_device ||
> +	    !iommu->iommu_dev->ops->unpreserve)
> +		return;
> +
> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> +	if (WARN_ON(ret))

Why WARN_ON here and not other places? Do we need it?

> +		return;
> +
> +	guard(mutex)(&flb_obj->lock);
> +	device_ser = dev_iommu_preserved_state(dev);
> +	if (WARN_ON(!device_ser))

Can't we just silently ignore this?

> +		return;
> +
> +	iommu->iommu_dev->ops->unpreserve_device(dev, device_ser);
> +	dev->iommu->device_ser = NULL;
> +
> +	iommu_unpreserve_locked(iommu->iommu_dev);
> +}

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-03-13 18:42     ` Samiullah Khawaja
@ 2026-03-17 20:09       ` Pranjal Shrivastava
  2026-03-17 20:13         ` Samiullah Khawaja
  0 siblings, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-17 20:09 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Fri, Mar 13, 2026 at 06:42:05PM +0000, Samiullah Khawaja wrote:
> On Thu, Mar 12, 2026 at 11:10:36PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
> > > Add IOMMU domain ops that can be implemented by the IOMMU drivers if
> > > they support IOMMU domain preservation across liveupdate. The new IOMMU
> > > domain preserve, unpreserve and restore APIs call these ops to perform
> > > respective live update operations.
> > > 
> > > Similarly add IOMMU ops to preserve/unpreserve a device. These can be
> > > implemented by the IOMMU drivers that support preservation of devices
> > > that have their IOMMU domains preserved. During device preservation the
> > > state of the associated IOMMU is also preserved. The device can only be
> > > preserved if the attached iommu domain is preserved and the associated
> > > iommu supports preservation.
> > > 
> > > The preserved state of the device and IOMMU needs to be fetched during
> > > shutdown and boot in the next kernel. Add APIs that can be used to fetch
> > > the preserved state of a device and IOMMU. The APIs will only be used
> > > during shutdown and after liveupdate so no locking needed.
> > > 
> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> > > ---
> > >  drivers/iommu/iommu.c      |   3 +
> > >  drivers/iommu/liveupdate.c | 326 +++++++++++++++++++++++++++++++++++++
> > >  include/linux/iommu-lu.h   | 119 ++++++++++++++
> > >  include/linux/iommu.h      |  32 ++++
> > >  4 files changed, 480 insertions(+)
> > > 
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > index 4926a43118e6..c0632cb5b570 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -389,6 +389,9 @@ static struct dev_iommu *dev_iommu_get(struct device *dev)
> > > 
> > >  	mutex_init(&param->lock);
> > >  	dev->iommu = param;
> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > +	dev->iommu->device_ser = NULL;
> > > +#endif
> > >  	return param;
> > >  }
> > > 
> > > diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
> > > index 6189ba32ff2c..83eb609b3fd7 100644
> > > --- a/drivers/iommu/liveupdate.c
> > > +++ b/drivers/iommu/liveupdate.c
> > > @@ -11,6 +11,7 @@
> > >  #include <linux/liveupdate.h>
> > >  #include <linux/iommu-lu.h>
> > >  #include <linux/iommu.h>
> > > +#include <linux/pci.h>
> > >  #include <linux/errno.h>
> > > 
> > >  static void iommu_liveupdate_restore_objs(u64 next)
> > > @@ -175,3 +176,328 @@ int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler)
> > >  	return liveupdate_unregister_flb(handler, &iommu_flb);
> > >  }
> > >  EXPORT_SYMBOL(iommu_liveupdate_unregister_flb);
> > > +
> > > +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
> > > +				    void *arg)
> > > +{
> > > +	struct iommu_lu_flb_obj *obj;
> > > +	struct devices_ser *devices;
> > > +	int ret, i, idx;
> > > +
> > > +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> > > +	if (ret)
> > > +		return -ENOENT;
> > > +
> > > +	devices = __va(obj->ser->devices_phys);
> > > +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
> > > +		if (idx >= MAX_DEVICE_SERS) {
> > > +			devices = __va(devices->objs.next_objs);
> > > +			idx = 0;
> > > +		}
> > > +
> > > +		if (devices->devices[idx].obj.deleted)
> > > +			continue;
> > > +
> > > +		ret = fn(&devices->devices[idx], arg);
> > > +		if (ret)
> > > +			return ret;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL(iommu_for_each_preserved_device);
> > > +
> > > +static inline bool device_ser_match(struct device_ser *match,
> > > +				    struct pci_dev *pdev)
> > > +{
> > > +	return match->devid == pci_dev_id(pdev) && match->pci_domain == pci_domain_nr(pdev->bus);
> > > +}
> > > +
> > > +struct device_ser *iommu_get_device_preserved_data(struct device *dev)
> > > +{
> > > +	struct iommu_lu_flb_obj *obj;
> > > +	struct devices_ser *devices;
> > > +	int ret, i, idx;
> > > +
> > > +	if (!dev_is_pci(dev))
> > > +		return NULL;
> > > +
> > > +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> > > +	if (ret)
> > > +		return NULL;
> > > +
> > > +	devices = __va(obj->ser->devices_phys);
> > > +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
> > > +		if (idx >= MAX_DEVICE_SERS) {
> > > +			devices = __va(devices->objs.next_objs);
> > > +			idx = 0;
> > > +		}
> > > +
> > > +		if (devices->devices[idx].obj.deleted)
> > > +			continue;
> > > +
> > > +		if (device_ser_match(&devices->devices[idx], to_pci_dev(dev))) {
> > > +			devices->devices[idx].obj.incoming = true;
> > > +			return &devices->devices[idx];
> > > +		}
> > > +	}
> > > +
> > > +	return NULL;
> > > +}
> > > +EXPORT_SYMBOL(iommu_get_device_preserved_data);
> > > +
> > > +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
> > > +{
> > > +	struct iommu_lu_flb_obj *obj;
> > > +	struct iommus_ser *iommus;
> > > +	int ret, i, idx;
> > > +
> > > +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> > > +	if (ret)
> > > +		return NULL;
> > > +
> > > +	iommus = __va(obj->ser->iommus_phys);
> > > +	for (i = 0, idx = 0; i < obj->ser->nr_iommus; ++i, ++idx) {
> > > +		if (idx >= MAX_IOMMU_SERS) {
> > > +			iommus = __va(iommus->objs.next_objs);
> > > +			idx = 0;
> > > +		}
> > > +
> > > +		if (iommus->iommus[idx].obj.deleted)
> > > +			continue;
> > > +
> > > +		if (iommus->iommus[idx].token == token &&
> > > +		    iommus->iommus[idx].type == type)
> > > +			return &iommus->iommus[idx];
> > > +	}
> > > +
> > > +	return NULL;
> > > +}
> > > +EXPORT_SYMBOL(iommu_get_preserved_data);
> > > +

Also, I don't see these helpers being used anywhere in this series yet?
Should we add them with the phase-2 series? When we actually "Restore"
the domain?

Thanks,
Praan

> > > +static int reserve_obj_ser(struct iommu_objs_ser **objs_ptr, u64 max_objs)
> > 
> > Isn't this more of an "insert" / "populate" / write_ser_entry? We can
> > rename it to something like iommu_ser_push_entry / iommu_ser_write_entry
> 
> This is reserving an object in the objects array. The object is filled
> once reservation is done. Maybe I can call it alloc_obj_ser or
> alloc_entry_ser.
> > 
> > > +{
> > > +	struct iommu_objs_ser *next_objs, *objs = *objs_ptr;
> > 
> > Not loving these names :(
> > 
> > TBH, the reserve_obj_ser function isn't too readable, esp. with all the
> > variable names, here and in the lu header.. I've had to go back n forth
> > to the first patch. For example, here next_objs can be next_objs_page &
> > objs can be curr_page. (PTAL at my reply on PATCH 01 about renaming).
> 
> Basically with current naming there are "serialization objects" in
> "serializatoin object arrays". The object is a "base" type with
> "inherited" types for iommu, domain etc. I will add explanation of all
> this in the ABI header.
> 
> Maybe we can name it to:
> 
> struct iommu_obj_ser;
> struct iommu_obj_array_ser; or struct iommu_obj_ser_array;
> 
> Then for reserve/alloc, we use names like "next_obj_array" and
> "curr_obj_array"?
> > 
> > > +	int idx;
> > > +
> > > +	if (objs->nr_objs == max_objs) {
> > > +		next_objs = kho_alloc_preserve(PAGE_SIZE);
> > > +		if (IS_ERR(next_objs))
> > > +			return PTR_ERR(next_objs);
> > > +
> > > +		objs->next_objs = virt_to_phys(next_objs);
> > > +		objs = next_objs;
> > > +		*objs_ptr = objs;
> > > +		objs->nr_objs = 0;
> > > +		objs->next_objs = 0;
> > 
> > This seems redundant, no need to zero these out, kho_alloc_preserve
> > passes __GFP_ZERO to folio_alloc [1], which should z-alloc the pages.
> 
> Agreed. Will remove this.
> > 
> > > +	}
> > > +
> > > +	idx = objs->nr_objs++;
> > > +	return idx;
> > > +}
> > 
> > Just to give a mental model to fellow reviewers, here's how this is laid
> > out:
> > 
> > ----------------------------------------------------------------------
> > [ PAGE START ]                                                       |
> > ----------------------------------------------------------------------
> > | iommu_objs_ser (The Page Header)                                   |
> > |   - next_objs: 0x0000 (End of the page-chain)                      |
> > |   - nr_objs: 2                                                     |
> > ----------------------------------------------------------------------
> > | ITEM 0: iommu_domain_ser                                           |
> > |   [ iommu_obj_ser (The entry header) ]                             |
> > |     - idx: 0                                                       |
> > |     - ref_count: 1                                                 |
> > |     - deleted: 0                                                   |
> > |   [ Domain Data ]                                                  |
> > ----------------------------------------------------------------------
> > | ITEM 1: iommu_domain_ser                                           |
> > |   [ iommu_obj_ser (The Price Tag) ]                                |
> > |     - idx: 1                                                       |
> > |     - ref_count: 1                                                 |
> > |     - deleted: 0                                                   |
> > |   [ Domain Data ]                                                  |
> > ----------------------------------------------------------------------
> > | ... (Empty space for more domains) ...                             |
> > |                                                                    |
> > ----------------------------------------------------------------------
> > [ PAGE END ]                                                         |
> > ----------------------------------------------------------------------
> 
> +1
> 
> Will add a table in the header as you suggested.
> > 
> > > +
> > > +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
> > > +{
> > > +	struct iommu_domain_ser *domain_ser;
> > > +	struct iommu_lu_flb_obj *flb_obj;
> > > +	int idx, ret;
> > > +
> > > +	if (!domain->ops->preserve)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	guard(mutex)(&flb_obj->lock);
> > > +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommu_domains,
> > > +			      MAX_IOMMU_DOMAIN_SERS);
> > > +	if (idx < 0)
> > > +		return idx;
> > > +
> > > +	domain_ser = &flb_obj->iommu_domains->iommu_domains[idx];
> > 
> > This is slightly less-readable as well, I understand we're trying to:
> > 
> > iommu_domains_ser -> iommu_domain_ser[idx] but the same name
> > (iommu_domains) makes it difficult to read.. we should rename this as:
> 
> Agreed.
> 
> I will update it to,
> 
> &flb_obj->curr_iommu_domains->domain_array[idx]
> > 
> > &flb_obj->iommu_domains_page->domain_entries[idx] or something for
> > better readability..
> > 
> > Also, let's add a comment explaining that reserve_obj_ser actually
> > advances the flb_obj ptr when necessary..
> 
> Agreed.
> > 
> > IIUC, we start with PAGE 0 initially, store it's phys in the
> > iommu_flb_preserve op (the iommu_ser_phys et al) & then we go on
> > alloc-ing more pages and keep storing the "current" active page with the
> > liveupdate core. Now when we jump into the new kernel, we read the
> > ser_phys and then follow the page chain, right?
> 
> Yes, we do an initial allocation of arrays for each object type and then
> later allocate more as needed. The flb_obj holds the address of
> currently active array.
> 
> I will add this explanation in the header.
> > 
> > > +	idx = flb_obj->ser->nr_domains++;
> > > +	domain_ser->obj.idx = idx;
> > > +	domain_ser->obj.ref_count = 1;
> > > +
> > > +	ret = domain->ops->preserve(domain, domain_ser);
> > > +	if (ret) {
> > > +		domain_ser->obj.deleted = true;
> > > +		return ret;
> > > +	}
> > > +
> > > +	domain->preserved_state = domain_ser;
> > > +	*ser = domain_ser;
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_domain_preserve);
> > > +
> > > +void iommu_domain_unpreserve(struct iommu_domain *domain)
> > > +{
> > > +	struct iommu_domain_ser *domain_ser;
> > > +	struct iommu_lu_flb_obj *flb_obj;
> > > +	int ret;
> > > +
> > > +	if (!domain->ops->unpreserve)
> > > +		return;
> > > +
> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> > > +	if (ret)
> > > +		return;
> > > +
> > > +	guard(mutex)(&flb_obj->lock);
> > > +
> > > +	/*
> > > +	 * There is no check for attached devices here. The correctness relies
> > > +	 * on the Live Update Orchestrator's session lifecycle. All resources
> > > +	 * (iommufd, vfio devices) are preserved within a single session. If the
> > > +	 * session is torn down, the .unpreserve callbacks for all files will be
> > > +	 * invoked, ensuring a consistent cleanup without needing explicit
> > > +	 * refcounting for the serialized objects here.
> > > +	 */
> > > +	domain_ser = domain->preserved_state;
> > > +	domain->ops->unpreserve(domain, domain_ser);
> > > +	domain_ser->obj.deleted = true;
> > > +	domain->preserved_state = NULL;
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_domain_unpreserve);
> > > +
> > > +static int iommu_preserve_locked(struct iommu_device *iommu)
> > > +{
> > > +	struct iommu_lu_flb_obj *flb_obj;
> > > +	struct iommu_ser *iommu_ser;
> > > +	int idx, ret;
> > > +
> > > +	if (!iommu->ops->preserve)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (iommu->outgoing_preserved_state) {
> > > +		iommu->outgoing_preserved_state->obj.ref_count++;
> > > +		return 0;
> > > +	}
> > > +
> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommus,
> > > +			      MAX_IOMMU_SERS);
> > > +	if (idx < 0)
> > > +		return idx;
> > > +
> > > +	iommu_ser = &flb_obj->iommus->iommus[idx];
> > > +	idx = flb_obj->ser->nr_iommus++;
> > > +	iommu_ser->obj.idx = idx;
> > > +	iommu_ser->obj.ref_count = 1;
> > > +
> > > +	ret = iommu->ops->preserve(iommu, iommu_ser);
> > > +	if (ret)
> > > +		iommu_ser->obj.deleted = true;
> > > +
> > > +	iommu->outgoing_preserved_state = iommu_ser;
> > > +	return ret;
> > > +}
> > > +
> > > +static void iommu_unpreserve_locked(struct iommu_device *iommu)
> > > +{
> > > +	struct iommu_ser *iommu_ser = iommu->outgoing_preserved_state;
> > > +
> > > +	iommu_ser->obj.ref_count--;
> > > +	if (iommu_ser->obj.ref_count)
> > > +		return;
> > > +
> > > +	iommu->outgoing_preserved_state = NULL;
> > > +	iommu->ops->unpreserve(iommu, iommu_ser);
> > > +	iommu_ser->obj.deleted = true;
> > > +}
> > > +
> > > +int iommu_preserve_device(struct iommu_domain *domain,
> > > +			  struct device *dev, u64 token)
> > > +{
> > > +	struct iommu_lu_flb_obj *flb_obj;
> > > +	struct device_ser *device_ser;
> > > +	struct dev_iommu *iommu;
> > > +	struct pci_dev *pdev;
> > > +	int ret, idx;
> > > +
> > > +	if (!dev_is_pci(dev))
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (!domain->preserved_state)
> > > +		return -EINVAL;
> > > +
> > > +	pdev = to_pci_dev(dev);
> > > +	iommu = dev->iommu;
> > > +	if (!iommu->iommu_dev->ops->preserve_device ||
> > > +	    !iommu->iommu_dev->ops->preserve)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	guard(mutex)(&flb_obj->lock);
> > > +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->devices,
> > > +			      MAX_DEVICE_SERS);
> > > +	if (idx < 0)
> > > +		return idx;
> > > +
> > > +	device_ser = &flb_obj->devices->devices[idx];
> > > +	idx = flb_obj->ser->nr_devices++;
> > > +	device_ser->obj.idx = idx;
> > > +	device_ser->obj.ref_count = 1;
> > > +
> > > +	ret = iommu_preserve_locked(iommu->iommu_dev);
> > > +	if (ret) {
> > > +		device_ser->obj.deleted = true;
> > > +		return ret;
> > > +	}
> > > +
> > > +	device_ser->domain_iommu_ser.domain_phys = __pa(domain->preserved_state);
> > > +	device_ser->domain_iommu_ser.iommu_phys = __pa(iommu->iommu_dev->outgoing_preserved_state);
> > > +	device_ser->devid = pci_dev_id(pdev);
> > > +	device_ser->pci_domain = pci_domain_nr(pdev->bus);
> > > +	device_ser->token = token;
> > > +
> > > +	ret = iommu->iommu_dev->ops->preserve_device(dev, device_ser);
> > > +	if (ret) {
> > > +		device_ser->obj.deleted = true;
> > > +		iommu_unpreserve_locked(iommu->iommu_dev);
> > > +		return ret;
> > > +	}
> > > +
> > > +	dev->iommu->device_ser = device_ser;
> > > +	return 0;
> > > +}
> > > +
> > > +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
> > > +{
> > > +	struct iommu_lu_flb_obj *flb_obj;
> > > +	struct device_ser *device_ser;
> > > +	struct dev_iommu *iommu;
> > > +	struct pci_dev *pdev;
> > > +	int ret;
> > > +
> > > +	if (!dev_is_pci(dev))
> > > +		return;
> > > +
> > > +	pdev = to_pci_dev(dev);
> > > +	iommu = dev->iommu;
> > > +	if (!iommu->iommu_dev->ops->unpreserve_device ||
> > > +	    !iommu->iommu_dev->ops->unpreserve)
> > > +		return;
> > > +
> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> > > +	if (WARN_ON(ret))
> > > +		return;
> > > +
> > > +	guard(mutex)(&flb_obj->lock);
> > > +	device_ser = dev_iommu_preserved_state(dev);
> > > +	if (WARN_ON(!device_ser))
> > > +		return;
> > > +
> > > +	iommu->iommu_dev->ops->unpreserve_device(dev, device_ser);
> > > +	dev->iommu->device_ser = NULL;
> > > +
> > > +	iommu_unpreserve_locked(iommu->iommu_dev);
> > > +}
> > 
> > I'm wondering ig we should guard these APIs against accidental or
> > potential abuse by other in-kernel drivers. Since the Live Update
> > Orchestrator (LUO) is architecturally designed around user-space
> > driven sequences (IOCTLs, specific mutex ordering, etc.), for now.
> > 
> > Since the header-file is also under include/linux, we should avoid any
> > possibility where we end up having drivers depending on these API.
> > We could add a check based on dma owner:
> > 
> > +	/* Only devices explicitly claimed by a user-space driver
> > +	 * (VFIO/IOMMUFD) are eligible for Live Update preservation.
> > +	 */
> > +	if (!iommu_dma_owner_claimed(dev))
> > +		return -EPERM;
> > 
> > This should ensure we aren't creating 'zombie' preserved states for
> > devices not managed by IOMMUFD/VFIO.
> 
> Agreed. I will update this.
> > 
> > 
> > > diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
> > > index 59095d2f1bb2..48c07514a776 100644
> > > --- a/include/linux/iommu-lu.h
> > > +++ b/include/linux/iommu-lu.h
> > > @@ -8,9 +8,128 @@
> > >  #ifndef _LINUX_IOMMU_LU_H
> > >  #define _LINUX_IOMMU_LU_H
> > > 
> > > +#include <linux/device.h>
> > > +#include <linux/iommu.h>
> > >  #include <linux/liveupdate.h>
> > >  #include <linux/kho/abi/iommu.h>
> > > 
> > > +typedef int (*iommu_preserved_device_iter_fn)(struct device_ser *ser,
> > > +					      void *arg);
> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > +static inline void *dev_iommu_preserved_state(struct device *dev)
> > > +{
> > > +	struct device_ser *ser;
> > > +
> > > +	if (!dev->iommu)
> > > +		return NULL;
> > > +
> > > +	ser = dev->iommu->device_ser;
> > > +	if (ser && !ser->obj.incoming)
> > > +		return ser;
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +static inline void *dev_iommu_restored_state(struct device *dev)
> > > +{
> > > +	struct device_ser *ser;
> > > +
> > > +	if (!dev->iommu)
> > > +		return NULL;
> > > +
> > > +	ser = dev->iommu->device_ser;
> > > +	if (ser && ser->obj.incoming)
> > > +		return ser;
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
> > > +{
> > > +	struct iommu_domain_ser *ser;
> > > +
> > > +	ser = domain->preserved_state;
> > > +	if (ser && ser->obj.incoming)
> > > +		return ser;
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
> > > +{
> > > +	struct device_ser *ser = dev_iommu_restored_state(dev);
> > > +
> > > +	if (ser && iommu_domain_restored_state(domain))
> > > +		return ser->domain_iommu_ser.did;
> > > +
> > > +	return -1;
> > > +}
> > > +
> > > +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
> > > +				    void *arg);
> > > +struct device_ser *iommu_get_device_preserved_data(struct device *dev);
> > > +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type);
> > > +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser);
> > > +void iommu_domain_unpreserve(struct iommu_domain *domain);
> > > +int iommu_preserve_device(struct iommu_domain *domain,
> > > +			  struct device *dev, u64 token);
> > > +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev);
> > > +#else
> > > +static inline void *dev_iommu_preserved_state(struct device *dev)
> > > +{
> > > +	return NULL;
> > > +}
> > > +
> > > +static inline void *dev_iommu_restored_state(struct device *dev)
> > > +{
> > > +	return NULL;
> > > +}
> > > +
> > > +static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
> > > +{
> > > +	return -1;
> > > +}
> > > +
> > > +static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
> > > +{
> > > +	return NULL;
> > > +}
> > > +
> > > +static inline int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn, void *arg)
> > > +{
> > > +	return -EOPNOTSUPP;
> > > +}
> > > +
> > > +static inline struct device_ser *iommu_get_device_preserved_data(struct device *dev)
> > > +{
> > > +	return NULL;
> > > +}
> > > +
> > > +static inline struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
> > > +{
> > > +	return NULL;
> > > +}
> > > +
> > > +static inline int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
> > > +{
> > > +	return -EOPNOTSUPP;
> > > +}
> > > +
> > > +static inline void iommu_domain_unpreserve(struct iommu_domain *domain)
> > > +{
> > > +}
> > > +
> > > +static inline int iommu_preserve_device(struct iommu_domain *domain,
> > > +					struct device *dev, u64 token)
> > > +{
> > > +	return -EOPNOTSUPP;
> > > +}
> > > +
> > > +static inline void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
> > > +{
> > > +}
> > > +#endif
> > > +
> > >  int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler);
> > >  int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler);
> > > 
> > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > index 54b8b48c762e..bd949c1ce7c5 100644
> > > --- a/include/linux/iommu.h
> > > +++ b/include/linux/iommu.h
> > > @@ -14,6 +14,8 @@
> > >  #include <linux/err.h>
> > >  #include <linux/of.h>
> > >  #include <linux/iova_bitmap.h>
> > > +#include <linux/atomic.h>
> > > +#include <linux/kho/abi/iommu.h>
> > >  #include <uapi/linux/iommufd.h>
> > > 
> > >  #define IOMMU_READ	(1 << 0)
> > > @@ -248,6 +250,10 @@ struct iommu_domain {
> > >  			struct list_head next;
> > >  		};
> > >  	};
> > > +
> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > +	struct iommu_domain_ser *preserved_state;
> > > +#endif
> > >  };
> > > 
> > >  static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
> > > @@ -647,6 +653,10 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
> > >   *               resources shared/passed to user space IOMMU instance. Associate
> > >   *               it with a nesting @parent_domain. It is required for driver to
> > >   *               set @viommu->ops pointing to its own viommu_ops
> > > + * @preserve_device: Preserve state of a device for liveupdate.
> > > + * @unpreserve_device: Unpreserve state that was preserved earlier.
> > > + * @preserve: Preserve state of iommu translation hardware for liveupdate.
> > > + * @unpreserve: Unpreserve state of iommu that was preserved earlier.
> > >   * @owner: Driver module providing these ops
> > >   * @identity_domain: An always available, always attachable identity
> > >   *                   translation.
> > > @@ -703,6 +713,11 @@ struct iommu_ops {
> > >  			   struct iommu_domain *parent_domain,
> > >  			   const struct iommu_user_data *user_data);
> > > 
> > > +	int (*preserve_device)(struct device *dev, struct device_ser *device_ser);
> > > +	void (*unpreserve_device)(struct device *dev, struct device_ser *device_ser);
> > 
> > Nit: Let's move the _device ops under the comment:
> > `/* Per device IOMMU features */`
> 
> I will move these.
> > 
> > > +	int (*preserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
> > > +	void (*unpreserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
> > > +
> > 
> > I'm wondering if there's any benefit to add these ops under a #ifdef ?
> 
> These should be NULL when liveupdate is disabled, so #ifdef should not
> be needed. Not sure how much space we save if don't define these. I will
> move these in #ifdef anyway.
> > 
> > >  	const struct iommu_domain_ops *default_domain_ops;
> > >  	struct module *owner;
> > >  	struct iommu_domain *identity_domain;
> > > @@ -749,6 +764,11 @@ struct iommu_ops {
> > >   *                           specific mechanisms.
> > >   * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
> > >   * @free: Release the domain after use.
> > > + * @preserve: Preserve the iommu domain for liveupdate.
> > > + *            Returns 0 on success, a negative errno on failure.
> > > + * @unpreserve: Unpreserve the iommu domain that was preserved earlier.
> > > + * @restore: Restore the iommu domain after liveupdate.
> > > + *           Returns 0 on success, a negative errno on failure.
> > >   */
> > >  struct iommu_domain_ops {
> > >  	int (*attach_dev)(struct iommu_domain *domain, struct device *dev,
> > > @@ -779,6 +799,9 @@ struct iommu_domain_ops {
> > >  				  unsigned long quirks);
> > > 
> > >  	void (*free)(struct iommu_domain *domain);
> > > +	int (*preserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
> > > +	void (*unpreserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
> > > +	int (*restore)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
> > >  };
> > > 
> > >  /**
> > > @@ -790,6 +813,8 @@ struct iommu_domain_ops {
> > >   * @singleton_group: Used internally for drivers that have only one group
> > >   * @max_pasids: number of supported PASIDs
> > >   * @ready: set once iommu_device_register() has completed successfully
> > > + * @outgoing_preserved_state: preserved iommu state of outgoing kernel for
> > > + * liveupdate.
> > >   */
> > >  struct iommu_device {
> > >  	struct list_head list;
> > > @@ -799,6 +824,10 @@ struct iommu_device {
> > >  	struct iommu_group *singleton_group;
> > >  	u32 max_pasids;
> > >  	bool ready;
> > > +
> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > +	struct iommu_ser *outgoing_preserved_state;
> > > +#endif
> > >  };
> > > 
> > >  /**
> > > @@ -853,6 +882,9 @@ struct dev_iommu {
> > >  	u32				pci_32bit_workaround:1;
> > >  	u32				require_direct:1;
> > >  	u32				shadow_on_flush:1;
> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > +	struct device_ser		*device_ser;
> > > +#endif
> > >  };
> > > 
> > >  int iommu_device_register(struct iommu_device *iommu,
> > 
> > Thanks,
> > Praan
> > 
> > [1] https://elixir.bootlin.com/linux/v7.0-rc3/source/kernel/liveupdate/kexec_handover.c#L1182

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-03-17 20:09       ` Pranjal Shrivastava
@ 2026-03-17 20:13         ` Samiullah Khawaja
  2026-03-17 20:23           ` Pranjal Shrivastava
  0 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-17 20:13 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Mar 17, 2026 at 08:09:26PM +0000, Pranjal Shrivastava wrote:
>On Fri, Mar 13, 2026 at 06:42:05PM +0000, Samiullah Khawaja wrote:
>> On Thu, Mar 12, 2026 at 11:10:36PM +0000, Pranjal Shrivastava wrote:
>> > On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
>> > > Add IOMMU domain ops that can be implemented by the IOMMU drivers if
>> > > they support IOMMU domain preservation across liveupdate. The new IOMMU
>> > > domain preserve, unpreserve and restore APIs call these ops to perform
>> > > respective live update operations.
>> > >
>> > > Similarly add IOMMU ops to preserve/unpreserve a device. These can be
>> > > implemented by the IOMMU drivers that support preservation of devices
>> > > that have their IOMMU domains preserved. During device preservation the
>> > > state of the associated IOMMU is also preserved. The device can only be
>> > > preserved if the attached iommu domain is preserved and the associated
>> > > iommu supports preservation.
>> > >
>> > > The preserved state of the device and IOMMU needs to be fetched during
>> > > shutdown and boot in the next kernel. Add APIs that can be used to fetch
>> > > the preserved state of a device and IOMMU. The APIs will only be used
>> > > during shutdown and after liveupdate so no locking needed.
>> > >
>> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> > > ---
>> > >  drivers/iommu/iommu.c      |   3 +
>> > >  drivers/iommu/liveupdate.c | 326 +++++++++++++++++++++++++++++++++++++
>> > >  include/linux/iommu-lu.h   | 119 ++++++++++++++
>> > >  include/linux/iommu.h      |  32 ++++
>> > >  4 files changed, 480 insertions(+)
>> > >
>> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> > > index 4926a43118e6..c0632cb5b570 100644
>> > > --- a/drivers/iommu/iommu.c
>> > > +++ b/drivers/iommu/iommu.c
>> > > @@ -389,6 +389,9 @@ static struct dev_iommu *dev_iommu_get(struct device *dev)
>> > >
>> > >  	mutex_init(&param->lock);
>> > >  	dev->iommu = param;
>> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> > > +	dev->iommu->device_ser = NULL;
>> > > +#endif
>> > >  	return param;
>> > >  }
>> > >
>> > > diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
>> > > index 6189ba32ff2c..83eb609b3fd7 100644
>> > > --- a/drivers/iommu/liveupdate.c
>> > > +++ b/drivers/iommu/liveupdate.c
>> > > @@ -11,6 +11,7 @@
>> > >  #include <linux/liveupdate.h>
>> > >  #include <linux/iommu-lu.h>
>> > >  #include <linux/iommu.h>
>> > > +#include <linux/pci.h>
>> > >  #include <linux/errno.h>
>> > >
>> > >  static void iommu_liveupdate_restore_objs(u64 next)
>> > > @@ -175,3 +176,328 @@ int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler)
>> > >  	return liveupdate_unregister_flb(handler, &iommu_flb);
>> > >  }
>> > >  EXPORT_SYMBOL(iommu_liveupdate_unregister_flb);
>> > > +
>> > > +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
>> > > +				    void *arg)
>> > > +{
>> > > +	struct iommu_lu_flb_obj *obj;
>> > > +	struct devices_ser *devices;
>> > > +	int ret, i, idx;
>> > > +
>> > > +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
>> > > +	if (ret)
>> > > +		return -ENOENT;
>> > > +
>> > > +	devices = __va(obj->ser->devices_phys);
>> > > +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
>> > > +		if (idx >= MAX_DEVICE_SERS) {
>> > > +			devices = __va(devices->objs.next_objs);
>> > > +			idx = 0;
>> > > +		}
>> > > +
>> > > +		if (devices->devices[idx].obj.deleted)
>> > > +			continue;
>> > > +
>> > > +		ret = fn(&devices->devices[idx], arg);
>> > > +		if (ret)
>> > > +			return ret;
>> > > +	}
>> > > +
>> > > +	return 0;
>> > > +}
>> > > +EXPORT_SYMBOL(iommu_for_each_preserved_device);
>> > > +
>> > > +static inline bool device_ser_match(struct device_ser *match,
>> > > +				    struct pci_dev *pdev)
>> > > +{
>> > > +	return match->devid == pci_dev_id(pdev) && match->pci_domain == pci_domain_nr(pdev->bus);
>> > > +}
>> > > +
>> > > +struct device_ser *iommu_get_device_preserved_data(struct device *dev)
>> > > +{
>> > > +	struct iommu_lu_flb_obj *obj;
>> > > +	struct devices_ser *devices;
>> > > +	int ret, i, idx;
>> > > +
>> > > +	if (!dev_is_pci(dev))
>> > > +		return NULL;
>> > > +
>> > > +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
>> > > +	if (ret)
>> > > +		return NULL;
>> > > +
>> > > +	devices = __va(obj->ser->devices_phys);
>> > > +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
>> > > +		if (idx >= MAX_DEVICE_SERS) {
>> > > +			devices = __va(devices->objs.next_objs);
>> > > +			idx = 0;
>> > > +		}
>> > > +
>> > > +		if (devices->devices[idx].obj.deleted)
>> > > +			continue;
>> > > +
>> > > +		if (device_ser_match(&devices->devices[idx], to_pci_dev(dev))) {
>> > > +			devices->devices[idx].obj.incoming = true;
>> > > +			return &devices->devices[idx];
>> > > +		}
>> > > +	}
>> > > +
>> > > +	return NULL;
>> > > +}
>> > > +EXPORT_SYMBOL(iommu_get_device_preserved_data);
>> > > +
>> > > +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
>> > > +{
>> > > +	struct iommu_lu_flb_obj *obj;
>> > > +	struct iommus_ser *iommus;
>> > > +	int ret, i, idx;
>> > > +
>> > > +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
>> > > +	if (ret)
>> > > +		return NULL;
>> > > +
>> > > +	iommus = __va(obj->ser->iommus_phys);
>> > > +	for (i = 0, idx = 0; i < obj->ser->nr_iommus; ++i, ++idx) {
>> > > +		if (idx >= MAX_IOMMU_SERS) {
>> > > +			iommus = __va(iommus->objs.next_objs);
>> > > +			idx = 0;
>> > > +		}
>> > > +
>> > > +		if (iommus->iommus[idx].obj.deleted)
>> > > +			continue;
>> > > +
>> > > +		if (iommus->iommus[idx].token == token &&
>> > > +		    iommus->iommus[idx].type == type)
>> > > +			return &iommus->iommus[idx];
>> > > +	}
>> > > +
>> > > +	return NULL;
>> > > +}
>> > > +EXPORT_SYMBOL(iommu_get_preserved_data);
>> > > +
>
>Also, I don't see these helpers being used anywhere in this series yet?
>Should we add them with the phase-2 series? When we actually "Restore"
>the domain?

Phase 1 does restore the iommu and domains during boot. But the
iommufd/hwpt state is not restored or re-associated with the restored
domains.

These helpers are used to fetch the preserved state during boot.

Thanks,
Sami
>
>Thanks,
>Praan
>
>> > > +static int reserve_obj_ser(struct iommu_objs_ser **objs_ptr, u64 max_objs)
>> >
>> > Isn't this more of an "insert" / "populate" / write_ser_entry? We can
>> > rename it to something like iommu_ser_push_entry / iommu_ser_write_entry
>>
>> This is reserving an object in the objects array. The object is filled
>> once reservation is done. Maybe I can call it alloc_obj_ser or
>> alloc_entry_ser.
>> >
>> > > +{
>> > > +	struct iommu_objs_ser *next_objs, *objs = *objs_ptr;
>> >
>> > Not loving these names :(
>> >
>> > TBH, the reserve_obj_ser function isn't too readable, esp. with all the
>> > variable names, here and in the lu header.. I've had to go back n forth
>> > to the first patch. For example, here next_objs can be next_objs_page &
>> > objs can be curr_page. (PTAL at my reply on PATCH 01 about renaming).
>>
>> Basically with current naming there are "serialization objects" in
>> "serializatoin object arrays". The object is a "base" type with
>> "inherited" types for iommu, domain etc. I will add explanation of all
>> this in the ABI header.
>>
>> Maybe we can name it to:
>>
>> struct iommu_obj_ser;
>> struct iommu_obj_array_ser; or struct iommu_obj_ser_array;
>>
>> Then for reserve/alloc, we use names like "next_obj_array" and
>> "curr_obj_array"?
>> >
>> > > +	int idx;
>> > > +
>> > > +	if (objs->nr_objs == max_objs) {
>> > > +		next_objs = kho_alloc_preserve(PAGE_SIZE);
>> > > +		if (IS_ERR(next_objs))
>> > > +			return PTR_ERR(next_objs);
>> > > +
>> > > +		objs->next_objs = virt_to_phys(next_objs);
>> > > +		objs = next_objs;
>> > > +		*objs_ptr = objs;
>> > > +		objs->nr_objs = 0;
>> > > +		objs->next_objs = 0;
>> >
>> > This seems redundant, no need to zero these out, kho_alloc_preserve
>> > passes __GFP_ZERO to folio_alloc [1], which should z-alloc the pages.
>>
>> Agreed. Will remove this.
>> >
>> > > +	}
>> > > +
>> > > +	idx = objs->nr_objs++;
>> > > +	return idx;
>> > > +}
>> >
>> > Just to give a mental model to fellow reviewers, here's how this is laid
>> > out:
>> >
>> > ----------------------------------------------------------------------
>> > [ PAGE START ]                                                       |
>> > ----------------------------------------------------------------------
>> > | iommu_objs_ser (The Page Header)                                   |
>> > |   - next_objs: 0x0000 (End of the page-chain)                      |
>> > |   - nr_objs: 2                                                     |
>> > ----------------------------------------------------------------------
>> > | ITEM 0: iommu_domain_ser                                           |
>> > |   [ iommu_obj_ser (The entry header) ]                             |
>> > |     - idx: 0                                                       |
>> > |     - ref_count: 1                                                 |
>> > |     - deleted: 0                                                   |
>> > |   [ Domain Data ]                                                  |
>> > ----------------------------------------------------------------------
>> > | ITEM 1: iommu_domain_ser                                           |
>> > |   [ iommu_obj_ser (The Price Tag) ]                                |
>> > |     - idx: 1                                                       |
>> > |     - ref_count: 1                                                 |
>> > |     - deleted: 0                                                   |
>> > |   [ Domain Data ]                                                  |
>> > ----------------------------------------------------------------------
>> > | ... (Empty space for more domains) ...                             |
>> > |                                                                    |
>> > ----------------------------------------------------------------------
>> > [ PAGE END ]                                                         |
>> > ----------------------------------------------------------------------
>>
>> +1
>>
>> Will add a table in the header as you suggested.
>> >
>> > > +
>> > > +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
>> > > +{
>> > > +	struct iommu_domain_ser *domain_ser;
>> > > +	struct iommu_lu_flb_obj *flb_obj;
>> > > +	int idx, ret;
>> > > +
>> > > +	if (!domain->ops->preserve)
>> > > +		return -EOPNOTSUPP;
>> > > +
>> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> > > +	if (ret)
>> > > +		return ret;
>> > > +
>> > > +	guard(mutex)(&flb_obj->lock);
>> > > +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommu_domains,
>> > > +			      MAX_IOMMU_DOMAIN_SERS);
>> > > +	if (idx < 0)
>> > > +		return idx;
>> > > +
>> > > +	domain_ser = &flb_obj->iommu_domains->iommu_domains[idx];
>> >
>> > This is slightly less-readable as well, I understand we're trying to:
>> >
>> > iommu_domains_ser -> iommu_domain_ser[idx] but the same name
>> > (iommu_domains) makes it difficult to read.. we should rename this as:
>>
>> Agreed.
>>
>> I will update it to,
>>
>> &flb_obj->curr_iommu_domains->domain_array[idx]
>> >
>> > &flb_obj->iommu_domains_page->domain_entries[idx] or something for
>> > better readability..
>> >
>> > Also, let's add a comment explaining that reserve_obj_ser actually
>> > advances the flb_obj ptr when necessary..
>>
>> Agreed.
>> >
>> > IIUC, we start with PAGE 0 initially, store it's phys in the
>> > iommu_flb_preserve op (the iommu_ser_phys et al) & then we go on
>> > alloc-ing more pages and keep storing the "current" active page with the
>> > liveupdate core. Now when we jump into the new kernel, we read the
>> > ser_phys and then follow the page chain, right?
>>
>> Yes, we do an initial allocation of arrays for each object type and then
>> later allocate more as needed. The flb_obj holds the address of
>> currently active array.
>>
>> I will add this explanation in the header.
>> >
>> > > +	idx = flb_obj->ser->nr_domains++;
>> > > +	domain_ser->obj.idx = idx;
>> > > +	domain_ser->obj.ref_count = 1;
>> > > +
>> > > +	ret = domain->ops->preserve(domain, domain_ser);
>> > > +	if (ret) {
>> > > +		domain_ser->obj.deleted = true;
>> > > +		return ret;
>> > > +	}
>> > > +
>> > > +	domain->preserved_state = domain_ser;
>> > > +	*ser = domain_ser;
>> > > +	return 0;
>> > > +}
>> > > +EXPORT_SYMBOL_GPL(iommu_domain_preserve);
>> > > +
>> > > +void iommu_domain_unpreserve(struct iommu_domain *domain)
>> > > +{
>> > > +	struct iommu_domain_ser *domain_ser;
>> > > +	struct iommu_lu_flb_obj *flb_obj;
>> > > +	int ret;
>> > > +
>> > > +	if (!domain->ops->unpreserve)
>> > > +		return;
>> > > +
>> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> > > +	if (ret)
>> > > +		return;
>> > > +
>> > > +	guard(mutex)(&flb_obj->lock);
>> > > +
>> > > +	/*
>> > > +	 * There is no check for attached devices here. The correctness relies
>> > > +	 * on the Live Update Orchestrator's session lifecycle. All resources
>> > > +	 * (iommufd, vfio devices) are preserved within a single session. If the
>> > > +	 * session is torn down, the .unpreserve callbacks for all files will be
>> > > +	 * invoked, ensuring a consistent cleanup without needing explicit
>> > > +	 * refcounting for the serialized objects here.
>> > > +	 */
>> > > +	domain_ser = domain->preserved_state;
>> > > +	domain->ops->unpreserve(domain, domain_ser);
>> > > +	domain_ser->obj.deleted = true;
>> > > +	domain->preserved_state = NULL;
>> > > +}
>> > > +EXPORT_SYMBOL_GPL(iommu_domain_unpreserve);
>> > > +
>> > > +static int iommu_preserve_locked(struct iommu_device *iommu)
>> > > +{
>> > > +	struct iommu_lu_flb_obj *flb_obj;
>> > > +	struct iommu_ser *iommu_ser;
>> > > +	int idx, ret;
>> > > +
>> > > +	if (!iommu->ops->preserve)
>> > > +		return -EOPNOTSUPP;
>> > > +
>> > > +	if (iommu->outgoing_preserved_state) {
>> > > +		iommu->outgoing_preserved_state->obj.ref_count++;
>> > > +		return 0;
>> > > +	}
>> > > +
>> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> > > +	if (ret)
>> > > +		return ret;
>> > > +
>> > > +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommus,
>> > > +			      MAX_IOMMU_SERS);
>> > > +	if (idx < 0)
>> > > +		return idx;
>> > > +
>> > > +	iommu_ser = &flb_obj->iommus->iommus[idx];
>> > > +	idx = flb_obj->ser->nr_iommus++;
>> > > +	iommu_ser->obj.idx = idx;
>> > > +	iommu_ser->obj.ref_count = 1;
>> > > +
>> > > +	ret = iommu->ops->preserve(iommu, iommu_ser);
>> > > +	if (ret)
>> > > +		iommu_ser->obj.deleted = true;
>> > > +
>> > > +	iommu->outgoing_preserved_state = iommu_ser;
>> > > +	return ret;
>> > > +}
>> > > +
>> > > +static void iommu_unpreserve_locked(struct iommu_device *iommu)
>> > > +{
>> > > +	struct iommu_ser *iommu_ser = iommu->outgoing_preserved_state;
>> > > +
>> > > +	iommu_ser->obj.ref_count--;
>> > > +	if (iommu_ser->obj.ref_count)
>> > > +		return;
>> > > +
>> > > +	iommu->outgoing_preserved_state = NULL;
>> > > +	iommu->ops->unpreserve(iommu, iommu_ser);
>> > > +	iommu_ser->obj.deleted = true;
>> > > +}
>> > > +
>> > > +int iommu_preserve_device(struct iommu_domain *domain,
>> > > +			  struct device *dev, u64 token)
>> > > +{
>> > > +	struct iommu_lu_flb_obj *flb_obj;
>> > > +	struct device_ser *device_ser;
>> > > +	struct dev_iommu *iommu;
>> > > +	struct pci_dev *pdev;
>> > > +	int ret, idx;
>> > > +
>> > > +	if (!dev_is_pci(dev))
>> > > +		return -EOPNOTSUPP;
>> > > +
>> > > +	if (!domain->preserved_state)
>> > > +		return -EINVAL;
>> > > +
>> > > +	pdev = to_pci_dev(dev);
>> > > +	iommu = dev->iommu;
>> > > +	if (!iommu->iommu_dev->ops->preserve_device ||
>> > > +	    !iommu->iommu_dev->ops->preserve)
>> > > +		return -EOPNOTSUPP;
>> > > +
>> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> > > +	if (ret)
>> > > +		return ret;
>> > > +
>> > > +	guard(mutex)(&flb_obj->lock);
>> > > +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->devices,
>> > > +			      MAX_DEVICE_SERS);
>> > > +	if (idx < 0)
>> > > +		return idx;
>> > > +
>> > > +	device_ser = &flb_obj->devices->devices[idx];
>> > > +	idx = flb_obj->ser->nr_devices++;
>> > > +	device_ser->obj.idx = idx;
>> > > +	device_ser->obj.ref_count = 1;
>> > > +
>> > > +	ret = iommu_preserve_locked(iommu->iommu_dev);
>> > > +	if (ret) {
>> > > +		device_ser->obj.deleted = true;
>> > > +		return ret;
>> > > +	}
>> > > +
>> > > +	device_ser->domain_iommu_ser.domain_phys = __pa(domain->preserved_state);
>> > > +	device_ser->domain_iommu_ser.iommu_phys = __pa(iommu->iommu_dev->outgoing_preserved_state);
>> > > +	device_ser->devid = pci_dev_id(pdev);
>> > > +	device_ser->pci_domain = pci_domain_nr(pdev->bus);
>> > > +	device_ser->token = token;
>> > > +
>> > > +	ret = iommu->iommu_dev->ops->preserve_device(dev, device_ser);
>> > > +	if (ret) {
>> > > +		device_ser->obj.deleted = true;
>> > > +		iommu_unpreserve_locked(iommu->iommu_dev);
>> > > +		return ret;
>> > > +	}
>> > > +
>> > > +	dev->iommu->device_ser = device_ser;
>> > > +	return 0;
>> > > +}
>> > > +
>> > > +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>> > > +{
>> > > +	struct iommu_lu_flb_obj *flb_obj;
>> > > +	struct device_ser *device_ser;
>> > > +	struct dev_iommu *iommu;
>> > > +	struct pci_dev *pdev;
>> > > +	int ret;
>> > > +
>> > > +	if (!dev_is_pci(dev))
>> > > +		return;
>> > > +
>> > > +	pdev = to_pci_dev(dev);
>> > > +	iommu = dev->iommu;
>> > > +	if (!iommu->iommu_dev->ops->unpreserve_device ||
>> > > +	    !iommu->iommu_dev->ops->unpreserve)
>> > > +		return;
>> > > +
>> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> > > +	if (WARN_ON(ret))
>> > > +		return;
>> > > +
>> > > +	guard(mutex)(&flb_obj->lock);
>> > > +	device_ser = dev_iommu_preserved_state(dev);
>> > > +	if (WARN_ON(!device_ser))
>> > > +		return;
>> > > +
>> > > +	iommu->iommu_dev->ops->unpreserve_device(dev, device_ser);
>> > > +	dev->iommu->device_ser = NULL;
>> > > +
>> > > +	iommu_unpreserve_locked(iommu->iommu_dev);
>> > > +}
>> >
>> > I'm wondering ig we should guard these APIs against accidental or
>> > potential abuse by other in-kernel drivers. Since the Live Update
>> > Orchestrator (LUO) is architecturally designed around user-space
>> > driven sequences (IOCTLs, specific mutex ordering, etc.), for now.
>> >
>> > Since the header-file is also under include/linux, we should avoid any
>> > possibility where we end up having drivers depending on these API.
>> > We could add a check based on dma owner:
>> >
>> > +	/* Only devices explicitly claimed by a user-space driver
>> > +	 * (VFIO/IOMMUFD) are eligible for Live Update preservation.
>> > +	 */
>> > +	if (!iommu_dma_owner_claimed(dev))
>> > +		return -EPERM;
>> >
>> > This should ensure we aren't creating 'zombie' preserved states for
>> > devices not managed by IOMMUFD/VFIO.
>>
>> Agreed. I will update this.
>> >
>> >
>> > > diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
>> > > index 59095d2f1bb2..48c07514a776 100644
>> > > --- a/include/linux/iommu-lu.h
>> > > +++ b/include/linux/iommu-lu.h
>> > > @@ -8,9 +8,128 @@
>> > >  #ifndef _LINUX_IOMMU_LU_H
>> > >  #define _LINUX_IOMMU_LU_H
>> > >
>> > > +#include <linux/device.h>
>> > > +#include <linux/iommu.h>
>> > >  #include <linux/liveupdate.h>
>> > >  #include <linux/kho/abi/iommu.h>
>> > >
>> > > +typedef int (*iommu_preserved_device_iter_fn)(struct device_ser *ser,
>> > > +					      void *arg);
>> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> > > +static inline void *dev_iommu_preserved_state(struct device *dev)
>> > > +{
>> > > +	struct device_ser *ser;
>> > > +
>> > > +	if (!dev->iommu)
>> > > +		return NULL;
>> > > +
>> > > +	ser = dev->iommu->device_ser;
>> > > +	if (ser && !ser->obj.incoming)
>> > > +		return ser;
>> > > +
>> > > +	return NULL;
>> > > +}
>> > > +
>> > > +static inline void *dev_iommu_restored_state(struct device *dev)
>> > > +{
>> > > +	struct device_ser *ser;
>> > > +
>> > > +	if (!dev->iommu)
>> > > +		return NULL;
>> > > +
>> > > +	ser = dev->iommu->device_ser;
>> > > +	if (ser && ser->obj.incoming)
>> > > +		return ser;
>> > > +
>> > > +	return NULL;
>> > > +}
>> > > +
>> > > +static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
>> > > +{
>> > > +	struct iommu_domain_ser *ser;
>> > > +
>> > > +	ser = domain->preserved_state;
>> > > +	if (ser && ser->obj.incoming)
>> > > +		return ser;
>> > > +
>> > > +	return NULL;
>> > > +}
>> > > +
>> > > +static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
>> > > +{
>> > > +	struct device_ser *ser = dev_iommu_restored_state(dev);
>> > > +
>> > > +	if (ser && iommu_domain_restored_state(domain))
>> > > +		return ser->domain_iommu_ser.did;
>> > > +
>> > > +	return -1;
>> > > +}
>> > > +
>> > > +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
>> > > +				    void *arg);
>> > > +struct device_ser *iommu_get_device_preserved_data(struct device *dev);
>> > > +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type);
>> > > +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser);
>> > > +void iommu_domain_unpreserve(struct iommu_domain *domain);
>> > > +int iommu_preserve_device(struct iommu_domain *domain,
>> > > +			  struct device *dev, u64 token);
>> > > +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev);
>> > > +#else
>> > > +static inline void *dev_iommu_preserved_state(struct device *dev)
>> > > +{
>> > > +	return NULL;
>> > > +}
>> > > +
>> > > +static inline void *dev_iommu_restored_state(struct device *dev)
>> > > +{
>> > > +	return NULL;
>> > > +}
>> > > +
>> > > +static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
>> > > +{
>> > > +	return -1;
>> > > +}
>> > > +
>> > > +static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
>> > > +{
>> > > +	return NULL;
>> > > +}
>> > > +
>> > > +static inline int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn, void *arg)
>> > > +{
>> > > +	return -EOPNOTSUPP;
>> > > +}
>> > > +
>> > > +static inline struct device_ser *iommu_get_device_preserved_data(struct device *dev)
>> > > +{
>> > > +	return NULL;
>> > > +}
>> > > +
>> > > +static inline struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
>> > > +{
>> > > +	return NULL;
>> > > +}
>> > > +
>> > > +static inline int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
>> > > +{
>> > > +	return -EOPNOTSUPP;
>> > > +}
>> > > +
>> > > +static inline void iommu_domain_unpreserve(struct iommu_domain *domain)
>> > > +{
>> > > +}
>> > > +
>> > > +static inline int iommu_preserve_device(struct iommu_domain *domain,
>> > > +					struct device *dev, u64 token)
>> > > +{
>> > > +	return -EOPNOTSUPP;
>> > > +}
>> > > +
>> > > +static inline void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>> > > +{
>> > > +}
>> > > +#endif
>> > > +
>> > >  int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler);
>> > >  int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler);
>> > >
>> > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> > > index 54b8b48c762e..bd949c1ce7c5 100644
>> > > --- a/include/linux/iommu.h
>> > > +++ b/include/linux/iommu.h
>> > > @@ -14,6 +14,8 @@
>> > >  #include <linux/err.h>
>> > >  #include <linux/of.h>
>> > >  #include <linux/iova_bitmap.h>
>> > > +#include <linux/atomic.h>
>> > > +#include <linux/kho/abi/iommu.h>
>> > >  #include <uapi/linux/iommufd.h>
>> > >
>> > >  #define IOMMU_READ	(1 << 0)
>> > > @@ -248,6 +250,10 @@ struct iommu_domain {
>> > >  			struct list_head next;
>> > >  		};
>> > >  	};
>> > > +
>> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> > > +	struct iommu_domain_ser *preserved_state;
>> > > +#endif
>> > >  };
>> > >
>> > >  static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
>> > > @@ -647,6 +653,10 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
>> > >   *               resources shared/passed to user space IOMMU instance. Associate
>> > >   *               it with a nesting @parent_domain. It is required for driver to
>> > >   *               set @viommu->ops pointing to its own viommu_ops
>> > > + * @preserve_device: Preserve state of a device for liveupdate.
>> > > + * @unpreserve_device: Unpreserve state that was preserved earlier.
>> > > + * @preserve: Preserve state of iommu translation hardware for liveupdate.
>> > > + * @unpreserve: Unpreserve state of iommu that was preserved earlier.
>> > >   * @owner: Driver module providing these ops
>> > >   * @identity_domain: An always available, always attachable identity
>> > >   *                   translation.
>> > > @@ -703,6 +713,11 @@ struct iommu_ops {
>> > >  			   struct iommu_domain *parent_domain,
>> > >  			   const struct iommu_user_data *user_data);
>> > >
>> > > +	int (*preserve_device)(struct device *dev, struct device_ser *device_ser);
>> > > +	void (*unpreserve_device)(struct device *dev, struct device_ser *device_ser);
>> >
>> > Nit: Let's move the _device ops under the comment:
>> > `/* Per device IOMMU features */`
>>
>> I will move these.
>> >
>> > > +	int (*preserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
>> > > +	void (*unpreserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
>> > > +
>> >
>> > I'm wondering if there's any benefit to add these ops under a #ifdef ?
>>
>> These should be NULL when liveupdate is disabled, so #ifdef should not
>> be needed. Not sure how much space we save if don't define these. I will
>> move these in #ifdef anyway.
>> >
>> > >  	const struct iommu_domain_ops *default_domain_ops;
>> > >  	struct module *owner;
>> > >  	struct iommu_domain *identity_domain;
>> > > @@ -749,6 +764,11 @@ struct iommu_ops {
>> > >   *                           specific mechanisms.
>> > >   * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>> > >   * @free: Release the domain after use.
>> > > + * @preserve: Preserve the iommu domain for liveupdate.
>> > > + *            Returns 0 on success, a negative errno on failure.
>> > > + * @unpreserve: Unpreserve the iommu domain that was preserved earlier.
>> > > + * @restore: Restore the iommu domain after liveupdate.
>> > > + *           Returns 0 on success, a negative errno on failure.
>> > >   */
>> > >  struct iommu_domain_ops {
>> > >  	int (*attach_dev)(struct iommu_domain *domain, struct device *dev,
>> > > @@ -779,6 +799,9 @@ struct iommu_domain_ops {
>> > >  				  unsigned long quirks);
>> > >
>> > >  	void (*free)(struct iommu_domain *domain);
>> > > +	int (*preserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
>> > > +	void (*unpreserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
>> > > +	int (*restore)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
>> > >  };
>> > >
>> > >  /**
>> > > @@ -790,6 +813,8 @@ struct iommu_domain_ops {
>> > >   * @singleton_group: Used internally for drivers that have only one group
>> > >   * @max_pasids: number of supported PASIDs
>> > >   * @ready: set once iommu_device_register() has completed successfully
>> > > + * @outgoing_preserved_state: preserved iommu state of outgoing kernel for
>> > > + * liveupdate.
>> > >   */
>> > >  struct iommu_device {
>> > >  	struct list_head list;
>> > > @@ -799,6 +824,10 @@ struct iommu_device {
>> > >  	struct iommu_group *singleton_group;
>> > >  	u32 max_pasids;
>> > >  	bool ready;
>> > > +
>> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> > > +	struct iommu_ser *outgoing_preserved_state;
>> > > +#endif
>> > >  };
>> > >
>> > >  /**
>> > > @@ -853,6 +882,9 @@ struct dev_iommu {
>> > >  	u32				pci_32bit_workaround:1;
>> > >  	u32				require_direct:1;
>> > >  	u32				shadow_on_flush:1;
>> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> > > +	struct device_ser		*device_ser;
>> > > +#endif
>> > >  };
>> > >
>> > >  int iommu_device_register(struct iommu_device *iommu,
>> >
>> > Thanks,
>> > Praan
>> >
>> > [1] https://elixir.bootlin.com/linux/v7.0-rc3/source/kernel/liveupdate/kexec_handover.c#L1182

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-03-17 20:13         ` Samiullah Khawaja
@ 2026-03-17 20:23           ` Pranjal Shrivastava
  2026-03-17 21:03             ` Vipin Sharma
  2026-03-18 17:49             ` Samiullah Khawaja
  0 siblings, 2 replies; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-17 20:23 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Mar 17, 2026 at 08:13:47PM +0000, Samiullah Khawaja wrote:
> On Tue, Mar 17, 2026 at 08:09:26PM +0000, Pranjal Shrivastava wrote:
> > On Fri, Mar 13, 2026 at 06:42:05PM +0000, Samiullah Khawaja wrote:
> > > On Thu, Mar 12, 2026 at 11:10:36PM +0000, Pranjal Shrivastava wrote:
> > > > On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
> > > > > Add IOMMU domain ops that can be implemented by the IOMMU drivers if
> > > > > they support IOMMU domain preservation across liveupdate. The new IOMMU
> > > > > domain preserve, unpreserve and restore APIs call these ops to perform
> > > > > respective live update operations.
> > > > >
> > > > > Similarly add IOMMU ops to preserve/unpreserve a device. These can be
> > > > > implemented by the IOMMU drivers that support preservation of devices
> > > > > that have their IOMMU domains preserved. During device preservation the
> > > > > state of the associated IOMMU is also preserved. The device can only be
> > > > > preserved if the attached iommu domain is preserved and the associated
> > > > > iommu supports preservation.
> > > > >
> > > > > The preserved state of the device and IOMMU needs to be fetched during
> > > > > shutdown and boot in the next kernel. Add APIs that can be used to fetch
> > > > > the preserved state of a device and IOMMU. The APIs will only be used
> > > > > during shutdown and after liveupdate so no locking needed.
> > > > >
> > > > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> > > > > ---
> > > > >  drivers/iommu/iommu.c      |   3 +
> > > > >  drivers/iommu/liveupdate.c | 326 +++++++++++++++++++++++++++++++++++++
> > > > >  include/linux/iommu-lu.h   | 119 ++++++++++++++
> > > > >  include/linux/iommu.h      |  32 ++++
> > > > >  4 files changed, 480 insertions(+)
> > > > >
> > > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > > > index 4926a43118e6..c0632cb5b570 100644
> > > > > --- a/drivers/iommu/iommu.c
> > > > > +++ b/drivers/iommu/iommu.c
> > > > > @@ -389,6 +389,9 @@ static struct dev_iommu *dev_iommu_get(struct device *dev)
> > > > >
> > > > >  	mutex_init(&param->lock);
> > > > >  	dev->iommu = param;
> > > > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > > > +	dev->iommu->device_ser = NULL;
> > > > > +#endif
> > > > >  	return param;
> > > > >  }
> > > > >
> > > > > diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
> > > > > index 6189ba32ff2c..83eb609b3fd7 100644
> > > > > --- a/drivers/iommu/liveupdate.c
> > > > > +++ b/drivers/iommu/liveupdate.c
> > > > > @@ -11,6 +11,7 @@
> > > > >  #include <linux/liveupdate.h>
> > > > >  #include <linux/iommu-lu.h>
> > > > >  #include <linux/iommu.h>
> > > > > +#include <linux/pci.h>
> > > > >  #include <linux/errno.h>
> > > > >
> > > > >  static void iommu_liveupdate_restore_objs(u64 next)
> > > > > @@ -175,3 +176,328 @@ int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler)
> > > > >  	return liveupdate_unregister_flb(handler, &iommu_flb);
> > > > >  }
> > > > >  EXPORT_SYMBOL(iommu_liveupdate_unregister_flb);
> > > > > +
> > > > > +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
> > > > > +				    void *arg)
> > > > > +{
> > > > > +	struct iommu_lu_flb_obj *obj;
> > > > > +	struct devices_ser *devices;
> > > > > +	int ret, i, idx;
> > > > > +
> > > > > +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> > > > > +	if (ret)
> > > > > +		return -ENOENT;
> > > > > +
> > > > > +	devices = __va(obj->ser->devices_phys);
> > > > > +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
> > > > > +		if (idx >= MAX_DEVICE_SERS) {
> > > > > +			devices = __va(devices->objs.next_objs);
> > > > > +			idx = 0;
> > > > > +		}
> > > > > +
> > > > > +		if (devices->devices[idx].obj.deleted)
> > > > > +			continue;
> > > > > +
> > > > > +		ret = fn(&devices->devices[idx], arg);
> > > > > +		if (ret)
> > > > > +			return ret;
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +EXPORT_SYMBOL(iommu_for_each_preserved_device);
> > > > > +
> > > > > +static inline bool device_ser_match(struct device_ser *match,
> > > > > +				    struct pci_dev *pdev)
> > > > > +{
> > > > > +	return match->devid == pci_dev_id(pdev) && match->pci_domain == pci_domain_nr(pdev->bus);
> > > > > +}
> > > > > +
> > > > > +struct device_ser *iommu_get_device_preserved_data(struct device *dev)
> > > > > +{
> > > > > +	struct iommu_lu_flb_obj *obj;
> > > > > +	struct devices_ser *devices;
> > > > > +	int ret, i, idx;
> > > > > +
> > > > > +	if (!dev_is_pci(dev))
> > > > > +		return NULL;
> > > > > +
> > > > > +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> > > > > +	if (ret)
> > > > > +		return NULL;
> > > > > +
> > > > > +	devices = __va(obj->ser->devices_phys);
> > > > > +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
> > > > > +		if (idx >= MAX_DEVICE_SERS) {
> > > > > +			devices = __va(devices->objs.next_objs);
> > > > > +			idx = 0;
> > > > > +		}
> > > > > +
> > > > > +		if (devices->devices[idx].obj.deleted)
> > > > > +			continue;
> > > > > +
> > > > > +		if (device_ser_match(&devices->devices[idx], to_pci_dev(dev))) {
> > > > > +			devices->devices[idx].obj.incoming = true;
> > > > > +			return &devices->devices[idx];
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	return NULL;
> > > > > +}
> > > > > +EXPORT_SYMBOL(iommu_get_device_preserved_data);
> > > > > +
> > > > > +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
> > > > > +{
> > > > > +	struct iommu_lu_flb_obj *obj;
> > > > > +	struct iommus_ser *iommus;
> > > > > +	int ret, i, idx;
> > > > > +
> > > > > +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> > > > > +	if (ret)
> > > > > +		return NULL;
> > > > > +
> > > > > +	iommus = __va(obj->ser->iommus_phys);
> > > > > +	for (i = 0, idx = 0; i < obj->ser->nr_iommus; ++i, ++idx) {
> > > > > +		if (idx >= MAX_IOMMU_SERS) {
> > > > > +			iommus = __va(iommus->objs.next_objs);
> > > > > +			idx = 0;
> > > > > +		}
> > > > > +
> > > > > +		if (iommus->iommus[idx].obj.deleted)
> > > > > +			continue;
> > > > > +
> > > > > +		if (iommus->iommus[idx].token == token &&
> > > > > +		    iommus->iommus[idx].type == type)
> > > > > +			return &iommus->iommus[idx];
> > > > > +	}
> > > > > +
> > > > > +	return NULL;
> > > > > +}
> > > > > +EXPORT_SYMBOL(iommu_get_preserved_data);
> > > > > +
> > 
> > Also, I don't see these helpers being used anywhere in this series yet?
> > Should we add them with the phase-2 series? When we actually "Restore"
> > the domain?
> 
> Phase 1 does restore the iommu and domains during boot. But the
> iommufd/hwpt state is not restored or re-associated with the restored
> domains.
> 
> These helpers are used to fetch the preserved state during boot.
> 
> Thanks,
> Sami

I see these are used in PATCH 8, should we introduce these helpers in
PATCH 8 where we actually use them? As we introduce iommu_restore_domain
in PATCH 8?

Thanks,
Praan

> > 
> > > > > +static int reserve_obj_ser(struct iommu_objs_ser **objs_ptr, u64 max_objs)
> > > >
> > > > Isn't this more of an "insert" / "populate" / write_ser_entry? We can
> > > > rename it to something like iommu_ser_push_entry / iommu_ser_write_entry
> > > 
> > > This is reserving an object in the objects array. The object is filled
> > > once reservation is done. Maybe I can call it alloc_obj_ser or
> > > alloc_entry_ser.
> > > >
> > > > > +{
> > > > > +	struct iommu_objs_ser *next_objs, *objs = *objs_ptr;
> > > >
> > > > Not loving these names :(
> > > >
> > > > TBH, the reserve_obj_ser function isn't too readable, esp. with all the
> > > > variable names, here and in the lu header.. I've had to go back n forth
> > > > to the first patch. For example, here next_objs can be next_objs_page &
> > > > objs can be curr_page. (PTAL at my reply on PATCH 01 about renaming).
> > > 
> > > Basically with current naming there are "serialization objects" in
> > > "serializatoin object arrays". The object is a "base" type with
> > > "inherited" types for iommu, domain etc. I will add explanation of all
> > > this in the ABI header.
> > > 
> > > Maybe we can name it to:
> > > 
> > > struct iommu_obj_ser;
> > > struct iommu_obj_array_ser; or struct iommu_obj_ser_array;
> > > 
> > > Then for reserve/alloc, we use names like "next_obj_array" and
> > > "curr_obj_array"?
> > > >
> > > > > +	int idx;
> > > > > +
> > > > > +	if (objs->nr_objs == max_objs) {
> > > > > +		next_objs = kho_alloc_preserve(PAGE_SIZE);
> > > > > +		if (IS_ERR(next_objs))
> > > > > +			return PTR_ERR(next_objs);
> > > > > +
> > > > > +		objs->next_objs = virt_to_phys(next_objs);
> > > > > +		objs = next_objs;
> > > > > +		*objs_ptr = objs;
> > > > > +		objs->nr_objs = 0;
> > > > > +		objs->next_objs = 0;
> > > >
> > > > This seems redundant, no need to zero these out, kho_alloc_preserve
> > > > passes __GFP_ZERO to folio_alloc [1], which should z-alloc the pages.
> > > 
> > > Agreed. Will remove this.
> > > >
> > > > > +	}
> > > > > +
> > > > > +	idx = objs->nr_objs++;
> > > > > +	return idx;
> > > > > +}
> > > >
> > > > Just to give a mental model to fellow reviewers, here's how this is laid
> > > > out:
> > > >
> > > > ----------------------------------------------------------------------
> > > > [ PAGE START ]                                                       |
> > > > ----------------------------------------------------------------------
> > > > | iommu_objs_ser (The Page Header)                                   |
> > > > |   - next_objs: 0x0000 (End of the page-chain)                      |
> > > > |   - nr_objs: 2                                                     |
> > > > ----------------------------------------------------------------------
> > > > | ITEM 0: iommu_domain_ser                                           |
> > > > |   [ iommu_obj_ser (The entry header) ]                             |
> > > > |     - idx: 0                                                       |
> > > > |     - ref_count: 1                                                 |
> > > > |     - deleted: 0                                                   |
> > > > |   [ Domain Data ]                                                  |
> > > > ----------------------------------------------------------------------
> > > > | ITEM 1: iommu_domain_ser                                           |
> > > > |   [ iommu_obj_ser (The Price Tag) ]                                |
> > > > |     - idx: 1                                                       |
> > > > |     - ref_count: 1                                                 |
> > > > |     - deleted: 0                                                   |
> > > > |   [ Domain Data ]                                                  |
> > > > ----------------------------------------------------------------------
> > > > | ... (Empty space for more domains) ...                             |
> > > > |                                                                    |
> > > > ----------------------------------------------------------------------
> > > > [ PAGE END ]                                                         |
> > > > ----------------------------------------------------------------------
> > > 
> > > +1
> > > 
> > > Will add a table in the header as you suggested.
> > > >
> > > > > +
> > > > > +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
> > > > > +{
> > > > > +	struct iommu_domain_ser *domain_ser;
> > > > > +	struct iommu_lu_flb_obj *flb_obj;
> > > > > +	int idx, ret;
> > > > > +
> > > > > +	if (!domain->ops->preserve)
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> > > > > +	if (ret)
> > > > > +		return ret;
> > > > > +
> > > > > +	guard(mutex)(&flb_obj->lock);
> > > > > +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommu_domains,
> > > > > +			      MAX_IOMMU_DOMAIN_SERS);
> > > > > +	if (idx < 0)
> > > > > +		return idx;
> > > > > +
> > > > > +	domain_ser = &flb_obj->iommu_domains->iommu_domains[idx];
> > > >
> > > > This is slightly less-readable as well, I understand we're trying to:
> > > >
> > > > iommu_domains_ser -> iommu_domain_ser[idx] but the same name
> > > > (iommu_domains) makes it difficult to read.. we should rename this as:
> > > 
> > > Agreed.
> > > 
> > > I will update it to,
> > > 
> > > &flb_obj->curr_iommu_domains->domain_array[idx]
> > > >
> > > > &flb_obj->iommu_domains_page->domain_entries[idx] or something for
> > > > better readability..
> > > >
> > > > Also, let's add a comment explaining that reserve_obj_ser actually
> > > > advances the flb_obj ptr when necessary..
> > > 
> > > Agreed.
> > > >
> > > > IIUC, we start with PAGE 0 initially, store it's phys in the
> > > > iommu_flb_preserve op (the iommu_ser_phys et al) & then we go on
> > > > alloc-ing more pages and keep storing the "current" active page with the
> > > > liveupdate core. Now when we jump into the new kernel, we read the
> > > > ser_phys and then follow the page chain, right?
> > > 
> > > Yes, we do an initial allocation of arrays for each object type and then
> > > later allocate more as needed. The flb_obj holds the address of
> > > currently active array.
> > > 
> > > I will add this explanation in the header.
> > > >
> > > > > +	idx = flb_obj->ser->nr_domains++;
> > > > > +	domain_ser->obj.idx = idx;
> > > > > +	domain_ser->obj.ref_count = 1;
> > > > > +
> > > > > +	ret = domain->ops->preserve(domain, domain_ser);
> > > > > +	if (ret) {
> > > > > +		domain_ser->obj.deleted = true;
> > > > > +		return ret;
> > > > > +	}
> > > > > +
> > > > > +	domain->preserved_state = domain_ser;
> > > > > +	*ser = domain_ser;
> > > > > +	return 0;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(iommu_domain_preserve);
> > > > > +
> > > > > +void iommu_domain_unpreserve(struct iommu_domain *domain)
> > > > > +{
> > > > > +	struct iommu_domain_ser *domain_ser;
> > > > > +	struct iommu_lu_flb_obj *flb_obj;
> > > > > +	int ret;
> > > > > +
> > > > > +	if (!domain->ops->unpreserve)
> > > > > +		return;
> > > > > +
> > > > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> > > > > +	if (ret)
> > > > > +		return;
> > > > > +
> > > > > +	guard(mutex)(&flb_obj->lock);
> > > > > +
> > > > > +	/*
> > > > > +	 * There is no check for attached devices here. The correctness relies
> > > > > +	 * on the Live Update Orchestrator's session lifecycle. All resources
> > > > > +	 * (iommufd, vfio devices) are preserved within a single session. If the
> > > > > +	 * session is torn down, the .unpreserve callbacks for all files will be
> > > > > +	 * invoked, ensuring a consistent cleanup without needing explicit
> > > > > +	 * refcounting for the serialized objects here.
> > > > > +	 */
> > > > > +	domain_ser = domain->preserved_state;
> > > > > +	domain->ops->unpreserve(domain, domain_ser);
> > > > > +	domain_ser->obj.deleted = true;
> > > > > +	domain->preserved_state = NULL;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(iommu_domain_unpreserve);
> > > > > +
> > > > > +static int iommu_preserve_locked(struct iommu_device *iommu)
> > > > > +{
> > > > > +	struct iommu_lu_flb_obj *flb_obj;
> > > > > +	struct iommu_ser *iommu_ser;
> > > > > +	int idx, ret;
> > > > > +
> > > > > +	if (!iommu->ops->preserve)
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	if (iommu->outgoing_preserved_state) {
> > > > > +		iommu->outgoing_preserved_state->obj.ref_count++;
> > > > > +		return 0;
> > > > > +	}
> > > > > +
> > > > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> > > > > +	if (ret)
> > > > > +		return ret;
> > > > > +
> > > > > +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommus,
> > > > > +			      MAX_IOMMU_SERS);
> > > > > +	if (idx < 0)
> > > > > +		return idx;
> > > > > +
> > > > > +	iommu_ser = &flb_obj->iommus->iommus[idx];
> > > > > +	idx = flb_obj->ser->nr_iommus++;
> > > > > +	iommu_ser->obj.idx = idx;
> > > > > +	iommu_ser->obj.ref_count = 1;
> > > > > +
> > > > > +	ret = iommu->ops->preserve(iommu, iommu_ser);
> > > > > +	if (ret)
> > > > > +		iommu_ser->obj.deleted = true;
> > > > > +
> > > > > +	iommu->outgoing_preserved_state = iommu_ser;
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > +static void iommu_unpreserve_locked(struct iommu_device *iommu)
> > > > > +{
> > > > > +	struct iommu_ser *iommu_ser = iommu->outgoing_preserved_state;
> > > > > +
> > > > > +	iommu_ser->obj.ref_count--;
> > > > > +	if (iommu_ser->obj.ref_count)
> > > > > +		return;
> > > > > +
> > > > > +	iommu->outgoing_preserved_state = NULL;
> > > > > +	iommu->ops->unpreserve(iommu, iommu_ser);
> > > > > +	iommu_ser->obj.deleted = true;
> > > > > +}
> > > > > +
> > > > > +int iommu_preserve_device(struct iommu_domain *domain,
> > > > > +			  struct device *dev, u64 token)
> > > > > +{
> > > > > +	struct iommu_lu_flb_obj *flb_obj;
> > > > > +	struct device_ser *device_ser;
> > > > > +	struct dev_iommu *iommu;
> > > > > +	struct pci_dev *pdev;
> > > > > +	int ret, idx;
> > > > > +
> > > > > +	if (!dev_is_pci(dev))
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	if (!domain->preserved_state)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	pdev = to_pci_dev(dev);
> > > > > +	iommu = dev->iommu;
> > > > > +	if (!iommu->iommu_dev->ops->preserve_device ||
> > > > > +	    !iommu->iommu_dev->ops->preserve)
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> > > > > +	if (ret)
> > > > > +		return ret;
> > > > > +
> > > > > +	guard(mutex)(&flb_obj->lock);
> > > > > +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->devices,
> > > > > +			      MAX_DEVICE_SERS);
> > > > > +	if (idx < 0)
> > > > > +		return idx;
> > > > > +
> > > > > +	device_ser = &flb_obj->devices->devices[idx];
> > > > > +	idx = flb_obj->ser->nr_devices++;
> > > > > +	device_ser->obj.idx = idx;
> > > > > +	device_ser->obj.ref_count = 1;
> > > > > +
> > > > > +	ret = iommu_preserve_locked(iommu->iommu_dev);
> > > > > +	if (ret) {
> > > > > +		device_ser->obj.deleted = true;
> > > > > +		return ret;
> > > > > +	}
> > > > > +
> > > > > +	device_ser->domain_iommu_ser.domain_phys = __pa(domain->preserved_state);
> > > > > +	device_ser->domain_iommu_ser.iommu_phys = __pa(iommu->iommu_dev->outgoing_preserved_state);
> > > > > +	device_ser->devid = pci_dev_id(pdev);
> > > > > +	device_ser->pci_domain = pci_domain_nr(pdev->bus);
> > > > > +	device_ser->token = token;
> > > > > +
> > > > > +	ret = iommu->iommu_dev->ops->preserve_device(dev, device_ser);
> > > > > +	if (ret) {
> > > > > +		device_ser->obj.deleted = true;
> > > > > +		iommu_unpreserve_locked(iommu->iommu_dev);
> > > > > +		return ret;
> > > > > +	}
> > > > > +
> > > > > +	dev->iommu->device_ser = device_ser;
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
> > > > > +{
> > > > > +	struct iommu_lu_flb_obj *flb_obj;
> > > > > +	struct device_ser *device_ser;
> > > > > +	struct dev_iommu *iommu;
> > > > > +	struct pci_dev *pdev;
> > > > > +	int ret;
> > > > > +
> > > > > +	if (!dev_is_pci(dev))
> > > > > +		return;
> > > > > +
> > > > > +	pdev = to_pci_dev(dev);
> > > > > +	iommu = dev->iommu;
> > > > > +	if (!iommu->iommu_dev->ops->unpreserve_device ||
> > > > > +	    !iommu->iommu_dev->ops->unpreserve)
> > > > > +		return;
> > > > > +
> > > > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> > > > > +	if (WARN_ON(ret))
> > > > > +		return;
> > > > > +
> > > > > +	guard(mutex)(&flb_obj->lock);
> > > > > +	device_ser = dev_iommu_preserved_state(dev);
> > > > > +	if (WARN_ON(!device_ser))
> > > > > +		return;
> > > > > +
> > > > > +	iommu->iommu_dev->ops->unpreserve_device(dev, device_ser);
> > > > > +	dev->iommu->device_ser = NULL;
> > > > > +
> > > > > +	iommu_unpreserve_locked(iommu->iommu_dev);
> > > > > +}
> > > >
> > > > I'm wondering ig we should guard these APIs against accidental or
> > > > potential abuse by other in-kernel drivers. Since the Live Update
> > > > Orchestrator (LUO) is architecturally designed around user-space
> > > > driven sequences (IOCTLs, specific mutex ordering, etc.), for now.
> > > >
> > > > Since the header-file is also under include/linux, we should avoid any
> > > > possibility where we end up having drivers depending on these API.
> > > > We could add a check based on dma owner:
> > > >
> > > > +	/* Only devices explicitly claimed by a user-space driver
> > > > +	 * (VFIO/IOMMUFD) are eligible for Live Update preservation.
> > > > +	 */
> > > > +	if (!iommu_dma_owner_claimed(dev))
> > > > +		return -EPERM;
> > > >
> > > > This should ensure we aren't creating 'zombie' preserved states for
> > > > devices not managed by IOMMUFD/VFIO.
> > > 
> > > Agreed. I will update this.
> > > >
> > > >
> > > > > diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
> > > > > index 59095d2f1bb2..48c07514a776 100644
> > > > > --- a/include/linux/iommu-lu.h
> > > > > +++ b/include/linux/iommu-lu.h
> > > > > @@ -8,9 +8,128 @@
> > > > >  #ifndef _LINUX_IOMMU_LU_H
> > > > >  #define _LINUX_IOMMU_LU_H
> > > > >
> > > > > +#include <linux/device.h>
> > > > > +#include <linux/iommu.h>
> > > > >  #include <linux/liveupdate.h>
> > > > >  #include <linux/kho/abi/iommu.h>
> > > > >
> > > > > +typedef int (*iommu_preserved_device_iter_fn)(struct device_ser *ser,
> > > > > +					      void *arg);
> > > > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > > > +static inline void *dev_iommu_preserved_state(struct device *dev)
> > > > > +{
> > > > > +	struct device_ser *ser;
> > > > > +
> > > > > +	if (!dev->iommu)
> > > > > +		return NULL;
> > > > > +
> > > > > +	ser = dev->iommu->device_ser;
> > > > > +	if (ser && !ser->obj.incoming)
> > > > > +		return ser;
> > > > > +
> > > > > +	return NULL;
> > > > > +}
> > > > > +
> > > > > +static inline void *dev_iommu_restored_state(struct device *dev)
> > > > > +{
> > > > > +	struct device_ser *ser;
> > > > > +
> > > > > +	if (!dev->iommu)
> > > > > +		return NULL;
> > > > > +
> > > > > +	ser = dev->iommu->device_ser;
> > > > > +	if (ser && ser->obj.incoming)
> > > > > +		return ser;
> > > > > +
> > > > > +	return NULL;
> > > > > +}
> > > > > +
> > > > > +static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
> > > > > +{
> > > > > +	struct iommu_domain_ser *ser;
> > > > > +
> > > > > +	ser = domain->preserved_state;
> > > > > +	if (ser && ser->obj.incoming)
> > > > > +		return ser;
> > > > > +
> > > > > +	return NULL;
> > > > > +}
> > > > > +
> > > > > +static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
> > > > > +{
> > > > > +	struct device_ser *ser = dev_iommu_restored_state(dev);
> > > > > +
> > > > > +	if (ser && iommu_domain_restored_state(domain))
> > > > > +		return ser->domain_iommu_ser.did;
> > > > > +
> > > > > +	return -1;
> > > > > +}
> > > > > +
> > > > > +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
> > > > > +				    void *arg);
> > > > > +struct device_ser *iommu_get_device_preserved_data(struct device *dev);
> > > > > +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type);
> > > > > +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser);
> > > > > +void iommu_domain_unpreserve(struct iommu_domain *domain);
> > > > > +int iommu_preserve_device(struct iommu_domain *domain,
> > > > > +			  struct device *dev, u64 token);
> > > > > +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev);
> > > > > +#else
> > > > > +static inline void *dev_iommu_preserved_state(struct device *dev)
> > > > > +{
> > > > > +	return NULL;
> > > > > +}
> > > > > +
> > > > > +static inline void *dev_iommu_restored_state(struct device *dev)
> > > > > +{
> > > > > +	return NULL;
> > > > > +}
> > > > > +
> > > > > +static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain *domain)
> > > > > +{
> > > > > +	return -1;
> > > > > +}
> > > > > +
> > > > > +static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
> > > > > +{
> > > > > +	return NULL;
> > > > > +}
> > > > > +
> > > > > +static inline int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn, void *arg)
> > > > > +{
> > > > > +	return -EOPNOTSUPP;
> > > > > +}
> > > > > +
> > > > > +static inline struct device_ser *iommu_get_device_preserved_data(struct device *dev)
> > > > > +{
> > > > > +	return NULL;
> > > > > +}
> > > > > +
> > > > > +static inline struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
> > > > > +{
> > > > > +	return NULL;
> > > > > +}
> > > > > +
> > > > > +static inline int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
> > > > > +{
> > > > > +	return -EOPNOTSUPP;
> > > > > +}
> > > > > +
> > > > > +static inline void iommu_domain_unpreserve(struct iommu_domain *domain)
> > > > > +{
> > > > > +}
> > > > > +
> > > > > +static inline int iommu_preserve_device(struct iommu_domain *domain,
> > > > > +					struct device *dev, u64 token)
> > > > > +{
> > > > > +	return -EOPNOTSUPP;
> > > > > +}
> > > > > +
> > > > > +static inline void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
> > > > > +{
> > > > > +}
> > > > > +#endif
> > > > > +
> > > > >  int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler);
> > > > >  int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler);
> > > > >
> > > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > > > index 54b8b48c762e..bd949c1ce7c5 100644
> > > > > --- a/include/linux/iommu.h
> > > > > +++ b/include/linux/iommu.h
> > > > > @@ -14,6 +14,8 @@
> > > > >  #include <linux/err.h>
> > > > >  #include <linux/of.h>
> > > > >  #include <linux/iova_bitmap.h>
> > > > > +#include <linux/atomic.h>
> > > > > +#include <linux/kho/abi/iommu.h>
> > > > >  #include <uapi/linux/iommufd.h>
> > > > >
> > > > >  #define IOMMU_READ	(1 << 0)
> > > > > @@ -248,6 +250,10 @@ struct iommu_domain {
> > > > >  			struct list_head next;
> > > > >  		};
> > > > >  	};
> > > > > +
> > > > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > > > +	struct iommu_domain_ser *preserved_state;
> > > > > +#endif
> > > > >  };
> > > > >
> > > > >  static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
> > > > > @@ -647,6 +653,10 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
> > > > >   *               resources shared/passed to user space IOMMU instance. Associate
> > > > >   *               it with a nesting @parent_domain. It is required for driver to
> > > > >   *               set @viommu->ops pointing to its own viommu_ops
> > > > > + * @preserve_device: Preserve state of a device for liveupdate.
> > > > > + * @unpreserve_device: Unpreserve state that was preserved earlier.
> > > > > + * @preserve: Preserve state of iommu translation hardware for liveupdate.
> > > > > + * @unpreserve: Unpreserve state of iommu that was preserved earlier.
> > > > >   * @owner: Driver module providing these ops
> > > > >   * @identity_domain: An always available, always attachable identity
> > > > >   *                   translation.
> > > > > @@ -703,6 +713,11 @@ struct iommu_ops {
> > > > >  			   struct iommu_domain *parent_domain,
> > > > >  			   const struct iommu_user_data *user_data);
> > > > >
> > > > > +	int (*preserve_device)(struct device *dev, struct device_ser *device_ser);
> > > > > +	void (*unpreserve_device)(struct device *dev, struct device_ser *device_ser);
> > > >
> > > > Nit: Let's move the _device ops under the comment:
> > > > `/* Per device IOMMU features */`
> > > 
> > > I will move these.
> > > >
> > > > > +	int (*preserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
> > > > > +	void (*unpreserve)(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
> > > > > +
> > > >
> > > > I'm wondering if there's any benefit to add these ops under a #ifdef ?
> > > 
> > > These should be NULL when liveupdate is disabled, so #ifdef should not
> > > be needed. Not sure how much space we save if don't define these. I will
> > > move these in #ifdef anyway.
> > > >
> > > > >  	const struct iommu_domain_ops *default_domain_ops;
> > > > >  	struct module *owner;
> > > > >  	struct iommu_domain *identity_domain;
> > > > > @@ -749,6 +764,11 @@ struct iommu_ops {
> > > > >   *                           specific mechanisms.
> > > > >   * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
> > > > >   * @free: Release the domain after use.
> > > > > + * @preserve: Preserve the iommu domain for liveupdate.
> > > > > + *            Returns 0 on success, a negative errno on failure.
> > > > > + * @unpreserve: Unpreserve the iommu domain that was preserved earlier.
> > > > > + * @restore: Restore the iommu domain after liveupdate.
> > > > > + *           Returns 0 on success, a negative errno on failure.
> > > > >   */
> > > > >  struct iommu_domain_ops {
> > > > >  	int (*attach_dev)(struct iommu_domain *domain, struct device *dev,
> > > > > @@ -779,6 +799,9 @@ struct iommu_domain_ops {
> > > > >  				  unsigned long quirks);
> > > > >
> > > > >  	void (*free)(struct iommu_domain *domain);
> > > > > +	int (*preserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
> > > > > +	void (*unpreserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
> > > > > +	int (*restore)(struct iommu_domain *domain, struct iommu_domain_ser *ser);
> > > > >  };
> > > > >
> > > > >  /**
> > > > > @@ -790,6 +813,8 @@ struct iommu_domain_ops {
> > > > >   * @singleton_group: Used internally for drivers that have only one group
> > > > >   * @max_pasids: number of supported PASIDs
> > > > >   * @ready: set once iommu_device_register() has completed successfully
> > > > > + * @outgoing_preserved_state: preserved iommu state of outgoing kernel for
> > > > > + * liveupdate.
> > > > >   */
> > > > >  struct iommu_device {
> > > > >  	struct list_head list;
> > > > > @@ -799,6 +824,10 @@ struct iommu_device {
> > > > >  	struct iommu_group *singleton_group;
> > > > >  	u32 max_pasids;
> > > > >  	bool ready;
> > > > > +
> > > > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > > > +	struct iommu_ser *outgoing_preserved_state;
> > > > > +#endif
> > > > >  };
> > > > >
> > > > >  /**
> > > > > @@ -853,6 +882,9 @@ struct dev_iommu {
> > > > >  	u32				pci_32bit_workaround:1;
> > > > >  	u32				require_direct:1;
> > > > >  	u32				shadow_on_flush:1;
> > > > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > > > +	struct device_ser		*device_ser;
> > > > > +#endif
> > > > >  };
> > > > >
> > > > >  int iommu_device_register(struct iommu_device *iommu,
> > > >
> > > > Thanks,
> > > > Praan
> > > >
> > > > [1] https://elixir.bootlin.com/linux/v7.0-rc3/source/kernel/liveupdate/kexec_handover.c#L1182

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-03-17 19:58   ` Vipin Sharma
@ 2026-03-17 20:33     ` Samiullah Khawaja
  2026-03-24 19:06       ` Vipin Sharma
  0 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-17 20:33 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Mar 17, 2026 at 12:58:27PM -0700, Vipin Sharma wrote:
>On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 4926a43118e6..c0632cb5b570 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -389,6 +389,9 @@ static struct dev_iommu *dev_iommu_get(struct device *dev)
>>
>>  	mutex_init(&param->lock);
>>  	dev->iommu = param;
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	dev->iommu->device_ser = NULL;
>> +#endif
>
>This should already be NULL due to kzalloc() above.

Will remove.
>
>>  	return param;
>>  }
>>
>> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
>> index 6189ba32ff2c..83eb609b3fd7 100644
>> --- a/drivers/iommu/liveupdate.c
>> +++ b/drivers/iommu/liveupdate.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/liveupdate.h>
>>  #include <linux/iommu-lu.h>
>>  #include <linux/iommu.h>
>> +#include <linux/pci.h>
>>  #include <linux/errno.h>
>>
>>  static void iommu_liveupdate_restore_objs(u64 next)
>> @@ -175,3 +176,328 @@ int iommu_liveupdate_unregister_flb(struct liveupdate_file_handler *handler)
>>  	return liveupdate_unregister_flb(handler, &iommu_flb);
>>  }
>>  EXPORT_SYMBOL(iommu_liveupdate_unregister_flb);
>> +
>> +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
>> +				    void *arg)
>> +{
>> +	struct iommu_lu_flb_obj *obj;
>> +	struct devices_ser *devices;
>> +	int ret, i, idx;
>> +
>> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
>> +	if (ret)
>> +		return -ENOENT;
>> +
>> +	devices = __va(obj->ser->devices_phys);
>> +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
>> +		if (idx >= MAX_DEVICE_SERS) {
>> +			devices = __va(devices->objs.next_objs);
>> +			idx = 0;
>> +		}
>> +
>> +		if (devices->devices[idx].obj.deleted)
>> +			continue;
>> +
>> +		ret = fn(&devices->devices[idx], arg);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(iommu_for_each_preserved_device);
>
>IMHO, with all the obj, ser, devices->devices[..], it is little harder
>to follow the code. I will recommend to reconsider naming.

Agreed. I will update this based on the feedback on the previous patch.
>
>Also, should this function be introduced in the patch where it is
>getting used? Other changes in this patch are already big and complex.
>Same for iommu_get_device_preserved_data() and
>iommu_get_preserved_data().

These are used by the drivers, but part of core. So need to be in
this patch :(.

Note that this patch is adding core skeleton only, focusing on helpers
for the serialized state. This patch is not preserving any real state of
iommu, domain or devices. For example, the domains are saved through
generic page table in a separate patch, and the drivers preserve the
state of devices and associated iommu in separate patches.

I will add this text in the commit message to clarify the purpose of
this patch.
>
>I think this patch can be split in three.
>Patch 1: Preserve iommu_domain
>Patch 2: Preserve pci device and iommu device
>Patch 3: The helper functions I mentioned above.
>
>
>> +
>> +static inline bool device_ser_match(struct device_ser *match,
>> +				    struct pci_dev *pdev)
>> +{
>> +	return match->devid == pci_dev_id(pdev) && match->pci_domain == pci_domain_nr(pdev->bus);
>
>Nit: s/match/device_ser for readability or something similar.

Agreed. Will do.
>
>> +}
>> +
>> +struct device_ser *iommu_get_device_preserved_data(struct device *dev)
>> +{
>> +	struct iommu_lu_flb_obj *obj;
>> +	struct devices_ser *devices;
>> +	int ret, i, idx;
>> +
>> +	if (!dev_is_pci(dev))
>> +		return NULL;
>> +
>> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
>> +	if (ret)
>> +		return NULL;
>> +
>> +	devices = __va(obj->ser->devices_phys);
>> +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
>> +		if (idx >= MAX_DEVICE_SERS) {
>> +			devices = __va(devices->objs.next_objs);
>> +			idx = 0;
>> +		}
>> +
>> +		if (devices->devices[idx].obj.deleted)
>> +			continue;
>> +
>> +		if (device_ser_match(&devices->devices[idx], to_pci_dev(dev))) {
>> +			devices->devices[idx].obj.incoming = true;
>> +			return &devices->devices[idx];
>> +		}
>> +	}
>> +
>> +	return NULL;
>> +}
>> +EXPORT_SYMBOL(iommu_get_device_preserved_data);
>> +
>> +struct iommu_ser *iommu_get_preserved_data(u64 token, enum iommu_lu_type type)
>> +{
>> +	struct iommu_lu_flb_obj *obj;
>> +	struct iommus_ser *iommus;
>> +	int ret, i, idx;
>> +
>> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
>> +	if (ret)
>> +		return NULL;
>> +
>> +	iommus = __va(obj->ser->iommus_phys);
>> +	for (i = 0, idx = 0; i < obj->ser->nr_iommus; ++i, ++idx) {
>> +		if (idx >= MAX_IOMMU_SERS) {
>> +			iommus = __va(iommus->objs.next_objs);
>> +			idx = 0;
>> +		}
>> +
>> +		if (iommus->iommus[idx].obj.deleted)
>> +			continue;
>> +
>> +		if (iommus->iommus[idx].token == token &&
>> +		    iommus->iommus[idx].type == type)
>> +			return &iommus->iommus[idx];
>> +	}
>> +
>> +	return NULL;
>> +}
>> +EXPORT_SYMBOL(iommu_get_preserved_data);
>> +
>> +static int reserve_obj_ser(struct iommu_objs_ser **objs_ptr, u64 max_objs)
>> +{
>> +	struct iommu_objs_ser *next_objs, *objs = *objs_ptr;
>> +	int idx;
>> +
>> +	if (objs->nr_objs == max_objs) {
>> +		next_objs = kho_alloc_preserve(PAGE_SIZE);
>> +		if (IS_ERR(next_objs))
>> +			return PTR_ERR(next_objs);
>> +
>> +		objs->next_objs = virt_to_phys(next_objs);
>> +		objs = next_objs;
>> +		*objs_ptr = objs;
>> +		objs->nr_objs = 0;
>> +		objs->next_objs = 0;
>> +	}
>> +
>> +	idx = objs->nr_objs++;
>> +	return idx;
>
>I think we should rename these variables, it is difficult to comprehend
>the code.

Will do.
>
>> +}
>> +
>> +int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser)
>> +{
>> +	struct iommu_domain_ser *domain_ser;
>> +	struct iommu_lu_flb_obj *flb_obj;
>> +	int idx, ret;
>> +
>> +	if (!domain->ops->preserve)
>> +		return -EOPNOTSUPP;
>> +
>> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> +	if (ret)
>> +		return ret;
>> +
>> +	guard(mutex)(&flb_obj->lock);
>> +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->iommu_domains,
>> +			      MAX_IOMMU_DOMAIN_SERS);
>> +	if (idx < 0)
>> +		return idx;
>> +
>> +	domain_ser = &flb_obj->iommu_domains->iommu_domains[idx];
>> +	idx = flb_obj->ser->nr_domains++;
>
>In for loops above, nr_domains are compared with 'i', and idx is for
>index inside a page. Lets not reuse idx here for nr_domains, keep it
>consistent for easier understanding. May be use something more
>descriptive than 'i' like global_idx and local_idx?

Agreed. Will change it.
>
>> +	domain_ser->obj.idx = idx;
>> +	domain_ser->obj.ref_count = 1;
>> +
>> +	ret = domain->ops->preserve(domain, domain_ser);
>> +	if (ret) {
>> +		domain_ser->obj.deleted = true;
>> +		return ret;
>> +	}
>> +
>> +	domain->preserved_state = domain_ser;
>> +	*ser = domain_ser;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_domain_preserve);
>> +
>> +static int iommu_preserve_locked(struct iommu_device *iommu)
>> +{
>> +	struct iommu_lu_flb_obj *flb_obj;
>> +	struct iommu_ser *iommu_ser;
>> +	int idx, ret;
>
>Should there be lockdep asserts in these locked version APIs?

Will add it here.
>
>> +
>> +}
>> +
>> +static void iommu_unpreserve_locked(struct iommu_device *iommu)
>> +{
>> +	struct iommu_ser *iommu_ser = iommu->outgoing_preserved_state;
>> +
>> +	iommu_ser->obj.ref_count--;
>
>Should there be a null check?

Hmm.. There is a dependency of unpreservation of iommus with devices, so
this should never be NULL unless used independently.

But I think I will add it here to protect against that.
>
>> +	if (iommu_ser->obj.ref_count)
>> +		return;
>> +
>> +	iommu->outgoing_preserved_state = NULL;
>> +	iommu->ops->unpreserve(iommu, iommu_ser);
>> +	iommu_ser->obj.deleted = true;
>> +}
>> +
>> +int iommu_preserve_device(struct iommu_domain *domain,
>> +			  struct device *dev, u64 token)
>> +{
>> +	struct iommu_lu_flb_obj *flb_obj;
>> +	struct device_ser *device_ser;
>> +	struct dev_iommu *iommu;
>> +	struct pci_dev *pdev;
>> +	int ret, idx;
>> +
>> +	if (!dev_is_pci(dev))
>> +		return -EOPNOTSUPP;
>> +
>> +	if (!domain->preserved_state)
>> +		return -EINVAL;
>> +
>> +	pdev = to_pci_dev(dev);
>> +	iommu = dev->iommu;
>> +	if (!iommu->iommu_dev->ops->preserve_device ||
>> +	    !iommu->iommu_dev->ops->preserve)
>
>iommu_preserve_locked() is already checking for preserve(), we can just
>check preserve_device here.

iommu_preserve_locked() is called after reserving the serialization
state in FLB and taking the FLB lock, so this is an early check that
returns early.
>
>> +		return -EOPNOTSUPP;
>> +
>> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> +	if (ret)
>> +		return ret;
>> +
>> +	guard(mutex)(&flb_obj->lock);
>> +	idx = reserve_obj_ser((struct iommu_objs_ser **)&flb_obj->devices,
>> +			      MAX_DEVICE_SERS);
>> +	if (idx < 0)
>> +		return idx;
>> +
>> +	device_ser = &flb_obj->devices->devices[idx];
>> +	idx = flb_obj->ser->nr_devices++;
>> +	device_ser->obj.idx = idx;
>> +	device_ser->obj.ref_count = 1;
>> +
>> +	ret = iommu_preserve_locked(iommu->iommu_dev);
>> +	if (ret) {
>> +		device_ser->obj.deleted = true;
>> +		return ret;
>> +	}
>> +
>> +	device_ser->domain_iommu_ser.domain_phys = __pa(domain->preserved_state);
>> +	device_ser->domain_iommu_ser.iommu_phys = __pa(iommu->iommu_dev->outgoing_preserved_state);
>> +	device_ser->devid = pci_dev_id(pdev);
>> +	device_ser->pci_domain = pci_domain_nr(pdev->bus);
>> +	device_ser->token = token;
>> +
>> +	ret = iommu->iommu_dev->ops->preserve_device(dev, device_ser);
>> +	if (ret) {
>> +		device_ser->obj.deleted = true;
>> +		iommu_unpreserve_locked(iommu->iommu_dev);
>> +		return ret;
>> +	}
>> +
>> +	dev->iommu->device_ser = device_ser;
>> +	return 0;
>> +}
>> +
>> +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>> +{
>> +	struct iommu_lu_flb_obj *flb_obj;
>> +	struct device_ser *device_ser;
>> +	struct dev_iommu *iommu;
>> +	struct pci_dev *pdev;
>> +	int ret;
>> +
>> +	if (!dev_is_pci(dev))
>> +		return;
>> +
>> +	pdev = to_pci_dev(dev);
>> +	iommu = dev->iommu;
>> +	if (!iommu->iommu_dev->ops->unpreserve_device ||
>> +	    !iommu->iommu_dev->ops->unpreserve)
>> +		return;
>> +
>> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> +	if (WARN_ON(ret))
>
>Why WARN_ON here and not other places? Do we need it?

Basically this means that the upper layer (iommufd/vfio) is asking to
unpreserve a device, but there is no FLB found. This should not happen
and should generate a warning.
>
>> +		return;
>> +
>> +	guard(mutex)(&flb_obj->lock);
>> +	device_ser = dev_iommu_preserved_state(dev);
>> +	if (WARN_ON(!device_ser))
>
>Can't we just silently ignore this?

See answer above for the previous WARN. Same applies here.
>
>> +		return;
>> +
>> +	iommu->iommu_dev->ops->unpreserve_device(dev, device_ser);
>> +	dev->iommu->device_ser = NULL;
>> +
>> +	iommu_unpreserve_locked(iommu->iommu_dev);
>> +}

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages
  2026-02-03 22:09 ` [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages Samiullah Khawaja
  2026-03-03 16:42   ` Ankit Soni
@ 2026-03-17 20:59   ` Vipin Sharma
  2026-03-20  9:28     ` Pranjal Shrivastava
  2026-03-20 11:01     ` Pranjal Shrivastava
  1 sibling, 2 replies; 98+ messages in thread
From: Vipin Sharma @ 2026-03-17 20:59 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:38PM +0000, Samiullah Khawaja wrote:
> +
> +void iommu_unpreserve_pages(struct iommu_pages_list *list, int count)
> +{
> +	struct ioptdesc *iopt;
> +
> +	if (!count)
> +		return;
> +
> +	/* If less than zero then unpreserve all pages. */
> +	if (count < 0)
> +		count = 0;
> +
> +	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
> +		kho_unpreserve_folio(ioptdesc_folio(iopt));
> +		if (count > 0 && --count ==  0)
> +			break;
> +	}
> +}

I see you are trying to have common function for error handling in
iommu_preserve_pages() down and unpreserving in the next patch with
overloaded meaning of count. 

 < 0 means unpreserve all of the pages
 = 0 means do nothing
 > 0 unpreserve these many pages.

It seems non-intuitive to me and code also end up having multiple ifs. I
will recommend to just have this function go through all of the pages
without passing count argument. In iommu_preserve_pages() function down,
handle the error case with a "goto err" statement. If this is the way
things happen in iommu codebase feel free to ignore this suggestion.

> +EXPORT_SYMBOL_GPL(iommu_unpreserve_pages);
> +
> +void iommu_restore_page(u64 phys)
> +{
> +	struct ioptdesc *iopt;
> +	struct folio *folio;
> +	unsigned long pgcnt;
> +	unsigned int order;
> +
> +	folio = kho_restore_folio(phys);
> +	BUG_ON(!folio);
> +
> +	iopt = folio_ioptdesc(folio);
> +
> +	order = folio_order(folio);
> +	pgcnt = 1UL << order;
> +	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, pgcnt);
> +	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, pgcnt);

Above two functions are also used in other two places in this file, lets
create a common function for these two operations.

> +}
> +EXPORT_SYMBOL_GPL(iommu_restore_page);
> +

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-03-17 20:23           ` Pranjal Shrivastava
@ 2026-03-17 21:03             ` Vipin Sharma
  2026-03-18 18:51               ` Pranjal Shrivastava
  2026-03-18 17:49             ` Samiullah Khawaja
  1 sibling, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-17 21:03 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Samiullah Khawaja, David Woodhouse, Lu Baolu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, YiFei Zhu

On Tue, Mar 17, 2026 at 08:23:38PM +0000, Pranjal Shrivastava wrote:
> On Tue, Mar 17, 2026 at 08:13:47PM +0000, Samiullah Khawaja wrote:
> > On Tue, Mar 17, 2026 at 08:09:26PM +0000, Pranjal Shrivastava wrote:
> > > On Fri, Mar 13, 2026 at 06:42:05PM +0000, Samiullah Khawaja wrote:
> > > > On Thu, Mar 12, 2026 at 11:10:36PM +0000, Pranjal Shrivastava wrote:
> > > > > On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
> > > > > > +EXPORT_SYMBOL(iommu_get_preserved_data);
> > > > > > +
> > > 
> > > Also, I don't see these helpers being used anywhere in this series yet?
> > > Should we add them with the phase-2 series? When we actually "Restore"
> > > the domain?
> > 
> > Phase 1 does restore the iommu and domains during boot. But the
> > iommufd/hwpt state is not restored or re-associated with the restored
> > domains.
> > 
> > These helpers are used to fetch the preserved state during boot.
> > 
> > Thanks,
> > Sami
> 
> I see these are used in PATCH 8, should we introduce these helpers in
> PATCH 8 where we actually use them? As we introduce iommu_restore_domain
> in PATCH 8?
> 
> Thanks,
> Praan
> 

Please trim your responses.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 03/14] liveupdate: luo_file: Add internal APIs for file preservation
  2026-02-03 22:09 ` [PATCH 03/14] liveupdate: luo_file: Add internal APIs for file preservation Samiullah Khawaja
@ 2026-03-18 10:00   ` Pranjal Shrivastava
  2026-03-18 16:54     ` Samiullah Khawaja
  0 siblings, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-18 10:00 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Pasha Tatashin, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, David Matlack,
	Andrew Morton, Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:37PM +0000, Samiullah Khawaja wrote:
> From: Pasha Tatashin <pasha.tatashin@soleen.com>
> 
> The core liveupdate mechanism allows userspace to preserve file
> descriptors. However, kernel subsystems often manage struct file
> objects directly and need to participate in the preservation process
> programmatically without relying solely on userspace interaction.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  include/linux/liveupdate.h       | 21 ++++++++++
>  kernel/liveupdate/luo_file.c     | 71 ++++++++++++++++++++++++++++++++
>  kernel/liveupdate/luo_internal.h | 16 +++++++
>  3 files changed, 108 insertions(+)
> 
> diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
> index fe82a6c3005f..8e47504ba01e 100644
> --- a/include/linux/liveupdate.h
> +++ b/include/linux/liveupdate.h
> @@ -23,6 +23,7 @@ struct file;
>  /**
>   * struct liveupdate_file_op_args - Arguments for file operation callbacks.
>   * @handler:          The file handler being called.
> + * @session:          The session this file belongs to.
>   * @retrieved:        The retrieve status for the 'can_finish / finish'
>   *                    operation.
>   * @file:             The file object. For retrieve: [OUT] The callback sets
> @@ -40,6 +41,7 @@ struct file;
>   */
>  struct liveupdate_file_op_args {
>  	struct liveupdate_file_handler *handler;
> +	struct liveupdate_session *session;
>  	bool retrieved;

Nit: I don't think this is on the latest tree. I see `int retrieved` [1]
in the latest tree. I guess we'd need to rebase it on the latest?

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/include/linux/liveupdate.h#n46

>  	struct file *file;
>  	u64 serialized_data;
> @@ -234,6 +236,13 @@ int liveupdate_unregister_flb(struct liveupdate_file_handler *fh,
>  
>  int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp);
>  int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp);
> +/* kernel can internally retrieve files */
> +int liveupdate_get_file_incoming(struct liveupdate_session *s, u64 token,
> +				 struct file **filep);
> +
> +/* Get a token for an outgoing file, or -ENOENT if file is not preserved */
> +int liveupdate_get_token_outgoing(struct liveupdate_session *s,
> +				  struct file *file, u64 *tokenp);
>  
>  #else /* CONFIG_LIVEUPDATE */
>  
> @@ -281,5 +290,17 @@ static inline int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb,
>  	return -EOPNOTSUPP;
>  }
>  
> +static inline int liveupdate_get_file_incoming(struct liveupdate_session *s,
> +					       u64 token, struct file **filep)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline int liveupdate_get_token_outgoing(struct liveupdate_session *s,
> +						struct file *file, u64 *tokenp)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
>  #endif /* CONFIG_LIVEUPDATE */
>  #endif /* _LINUX_LIVEUPDATE_H */
> diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c
> index 32759e846bc9..7ac591542059 100644
> --- a/kernel/liveupdate/luo_file.c
> +++ b/kernel/liveupdate/luo_file.c
> @@ -302,6 +302,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd)
>  	mutex_init(&luo_file->mutex);
>  
>  	args.handler = fh;
> +	args.session = luo_session_from_file_set(file_set);
>  	args.file = file;
>  	err = fh->ops->preserve(&args);
>  	if (err)
> @@ -355,6 +356,7 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set)
>  					   struct luo_file, list);
>  
>  		args.handler = luo_file->fh;
> +		args.session = luo_session_from_file_set(file_set);
>  		args.file = luo_file->file;
>  		args.serialized_data = luo_file->serialized_data;
>  		args.private_data = luo_file->private_data;
> @@ -383,6 +385,7 @@ static int luo_file_freeze_one(struct luo_file_set *file_set,
>  		struct liveupdate_file_op_args args = {0};
>  
>  		args.handler = luo_file->fh;
> +		args.session = luo_session_from_file_set(file_set);
>  		args.file = luo_file->file;
>  		args.serialized_data = luo_file->serialized_data;
>  		args.private_data = luo_file->private_data;
> @@ -404,6 +407,7 @@ static void luo_file_unfreeze_one(struct luo_file_set *file_set,
>  		struct liveupdate_file_op_args args = {0};
>  
>  		args.handler = luo_file->fh;
> +		args.session = luo_session_from_file_set(file_set);
>  		args.file = luo_file->file;
>  		args.serialized_data = luo_file->serialized_data;
>  		args.private_data = luo_file->private_data;
> @@ -590,6 +594,7 @@ int luo_retrieve_file(struct luo_file_set *file_set, u64 token,
>  	}
>  
>  	args.handler = luo_file->fh;
> +	args.session = luo_session_from_file_set(file_set);
>  	args.serialized_data = luo_file->serialized_data;
>  	err = luo_file->fh->ops->retrieve(&args);
>  	if (!err) {
> @@ -615,6 +620,7 @@ static int luo_file_can_finish_one(struct luo_file_set *file_set,
>  		struct liveupdate_file_op_args args = {0};
>  
>  		args.handler = luo_file->fh;
> +		args.session = luo_session_from_file_set(file_set);
>  		args.file = luo_file->file;
>  		args.serialized_data = luo_file->serialized_data;
>  		args.retrieved = luo_file->retrieved;
> @@ -632,6 +638,7 @@ static void luo_file_finish_one(struct luo_file_set *file_set,
>  	guard(mutex)(&luo_file->mutex);
>  
>  	args.handler = luo_file->fh;
> +	args.session = luo_session_from_file_set(file_set);
>  	args.file = luo_file->file;
>  	args.serialized_data = luo_file->serialized_data;
>  	args.retrieved = luo_file->retrieved;
> @@ -919,3 +926,67 @@ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh)
>  	return err;
>  }
>  EXPORT_SYMBOL_GPL(liveupdate_unregister_file_handler);
> +
> +/**
> + * liveupdate_get_token_outgoing - Get the token for a preserved file.
> + * @s:      The outgoing liveupdate session.
> + * @file:   The file object to search for.
> + * @tokenp: Output parameter for the found token.
> + *
> + * Searches the list of preserved files in an outgoing session for a matching
> + * file object. If found, the corresponding user-provided token is returned.
> + *
> + * This function is intended for in-kernel callers that need to correlate a
> + * file with its liveupdate token.
> + *
> + * Context: Can be called from any context that can acquire the session mutex.
> + * Return: 0 on success, -ENOENT if the file is not preserved in this session.
> + */
> +int liveupdate_get_token_outgoing(struct liveupdate_session *s,
> +				  struct file *file, u64 *tokenp)
> +{
> +	struct luo_file_set *file_set = luo_file_set_from_session(s);
> +	struct luo_file *luo_file;
> +	int err = -ENOENT;
> +
> +	list_for_each_entry(luo_file, &file_set->files_list, list) {

Shouldn't we hold a lock while traversing the file_set? Couldn't this
race with an unpreserve? If a concurrent unpreserve (writer) calls 
list_del while this reader is mid-iteration, we'll likely follow a stale
pointer..

I noticed that luo_preserve_file / upreserve and retrieve also don't
take locks while manipulating / iterating  over file_set->files_list..

Shouldn't we protect these with a lock? Otherwise, how do we avoid
races in situations like:

1. CPU 0 (reader) is iterating the list A->B->C ..
2. CPU 1 (writer) is handling an unpreserve which remove file B

Am I missing something? If we're expecting the callers to hold a lock,
we should have a lockdep_assert in these functions..

> +		if (luo_file->file == file) {
> +			if (tokenp)
> +				*tokenp = luo_file->token;
> +			err = 0;
> +			break;
> +		}
> +	}
> +
> +	return err;
> +}
> +
> +/**
> + * liveupdate_get_file_incoming - Retrieves a preserved file for in-kernel use.
> + * @s:      The incoming liveupdate session (restored from the previous kernel).
> + * @token:  The unique token identifying the file to retrieve.
> + * @filep:  On success, this will be populated with a pointer to the retrieved
> + *          'struct file'.
> + *
> + * Provides a kernel-internal API for other subsystems to retrieve their
> + * preserved files after a live update. This function is a simple wrapper
> + * around luo_retrieve_file(), allowing callers to find a file by its token.
> + *
> + * The operation is idempotent; subsequent calls for the same token will return
> + * a pointer to the same 'struct file' object.
> + *
> + * The caller receives a new reference to the file and must call fput() when it
> + * is no longer needed. The file's lifetime is managed by LUO and any userspace
> + * file descriptors. If the caller needs to hold a reference to the file beyond
> + * the immediate scope, it must call get_file() itself.
> + *

I'm little confused here, we say the op is idempotent, but also mention
that the caller receives a new reference. I'm wondering of a situation 
where a driver calls this multiple times, incrementing the refcount with
each call. Do we rely on flb_file_finish to drop all the refcounts?

We should clarify the lifecycle requirements here: is the driver 
expected to call fput() for every single call to 
liveupdate_get_file_incoming(), or is the flb_finish callback intended
to be a 'catch-all' that reaps these? 

> + * Context: Can be called from any context in the new kernel that has a handle
> + *          to a restored session.
> + * Return: 0 on success. Returns -ENOENT if no file with the matching token is
> + *         found, or any other negative errno on failure.
> + */
> +int liveupdate_get_file_incoming(struct liveupdate_session *s, u64 token,
> +				 struct file **filep)
> +{
> +	return luo_retrieve_file(luo_file_set_from_session(s), token, filep);
> +}
> diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
> index 8083d8739b09..a24933d24fd9 100644
> --- a/kernel/liveupdate/luo_internal.h
> +++ b/kernel/liveupdate/luo_internal.h
> @@ -77,6 +77,22 @@ struct luo_session {
>  	struct mutex mutex;
>  };
>  
> +static inline struct liveupdate_session *luo_session_from_file_set(struct luo_file_set *file_set)
> +{
> +	struct luo_session *session;
> +
> +	session = container_of(file_set, struct luo_session, file_set);
> +
> +	return (struct liveupdate_session *)session;
> +}
> +
> +static inline struct luo_file_set *luo_file_set_from_session(struct liveupdate_session *s)
> +{
> +	struct luo_session *session = (struct luo_session *)s;
> +
> +	return &session->file_set;
> +}
> +
>  int luo_session_create(const char *name, struct file **filep);
>  int luo_session_retrieve(const char *name, struct file **filep);
>  int __init luo_session_setup_outgoing(void *fdt);
> -- 
>

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 03/14] liveupdate: luo_file: Add internal APIs for file preservation
  2026-03-18 10:00   ` Pranjal Shrivastava
@ 2026-03-18 16:54     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-18 16:54 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Pasha Tatashin, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, David Matlack,
	Andrew Morton, Chris Li, Vipin Sharma, YiFei Zhu

On Wed, Mar 18, 2026 at 10:00:57AM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:37PM +0000, Samiullah Khawaja wrote:
>> From: Pasha Tatashin <pasha.tatashin@soleen.com>
>>
>> The core liveupdate mechanism allows userspace to preserve file
>> descriptors. However, kernel subsystems often manage struct file
>> objects directly and need to participate in the preservation process
>> programmatically without relying solely on userspace interaction.
>>
>> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> ---
>>  include/linux/liveupdate.h       | 21 ++++++++++
>>  kernel/liveupdate/luo_file.c     | 71 ++++++++++++++++++++++++++++++++
>>  kernel/liveupdate/luo_internal.h | 16 +++++++
>>  3 files changed, 108 insertions(+)
>>
>> diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
>> index fe82a6c3005f..8e47504ba01e 100644
>> --- a/include/linux/liveupdate.h
>> +++ b/include/linux/liveupdate.h
>> @@ -23,6 +23,7 @@ struct file;
>>  /**
>>   * struct liveupdate_file_op_args - Arguments for file operation callbacks.
>>   * @handler:          The file handler being called.
>> + * @session:          The session this file belongs to.
>>   * @retrieved:        The retrieve status for the 'can_finish / finish'
>>   *                    operation.
>>   * @file:             The file object. For retrieve: [OUT] The callback sets
>> @@ -40,6 +41,7 @@ struct file;
>>   */
>>  struct liveupdate_file_op_args {
>>  	struct liveupdate_file_handler *handler;
>> +	struct liveupdate_session *session;
>>  	bool retrieved;
>
>Nit: I don't think this is on the latest tree. I see `int retrieved` [1]
>in the latest tree. I guess we'd need to rebase it on the latest?

Will rebase this.
>
>https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/include/linux/liveupdate.h#n46
>
>>  	struct file *file;
>>  	u64 serialized_data;
>> @@ -234,6 +236,13 @@ int liveupdate_unregister_flb(struct liveupdate_file_handler *fh,
>>
>>  int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp);
>>  int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp);
>> +/* kernel can internally retrieve files */
>> +int liveupdate_get_file_incoming(struct liveupdate_session *s, u64 token,
>> +				 struct file **filep);
>> +
>> +/* Get a token for an outgoing file, or -ENOENT if file is not preserved */
>> +int liveupdate_get_token_outgoing(struct liveupdate_session *s,
>> +				  struct file *file, u64 *tokenp);
>>
>>  #else /* CONFIG_LIVEUPDATE */
>>
>> @@ -281,5 +290,17 @@ static inline int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb,
>>  	return -EOPNOTSUPP;
>>  }
>>
>> +static inline int liveupdate_get_file_incoming(struct liveupdate_session *s,
>> +					       u64 token, struct file **filep)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static inline int liveupdate_get_token_outgoing(struct liveupdate_session *s,
>> +						struct file *file, u64 *tokenp)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>>  #endif /* CONFIG_LIVEUPDATE */
>>  #endif /* _LINUX_LIVEUPDATE_H */
>> diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c
>> index 32759e846bc9..7ac591542059 100644
>> --- a/kernel/liveupdate/luo_file.c
>> +++ b/kernel/liveupdate/luo_file.c
>> @@ -302,6 +302,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd)
>>  	mutex_init(&luo_file->mutex);
>>
>>  	args.handler = fh;
>> +	args.session = luo_session_from_file_set(file_set);
>>  	args.file = file;
>>  	err = fh->ops->preserve(&args);
>>  	if (err)
>> @@ -355,6 +356,7 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set)
>>  					   struct luo_file, list);
>>
>>  		args.handler = luo_file->fh;
>> +		args.session = luo_session_from_file_set(file_set);
>>  		args.file = luo_file->file;
>>  		args.serialized_data = luo_file->serialized_data;
>>  		args.private_data = luo_file->private_data;
>> @@ -383,6 +385,7 @@ static int luo_file_freeze_one(struct luo_file_set *file_set,
>>  		struct liveupdate_file_op_args args = {0};
>>
>>  		args.handler = luo_file->fh;
>> +		args.session = luo_session_from_file_set(file_set);
>>  		args.file = luo_file->file;
>>  		args.serialized_data = luo_file->serialized_data;
>>  		args.private_data = luo_file->private_data;
>> @@ -404,6 +407,7 @@ static void luo_file_unfreeze_one(struct luo_file_set *file_set,
>>  		struct liveupdate_file_op_args args = {0};
>>
>>  		args.handler = luo_file->fh;
>> +		args.session = luo_session_from_file_set(file_set);
>>  		args.file = luo_file->file;
>>  		args.serialized_data = luo_file->serialized_data;
>>  		args.private_data = luo_file->private_data;
>> @@ -590,6 +594,7 @@ int luo_retrieve_file(struct luo_file_set *file_set, u64 token,
>>  	}
>>
>>  	args.handler = luo_file->fh;
>> +	args.session = luo_session_from_file_set(file_set);
>>  	args.serialized_data = luo_file->serialized_data;
>>  	err = luo_file->fh->ops->retrieve(&args);
>>  	if (!err) {
>> @@ -615,6 +620,7 @@ static int luo_file_can_finish_one(struct luo_file_set *file_set,
>>  		struct liveupdate_file_op_args args = {0};
>>
>>  		args.handler = luo_file->fh;
>> +		args.session = luo_session_from_file_set(file_set);
>>  		args.file = luo_file->file;
>>  		args.serialized_data = luo_file->serialized_data;
>>  		args.retrieved = luo_file->retrieved;
>> @@ -632,6 +638,7 @@ static void luo_file_finish_one(struct luo_file_set *file_set,
>>  	guard(mutex)(&luo_file->mutex);
>>
>>  	args.handler = luo_file->fh;
>> +	args.session = luo_session_from_file_set(file_set);
>>  	args.file = luo_file->file;
>>  	args.serialized_data = luo_file->serialized_data;
>>  	args.retrieved = luo_file->retrieved;
>> @@ -919,3 +926,67 @@ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh)
>>  	return err;
>>  }
>>  EXPORT_SYMBOL_GPL(liveupdate_unregister_file_handler);
>> +
>> +/**
>> + * liveupdate_get_token_outgoing - Get the token for a preserved file.
>> + * @s:      The outgoing liveupdate session.
>> + * @file:   The file object to search for.
>> + * @tokenp: Output parameter for the found token.
>> + *
>> + * Searches the list of preserved files in an outgoing session for a matching
>> + * file object. If found, the corresponding user-provided token is returned.
>> + *
>> + * This function is intended for in-kernel callers that need to correlate a
>> + * file with its liveupdate token.
>> + *
>> + * Context: Can be called from any context that can acquire the session mutex.

I will reword this to convey that the session mutex should be acquired.
>> + * Return: 0 on success, -ENOENT if the file is not preserved in this session.
>> + */
>> +int liveupdate_get_token_outgoing(struct liveupdate_session *s,
>> +				  struct file *file, u64 *tokenp)
>> +{
>> +	struct luo_file_set *file_set = luo_file_set_from_session(s);
>> +	struct luo_file *luo_file;
>> +	int err = -ENOENT;
>> +
>> +	list_for_each_entry(luo_file, &file_set->files_list, list) {
>
>Shouldn't we hold a lock while traversing the file_set? Couldn't this
>race with an unpreserve? If a concurrent unpreserve (writer) calls
>list_del while this reader is mid-iteration, we'll likely follow a stale
>pointer..

The session mutex is supposed to be acquired when this function is
called. I will add a lockdep in this function to enforce that.

Also, we call this in preserve() of vfio-cdev and the session mutex is
already acquired by luo in that context.
>
>I noticed that luo_preserve_file / upreserve and retrieve also don't
>take locks while manipulating / iterating  over file_set->files_list..
>
>Shouldn't we protect these with a lock? Otherwise, how do we avoid
>races in situations like:
>
>1. CPU 0 (reader) is iterating the list A->B->C ..
>2. CPU 1 (writer) is handling an unpreserve which remove file B
>
>Am I missing something? If we're expecting the callers to hold a lock,
>we should have a lockdep_assert in these functions..

The LUO acquires session mutex before calling preserve/unpreserve to
protect agains race conditions. See luo_session.c for details.
>
>> +		if (luo_file->file == file) {
>> +			if (tokenp)
>> +				*tokenp = luo_file->token;
>> +			err = 0;
>> +			break;
>> +		}
>> +	}
>> +
>> +	return err;
>> +}
>> +
>> +/**
>> + * liveupdate_get_file_incoming - Retrieves a preserved file for in-kernel use.
>> + * @s:      The incoming liveupdate session (restored from the previous kernel).
>> + * @token:  The unique token identifying the file to retrieve.
>> + * @filep:  On success, this will be populated with a pointer to the retrieved
>> + *          'struct file'.
>> + *
>> + * Provides a kernel-internal API for other subsystems to retrieve their
>> + * preserved files after a live update. This function is a simple wrapper
>> + * around luo_retrieve_file(), allowing callers to find a file by its token.
>> + *
>> + * The operation is idempotent; subsequent calls for the same token will return
>> + * a pointer to the same 'struct file' object.
>> + *
>> + * The caller receives a new reference to the file and must call fput() when it
>> + * is no longer needed. The file's lifetime is managed by LUO and any userspace
>> + * file descriptors. If the caller needs to hold a reference to the file beyond
>> + * the immediate scope, it must call get_file() itself.
>> + *
>
>I'm little confused here, we say the op is idempotent, but also mention
>that the caller receives a new reference. I'm wondering of a situation
>where a driver calls this multiple times, incrementing the refcount with
>each call. Do we rely on flb_file_finish to drop all the refcounts?
>
>We should clarify the lifecycle requirements here: is the driver
>expected to call fput() for every single call to
>liveupdate_get_file_incoming(), or is the flb_finish callback intended
>to be a 'catch-all' that reaps these?

The user is supposed to call fput to release the reference as mentioned
in the kernel-doc above.

But I agree it is a little confusing with the idempotent. I will reword
the text.
>
>> + * Context: Can be called from any context in the new kernel that has a handle
>> + *          to a restored session.
>> + * Return: 0 on success. Returns -ENOENT if no file with the matching token is
>> + *         found, or any other negative errno on failure.
>> + */
>> +int liveupdate_get_file_incoming(struct liveupdate_session *s, u64 token,
>> +				 struct file **filep)
>> +{
>> +	return luo_retrieve_file(luo_file_set_from_session(s), token, filep);
>> +}
>> diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
>> index 8083d8739b09..a24933d24fd9 100644
>> --- a/kernel/liveupdate/luo_internal.h
>> +++ b/kernel/liveupdate/luo_internal.h
>> @@ -77,6 +77,22 @@ struct luo_session {
>>  	struct mutex mutex;
>>  };
>>
>> +static inline struct liveupdate_session *luo_session_from_file_set(struct luo_file_set *file_set)
>> +{
>> +	struct luo_session *session;
>> +
>> +	session = container_of(file_set, struct luo_session, file_set);
>> +
>> +	return (struct liveupdate_session *)session;
>> +}
>> +
>> +static inline struct luo_file_set *luo_file_set_from_session(struct liveupdate_session *s)
>> +{
>> +	struct luo_session *session = (struct luo_session *)s;
>> +
>> +	return &session->file_set;
>> +}
>> +
>>  int luo_session_create(const char *name, struct file **filep);
>>  int luo_session_retrieve(const char *name, struct file **filep);
>>  int __init luo_session_setup_outgoing(void *fdt);
>> --
>>
>
>Thanks,
>Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-03-17 20:23           ` Pranjal Shrivastava
  2026-03-17 21:03             ` Vipin Sharma
@ 2026-03-18 17:49             ` Samiullah Khawaja
  1 sibling, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-18 17:49 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Mar 17, 2026 at 08:23:38PM +0000, Pranjal Shrivastava wrote:
>On Tue, Mar 17, 2026 at 08:13:47PM +0000, Samiullah Khawaja wrote:
>> On Tue, Mar 17, 2026 at 08:09:26PM +0000, Pranjal Shrivastava wrote:
>> > On Fri, Mar 13, 2026 at 06:42:05PM +0000, Samiullah Khawaja wrote:
>> > > On Thu, Mar 12, 2026 at 11:10:36PM +0000, Pranjal Shrivastava wrote:
>> > > > On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
>> > > > > +EXPORT_SYMBOL(iommu_get_preserved_data);
>> > > > > +
>> >
>> > Also, I don't see these helpers being used anywhere in this series yet?
>> > Should we add them with the phase-2 series? When we actually "Restore"
>> > the domain?
>>
>> Phase 1 does restore the iommu and domains during boot. But the
>> iommufd/hwpt state is not restored or re-associated with the restored
>> domains.
>>
>> These helpers are used to fetch the preserved state during boot.
>>
>> Thanks,
>> Sami
>
>I see these are used in PATCH 8, should we introduce these helpers in
>PATCH 8 where we actually use them? As we introduce iommu_restore_domain
>in PATCH 8?

Agreed. I will move the helper for getting the device state into PATCH
8. But the one that is used to fetch the iommu state is used by the vt-d
driver in the PATCH 7. I will add a new patch before PATCH 7 to add that
helper.

Thanks,
Sami
>
>Thanks,
>Praan
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-03-17 21:03             ` Vipin Sharma
@ 2026-03-18 18:51               ` Pranjal Shrivastava
  0 siblings, 0 replies; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-18 18:51 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: Samiullah Khawaja, David Woodhouse, Lu Baolu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, YiFei Zhu

On Tue, Mar 17, 2026 at 02:03:16PM -0700, Vipin Sharma wrote:
> On Tue, Mar 17, 2026 at 08:23:38PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Mar 17, 2026 at 08:13:47PM +0000, Samiullah Khawaja wrote:
> > > On Tue, Mar 17, 2026 at 08:09:26PM +0000, Pranjal Shrivastava wrote:
> > > > On Fri, Mar 13, 2026 at 06:42:05PM +0000, Samiullah Khawaja wrote:
> > > > > On Thu, Mar 12, 2026 at 11:10:36PM +0000, Pranjal Shrivastava wrote:
> > > > > > On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
> > > > > > > +EXPORT_SYMBOL(iommu_get_preserved_data);
> > > > > > > +
> > > > 
> > > > Also, I don't see these helpers being used anywhere in this series yet?
> > > > Should we add them with the phase-2 series? When we actually "Restore"
> > > > the domain?
> > > 
> > > Phase 1 does restore the iommu and domains during boot. But the
> > > iommufd/hwpt state is not restored or re-associated with the restored
> > > domains.
> > > 
> > > These helpers are used to fetch the preserved state during boot.
> > > 
> > > Thanks,
> > > Sami
> > 
> > I see these are used in PATCH 8, should we introduce these helpers in
> > PATCH 8 where we actually use them? As we introduce iommu_restore_domain
> > in PATCH 8?
> > 
> > Thanks,
> > Praan
> > 
> 
> Please trim your responses.

Ack, sorry for missing that in this thread.

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops
  2026-02-03 22:09 ` [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops Samiullah Khawaja
@ 2026-03-19 16:04   ` Vipin Sharma
  2026-03-19 16:27     ` Samiullah Khawaja
  2026-03-20 23:01   ` Pranjal Shrivastava
  1 sibling, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-19 16:04 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:40PM +0000, Samiullah Khawaja wrote:
>  /*
>   * Take a root_entry and return the Lower Context Table Pointer (LCTP)
>   * if marked present.
> @@ -2378,8 +2379,10 @@ void intel_iommu_shutdown(void)
>  		/* Disable PMRs explicitly here. */
>  		iommu_disable_protect_mem_regions(iommu);
>  
> -		/* Make sure the IOMMUs are switched off */
> -		iommu_disable_translation(iommu);
> +		if (!__maybe_clean_unpreserved_context_entries(iommu)) {
> +			/* Make sure the IOMMUs are switched off */
> +			iommu_disable_translation(iommu);
> +		}

I think instead of having a "maybe" function where it might or might not
clear the context entry, it will be simpler to just make if-else
explicit here.

@@ -2375,8 +2375,11 @@ void intel_iommu_shutdown(void)
                /* Disable PMRs explicitly here. */
                iommu_disable_protect_mem_regions(iommu);

-               /* Make sure the IOMMUs are switched off */
-               iommu_disable_translation(iommu);
+               if (iommu->iommu.outgoing_preserved_state)
+                       clear_unpreserved_context_entries();
+               else
+                       /* Make sure the IOMMUs are switched off */
+                       iommu_disable_translation(iommu);
        }
 }

clear_unpreserved_context_entries() will explicitly do what
__maybe_clean_unpreserved_context_entries() is doing.

>  	}
>  }
>  
> @@ -2902,6 +2905,38 @@ static const struct iommu_dirty_ops intel_second_stage_dirty_ops = {
>  	.set_dirty_tracking = intel_iommu_set_dirty_tracking,
>  };
>  
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu)

Nit: s/clean/clear, this matches domain_context_clear().

> +{
> +	struct device_domain_info *info;
> +	struct pci_dev *pdev = NULL;
> +
> +	if (!iommu->iommu.outgoing_preserved_state)
> +		return false;
> +
> +	for_each_pci_dev(pdev) {
> +		info = dev_iommu_priv_get(&pdev->dev);
> +		if (!info)
> +			continue;
> +
> +		if (info->iommu != iommu)
> +			continue;
> +
> +		if (dev_iommu_preserved_state(&pdev->dev))
> +			continue;
> +
> +		domain_context_clear(info);
> +	}
> +
> +	return true;
> +}
> +++ b/drivers/iommu/intel/liveupdate.c
> @@ -0,0 +1,134 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * Copyright (C) 2025, Google LLC
> + * Author: Samiullah Khawaja <skhawaja@google.com>
> + */
> +
> +#define pr_fmt(fmt)    "iommu: liveupdate: " fmt

Should this be 'DMAR: liveupdate' to be consistent with other files in
intel iommu?

> +
> +#include <linux/kexec_handover.h>
> +#include <linux/liveupdate.h>
> +#include <linux/iommu-lu.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +
> +#include "iommu.h"
> +#include "../iommu-pages.h"
> +
> +static void unpreserve_iommu_context(struct intel_iommu *iommu, int end)

Nit: Can this be unpreserve_iommu_context_table(), similar to existing
free_context_table()? Same for corresponding preserve function

> +{
> +	struct context_entry *context;
> +	int i;
> +
> +	if (end < 0)
> +		end = ROOT_ENTRY_NR;

Can't we just pass ROOT_ENTRY_NR as end value instead of -1?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops
  2026-03-19 16:04   ` Vipin Sharma
@ 2026-03-19 16:27     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-19 16:27 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Thu, Mar 19, 2026 at 09:04:31AM -0700, Vipin Sharma wrote:
>On Tue, Feb 03, 2026 at 10:09:40PM +0000, Samiullah Khawaja wrote:
>>  /*
>>   * Take a root_entry and return the Lower Context Table Pointer (LCTP)
>>   * if marked present.
>> @@ -2378,8 +2379,10 @@ void intel_iommu_shutdown(void)
>>  		/* Disable PMRs explicitly here. */
>>  		iommu_disable_protect_mem_regions(iommu);
>>
>> -		/* Make sure the IOMMUs are switched off */
>> -		iommu_disable_translation(iommu);
>> +		if (!__maybe_clean_unpreserved_context_entries(iommu)) {
>> +			/* Make sure the IOMMUs are switched off */
>> +			iommu_disable_translation(iommu);
>> +		}
>
>I think instead of having a "maybe" function where it might or might not
>clear the context entry, it will be simpler to just make if-else
>explicit here.
>
>@@ -2375,8 +2375,11 @@ void intel_iommu_shutdown(void)
>                /* Disable PMRs explicitly here. */
>                iommu_disable_protect_mem_regions(iommu);
>
>-               /* Make sure the IOMMUs are switched off */
>-               iommu_disable_translation(iommu);
>+               if (iommu->iommu.outgoing_preserved_state)

This is only defined when CONFIG_LIVEUPDATE is enabled and would need
the #ifdefs :(.

But I can add a function, iommu_outgoing_preserved_state() or
iommu_is_outgoing_preserved() that would return NULL or false
respectively when liveupdate is disabled, in the skeleton patch to check
this.

I will update this in the next revision.
>+                       clear_unpreserved_context_entries();
>+               else
>+                       /* Make sure the IOMMUs are switched off */
>+                       iommu_disable_translation(iommu);
>        }
> }
>
>clear_unpreserved_context_entries() will explicitly do what
>__maybe_clean_unpreserved_context_entries() is doing.
>
>>  	}
>>  }
>>
>> @@ -2902,6 +2905,38 @@ static const struct iommu_dirty_ops intel_second_stage_dirty_ops = {
>>  	.set_dirty_tracking = intel_iommu_set_dirty_tracking,
>>  };
>>
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu)
>
>Nit: s/clean/clear, this matches domain_context_clear().

Agreed.
>
>> +{
>> +	struct device_domain_info *info;
>> +	struct pci_dev *pdev = NULL;
>> +
>> +	if (!iommu->iommu.outgoing_preserved_state)
>> +		return false;
>> +
>> +	for_each_pci_dev(pdev) {
>> +		info = dev_iommu_priv_get(&pdev->dev);
>> +		if (!info)
>> +			continue;
>> +
>> +		if (info->iommu != iommu)
>> +			continue;
>> +
>> +		if (dev_iommu_preserved_state(&pdev->dev))
>> +			continue;
>> +
>> +		domain_context_clear(info);
>> +	}
>> +
>> +	return true;
>> +}
>> +++ b/drivers/iommu/intel/liveupdate.c
>> @@ -0,0 +1,134 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +
>> +/*
>> + * Copyright (C) 2025, Google LLC
>> + * Author: Samiullah Khawaja <skhawaja@google.com>
>> + */
>> +
>> +#define pr_fmt(fmt)    "iommu: liveupdate: " fmt

>Should this be 'DMAR: liveupdate' to be consistent with other files in
>intel iommu?

Agreed. Will change.
>
>> +
>> +#include <linux/kexec_handover.h>
>> +#include <linux/liveupdate.h>
>> +#include <linux/iommu-lu.h>
>> +#include <linux/module.h>
>> +#include <linux/pci.h>
>> +
>> +#include "iommu.h"
>> +#include "../iommu-pages.h"
>> +
>> +static void unpreserve_iommu_context(struct intel_iommu *iommu, int end)
>
>Nit: Can this be unpreserve_iommu_context_table(), similar to existing
>free_context_table()? Same for corresponding preserve function

Agreed. I will change this.
>
>> +{
>> +	struct context_entry *context;
>> +	int i;
>> +
>> +	if (end < 0)
>> +		end = ROOT_ENTRY_NR;
>
>Can't we just pass ROOT_ENTRY_NR as end value instead of -1?

Yes, we can do that. Will update.
>

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/14] iommu/vt-d: Restore IOMMU state and reclaimed domain ids
  2026-02-03 22:09 ` [PATCH 07/14] iommu/vt-d: Restore IOMMU state and reclaimed domain ids Samiullah Khawaja
@ 2026-03-19 20:54   ` Vipin Sharma
  2026-03-20  1:05     ` Samiullah Khawaja
  2026-03-22 19:51   ` Pranjal Shrivastava
  1 sibling, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-19 20:54 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:41PM +0000, Samiullah Khawaja wrote:
> During boot fetch the preserved state of IOMMU unit and if found then
> restore the state.
> 
> - Reuse the root_table that was preserved in the previous kernel.
> - Reclaim the domain ids of the preserved domains for each preserved
>   devices so these are not acquired by another domain.

Can this be two patches? One just restoring root table and second one to
reclaim device ids.

> -static void init_translation_status(struct intel_iommu *iommu)
> +static void init_translation_status(struct intel_iommu *iommu, bool restoring)
>  {
>  	u32 gsts;
>  
>  	gsts = readl(iommu->reg + DMAR_GSTS_REG);
> -	if (gsts & DMA_GSTS_TES)
> +	if (!restoring && (gsts & DMA_GSTS_TES))
>  		iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
>  }

I think we can just check in the caller, init_dmars(), and not call this
function. We are already modifying that function, so no need to make
changes here as well. WDYT?

>  
> @@ -670,10 +670,16 @@ void dmar_fault_dump_ptes(struct intel_iommu *iommu, u16 source_id,
>  #endif
>  
>  /* iommu handling */
> -static int iommu_alloc_root_entry(struct intel_iommu *iommu)
> +static int iommu_alloc_root_entry(struct intel_iommu *iommu, struct iommu_ser *restored_state)

Nit: Since we are using iommu_ser at other places, I will recommend to
keep it same variable name here as well.

> @@ -1636,8 +1643,10 @@ static int __init init_dmars(void)
>  						   intel_pasid_max_id);
>  		}
>  
> +		iommu_ser = iommu_get_preserved_data(iommu->reg_phys, IOMMU_INTEL);
> +
>  		intel_iommu_init_qi(iommu);
> -		init_translation_status(iommu);
> +		init_translation_status(iommu, !!iommu_ser);

Just put 'if' here to avoid changes in init_translation_status().

> +static int __restore_used_domain_ids(struct device_ser *ser, void *arg)

Nit: Just curious, why two __?

> +{
> +	int id = ser->domain_iommu_ser.did;
> +	struct intel_iommu *iommu = arg;
> +
> +	ida_alloc_range(&iommu->domain_ida, id, id, GFP_ATOMIC);

Should we check if allocation succeeded or not?

> +	return 0;
> +}
> +
> +void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
> +					       struct iommu_ser *iommu_ser)
> +{
> +	BUG_ON(!kho_restore_folio(iommu_ser->intel.root_table));
> +	iommu->root_entry = __va(iommu_ser->intel.root_table);
> +
> +	restore_iommu_context(iommu);
> +	iommu_for_each_preserved_device(__restore_used_domain_ids, iommu);
> +	pr_info("Restored IOMMU[0x%llx] Root Table at: 0x%llx\n",
> +		iommu->reg_phys, iommu_ser->intel.root_table);

Should raw pointer values be printed like this?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
  2026-02-03 22:09 ` [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved Samiullah Khawaja
@ 2026-03-19 23:35   ` Vipin Sharma
  2026-03-20  0:40     ` Samiullah Khawaja
  2026-03-25 14:37   ` Pranjal Shrivastava
  1 sibling, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-19 23:35 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Pranjal Shrivastava

On Tue, Feb 03, 2026 at 10:09:44PM +0000, Samiullah Khawaja wrote:
> From: YiFei Zhu <zhuyifei@google.com>
> @@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
>  	bool auto_domain : 1;
>  	bool enforce_cache_coherency : 1;
>  	bool nest_parent : 1;
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	bool lu_preserve : 1;
> +	u32 lu_token;

Should we use full name i.e. liveupdate here and other places in this
series?

> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 5cc4b08c25f5..e1a9b3051f65 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -493,6 +493,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>  		 __reserved),
>  	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
>  		 struct iommu_viommu_alloc, out_viommu_id),
> +	IOCTL_OP(IOMMU_HWPT_LU_SET_PRESERVE, iommufd_hwpt_lu_set_preserve,
> +		 struct iommu_hwpt_lu_set_preserve, hwpt_token),

Shouldn't struct iommu_hwpt_lu_set_preserve{} be added to union ucmd_buffer{}
above in this file?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
  2026-03-19 23:35   ` Vipin Sharma
@ 2026-03-20  0:40     ` Samiullah Khawaja
  2026-03-20 23:34       ` Vipin Sharma
  0 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-20  0:40 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Pranjal Shrivastava

On Thu, Mar 19, 2026 at 04:35:32PM -0700, Vipin Sharma wrote:
>On Tue, Feb 03, 2026 at 10:09:44PM +0000, Samiullah Khawaja wrote:
>> From: YiFei Zhu <zhuyifei@google.com>
>> @@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
>>  	bool auto_domain : 1;
>>  	bool enforce_cache_coherency : 1;
>>  	bool nest_parent : 1;
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	bool lu_preserve : 1;
>> +	u32 lu_token;
>
>Should we use full name i.e. liveupdate here and other places in this
>series?

I think using full name liveupdate would be too long in other places in
this series. And also there are other examples of "luo" being used as a
short form. Please see:

https://lore.kernel.org/all/20251125165850.3389713-15-pasha.tatashin@soleen.com/
>
>> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
>> index 5cc4b08c25f5..e1a9b3051f65 100644
>> --- a/drivers/iommu/iommufd/main.c
>> +++ b/drivers/iommu/iommufd/main.c
>> @@ -493,6 +493,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>>  		 __reserved),
>>  	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
>>  		 struct iommu_viommu_alloc, out_viommu_id),
>> +	IOCTL_OP(IOMMU_HWPT_LU_SET_PRESERVE, iommufd_hwpt_lu_set_preserve,
>> +		 struct iommu_hwpt_lu_set_preserve, hwpt_token),
>
>Shouldn't struct iommu_hwpt_lu_set_preserve{} be added to union ucmd_buffer{}
>above in this file?

Agreed. Will update.
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/14] iommu/vt-d: Restore IOMMU state and reclaimed domain ids
  2026-03-19 20:54   ` Vipin Sharma
@ 2026-03-20  1:05     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-20  1:05 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

Hi Vipin,

On Thu, Mar 19, 2026 at 01:54:27PM -0700, Vipin Sharma wrote:
>On Tue, Feb 03, 2026 at 10:09:41PM +0000, Samiullah Khawaja wrote:
>> During boot fetch the preserved state of IOMMU unit and if found then
>> restore the state.
>>
>> - Reuse the root_table that was preserved in the previous kernel.
>> - Reclaim the domain ids of the preserved domains for each preserved
>>   devices so these are not acquired by another domain.
>
>Can this be two patches? One just restoring root table and second one to
>reclaim device ids.

This patch is restoring the state of the preserved IOMMU. The domain ids
are part of the iommu state and used in the context entries that are
being restored in this patch, that is why I kept these in one patch. I
will reword the subject of this patch and add a note on this in commit
message.
>
>> -static void init_translation_status(struct intel_iommu *iommu)
>> +static void init_translation_status(struct intel_iommu *iommu, bool restoring)
>>  {
>>  	u32 gsts;
>>
>>  	gsts = readl(iommu->reg + DMAR_GSTS_REG);
>> -	if (gsts & DMA_GSTS_TES)
>> +	if (!restoring && (gsts & DMA_GSTS_TES))
>>  		iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
>>  }
>
>I think we can just check in the caller, init_dmars(), and not call this
>function. We are already modifying that function, so no need to make
>changes here as well. WDYT?

Yes, I think we can move the if outside.
>
>>
>> @@ -670,10 +670,16 @@ void dmar_fault_dump_ptes(struct intel_iommu *iommu, u16 source_id,
>>  #endif
>>
>>  /* iommu handling */
>> -static int iommu_alloc_root_entry(struct intel_iommu *iommu)
>> +static int iommu_alloc_root_entry(struct intel_iommu *iommu, struct iommu_ser *restored_state)
>
>Nit: Since we are using iommu_ser at other places, I will recommend to
>keep it same variable name here as well.

Hmm... that is a good point. I think I will update the other places and
change it to restored_state/preserved_state to better reflect the
context.
>
>> @@ -1636,8 +1643,10 @@ static int __init init_dmars(void)
>>  						   intel_pasid_max_id);
>>  		}
>>
>> +		iommu_ser = iommu_get_preserved_data(iommu->reg_phys, IOMMU_INTEL);
>> +
>>  		intel_iommu_init_qi(iommu);
>> -		init_translation_status(iommu);
>> +		init_translation_status(iommu, !!iommu_ser);
>
>Just put 'if' here to avoid changes in init_translation_status().

Agreed.
>
>> +static int __restore_used_domain_ids(struct device_ser *ser, void *arg)
>
>Nit: Just curious, why two __?

I will remove these.
>
>> +{
>> +	int id = ser->domain_iommu_ser.did;
>> +	struct intel_iommu *iommu = arg;
>> +
>> +	ida_alloc_range(&iommu->domain_ida, id, id, GFP_ATOMIC);
>
>Should we check if allocation succeeded or not?

Good point. This should not fail as we are reserving these ids during
init. But I think I will add a WARN here.
>
>> +	return 0;
>> +}
>> +
>> +void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
>> +					       struct iommu_ser *iommu_ser)
>> +{
>> +	BUG_ON(!kho_restore_folio(iommu_ser->intel.root_table));
>> +	iommu->root_entry = __va(iommu_ser->intel.root_table);
>> +
>> +	restore_iommu_context(iommu);
>> +	iommu_for_each_preserved_device(__restore_used_domain_ids, iommu);
>> +	pr_info("Restored IOMMU[0x%llx] Root Table at: 0x%llx\n",
>> +		iommu->reg_phys, iommu_ser->intel.root_table);
>
>Should raw pointer values be printed like this?

I will remove this.
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages
  2026-03-17 20:59   ` Vipin Sharma
@ 2026-03-20  9:28     ` Pranjal Shrivastava
  2026-03-20 18:27       ` Samiullah Khawaja
  2026-03-20 11:01     ` Pranjal Shrivastava
  1 sibling, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-20  9:28 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: Samiullah Khawaja, David Woodhouse, Lu Baolu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, YiFei Zhu

On Tue, Mar 17, 2026 at 01:59:03PM -0700, Vipin Sharma wrote:
> On Tue, Feb 03, 2026 at 10:09:38PM +0000, Samiullah Khawaja wrote:
> > +
> > +void iommu_unpreserve_pages(struct iommu_pages_list *list, int count)
> > +{
> > +	struct ioptdesc *iopt;
> > +
> > +	if (!count)
> > +		return;
> > +
> > +	/* If less than zero then unpreserve all pages. */
> > +	if (count < 0)
> > +		count = 0;
> > +
> > +	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
> > +		kho_unpreserve_folio(ioptdesc_folio(iopt));
> > +		if (count > 0 && --count ==  0)
> > +			break;
> > +	}
> > +}
> 
> I see you are trying to have common function for error handling in
> iommu_preserve_pages() down and unpreserving in the next patch with
> overloaded meaning of count. 
> 
>  < 0 means unpreserve all of the pages
>  = 0 means do nothing
>  > 0 unpreserve these many pages.
> 
> It seems non-intuitive to me and code also end up having multiple ifs. I
> will recommend to just have this function go through all of the pages
> without passing count argument. In iommu_preserve_pages() function down,
> handle the error case with a "goto err" statement. If this is the way
> things happen in iommu codebase feel free to ignore this suggestion.
> 

+1 to this. The current version of overloaded count is counter-intuitive
as it expects a back-handed API contract. Since we're using lists here,
I'd recommend using the list_for_each_entry_continue_reverse() helper:

+void iommu_unpreserve_pages(struct iommu_pages_list *list)
+{
+	struct ioptdesc *iopt;
+
+	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm)
+		kho_unpreserve_folio(ioptdesc_folio(iopt));
+}
+EXPORT_SYMBOL_GPL(iommu_unpreserve_pages);
+
+int iommu_preserve_pages(struct iommu_pages_list *list)
+{
+	struct ioptdesc *iopt;
+	int ret;
+
+	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
+		ret = kho_preserve_folio(ioptdesc_folio(iopt));
+		if (ret)
+			goto err_rollback;
+	}
+	return 0;
+
+err_rollback:
+	/* Rollback only the successfully preserved pages */
+	list_for_each_entry_continue_reverse(iopt, &list->pages, iopt_freelist_elm)
+		kho_unpreserve_folio(ioptdesc_folio(iopt));
+	return ret;
+}

[ ---- snip >8---- ]

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages
  2026-03-17 20:59   ` Vipin Sharma
  2026-03-20  9:28     ` Pranjal Shrivastava
@ 2026-03-20 11:01     ` Pranjal Shrivastava
  2026-03-20 18:56       ` Samiullah Khawaja
  1 sibling, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-20 11:01 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: Samiullah Khawaja, David Woodhouse, Lu Baolu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, YiFei Zhu

On Tue, Mar 17, 2026 at 01:59:03PM -0700, Vipin Sharma wrote:
> On Tue, Feb 03, 2026 at 10:09:38PM +0000, Samiullah Khawaja wrote:
> > +

[------- snip >8 --------]

> > +EXPORT_SYMBOL_GPL(iommu_unpreserve_pages);
> > +
> > +void iommu_restore_page(u64 phys)
> > +{
> > +	struct ioptdesc *iopt;
> > +	struct folio *folio;
> > +	unsigned long pgcnt;
> > +	unsigned int order;
> > +
> > +	folio = kho_restore_folio(phys);
> > +	BUG_ON(!folio);
> > +
> > +	iopt = folio_ioptdesc(folio);
> > +
> > +	order = folio_order(folio);
> > +	pgcnt = 1UL << order;
> > +	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, pgcnt);
> > +	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, pgcnt);
> 
> Above two functions are also used in other two places in this file, lets
> create a common function for these two operations.
> 

I was thinking about this, out of the 2 other places one is 
`__iommu_free_desc` where we are supposed to subtract pgcnt.. but I
think we can follow the -pgcnt pattern we have currently and add a
helper `iommu_folio_update_stats(folio, pgcnt)` where we can pass -pgcnt
to subtract nr_pages? Something like:

+static inline void iommu_folio_update_stats(struct folio *folio, long nr_pages)
+{
+	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, nr_pages);
+	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, nr_pages);
+}

@@ -86,8 +91,7 @@ static void __iommu_free_desc(struct ioptdesc *iopt)
 	if (IOMMU_PAGES_USE_DMA_API)
 		WARN_ON_ONCE(iopt->incoherent);

-	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, -pgcnt);
-	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, -pgcnt);
+	iommu_folio_update_stats(folio, -(long)pgcnt);
 	folio_put(folio);
 }

But I'm unsure if passing -(long)pgcnt is a good idea?

> > +}
> > +EXPORT_SYMBOL_GPL(iommu_restore_page);
> > +

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages
  2026-03-03 18:41     ` Samiullah Khawaja
@ 2026-03-20 17:27       ` Pranjal Shrivastava
  2026-03-20 18:12         ` Samiullah Khawaja
  0 siblings, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-20 17:27 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Ankit Soni, David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Mar 03, 2026 at 06:41:26PM +0000, Samiullah Khawaja wrote:
> On Tue, Mar 03, 2026 at 04:42:02PM +0000, Ankit Soni wrote:
> > On Tue, Feb 03, 2026 at 10:09:38PM +0000, Samiullah Khawaja wrote:
> > > IOMMU pages are allocated/freed using APIs using struct ioptdesc. For
> > > the proper preservation and restoration of ioptdesc add helper
> > > functions.
> > > 
> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> > > ---
> > >  drivers/iommu/iommu-pages.c | 74 +++++++++++++++++++++++++++++++++++++
> > >  drivers/iommu/iommu-pages.h | 30 +++++++++++++++
> > >  2 files changed, 104 insertions(+)
> > > 
> > > diff --git a/drivers/iommu/iommu-pages.c b/drivers/iommu/iommu-pages.c
> > > index 3bab175d8557..588a8f19b196 100644
> > > --- a/drivers/iommu/iommu-pages.c
> > > +++ b/drivers/iommu/iommu-pages.c
> > > @@ -6,6 +6,7 @@
> > >  #include "iommu-pages.h"
> > >  #include <linux/dma-mapping.h>
> > >  #include <linux/gfp.h>
> > > +#include <linux/kexec_handover.h>
> > >  #include <linux/mm.h>
> > > 
> > >  #define IOPTDESC_MATCH(pg_elm, elm)                    \
> > > @@ -131,6 +132,79 @@ void iommu_put_pages_list(struct iommu_pages_list *list)
> > >  }
> > >  EXPORT_SYMBOL_GPL(iommu_put_pages_list);
> > > 
> > > +#if IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)
> > > +void iommu_unpreserve_page(void *virt)
> > > +{
> > > +	kho_unpreserve_folio(ioptdesc_folio(virt_to_ioptdesc(virt)));
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_unpreserve_page);
> > > +
> > > +int iommu_preserve_page(void *virt)
> > > +{
> > > +	return kho_preserve_folio(ioptdesc_folio(virt_to_ioptdesc(virt)));
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_preserve_page);
> > > +
> > > +void iommu_unpreserve_pages(struct iommu_pages_list *list, int count)
> > > +{
> > > +	struct ioptdesc *iopt;
> > > +
> > > +	if (!count)
> > > +		return;
> > > +
> > > +	/* If less than zero then unpreserve all pages. */
> > > +	if (count < 0)
> > > +		count = 0;
> > > +
> > > +	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
> > > +		kho_unpreserve_folio(ioptdesc_folio(iopt));
> > > +		if (count > 0 && --count ==  0)
> > > +			break;
> > > +	}
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_unpreserve_pages);
> > > +
> > > +void iommu_restore_page(u64 phys)
> > > +{
> > > +	struct ioptdesc *iopt;
> > > +	struct folio *folio;
> > > +	unsigned long pgcnt;
> > > +	unsigned int order;
> > > +
> > > +	folio = kho_restore_folio(phys);
> > > +	BUG_ON(!folio);
> > > +
> > > +	iopt = folio_ioptdesc(folio);
> > 
> > iopt->incoherent = false; should be here?
> > 
> 
> Yes this should be set here. I will update this.

I'm wondering if we are silently losing state here. What if the 
preserved page was actually incoherent in the previous kernel?
I understand we likely need to initialize it to false here because we
don't have a dev pointer for DMA sync operations at this low level (though
x86 uses clflush).

But when is it set back to "incoherent" again? I don't see that 
happening during the driver re-attach phase?

Should we at least mention that this API intentionally overwrites the
preserved coherency state and that these pages must explicitly be marked
incoherent again later by the driver based on its preserved HW state OR
by the IOMMUFD re-attach?

> > > +
> > > +	order = folio_order(folio);
> > > +	pgcnt = 1UL << order;
> > > +	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, pgcnt);
> > > +	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, pgcnt);
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_restore_page);

[------ snip >8 -------]

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages
  2026-03-20 17:27       ` Pranjal Shrivastava
@ 2026-03-20 18:12         ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-20 18:12 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Ankit Soni, David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Fri, Mar 20, 2026 at 05:27:46PM +0000, Pranjal Shrivastava wrote:
>On Tue, Mar 03, 2026 at 06:41:26PM +0000, Samiullah Khawaja wrote:
>> On Tue, Mar 03, 2026 at 04:42:02PM +0000, Ankit Soni wrote:
>> > On Tue, Feb 03, 2026 at 10:09:38PM +0000, Samiullah Khawaja wrote:
>> > > IOMMU pages are allocated/freed using APIs using struct ioptdesc. For
>> > > the proper preservation and restoration of ioptdesc add helper
>> > > functions.
>> > >
>> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> > > ---
>> > >  drivers/iommu/iommu-pages.c | 74 +++++++++++++++++++++++++++++++++++++
>> > >  drivers/iommu/iommu-pages.h | 30 +++++++++++++++
>> > >  2 files changed, 104 insertions(+)
>> > >
>> > > diff --git a/drivers/iommu/iommu-pages.c b/drivers/iommu/iommu-pages.c
>> > > index 3bab175d8557..588a8f19b196 100644
>> > > --- a/drivers/iommu/iommu-pages.c
>> > > +++ b/drivers/iommu/iommu-pages.c
>> > > @@ -6,6 +6,7 @@
>> > >  #include "iommu-pages.h"
>> > >  #include <linux/dma-mapping.h>
>> > >  #include <linux/gfp.h>
>> > > +#include <linux/kexec_handover.h>
>> > >  #include <linux/mm.h>
>> > >
>> > >  #define IOPTDESC_MATCH(pg_elm, elm)                    \
>> > > @@ -131,6 +132,79 @@ void iommu_put_pages_list(struct iommu_pages_list *list)
>> > >  }
>> > >  EXPORT_SYMBOL_GPL(iommu_put_pages_list);
>> > >
>> > > +#if IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)
>> > > +void iommu_unpreserve_page(void *virt)
>> > > +{
>> > > +	kho_unpreserve_folio(ioptdesc_folio(virt_to_ioptdesc(virt)));
>> > > +}
>> > > +EXPORT_SYMBOL_GPL(iommu_unpreserve_page);
>> > > +
>> > > +int iommu_preserve_page(void *virt)
>> > > +{
>> > > +	return kho_preserve_folio(ioptdesc_folio(virt_to_ioptdesc(virt)));
>> > > +}
>> > > +EXPORT_SYMBOL_GPL(iommu_preserve_page);
>> > > +
>> > > +void iommu_unpreserve_pages(struct iommu_pages_list *list, int count)
>> > > +{
>> > > +	struct ioptdesc *iopt;
>> > > +
>> > > +	if (!count)
>> > > +		return;
>> > > +
>> > > +	/* If less than zero then unpreserve all pages. */
>> > > +	if (count < 0)
>> > > +		count = 0;
>> > > +
>> > > +	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
>> > > +		kho_unpreserve_folio(ioptdesc_folio(iopt));
>> > > +		if (count > 0 && --count ==  0)
>> > > +			break;
>> > > +	}
>> > > +}
>> > > +EXPORT_SYMBOL_GPL(iommu_unpreserve_pages);
>> > > +
>> > > +void iommu_restore_page(u64 phys)
>> > > +{
>> > > +	struct ioptdesc *iopt;
>> > > +	struct folio *folio;
>> > > +	unsigned long pgcnt;
>> > > +	unsigned int order;
>> > > +
>> > > +	folio = kho_restore_folio(phys);
>> > > +	BUG_ON(!folio);
>> > > +
>> > > +	iopt = folio_ioptdesc(folio);
>> >
>> > iopt->incoherent = false; should be here?
>> >
>>
>> Yes this should be set here. I will update this.
>
>I'm wondering if we are silently losing state here. What if the
>preserved page was actually incoherent in the previous kernel?
>I understand we likely need to initialize it to false here because we
>don't have a dev pointer for DMA sync operations at this low level (though
>x86 uses clflush).
>
>But when is it set back to "incoherent" again? I don't see that
>happening during the driver re-attach phase?

This can be done during restore_domain as the domain has a reference to
the dev when it is recreated. I will updated the walker and add this in
the next revision.
>
>Should we at least mention that this API intentionally overwrites the
>preserved coherency state and that these pages must explicitly be marked
>incoherent again later by the driver based on its preserved HW state OR
>by the IOMMUFD re-attach?

Agreed. I will add a Note about this.
>
>> > > +
>> > > +	order = folio_order(folio);
>> > > +	pgcnt = 1UL << order;
>> > > +	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, pgcnt);
>> > > +	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, pgcnt);
>> > > +}
>> > > +EXPORT_SYMBOL_GPL(iommu_restore_page);
>
>[------ snip >8 -------]
>
>Thanks,
>Praan

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages
  2026-03-20  9:28     ` Pranjal Shrivastava
@ 2026-03-20 18:27       ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-20 18:27 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Vipin Sharma, David Woodhouse, Lu Baolu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, YiFei Zhu

On Fri, Mar 20, 2026 at 09:28:28AM +0000, Pranjal Shrivastava wrote:
>On Tue, Mar 17, 2026 at 01:59:03PM -0700, Vipin Sharma wrote:
>> On Tue, Feb 03, 2026 at 10:09:38PM +0000, Samiullah Khawaja wrote:
>> > +
>> > +void iommu_unpreserve_pages(struct iommu_pages_list *list, int count)
>> > +{
>> > +	struct ioptdesc *iopt;
>> > +
>> > +	if (!count)
>> > +		return;
>> > +
>> > +	/* If less than zero then unpreserve all pages. */
>> > +	if (count < 0)
>> > +		count = 0;
>> > +
>> > +	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
>> > +		kho_unpreserve_folio(ioptdesc_folio(iopt));
>> > +		if (count > 0 && --count ==  0)
>> > +			break;
>> > +	}
>> > +}
>>
>> I see you are trying to have common function for error handling in
>> iommu_preserve_pages() down and unpreserving in the next patch with
>> overloaded meaning of count.
>>
>>  < 0 means unpreserve all of the pages
>>  = 0 means do nothing
>>  > 0 unpreserve these many pages.
>>
>> It seems non-intuitive to me and code also end up having multiple ifs. I
>> will recommend to just have this function go through all of the pages
>> without passing count argument. In iommu_preserve_pages() function down,
>> handle the error case with a "goto err" statement. If this is the way
>> things happen in iommu codebase feel free to ignore this suggestion.
>>
>
>+1 to this. The current version of overloaded count is counter-intuitive
>as it expects a back-handed API contract. Since we're using lists here,
>I'd recommend using the list_for_each_entry_continue_reverse() helper:

Agreed. I will use this approach in next revision.

Thanks,
Sami

>
>+void iommu_unpreserve_pages(struct iommu_pages_list *list)
>+{
>+	struct ioptdesc *iopt;
>+
>+	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm)
>+		kho_unpreserve_folio(ioptdesc_folio(iopt));
>+}
>+EXPORT_SYMBOL_GPL(iommu_unpreserve_pages);
>+
>+int iommu_preserve_pages(struct iommu_pages_list *list)
>+{
>+	struct ioptdesc *iopt;
>+	int ret;
>+
>+	list_for_each_entry(iopt, &list->pages, iopt_freelist_elm) {
>+		ret = kho_preserve_folio(ioptdesc_folio(iopt));
>+		if (ret)
>+			goto err_rollback;
>+	}
>+	return 0;
>+
>+err_rollback:
>+	/* Rollback only the successfully preserved pages */
>+	list_for_each_entry_continue_reverse(iopt, &list->pages, iopt_freelist_elm)
>+		kho_unpreserve_folio(ioptdesc_folio(iopt));
>+	return ret;
>+}
>
>[ ---- snip >8---- ]
>
>Thanks,
>Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages
  2026-03-20 11:01     ` Pranjal Shrivastava
@ 2026-03-20 18:56       ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-20 18:56 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Vipin Sharma, David Woodhouse, Lu Baolu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, YiFei Zhu

On Fri, Mar 20, 2026 at 11:01:25AM +0000, Pranjal Shrivastava wrote:
>On Tue, Mar 17, 2026 at 01:59:03PM -0700, Vipin Sharma wrote:
>> On Tue, Feb 03, 2026 at 10:09:38PM +0000, Samiullah Khawaja wrote:
>> > +
>
>[------- snip >8 --------]
>
>> > +EXPORT_SYMBOL_GPL(iommu_unpreserve_pages);
>> > +
>> > +void iommu_restore_page(u64 phys)
>> > +{
>> > +	struct ioptdesc *iopt;
>> > +	struct folio *folio;
>> > +	unsigned long pgcnt;
>> > +	unsigned int order;
>> > +
>> > +	folio = kho_restore_folio(phys);
>> > +	BUG_ON(!folio);
>> > +
>> > +	iopt = folio_ioptdesc(folio);
>> > +
>> > +	order = folio_order(folio);
>> > +	pgcnt = 1UL << order;
>> > +	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, pgcnt);
>> > +	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, pgcnt);
>>
>> Above two functions are also used in other two places in this file, lets
>> create a common function for these two operations.
>>
>
>I was thinking about this, out of the 2 other places one is
>`__iommu_free_desc` where we are supposed to subtract pgcnt.. but I
>think we can follow the -pgcnt pattern we have currently and add a
>helper `iommu_folio_update_stats(folio, pgcnt)` where we can pass -pgcnt
>to subtract nr_pages? Something like:
>
>+static inline void iommu_folio_update_stats(struct folio *folio, long nr_pages)
>+{
>+	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, nr_pages);
>+	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, nr_pages);
>+}
>
>@@ -86,8 +91,7 @@ static void __iommu_free_desc(struct ioptdesc *iopt)
> 	if (IOMMU_PAGES_USE_DMA_API)
> 		WARN_ON_ONCE(iopt->incoherent);
>
>-	mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, -pgcnt);
>-	lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, -pgcnt);
>+	iommu_folio_update_stats(folio, -(long)pgcnt);
> 	folio_put(folio);
> }
>
>But I'm unsure if passing -(long)pgcnt is a good idea?

mod_node_page_state() takes last argument as "long" and I can see other
examples of passing -folio_nr_pages(folio) directly into it. So we can
just keep it as following:

iommu_folio_update_stats(folio, -pgcnt);

I think it is a good cleanup. I will update this in the next revision.
>
>> > +}
>> > +EXPORT_SYMBOL_GPL(iommu_restore_page);
>> > +
>
>Thanks,
>Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/14] iommupt: Implement preserve/unpreserve/restore callbacks
  2026-02-03 22:09 ` [PATCH 05/14] iommupt: Implement preserve/unpreserve/restore callbacks Samiullah Khawaja
@ 2026-03-20 21:57   ` Pranjal Shrivastava
  2026-03-23 16:41     ` Samiullah Khawaja
  0 siblings, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-20 21:57 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:39PM +0000, Samiullah Khawaja wrote:
> Implement the iommu domain ops for presevation, unpresevation and
> restoration of iommu domains for liveupdate. Use the existing page
> walker to preserve the ioptdesc of the top_table and the lower tables.
> Preserve the top_level also so it can be restored during boot.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/generic_pt/iommu_pt.h | 96 +++++++++++++++++++++++++++++
>  include/linux/generic_pt/iommu.h    | 10 +++
>  2 files changed, 106 insertions(+)
> 
> diff --git a/drivers/iommu/generic_pt/iommu_pt.h b/drivers/iommu/generic_pt/iommu_pt.h
> index 3327116a441c..0a1adb6312dd 100644
> --- a/drivers/iommu/generic_pt/iommu_pt.h
> +++ b/drivers/iommu/generic_pt/iommu_pt.h
> @@ -921,6 +921,102 @@ int DOMAIN_NS(map_pages)(struct iommu_domain *domain, unsigned long iova,
>  }
>  EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(map_pages), "GENERIC_PT_IOMMU");
>  
> +/**
> + * unpreserve() - Unpreserve page tables and other state of a domain.
> + * @domain: Domain to unpreserve
> + */
> +void DOMAIN_NS(unpreserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser)

I don't see us using the `ser` arg in this function? Shall we remove it?

> +{
> +	struct pt_iommu *iommu_table =
> +		container_of(domain, struct pt_iommu, domain);
> +	struct pt_common *common = common_from_iommu(iommu_table);
> +	struct pt_range range = pt_all_range(common);
> +	struct pt_iommu_collect_args collect = {
> +		.free_list = IOMMU_PAGES_LIST_INIT(collect.free_list),
> +	};
> +
> +	iommu_pages_list_add(&collect.free_list, range.top_table);
> +	pt_walk_range(&range, __collect_tables, &collect);
> +
> +	iommu_unpreserve_pages(&collect.free_list, -1);
> +}
> +EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(unpreserve), "GENERIC_PT_IOMMU");
> +
> +/**
> + * preserve() - Preserve page tables and other state of a domain.
> + * @domain: Domain to preserve
> + *
> + * Returns: -ERRNO on failure, on success.

Nit: 0 on success?

> + */
> +int DOMAIN_NS(preserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser)
> +{
> +	struct pt_iommu *iommu_table =
> +		container_of(domain, struct pt_iommu, domain);
> +	struct pt_common *common = common_from_iommu(iommu_table);
> +	struct pt_range range = pt_all_range(common);
> +	struct pt_iommu_collect_args collect = {
> +		.free_list = IOMMU_PAGES_LIST_INIT(collect.free_list),
> +	};
> +	int ret;
> +
> +	iommu_pages_list_add(&collect.free_list, range.top_table);
> +	pt_walk_range(&range, __collect_tables, &collect);
> +
> +	ret = iommu_preserve_pages(&collect.free_list);
> +	if (ret)
> +		return ret;
> +
> +	ser->top_table = virt_to_phys(range.top_table);
> +	ser->top_level = range.top_level;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(preserve), "GENERIC_PT_IOMMU");
> +
> +static int __restore_tables(struct pt_range *range, void *arg,
> +			    unsigned int level, struct pt_table_p *table)
> +{
> +	struct pt_state pts = pt_init(range, level, table);
> +	int ret;
> +
> +	for_each_pt_level_entry(&pts) {
> +		if (pts.type == PT_ENTRY_TABLE) {
> +			iommu_restore_page(virt_to_phys(pts.table_lower));
> +			ret = pt_descend(&pts, arg, __restore_tables);
> +			if (ret)
> +				return ret;

If pt_descend() returns an error, we immediately return ret. However, we
have already successfully called iommu_restore_page() on pts.table_lower
and potentially many other tables earlier in the loop or higher up in
the tree..

If the domain restore fails halfway through the tree walk, how are these
partially restored pages cleaned up? Should we clean them up / free them
in an err label or does the caller perform a BUG_ON(ret) ?

> +		}
> +	}
> +	return 0;
> +}
> +
> +/**
> + * restore() - Restore page tables and other state of a domain.
> + * @domain: Domain to preserve
> + *
> + * Returns: -ERRNO on failure, on success.

Nit: 0 on success?

> + */
> +int DOMAIN_NS(restore)(struct iommu_domain *domain, struct iommu_domain_ser *ser)
> +{
> +	struct pt_iommu *iommu_table =
> +		container_of(domain, struct pt_iommu, domain);
> +	struct pt_common *common = common_from_iommu(iommu_table);
> +	struct pt_range range = pt_all_range(common);
> +
> +	iommu_restore_page(ser->top_table);
> +
> +	/* Free new table */
> +	iommu_free_pages(range.top_table);
> +
> +	/* Set the restored top table */
> +	pt_top_set(common, phys_to_virt(ser->top_table), ser->top_level);
> +
> +	/* Restore all pages*/
> +	range = pt_all_range(common);
> +	return pt_walk_range(&range, __restore_tables, NULL);
> +}
> +EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(restore), "GENERIC_PT_IOMMU");
> +

[ ------ >8 Snip >8------]

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops
  2026-02-03 22:09 ` [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops Samiullah Khawaja
  2026-03-19 16:04   ` Vipin Sharma
@ 2026-03-20 23:01   ` Pranjal Shrivastava
  2026-03-21 13:27     ` Pranjal Shrivastava
  2026-03-23 18:32     ` Samiullah Khawaja
  1 sibling, 2 replies; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-20 23:01 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:40PM +0000, Samiullah Khawaja wrote:
> Add implementation of the device and iommu presevation in a separate
> file. Also set the device and iommu preserve/unpreserve ops in the
> struct iommu_ops.
> 
> During normal shutdown the iommu translation is disabled. Since the root
> table is preserved during live update, it needs to be cleaned up and the
> context entries of the unpreserved devices need to be cleared.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/intel/Makefile     |   1 +
>  drivers/iommu/intel/iommu.c      |  47 ++++++++++-
>  drivers/iommu/intel/iommu.h      |  27 +++++++
>  drivers/iommu/intel/liveupdate.c | 134 +++++++++++++++++++++++++++++++
>  4 files changed, 205 insertions(+), 4 deletions(-)
>  create mode 100644 drivers/iommu/intel/liveupdate.c
> 
> diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
> index ada651c4a01b..d38fc101bc35 100644
> --- a/drivers/iommu/intel/Makefile
> +++ b/drivers/iommu/intel/Makefile
> @@ -6,3 +6,4 @@ obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o
>  obj-$(CONFIG_INTEL_IOMMU_SVM) += svm.o
>  obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o
>  obj-$(CONFIG_INTEL_IOMMU_PERF_EVENTS) += perfmon.o
> +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 134302fbcd92..c95de93fb72f 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -16,6 +16,7 @@
>  #include <linux/crash_dump.h>
>  #include <linux/dma-direct.h>
>  #include <linux/dmi.h>
> +#include <linux/iommu-lu.h>
>  #include <linux/memory.h>
>  #include <linux/pci.h>
>  #include <linux/pci-ats.h>
> @@ -52,6 +53,8 @@ static int rwbf_quirk;
>  
>  #define rwbf_required(iommu)	(rwbf_quirk || cap_rwbf((iommu)->cap))
>  
> +static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu);
> +
>  /*
>   * set to 1 to panic kernel if can't successfully enable VT-d
>   * (used when kernel is launched w/ TXT)
> @@ -60,8 +63,6 @@ static int force_on = 0;
>  static int intel_iommu_tboot_noforce;
>  static int no_platform_optin;
>  
> -#define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
> -
>  /*
>   * Take a root_entry and return the Lower Context Table Pointer (LCTP)
>   * if marked present.
> @@ -2378,8 +2379,10 @@ void intel_iommu_shutdown(void)
>  		/* Disable PMRs explicitly here. */
>  		iommu_disable_protect_mem_regions(iommu);
>  
> -		/* Make sure the IOMMUs are switched off */
> -		iommu_disable_translation(iommu);
> +		if (!__maybe_clean_unpreserved_context_entries(iommu)) {
> +			/* Make sure the IOMMUs are switched off */
> +			iommu_disable_translation(iommu);
> +		}
>  	}
>  }
>  
> @@ -2902,6 +2905,38 @@ static const struct iommu_dirty_ops intel_second_stage_dirty_ops = {
>  	.set_dirty_tracking = intel_iommu_set_dirty_tracking,
>  };
>  
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu)
> +{
> +	struct device_domain_info *info;
> +	struct pci_dev *pdev = NULL;
> +
> +	if (!iommu->iommu.outgoing_preserved_state)
> +		return false;
> +
> +	for_each_pci_dev(pdev) {

Shouldn't we *also* clean context entries for non-pci devices?

AFAICT, in the current code, we simply turn off translation during
shutdowm. However, with this change, because this cleanup loop only runs
for_each_pci_dev(), it completely ignores any non-PCI devices (like 
platform devices) that might be attached to this IOMMU?

Since the IOMMU is left powered on during the handover, skipping non-PCI
devices in this cleanup loop means their context entries are left
perfectly intact and active in the hardware root table. There could be a
chance that the new kernel doesn't reclaim/overwrite these entries in
which case it could result in zombie DMAs surviving the KHO.

> +		info = dev_iommu_priv_get(&pdev->dev);
> +		if (!info)
> +			continue;
> +
> +		if (info->iommu != iommu)
> +			continue;
> +
> +		if (dev_iommu_preserved_state(&pdev->dev))
> +			continue;
> +
> +		domain_context_clear(info);
> +	}
> +
> +	return true;
> +}
> +#else
> +static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu)
> +{
> +	return false;
> +}
> +#endif
> +
>  static struct iommu_domain *
>  intel_iommu_domain_alloc_second_stage(struct device *dev,
>  				      struct intel_iommu *iommu, u32 flags)
> @@ -3925,6 +3960,10 @@ const struct iommu_ops intel_iommu_ops = {
>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
>  	.def_domain_type	= device_def_domain_type,
>  	.page_response		= intel_iommu_page_response,
> +	.preserve_device	= intel_iommu_preserve_device,
> +	.unpreserve_device	= intel_iommu_unpreserve_device,
> +	.preserve		= intel_iommu_preserve,
> +	.unpreserve		= intel_iommu_unpreserve,
>  };
>  
>  static void quirk_iommu_igfx(struct pci_dev *dev)
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index 25c5e22096d4..70032e86437d 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -557,6 +557,8 @@ struct root_entry {
>  	u64     hi;
>  };
>  
> +#define ROOT_ENTRY_NR (VTD_PAGE_SIZE / sizeof(struct root_entry))
> +
>  /*
>   * low 64 bits:
>   * 0: present
> @@ -1276,6 +1278,31 @@ static inline int iopf_for_domain_replace(struct iommu_domain *new,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser);
> +void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser);
> +int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
> +void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
> +#else
> +static inline int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser)
> +{
> +}
> +
> +static inline int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser)
> +{
> +}
> +#endif
> +
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  void intel_svm_check(struct intel_iommu *iommu);
>  struct iommu_domain *intel_svm_domain_alloc(struct device *dev,
> diff --git a/drivers/iommu/intel/liveupdate.c b/drivers/iommu/intel/liveupdate.c
> new file mode 100644
> index 000000000000..82ba1daf1711
> --- /dev/null
> +++ b/drivers/iommu/intel/liveupdate.c
> @@ -0,0 +1,134 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * Copyright (C) 2025, Google LLC
> + * Author: Samiullah Khawaja <skhawaja@google.com>
> + */
> +
> +#define pr_fmt(fmt)    "iommu: liveupdate: " fmt
> +
> +#include <linux/kexec_handover.h>
> +#include <linux/liveupdate.h>
> +#include <linux/iommu-lu.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +
> +#include "iommu.h"
> +#include "../iommu-pages.h"
> +
> +static void unpreserve_iommu_context(struct intel_iommu *iommu, int end)
> +{
> +	struct context_entry *context;
> +	int i;
> +
> +	if (end < 0)
> +		end = ROOT_ENTRY_NR;
> +
> +	for (i = 0; i < end; i++) {
> +		context = iommu_context_addr(iommu, i, 0, 0);
> +		if (context)
> +			iommu_unpreserve_page(context);
> +
> +		if (!sm_supported(iommu))
> +			continue;
> +
> +		context = iommu_context_addr(iommu, i, 0x80, 0);
> +		if (context)
> +			iommu_unpreserve_page(context);
> +	}
> +}
> +
> +static int preserve_iommu_context(struct intel_iommu *iommu)
> +{
> +	struct context_entry *context;
> +	int ret;
> +	int i;
> +
> +	for (i = 0; i < ROOT_ENTRY_NR; i++) {
> +		context = iommu_context_addr(iommu, i, 0, 0);
> +		if (context) {
> +			ret = iommu_preserve_page(context);
> +			if (ret)
> +				goto error;
> +		}
> +
> +		if (!sm_supported(iommu))
> +			continue;
> +
> +		context = iommu_context_addr(iommu, i, 0x80, 0);
> +		if (context) {
> +			ret = iommu_preserve_page(context);
> +			if (ret)
> +				goto error_sm;
> +		}
> +	}
> +
> +	return 0;
> +
> +error_sm:
> +	context = iommu_context_addr(iommu, i, 0, 0);
> +	iommu_unpreserve_page(context);
> +error:
> +	unpreserve_iommu_context(iommu, i);
> +	return ret;
> +}
> +
> +int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
> +{
> +	struct device_domain_info *info = dev_iommu_priv_get(dev);
> +
> +	if (!dev_is_pci(dev))
> +		return -EOPNOTSUPP;

Maybe also WARN_ON_ONCE ?

> +
> +	if (!info)
> +		return -EINVAL;
> +
> +	device_ser->domain_iommu_ser.did = domain_id_iommu(info->domain, info->iommu);
> +	return 0;
> +}
> +
> +void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser)
> +{
> +}

Nit: We can just not populate the unpreserve_device op here, we don't
need an empty function

> +
> +int intel_iommu_preserve(struct iommu_device *iommu_dev, struct iommu_ser *ser)
> +{
> +	struct intel_iommu *iommu;
> +	int ret;
> +
> +	iommu = container_of(iommu_dev, struct intel_iommu, iommu);
> +
> +	spin_lock(&iommu->lock);

Minor note: We call this with the session->mutex acquired which is a
sleepable/preemptable context whereas spin_lock disables preemption and
puts us in an atomic context. 

While iommu_context_addr() happens to be safe here because it uses
GFP_ATOMIC & we pass alloc=0, we are heavily relying on the assumption
that iommu_context_addr() or other helpers will never sleep or attempt
a GFP_KERNEL allocation under the hood.

Given how easy it would be for a future refactor to accidentally
introduce a sleeping lock inside any of the other helpers, I'd recommend
adding an explicit comment here warning that this runs in an atomic
context (same for unpreserve below).

>r +	ret = preserve_iommu_context(iommu);
> +	if (ret)
> +		goto err;
> +
> +	ret = iommu_preserve_page(iommu->root_entry);
> +	if (ret) {
> +		unpreserve_iommu_context(iommu, -1);

Nit: The same overloading discussion as Vipin pointed out in the other
patch.

> +		goto err;
> +	}
> +
> +	ser->intel.phys_addr = iommu->reg_phys;
> +	ser->intel.root_table = __pa(iommu->root_entry);
> +	ser->type = IOMMU_INTEL;
> +	ser->token = ser->intel.phys_addr;
> +	spin_unlock(&iommu->lock);
> +
> +	return 0;
> +err:
> +	spin_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +void intel_iommu_unpreserve(struct iommu_device *iommu_dev, struct iommu_ser *iommu_ser)
> +{
> +	struct intel_iommu *iommu;
> +
> +	iommu = container_of(iommu_dev, struct intel_iommu, iommu);
> +
> +	spin_lock(&iommu->lock);
> +	unpreserve_iommu_context(iommu, -1);
> +	iommu_unpreserve_page(iommu->root_entry);
> +	spin_unlock(&iommu->lock);
> +}

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
  2026-03-20  0:40     ` Samiullah Khawaja
@ 2026-03-20 23:34       ` Vipin Sharma
  2026-03-23 16:24         ` Samiullah Khawaja
  0 siblings, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-20 23:34 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Pranjal Shrivastava

On Fri, Mar 20, 2026 at 12:40:44AM +0000, Samiullah Khawaja wrote:
> On Thu, Mar 19, 2026 at 04:35:32PM -0700, Vipin Sharma wrote:
> > On Tue, Feb 03, 2026 at 10:09:44PM +0000, Samiullah Khawaja wrote:
> > > From: YiFei Zhu <zhuyifei@google.com>
> > > @@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
> > >  	bool auto_domain : 1;
> > >  	bool enforce_cache_coherency : 1;
> > >  	bool nest_parent : 1;
> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > +	bool lu_preserve : 1;
> > > +	u32 lu_token;
> > 
> > Should we use full name i.e. liveupdate here and other places in this
> > series?
> 
> I think using full name liveupdate would be too long in other places in
> this series. And also there are other examples of "luo" being used as a
> short form. Please see:
> 
> https://lore.kernel.org/all/20251125165850.3389713-15-pasha.tatashin@soleen.com/

That patch is using "luo" short, which I think is also wrong in saying
memfd_luo, as there is nothing related to orchestrator (O of LUO) in
that patch. It is saving memfd state for liveupdate. But that ship has
sailed.

In the current patch, I don't think it will be too long, and it also
easier to read code without someone to know what "lu" is. We have worked
on it so we know what "lu" stands for and kind of accepted it but I
agree what Greg k-h was also suggesting.

https://lore.kernel.org/all/2025093023-frantic-sediment-9849@gregkh/
 - "You have more letters, please use them.  "lu" is too short."

Not a hard no from me, just a suggestion.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops
  2026-03-20 23:01   ` Pranjal Shrivastava
@ 2026-03-21 13:27     ` Pranjal Shrivastava
  2026-03-23 18:32     ` Samiullah Khawaja
  1 sibling, 0 replies; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-21 13:27 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Fri, Mar 20, 2026 at 11:01:59PM +0000, Pranjal Shrivastava wrote:
> On Tue, Feb 03, 2026 at 10:09:40PM +0000, Samiullah Khawaja wrote:
> > Add implementation of the device and iommu presevation in a separate
> > file. Also set the device and iommu preserve/unpreserve ops in the
> > struct iommu_ops.
> > 
> > During normal shutdown the iommu translation is disabled. Since the root
> > table is preserved during live update, it needs to be cleaned up and the
> > context entries of the unpreserved devices need to be cleared.
> > 
> > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> > ---
> >  drivers/iommu/intel/Makefile     |   1 +
> >  drivers/iommu/intel/iommu.c      |  47 ++++++++++-
> >  drivers/iommu/intel/iommu.h      |  27 +++++++
> >  drivers/iommu/intel/liveupdate.c | 134 +++++++++++++++++++++++++++++++
> >  4 files changed, 205 insertions(+), 4 deletions(-)
> >  create mode 100644 drivers/iommu/intel/liveupdate.c
> > 
> > diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
> > index ada651c4a01b..d38fc101bc35 100644
> > --- a/drivers/iommu/intel/Makefile
> > +++ b/drivers/iommu/intel/Makefile
> > @@ -6,3 +6,4 @@ obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o
> >  obj-$(CONFIG_INTEL_IOMMU_SVM) += svm.o
> >  obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o
> >  obj-$(CONFIG_INTEL_IOMMU_PERF_EVENTS) += perfmon.o
> > +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 134302fbcd92..c95de93fb72f 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -16,6 +16,7 @@
> >  #include <linux/crash_dump.h>
> >  #include <linux/dma-direct.h>
> >  #include <linux/dmi.h>
> > +#include <linux/iommu-lu.h>
> >  #include <linux/memory.h>
> >  #include <linux/pci.h>
> >  #include <linux/pci-ats.h>
> > @@ -52,6 +53,8 @@ static int rwbf_quirk;
> >  

[ ---- snip >8 ----- ]
 
> > +
> > +int intel_iommu_preserve(struct iommu_device *iommu_dev, struct iommu_ser *ser)
> > +{
> > +	struct intel_iommu *iommu;
> > +	int ret;
> > +
> > +	iommu = container_of(iommu_dev, struct intel_iommu, iommu);
> > +
> > +	spin_lock(&iommu->lock);
> 
> Minor note: We call this with the session->mutex acquired which is a
> sleepable/preemptable context whereas spin_lock disables preemption and
> puts us in an atomic context. 
> 
> While iommu_context_addr() happens to be safe here because it uses
> GFP_ATOMIC & we pass alloc=0, we are heavily relying on the assumption
> that iommu_context_addr() or other helpers will never sleep or attempt
> a GFP_KERNEL allocation under the hood.
> 
> Given how easy it would be for a future refactor to accidentally
> introduce a sleeping lock inside any of the other helpers, I'd recommend
> adding an explicit comment here warning that this runs in an atomic
> context (same for unpreserve below).
> 

Thinking a little more about this, I think we shall instead add a
comment in the core API docs, mentioning that these helpers (like
iommu_preserve_page etc.) can be called from atomic context and hence
should not acquire any mutexes as we can't be certain of the context the
drivers call them in. 
 
Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/14] iommu/vt-d: Restore IOMMU state and reclaimed domain ids
  2026-02-03 22:09 ` [PATCH 07/14] iommu/vt-d: Restore IOMMU state and reclaimed domain ids Samiullah Khawaja
  2026-03-19 20:54   ` Vipin Sharma
@ 2026-03-22 19:51   ` Pranjal Shrivastava
  2026-03-23 19:33     ` Samiullah Khawaja
  1 sibling, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-22 19:51 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:41PM +0000, Samiullah Khawaja wrote:
> During boot fetch the preserved state of IOMMU unit and if found then
> restore the state.
> 
> - Reuse the root_table that was preserved in the previous kernel.
> - Reclaim the domain ids of the preserved domains for each preserved
>   devices so these are not acquired by another domain.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/intel/iommu.c      | 26 +++++++++++++++------
>  drivers/iommu/intel/iommu.h      |  7 ++++++
>  drivers/iommu/intel/liveupdate.c | 40 ++++++++++++++++++++++++++++++++
>  3 files changed, 66 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index c95de93fb72f..8acb7f8a7627 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -222,12 +222,12 @@ static void clear_translation_pre_enabled(struct intel_iommu *iommu)
>  	iommu->flags &= ~VTD_FLAG_TRANS_PRE_ENABLED;
>  }
>  
> -static void init_translation_status(struct intel_iommu *iommu)
> +static void init_translation_status(struct intel_iommu *iommu, bool restoring)
>  {
>  	u32 gsts;
>  
>  	gsts = readl(iommu->reg + DMAR_GSTS_REG);
> -	if (gsts & DMA_GSTS_TES)
> +	if (!restoring && (gsts & DMA_GSTS_TES))
>  		iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
>  }
>  
> @@ -670,10 +670,16 @@ void dmar_fault_dump_ptes(struct intel_iommu *iommu, u16 source_id,
>  #endif
>  
>  /* iommu handling */
> -static int iommu_alloc_root_entry(struct intel_iommu *iommu)
> +static int iommu_alloc_root_entry(struct intel_iommu *iommu, struct iommu_ser *restored_state)
>  {
>  	struct root_entry *root;
>  
> +	if (restored_state) {
> +		intel_iommu_liveupdate_restore_root_table(iommu, restored_state);
> +		__iommu_flush_cache(iommu, iommu->root_entry, ROOT_SIZE);
> +		return 0;
> +	}

Instead of putting this inside the allocator, shouldn't init_dmars and
intel_iommu_add check for iommu_ser and call 
intel_iommu_liveupdate_restore_root_table() directly, bypassing the 
allocation entirely? This looks like it could be a stand-alone function
which has nothing to do with allocation.

> +
>  	root = iommu_alloc_pages_node_sz(iommu->node, GFP_ATOMIC, SZ_4K);
>  	if (!root) {
>  		pr_err("Allocating root entry for %s failed\n",
> @@ -1614,6 +1620,7 @@ static int copy_translation_tables(struct intel_iommu *iommu)
>  
>  static int __init init_dmars(void)
>  {
> +	struct iommu_ser *iommu_ser = NULL;
>  	struct dmar_drhd_unit *drhd;
>  	struct intel_iommu *iommu;
>  	int ret;
> @@ -1636,8 +1643,10 @@ static int __init init_dmars(void)
>  						   intel_pasid_max_id);
>  		}
>  
> +		iommu_ser = iommu_get_preserved_data(iommu->reg_phys, IOMMU_INTEL);
> +
>  		intel_iommu_init_qi(iommu);
> -		init_translation_status(iommu);
> +		init_translation_status(iommu, !!iommu_ser);
>  
>  		if (translation_pre_enabled(iommu) && !is_kdump_kernel()) {
>  			iommu_disable_translation(iommu);
> @@ -1651,7 +1660,7 @@ static int __init init_dmars(void)
>  		 * we could share the same root & context tables
>  		 * among all IOMMU's. Need to Split it later.
>  		 */
> -		ret = iommu_alloc_root_entry(iommu);
> +		ret = iommu_alloc_root_entry(iommu, iommu_ser);
>  		if (ret)
>  			goto free_iommu;
>  
> @@ -2110,15 +2119,18 @@ int dmar_parse_one_satc(struct acpi_dmar_header *hdr, void *arg)
>  static int intel_iommu_add(struct dmar_drhd_unit *dmaru)
>  {
>  	struct intel_iommu *iommu = dmaru->iommu;
> +	struct iommu_ser *iommu_ser = NULL;
>  	int ret;
>  

Nit: Add: /* Fetch the preserved context using MMIO base as a token */ ?

> +	iommu_ser = iommu_get_preserved_data(iommu->reg_phys, IOMMU_INTEL);
> +
>  	/*
>  	 * Disable translation if already enabled prior to OS handover.
>  	 */
> -	if (iommu->gcmd & DMA_GCMD_TE)
> +	if (!iommu_ser && iommu->gcmd & DMA_GCMD_TE)
>  		iommu_disable_translation(iommu);
>  
> -	ret = iommu_alloc_root_entry(iommu);
> +	ret = iommu_alloc_root_entry(iommu, iommu_ser);

I understand that iommu_get_preserved_data() will eventually return NULL
after the flb_finish op has executed (based on the LUO IOCTLs dropping
the incoming state), but I'm sensing a potential UAF/double-restore
issue here that could happen during the boot window.

I believe we could restore the same context multiple times? I see 
intel_iommu_add() is called from both dmar_device_add() and 
dmar_device_remove() paths, and the ACPI probe has the following 
sequence [1]:

static int acpi_pci_root_add(struct acpi_device *device, ...)
{
	// ...
	if (hotadd && dmar_device_add(handle)) {
		result = -ENXIO;
		goto end;
	}

	// ...
	root->bus = pci_acpi_scan_root(root);
	if (!root->bus) {
		// ...
		result = -ENODEV;
		goto remove_dmar;
	}
	// ...

remove_dmar:
	if (hotadd)
		dmar_device_remove(handle);
end:
	return result;
}

If we successfully restored a domain during dmar_device_add(), but the
ACPI probe fails later (e.g., pci_acpi_scan_root fails), we jump to 
remove_dmar. This tears down the DMAR unit, it unwinds via 
dmar_device_remove() which eventually calls dmar_iommu_hotplug(false)
where we:

	disable_dmar_iommu(iommu);
	free_dmar_iommu(iommu);

At this point, the root table folios are freed back to the allocator.

However, if a re-scan is then triggered before the FLB drops the 
incoming state, we would call:

dmar_device_add() -> intel_iommu_add() -> iommu_alloc_root_entry() again

Because the KHO state wasn't marked as deleted/consumed, 
iommu_get_preserved_data() will hand us the exact same iommu_ser pointer?

In which case, we'd call kho_restore_folio(iommu_ser->intel.root_table)
on a physical page that might have already been reallocated?

Shouldn't the restored state be explicitly marked as consumed
(obj.deleted = 1), and shouldn't the driver properly unpreserve/clean up
the KHO tracking during the free_dmar_iommu() teardown path?

>  	if (ret)
>  		goto out;
>  
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index 70032e86437d..d7bf63aff17d 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -1283,6 +1283,8 @@ int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_se
>  void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser);
>  int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
>  void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
> +void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
> +					       struct iommu_ser *iommu_ser);
>  #else
>  static inline int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
>  {
> @@ -1301,6 +1303,11 @@ static inline int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_
>  static inline void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser)
>  {
>  }
> +
> +static inline void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
> +							     struct iommu_ser *iommu_ser)
> +{
> +}
>  #endif
>  
>  #ifdef CONFIG_INTEL_IOMMU_SVM
> diff --git a/drivers/iommu/intel/liveupdate.c b/drivers/iommu/intel/liveupdate.c
> index 82ba1daf1711..6dcb5783d1db 100644
> --- a/drivers/iommu/intel/liveupdate.c
> +++ b/drivers/iommu/intel/liveupdate.c
> @@ -73,6 +73,46 @@ static int preserve_iommu_context(struct intel_iommu *iommu)
>  	return ret;
>  }
>  
> +static void restore_iommu_context(struct intel_iommu *iommu)
> +{
> +	struct context_entry *context;
> +	int i;
> +
> +	for (i = 0; i < ROOT_ENTRY_NR; i++) {
> +		context = iommu_context_addr(iommu, i, 0, 0);
> +		if (context)
> +			BUG_ON(!kho_restore_folio(virt_to_phys(context)));
> +
> +		if (!sm_supported(iommu))
> +			continue;
> +
> +		context = iommu_context_addr(iommu, i, 0x80, 0);
> +		if (context)
> +			BUG_ON(!kho_restore_folio(virt_to_phys(context)));
> +	}
> +}
> +
> +static int __restore_used_domain_ids(struct device_ser *ser, void *arg)
> +{
> +	int id = ser->domain_iommu_ser.did;
> +	struct intel_iommu *iommu = arg;
> +

Shouldn't we check if the did actually belongs to the iommu instance?
iommu_for_each_preserved_device() iterates over all preserved devices in
the system. However, here (__restore_used_domain_ids) we allocate the 
device's did in the current iommu->domain_ida without checking if that
device actually belongs to the current IOMMU?

On multi-IOMMU systems, this will cause every IOMMU's IDA to be
cross-polluted with the domain IDs of devices attached to other IOMMUs.
We must verify the device belongs to this specific IOMMU first, maybe:

if (ser->domain_iommu_ser.iommu_phys != iommu->reg_phys)
		return 0;

> +	ida_alloc_range(&iommu->domain_ida, id, id, GFP_ATOMIC);
> +	return 0;
> +}
> +
> +void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
> +					       struct iommu_ser *iommu_ser)
> +{
> +	BUG_ON(!kho_restore_folio(iommu_ser->intel.root_table));
> +	iommu->root_entry = __va(iommu_ser->intel.root_table);
> +
> +	restore_iommu_context(iommu);
> +	iommu_for_each_preserved_device(__restore_used_domain_ids, iommu);
> +	pr_info("Restored IOMMU[0x%llx] Root Table at: 0x%llx\n",
> +		iommu->reg_phys, iommu_ser->intel.root_table);
> +}
> +
>  int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
>  {
>  	struct device_domain_info *info = dev_iommu_priv_get(dev);

Thanks,
Praan

[1] https://elixir.bootlin.com/linux/v7.0-rc4/source/drivers/acpi/pci_root.c#L728

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 08/14] iommu: Restore and reattach preserved domains to devices
  2026-02-03 22:09 ` [PATCH 08/14] iommu: Restore and reattach preserved domains to devices Samiullah Khawaja
  2026-03-10  5:16   ` Ankit Soni
@ 2026-03-22 21:59   ` Pranjal Shrivastava
  2026-03-23 18:02     ` Samiullah Khawaja
  1 sibling, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-22 21:59 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:42PM +0000, Samiullah Khawaja wrote:
> Restore the preserved domains by restoring the page tables using restore
> IOMMU domain op. Reattach the preserved domain to the device during
> default domain setup. While attaching, reuse the domain ID that was used
> in the previous kernel. The context entry setup is not needed as that is
> preserved during liveupdate.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/intel/iommu.c  | 40 ++++++++++++++++++------------
>  drivers/iommu/intel/iommu.h  |  3 ++-
>  drivers/iommu/intel/nested.c |  2 +-
>  drivers/iommu/iommu.c        | 47 ++++++++++++++++++++++++++++++++++--
>  drivers/iommu/liveupdate.c   | 31 ++++++++++++++++++++++++
>  include/linux/iommu-lu.h     |  8 ++++++
>  6 files changed, 112 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 8acb7f8a7627..83faad53f247 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -1029,7 +1029,8 @@ static bool first_level_by_default(struct intel_iommu *iommu)
>  	return true;
>  }
>  
> -int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
> +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
> +			int restore_did)
>  {
>  	struct iommu_domain_info *info, *curr;
>  	int num, ret = -ENOSPC;
> @@ -1049,8 +1050,11 @@ int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
>  		return 0;
>  	}
>  
> -	num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
> -			      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
> +	if (restore_did >= 0)

I just looked at the code and saw the comment for IDA_START_DID [1] and
I realized we probably shouldn't allow domain_ids 0 or 1 here, as they
are reserved?

It seems that DID 0 essentially acts as a blocking domain, and DID 1 is
used for identity / FLPT_DEFAULT_DID (comparing it to S1DSS in Arm)

If a corrupted KHO image passes restore_did = 0, this check evaluates to 
true. The driver will bypass the IDA, assign DID 0, and program it into 
the context entries. Wouldn't we accidentally attach the device to a
blocking mode where the VT-d hardware just silently drops/faults all DMA?

Should this check be:
if (restore_did >= IDA_START_DID) ?

> +		num = restore_did;
> +	else
> +		num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
> +				      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
>  	if (num < 0) {
>  		pr_err("%s: No free domain ids\n", iommu->name);
>  		goto err_unlock;
> @@ -1321,10 +1325,14 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>  {
>  	struct device_domain_info *info = dev_iommu_priv_get(dev);
>  	struct intel_iommu *iommu = info->iommu;
> +	struct device_ser *device_ser = NULL;
>  	unsigned long flags;
>  	int ret;
>  
> -	ret = domain_attach_iommu(domain, iommu);
> +	device_ser = dev_iommu_restored_state(dev);
> +
> +	ret = domain_attach_iommu(domain, iommu,
> +				  dev_iommu_restore_did(dev, &domain->domain));
>  	if (ret)
>  		return ret;
>  
> @@ -1337,16 +1345,18 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,

[ ------- >8 snip -------- ]

> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index c0632cb5b570..8103b5372364 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -18,6 +18,7 @@
>  #include <linux/errno.h>
>  #include <linux/host1x_context_bus.h>
>  #include <linux/iommu.h>
> +#include <linux/iommu-lu.h>
>  #include <linux/iommufd.h>
>  #include <linux/idr.h>
>  #include <linux/err.h>
> @@ -489,6 +490,10 @@ static int iommu_init_device(struct device *dev)
>  		goto err_free;
>  	}
>  
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	dev->iommu->device_ser = iommu_get_device_preserved_data(dev);
> +#endif
> +
>  	iommu_dev = ops->probe_device(dev);
>  	if (IS_ERR(iommu_dev)) {
>  		ret = PTR_ERR(iommu_dev);
> @@ -2149,6 +2154,13 @@ static int __iommu_attach_device(struct iommu_domain *domain,
>  	ret = domain->ops->attach_dev(domain, dev, old);
>  	if (ret)
>  		return ret;
> +
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	/* The associated state can be unset once restored. */
> +	if (dev_iommu_restored_state(dev))
> +		WRITE_ONCE(dev->iommu->device_ser, NULL);
> +#endif
> +
>  	dev->iommu->attach_deferred = 0;
>  	trace_attach_device_to_domain(dev);
>  	return 0;
> @@ -3061,6 +3073,34 @@ int iommu_fwspec_add_ids(struct device *dev, const u32 *ids, int num_ids)
>  }
>  EXPORT_SYMBOL_GPL(iommu_fwspec_add_ids);
>  
> +static struct iommu_domain *__iommu_group_maybe_restore_domain(struct iommu_group *group)
> +{
> +	struct device_ser *device_ser;
> +	struct iommu_domain *domain;
> +	struct device *dev;
> +
> +	dev = iommu_group_first_dev(group);
> +	if (!dev_is_pci(dev))
> +		return NULL;
> +
> +	device_ser = dev_iommu_restored_state(dev);
> +	if (!device_ser)
> +		return NULL;
> +
> +	domain = iommu_restore_domain(dev, device_ser);
> +	if (WARN_ON(IS_ERR(domain)))
> +		return NULL;
> +
> +	/*
> +	 * The group is owned by the entity (a preserved iommufd) that provided
> +	 * this token in the previous kernel. It will be used to reclaim it
> +	 * later.
> +	 */
> +	group->owner = (void *)device_ser->token;

Umm.. this is a nak as it looks troubled for the following reasons:

1. The group->owner and group->owner_cnt are being updated directly
without holding group->mutex. We shouldn't bypass the lock as this
violates the group's concurrency protections and can cause APIs like
iommu_device_claim_dma_owner[2] to race or fail.

2. Stuff like this shouldn't be open-coded. Presumably, this was
open-coded because the standard API (__iommu_take_dma_ownership) forces
the IOMMU group into a blocking_domain, which conflicts with the fact 
that we just restored an active paging domain?

I think we should introduce a proper helper that acquires the lock,
safely sets the opaque owner, and manages the domain state without
forcing it into a blocking domain? Please note that the helper shouldn't
be exposed / exported, we wouldn't want any abusive calls to claim owner
without attaching to a blocked domain.

3. Please note that owner is an opaque "pointer" but we set it to a u64.
Setting an opaque void *owner directly to a user-controlled u64 token is
dangerous. If a malicious or unlucky userspace provides a token value 
that happens to perfectly match a valid kernel pointer (e.g. another
subsystem's cookie), we'll have an aliasing collision.

If that subsystem subsequently calls iommu_device_claim_dma_owner(), the
if (group->owner == owner) check will falsely pass, incorrectly granting
DMA ownership and breaking device isolation.

I understand we can't restore the original owner pointer because the
original iommufd file/context from the previous kernel no longer exists.
However, to avoid aliasing could we dynamically allocate a small wrapper
(e.g., struct iommu_lu_owner { u64 token; }) and set group->owner to the
uniquely allocated pointer?

4. In the full RFC [3], we seem to be "transferring" the ownership but
here we seem to set the owner_cnt = 1 what if it was > 1 earlier? The 
iommu_device_release_dma_owner seems to handle multiple release calls
well so this shouldn't lead to crashes but the "owning" entity could use
this count for some book keeping.. it'd be nice if we could have a
proper restore_owner API / helper here.

> +	group->owner_cnt = 1;
> +	return domain;
> +}
> +
>  /**
>   * iommu_setup_default_domain - Set the default_domain for the group
>   * @group: Group to change
> @@ -3075,8 +3115,8 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>  				      int target_type)
>  {
>  	struct iommu_domain *old_dom = group->default_domain;
> +	struct iommu_domain *dom, *restored_domain;
>  	struct group_device *gdev;
> -	struct iommu_domain *dom;
>  	bool direct_failed;
>  	int req_type;
>  	int ret;
> @@ -3120,6 +3160,9 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>  	/* We must set default_domain early for __iommu_device_set_domain */
>  	group->default_domain = dom;
>  	if (!group->domain) {
> +		restored_domain = __iommu_group_maybe_restore_domain(group);
> +		if (!restored_domain)
> +			restored_domain = dom;
>  		/*
>  		 * Drivers are not allowed to fail the first domain attach.
>  		 * The only way to recover from this is to fail attaching the
> @@ -3127,7 +3170,7 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>  		 * in group->default_domain so it is freed after.
>  		 */
>  		ret = __iommu_group_set_domain_internal(
> -			group, dom, IOMMU_SET_DOMAIN_MUST_SUCCEED);
> +			group, restored_domain, IOMMU_SET_DOMAIN_MUST_SUCCEED);
>  		if (WARN_ON(ret))
>  			goto out_free_old;
>  	} else {
> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
> index 83eb609b3fd7..6b211436ad25 100644
> --- a/drivers/iommu/liveupdate.c
> +++ b/drivers/iommu/liveupdate.c
> @@ -501,3 +501,34 @@ void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>  
>  	iommu_unpreserve_locked(iommu->iommu_dev);
>  }
> +
> +struct iommu_domain *iommu_restore_domain(struct device *dev, struct device_ser *ser)
> +{
> +	struct iommu_domain_ser *domain_ser;
> +	struct iommu_lu_flb_obj *flb_obj;
> +	struct iommu_domain *domain;
> +	int ret;
> +
> +	domain_ser = __va(ser->domain_iommu_ser.domain_phys);
> +
> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&flb_obj);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	guard(mutex)(&flb_obj->lock);
> +	if (domain_ser->restored_domain)
> +		return domain_ser->restored_domain;
> +
> +	domain_ser->obj.incoming =  true;
> +	domain = iommu_paging_domain_alloc(dev);
> +	if (IS_ERR(domain))
> +		return domain;
> +
> +	ret = domain->ops->restore(domain, domain_ser);
> +	if (ret)

did we miss free-ing domain here?

> +		return ERR_PTR(ret);
> +
> +	domain->preserved_state = domain_ser;
> +	domain_ser->restored_domain = domain;
> +	return domain;
> +}
> diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
> index 48c07514a776..4879abaf83d3 100644
> --- a/include/linux/iommu-lu.h
> +++ b/include/linux/iommu-lu.h
> @@ -65,6 +65,8 @@ static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain
>  	return -1;
>  }
>  
> +struct iommu_domain *iommu_restore_domain(struct device *dev,
> +					  struct device_ser *ser);
>  int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
>  				    void *arg);
>  struct device_ser *iommu_get_device_preserved_data(struct device *dev);
> @@ -95,6 +97,12 @@ static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
>  	return NULL;
>  }
>  
> +static inline struct iommu_domain *iommu_restore_domain(struct device *dev,
> +							struct device_ser *ser)
> +{
> +	return NULL;
> +}
> +
>  static inline int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn, void *arg)
>  {
>  	return -EOPNOTSUPP;

Thanks,
Praan

[1] https://elixir.bootlin.com/linux/v7.0-rc4/source/drivers/iommu/intel/iommu.h#L795
[2] https://elixir.bootlin.com/linux/v7.0-rc4/source/drivers/iommu/iommu.c#L3367

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
  2026-03-20 23:34       ` Vipin Sharma
@ 2026-03-23 16:24         ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-23 16:24 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Pranjal Shrivastava

On Fri, Mar 20, 2026 at 04:34:41PM -0700, Vipin Sharma wrote:
>On Fri, Mar 20, 2026 at 12:40:44AM +0000, Samiullah Khawaja wrote:
>> On Thu, Mar 19, 2026 at 04:35:32PM -0700, Vipin Sharma wrote:
>> > On Tue, Feb 03, 2026 at 10:09:44PM +0000, Samiullah Khawaja wrote:
>> > > From: YiFei Zhu <zhuyifei@google.com>
>> > > @@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
>> > >  	bool auto_domain : 1;
>> > >  	bool enforce_cache_coherency : 1;
>> > >  	bool nest_parent : 1;
>> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> > > +	bool lu_preserve : 1;
>> > > +	u32 lu_token;
>> >
>> > Should we use full name i.e. liveupdate here and other places in this
>> > series?
>>
>> I think using full name liveupdate would be too long in other places in
>> this series. And also there are other examples of "luo" being used as a
>> short form. Please see:
>>
>> https://lore.kernel.org/all/20251125165850.3389713-15-pasha.tatashin@soleen.com/
>
>That patch is using "luo" short, which I think is also wrong in saying
>memfd_luo, as there is nothing related to orchestrator (O of LUO) in
>that patch. It is saving memfd state for liveupdate. But that ship has
>sailed.

Yes, I think that makes sense.
>
>In the current patch, I don't think it will be too long, and it also
>easier to read code without someone to know what "lu" is. We have worked
>on it so we know what "lu" stands for and kind of accepted it but I
>agree what Greg k-h was also suggesting.

Agreed, in this patch it won't be too long. I think for the public API
being used in other subsystems we can use "liveupdate", but internally
we can use lu. I will rework this.
>
>https://lore.kernel.org/all/2025093023-frantic-sediment-9849@gregkh/
> - "You have more letters, please use them.  "lu" is too short."
>
>Not a hard no from me, just a suggestion.

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/14] iommupt: Implement preserve/unpreserve/restore callbacks
  2026-03-20 21:57   ` Pranjal Shrivastava
@ 2026-03-23 16:41     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-23 16:41 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Fri, Mar 20, 2026 at 09:57:08PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:39PM +0000, Samiullah Khawaja wrote:
>> Implement the iommu domain ops for presevation, unpresevation and
>> restoration of iommu domains for liveupdate. Use the existing page
>> walker to preserve the ioptdesc of the top_table and the lower tables.
>> Preserve the top_level also so it can be restored during boot.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/generic_pt/iommu_pt.h | 96 +++++++++++++++++++++++++++++
>>  include/linux/generic_pt/iommu.h    | 10 +++
>>  2 files changed, 106 insertions(+)
>>
>> diff --git a/drivers/iommu/generic_pt/iommu_pt.h b/drivers/iommu/generic_pt/iommu_pt.h
>> index 3327116a441c..0a1adb6312dd 100644
>> --- a/drivers/iommu/generic_pt/iommu_pt.h
>> +++ b/drivers/iommu/generic_pt/iommu_pt.h
>> @@ -921,6 +921,102 @@ int DOMAIN_NS(map_pages)(struct iommu_domain *domain, unsigned long iova,
>>  }
>>  EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(map_pages), "GENERIC_PT_IOMMU");
>>
>> +/**
>> + * unpreserve() - Unpreserve page tables and other state of a domain.
>> + * @domain: Domain to unpreserve
>> + */
>> +void DOMAIN_NS(unpreserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser)
>
>I don't see us using the `ser` arg in this function? Shall we remove it?

Hmm.. this is tricky. Its not needed for GPT, but to keep the
unpreserve() callback complete, "ser" should be there as some IOMMU that
preserves/unpreserves more state with each domain might need this. I am
ok either way, WDYT?
>
>> +{
>> +	struct pt_iommu *iommu_table =
>> +		container_of(domain, struct pt_iommu, domain);
>> +	struct pt_common *common = common_from_iommu(iommu_table);
>> +	struct pt_range range = pt_all_range(common);
>> +	struct pt_iommu_collect_args collect = {
>> +		.free_list = IOMMU_PAGES_LIST_INIT(collect.free_list),
>> +	};
>> +
>> +	iommu_pages_list_add(&collect.free_list, range.top_table);
>> +	pt_walk_range(&range, __collect_tables, &collect);
>> +
>> +	iommu_unpreserve_pages(&collect.free_list, -1);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(unpreserve), "GENERIC_PT_IOMMU");
>> +
>> +/**
>> + * preserve() - Preserve page tables and other state of a domain.
>> + * @domain: Domain to preserve
>> + *
>> + * Returns: -ERRNO on failure, on success.
>
>Nit: 0 on success?

Agreed. Will update.
>
>> + */
>> +int DOMAIN_NS(preserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser)
>> +{
>> +	struct pt_iommu *iommu_table =
>> +		container_of(domain, struct pt_iommu, domain);
>> +	struct pt_common *common = common_from_iommu(iommu_table);
>> +	struct pt_range range = pt_all_range(common);
>> +	struct pt_iommu_collect_args collect = {
>> +		.free_list = IOMMU_PAGES_LIST_INIT(collect.free_list),
>> +	};
>> +	int ret;
>> +
>> +	iommu_pages_list_add(&collect.free_list, range.top_table);
>> +	pt_walk_range(&range, __collect_tables, &collect);
>> +
>> +	ret = iommu_preserve_pages(&collect.free_list);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ser->top_table = virt_to_phys(range.top_table);
>> +	ser->top_level = range.top_level;
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(preserve), "GENERIC_PT_IOMMU");
>> +
>> +static int __restore_tables(struct pt_range *range, void *arg,
>> +			    unsigned int level, struct pt_table_p *table)
>> +{
>> +	struct pt_state pts = pt_init(range, level, table);
>> +	int ret;
>> +
>> +	for_each_pt_level_entry(&pts) {
>> +		if (pts.type == PT_ENTRY_TABLE) {
>> +			iommu_restore_page(virt_to_phys(pts.table_lower));
>> +			ret = pt_descend(&pts, arg, __restore_tables);
>> +			if (ret)
>> +				return ret;
>
>If pt_descend() returns an error, we immediately return ret. However, we
>have already successfully called iommu_restore_page() on pts.table_lower
>and potentially many other tables earlier in the loop or higher up in
>the tree..
>
>If the domain restore fails halfway through the tree walk, how are these
>partially restored pages cleaned up? Should we clean them up / free them
>in an err label or does the caller perform a BUG_ON(ret) ?

Yes, a failed restore here should be fatal. Looking at pt_descend
source, it should not fail here anyways as table_lower is init. I will
add a BUG_ON here.
>
>> +		}
>> +	}
>> +	return 0;
>> +}
>> +
>> +/**
>> + * restore() - Restore page tables and other state of a domain.
>> + * @domain: Domain to preserve
>> + *
>> + * Returns: -ERRNO on failure, on success.
>
>Nit: 0 on success?

Agreed.
>
>> + */
>> +int DOMAIN_NS(restore)(struct iommu_domain *domain, struct iommu_domain_ser *ser)
>> +{
>> +	struct pt_iommu *iommu_table =
>> +		container_of(domain, struct pt_iommu, domain);
>> +	struct pt_common *common = common_from_iommu(iommu_table);
>> +	struct pt_range range = pt_all_range(common);
>> +
>> +	iommu_restore_page(ser->top_table);
>> +
>> +	/* Free new table */
>> +	iommu_free_pages(range.top_table);
>> +
>> +	/* Set the restored top table */
>> +	pt_top_set(common, phys_to_virt(ser->top_table), ser->top_level);
>> +
>> +	/* Restore all pages*/
>> +	range = pt_all_range(common);
>> +	return pt_walk_range(&range, __restore_tables, NULL);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(restore), "GENERIC_PT_IOMMU");
>> +
>
>[ ------ >8 Snip >8------]
>
>Thanks,
>Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 08/14] iommu: Restore and reattach preserved domains to devices
  2026-03-22 21:59   ` Pranjal Shrivastava
@ 2026-03-23 18:02     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-23 18:02 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Sun, Mar 22, 2026 at 09:59:16PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:42PM +0000, Samiullah Khawaja wrote:
>> Restore the preserved domains by restoring the page tables using restore
>> IOMMU domain op. Reattach the preserved domain to the device during
>> default domain setup. While attaching, reuse the domain ID that was used
>> in the previous kernel. The context entry setup is not needed as that is
>> preserved during liveupdate.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/intel/iommu.c  | 40 ++++++++++++++++++------------
>>  drivers/iommu/intel/iommu.h  |  3 ++-
>>  drivers/iommu/intel/nested.c |  2 +-
>>  drivers/iommu/iommu.c        | 47 ++++++++++++++++++++++++++++++++++--
>>  drivers/iommu/liveupdate.c   | 31 ++++++++++++++++++++++++
>>  include/linux/iommu-lu.h     |  8 ++++++
>>  6 files changed, 112 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index 8acb7f8a7627..83faad53f247 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -1029,7 +1029,8 @@ static bool first_level_by_default(struct intel_iommu *iommu)
>>  	return true;
>>  }
>>
>> -int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
>> +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
>> +			int restore_did)
>>  {
>>  	struct iommu_domain_info *info, *curr;
>>  	int num, ret = -ENOSPC;
>> @@ -1049,8 +1050,11 @@ int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
>>  		return 0;
>>  	}
>>
>> -	num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
>> -			      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
>> +	if (restore_did >= 0)
>
>I just looked at the code and saw the comment for IDA_START_DID [1] and
>I realized we probably shouldn't allow domain_ids 0 or 1 here, as they
>are reserved?
>
>It seems that DID 0 essentially acts as a blocking domain, and DID 1 is
>used for identity / FLPT_DEFAULT_DID (comparing it to S1DSS in Arm)
>
>If a corrupted KHO image passes restore_did = 0, this check evaluates to
>true. The driver will bypass the IDA, assign DID 0, and program it into
>the context entries. Wouldn't we accidentally attach the device to a
>blocking mode where the VT-d hardware just silently drops/faults all DMA?
>
>Should this check be:
>if (restore_did >= IDA_START_DID) ?

Agreed. Will update.
>
>> +		num = restore_did;
>> +	else
>> +		num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
>> +				      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
>>  	if (num < 0) {
>>  		pr_err("%s: No free domain ids\n", iommu->name);
>>  		goto err_unlock;
>> @@ -1321,10 +1325,14 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>>  {
>>  	struct device_domain_info *info = dev_iommu_priv_get(dev);
>>  	struct intel_iommu *iommu = info->iommu;
>> +	struct device_ser *device_ser = NULL;
>>  	unsigned long flags;
>>  	int ret;
>>
>> -	ret = domain_attach_iommu(domain, iommu);
>> +	device_ser = dev_iommu_restored_state(dev);
>> +
>> +	ret = domain_attach_iommu(domain, iommu,
>> +				  dev_iommu_restore_did(dev, &domain->domain));
>>  	if (ret)
>>  		return ret;
>>
>> @@ -1337,16 +1345,18 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>
>[ ------- >8 snip -------- ]
>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index c0632cb5b570..8103b5372364 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -18,6 +18,7 @@
>>  #include <linux/errno.h>
>>  #include <linux/host1x_context_bus.h>
>>  #include <linux/iommu.h>
>> +#include <linux/iommu-lu.h>
>>  #include <linux/iommufd.h>
>>  #include <linux/idr.h>
>>  #include <linux/err.h>
>> @@ -489,6 +490,10 @@ static int iommu_init_device(struct device *dev)
>>  		goto err_free;
>>  	}
>>
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	dev->iommu->device_ser = iommu_get_device_preserved_data(dev);
>> +#endif
>> +
>>  	iommu_dev = ops->probe_device(dev);
>>  	if (IS_ERR(iommu_dev)) {
>>  		ret = PTR_ERR(iommu_dev);
>> @@ -2149,6 +2154,13 @@ static int __iommu_attach_device(struct iommu_domain *domain,
>>  	ret = domain->ops->attach_dev(domain, dev, old);
>>  	if (ret)
>>  		return ret;
>> +
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	/* The associated state can be unset once restored. */
>> +	if (dev_iommu_restored_state(dev))
>> +		WRITE_ONCE(dev->iommu->device_ser, NULL);
>> +#endif
>> +
>>  	dev->iommu->attach_deferred = 0;
>>  	trace_attach_device_to_domain(dev);
>>  	return 0;
>> @@ -3061,6 +3073,34 @@ int iommu_fwspec_add_ids(struct device *dev, const u32 *ids, int num_ids)
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_fwspec_add_ids);
>>
>> +static struct iommu_domain *__iommu_group_maybe_restore_domain(struct iommu_group *group)
>> +{
>> +	struct device_ser *device_ser;
>> +	struct iommu_domain *domain;
>> +	struct device *dev;
>> +
>> +	dev = iommu_group_first_dev(group);
>> +	if (!dev_is_pci(dev))
>> +		return NULL;
>> +
>> +	device_ser = dev_iommu_restored_state(dev);
>> +	if (!device_ser)
>> +		return NULL;
>> +
>> +	domain = iommu_restore_domain(dev, device_ser);
>> +	if (WARN_ON(IS_ERR(domain)))
>> +		return NULL;
>> +
>> +	/*
>> +	 * The group is owned by the entity (a preserved iommufd) that provided
>> +	 * this token in the previous kernel. It will be used to reclaim it
>> +	 * later.
>> +	 */
>> +	group->owner = (void *)device_ser->token;
>
>Umm.. this is a nak as it looks troubled for the following reasons:
>
>1. The group->owner and group->owner_cnt are being updated directly
>without holding group->mutex. We shouldn't bypass the lock as this
>violates the group's concurrency protections and can cause APIs like
>iommu_device_claim_dma_owner[2] to race or fail.

This function is called with group->mutex held during the setup of
default domain during device iommu probe. So this should fine. I will
add a lockdep to make it clear.
>
>2. Stuff like this shouldn't be open-coded. Presumably, this was
>open-coded because the standard API (__iommu_take_dma_ownership) forces
>the IOMMU group into a blocking_domain, which conflicts with the fact
>that we just restored an active paging domain?
>
>I think we should introduce a proper helper that acquires the lock,
>safely sets the opaque owner, and manages the domain state without
>forcing it into a blocking domain? Please note that the helper shouldn't
>be exposed / exported, we wouldn't want any abusive calls to claim owner
>without attaching to a blocked domain.

The init/restore of owner fields in group should only be done during
restore when restoring the iommu domain for the group with mutex already
held. Adding a separate helper might allow setting it up at some other
place even in the core, so want to avoid that.
>
>3. Please note that owner is an opaque "pointer" but we set it to a u64.
>Setting an opaque void *owner directly to a user-controlled u64 token is
>dangerous. If a malicious or unlucky userspace provides a token value
>that happens to perfectly match a valid kernel pointer (e.g. another
>subsystem's cookie), we'll have an aliasing collision.
>
>If that subsystem subsequently calls iommu_device_claim_dma_owner(), the
>if (group->owner == owner) check will falsely pass, incorrectly granting
>DMA ownership and breaking device isolation.
>
>I understand we can't restore the original owner pointer because the
>original iommufd file/context from the previous kernel no longer exists.
>However, to avoid aliasing could we dynamically allocate a small wrapper
>(e.g., struct iommu_lu_owner { u64 token; }) and set group->owner to the
>uniquely allocated pointer?

Agreed. Will update.
>
>4. In the full RFC [3], we seem to be "transferring" the ownership but
>here we seem to set the owner_cnt = 1 what if it was > 1 earlier? The
>iommu_device_release_dma_owner seems to handle multiple release calls
>well so this shouldn't lead to crashes but the "owning" entity could use
>this count for some book keeping.. it'd be nice if we could have a
>proper restore_owner API / helper here.

This is being done during iommu probe of the device when the default
domain is being set on the group, so group will not have any existing
owner. I will add a WARN for this. The transfer of ownership will happen
after the iommufd is restored. Also note that once initialized here, the
device/group will not be owned by any other driver or iommufd until the
ownership transfer is done.
>
>> +	group->owner_cnt = 1;
>> +	return domain;
>> +}
>> +
>>  /**
>>   * iommu_setup_default_domain - Set the default_domain for the group
>>   * @group: Group to change
>> @@ -3075,8 +3115,8 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>>  				      int target_type)
>>  {
>>  	struct iommu_domain *old_dom = group->default_domain;
>> +	struct iommu_domain *dom, *restored_domain;
>>  	struct group_device *gdev;
>> -	struct iommu_domain *dom;
>>  	bool direct_failed;
>>  	int req_type;
>>  	int ret;
>> @@ -3120,6 +3160,9 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>>  	/* We must set default_domain early for __iommu_device_set_domain */
>>  	group->default_domain = dom;
>>  	if (!group->domain) {
>> +		restored_domain = __iommu_group_maybe_restore_domain(group);
>> +		if (!restored_domain)
>> +			restored_domain = dom;
>>  		/*
>>  		 * Drivers are not allowed to fail the first domain attach.
>>  		 * The only way to recover from this is to fail attaching the
>> @@ -3127,7 +3170,7 @@ static int iommu_setup_default_domain(struct iommu_group *group,
>>  		 * in group->default_domain so it is freed after.
>>  		 */
>>  		ret = __iommu_group_set_domain_internal(
>> -			group, dom, IOMMU_SET_DOMAIN_MUST_SUCCEED);
>> +			group, restored_domain, IOMMU_SET_DOMAIN_MUST_SUCCEED);
>>  		if (WARN_ON(ret))
>>  			goto out_free_old;
>>  	} else {
>> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
>> index 83eb609b3fd7..6b211436ad25 100644
>> --- a/drivers/iommu/liveupdate.c
>> +++ b/drivers/iommu/liveupdate.c
>> @@ -501,3 +501,34 @@ void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>>
>>  	iommu_unpreserve_locked(iommu->iommu_dev);
>>  }
>> +
>> +struct iommu_domain *iommu_restore_domain(struct device *dev, struct device_ser *ser)
>> +{
>> +	struct iommu_domain_ser *domain_ser;
>> +	struct iommu_lu_flb_obj *flb_obj;
>> +	struct iommu_domain *domain;
>> +	int ret;
>> +
>> +	domain_ser = __va(ser->domain_iommu_ser.domain_phys);
>> +
>> +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&flb_obj);
>> +	if (ret)
>> +		return ERR_PTR(ret);
>> +
>> +	guard(mutex)(&flb_obj->lock);
>> +	if (domain_ser->restored_domain)
>> +		return domain_ser->restored_domain;
>> +
>> +	domain_ser->obj.incoming =  true;
>> +	domain = iommu_paging_domain_alloc(dev);
>> +	if (IS_ERR(domain))
>> +		return domain;
>> +
>> +	ret = domain->ops->restore(domain, domain_ser);
>> +	if (ret)
>
>did we miss free-ing domain here?

Yes, Ankit pointed out same. Will update this.
>
>> +		return ERR_PTR(ret);
>> +
>> +	domain->preserved_state = domain_ser;
>> +	domain_ser->restored_domain = domain;
>> +	return domain;
>> +}
>> diff --git a/include/linux/iommu-lu.h b/include/linux/iommu-lu.h
>> index 48c07514a776..4879abaf83d3 100644
>> --- a/include/linux/iommu-lu.h
>> +++ b/include/linux/iommu-lu.h
>> @@ -65,6 +65,8 @@ static inline int dev_iommu_restore_did(struct device *dev, struct iommu_domain
>>  	return -1;
>>  }
>>
>> +struct iommu_domain *iommu_restore_domain(struct device *dev,
>> +					  struct device_ser *ser);
>>  int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
>>  				    void *arg);
>>  struct device_ser *iommu_get_device_preserved_data(struct device *dev);
>> @@ -95,6 +97,12 @@ static inline void *iommu_domain_restored_state(struct iommu_domain *domain)
>>  	return NULL;
>>  }
>>
>> +static inline struct iommu_domain *iommu_restore_domain(struct device *dev,
>> +							struct device_ser *ser)
>> +{
>> +	return NULL;
>> +}
>> +
>>  static inline int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn, void *arg)
>>  {
>>  	return -EOPNOTSUPP;
>
>Thanks,
>Praan
>
>[1] https://elixir.bootlin.com/linux/v7.0-rc4/source/drivers/iommu/intel/iommu.h#L795
>[2] https://elixir.bootlin.com/linux/v7.0-rc4/source/drivers/iommu/iommu.c#L3367

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/14] iommu/vt-d: preserve PASID table of preserved device
  2026-02-03 22:09 ` [PATCH 09/14] iommu/vt-d: preserve PASID table of preserved device Samiullah Khawaja
@ 2026-03-23 18:19   ` Pranjal Shrivastava
  2026-03-23 18:51     ` Samiullah Khawaja
  0 siblings, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-23 18:19 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:43PM +0000, Samiullah Khawaja wrote:
> In scalable mode the PASID table is used to fetch the io page tables.
> Preserve and restore the PASID table of the preserved devices.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/intel/iommu.c      |   4 +-
>  drivers/iommu/intel/iommu.h      |   5 ++
>  drivers/iommu/intel/liveupdate.c | 130 +++++++++++++++++++++++++++++++
>  drivers/iommu/intel/pasid.c      |   7 +-
>  drivers/iommu/intel/pasid.h      |   9 +++
>  include/linux/kho/abi/iommu.h    |   8 ++
>  6 files changed, 160 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 83faad53f247..2d0dae57f5a2 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -2944,8 +2944,10 @@ static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu)
>  		if (info->iommu != iommu)
>  			continue;
>  
> -		if (dev_iommu_preserved_state(&pdev->dev))
> +		if (dev_iommu_preserved_state(&pdev->dev)) {
> +			pasid_cleanup_preserved_table(&pdev->dev);
>  			continue;
> +		}
>  
>  		domain_context_clear(info);
>  	}
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index 057bd6035d85..d24d6aeaacc0 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -1286,6 +1286,7 @@ int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser
>  void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
>  void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
>  					       struct iommu_ser *iommu_ser);
> +void pasid_cleanup_preserved_table(struct device *dev);
>  #else
>  static inline int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
>  {
> @@ -1309,6 +1310,10 @@ static inline void intel_iommu_liveupdate_restore_root_table(struct intel_iommu
>  							     struct iommu_ser *iommu_ser)
>  {
>  }
> +
> +static inline void pasid_cleanup_preserved_table(struct device *dev)
> +{
> +}
>  #endif
>  
>  #ifdef CONFIG_INTEL_IOMMU_SVM
> diff --git a/drivers/iommu/intel/liveupdate.c b/drivers/iommu/intel/liveupdate.c
> index 6dcb5783d1db..53bb5fe3a764 100644
> --- a/drivers/iommu/intel/liveupdate.c
> +++ b/drivers/iommu/intel/liveupdate.c
> @@ -14,6 +14,7 @@
>  #include <linux/pci.h>
>  
>  #include "iommu.h"
> +#include "pasid.h"
>  #include "../iommu-pages.h"
>  
>  static void unpreserve_iommu_context(struct intel_iommu *iommu, int end)
> @@ -113,9 +114,89 @@ void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
>  		iommu->reg_phys, iommu_ser->intel.root_table);
>  }
>  
> +enum pasid_lu_op {
> +	PASID_LU_OP_PRESERVE = 1,
> +	PASID_LU_OP_UNPRESERVE,
> +	PASID_LU_OP_RESTORE,
> +	PASID_LU_OP_FREE,
> +};
> +
> +static int pasid_lu_do_op(void *table, enum pasid_lu_op op)
> +{
> +	int ret = 0;
> +
> +	switch (op) {
> +	case PASID_LU_OP_PRESERVE:
> +		ret = iommu_preserve_page(table);
> +		break;
> +	case PASID_LU_OP_UNPRESERVE:
> +		iommu_unpreserve_page(table);
> +		break;
> +	case PASID_LU_OP_RESTORE:
> +		iommu_restore_page(virt_to_phys(table));
> +		break;
> +	case PASID_LU_OP_FREE:
> +		iommu_free_pages(table);
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int pasid_lu_handle_pd(struct pasid_dir_entry *dir, enum pasid_lu_op op)
> +{
> +	struct pasid_entry *table;
> +	int ret;
> +
> +	/* Only preserve first table for NO_PASID. */
> +	table = get_pasid_table_from_pde(&dir[0]);
> +	if (!table)
> +		return -EINVAL;
> +
> +	ret = pasid_lu_do_op(table, op);
> +	if (ret)
> +		return ret;
> +
> +	ret = pasid_lu_do_op(dir, op);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	if (op == PASID_LU_OP_PRESERVE)
> +		pasid_lu_do_op(table, PASID_LU_OP_UNPRESERVE);
> +
> +	return ret;
> +}
> +
> +void pasid_cleanup_preserved_table(struct device *dev)
> +{
> +	struct pasid_table *pasid_table;
> +	struct pasid_dir_entry *dir;
> +	struct pasid_entry *table;
> +
> +	pasid_table = intel_pasid_get_table(dev);
> +	if (!pasid_table)
> +		return;
> +
> +	dir = pasid_table->table;
> +	table = get_pasid_table_from_pde(&dir[0]);
> +	if (!table)
> +		return;
> +
> +	/* Cleanup everything except the first entry. */
> +	memset(&table[1], 0, SZ_4K - sizeof(*table));
> +	memset(&dir[1], 0, SZ_4K - sizeof(struct pasid_dir_entry));

(Not too familiar with Intel IOMMU / VT-d)
We seem to hardcode SZ_4K when clearing the directory entries. But in
intel_pasid_alloc_table(), the allocation size seems to depend on
max_pasid which could be larger than one page (order > 0)?

If the directory is multi-page, won't we leave the trailing pages
and full of stale PDE pointers that the HW could still walk?

> +
> +	clflush_cache_range(&table[0], SZ_4K);
> +	clflush_cache_range(&dir[0], SZ_4K);
> +}
> +

[ ------ >8 ------ ]

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops
  2026-03-20 23:01   ` Pranjal Shrivastava
  2026-03-21 13:27     ` Pranjal Shrivastava
@ 2026-03-23 18:32     ` Samiullah Khawaja
  1 sibling, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-23 18:32 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Fri, Mar 20, 2026 at 11:01:59PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:40PM +0000, Samiullah Khawaja wrote:
>> Add implementation of the device and iommu presevation in a separate
>> file. Also set the device and iommu preserve/unpreserve ops in the
>> struct iommu_ops.
>>
>> During normal shutdown the iommu translation is disabled. Since the root
>> table is preserved during live update, it needs to be cleaned up and the
>> context entries of the unpreserved devices need to be cleared.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/intel/Makefile     |   1 +
>>  drivers/iommu/intel/iommu.c      |  47 ++++++++++-
>>  drivers/iommu/intel/iommu.h      |  27 +++++++
>>  drivers/iommu/intel/liveupdate.c | 134 +++++++++++++++++++++++++++++++
>>  4 files changed, 205 insertions(+), 4 deletions(-)
>>  create mode 100644 drivers/iommu/intel/liveupdate.c
>>
>> diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
>> index ada651c4a01b..d38fc101bc35 100644
>> --- a/drivers/iommu/intel/Makefile
>> +++ b/drivers/iommu/intel/Makefile
>> @@ -6,3 +6,4 @@ obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o
>>  obj-$(CONFIG_INTEL_IOMMU_SVM) += svm.o
>>  obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o
>>  obj-$(CONFIG_INTEL_IOMMU_PERF_EVENTS) += perfmon.o
>> +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index 134302fbcd92..c95de93fb72f 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -16,6 +16,7 @@
>>  #include <linux/crash_dump.h>
>>  #include <linux/dma-direct.h>
>>  #include <linux/dmi.h>
>> +#include <linux/iommu-lu.h>
>>  #include <linux/memory.h>
>>  #include <linux/pci.h>
>>  #include <linux/pci-ats.h>
>> @@ -52,6 +53,8 @@ static int rwbf_quirk;
>>
>>  #define rwbf_required(iommu)	(rwbf_quirk || cap_rwbf((iommu)->cap))
>>
>> +static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu);
>> +
>>  /*
>>   * set to 1 to panic kernel if can't successfully enable VT-d
>>   * (used when kernel is launched w/ TXT)
>> @@ -60,8 +63,6 @@ static int force_on = 0;
>>  static int intel_iommu_tboot_noforce;
>>  static int no_platform_optin;
>>
>> -#define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
>> -
>>  /*
>>   * Take a root_entry and return the Lower Context Table Pointer (LCTP)
>>   * if marked present.
>> @@ -2378,8 +2379,10 @@ void intel_iommu_shutdown(void)
>>  		/* Disable PMRs explicitly here. */
>>  		iommu_disable_protect_mem_regions(iommu);
>>
>> -		/* Make sure the IOMMUs are switched off */
>> -		iommu_disable_translation(iommu);
>> +		if (!__maybe_clean_unpreserved_context_entries(iommu)) {
>> +			/* Make sure the IOMMUs are switched off */
>> +			iommu_disable_translation(iommu);
>> +		}
>>  	}
>>  }
>>
>> @@ -2902,6 +2905,38 @@ static const struct iommu_dirty_ops intel_second_stage_dirty_ops = {
>>  	.set_dirty_tracking = intel_iommu_set_dirty_tracking,
>>  };
>>
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu)
>> +{
>> +	struct device_domain_info *info;
>> +	struct pci_dev *pdev = NULL;
>> +
>> +	if (!iommu->iommu.outgoing_preserved_state)
>> +		return false;
>> +
>> +	for_each_pci_dev(pdev) {
>
>Shouldn't we *also* clean context entries for non-pci devices?
>
>AFAICT, in the current code, we simply turn off translation during
>shutdowm. However, with this change, because this cleanup loop only runs
>for_each_pci_dev(), it completely ignores any non-PCI devices (like
>platform devices) that might be attached to this IOMMU?
>
>Since the IOMMU is left powered on during the handover, skipping non-PCI
>devices in this cleanup loop means their context entries are left
>perfectly intact and active in the hardware root table. There could be a
>chance that the new kernel doesn't reclaim/overwrite these entries in
>which case it could result in zombie DMAs surviving the KHO.

Agreed. I will update this.
>
>> +		info = dev_iommu_priv_get(&pdev->dev);
>> +		if (!info)
>> +			continue;
>> +
>> +		if (info->iommu != iommu)
>> +			continue;
>> +
>> +		if (dev_iommu_preserved_state(&pdev->dev))
>> +			continue;
>> +
>> +		domain_context_clear(info);
>> +	}
>> +
>> +	return true;
>> +}
>> +#else
>> +static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu)
>> +{
>> +	return false;
>> +}
>> +#endif
>> +
>>  static struct iommu_domain *
>>  intel_iommu_domain_alloc_second_stage(struct device *dev,
>>  				      struct intel_iommu *iommu, u32 flags)
>> @@ -3925,6 +3960,10 @@ const struct iommu_ops intel_iommu_ops = {
>>  	.is_attach_deferred	= intel_iommu_is_attach_deferred,
>>  	.def_domain_type	= device_def_domain_type,
>>  	.page_response		= intel_iommu_page_response,
>> +	.preserve_device	= intel_iommu_preserve_device,
>> +	.unpreserve_device	= intel_iommu_unpreserve_device,
>> +	.preserve		= intel_iommu_preserve,
>> +	.unpreserve		= intel_iommu_unpreserve,
>>  };
>>
>>  static void quirk_iommu_igfx(struct pci_dev *dev)
>> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
>> index 25c5e22096d4..70032e86437d 100644
>> --- a/drivers/iommu/intel/iommu.h
>> +++ b/drivers/iommu/intel/iommu.h
>> @@ -557,6 +557,8 @@ struct root_entry {
>>  	u64     hi;
>>  };
>>
>> +#define ROOT_ENTRY_NR (VTD_PAGE_SIZE / sizeof(struct root_entry))
>> +
>>  /*
>>   * low 64 bits:
>>   * 0: present
>> @@ -1276,6 +1278,31 @@ static inline int iopf_for_domain_replace(struct iommu_domain *new,
>>  	return 0;
>>  }
>>
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser);
>> +void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser);
>> +int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
>> +void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
>> +#else
>> +static inline int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static inline void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser)
>> +{
>> +}
>> +
>> +static inline int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static inline void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser)
>> +{
>> +}
>> +#endif
>> +
>>  #ifdef CONFIG_INTEL_IOMMU_SVM
>>  void intel_svm_check(struct intel_iommu *iommu);
>>  struct iommu_domain *intel_svm_domain_alloc(struct device *dev,
>> diff --git a/drivers/iommu/intel/liveupdate.c b/drivers/iommu/intel/liveupdate.c
>> new file mode 100644
>> index 000000000000..82ba1daf1711
>> --- /dev/null
>> +++ b/drivers/iommu/intel/liveupdate.c
>> @@ -0,0 +1,134 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +
>> +/*
>> + * Copyright (C) 2025, Google LLC
>> + * Author: Samiullah Khawaja <skhawaja@google.com>
>> + */
>> +
>> +#define pr_fmt(fmt)    "iommu: liveupdate: " fmt
>> +
>> +#include <linux/kexec_handover.h>
>> +#include <linux/liveupdate.h>
>> +#include <linux/iommu-lu.h>
>> +#include <linux/module.h>
>> +#include <linux/pci.h>
>> +
>> +#include "iommu.h"
>> +#include "../iommu-pages.h"
>> +
>> +static void unpreserve_iommu_context(struct intel_iommu *iommu, int end)
>> +{
>> +	struct context_entry *context;
>> +	int i;
>> +
>> +	if (end < 0)
>> +		end = ROOT_ENTRY_NR;
>> +
>> +	for (i = 0; i < end; i++) {
>> +		context = iommu_context_addr(iommu, i, 0, 0);
>> +		if (context)
>> +			iommu_unpreserve_page(context);
>> +
>> +		if (!sm_supported(iommu))
>> +			continue;
>> +
>> +		context = iommu_context_addr(iommu, i, 0x80, 0);
>> +		if (context)
>> +			iommu_unpreserve_page(context);
>> +	}
>> +}
>> +
>> +static int preserve_iommu_context(struct intel_iommu *iommu)
>> +{
>> +	struct context_entry *context;
>> +	int ret;
>> +	int i;
>> +
>> +	for (i = 0; i < ROOT_ENTRY_NR; i++) {
>> +		context = iommu_context_addr(iommu, i, 0, 0);
>> +		if (context) {
>> +			ret = iommu_preserve_page(context);
>> +			if (ret)
>> +				goto error;
>> +		}
>> +
>> +		if (!sm_supported(iommu))
>> +			continue;
>> +
>> +		context = iommu_context_addr(iommu, i, 0x80, 0);
>> +		if (context) {
>> +			ret = iommu_preserve_page(context);
>> +			if (ret)
>> +				goto error_sm;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +
>> +error_sm:
>> +	context = iommu_context_addr(iommu, i, 0, 0);
>> +	iommu_unpreserve_page(context);
>> +error:
>> +	unpreserve_iommu_context(iommu, i);
>> +	return ret;
>> +}
>> +
>> +int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
>> +{
>> +	struct device_domain_info *info = dev_iommu_priv_get(dev);
>> +
>> +	if (!dev_is_pci(dev))
>> +		return -EOPNOTSUPP;
>
>Maybe also WARN_ON_ONCE ?

I think WARN should not be used if the driver supports only PCIe device
preservation. Do you think returning NOTSUPP is not enough and an error
should be logged? I can add a ratelimited error log.
>
>> +
>> +	if (!info)
>> +		return -EINVAL;
>> +
>> +	device_ser->domain_iommu_ser.did = domain_id_iommu(info->domain, info->iommu);
>> +	return 0;
>> +}
>> +
>> +void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser)
>> +{
>> +}
>
>Nit: We can just not populate the unpreserve_device op here, we don't
>need an empty function

Agreed.
>
>> +
>> +int intel_iommu_preserve(struct iommu_device *iommu_dev, struct iommu_ser *ser)
>> +{
>> +	struct intel_iommu *iommu;
>> +	int ret;
>> +
>> +	iommu = container_of(iommu_dev, struct intel_iommu, iommu);
>> +
>> +	spin_lock(&iommu->lock);
>
>Minor note: We call this with the session->mutex acquired which is a
>sleepable/preemptable context whereas spin_lock disables preemption and
>puts us in an atomic context.
>
>While iommu_context_addr() happens to be safe here because it uses
>GFP_ATOMIC & we pass alloc=0, we are heavily relying on the assumption
>that iommu_context_addr() or other helpers will never sleep or attempt
>a GFP_KERNEL allocation under the hood.
>
>Given how easy it would be for a future refactor to accidentally
>introduce a sleeping lock inside any of the other helpers, I'd recommend
>adding an explicit comment here warning that this runs in an atomic
>context (same for unpreserve below).

I will add a comment about this as you suggested in the later email.
>
>>r +	ret = preserve_iommu_context(iommu);
>> +	if (ret)
>> +		goto err;
>> +
>> +	ret = iommu_preserve_page(iommu->root_entry);
>> +	if (ret) {
>> +		unpreserve_iommu_context(iommu, -1);
>
>Nit: The same overloading discussion as Vipin pointed out in the other
>patch.

Agreed. Will update.
>
>> +		goto err;
>> +	}
>> +
>> +	ser->intel.phys_addr = iommu->reg_phys;
>> +	ser->intel.root_table = __pa(iommu->root_entry);
>> +	ser->type = IOMMU_INTEL;
>> +	ser->token = ser->intel.phys_addr;
>> +	spin_unlock(&iommu->lock);
>> +
>> +	return 0;
>> +err:
>> +	spin_unlock(&iommu->lock);
>> +	return ret;
>> +}
>> +
>> +void intel_iommu_unpreserve(struct iommu_device *iommu_dev, struct iommu_ser *iommu_ser)
>> +{
>> +	struct intel_iommu *iommu;
>> +
>> +	iommu = container_of(iommu_dev, struct intel_iommu, iommu);
>> +
>> +	spin_lock(&iommu->lock);
>> +	unpreserve_iommu_context(iommu, -1);
>> +	iommu_unpreserve_page(iommu->root_entry);
>> +	spin_unlock(&iommu->lock);
>> +}
>
>Thanks,
>Praan

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/14] iommu/vt-d: preserve PASID table of preserved device
  2026-03-23 18:19   ` Pranjal Shrivastava
@ 2026-03-23 18:51     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-23 18:51 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Mon, Mar 23, 2026 at 06:19:18PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:43PM +0000, Samiullah Khawaja wrote:
>> In scalable mode the PASID table is used to fetch the io page tables.
>> Preserve and restore the PASID table of the preserved devices.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/intel/iommu.c      |   4 +-
>>  drivers/iommu/intel/iommu.h      |   5 ++
>>  drivers/iommu/intel/liveupdate.c | 130 +++++++++++++++++++++++++++++++
>>  drivers/iommu/intel/pasid.c      |   7 +-
>>  drivers/iommu/intel/pasid.h      |   9 +++
>>  include/linux/kho/abi/iommu.h    |   8 ++
>>  6 files changed, 160 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index 83faad53f247..2d0dae57f5a2 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -2944,8 +2944,10 @@ static bool __maybe_clean_unpreserved_context_entries(struct intel_iommu *iommu)
>>  		if (info->iommu != iommu)
>>  			continue;
>>
>> -		if (dev_iommu_preserved_state(&pdev->dev))
>> +		if (dev_iommu_preserved_state(&pdev->dev)) {
>> +			pasid_cleanup_preserved_table(&pdev->dev);
>>  			continue;
>> +		}
>>
>>  		domain_context_clear(info);
>>  	}
>> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
>> index 057bd6035d85..d24d6aeaacc0 100644
>> --- a/drivers/iommu/intel/iommu.h
>> +++ b/drivers/iommu/intel/iommu.h
>> @@ -1286,6 +1286,7 @@ int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser
>>  void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
>>  void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
>>  					       struct iommu_ser *iommu_ser);
>> +void pasid_cleanup_preserved_table(struct device *dev);
>>  #else
>>  static inline int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
>>  {
>> @@ -1309,6 +1310,10 @@ static inline void intel_iommu_liveupdate_restore_root_table(struct intel_iommu
>>  							     struct iommu_ser *iommu_ser)
>>  {
>>  }
>> +
>> +static inline void pasid_cleanup_preserved_table(struct device *dev)
>> +{
>> +}
>>  #endif
>>
>>  #ifdef CONFIG_INTEL_IOMMU_SVM
>> diff --git a/drivers/iommu/intel/liveupdate.c b/drivers/iommu/intel/liveupdate.c
>> index 6dcb5783d1db..53bb5fe3a764 100644
>> --- a/drivers/iommu/intel/liveupdate.c
>> +++ b/drivers/iommu/intel/liveupdate.c
>> @@ -14,6 +14,7 @@
>>  #include <linux/pci.h>
>>
>>  #include "iommu.h"
>> +#include "pasid.h"
>>  #include "../iommu-pages.h"
>>
>>  static void unpreserve_iommu_context(struct intel_iommu *iommu, int end)
>> @@ -113,9 +114,89 @@ void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
>>  		iommu->reg_phys, iommu_ser->intel.root_table);
>>  }
>>
>> +enum pasid_lu_op {
>> +	PASID_LU_OP_PRESERVE = 1,
>> +	PASID_LU_OP_UNPRESERVE,
>> +	PASID_LU_OP_RESTORE,
>> +	PASID_LU_OP_FREE,
>> +};
>> +
>> +static int pasid_lu_do_op(void *table, enum pasid_lu_op op)
>> +{
>> +	int ret = 0;
>> +
>> +	switch (op) {
>> +	case PASID_LU_OP_PRESERVE:
>> +		ret = iommu_preserve_page(table);
>> +		break;
>> +	case PASID_LU_OP_UNPRESERVE:
>> +		iommu_unpreserve_page(table);
>> +		break;
>> +	case PASID_LU_OP_RESTORE:
>> +		iommu_restore_page(virt_to_phys(table));
>> +		break;
>> +	case PASID_LU_OP_FREE:
>> +		iommu_free_pages(table);
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static int pasid_lu_handle_pd(struct pasid_dir_entry *dir, enum pasid_lu_op op)
>> +{
>> +	struct pasid_entry *table;
>> +	int ret;
>> +
>> +	/* Only preserve first table for NO_PASID. */
>> +	table = get_pasid_table_from_pde(&dir[0]);
>> +	if (!table)
>> +		return -EINVAL;
>> +
>> +	ret = pasid_lu_do_op(table, op);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = pasid_lu_do_op(dir, op);
>> +	if (ret)
>> +		goto err;
>> +
>> +	return 0;
>> +err:
>> +	if (op == PASID_LU_OP_PRESERVE)
>> +		pasid_lu_do_op(table, PASID_LU_OP_UNPRESERVE);
>> +
>> +	return ret;
>> +}
>> +
>> +void pasid_cleanup_preserved_table(struct device *dev)
>> +{
>> +	struct pasid_table *pasid_table;
>> +	struct pasid_dir_entry *dir;
>> +	struct pasid_entry *table;
>> +
>> +	pasid_table = intel_pasid_get_table(dev);
>> +	if (!pasid_table)
>> +		return;
>> +
>> +	dir = pasid_table->table;
>> +	table = get_pasid_table_from_pde(&dir[0]);
>> +	if (!table)
>> +		return;
>> +
>> +	/* Cleanup everything except the first entry. */
>> +	memset(&table[1], 0, SZ_4K - sizeof(*table));
>> +	memset(&dir[1], 0, SZ_4K - sizeof(struct pasid_dir_entry));
>
>(Not too familiar with Intel IOMMU / VT-d)
>We seem to hardcode SZ_4K when clearing the directory entries. But in
>intel_pasid_alloc_table(), the allocation size seems to depend on
>max_pasid which could be larger than one page (order > 0)?
>
>If the directory is multi-page, won't we leave the trailing pages
>and full of stale PDE pointers that the HW could still walk?

Agreed. It should cleanup based on max_pasid. Will update.
>
>> +
>> +	clflush_cache_range(&table[0], SZ_4K);
>> +	clflush_cache_range(&dir[0], SZ_4K);
>> +}
>> +
>
>[ ------ >8 ------ ]
>
>Thanks,
>Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/14] iommu/vt-d: Restore IOMMU state and reclaimed domain ids
  2026-03-22 19:51   ` Pranjal Shrivastava
@ 2026-03-23 19:33     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-23 19:33 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Sun, Mar 22, 2026 at 07:51:11PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:41PM +0000, Samiullah Khawaja wrote:
>> During boot fetch the preserved state of IOMMU unit and if found then
>> restore the state.
>>
>> - Reuse the root_table that was preserved in the previous kernel.
>> - Reclaim the domain ids of the preserved domains for each preserved
>>   devices so these are not acquired by another domain.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/intel/iommu.c      | 26 +++++++++++++++------
>>  drivers/iommu/intel/iommu.h      |  7 ++++++
>>  drivers/iommu/intel/liveupdate.c | 40 ++++++++++++++++++++++++++++++++
>>  3 files changed, 66 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index c95de93fb72f..8acb7f8a7627 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -222,12 +222,12 @@ static void clear_translation_pre_enabled(struct intel_iommu *iommu)
>>  	iommu->flags &= ~VTD_FLAG_TRANS_PRE_ENABLED;
>>  }
>>
>> -static void init_translation_status(struct intel_iommu *iommu)
>> +static void init_translation_status(struct intel_iommu *iommu, bool restoring)
>>  {
>>  	u32 gsts;
>>
>>  	gsts = readl(iommu->reg + DMAR_GSTS_REG);
>> -	if (gsts & DMA_GSTS_TES)
>> +	if (!restoring && (gsts & DMA_GSTS_TES))
>>  		iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
>>  }
>>
>> @@ -670,10 +670,16 @@ void dmar_fault_dump_ptes(struct intel_iommu *iommu, u16 source_id,
>>  #endif
>>
>>  /* iommu handling */
>> -static int iommu_alloc_root_entry(struct intel_iommu *iommu)
>> +static int iommu_alloc_root_entry(struct intel_iommu *iommu, struct iommu_ser *restored_state)
>>  {
>>  	struct root_entry *root;
>>
>> +	if (restored_state) {
>> +		intel_iommu_liveupdate_restore_root_table(iommu, restored_state);
>> +		__iommu_flush_cache(iommu, iommu->root_entry, ROOT_SIZE);
>> +		return 0;
>> +	}
>
>Instead of putting this inside the allocator, shouldn't init_dmars and
>intel_iommu_add check for iommu_ser and call
>intel_iommu_liveupdate_restore_root_table() directly, bypassing the
>allocation entirely? This looks like it could be a stand-alone function
>which has nothing to do with allocation.

Agreed. Will move the check out into the caller.
>
>> +
>>  	root = iommu_alloc_pages_node_sz(iommu->node, GFP_ATOMIC, SZ_4K);
>>  	if (!root) {
>>  		pr_err("Allocating root entry for %s failed\n",
>> @@ -1614,6 +1620,7 @@ static int copy_translation_tables(struct intel_iommu *iommu)
>>
>>  static int __init init_dmars(void)
>>  {
>> +	struct iommu_ser *iommu_ser = NULL;
>>  	struct dmar_drhd_unit *drhd;
>>  	struct intel_iommu *iommu;
>>  	int ret;
>> @@ -1636,8 +1643,10 @@ static int __init init_dmars(void)
>>  						   intel_pasid_max_id);
>>  		}
>>
>> +		iommu_ser = iommu_get_preserved_data(iommu->reg_phys, IOMMU_INTEL);
>> +
>>  		intel_iommu_init_qi(iommu);
>> -		init_translation_status(iommu);
>> +		init_translation_status(iommu, !!iommu_ser);
>>
>>  		if (translation_pre_enabled(iommu) && !is_kdump_kernel()) {
>>  			iommu_disable_translation(iommu);
>> @@ -1651,7 +1660,7 @@ static int __init init_dmars(void)
>>  		 * we could share the same root & context tables
>>  		 * among all IOMMU's. Need to Split it later.
>>  		 */
>> -		ret = iommu_alloc_root_entry(iommu);
>> +		ret = iommu_alloc_root_entry(iommu, iommu_ser);
>>  		if (ret)
>>  			goto free_iommu;
>>
>> @@ -2110,15 +2119,18 @@ int dmar_parse_one_satc(struct acpi_dmar_header *hdr, void *arg)
>>  static int intel_iommu_add(struct dmar_drhd_unit *dmaru)
>>  {
>>  	struct intel_iommu *iommu = dmaru->iommu;
>> +	struct iommu_ser *iommu_ser = NULL;
>>  	int ret;
>>
>
>Nit: Add: /* Fetch the preserved context using MMIO base as a token */ ?

Will add.
>
>> +	iommu_ser = iommu_get_preserved_data(iommu->reg_phys, IOMMU_INTEL);
>> +
>>  	/*
>>  	 * Disable translation if already enabled prior to OS handover.
>>  	 */
>> -	if (iommu->gcmd & DMA_GCMD_TE)
>> +	if (!iommu_ser && iommu->gcmd & DMA_GCMD_TE)
>>  		iommu_disable_translation(iommu);
>>
>> -	ret = iommu_alloc_root_entry(iommu);
>> +	ret = iommu_alloc_root_entry(iommu, iommu_ser);
>
>I understand that iommu_get_preserved_data() will eventually return NULL
>after the flb_finish op has executed (based on the LUO IOCTLs dropping
>the incoming state), but I'm sensing a potential UAF/double-restore
>issue here that could happen during the boot window.
>
>I believe we could restore the same context multiple times? I see
>intel_iommu_add() is called from both dmar_device_add() and
>dmar_device_remove() paths, and the ACPI probe has the following
>sequence [1]:
>
>static int acpi_pci_root_add(struct acpi_device *device, ...)
>{
>	// ...
>	if (hotadd && dmar_device_add(handle)) {
>		result = -ENXIO;
>		goto end;
>	}
>
>	// ...
>	root->bus = pci_acpi_scan_root(root);
>	if (!root->bus) {
>		// ...
>		result = -ENODEV;
>		goto remove_dmar;
>	}
>	// ...
>
>remove_dmar:
>	if (hotadd)
>		dmar_device_remove(handle);
>end:
>	return result;
>}
>
>If we successfully restored a domain during dmar_device_add(), but the
>ACPI probe fails later (e.g., pci_acpi_scan_root fails), we jump to
>remove_dmar. This tears down the DMAR unit, it unwinds via
>dmar_device_remove() which eventually calls dmar_iommu_hotplug(false)
>where we:
>
>	disable_dmar_iommu(iommu);
>	free_dmar_iommu(iommu);
>
>At this point, the root table folios are freed back to the allocator.
>
>However, if a re-scan is then triggered before the FLB drops the
>incoming state, we would call:
>
>dmar_device_add() -> intel_iommu_add() -> iommu_alloc_root_entry() again
>
>Because the KHO state wasn't marked as deleted/consumed,
>iommu_get_preserved_data() will hand us the exact same iommu_ser pointer?
>
>In which case, we'd call kho_restore_folio(iommu_ser->intel.root_table)
>on a physical page that might have already been reallocated?
>
>Shouldn't the restored state be explicitly marked as consumed
>(obj.deleted = 1), and shouldn't the driver properly unpreserve/clean up
>the KHO tracking during the free_dmar_iommu() teardown path?

Thats a good point.

I think on disable/free the restored state should not be freed abruptly.
The iommu should be able to reuse the same state. We just need to mark
it as consumed/restored. I will rework the disable/free code paths to
handle this.
>
>>  	if (ret)
>>  		goto out;
>>
>> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
>> index 70032e86437d..d7bf63aff17d 100644
>> --- a/drivers/iommu/intel/iommu.h
>> +++ b/drivers/iommu/intel/iommu.h
>> @@ -1283,6 +1283,8 @@ int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_se
>>  void intel_iommu_unpreserve_device(struct device *dev, struct device_ser *device_ser);
>>  int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
>>  void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser);
>> +void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
>> +					       struct iommu_ser *iommu_ser);
>>  #else
>>  static inline int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
>>  {
>> @@ -1301,6 +1303,11 @@ static inline int intel_iommu_preserve(struct iommu_device *iommu, struct iommu_
>>  static inline void intel_iommu_unpreserve(struct iommu_device *iommu, struct iommu_ser *iommu_ser)
>>  {
>>  }
>> +
>> +static inline void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
>> +							     struct iommu_ser *iommu_ser)
>> +{
>> +}
>>  #endif
>>
>>  #ifdef CONFIG_INTEL_IOMMU_SVM
>> diff --git a/drivers/iommu/intel/liveupdate.c b/drivers/iommu/intel/liveupdate.c
>> index 82ba1daf1711..6dcb5783d1db 100644
>> --- a/drivers/iommu/intel/liveupdate.c
>> +++ b/drivers/iommu/intel/liveupdate.c
>> @@ -73,6 +73,46 @@ static int preserve_iommu_context(struct intel_iommu *iommu)
>>  	return ret;
>>  }
>>
>> +static void restore_iommu_context(struct intel_iommu *iommu)
>> +{
>> +	struct context_entry *context;
>> +	int i;
>> +
>> +	for (i = 0; i < ROOT_ENTRY_NR; i++) {
>> +		context = iommu_context_addr(iommu, i, 0, 0);
>> +		if (context)
>> +			BUG_ON(!kho_restore_folio(virt_to_phys(context)));
>> +
>> +		if (!sm_supported(iommu))
>> +			continue;
>> +
>> +		context = iommu_context_addr(iommu, i, 0x80, 0);
>> +		if (context)
>> +			BUG_ON(!kho_restore_folio(virt_to_phys(context)));
>> +	}
>> +}
>> +
>> +static int __restore_used_domain_ids(struct device_ser *ser, void *arg)
>> +{
>> +	int id = ser->domain_iommu_ser.did;
>> +	struct intel_iommu *iommu = arg;
>> +
>
>Shouldn't we check if the did actually belongs to the iommu instance?
>iommu_for_each_preserved_device() iterates over all preserved devices in
>the system. However, here (__restore_used_domain_ids) we allocate the
>device's did in the current iommu->domain_ida without checking if that
>device actually belongs to the current IOMMU?

Yes, this needs to be checked. I will add this.
>
>On multi-IOMMU systems, this will cause every IOMMU's IDA to be
>cross-polluted with the domain IDs of devices attached to other IOMMUs.
>We must verify the device belongs to this specific IOMMU first, maybe:
>
>if (ser->domain_iommu_ser.iommu_phys != iommu->reg_phys)
>		return 0;
>
>> +	ida_alloc_range(&iommu->domain_ida, id, id, GFP_ATOMIC);
>> +	return 0;
>> +}
>> +
>> +void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
>> +					       struct iommu_ser *iommu_ser)
>> +{
>> +	BUG_ON(!kho_restore_folio(iommu_ser->intel.root_table));
>> +	iommu->root_entry = __va(iommu_ser->intel.root_table);
>> +
>> +	restore_iommu_context(iommu);
>> +	iommu_for_each_preserved_device(__restore_used_domain_ids, iommu);
>> +	pr_info("Restored IOMMU[0x%llx] Root Table at: 0x%llx\n",
>> +		iommu->reg_phys, iommu_ser->intel.root_table);
>> +}
>> +
>>  int intel_iommu_preserve_device(struct device *dev, struct device_ser *device_ser)
>>  {
>>  	struct device_domain_info *info = dev_iommu_priv_get(dev);
>
>Thanks,
>Praan
>
>[1] https://elixir.bootlin.com/linux/v7.0-rc4/source/drivers/acpi/pci_root.c#L728

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update
  2026-02-03 22:09 ` [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update Samiullah Khawaja
  2026-02-25 23:47   ` Samiullah Khawaja
  2026-03-03  5:56   ` Ankit Soni
@ 2026-03-23 20:28   ` Vipin Sharma
  2026-03-23 21:34     ` Samiullah Khawaja
  2026-03-25 20:08   ` Pranjal Shrivastava
  3 siblings, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-23 20:28 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Pranjal Shrivastava

On Tue, Feb 03, 2026 at 10:09:45PM +0000, Samiullah Khawaja wrote:
> From: YiFei Zhu <zhuyifei@google.com>
> 
> The caller is expected to mark each HWPT to be preserved with an ioctl
> call, with a token that will be used in restore. At preserve time, each
> HWPT's domain is then called with iommu_domain_preserve to preserve the

Nit: Add () after iommu_domain_preserve, it makes easier to know this is
a function.

> diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
> +static int iommufd_save_hwpts(struct iommufd_ctx *ictx,
> +			      struct iommufd_lu *iommufd_lu,
> +			      struct liveupdate_session *session)
> +{
> +	struct iommufd_hwpt_paging *hwpt, **hwpts = NULL;

We should rename hwpts to something to denote that it is a list.

> +	struct iommu_domain_ser *domain_ser;
> +	struct iommufd_hwpt_lu *hwpt_lu;
> +	struct iommufd_object *obj;
> +	unsigned int nr_hwpts = 0;
> +	unsigned long index;
> +	unsigned int i;
> +	int rc = 0;
> +
> +	if (iommufd_lu) {
> +		hwpts = kcalloc(iommufd_lu->nr_hwpts, sizeof(*hwpts),
> +				GFP_KERNEL);
> +		if (!hwpts)
> +			return -ENOMEM;
> +	}
> +
> +	xa_lock(&ictx->objects);
> +	xa_for_each(&ictx->objects, index, obj) {
> +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
> +			continue;
> +
> +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
> +		if (!hwpt->lu_preserve)
> +			continue;
> +
> +		if (hwpt->ioas) {
> +			/*
> +			 * Obtain exclusive access to the IOAS and IOPT while we
> +			 * set immutability
> +			 */
> +			mutex_lock(&hwpt->ioas->mutex);
> +			down_write(&hwpt->ioas->iopt.domains_rwsem);
> +			down_write(&hwpt->ioas->iopt.iova_rwsem);
> +
> +			hwpt->ioas->iopt.lu_map_immutable = true;

It feels odd that this is not cleaned up in this function as a part of
its error handling. Now it becomes caller resposibility to handle clean
up for the side effect created by this function. 

IMO, this function should clean up lu_map_immutable instread of callers
to make sure they call cleanups. You can also try exploring XA_MARK_*
APIs if that can help in cleaning up easily.

> +int iommufd_liveupdate_unregister_lufs(void)
> +{
> +	WARN_ON(iommu_liveupdate_unregister_flb(&iommufd_lu_handler));
> +
> +	return liveupdate_unregister_file_handler(&iommufd_lu_handler);

This seems little insconsistent, iommu_liveupdate_unregister_flb() has
WARN_ON and liveupdate_unregister_file_handler() does not.

Also, refer https://lore.kernel.org/all/20260226160353.6f3371bc@shazbot.org/

> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> @@ -775,11 +775,21 @@ static int __init iommufd_init(void)
>  		if (ret)
>  			goto err_misc;
>  	}
> +
> +	if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)) {
> +		ret = iommufd_liveupdate_register_lufs();
> +		if (ret)
> +			goto err_vfio_misc;
> +	}
> +

Do we need IS_ENABLED here? iommufd_private.h is providing definition
for both values of CONFIG_IOMMU_LIVEUPDATE.

Nit: What is lufs? Should we just rename to iommufd_liveupdate_register()?

>  	ret = iommufd_test_init();
>  	if (ret)
> -		goto err_vfio_misc;
> +		goto err_lufs;
>  	return 0;
>  
> +err_lufs:
> +	if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE))
> +		iommufd_liveupdate_unregister_lufs();
>  err_vfio_misc:
>  	if (IS_ENABLED(CONFIG_IOMMUFD_VFIO_CONTAINER))
>  		misc_deregister(&vfio_misc_dev);
> @@ -791,6 +801,8 @@ static int __init iommufd_init(void)
>  static void __exit iommufd_exit(void)
>  {
>  	iommufd_test_exit();
> +	if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE))
> +		iommufd_liveupdate_unregister_lufs();

Same as above.

> diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
> @@ -1427,6 +1429,12 @@ struct iopt_pages *iopt_alloc_file_pages(struct file *file,
>  	pages->file = get_file(file);
>  	pages->start = start - start_byte;
>  	pages->type = IOPT_ADDRESS_FILE;
> +
> +	pages->seals = 0;

This is already 0.

> +	seals = memfd_get_seals(file);
> +	if (seals > 0)
> +		pages->seals = seals;
> +
>  	return pages;
>  }
>  
> --- /dev/null
> +++ b/include/linux/kho/abi/iommufd.h
> +
> +struct iommufd_lu {
> +	unsigned int nr_hwpts;

Should this be u32 or u64?

> +	struct iommufd_hwpt_lu hwpts[];
> +};
> +

Should this also be __packed?

> +#endif /* _LINUX_KHO_ABI_IOMMUFD_H */
> -- 
> 2.53.0.rc2.204.g2597b5adb4-goog
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev
  2026-02-03 22:09 ` [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev Samiullah Khawaja
@ 2026-03-23 20:59   ` Vipin Sharma
  2026-03-23 21:38     ` Samiullah Khawaja
  2026-03-25 20:24   ` Pranjal Shrivastava
  1 sibling, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-23 20:59 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:46PM +0000, Samiullah Khawaja wrote:
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +int iommufd_device_preserve(struct liveupdate_session *s,
> +			    struct iommufd_device *idev,
> +			    u64 *tokenp)
> +{
> +	struct iommufd_group *igroup = idev->igroup;
> +	struct iommufd_hwpt_paging *hwpt_paging;
> +	struct iommufd_hw_pagetable *hwpt;
> +	struct iommufd_attach *attach;
> +	int ret;
> +
> +	mutex_lock(&igroup->lock);
> +	attach = xa_load(&igroup->pasid_attach, IOMMU_NO_PASID);
> +	if (!attach) {
> +		ret = -ENOENT;
> +		goto out;
> +	}

Should VFIO in its can_preserve() add this check?

> +void iommufd_device_unpreserve(struct liveupdate_session *s,
> +			       struct iommufd_device *idev,
> +			       u64 token)

Where is token used in this function?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 13/14] vfio/pci: Preserve the iommufd state of the vfio cdev
  2026-02-03 22:09 ` [PATCH 13/14] vfio/pci: Preserve the iommufd state of the " Samiullah Khawaja
  2026-02-17  4:18   ` Ankit Soni
@ 2026-03-23 21:17   ` Vipin Sharma
  2026-03-23 22:07     ` Samiullah Khawaja
  2026-03-25 20:55   ` Pranjal Shrivastava
  2 siblings, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-23 21:17 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:47PM +0000, Samiullah Khawaja wrote:
> If the vfio cdev is attached to an iommufd, preserve the state of the
> attached iommufd also. Basically preserve the iommu state of the device
> and also the attached domain. The token returned by the preservation API
> will be used to restore/rebind to the iommufd state after liveupdate.

Lets add token when it is used in restore/rebind patches in future.

> diff --git a/drivers/vfio/pci/vfio_pci_liveupdate.c b/drivers/vfio/pci/vfio_pci_liveupdate.c
> @@ -49,15 +51,32 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
>  	if (vfio_pci_is_intel_display(pdev))
>  		return -EINVAL;
>  
> +#if CONFIG_IOMMU_LIVEUPDATE
> +	/* If iommufd is attached, preserve the underlying domain */
> +	if (device->iommufd_attached) {
> +		int err = iommufd_device_preserve(args->session,
> +						  device->iommufd_device,
> +						  &token);
> +		if (err < 0)
> +			return err;
> +	}
> +#endif
> +
>  	ser = kho_alloc_preserve(sizeof(*ser));
> -	if (IS_ERR(ser))
> +	if (IS_ERR(ser)) {
> +		if (device->iommufd_attached)
> +			iommufd_device_unpreserve(args->session,
> +						  device->iommufd_device, token);
> +
>  		return PTR_ERR(ser);
> +	}

driver/vfio/pci/iommufd.c has all of the code which interacts with
iommufd in VFIO, I think we should follow the convention and add a
function there which can be called from here.

> diff --git a/include/linux/kho/abi/vfio_pci.h b/include/linux/kho/abi/vfio_pci.h
> +/**
> + * struct vfio_iommufd_ser - Serialized state of the attached iommufd.
> + *
> + * @token: The token of the bound iommufd state.
> + */
> +struct vfio_iommufd_ser {
> +	u32 token;

This is u32, whereas the token above is u64.



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update
  2026-03-23 20:28   ` Vipin Sharma
@ 2026-03-23 21:34     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-23 21:34 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Pranjal Shrivastava

On Mon, Mar 23, 2026 at 01:28:13PM -0700, Vipin Sharma wrote:
>On Tue, Feb 03, 2026 at 10:09:45PM +0000, Samiullah Khawaja wrote:
>> From: YiFei Zhu <zhuyifei@google.com>
>>
>> The caller is expected to mark each HWPT to be preserved with an ioctl
>> call, with a token that will be used in restore. At preserve time, each
>> HWPT's domain is then called with iommu_domain_preserve to preserve the
>
>Nit: Add () after iommu_domain_preserve, it makes easier to know this is
>a function.

Will do.
>
>> diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
>> +static int iommufd_save_hwpts(struct iommufd_ctx *ictx,
>> +			      struct iommufd_lu *iommufd_lu,
>> +			      struct liveupdate_session *session)
>> +{
>> +	struct iommufd_hwpt_paging *hwpt, **hwpts = NULL;
>
>We should rename hwpts to something to denote that it is a list.

Agreed. Will do.
>
>> +	struct iommu_domain_ser *domain_ser;
>> +	struct iommufd_hwpt_lu *hwpt_lu;
>> +	struct iommufd_object *obj;
>> +	unsigned int nr_hwpts = 0;
>> +	unsigned long index;
>> +	unsigned int i;
>> +	int rc = 0;
>> +
>> +	if (iommufd_lu) {
>> +		hwpts = kcalloc(iommufd_lu->nr_hwpts, sizeof(*hwpts),
>> +				GFP_KERNEL);
>> +		if (!hwpts)
>> +			return -ENOMEM;
>> +	}
>> +
>> +	xa_lock(&ictx->objects);
>> +	xa_for_each(&ictx->objects, index, obj) {
>> +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
>> +			continue;
>> +
>> +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
>> +		if (!hwpt->lu_preserve)
>> +			continue;
>> +
>> +		if (hwpt->ioas) {
>> +			/*
>> +			 * Obtain exclusive access to the IOAS and IOPT while we
>> +			 * set immutability
>> +			 */
>> +			mutex_lock(&hwpt->ioas->mutex);
>> +			down_write(&hwpt->ioas->iopt.domains_rwsem);
>> +			down_write(&hwpt->ioas->iopt.iova_rwsem);
>> +
>> +			hwpt->ioas->iopt.lu_map_immutable = true;
>
>It feels odd that this is not cleaned up in this function as a part of
>its error handling. Now it becomes caller resposibility to handle clean
>up for the side effect created by this function.

The immutablity is set early during preservation and it might be
required to unset it in the caller also when error occurs. I have to
rework this to not acquire a mutex lock in a spin lock, so I think this
part can also be reworked by doing it in following steps,

- Alloc memory for preservation.
- Mark immutable.
- Preserve using IOMMU core APIs.
>
>IMO, this function should clean up lu_map_immutable instread of callers
>to make sure they call cleanups. You can also try exploring XA_MARK_*
>APIs if that can help in cleaning up easily.

Agreed. 
>
>> +int iommufd_liveupdate_unregister_lufs(void)
>> +{
>> +	WARN_ON(iommu_liveupdate_unregister_flb(&iommufd_lu_handler));
>> +
>> +	return liveupdate_unregister_file_handler(&iommufd_lu_handler);
>
>This seems little insconsistent, iommu_liveupdate_unregister_flb() has
>WARN_ON and liveupdate_unregister_file_handler() does not.

The error from unregister_file_handler() is propagated to the caller,
but the error from unregister_flb() is not propagated as need to call
the unregister_file_handler().
>
>Also, refer https://lore.kernel.org/all/20260226160353.6f3371bc@shazbot.org/

Yes. This will need updates according to the new changes in LUO where
the unregister functions has void return type.
>
>> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
>> @@ -775,11 +775,21 @@ static int __init iommufd_init(void)
>>  		if (ret)
>>  			goto err_misc;
>>  	}
>> +
>> +	if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE)) {
>> +		ret = iommufd_liveupdate_register_lufs();
>> +		if (ret)
>> +			goto err_vfio_misc;
>> +	}
>> +
>
>Do we need IS_ENABLED here? iommufd_private.h is providing definition
>for both values of CONFIG_IOMMU_LIVEUPDATE.
>
>Nit: What is lufs? Should we just rename to iommufd_liveupdate_register()?

Will update both.
>
>>  	ret = iommufd_test_init();
>>  	if (ret)
>> -		goto err_vfio_misc;
>> +		goto err_lufs;
>>  	return 0;
>>
>> +err_lufs:
>> +	if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE))
>> +		iommufd_liveupdate_unregister_lufs();
>>  err_vfio_misc:
>>  	if (IS_ENABLED(CONFIG_IOMMUFD_VFIO_CONTAINER))
>>  		misc_deregister(&vfio_misc_dev);
>> @@ -791,6 +801,8 @@ static int __init iommufd_init(void)
>>  static void __exit iommufd_exit(void)
>>  {
>>  	iommufd_test_exit();
>> +	if (IS_ENABLED(CONFIG_IOMMU_LIVEUPDATE))
>> +		iommufd_liveupdate_unregister_lufs();
>
>Same as above.

Will update.
>
>> diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
>> @@ -1427,6 +1429,12 @@ struct iopt_pages *iopt_alloc_file_pages(struct file *file,
>>  	pages->file = get_file(file);
>>  	pages->start = start - start_byte;
>>  	pages->type = IOPT_ADDRESS_FILE;
>> +
>> +	pages->seals = 0;
>
>This is already 0.

Will update.
>
>> +	seals = memfd_get_seals(file);
>> +	if (seals > 0)
>> +		pages->seals = seals;
>> +
>>  	return pages;
>>  }
>>
>> --- /dev/null
>> +++ b/include/linux/kho/abi/iommufd.h
>> +
>> +struct iommufd_lu {
>> +	unsigned int nr_hwpts;
>
>Should this be u32 or u64?
>
>> +	struct iommufd_hwpt_lu hwpts[];
>> +};
>> +
>
>Should this also be __packed?

Will do both u32 and packed.
>
>> +#endif /* _LINUX_KHO_ABI_IOMMUFD_H */
>> --
>> 2.53.0.rc2.204.g2597b5adb4-goog
>>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev
  2026-03-23 20:59   ` Vipin Sharma
@ 2026-03-23 21:38     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-23 21:38 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Mon, Mar 23, 2026 at 01:59:13PM -0700, Vipin Sharma wrote:
>On Tue, Feb 03, 2026 at 10:09:46PM +0000, Samiullah Khawaja wrote:
>> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +int iommufd_device_preserve(struct liveupdate_session *s,
>> +			    struct iommufd_device *idev,
>> +			    u64 *tokenp)
>> +{
>> +	struct iommufd_group *igroup = idev->igroup;
>> +	struct iommufd_hwpt_paging *hwpt_paging;
>> +	struct iommufd_hw_pagetable *hwpt;
>> +	struct iommufd_attach *attach;
>> +	int ret;
>> +
>> +	mutex_lock(&igroup->lock);
>> +	attach = xa_load(&igroup->pasid_attach, IOMMU_NO_PASID);
>> +	if (!attach) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>
>Should VFIO in its can_preserve() add this check?

Good Point. We can move this check over there.
>
>> +void iommufd_device_unpreserve(struct liveupdate_session *s,
>> +			       struct iommufd_device *idev,
>> +			       u64 token)
>
>Where is token used in this function?

Will remove.
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 13/14] vfio/pci: Preserve the iommufd state of the vfio cdev
  2026-03-23 21:17   ` Vipin Sharma
@ 2026-03-23 22:07     ` Samiullah Khawaja
  2026-03-24 20:30       ` Vipin Sharma
  0 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-23 22:07 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Mon, Mar 23, 2026 at 02:17:29PM -0700, Vipin Sharma wrote:
>On Tue, Feb 03, 2026 at 10:09:47PM +0000, Samiullah Khawaja wrote:
>> If the vfio cdev is attached to an iommufd, preserve the state of the
>> attached iommufd also. Basically preserve the iommu state of the device
>> and also the attached domain. The token returned by the preservation API
>> will be used to restore/rebind to the iommufd state after liveupdate.
>
>Lets add token when it is used in restore/rebind patches in future.

Agreed.
>
>> diff --git a/drivers/vfio/pci/vfio_pci_liveupdate.c b/drivers/vfio/pci/vfio_pci_liveupdate.c
>> @@ -49,15 +51,32 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
>>  	if (vfio_pci_is_intel_display(pdev))
>>  		return -EINVAL;
>>
>> +#if CONFIG_IOMMU_LIVEUPDATE
>> +	/* If iommufd is attached, preserve the underlying domain */
>> +	if (device->iommufd_attached) {
>> +		int err = iommufd_device_preserve(args->session,
>> +						  device->iommufd_device,
>> +						  &token);
>> +		if (err < 0)
>> +			return err;
>> +	}
>> +#endif
>> +
>>  	ser = kho_alloc_preserve(sizeof(*ser));
>> -	if (IS_ERR(ser))
>> +	if (IS_ERR(ser)) {
>> +		if (device->iommufd_attached)
>> +			iommufd_device_unpreserve(args->session,
>> +						  device->iommufd_device, token);
>> +
>>  		return PTR_ERR(ser);
>> +	}
>
>driver/vfio/pci/iommufd.c has all of the code which interacts with
>iommufd in VFIO, I think we should follow the convention and add a
>function there which can be called from here.

I am assuming you meant driver/vfio/iommufd.c as
drivers/vfio/pci/iommufd.c doens't exist.

I see iommufd_ctx and other iommufd_ functions being used directly in
various places outside drivers/vfio/iommufd.c, so there is not a hard
split I think. This will introduce unnecessary indirection I think.
>
>> diff --git a/include/linux/kho/abi/vfio_pci.h b/include/linux/kho/abi/vfio_pci.h
>> +/**
>> + * struct vfio_iommufd_ser - Serialized state of the attached iommufd.
>> + *
>> + * @token: The token of the bound iommufd state.
>> + */
>> +struct vfio_iommufd_ser {
>> +	u32 token;
>
>This is u32, whereas the token above is u64.

Will update.
>
>

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation
  2026-02-03 22:09 ` [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation Samiullah Khawaja
@ 2026-03-23 22:18   ` Vipin Sharma
  2026-03-27 18:32     ` Samiullah Khawaja
  2026-03-25 21:05   ` Pranjal Shrivastava
  1 sibling, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-23 22:18 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:48PM +0000, Samiullah Khawaja wrote:
> Test iommufd preservation by setting up an iommufd and vfio cdev and
> preserve it across live update. Test takes VFIO cdev path of a device
> bound to vfio-pci driver and binds it to an iommufd being preserved. It
> also preserves the vfio cdev so the iommufd state associated with it is
> also preserved.
> 
> The restore path is tested by restoring the preserved vfio cdev only.
> Test tries to finish the session without restoring iommufd and confirms
> that it fails.

May be alos add more details that VFIO bind will also fail, it will return EPERM
due to ...

> diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
> +$(TEST_GEN_PROGS_EXTENDED): %: %.o $(LIBLIVEUPDATE_O)
> +	$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) $(TARGET_ARCH) $< $(LIBLIVEUPDATE_O) $(LDLIBS) -static -o $@

Why static? User can pass static flag through EXTRA_CFLAGS.

> diff --git a/tools/testing/selftests/iommu/iommufd_liveupdate.c b/tools/testing/selftests/iommu/iommufd_liveupdate.c
> +
> +#define ksft_assert(condition) \
> +	do { if (!(condition)) \
> +	ksft_exit_fail_msg("Failed: %s at %s %d: %s\n", \
> +	#condition, __FILE__, __LINE__, strerror(errno)); } while (0)

Please add indentation to the code.

> +
> +int setup_cdev(const char *vfio_cdev_path)

Nit: s/setup_cdev/open_cdev

Since you are using open_iommufd() name below, this will make it
consistent as there not setup here apart from opening.

> +int setup_iommufd(int iommufd, int memfd, int cdev_fd, int hwpt_token)
> +	bind.iommufd = iommufd;
> +	ret = ioctl(cdev_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
> +	ksft_assert(!ret);
> +
> +	ret = ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data);
> +	ksft_assert(!ret);
> +
> +	hwpt_alloc.dev_id = bind.out_devid;
> +	hwpt_alloc.pt_id = alloc_data.out_ioas_id;
> +	ret = ioctl(iommufd, IOMMU_HWPT_ALLOC, &hwpt_alloc);
> +	ksft_assert(!ret);
> +
> +	attach_data.pt_id = hwpt_alloc.out_hwpt_id;
> +	ret = ioctl(cdev_fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data);
> +	ksft_assert(!ret);
> +
> +	map_file.ioas_id = alloc_data.out_ioas_id;
> +	ret = ioctl(iommufd, IOMMU_IOAS_MAP_FILE, &map_file);
> +	ksft_assert(!ret);
> +
> +	set_preserve.hwpt_id = attach_data.pt_id;
> +	ret = ioctl(iommufd, IOMMU_HWPT_LU_SET_PRESERVE, &set_preserve);
> +	ksft_assert(!ret);
> +
> +	return ret;
> +}

iommufd_utils.h has functions defined for iommufd ioctls. I think we
should use those functions here and if related to iommufd are not
present we should add there.

> +int main(int argc, char *argv[])
> +{
> +	int iommufd, cdev_fd, memfd, luo, session, ret;
> +	const int token = 0x123456;
> +	const int cdev_token = 0x654321;
> +	const int hwpt_token = 0x789012;
> +	const int memfd_token = 0x890123;
> +
> +	if (argc < 2) {
> +		printf("Usage: ./iommufd_liveupdate <vfio_cdev_path>\n");
> +		return 1;
> +	}
> +
> +	luo = luo_open_device();
> +	ksft_assert(luo > 0);
> +
> +	session = luo_retrieve_session(luo, "iommufd-test");
> +	if (session == -ENOENT) {
> +		session = luo_create_session(luo, "iommufd-test");
> +
> +		iommufd = open_iommufd();
> +		memfd = create_sealed_memfd(SZ_1M);
> +		cdev_fd = setup_cdev(argv[1]);
> +
> +		ret = setup_iommufd(iommufd, memfd, cdev_fd, hwpt_token);
> +		ksft_assert(!ret);
> +
> +		/* Cannot preserve cdev without iommufd */
> +		ret = luo_session_preserve_fd(session, cdev_fd, cdev_token);
> +		ksft_assert(ret);
> +
> +		/* Cannot preserve iommufd without preserving memfd. */
> +		ret = luo_session_preserve_fd(session, iommufd, token);
> +		ksft_assert(ret);
> +
> +		ret = luo_session_preserve_fd(session, memfd, memfd_token);
> +		ksft_assert(!ret);
> +
> +		ret = luo_session_preserve_fd(session, iommufd, token);
> +		ksft_assert(!ret);
> +
> +		ret = luo_session_preserve_fd(session, cdev_fd, cdev_token);
> +		ksft_assert(!ret);
> +

All of these ksft_assert are hurting my eyes:) I like the approach in
VFIO where library APIs does validation and test code only check actual
things needed.

Should we at least create a common function to combine both
luo_session_preserve() and ksft_assert()?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks
  2026-03-17  1:06     ` Samiullah Khawaja
@ 2026-03-23 23:27       ` Vipin Sharma
  0 siblings, 0 replies; 98+ messages in thread
From: Vipin Sharma @ 2026-03-23 23:27 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Mar 17, 2026 at 01:06:34AM +0000, Samiullah Khawaja wrote:
> On Mon, Mar 16, 2026 at 03:54:50PM -0700, Vipin Sharma wrote:
> > On Tue, Feb 03, 2026 at 10:09:35PM +0000, Samiullah Khawaja wrote:
> > > +config IOMMU_LIVEUPDATE
> > > +	bool "IOMMU live update state preservation support"
> > > +	depends on LIVEUPDATE && IOMMUFD
> > > +	help
> > > +	  Enable support for preserving IOMMU state across a kexec live update.
> > > +
> > > +	  This allows devices managed by iommufd to maintain their DMA mappings
> > > +	  during kexec base kernel update.
> > > +
> > > +	  If unsure, say N.
> > > +
> > 
> > Do we need a separate config? Can't we just use CONFIG_LIVEUPDATE?
> 
> We have a separate CONFIG here so that the phase 1/2 split for iommu
> preservation doesn't break the vfio preservation. See following
> discussion in the RFCv2:
> 
> https://lore.kernel.org/all/aYEpHBYxlQxhXrwl@google.com/

Sounds good. 

> > > +static void iommu_liveupdate_free_objs(u64 next, bool incoming)
> > > +{
> > > +	struct iommu_objs_ser *objs;
> > > +
> > > +	while (next) {
> > > +		objs = __va(next);
> > > +		next = objs->next_objs;
> > > +
> > > +		if (!incoming)
> > > +			kho_unpreserve_free(objs);
> > > +		else
> > > +			folio_put(virt_to_folio(objs));
> > > +	}
> > > +}
> > 
> > Instead of passing boolean, and calling with different arguments, I
> > think it will be simpler to just have two functions
> > 
> > - iommu_liveupdate_unpreserve()
> > - iommu_liveupdate_folio_put()
> 
> This is a helper function to free the serialized state without
> duplicating multiple checks for various type of state (iommu,
> iommu_domain and devices).
> 
> Do you think maybe I should add these two functions and make it call the
> helper?

Read the next response.

> > 
> > > +
> > > +static void iommu_liveupdate_flb_free(struct iommu_lu_flb_obj *obj)
> > > +{
> > > +	if (obj->iommu_domains)
> > > +		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, false);
> > > +
> > > +	if (obj->devices)
> > > +		iommu_liveupdate_free_objs(obj->ser->devices_phys, false);
> > > +
> > > +	if (obj->iommus)
> > > +		iommu_liveupdate_free_objs(obj->ser->iommus_phys, false);
> > > +
> > > +	kho_unpreserve_free(obj->ser);
> > > +	kfree(obj);
> > > +}
> > > +
> > > +static int iommu_liveupdate_flb_preserve(struct liveupdate_flb_op_args *argp)
> > > +{
> > > +	struct iommu_lu_flb_obj *obj;
> > > +	struct iommu_lu_flb_ser *ser;
> > > +	void *mem;
> > > +
> > > +	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
> > > +	if (!obj)
> > > +		return -ENOMEM;
> > > +
> > > +	mutex_init(&obj->lock);
> > > +	mem = kho_alloc_preserve(sizeof(*ser));
> > > +	if (IS_ERR(mem))
> > > +		goto err_free;
> > > +
> > > +	ser = mem;
> > > +	obj->ser = ser;
> > > +
> > > +	mem = kho_alloc_preserve(PAGE_SIZE);
> > > +	if (IS_ERR(mem))
> > > +		goto err_free;
> > > +
> > > +	obj->iommu_domains = mem;
> > > +	ser->iommu_domains_phys = virt_to_phys(obj->iommu_domains);
> > > +
> > > +	mem = kho_alloc_preserve(PAGE_SIZE);
> > > +	if (IS_ERR(mem))
> > > +		goto err_free;
> > > +
> > > +	obj->devices = mem;
> > > +	ser->devices_phys = virt_to_phys(obj->devices);
> > > +
> > > +	mem = kho_alloc_preserve(PAGE_SIZE);
> > > +	if (IS_ERR(mem))
> > > +		goto err_free;
> > > +
> > > +	obj->iommus = mem;
> > > +	ser->iommus_phys = virt_to_phys(obj->iommus);
> > > +
> > > +	argp->obj = obj;
> > > +	argp->data = virt_to_phys(ser);
> > > +	return 0;
> > > +
> > > +err_free:
> > > +	iommu_liveupdate_flb_free(obj);
> > 
> > Generally, I have seen in the function goto will call corresponding
> > error tags, and free corresponding allocations and all the one which
> > happend before. It is easier to read code that way. I know you are
> > combining the free call from iommu_liveupdate_flb_unpreserve() also.
> > IMHO, code readability will be better this way.
> 
> I had that originally when I was writing this function, but it gets
> really cluttered :(. Instead it is more clean without code duplication
> using this one cleanup function here to free the state on error and also
> when doing unpreserve. Please consider this a "destroy" function of obj
> and it can be called from 2 places,
> 
> - Error during allocation of internal state.
> - During unpreserve.

It is removing code duplication in

 - iommu_liveupdate_flb_preserve()
 - iommu_liveupdate_flb_unpreserve()

However, there is still duplicate code in iommu_liveupdate_flb_finish().
Another thing is iommu_liveupdate_free_objs() is doing two different
things based on current liveupdate state (before or after kexec) passed by a
bool argument. IMO, it is cleaner if we explicitly write whether we are
doing unpreserve or just folio put.

I meant something like:

static void iommu_liveupdate_unpreserve_free(u64 next)
{
       while (next) {
               struct iommu_objs_ser *objs = __va(next);
               next = objs->next_objs;
               kho_unpreserve_free(objs);
       }
}

static void iommu_liveupdate_folio_put(u64 next)
{
       while (next) {
               struct iommu_objs_ser *objs = __va(next);
               next = objs->next_objs;
               folio_put(virt_to_folio(objs));
       }
}

static int iommu_liveupdate_flb_preserve(struct liveupdate_flb_op_args *argp)
{

...

err_free_devices:
	iommu_liveupdate_unpreserve_free(obj->ser->devices_phys);
err_free_iommu_domains:
	iommu_liveupdate_unpreserve_free(obj->ser->iommu_domains_phys);
err_free_ser:
	kho_unpreserve_free(obj->ser);
err_free_obj:
	kfree(obj);
	return PTR_ERR(mem);
}

static void iommu_liveupdate_flb_unpreserve(struct liveupdate_flb_op_args *argp)
{
	struct iommu_lu_flb_obj *obj = argp->obj;

	iommu_liveupdate_unpreserve_free(obj->ser->iommus_phys);
	iommu_liveupdate_unpreserve_free(obj->ser->devices_phys);
	iommu_liveupdate_unpreserve_free(obj->ser->iommu_domains_phys);
	kho_unpreserve_free(obj->ser);
	kfree(obj);
	argp->obj = NULL;
}

static void iommu_liveupdate_flb_finish(struct liveupdate_flb_op_args *argp)
{
	struct iommu_lu_flb_obj *obj = argp->obj;

	iommu_liveupdate_folio_put(obj->ser->iommus_phys);
	iommu_liveupdate_folio_put(obj->ser->devices_phys);
	iommu_liveupdate_folio_put(obj->ser->iommu_domains_phys);
	folio_put(virt_to_folio(obj->ser));
	kfree(obj);
	argp->obj = NULL
}

This way code is pretty explicit and clear what is happening. Let me
know if you meant something else by cluttered code.
> > 
> > > +	return PTR_ERR(mem);
> > > +}
> > > +
> > > +static void iommu_liveupdate_flb_unpreserve(struct liveupdate_flb_op_args *argp)
> > > +{
> > > +	iommu_liveupdate_flb_free(argp->obj);
> > > +}
> > > +
> > > +static void iommu_liveupdate_flb_finish(struct liveupdate_flb_op_args *argp)
> > > +{
> > > +	struct iommu_lu_flb_obj *obj = argp->obj;
> > > +
> > > +	if (obj->iommu_domains)
> > > +		iommu_liveupdate_free_objs(obj->ser->iommu_domains_phys, true);
> > 
> > Can there be the case where obj->iommu_domains is NULL but
> > obj->ser->iommu_domains_phys is not? If that is not possible, I will
> > just simplify the patch and unconditionally call
> > iommu_liveupdate_free_objs()?
> 
> Are you suggesting that on flb_finish() the obj->iommu_domains should be
> non-NULL as flb_retrieve() succeeded? If yes, then that is correct. I
> will update this to call the free_objs() without checking
> obj->iommu_domains. I will do same for other types.

Yes. 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-03-17 20:33     ` Samiullah Khawaja
@ 2026-03-24 19:06       ` Vipin Sharma
  2026-03-24 19:45         ` Samiullah Khawaja
  0 siblings, 1 reply; 98+ messages in thread
From: Vipin Sharma @ 2026-03-24 19:06 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Mar 17, 2026 at 08:33:39PM +0000, Samiullah Khawaja wrote:
> On Tue, Mar 17, 2026 at 12:58:27PM -0700, Vipin Sharma wrote:
> > On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
> > > diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
> > > +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
> > > +				    void *arg)
> > > +{
> > > +	struct iommu_lu_flb_obj *obj;
> > > +	struct devices_ser *devices;
> > > +	int ret, i, idx;
> > > +
> > > +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
> > > +	if (ret)
> > > +		return -ENOENT;
> > > +
> > > +	devices = __va(obj->ser->devices_phys);
> > > +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
> > > +		if (idx >= MAX_DEVICE_SERS) {
> > > +			devices = __va(devices->objs.next_objs);
> > > +			idx = 0;
> > > +		}
> > > +
> > > +		if (devices->devices[idx].obj.deleted)
> > > +			continue;
> > > +
> > > +		ret = fn(&devices->devices[idx], arg);
> > > +		if (ret)
> > > +			return ret;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL(iommu_for_each_preserved_device);
> > Also, should this function be introduced in the patch where it is
> > getting used? Other changes in this patch are already big and complex.
> > Same for iommu_get_device_preserved_data() and
> > iommu_get_preserved_data().
> 
> These are used by the drivers, but part of core. So need to be in
> this patch :(.

Sorry, I am not understanding why it has to be in this patch? Can it be
its own patch?
> 
> Note that this patch is adding core skeleton only, focusing on helpers
> for the serialized state. This patch is not preserving any real state of
> iommu, domain or devices. For example, the domains are saved through
> generic page table in a separate patch, and the drivers preserve the
> state of devices and associated iommu in separate patches.
> 
> I will add this text in the commit message to clarify the purpose of
> this patch.
> > 
> > I think this patch can be split in three.
> > Patch 1: Preserve iommu_domain
> > Patch 2: Preserve pci device and iommu device
> > Patch 3: The helper functions I mentioned above.

I understand that this patch is adding some helper functions and not
doing any actual preservation. I am suggesting to split this helper
function patch into three for easier review based on the above suggestion.
If I am not wrong this is biggest patch in series of approx 500 line
changes.

> > > +static void iommu_unpreserve_locked(struct iommu_device *iommu)
> > > +{
> > > +	struct iommu_ser *iommu_ser = iommu->outgoing_preserved_state;
> > > +
> > > +	iommu_ser->obj.ref_count--;
> > 
> > Should there be a null check?
> 
> Hmm.. There is a dependency of unpreservation of iommus with devices, so
> this should never be NULL unless used independently.
> 
> But I think I will add it here to protect against that.

Okay. Since, it is a static function, I am fine either way.

> > > +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
> > > +{
> > > +	struct iommu_lu_flb_obj *flb_obj;
> > > +	struct device_ser *device_ser;
> > > +	struct dev_iommu *iommu;
> > > +	struct pci_dev *pdev;
> > > +	int ret;
> > > +
> > > +	if (!dev_is_pci(dev))
> > > +		return;
> > > +
> > > +	pdev = to_pci_dev(dev);
> > > +	iommu = dev->iommu;
> > > +	if (!iommu->iommu_dev->ops->unpreserve_device ||
> > > +	    !iommu->iommu_dev->ops->unpreserve)
> > > +		return;
> > > +
> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> > > +	if (WARN_ON(ret))
> > 
> > Why WARN_ON here and not other places? Do we need it?
> 
> Basically this means that the upper layer (iommufd/vfio) is asking to
> unpreserve a device, but there is no FLB found. This should not happen
> and should generate a warning.

Yeah, but other places iommu_domain_[preserve|unpreserve](),
iommu_presreve_locked(), and iommu_preserve_device() are also using this
function. I am having a confusion on why it is important in this
function and not others. Those functions are also called by upper layer.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton
  2026-03-24 19:06       ` Vipin Sharma
@ 2026-03-24 19:45         ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-24 19:45 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Tue, Mar 24, 2026 at 12:06:10PM -0700, Vipin Sharma wrote:
>On Tue, Mar 17, 2026 at 08:33:39PM +0000, Samiullah Khawaja wrote:
>> On Tue, Mar 17, 2026 at 12:58:27PM -0700, Vipin Sharma wrote:
>> > On Tue, Feb 03, 2026 at 10:09:36PM +0000, Samiullah Khawaja wrote:
>> > > diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
>> > > +int iommu_for_each_preserved_device(iommu_preserved_device_iter_fn fn,
>> > > +				    void *arg)
>> > > +{
>> > > +	struct iommu_lu_flb_obj *obj;
>> > > +	struct devices_ser *devices;
>> > > +	int ret, i, idx;
>> > > +
>> > > +	ret = liveupdate_flb_get_incoming(&iommu_flb, (void **)&obj);
>> > > +	if (ret)
>> > > +		return -ENOENT;
>> > > +
>> > > +	devices = __va(obj->ser->devices_phys);
>> > > +	for (i = 0, idx = 0; i < obj->ser->nr_devices; ++i, ++idx) {
>> > > +		if (idx >= MAX_DEVICE_SERS) {
>> > > +			devices = __va(devices->objs.next_objs);
>> > > +			idx = 0;
>> > > +		}
>> > > +
>> > > +		if (devices->devices[idx].obj.deleted)
>> > > +			continue;
>> > > +
>> > > +		ret = fn(&devices->devices[idx], arg);
>> > > +		if (ret)
>> > > +			return ret;
>> > > +	}
>> > > +
>> > > +	return 0;
>> > > +}
>> > > +EXPORT_SYMBOL(iommu_for_each_preserved_device);
>> > Also, should this function be introduced in the patch where it is
>> > getting used? Other changes in this patch are already big and complex.
>> > Same for iommu_get_device_preserved_data() and
>> > iommu_get_preserved_data().
>>
>> These are used by the drivers, but part of core. So need to be in
>> this patch :(.
>
>Sorry, I am not understanding why it has to be in this patch? Can it be
>its own patch?

I will move it to a separate patch as per discussion in other thread.

See:
https://lore.kernel.org/all/abrk39_M8k45myXJ@google.com/
>>
>> Note that this patch is adding core skeleton only, focusing on helpers
>> for the serialized state. This patch is not preserving any real state of
>> iommu, domain or devices. For example, the domains are saved through
>> generic page table in a separate patch, and the drivers preserve the
>> state of devices and associated iommu in separate patches.
>>
>> I will add this text in the commit message to clarify the purpose of
>> this patch.
>> >
>> > I think this patch can be split in three.
>> > Patch 1: Preserve iommu_domain
>> > Patch 2: Preserve pci device and iommu device
>> > Patch 3: The helper functions I mentioned above.
>
>I understand that this patch is adding some helper functions and not
>doing any actual preservation. I am suggesting to split this helper
>function patch into three for easier review based on the above suggestion.
>If I am not wrong this is biggest patch in series of approx 500 line
>changes.

After moving the getter helpers out to the separate patch as mentioned
above, the size of this patch should significantly reduce. I will split
the remaining into domain and device + iommu preservation helper patches
as you suggested.
>
>> > > +static void iommu_unpreserve_locked(struct iommu_device *iommu)
>> > > +{
>> > > +	struct iommu_ser *iommu_ser = iommu->outgoing_preserved_state;
>> > > +
>> > > +	iommu_ser->obj.ref_count--;
>> >
>> > Should there be a null check?
>>
>> Hmm.. There is a dependency of unpreservation of iommus with devices, so
>> this should never be NULL unless used independently.
>>
>> But I think I will add it here to protect against that.
>
>Okay. Since, it is a static function, I am fine either way.
>
>> > > +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>> > > +{
>> > > +	struct iommu_lu_flb_obj *flb_obj;
>> > > +	struct device_ser *device_ser;
>> > > +	struct dev_iommu *iommu;
>> > > +	struct pci_dev *pdev;
>> > > +	int ret;
>> > > +
>> > > +	if (!dev_is_pci(dev))
>> > > +		return;
>> > > +
>> > > +	pdev = to_pci_dev(dev);
>> > > +	iommu = dev->iommu;
>> > > +	if (!iommu->iommu_dev->ops->unpreserve_device ||
>> > > +	    !iommu->iommu_dev->ops->unpreserve)
>> > > +		return;
>> > > +
>> > > +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>> > > +	if (WARN_ON(ret))
>> >
>> > Why WARN_ON here and not other places? Do we need it?
>>
>> Basically this means that the upper layer (iommufd/vfio) is asking to
>> unpreserve a device, but there is no FLB found. This should not happen
>> and should generate a warning.
>
>Yeah, but other places iommu_domain_[preserve|unpreserve](),
>iommu_presreve_locked(), and iommu_preserve_device() are also using this
>function. I am having a confusion on why it is important in his
>function and not others. Those functions are also called by upper layer.

Ah yes.. That makes sense. I will add it to the unpreserve variants of
functions you listed, as during preservation the FLB obj should have
been created.
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 13/14] vfio/pci: Preserve the iommufd state of the vfio cdev
  2026-03-23 22:07     ` Samiullah Khawaja
@ 2026-03-24 20:30       ` Vipin Sharma
  0 siblings, 0 replies; 98+ messages in thread
From: Vipin Sharma @ 2026-03-24 20:30 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Mon, Mar 23, 2026 at 10:07:31PM +0000, Samiullah Khawaja wrote:
> On Mon, Mar 23, 2026 at 02:17:29PM -0700, Vipin Sharma wrote:
> > On Tue, Feb 03, 2026 at 10:09:47PM +0000, Samiullah Khawaja wrote:
> > > diff --git a/drivers/vfio/pci/vfio_pci_liveupdate.c b/drivers/vfio/pci/vfio_pci_liveupdate.c
> > > @@ -49,15 +51,32 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
> > >  	if (vfio_pci_is_intel_display(pdev))
> > >  		return -EINVAL;
> > > 
> > > +#if CONFIG_IOMMU_LIVEUPDATE
> > > +	/* If iommufd is attached, preserve the underlying domain */
> > > +	if (device->iommufd_attached) {
> > > +		int err = iommufd_device_preserve(args->session,
> > > +						  device->iommufd_device,
> > > +						  &token);
> > > +		if (err < 0)
> > > +			return err;
> > > +	}
> > > +#endif
> > > +
> > >  	ser = kho_alloc_preserve(sizeof(*ser));
> > > -	if (IS_ERR(ser))
> > > +	if (IS_ERR(ser)) {
> > > +		if (device->iommufd_attached)
> > > +			iommufd_device_unpreserve(args->session,
> > > +						  device->iommufd_device, token);
> > > +
> > >  		return PTR_ERR(ser);
> > > +	}
> > 
> > driver/vfio/pci/iommufd.c has all of the code which interacts with
> > iommufd in VFIO, I think we should follow the convention and add a
> > function there which can be called from here.
> 
> I am assuming you meant driver/vfio/iommufd.c as
> drivers/vfio/pci/iommufd.c doens't exist.
> 
> I see iommufd_ctx and other iommufd_ functions being used directly in
> various places outside drivers/vfio/iommufd.c, so there is not a hard
> split I think. This will introduce unnecessary indirection I think.

Yes. sorry, for the wrong path. Sounds good, I misunderstood things.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
  2026-02-03 22:09 ` [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved Samiullah Khawaja
  2026-03-19 23:35   ` Vipin Sharma
@ 2026-03-25 14:37   ` Pranjal Shrivastava
  2026-03-25 17:31     ` Samiullah Khawaja
  1 sibling, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-25 14:37 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Vipin Sharma

On Tue, Feb 03, 2026 at 10:09:44PM +0000, Samiullah Khawaja wrote:
> From: YiFei Zhu <zhuyifei@google.com>
> 
> Userspace provides a token, which will then be used at restore to
> identify this HWPT. The restoration logic is not implemented and will be
> added later.
> 
> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/iommufd/Makefile          |  1 +
>  drivers/iommu/iommufd/iommufd_private.h | 13 +++++++
>  drivers/iommu/iommufd/liveupdate.c      | 49 +++++++++++++++++++++++++
>  drivers/iommu/iommufd/main.c            |  2 +
>  include/uapi/linux/iommufd.h            | 19 ++++++++++
>  5 files changed, 84 insertions(+)
>  create mode 100644 drivers/iommu/iommufd/liveupdate.c
> 
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> index 71d692c9a8f4..c3bf0b6452d3 100644
> --- a/drivers/iommu/iommufd/Makefile
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -17,3 +17,4 @@ obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o
>  
>  iommufd_driver-y := driver.o
>  obj-$(CONFIG_IOMMUFD_DRIVER_CORE) += iommufd_driver.o
> +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index eb6d1a70f673..6424e7cea5b2 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
>  	bool auto_domain : 1;
>  	bool enforce_cache_coherency : 1;
>  	bool nest_parent : 1;
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	bool lu_preserve : 1;
> +	u32 lu_token;

Did we downsize the token? Shouldn't this be u64 as everywhere else?

> +#endif
>  	/* Head at iommufd_ioas::hwpt_list */
>  	struct list_head hwpt_item;
>  	struct iommufd_sw_msi_maps present_sw_msi;
> @@ -707,6 +711,15 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
>  			    struct iommufd_vdevice, obj);
>  }
>  
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
> +#else
> +static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
> +{
> +	return -ENOTTY;
> +}
> +#endif
> +
>  #ifdef CONFIG_IOMMUFD_TEST
>  int iommufd_test(struct iommufd_ucmd *ucmd);
>  void iommufd_selftest_destroy(struct iommufd_object *obj);
> diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
> new file mode 100644
> index 000000000000..ae74f5b54735
> --- /dev/null
> +++ b/drivers/iommu/iommufd/liveupdate.c
> @@ -0,0 +1,49 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#define pr_fmt(fmt) "iommufd: " fmt
> +
> +#include <linux/file.h>
> +#include <linux/iommufd.h>
> +#include <linux/liveupdate.h>
> +
> +#include "iommufd_private.h"
> +
> +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_hwpt_lu_set_preserve *cmd = ucmd->cmd;
> +	struct iommufd_hwpt_paging *hwpt_target, *hwpt;
> +	struct iommufd_ctx *ictx = ucmd->ictx;
> +	struct iommufd_object *obj;
> +	unsigned long index;
> +	int rc = 0;
> +
> +	hwpt_target = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id);
> +	if (IS_ERR(hwpt_target))
> +		return PTR_ERR(hwpt_target);
> +
> +	xa_lock(&ictx->objects);
> +	xa_for_each(&ictx->objects, index, obj) {
> +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
> +			continue;

Couldn't these be HWPT_NESTED? Are we explicitly skipping HWPT_NESTED
here? ARM SMMUv3 heavily relies on IOMMU_DOMAIN_NESTED to back vIOMMUs
and hold critical guest translation state. We'd need to support 
HWPT_NESTED for arm-smmu-v3.

> +
> +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
> +
> +		if (hwpt == hwpt_target)
> +			continue;
> +		if (!hwpt->lu_preserve)
> +			continue;
> +		if (hwpt->lu_token == cmd->hwpt_token) {
> +			rc = -EADDRINUSE;
> +			goto out;
> +		}

I see that this entire loop is to avoid collisions but could we improve
this? We are doing an O(N) linear search over the entire ictx->objects
xarray while holding xa_lock on every setup call.

If the kernel requires a strict 1:1 mapping of lu_token to hwpt, 
wouldn't it be much better to track these in a dedicated xarray?

Just thinking out loud, if we added a dedicated lu_tokens xarray to
iommufd_ctx, we could drop the linear search and the lock entirely,
letting the xarray handle the collision natively like this:

	rc = xa_insert(&ictx->lu_tokens, cmd->hwpt_token, hwpt_target, GFP_KERNEL);
	if (rc == -EBUSY) {
		rc = -EADDRINUSE;
		goto out;
	} else if (rc) {
		goto out;
	}

This ensures instant collision detection without iterating the global 
object pool. When the HWPT is eventually destroyed (or un-preserved), we
simply call xa_erase(&ictx->lu_tokens, hwpt->lu_token).

> +	}
> +
> +	hwpt_target->lu_preserve = true;

I don't see a way to unset hwpt->lu_preserve once it's been set. What if
a VMM marks a HWPT for preservation, but then the guest decides to rmmod
the device before the actual kexec? The VMM would need a way to 
unpreserve it so we don't carry stale state across the live update?

Are we relying on the VMM to always call IOMMU_DESTROY on that HWPT when
it's no longer needed for preservation? A clever VMM optimizing for perf
might just pool or cache detached HWPTs for future reuse. If that HWPT
goes back into a free pool and gets re-attached to a new device later,
the sticky lu_preserve state will inadvertently leak across the kexec..

> +	hwpt_target->lu_token = cmd->hwpt_token;
> +
> +out:
> +	xa_unlock(&ictx->objects);
> +	iommufd_put_object(ictx, &hwpt_target->common.obj);
> +	return rc;
> +}
> +
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 5cc4b08c25f5..e1a9b3051f65 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -493,6 +493,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>  		 __reserved),
>  	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
>  		 struct iommu_viommu_alloc, out_viommu_id),
> +	IOCTL_OP(IOMMU_HWPT_LU_SET_PRESERVE, iommufd_hwpt_lu_set_preserve,
> +		 struct iommu_hwpt_lu_set_preserve, hwpt_token),
>  #ifdef CONFIG_IOMMUFD_TEST
>  	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
>  #endif
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 2c41920b641d..25d8cff987eb 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -57,6 +57,7 @@ enum {
>  	IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92,
>  	IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93,
>  	IOMMUFD_CMD_HW_QUEUE_ALLOC = 0x94,
> +	IOMMUFD_CMD_HWPT_LU_SET_PRESERVE = 0x95,
>  };
>  
>  /**
> @@ -1299,4 +1300,22 @@ struct iommu_hw_queue_alloc {
>  	__aligned_u64 length;
>  };
>  #define IOMMU_HW_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HW_QUEUE_ALLOC)
> +
> +/**
> + * struct iommu_hwpt_lu_set_preserve - ioctl(IOMMU_HWPT_LU_SET_PRESERVE)

Nit: The IOCTL is called "IOMMU_HWPT_LU_SET_PRESERVE" which subtly
implies the existence of a "GET_PRESERVE". Should we perhaps just call
it IOMMU_HWPT_LU_PRESERVE?

> + * @size: sizeof(struct iommu_hwpt_lu_set_preserve)
> + * @hwpt_id: Iommufd object ID of the target HWPT
> + * @hwpt_token: Token to identify this hwpt upon restore
> + *
> + * The target HWPT will be preserved during iommufd preservation.
> + *
> + * The hwpt_token is provided by userspace. If userspace enters a token
> + * already in use within this iommufd, -EADDRINUSE is returned from this ioctl.
> + */
> +struct iommu_hwpt_lu_set_preserve {
> +	__u32 size;
> +	__u32 hwpt_id;
> +	__u32 hwpt_token;
> +};

Nit: Let's make sure we follow the 64-bit alignment as enforced in the
rest of this file, note the __u32 __reserved fields in existing IOCTL
structs.

> +#define IOMMU_HWPT_LU_SET_PRESERVE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_LU_SET_PRESERVE)
>  #endif

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
  2026-03-25 14:37   ` Pranjal Shrivastava
@ 2026-03-25 17:31     ` Samiullah Khawaja
  2026-03-25 18:55       ` Pranjal Shrivastava
  0 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-25 17:31 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Vipin Sharma

On Wed, Mar 25, 2026 at 02:37:37PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:44PM +0000, Samiullah Khawaja wrote:
>> From: YiFei Zhu <zhuyifei@google.com>
>>
>> Userspace provides a token, which will then be used at restore to
>> identify this HWPT. The restoration logic is not implemented and will be
>> added later.
>>
>> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/iommufd/Makefile          |  1 +
>>  drivers/iommu/iommufd/iommufd_private.h | 13 +++++++
>>  drivers/iommu/iommufd/liveupdate.c      | 49 +++++++++++++++++++++++++
>>  drivers/iommu/iommufd/main.c            |  2 +
>>  include/uapi/linux/iommufd.h            | 19 ++++++++++
>>  5 files changed, 84 insertions(+)
>>  create mode 100644 drivers/iommu/iommufd/liveupdate.c
>>
>> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
>> index 71d692c9a8f4..c3bf0b6452d3 100644
>> --- a/drivers/iommu/iommufd/Makefile
>> +++ b/drivers/iommu/iommufd/Makefile
>> @@ -17,3 +17,4 @@ obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o
>>
>>  iommufd_driver-y := driver.o
>>  obj-$(CONFIG_IOMMUFD_DRIVER_CORE) += iommufd_driver.o
>> +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
>> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
>> index eb6d1a70f673..6424e7cea5b2 100644
>> --- a/drivers/iommu/iommufd/iommufd_private.h
>> +++ b/drivers/iommu/iommufd/iommufd_private.h
>> @@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
>>  	bool auto_domain : 1;
>>  	bool enforce_cache_coherency : 1;
>>  	bool nest_parent : 1;
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	bool lu_preserve : 1;
>> +	u32 lu_token;
>
>Did we downsize the token? Shouldn't this be u64 as everywhere else?

Note that this is different from the token that is used to preserve the
FD into LUO. This token is used to mark the HWPT for preservation, that
is it will be preserved when the FD is preserved.

I will add more text in the commit message to make it clear.

For consistency I will make it u64.
>
>> +#endif
>>  	/* Head at iommufd_ioas::hwpt_list */
>>  	struct list_head hwpt_item;
>>  	struct iommufd_sw_msi_maps present_sw_msi;
>> @@ -707,6 +711,15 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
>>  			    struct iommufd_vdevice, obj);
>>  }
>>
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
>> +#else
>> +static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>> +{
>> +	return -ENOTTY;
>> +}
>> +#endif
>> +
>>  #ifdef CONFIG_IOMMUFD_TEST
>>  int iommufd_test(struct iommufd_ucmd *ucmd);
>>  void iommufd_selftest_destroy(struct iommufd_object *obj);
>> diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
>> new file mode 100644
>> index 000000000000..ae74f5b54735
>> --- /dev/null
>> +++ b/drivers/iommu/iommufd/liveupdate.c
>> @@ -0,0 +1,49 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +
>> +#define pr_fmt(fmt) "iommufd: " fmt
>> +
>> +#include <linux/file.h>
>> +#include <linux/iommufd.h>
>> +#include <linux/liveupdate.h>
>> +
>> +#include "iommufd_private.h"
>> +
>> +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>> +{
>> +	struct iommu_hwpt_lu_set_preserve *cmd = ucmd->cmd;
>> +	struct iommufd_hwpt_paging *hwpt_target, *hwpt;
>> +	struct iommufd_ctx *ictx = ucmd->ictx;
>> +	struct iommufd_object *obj;
>> +	unsigned long index;
>> +	int rc = 0;
>> +
>> +	hwpt_target = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id);
>> +	if (IS_ERR(hwpt_target))
>> +		return PTR_ERR(hwpt_target);
>> +
>> +	xa_lock(&ictx->objects);
>> +	xa_for_each(&ictx->objects, index, obj) {
>> +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
>> +			continue;
>
>Couldn't these be HWPT_NESTED? Are we explicitly skipping HWPT_NESTED
>here? ARM SMMUv3 heavily relies on IOMMU_DOMAIN_NESTED to back vIOMMUs
>and hold critical guest translation state. We'd need to support
>HWPT_NESTED for arm-smmu-v3.

For this series, I am not handling the NESTED and vIOMMU usecases. I
will be sending a separate series to handle those, this is mentioned in
cover letter also in the Future work.

Will add a note in commit message also.
>
>> +
>> +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
>> +
>> +		if (hwpt == hwpt_target)
>> +			continue;
>> +		if (!hwpt->lu_preserve)
>> +			continue;
>> +		if (hwpt->lu_token == cmd->hwpt_token) {
>> +			rc = -EADDRINUSE;
>> +			goto out;
>> +		}
>
>I see that this entire loop is to avoid collisions but could we improve
>this? We are doing an O(N) linear search over the entire ictx->objects
>xarray while holding xa_lock on every setup call.
>
>If the kernel requires a strict 1:1 mapping of lu_token to hwpt,
>wouldn't it be much better to track these in a dedicated xarray?
>
>Just thinking out loud, if we added a dedicated lu_tokens xarray to
>iommufd_ctx, we could drop the linear search and the lock entirely,
>letting the xarray handle the collision natively like this:
>
>	rc = xa_insert(&ictx->lu_tokens, cmd->hwpt_token, hwpt_target, GFP_KERNEL);
>	if (rc == -EBUSY) {
>		rc = -EADDRINUSE;
>		goto out;
>	} else if (rc) {
>		goto out;
>	}
>
>This ensures instant collision detection without iterating the global
>object pool. When the HWPT is eventually destroyed (or un-preserved), we
>simply call xa_erase(&ictx->lu_tokens, hwpt->lu_token).

Agreed. We can call xa_erase when it is destroyed. This can also be used
during actual preservation without taking the objects lock.
>
>> +	}
>> +
>> +	hwpt_target->lu_preserve = true;
>
>I don't see a way to unset hwpt->lu_preserve once it's been set. What if
>a VMM marks a HWPT for preservation, but then the guest decides to rmmod
>the device before the actual kexec? The VMM would need a way to
>unpreserve it so we don't carry stale state across the live update?
>
>Are we relying on the VMM to always call IOMMU_DESTROY on that HWPT when
>it's no longer needed for preservation? A clever VMM optimizing for perf
>might just pool or cache detached HWPTs for future reuse. If that HWPT
>goes back into a free pool and gets re-attached to a new device later,
>the sticky lu_preserve state will inadvertently leak across the kexec..

As mentioned earlier, the HWPT is not being preserved in this call. So
when VMM dies or rmmod happens, this HWPT will be destroyed following
the normal flow.

I will add this in commit message.
>
>> +	hwpt_target->lu_token = cmd->hwpt_token;
>> +
>> +out:
>> +	xa_unlock(&ictx->objects);
>> +	iommufd_put_object(ictx, &hwpt_target->common.obj);
>> +	return rc;
>> +}
>> +
>> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
>> index 5cc4b08c25f5..e1a9b3051f65 100644
>> --- a/drivers/iommu/iommufd/main.c
>> +++ b/drivers/iommu/iommufd/main.c
>> @@ -493,6 +493,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>>  		 __reserved),
>>  	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
>>  		 struct iommu_viommu_alloc, out_viommu_id),
>> +	IOCTL_OP(IOMMU_HWPT_LU_SET_PRESERVE, iommufd_hwpt_lu_set_preserve,
>> +		 struct iommu_hwpt_lu_set_preserve, hwpt_token),
>>  #ifdef CONFIG_IOMMUFD_TEST
>>  	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
>>  #endif
>> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
>> index 2c41920b641d..25d8cff987eb 100644
>> --- a/include/uapi/linux/iommufd.h
>> +++ b/include/uapi/linux/iommufd.h
>> @@ -57,6 +57,7 @@ enum {
>>  	IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92,
>>  	IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93,
>>  	IOMMUFD_CMD_HW_QUEUE_ALLOC = 0x94,
>> +	IOMMUFD_CMD_HWPT_LU_SET_PRESERVE = 0x95,
>>  };
>>
>>  /**
>> @@ -1299,4 +1300,22 @@ struct iommu_hw_queue_alloc {
>>  	__aligned_u64 length;
>>  };
>>  #define IOMMU_HW_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HW_QUEUE_ALLOC)
>> +
>> +/**
>> + * struct iommu_hwpt_lu_set_preserve - ioctl(IOMMU_HWPT_LU_SET_PRESERVE)
>
>Nit: The IOCTL is called "IOMMU_HWPT_LU_SET_PRESERVE" which subtly
>implies the existence of a "GET_PRESERVE". Should we perhaps just call
>it IOMMU_HWPT_LU_PRESERVE?

LU_PRESERVE would imply that it is being preserved. Maybe
"IOMMU_HWPT_LU_MARK_PRESERVE"?
>
>> + * @size: sizeof(struct iommu_hwpt_lu_set_preserve)
>> + * @hwpt_id: Iommufd object ID of the target HWPT
>> + * @hwpt_token: Token to identify this hwpt upon restore
>> + *
>> + * The target HWPT will be preserved during iommufd preservation.
>> + *
>> + * The hwpt_token is provided by userspace. If userspace enters a token
>> + * already in use within this iommufd, -EADDRINUSE is returned from this ioctl.
>> + */
>> +struct iommu_hwpt_lu_set_preserve {
>> +	__u32 size;
>> +	__u32 hwpt_id;
>> +	__u32 hwpt_token;
>> +};
>
>Nit: Let's make sure we follow the 64-bit alignment as enforced in the
>rest of this file, note the __u32 __reserved fields in existing IOCTL
>structs.

Agreed. Will update
>
>> +#define IOMMU_HWPT_LU_SET_PRESERVE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_LU_SET_PRESERVE)
>>  #endif
>
>Thanks,
>Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
  2026-03-25 17:31     ` Samiullah Khawaja
@ 2026-03-25 18:55       ` Pranjal Shrivastava
  2026-03-25 20:19         ` Samiullah Khawaja
  0 siblings, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-25 18:55 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Vipin Sharma

On Wed, Mar 25, 2026 at 05:31:46PM +0000, Samiullah Khawaja wrote:
> On Wed, Mar 25, 2026 at 02:37:37PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Feb 03, 2026 at 10:09:44PM +0000, Samiullah Khawaja wrote:
> > > From: YiFei Zhu <zhuyifei@google.com>
> > > 
> > > Userspace provides a token, which will then be used at restore to
> > > identify this HWPT. The restoration logic is not implemented and will be
> > > added later.
> > > 
> > > Signed-off-by: YiFei Zhu <zhuyifei@google.com>
> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> > > ---
> > >  drivers/iommu/iommufd/Makefile          |  1 +
> > >  drivers/iommu/iommufd/iommufd_private.h | 13 +++++++
> > >  drivers/iommu/iommufd/liveupdate.c      | 49 +++++++++++++++++++++++++
> > >  drivers/iommu/iommufd/main.c            |  2 +
> > >  include/uapi/linux/iommufd.h            | 19 ++++++++++
> > >  5 files changed, 84 insertions(+)
> > >  create mode 100644 drivers/iommu/iommufd/liveupdate.c
> > > 
> > > diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> > > index 71d692c9a8f4..c3bf0b6452d3 100644
> > > --- a/drivers/iommu/iommufd/Makefile
> > > +++ b/drivers/iommu/iommufd/Makefile
> > > @@ -17,3 +17,4 @@ obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o
> > > 
> > >  iommufd_driver-y := driver.o
> > >  obj-$(CONFIG_IOMMUFD_DRIVER_CORE) += iommufd_driver.o
> > > +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
> > > diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> > > index eb6d1a70f673..6424e7cea5b2 100644
> > > --- a/drivers/iommu/iommufd/iommufd_private.h
> > > +++ b/drivers/iommu/iommufd/iommufd_private.h
> > > @@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
> > >  	bool auto_domain : 1;
> > >  	bool enforce_cache_coherency : 1;
> > >  	bool nest_parent : 1;
> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > +	bool lu_preserve : 1;
> > > +	u32 lu_token;
> > 
> > Did we downsize the token? Shouldn't this be u64 as everywhere else?
> 
> Note that this is different from the token that is used to preserve the
> FD into LUO. This token is used to mark the HWPT for preservation, that
> is it will be preserved when the FD is preserved.
> 
> I will add more text in the commit message to make it clear.
> 
> For consistency I will make it u64.

I understand that it's logically distinct from the FD preservation token
However, the userspace likely won't implement a separate 32-bit token 
generator just for IOMMUFD Live Update. I assume, it'll just use the a
same 64-bit restore-token allocator. Keeping this as u64 prevents them
from having to downcast or manage a separate ID space just for this IOCTL

> > 
> > > +#endif
> > >  	/* Head at iommufd_ioas::hwpt_list */
> > >  	struct list_head hwpt_item;
> > >  	struct iommufd_sw_msi_maps present_sw_msi;
> > > @@ -707,6 +711,15 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
> > >  			    struct iommufd_vdevice, obj);
> > >  }
> > > 
> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
> > > +#else
> > > +static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
> > > +{
> > > +	return -ENOTTY;
> > > +}
> > > +#endif
> > > +
> > >  #ifdef CONFIG_IOMMUFD_TEST
> > >  int iommufd_test(struct iommufd_ucmd *ucmd);
> > >  void iommufd_selftest_destroy(struct iommufd_object *obj);
> > > diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
> > > new file mode 100644
> > > index 000000000000..ae74f5b54735
> > > --- /dev/null
> > > +++ b/drivers/iommu/iommufd/liveupdate.c
> > > @@ -0,0 +1,49 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +
> > > +#define pr_fmt(fmt) "iommufd: " fmt
> > > +
> > > +#include <linux/file.h>
> > > +#include <linux/iommufd.h>
> > > +#include <linux/liveupdate.h>
> > > +
> > > +#include "iommufd_private.h"
> > > +
> > > +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
> > > +{
> > > +	struct iommu_hwpt_lu_set_preserve *cmd = ucmd->cmd;
> > > +	struct iommufd_hwpt_paging *hwpt_target, *hwpt;
> > > +	struct iommufd_ctx *ictx = ucmd->ictx;
> > > +	struct iommufd_object *obj;
> > > +	unsigned long index;
> > > +	int rc = 0;
> > > +
> > > +	hwpt_target = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id);
> > > +	if (IS_ERR(hwpt_target))
> > > +		return PTR_ERR(hwpt_target);
> > > +
> > > +	xa_lock(&ictx->objects);
> > > +	xa_for_each(&ictx->objects, index, obj) {
> > > +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
> > > +			continue;
> > 
> > Couldn't these be HWPT_NESTED? Are we explicitly skipping HWPT_NESTED
> > here? ARM SMMUv3 heavily relies on IOMMU_DOMAIN_NESTED to back vIOMMUs
> > and hold critical guest translation state. We'd need to support
> > HWPT_NESTED for arm-smmu-v3.
> 
> For this series, I am not handling the NESTED and vIOMMU usecases. I
> will be sending a separate series to handle those, this is mentioned in
> cover letter also in the Future work.
> 
> Will add a note in commit message also.

I see, I missed it in the cover letter. Shall we add a TODO mentioning
that we'll support to NESTED too? (No strong feelings about this or
adding stuff to the commit message too.. the cover letter mentions it)

> > 
> > > +
> > > +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
> > > +
> > > +		if (hwpt == hwpt_target)
> > > +			continue;
> > > +		if (!hwpt->lu_preserve)
> > > +			continue;
> > > +		if (hwpt->lu_token == cmd->hwpt_token) {
> > > +			rc = -EADDRINUSE;
> > > +			goto out;
> > > +		}
> > 
> > I see that this entire loop is to avoid collisions but could we improve
> > this? We are doing an O(N) linear search over the entire ictx->objects
> > xarray while holding xa_lock on every setup call.
> > 
> > If the kernel requires a strict 1:1 mapping of lu_token to hwpt,
> > wouldn't it be much better to track these in a dedicated xarray?
> > 
> > Just thinking out loud, if we added a dedicated lu_tokens xarray to
> > iommufd_ctx, we could drop the linear search and the lock entirely,
> > letting the xarray handle the collision natively like this:
> > 
> > 	rc = xa_insert(&ictx->lu_tokens, cmd->hwpt_token, hwpt_target, GFP_KERNEL);
> > 	if (rc == -EBUSY) {
> > 		rc = -EADDRINUSE;
> > 		goto out;
> > 	} else if (rc) {
> > 		goto out;
> > 	}
> > 
> > This ensures instant collision detection without iterating the global
> > object pool. When the HWPT is eventually destroyed (or un-preserved), we
> > simply call xa_erase(&ictx->lu_tokens, hwpt->lu_token).
> 
> Agreed. We can call xa_erase when it is destroyed. This can also be used
> during actual preservation without taking the objects lock.

Awesome!

> > 
> > > +	}
> > > +
> > > +	hwpt_target->lu_preserve = true;
> > 
> > I don't see a way to unset hwpt->lu_preserve once it's been set. What if
> > a VMM marks a HWPT for preservation, but then the guest decides to rmmod
> > the device before the actual kexec? The VMM would need a way to
> > unpreserve it so we don't carry stale state across the live update?
> > 
> > Are we relying on the VMM to always call IOMMU_DESTROY on that HWPT when
> > it's no longer needed for preservation? A clever VMM optimizing for perf
> > might just pool or cache detached HWPTs for future reuse. If that HWPT
> > goes back into a free pool and gets re-attached to a new device later,
> > the sticky lu_preserve state will inadvertently leak across the kexec..
> 
> As mentioned earlier, the HWPT is not being preserved in this call. So
> when VMM dies or rmmod happens, this HWPT will be destroyed following
> the normal flow.
> 

I think there might be a slight disconnect regarding the "normal flow" 
of HWPT destruction. My concern isn't about the VMM dying or a simple 1:1
teardown. My concern is about a VMM that deliberately avoids calling
IOMMU_DESTROY to optimize allocations.

The iommufd UAPI already explicitly supports the HWPT pooling model.
The IOMMU_DEVICE_ATTACH ioctl takes a pt_id, allowing a VMM to
pre-allocate an HWPT and then 'point' various devices at it over time.
(Note that detaching a device from a HWPT attaches it to a blocked
domain.)

If a VMM uses a free-list/cache for its HWPTs, a guest hot-unplug will 
cause the VMM to detach the device, but the VMM will keep the HWPT alive
in userspace for future reuse.

If that happens, the HWPT is now sitting in the VMM's free pool, but the
kernel still has it permanently flagged with lu_preserve = true. When 
the VMM later pulls that HWPT from the pool to attach to a new device 
(which might not need preservation), there is no way for the VMM to
UNMARK it for preservation.

> I will add this in commit message.
> > 
> > > +	hwpt_target->lu_token = cmd->hwpt_token;
> > > +
> > > +out:
> > > +	xa_unlock(&ictx->objects);
> > > +	iommufd_put_object(ictx, &hwpt_target->common.obj);
> > > +	return rc;
> > > +}
> > > +
> > > diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> > > index 5cc4b08c25f5..e1a9b3051f65 100644
> > > --- a/drivers/iommu/iommufd/main.c
> > > +++ b/drivers/iommu/iommufd/main.c
> > > @@ -493,6 +493,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
> > >  		 __reserved),
> > >  	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
> > >  		 struct iommu_viommu_alloc, out_viommu_id),
> > > +	IOCTL_OP(IOMMU_HWPT_LU_SET_PRESERVE, iommufd_hwpt_lu_set_preserve,
> > > +		 struct iommu_hwpt_lu_set_preserve, hwpt_token),
> > >  #ifdef CONFIG_IOMMUFD_TEST
> > >  	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
> > >  #endif
> > > diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> > > index 2c41920b641d..25d8cff987eb 100644
> > > --- a/include/uapi/linux/iommufd.h
> > > +++ b/include/uapi/linux/iommufd.h
> > > @@ -57,6 +57,7 @@ enum {
> > >  	IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92,
> > >  	IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93,
> > >  	IOMMUFD_CMD_HW_QUEUE_ALLOC = 0x94,
> > > +	IOMMUFD_CMD_HWPT_LU_SET_PRESERVE = 0x95,
> > >  };
> > > 
> > >  /**
> > > @@ -1299,4 +1300,22 @@ struct iommu_hw_queue_alloc {
> > >  	__aligned_u64 length;
> > >  };
> > >  #define IOMMU_HW_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HW_QUEUE_ALLOC)
> > > +
> > > +/**
> > > + * struct iommu_hwpt_lu_set_preserve - ioctl(IOMMU_HWPT_LU_SET_PRESERVE)
> > 
> > Nit: The IOCTL is called "IOMMU_HWPT_LU_SET_PRESERVE" which subtly
> > implies the existence of a "GET_PRESERVE". Should we perhaps just call
> > it IOMMU_HWPT_LU_PRESERVE?
> 
> LU_PRESERVE would imply that it is being preserved. Maybe
> "IOMMU_HWPT_LU_MARK_PRESERVE"?

Yup, sounds good! Thanks

> > 
> > > + * @size: sizeof(struct iommu_hwpt_lu_set_preserve)
> > > + * @hwpt_id: Iommufd object ID of the target HWPT
> > > + * @hwpt_token: Token to identify this hwpt upon restore
> > > + *
> > > + * The target HWPT will be preserved during iommufd preservation.
> > > + *
> > > + * The hwpt_token is provided by userspace. If userspace enters a token
> > > + * already in use within this iommufd, -EADDRINUSE is returned from this ioctl.
> > > + */
> > > +struct iommu_hwpt_lu_set_preserve {
> > > +	__u32 size;
> > > +	__u32 hwpt_id;
> > > +	__u32 hwpt_token;
> > > +};
> > 
> > Nit: Let's make sure we follow the 64-bit alignment as enforced in the
> > rest of this file, note the __u32 __reserved fields in existing IOCTL
> > structs.
> 
> Agreed. Will update

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update
  2026-02-03 22:09 ` [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update Samiullah Khawaja
                     ` (2 preceding siblings ...)
  2026-03-23 20:28   ` Vipin Sharma
@ 2026-03-25 20:08   ` Pranjal Shrivastava
  2026-03-25 20:32     ` Samiullah Khawaja
  3 siblings, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-25 20:08 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Vipin Sharma

On Tue, Feb 03, 2026 at 10:09:45PM +0000, Samiullah Khawaja wrote:
> From: YiFei Zhu <zhuyifei@google.com>
> 
> The caller is expected to mark each HWPT to be preserved with an ioctl
> call, with a token that will be used in restore. At preserve time, each
> HWPT's domain is then called with iommu_domain_preserve to preserve the
> iommu domain.
> 
> The HWPTs containing dma mappings backed by unpreserved memory should
> not be preserved. During preservation check if the mappings contained in
> the HWPT being preserved are only file based and all the files are
> preserved.
> 
> The memfd file preservation check is not enough when preserving iommufd.
> The memfd might have shrunk between the mapping and memfd preservation.
> This means that if it shrunk some pages that are right now pinned due to
> iommu mappings are not preserved with the memfd. Only allow iommufd
> preservation when all the iopt_pages are file backed and the memory file
> was seal sealed during mapping. This guarantees that all the pages that
> were backing memfd when it was mapped are preserved.
> 
> Once HWPT is preserved the iopt associated with the HWPT is made
> immutable. Since the map and unmap ioctls operates directly on iopt,
> which contains an array of domains, while each hwpt contains only one
> domain. The logic then becomes that mapping and unmapping is prohibited
> if any of the domains in an iopt belongs to a preserved hwpt. However,
> tracing to the hwpt through the domain is a lot more tedious than
> tracing through the ioas, so if an hwpt is preserved, hwpt->ioas->iopt
> is made immutable.
> 
> When undoing this (making the iopts mutable again), there's never
> a need to make some iopts mutable and some kept immutable, since
> the undo only happen on unpreserve and error path of preserve.
> Simply iterate all the ioas and clear the immutability flag on all
> their iopts.
> 
> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/iommufd/io_pagetable.c    |  17 ++
>  drivers/iommu/iommufd/io_pagetable.h    |   1 +
>  drivers/iommu/iommufd/iommufd_private.h |  25 ++
>  drivers/iommu/iommufd/liveupdate.c      | 300 ++++++++++++++++++++++++
>  drivers/iommu/iommufd/main.c            |  14 +-
>  drivers/iommu/iommufd/pages.c           |   8 +
>  include/linux/kho/abi/iommufd.h         |  39 +++
>  7 files changed, 403 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/kho/abi/iommufd.h
> 
> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
> index 436992331111..43e8a2443793 100644
> --- a/drivers/iommu/iommufd/io_pagetable.c
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -270,6 +270,11 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
>  	}
>  
>  	down_write(&iopt->iova_rwsem);
> +	if (iopt_lu_map_immutable(iopt)) {
> +		rc = -EBUSY;
> +		goto out_unlock;
> +	}
> +
>  	if ((length & (iopt->iova_alignment - 1)) || !length) {
>  		rc = -EINVAL;
>  		goto out_unlock;
> @@ -328,6 +333,7 @@ static void iopt_abort_area(struct iopt_area *area)
>  		WARN_ON(area->pages);
>  	if (area->iopt) {
>  		down_write(&area->iopt->iova_rwsem);
> +		WARN_ON(iopt_lu_map_immutable(area->iopt));
>  		interval_tree_remove(&area->node, &area->iopt->area_itree);
>  		up_write(&area->iopt->iova_rwsem);
>  	}
> @@ -755,6 +761,12 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
>  again:
>  	down_read(&iopt->domains_rwsem);
>  	down_write(&iopt->iova_rwsem);
> +
> +	if (iopt_lu_map_immutable(iopt)) {
> +		rc = -EBUSY;
> +		goto out_unlock_iova;
> +	}
> +
>  	while ((area = iopt_area_iter_first(iopt, start, last))) {
>  		unsigned long area_last = iopt_area_last_iova(area);
>  		unsigned long area_first = iopt_area_iova(area);
> @@ -1398,6 +1410,11 @@ int iopt_cut_iova(struct io_pagetable *iopt, unsigned long *iovas,
>  	int i;
>  
>  	down_write(&iopt->iova_rwsem);
> +	if (iopt_lu_map_immutable(iopt)) {
> +		up_write(&iopt->iova_rwsem);
> +		return -EBUSY;
> +	}
> +
>  	for (i = 0; i < num_iovas; i++) {
>  		struct iopt_area *area;
>  
> diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
> index 14cd052fd320..b64cb4cf300c 100644
> --- a/drivers/iommu/iommufd/io_pagetable.h
> +++ b/drivers/iommu/iommufd/io_pagetable.h
> @@ -234,6 +234,7 @@ struct iopt_pages {
>  		struct {			/* IOPT_ADDRESS_FILE */
>  			struct file *file;
>  			unsigned long start;
> +			u32 seals;
>  		};
>  		/* IOPT_ADDRESS_DMABUF */
>  		struct iopt_pages_dmabuf dmabuf;
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 6424e7cea5b2..f8366a23999f 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -94,6 +94,9 @@ struct io_pagetable {
>  	/* IOVA that cannot be allocated, struct iopt_reserved */
>  	struct rb_root_cached reserved_itree;
>  	u8 disable_large_pages;
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	bool lu_map_immutable;
> +#endif
>  	unsigned long iova_alignment;
>  };
>  
> @@ -712,12 +715,34 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
>  }
>  
>  #ifdef CONFIG_IOMMU_LIVEUPDATE
> +int iommufd_liveupdate_register_lufs(void);
> +int iommufd_liveupdate_unregister_lufs(void);
> +
>  int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
> +static inline bool iopt_lu_map_immutable(const struct io_pagetable *iopt)
> +{
> +	return iopt->lu_map_immutable;
> +}
>  #else
> +static inline int iommufd_liveupdate_register_lufs(void)
> +{
> +	return 0;
> +}
> +
> +static inline int iommufd_liveupdate_unregister_lufs(void)
> +{
> +	return 0;
> +}
> +
>  static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>  {
>  	return -ENOTTY;
>  }
> +
> +static inline bool iopt_lu_map_immutable(const struct io_pagetable *iopt)
> +{
> +	return false;
> +}
>  #endif
>  
>  #ifdef CONFIG_IOMMUFD_TEST
> diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
> index ae74f5b54735..ec11ae345fe7 100644
> --- a/drivers/iommu/iommufd/liveupdate.c
> +++ b/drivers/iommu/iommufd/liveupdate.c
> @@ -4,9 +4,15 @@
>  
>  #include <linux/file.h>
>  #include <linux/iommufd.h>
> +#include <linux/kexec_handover.h>
> +#include <linux/kho/abi/iommufd.h>
>  #include <linux/liveupdate.h>
> +#include <linux/iommu-lu.h>
> +#include <linux/mm.h>
> +#include <linux/pci.h>
>  
>  #include "iommufd_private.h"
> +#include "io_pagetable.h"
>  
>  int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>  {
> @@ -47,3 +53,297 @@ int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>  	return rc;
>  }
>  
> +static void iommufd_set_ioas_mutable(struct iommufd_ctx *ictx)
> +{
> +	struct iommufd_object *obj;
> +	struct iommufd_ioas *ioas;
> +	unsigned long index;
> +
> +	xa_lock(&ictx->objects);
> +	xa_for_each(&ictx->objects, index, obj) {
> +		if (obj->type != IOMMUFD_OBJ_IOAS)
> +			continue;
> +
> +		ioas = container_of(obj, struct iommufd_ioas, obj);
> +
> +		/*
> +		 * Not taking any IOAS lock here. All writers take LUO
> +		 * session mutex, and this writer racing with readers is not
> +		 * really a problem.
> +		 */
> +		WRITE_ONCE(ioas->iopt.lu_map_immutable, false);
> +	}
> +	xa_unlock(&ictx->objects);
> +}
> +
> +static int check_iopt_pages_preserved(struct liveupdate_session *s,
> +				      struct iommufd_hwpt_paging *hwpt)
> +{
> +	u32 req_seals = F_SEAL_SEAL | F_SEAL_GROW | F_SEAL_SHRINK;
> +	struct iopt_area *area;
> +	int ret;
> +
> +	for (area = iopt_area_iter_first(&hwpt->ioas->iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		/* Only allow file based mapping */
> +		if (pages->type != IOPT_ADDRESS_FILE)
> +			return -EINVAL;

That's an important check! should we document it somewhere for the user
to know about it too? Perhaps in the uAPI header for the preserve IOCTL,
so VMM authors are aware of the anonymous memory restriction upfront?

> +
> +		/*
> +		 * When this memory file was mapped it should be sealed and seal
> +		 * should be sealed. This means that since mapping was done the
> +		 * memory file was not grown or shrink and the pages being used
> +		 * until now remain pinnned and preserved.
> +		 */
> +		if ((pages->seals & req_seals) != req_seals)
> +			return -EINVAL;
> +
> +		/* Make sure that the file was preserved. */
> +		ret = liveupdate_get_token_outgoing(s, pages->file, NULL);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int iommufd_save_hwpts(struct iommufd_ctx *ictx,
> +			      struct iommufd_lu *iommufd_lu,
> +			      struct liveupdate_session *session)
> +{
> +	struct iommufd_hwpt_paging *hwpt, **hwpts = NULL;
> +	struct iommu_domain_ser *domain_ser;
> +	struct iommufd_hwpt_lu *hwpt_lu;
> +	struct iommufd_object *obj;
> +	unsigned int nr_hwpts = 0;
> +	unsigned long index;
> +	unsigned int i;
> +	int rc = 0;
> +
> +	if (iommufd_lu) {
> +		hwpts = kcalloc(iommufd_lu->nr_hwpts, sizeof(*hwpts),
> +				GFP_KERNEL);
> +		if (!hwpts)
> +			return -ENOMEM;
> +	}
> +
> +	xa_lock(&ictx->objects);
> +	xa_for_each(&ictx->objects, index, obj) {
> +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
> +			continue;
> +
> +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
> +		if (!hwpt->lu_preserve)
> +			continue;
> +
> +		if (hwpt->ioas) {
> +			/*
> +			 * Obtain exclusive access to the IOAS and IOPT while we
> +			 * set immutability
> +			 */
> +			mutex_lock(&hwpt->ioas->mutex);

Doubling down on Ankit's comment here, we should absolutely avoid taking
sleepable locks in atomic context. I can attempt testing the next series
with lockdep enabled.

> +			down_write(&hwpt->ioas->iopt.domains_rwsem);
> +			down_write(&hwpt->ioas->iopt.iova_rwsem);
> +
> +			hwpt->ioas->iopt.lu_map_immutable = true;
> +
> +			up_write(&hwpt->ioas->iopt.iova_rwsem);
> +			up_write(&hwpt->ioas->iopt.domains_rwsem);
> +			mutex_unlock(&hwpt->ioas->mutex);

Additionally, I'm wondering if this could be a helper function?
iommufd_ioas_set_immutable?

> +		}
> +
> +		if (!hwpt->common.domain) {
> +			rc = -EINVAL;
> +			xa_unlock(&ictx->objects);
> +			goto out;
> +		}
> +
> +		if (!iommufd_lu) {
> +			rc = check_iopt_pages_preserved(session, hwpt);

This seems slightly redundant, we are calling check_iopt_pages_preserved
for every single preserved HWPT. However, multiple HWPTs can share the 
same underlying IOAS. 

If a VMM has 5 devices sharing a single IOAS, this code will walk the
exact same iopt_area_itree and validate the exact same memfd seals 5
times. Should we perhaps track which iopts have already been validated
during the loop to avoid this redundant tree traversal?

> +			if (rc) {
> +				xa_unlock(&ictx->objects);
> +				goto out;
> +			}
> +		} else if (iommufd_lu) {
> +			hwpts[nr_hwpts] = hwpt;
> +			hwpt_lu = &iommufd_lu->hwpts[nr_hwpts];
> +
> +			hwpt_lu->token = hwpt->lu_token;
> +			hwpt_lu->reclaimed = false;
> +		}
> +
> +		nr_hwpts++;
> +	}
> +	xa_unlock(&ictx->objects);
> +
> +	if (WARN_ON(iommufd_lu && iommufd_lu->nr_hwpts != nr_hwpts)) {
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	if (iommufd_lu) {
> +		/*
> +		 * iommu_domain_preserve may sleep and must be called
> +		 * outside of xa_lock
> +		 */
> +		for (i = 0; i < nr_hwpts; i++) {
> +			hwpt = hwpts[i];
> +			hwpt_lu = &iommufd_lu->hwpts[i];
> +
> +			rc = iommu_domain_preserve(hwpt->common.domain, &domain_ser);
> +			if (rc < 0)
> +				goto out;
> +
> +			hwpt_lu->domain_data = __pa(domain_ser);
> +		}
> +	}
> +
> +	rc = nr_hwpts;
> +
> +out:
> +	kfree(hwpts);
> +	return rc;
> +}
> +
> +static int iommufd_liveupdate_preserve(struct liveupdate_file_op_args *args)
> +{
> +	struct iommufd_ctx *ictx = iommufd_ctx_from_file(args->file);
> +	struct iommufd_lu *iommufd_lu;
> +	size_t serial_size;
> +	void *mem;
> +	int rc;
> +
> +	if (IS_ERR(ictx))
> +		return PTR_ERR(ictx);
> +
> +	rc = iommufd_save_hwpts(ictx, NULL, args->session);
> +	if (rc < 0)
> +		goto err_ioas_mutable;
> +
> +	serial_size = struct_size(iommufd_lu, hwpts, rc);
> +
> +	mem = kho_alloc_preserve(serial_size);
> +	if (!mem) {
> +		rc = -ENOMEM;
> +		goto err_ioas_mutable;
> +	}
> +
> +	iommufd_lu = mem;
> +	iommufd_lu->nr_hwpts = rc;
> +	rc = iommufd_save_hwpts(ictx, iommufd_lu, args->session);

We call iommufd_save_hwpts twice (first to count/validate & second to 
serialize). Because xa_lock is dropped between these two calls to 
allocate KHO memory, we have a TOCTOU race.

If userspace concurrently creates or destroys a preserved HWPT during
this window, nr_hwpts will change, and the second pass will hit:
WARN_ON(iommufd_lu && iommufd_lu->nr_hwpts != nr_hwpts)

I'm not sure if that's a situation to WARN? Also, what about kernels
which have panic_on_warn?

> +	if (rc < 0)
> +		goto err_free;
> +
> +	args->serialized_data = virt_to_phys(iommufd_lu);
> +	iommufd_ctx_put(ictx);
> +	return 0;
> +
> +err_free:
> +	kho_unpreserve_free(mem);
> +err_ioas_mutable:
> +	iommufd_set_ioas_mutable(ictx);
> +	iommufd_ctx_put(ictx);
> +	return rc;
> +}
> +
> +static int iommufd_liveupdate_freeze(struct liveupdate_file_op_args *args)
> +{
> +	/* No-Op; everything should be made read-only */

Is there a plan to populate this in a future series? If it's a permanent
No-Op because the lu_map_immutable flag already handles the read-only
enforcement during the preserve phase, we should probably update the
comment to explicitly state that, rather than implying there is missing
logic?

> +	return 0;
> +}
> +
 
[ ---- >8 ----- ]

> diff --git a/include/linux/kho/abi/iommufd.h b/include/linux/kho/abi/iommufd.h
> new file mode 100644
> index 000000000000..f7393ac78aa9
> --- /dev/null
> +++ b/include/linux/kho/abi/iommufd.h
> @@ -0,0 +1,39 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Copyright (C) 2025, Google LLC
> + * Author: Samiullah Khawaja <skhawaja@google.com>
> + */
> +
> +#ifndef _LINUX_KHO_ABI_IOMMUFD_H
> +#define _LINUX_KHO_ABI_IOMMUFD_H
> +
> +#include <linux/mutex_types.h>
> +#include <linux/compiler.h>
> +#include <linux/types.h>
> +
> +/**
> + * DOC: IOMMUFD Live Update ABI
> + *
> + * This header defines the ABI for preserving the state of an IOMMUFD file
> + * across a kexec reboot using LUO.
> + *
> + * This interface is a contract. Any modification to any of the serialization
> + * structs defined here constitutes a breaking change. Such changes require
> + * incrementing the version number in the IOMMUFD_LUO_COMPATIBLE string.
> + */
> +
> +#define IOMMUFD_LUO_COMPATIBLE "iommufd-v1"
> +
> +struct iommufd_hwpt_lu {
> +	u32 token;
> +	u64 domain_data;
> +	bool reclaimed;
> +} __packed;
> +

Nit: Because of __packed, putting the u64 after the u32 forces an
unaligned 8-byte access at offset 4. Also, it's generally safer to use
explicitly sized types like u8 instead of bool for cross-kernel ABI
structs to avoid any compiler-specific sizing ambiguities. (This is the
reason most uAPI defs avoid using bool)

We can achieve natural alignment without implicit padding by ordering it
from largest to smallest and explicitly padding it out:

struct iommufd_hwpt_lu {
	__u64 domain_data;
	__u32 token;
	__u8 reclaimed;
	__u8 padding[3];
} __packed;

This guarantees an exact 16-byte footprint with perfectly aligned 64-bit
and 32-bit accesses.

> +struct iommufd_lu {
> +	unsigned int nr_hwpts;
> +	struct iommufd_hwpt_lu hwpts[];
> +};

Same here regarding explicitly sized types and packing (as pointed by
Vipin too)

> +
> +#endif /* _LINUX_KHO_ABI_IOMMUFD_H */

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
  2026-03-25 18:55       ` Pranjal Shrivastava
@ 2026-03-25 20:19         ` Samiullah Khawaja
  2026-03-25 20:36           ` Pranjal Shrivastava
  0 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-25 20:19 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Vipin Sharma

On Wed, Mar 25, 2026 at 06:55:36PM +0000, Pranjal Shrivastava wrote:
>On Wed, Mar 25, 2026 at 05:31:46PM +0000, Samiullah Khawaja wrote:
>> On Wed, Mar 25, 2026 at 02:37:37PM +0000, Pranjal Shrivastava wrote:
>> > On Tue, Feb 03, 2026 at 10:09:44PM +0000, Samiullah Khawaja wrote:
>> > > From: YiFei Zhu <zhuyifei@google.com>
>> > >
>> > > Userspace provides a token, which will then be used at restore to
>> > > identify this HWPT. The restoration logic is not implemented and will be
>> > > added later.
>> > >
>> > > Signed-off-by: YiFei Zhu <zhuyifei@google.com>
>> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> > > ---
>> > >  drivers/iommu/iommufd/Makefile          |  1 +
>> > >  drivers/iommu/iommufd/iommufd_private.h | 13 +++++++
>> > >  drivers/iommu/iommufd/liveupdate.c      | 49 +++++++++++++++++++++++++
>> > >  drivers/iommu/iommufd/main.c            |  2 +
>> > >  include/uapi/linux/iommufd.h            | 19 ++++++++++
>> > >  5 files changed, 84 insertions(+)
>> > >  create mode 100644 drivers/iommu/iommufd/liveupdate.c
>> > >
>> > > diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
>> > > index 71d692c9a8f4..c3bf0b6452d3 100644
>> > > --- a/drivers/iommu/iommufd/Makefile
>> > > +++ b/drivers/iommu/iommufd/Makefile
>> > > @@ -17,3 +17,4 @@ obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o
>> > >
>> > >  iommufd_driver-y := driver.o
>> > >  obj-$(CONFIG_IOMMUFD_DRIVER_CORE) += iommufd_driver.o
>> > > +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
>> > > diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
>> > > index eb6d1a70f673..6424e7cea5b2 100644
>> > > --- a/drivers/iommu/iommufd/iommufd_private.h
>> > > +++ b/drivers/iommu/iommufd/iommufd_private.h
>> > > @@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
>> > >  	bool auto_domain : 1;
>> > >  	bool enforce_cache_coherency : 1;
>> > >  	bool nest_parent : 1;
>> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> > > +	bool lu_preserve : 1;
>> > > +	u32 lu_token;
>> >
>> > Did we downsize the token? Shouldn't this be u64 as everywhere else?
>>
>> Note that this is different from the token that is used to preserve the
>> FD into LUO. This token is used to mark the HWPT for preservation, that
>> is it will be preserved when the FD is preserved.
>>
>> I will add more text in the commit message to make it clear.
>>
>> For consistency I will make it u64.
>
>I understand that it's logically distinct from the FD preservation token
>However, the userspace likely won't implement a separate 32-bit token
>generator just for IOMMUFD Live Update. I assume, it'll just use the a
>same 64-bit restore-token allocator. Keeping this as u64 prevents them
>from having to downcast or manage a separate ID space just for this IOCTL

Agreed.
>
>> >
>> > > +#endif
>> > >  	/* Head at iommufd_ioas::hwpt_list */
>> > >  	struct list_head hwpt_item;
>> > >  	struct iommufd_sw_msi_maps present_sw_msi;
>> > > @@ -707,6 +711,15 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
>> > >  			    struct iommufd_vdevice, obj);
>> > >  }
>> > >
>> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> > > +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
>> > > +#else
>> > > +static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>> > > +{
>> > > +	return -ENOTTY;
>> > > +}
>> > > +#endif
>> > > +
>> > >  #ifdef CONFIG_IOMMUFD_TEST
>> > >  int iommufd_test(struct iommufd_ucmd *ucmd);
>> > >  void iommufd_selftest_destroy(struct iommufd_object *obj);
>> > > diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
>> > > new file mode 100644
>> > > index 000000000000..ae74f5b54735
>> > > --- /dev/null
>> > > +++ b/drivers/iommu/iommufd/liveupdate.c
>> > > @@ -0,0 +1,49 @@
>> > > +// SPDX-License-Identifier: GPL-2.0-only
>> > > +
>> > > +#define pr_fmt(fmt) "iommufd: " fmt
>> > > +
>> > > +#include <linux/file.h>
>> > > +#include <linux/iommufd.h>
>> > > +#include <linux/liveupdate.h>
>> > > +
>> > > +#include "iommufd_private.h"
>> > > +
>> > > +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>> > > +{
>> > > +	struct iommu_hwpt_lu_set_preserve *cmd = ucmd->cmd;
>> > > +	struct iommufd_hwpt_paging *hwpt_target, *hwpt;
>> > > +	struct iommufd_ctx *ictx = ucmd->ictx;
>> > > +	struct iommufd_object *obj;
>> > > +	unsigned long index;
>> > > +	int rc = 0;
>> > > +
>> > > +	hwpt_target = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id);
>> > > +	if (IS_ERR(hwpt_target))
>> > > +		return PTR_ERR(hwpt_target);
>> > > +
>> > > +	xa_lock(&ictx->objects);
>> > > +	xa_for_each(&ictx->objects, index, obj) {
>> > > +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
>> > > +			continue;
>> >
>> > Couldn't these be HWPT_NESTED? Are we explicitly skipping HWPT_NESTED
>> > here? ARM SMMUv3 heavily relies on IOMMU_DOMAIN_NESTED to back vIOMMUs
>> > and hold critical guest translation state. We'd need to support
>> > HWPT_NESTED for arm-smmu-v3.
>>
>> For this series, I am not handling the NESTED and vIOMMU usecases. I
>> will be sending a separate series to handle those, this is mentioned in
>> cover letter also in the Future work.
>>
>> Will add a note in commit message also.
>
>I see, I missed it in the cover letter. Shall we add a TODO mentioning
>that we'll support to NESTED too? (No strong feelings about this or
>adding stuff to the commit message too.. the cover letter mentions it)
>
>> >
>> > > +
>> > > +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
>> > > +
>> > > +		if (hwpt == hwpt_target)
>> > > +			continue;
>> > > +		if (!hwpt->lu_preserve)
>> > > +			continue;
>> > > +		if (hwpt->lu_token == cmd->hwpt_token) {
>> > > +			rc = -EADDRINUSE;
>> > > +			goto out;
>> > > +		}
>> >
>> > I see that this entire loop is to avoid collisions but could we improve
>> > this? We are doing an O(N) linear search over the entire ictx->objects
>> > xarray while holding xa_lock on every setup call.
>> >
>> > If the kernel requires a strict 1:1 mapping of lu_token to hwpt,
>> > wouldn't it be much better to track these in a dedicated xarray?
>> >
>> > Just thinking out loud, if we added a dedicated lu_tokens xarray to
>> > iommufd_ctx, we could drop the linear search and the lock entirely,
>> > letting the xarray handle the collision natively like this:
>> >
>> > 	rc = xa_insert(&ictx->lu_tokens, cmd->hwpt_token, hwpt_target, GFP_KERNEL);
>> > 	if (rc == -EBUSY) {
>> > 		rc = -EADDRINUSE;
>> > 		goto out;
>> > 	} else if (rc) {
>> > 		goto out;
>> > 	}
>> >
>> > This ensures instant collision detection without iterating the global
>> > object pool. When the HWPT is eventually destroyed (or un-preserved), we
>> > simply call xa_erase(&ictx->lu_tokens, hwpt->lu_token).
>>
>> Agreed. We can call xa_erase when it is destroyed. This can also be used
>> during actual preservation without taking the objects lock.
>
>Awesome!
>
>> >
>> > > +	}
>> > > +
>> > > +	hwpt_target->lu_preserve = true;
>> >
>> > I don't see a way to unset hwpt->lu_preserve once it's been set. What if
>> > a VMM marks a HWPT for preservation, but then the guest decides to rmmod
>> > the device before the actual kexec? The VMM would need a way to
>> > unpreserve it so we don't carry stale state across the live update?
>> >
>> > Are we relying on the VMM to always call IOMMU_DESTROY on that HWPT when
>> > it's no longer needed for preservation? A clever VMM optimizing for perf
>> > might just pool or cache detached HWPTs for future reuse. If that HWPT
>> > goes back into a free pool and gets re-attached to a new device later,
>> > the sticky lu_preserve state will inadvertently leak across the kexec..
>>
>> As mentioned earlier, the HWPT is not being preserved in this call. So
>> when VMM dies or rmmod happens, this HWPT will be destroyed following
>> the normal flow.
>>
>
>I think there might be a slight disconnect regarding the "normal flow"
>of HWPT destruction. My concern isn't about the VMM dying or a simple 1:1
>teardown. My concern is about a VMM that deliberately avoids calling
>IOMMU_DESTROY to optimize allocations.
>
>The iommufd UAPI already explicitly supports the HWPT pooling model.
>The IOMMU_DEVICE_ATTACH ioctl takes a pt_id, allowing a VMM to
>pre-allocate an HWPT and then 'point' various devices at it over time.
>(Note that detaching a device from a HWPT attaches it to a blocked
>domain.)
>
>If a VMM uses a free-list/cache for its HWPTs, a guest hot-unplug will
>cause the VMM to detach the device, but the VMM will keep the HWPT alive
>in userspace for future reuse.
>
>If that happens, the HWPT is now sitting in the VMM's free pool, but the
>kernel still has it permanently flagged with lu_preserve = true. When
>the VMM later pulls that HWPT from the pool to attach to a new device
>(which might not need preservation), there is no way for the VMM to
>UNMARK it for preservation.

Interesting.. My thinking is that a VMM that is aware of the live update
use case should be responsible for its own object lifecycle. It should
simply discard such HWPT rather than returning it to a free-list.

My concern with adding unpreserve ioctl is that it forces a lot of
complex lifecycle tracking into the kernel, especially around the new
locking that would be needed to handle races between parallel iommufd
preserve/unpreserve calls.

Given that complexity, I think the cleaner approach is to avoid the new
ioctl and keep the kernel-side implementation simpler.
>
>> I will add this in commit message.
>> >
>> > > +	hwpt_target->lu_token = cmd->hwpt_token;
>> > > +
>> > > +out:
>> > > +	xa_unlock(&ictx->objects);
>> > > +	iommufd_put_object(ictx, &hwpt_target->common.obj);
>> > > +	return rc;
>> > > +}
>> > > +
>> > > diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
>> > > index 5cc4b08c25f5..e1a9b3051f65 100644
>> > > --- a/drivers/iommu/iommufd/main.c
>> > > +++ b/drivers/iommu/iommufd/main.c
>> > > @@ -493,6 +493,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>> > >  		 __reserved),
>> > >  	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
>> > >  		 struct iommu_viommu_alloc, out_viommu_id),
>> > > +	IOCTL_OP(IOMMU_HWPT_LU_SET_PRESERVE, iommufd_hwpt_lu_set_preserve,
>> > > +		 struct iommu_hwpt_lu_set_preserve, hwpt_token),
>> > >  #ifdef CONFIG_IOMMUFD_TEST
>> > >  	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
>> > >  #endif
>> > > diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
>> > > index 2c41920b641d..25d8cff987eb 100644
>> > > --- a/include/uapi/linux/iommufd.h
>> > > +++ b/include/uapi/linux/iommufd.h
>> > > @@ -57,6 +57,7 @@ enum {
>> > >  	IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92,
>> > >  	IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93,
>> > >  	IOMMUFD_CMD_HW_QUEUE_ALLOC = 0x94,
>> > > +	IOMMUFD_CMD_HWPT_LU_SET_PRESERVE = 0x95,
>> > >  };
>> > >
>> > >  /**
>> > > @@ -1299,4 +1300,22 @@ struct iommu_hw_queue_alloc {
>> > >  	__aligned_u64 length;
>> > >  };
>> > >  #define IOMMU_HW_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HW_QUEUE_ALLOC)
>> > > +
>> > > +/**
>> > > + * struct iommu_hwpt_lu_set_preserve - ioctl(IOMMU_HWPT_LU_SET_PRESERVE)
>> >
>> > Nit: The IOCTL is called "IOMMU_HWPT_LU_SET_PRESERVE" which subtly
>> > implies the existence of a "GET_PRESERVE". Should we perhaps just call
>> > it IOMMU_HWPT_LU_PRESERVE?
>>
>> LU_PRESERVE would imply that it is being preserved. Maybe
>> "IOMMU_HWPT_LU_MARK_PRESERVE"?
>
>Yup, sounds good! Thanks
>
>> >
>> > > + * @size: sizeof(struct iommu_hwpt_lu_set_preserve)
>> > > + * @hwpt_id: Iommufd object ID of the target HWPT
>> > > + * @hwpt_token: Token to identify this hwpt upon restore
>> > > + *
>> > > + * The target HWPT will be preserved during iommufd preservation.
>> > > + *
>> > > + * The hwpt_token is provided by userspace. If userspace enters a token
>> > > + * already in use within this iommufd, -EADDRINUSE is returned from this ioctl.
>> > > + */
>> > > +struct iommu_hwpt_lu_set_preserve {
>> > > +	__u32 size;
>> > > +	__u32 hwpt_id;
>> > > +	__u32 hwpt_token;
>> > > +};
>> >
>> > Nit: Let's make sure we follow the 64-bit alignment as enforced in the
>> > rest of this file, note the __u32 __reserved fields in existing IOCTL
>> > structs.
>>
>> Agreed. Will update
>
>Thanks,
>Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev
  2026-02-03 22:09 ` [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev Samiullah Khawaja
  2026-03-23 20:59   ` Vipin Sharma
@ 2026-03-25 20:24   ` Pranjal Shrivastava
  2026-03-25 20:41     ` Samiullah Khawaja
  1 sibling, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-25 20:24 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:46PM +0000, Samiullah Khawaja wrote:
> Add APIs that can be used to preserve and unpreserve a vfio cdev. Use
> the APIs exported by the IOMMU core to preserve/unpreserve device. Pass
> the LUO preservation token of the attached iommufd into IOMMU preserve
> device API. This establishes the ownership of the device with the
> preserved iommufd.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/iommu/iommufd/device.c | 69 ++++++++++++++++++++++++++++++++++
>  include/linux/iommufd.h        | 23 ++++++++++++
>  2 files changed, 92 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 4c842368289f..30cb5218093b 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -2,6 +2,7 @@
>  /* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
>   */
>  #include <linux/iommu.h>
> +#include <linux/iommu-lu.h>
>  #include <linux/iommufd.h>
>  #include <linux/pci-ats.h>
>  #include <linux/slab.h>
> @@ -1661,3 +1662,71 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
>  	iommufd_put_object(ucmd->ictx, &idev->obj);
>  	return rc;
>  }
> +
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +int iommufd_device_preserve(struct liveupdate_session *s,
> +			    struct iommufd_device *idev,
> +			    u64 *tokenp)
> +{
> +	struct iommufd_group *igroup = idev->igroup;
> +	struct iommufd_hwpt_paging *hwpt_paging;
> +	struct iommufd_hw_pagetable *hwpt;
> +	struct iommufd_attach *attach;
> +	int ret;
> +
> +	mutex_lock(&igroup->lock);
> +	attach = xa_load(&igroup->pasid_attach, IOMMU_NO_PASID);

By explicitly looking up IOMMU_NO_PASID, we skip any PASID attachments
the device might have. Since PASID live update is NOT supported in this
series, should we check if the pasid_attach xarray contains anything 
other than IOMMU_NO_PASID and return -EOPNOTSUPP? 

Otherwise, we silently fail to preserve those domains without informing
the VMM?

> +	if (!attach) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	hwpt = attach->hwpt;
> +	hwpt_paging = find_hwpt_paging(hwpt);
> +	if (!hwpt_paging || !hwpt_paging->lu_preserve) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	ret = liveupdate_get_token_outgoing(s, idev->ictx->file, tokenp);
> +	if (ret)
> +		goto out;
> +
> +	ret = iommu_preserve_device(hwpt_paging->common.domain,
> +				    idev->dev,
> +				    *tokenp);
> +out:
> +	mutex_unlock(&igroup->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(iommufd_device_preserve, "IOMMUFD");
> +
> +void iommufd_device_unpreserve(struct liveupdate_session *s,
> +			       struct iommufd_device *idev,
> +			       u64 token)
> +{
> +	struct iommufd_group *igroup = idev->igroup;
> +	struct iommufd_hwpt_paging *hwpt_paging;
> +	struct iommufd_hw_pagetable *hwpt;
> +	struct iommufd_attach *attach;
> +
> +	mutex_lock(&igroup->lock);
> +	attach = xa_load(&igroup->pasid_attach, IOMMU_NO_PASID);
> +	if (!attach) {
> +		WARN_ON(-ENOENT);

WARN_ON takes a condition.. if we want this to be printed always, why
not WARN_ON(1, "...") ? What's the significance of having -ENOENT as a
condition?

> +		goto out;
> +	}
> +
> +	hwpt = attach->hwpt;
> +	hwpt_paging = find_hwpt_paging(hwpt);
> +	if (!hwpt_paging || !hwpt_paging->lu_preserve) {
> +		WARN_ON(-EINVAL);

Same here for -EINVAL?

> +		goto out;
> +	}
> +
> +	iommu_unpreserve_device(hwpt_paging->common.domain, idev->dev);
> +out:
> +	mutex_unlock(&igroup->lock);
> +}
> +EXPORT_SYMBOL_NS_GPL(iommufd_device_unpreserve, "IOMMUFD");
> +#endif

[ ----- >8 ----- ]

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update
  2026-03-25 20:08   ` Pranjal Shrivastava
@ 2026-03-25 20:32     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-25 20:32 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Vipin Sharma

On Wed, Mar 25, 2026 at 08:08:48PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:45PM +0000, Samiullah Khawaja wrote:
>> From: YiFei Zhu <zhuyifei@google.com>
>>
>> The caller is expected to mark each HWPT to be preserved with an ioctl
>> call, with a token that will be used in restore. At preserve time, each
>> HWPT's domain is then called with iommu_domain_preserve to preserve the
>> iommu domain.
>>
>> The HWPTs containing dma mappings backed by unpreserved memory should
>> not be preserved. During preservation check if the mappings contained in
>> the HWPT being preserved are only file based and all the files are
>> preserved.
>>
>> The memfd file preservation check is not enough when preserving iommufd.
>> The memfd might have shrunk between the mapping and memfd preservation.
>> This means that if it shrunk some pages that are right now pinned due to
>> iommu mappings are not preserved with the memfd. Only allow iommufd
>> preservation when all the iopt_pages are file backed and the memory file
>> was seal sealed during mapping. This guarantees that all the pages that
>> were backing memfd when it was mapped are preserved.
>>
>> Once HWPT is preserved the iopt associated with the HWPT is made
>> immutable. Since the map and unmap ioctls operates directly on iopt,
>> which contains an array of domains, while each hwpt contains only one
>> domain. The logic then becomes that mapping and unmapping is prohibited
>> if any of the domains in an iopt belongs to a preserved hwpt. However,
>> tracing to the hwpt through the domain is a lot more tedious than
>> tracing through the ioas, so if an hwpt is preserved, hwpt->ioas->iopt
>> is made immutable.
>>
>> When undoing this (making the iopts mutable again), there's never
>> a need to make some iopts mutable and some kept immutable, since
>> the undo only happen on unpreserve and error path of preserve.
>> Simply iterate all the ioas and clear the immutability flag on all
>> their iopts.
>>
>> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/iommufd/io_pagetable.c    |  17 ++
>>  drivers/iommu/iommufd/io_pagetable.h    |   1 +
>>  drivers/iommu/iommufd/iommufd_private.h |  25 ++
>>  drivers/iommu/iommufd/liveupdate.c      | 300 ++++++++++++++++++++++++
>>  drivers/iommu/iommufd/main.c            |  14 +-
>>  drivers/iommu/iommufd/pages.c           |   8 +
>>  include/linux/kho/abi/iommufd.h         |  39 +++
>>  7 files changed, 403 insertions(+), 1 deletion(-)
>>  create mode 100644 include/linux/kho/abi/iommufd.h
>>
>> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
>> index 436992331111..43e8a2443793 100644
>> --- a/drivers/iommu/iommufd/io_pagetable.c
>> +++ b/drivers/iommu/iommufd/io_pagetable.c
>> @@ -270,6 +270,11 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
>>  	}
>>
>>  	down_write(&iopt->iova_rwsem);
>> +	if (iopt_lu_map_immutable(iopt)) {
>> +		rc = -EBUSY;
>> +		goto out_unlock;
>> +	}
>> +
>>  	if ((length & (iopt->iova_alignment - 1)) || !length) {
>>  		rc = -EINVAL;
>>  		goto out_unlock;
>> @@ -328,6 +333,7 @@ static void iopt_abort_area(struct iopt_area *area)
>>  		WARN_ON(area->pages);
>>  	if (area->iopt) {
>>  		down_write(&area->iopt->iova_rwsem);
>> +		WARN_ON(iopt_lu_map_immutable(area->iopt));
>>  		interval_tree_remove(&area->node, &area->iopt->area_itree);
>>  		up_write(&area->iopt->iova_rwsem);
>>  	}
>> @@ -755,6 +761,12 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
>>  again:
>>  	down_read(&iopt->domains_rwsem);
>>  	down_write(&iopt->iova_rwsem);
>> +
>> +	if (iopt_lu_map_immutable(iopt)) {
>> +		rc = -EBUSY;
>> +		goto out_unlock_iova;
>> +	}
>> +
>>  	while ((area = iopt_area_iter_first(iopt, start, last))) {
>>  		unsigned long area_last = iopt_area_last_iova(area);
>>  		unsigned long area_first = iopt_area_iova(area);
>> @@ -1398,6 +1410,11 @@ int iopt_cut_iova(struct io_pagetable *iopt, unsigned long *iovas,
>>  	int i;
>>
>>  	down_write(&iopt->iova_rwsem);
>> +	if (iopt_lu_map_immutable(iopt)) {
>> +		up_write(&iopt->iova_rwsem);
>> +		return -EBUSY;
>> +	}
>> +
>>  	for (i = 0; i < num_iovas; i++) {
>>  		struct iopt_area *area;
>>
>> diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
>> index 14cd052fd320..b64cb4cf300c 100644
>> --- a/drivers/iommu/iommufd/io_pagetable.h
>> +++ b/drivers/iommu/iommufd/io_pagetable.h
>> @@ -234,6 +234,7 @@ struct iopt_pages {
>>  		struct {			/* IOPT_ADDRESS_FILE */
>>  			struct file *file;
>>  			unsigned long start;
>> +			u32 seals;
>>  		};
>>  		/* IOPT_ADDRESS_DMABUF */
>>  		struct iopt_pages_dmabuf dmabuf;
>> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
>> index 6424e7cea5b2..f8366a23999f 100644
>> --- a/drivers/iommu/iommufd/iommufd_private.h
>> +++ b/drivers/iommu/iommufd/iommufd_private.h
>> @@ -94,6 +94,9 @@ struct io_pagetable {
>>  	/* IOVA that cannot be allocated, struct iopt_reserved */
>>  	struct rb_root_cached reserved_itree;
>>  	u8 disable_large_pages;
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	bool lu_map_immutable;
>> +#endif
>>  	unsigned long iova_alignment;
>>  };
>>
>> @@ -712,12 +715,34 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
>>  }
>>
>>  #ifdef CONFIG_IOMMU_LIVEUPDATE
>> +int iommufd_liveupdate_register_lufs(void);
>> +int iommufd_liveupdate_unregister_lufs(void);
>> +
>>  int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
>> +static inline bool iopt_lu_map_immutable(const struct io_pagetable *iopt)
>> +{
>> +	return iopt->lu_map_immutable;
>> +}
>>  #else
>> +static inline int iommufd_liveupdate_register_lufs(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline int iommufd_liveupdate_unregister_lufs(void)
>> +{
>> +	return 0;
>> +}
>> +
>>  static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>>  {
>>  	return -ENOTTY;
>>  }
>> +
>> +static inline bool iopt_lu_map_immutable(const struct io_pagetable *iopt)
>> +{
>> +	return false;
>> +}
>>  #endif
>>
>>  #ifdef CONFIG_IOMMUFD_TEST
>> diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
>> index ae74f5b54735..ec11ae345fe7 100644
>> --- a/drivers/iommu/iommufd/liveupdate.c
>> +++ b/drivers/iommu/iommufd/liveupdate.c
>> @@ -4,9 +4,15 @@
>>
>>  #include <linux/file.h>
>>  #include <linux/iommufd.h>
>> +#include <linux/kexec_handover.h>
>> +#include <linux/kho/abi/iommufd.h>
>>  #include <linux/liveupdate.h>
>> +#include <linux/iommu-lu.h>
>> +#include <linux/mm.h>
>> +#include <linux/pci.h>
>>
>>  #include "iommufd_private.h"
>> +#include "io_pagetable.h"
>>
>>  int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>>  {
>> @@ -47,3 +53,297 @@ int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>>  	return rc;
>>  }
>>
>> +static void iommufd_set_ioas_mutable(struct iommufd_ctx *ictx)
>> +{
>> +	struct iommufd_object *obj;
>> +	struct iommufd_ioas *ioas;
>> +	unsigned long index;
>> +
>> +	xa_lock(&ictx->objects);
>> +	xa_for_each(&ictx->objects, index, obj) {
>> +		if (obj->type != IOMMUFD_OBJ_IOAS)
>> +			continue;
>> +
>> +		ioas = container_of(obj, struct iommufd_ioas, obj);
>> +
>> +		/*
>> +		 * Not taking any IOAS lock here. All writers take LUO
>> +		 * session mutex, and this writer racing with readers is not
>> +		 * really a problem.
>> +		 */
>> +		WRITE_ONCE(ioas->iopt.lu_map_immutable, false);
>> +	}
>> +	xa_unlock(&ictx->objects);
>> +}
>> +
>> +static int check_iopt_pages_preserved(struct liveupdate_session *s,
>> +				      struct iommufd_hwpt_paging *hwpt)
>> +{
>> +	u32 req_seals = F_SEAL_SEAL | F_SEAL_GROW | F_SEAL_SHRINK;
>> +	struct iopt_area *area;
>> +	int ret;
>> +
>> +	for (area = iopt_area_iter_first(&hwpt->ioas->iopt, 0, ULONG_MAX); area;
>> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
>> +		struct iopt_pages *pages = area->pages;
>> +
>> +		/* Only allow file based mapping */
>> +		if (pages->type != IOPT_ADDRESS_FILE)
>> +			return -EINVAL;
>
>That's an important check! should we document it somewhere for the user
>to know about it too? Perhaps in the uAPI header for the preserve IOCTL,
>so VMM authors are aware of the anonymous memory restriction upfront?

Agreed. Will add this in next revision.
>
>> +
>> +		/*
>> +		 * When this memory file was mapped it should be sealed and seal
>> +		 * should be sealed. This means that since mapping was done the
>> +		 * memory file was not grown or shrink and the pages being used
>> +		 * until now remain pinnned and preserved.
>> +		 */
>> +		if ((pages->seals & req_seals) != req_seals)
>> +			return -EINVAL;
>> +
>> +		/* Make sure that the file was preserved. */
>> +		ret = liveupdate_get_token_outgoing(s, pages->file, NULL);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int iommufd_save_hwpts(struct iommufd_ctx *ictx,
>> +			      struct iommufd_lu *iommufd_lu,
>> +			      struct liveupdate_session *session)
>> +{
>> +	struct iommufd_hwpt_paging *hwpt, **hwpts = NULL;
>> +	struct iommu_domain_ser *domain_ser;
>> +	struct iommufd_hwpt_lu *hwpt_lu;
>> +	struct iommufd_object *obj;
>> +	unsigned int nr_hwpts = 0;
>> +	unsigned long index;
>> +	unsigned int i;
>> +	int rc = 0;
>> +
>> +	if (iommufd_lu) {
>> +		hwpts = kcalloc(iommufd_lu->nr_hwpts, sizeof(*hwpts),
>> +				GFP_KERNEL);
>> +		if (!hwpts)
>> +			return -ENOMEM;
>> +	}
>> +
>> +	xa_lock(&ictx->objects);
>> +	xa_for_each(&ictx->objects, index, obj) {
>> +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
>> +			continue;
>> +
>> +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
>> +		if (!hwpt->lu_preserve)
>> +			continue;
>> +
>> +		if (hwpt->ioas) {
>> +			/*
>> +			 * Obtain exclusive access to the IOAS and IOPT while we
>> +			 * set immutability
>> +			 */
>> +			mutex_lock(&hwpt->ioas->mutex);
>
>Doubling down on Ankit's comment here, we should absolutely avoid taking
>sleepable locks in atomic context. I can attempt testing the next series
>with lockdep enabled.
>
>> +			down_write(&hwpt->ioas->iopt.domains_rwsem);
>> +			down_write(&hwpt->ioas->iopt.iova_rwsem);
>> +
>> +			hwpt->ioas->iopt.lu_map_immutable = true;
>> +
>> +			up_write(&hwpt->ioas->iopt.iova_rwsem);
>> +			up_write(&hwpt->ioas->iopt.domains_rwsem);
>> +			mutex_unlock(&hwpt->ioas->mutex);
>
>Additionally, I'm wondering if this could be a helper function?
>iommufd_ioas_set_immutable?

I will have to restructure this part of the code to not take the mutex
in the atomic context. I will move it to a helper function in next
revision if there is opportunity to do that.
>
>> +		}
>> +
>> +		if (!hwpt->common.domain) {
>> +			rc = -EINVAL;
>> +			xa_unlock(&ictx->objects);
>> +			goto out;
>> +		}
>> +
>> +		if (!iommufd_lu) {
>> +			rc = check_iopt_pages_preserved(session, hwpt);
>
>This seems slightly redundant, we are calling check_iopt_pages_preserved
>for every single preserved HWPT. However, multiple HWPTs can share the
>same underlying IOAS.

Agreed. I think this check and the earlier immutable thing need to be
done at IOPT level. I think the rework for that should also remove the
redundant checks here.
>
>If a VMM has 5 devices sharing a single IOAS, this code will walk the
>exact same iopt_area_itree and validate the exact same memfd seals 5
>times. Should we perhaps track which iopts have already been validated
>during the loop to avoid this redundant tree traversal?
>
>> +			if (rc) {
>> +				xa_unlock(&ictx->objects);
>> +				goto out;
>> +			}
>> +		} else if (iommufd_lu) {
>> +			hwpts[nr_hwpts] = hwpt;
>> +			hwpt_lu = &iommufd_lu->hwpts[nr_hwpts];
>> +
>> +			hwpt_lu->token = hwpt->lu_token;
>> +			hwpt_lu->reclaimed = false;
>> +		}
>> +
>> +		nr_hwpts++;
>> +	}
>> +	xa_unlock(&ictx->objects);
>> +
>> +	if (WARN_ON(iommufd_lu && iommufd_lu->nr_hwpts != nr_hwpts)) {
>> +		rc = -EFAULT;
>> +		goto out;
>> +	}
>> +
>> +	if (iommufd_lu) {
>> +		/*
>> +		 * iommu_domain_preserve may sleep and must be called
>> +		 * outside of xa_lock
>> +		 */
>> +		for (i = 0; i < nr_hwpts; i++) {
>> +			hwpt = hwpts[i];
>> +			hwpt_lu = &iommufd_lu->hwpts[i];
>> +
>> +			rc = iommu_domain_preserve(hwpt->common.domain, &domain_ser);
>> +			if (rc < 0)
>> +				goto out;
>> +
>> +			hwpt_lu->domain_data = __pa(domain_ser);
>> +		}
>> +	}
>> +
>> +	rc = nr_hwpts;
>> +
>> +out:
>> +	kfree(hwpts);
>> +	return rc;
>> +}
>> +
>> +static int iommufd_liveupdate_preserve(struct liveupdate_file_op_args *args)
>> +{
>> +	struct iommufd_ctx *ictx = iommufd_ctx_from_file(args->file);
>> +	struct iommufd_lu *iommufd_lu;
>> +	size_t serial_size;
>> +	void *mem;
>> +	int rc;
>> +
>> +	if (IS_ERR(ictx))
>> +		return PTR_ERR(ictx);
>> +
>> +	rc = iommufd_save_hwpts(ictx, NULL, args->session);
>> +	if (rc < 0)
>> +		goto err_ioas_mutable;
>> +
>> +	serial_size = struct_size(iommufd_lu, hwpts, rc);
>> +
>> +	mem = kho_alloc_preserve(serial_size);
>> +	if (!mem) {
>> +		rc = -ENOMEM;
>> +		goto err_ioas_mutable;
>> +	}
>> +
>> +	iommufd_lu = mem;
>> +	iommufd_lu->nr_hwpts = rc;
>> +	rc = iommufd_save_hwpts(ictx, iommufd_lu, args->session);
>
>We call iommufd_save_hwpts twice (first to count/validate & second to
>serialize). Because xa_lock is dropped between these two calls to
>allocate KHO memory, we have a TOCTOU race.

Agreed. Will handle this in next revision.
>
>If userspace concurrently creates or destroys a preserved HWPT during
>this window, nr_hwpts will change, and the second pass will hit:
>WARN_ON(iommufd_lu && iommufd_lu->nr_hwpts != nr_hwpts)
>
>I'm not sure if that's a situation to WARN? Also, what about kernels
>which have panic_on_warn?
>
>> +	if (rc < 0)
>> +		goto err_free;
>> +
>> +	args->serialized_data = virt_to_phys(iommufd_lu);
>> +	iommufd_ctx_put(ictx);
>> +	return 0;
>> +
>> +err_free:
>> +	kho_unpreserve_free(mem);
>> +err_ioas_mutable:
>> +	iommufd_set_ioas_mutable(ictx);
>> +	iommufd_ctx_put(ictx);
>> +	return rc;
>> +}
>> +
>> +static int iommufd_liveupdate_freeze(struct liveupdate_file_op_args *args)
>> +{
>> +	/* No-Op; everything should be made read-only */
>
>Is there a plan to populate this in a future series? If it's a permanent
>No-Op because the lu_map_immutable flag already handles the read-only
>enforcement during the preserve phase, we should probably update the
>comment to explicitly state that, rather than implying there is missing
>logic?

Agreed. Will update the comment.
>
>> +	return 0;
>> +}
>> +
>
>[ ---- >8 ----- ]
>
>> diff --git a/include/linux/kho/abi/iommufd.h b/include/linux/kho/abi/iommufd.h
>> new file mode 100644
>> index 000000000000..f7393ac78aa9
>> --- /dev/null
>> +++ b/include/linux/kho/abi/iommufd.h
>> @@ -0,0 +1,39 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +/*
>> + * Copyright (C) 2025, Google LLC
>> + * Author: Samiullah Khawaja <skhawaja@google.com>
>> + */
>> +
>> +#ifndef _LINUX_KHO_ABI_IOMMUFD_H
>> +#define _LINUX_KHO_ABI_IOMMUFD_H
>> +
>> +#include <linux/mutex_types.h>
>> +#include <linux/compiler.h>
>> +#include <linux/types.h>
>> +
>> +/**
>> + * DOC: IOMMUFD Live Update ABI
>> + *
>> + * This header defines the ABI for preserving the state of an IOMMUFD file
>> + * across a kexec reboot using LUO.
>> + *
>> + * This interface is a contract. Any modification to any of the serialization
>> + * structs defined here constitutes a breaking change. Such changes require
>> + * incrementing the version number in the IOMMUFD_LUO_COMPATIBLE string.
>> + */
>> +
>> +#define IOMMUFD_LUO_COMPATIBLE "iommufd-v1"
>> +
>> +struct iommufd_hwpt_lu {
>> +	u32 token;
>> +	u64 domain_data;
>> +	bool reclaimed;
>> +} __packed;
>> +
>
>Nit: Because of __packed, putting the u64 after the u32 forces an
>unaligned 8-byte access at offset 4. Also, it's generally safer to use
>explicitly sized types like u8 instead of bool for cross-kernel ABI
>structs to avoid any compiler-specific sizing ambiguities. (This is the
>reason most uAPI defs avoid using bool)
>
>We can achieve natural alignment without implicit padding by ordering it
>from largest to smallest and explicitly padding it out:
>
>struct iommufd_hwpt_lu {
>	__u64 domain_data;
>	__u32 token;
>	__u8 reclaimed;
>	__u8 padding[3];
>} __packed;
>
>This guarantees an exact 16-byte footprint with perfectly aligned 64-bit
>and 32-bit accesses.
>
>> +struct iommufd_lu {
>> +	unsigned int nr_hwpts;
>> +	struct iommufd_hwpt_lu hwpts[];
>> +};
>
>Same here regarding explicitly sized types and packing (as pointed by
>Vipin too)
>
>> +
>> +#endif /* _LINUX_KHO_ABI_IOMMUFD_H */
>
>Thanks,
>Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
  2026-03-25 20:19         ` Samiullah Khawaja
@ 2026-03-25 20:36           ` Pranjal Shrivastava
  2026-03-25 20:46             ` Samiullah Khawaja
  0 siblings, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-25 20:36 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Vipin Sharma

On Wed, Mar 25, 2026 at 08:19:05PM +0000, Samiullah Khawaja wrote:
> On Wed, Mar 25, 2026 at 06:55:36PM +0000, Pranjal Shrivastava wrote:
> > On Wed, Mar 25, 2026 at 05:31:46PM +0000, Samiullah Khawaja wrote:
> > > On Wed, Mar 25, 2026 at 02:37:37PM +0000, Pranjal Shrivastava wrote:
> > > > On Tue, Feb 03, 2026 at 10:09:44PM +0000, Samiullah Khawaja wrote:
> > > > > From: YiFei Zhu <zhuyifei@google.com>
> > > > >
> > > > > Userspace provides a token, which will then be used at restore to
> > > > > identify this HWPT. The restoration logic is not implemented and will be
> > > > > added later.
> > > > >
> > > > > Signed-off-by: YiFei Zhu <zhuyifei@google.com>
> > > > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> > > > > ---
> > > > >  drivers/iommu/iommufd/Makefile          |  1 +
> > > > >  drivers/iommu/iommufd/iommufd_private.h | 13 +++++++
> > > > >  drivers/iommu/iommufd/liveupdate.c      | 49 +++++++++++++++++++++++++
> > > > >  drivers/iommu/iommufd/main.c            |  2 +
> > > > >  include/uapi/linux/iommufd.h            | 19 ++++++++++
> > > > >  5 files changed, 84 insertions(+)
> > > > >  create mode 100644 drivers/iommu/iommufd/liveupdate.c
> > > > >
> > > > > diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> > > > > index 71d692c9a8f4..c3bf0b6452d3 100644
> > > > > --- a/drivers/iommu/iommufd/Makefile
> > > > > +++ b/drivers/iommu/iommufd/Makefile
> > > > > @@ -17,3 +17,4 @@ obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o
> > > > >
> > > > >  iommufd_driver-y := driver.o
> > > > >  obj-$(CONFIG_IOMMUFD_DRIVER_CORE) += iommufd_driver.o
> > > > > +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
> > > > > diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> > > > > index eb6d1a70f673..6424e7cea5b2 100644
> > > > > --- a/drivers/iommu/iommufd/iommufd_private.h
> > > > > +++ b/drivers/iommu/iommufd/iommufd_private.h
> > > > > @@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
> > > > >  	bool auto_domain : 1;
> > > > >  	bool enforce_cache_coherency : 1;
> > > > >  	bool nest_parent : 1;
> > > > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > > > +	bool lu_preserve : 1;
> > > > > +	u32 lu_token;
> > > >
> > > > Did we downsize the token? Shouldn't this be u64 as everywhere else?
> > > 
> > > Note that this is different from the token that is used to preserve the
> > > FD into LUO. This token is used to mark the HWPT for preservation, that
> > > is it will be preserved when the FD is preserved.
> > > 
> > > I will add more text in the commit message to make it clear.
> > > 
> > > For consistency I will make it u64.
> > 
> > I understand that it's logically distinct from the FD preservation token
> > However, the userspace likely won't implement a separate 32-bit token
> > generator just for IOMMUFD Live Update. I assume, it'll just use the a
> > same 64-bit restore-token allocator. Keeping this as u64 prevents them
> > from having to downcast or manage a separate ID space just for this IOCTL
> 
> Agreed.
> > 
> > > >
> > > > > +#endif
> > > > >  	/* Head at iommufd_ioas::hwpt_list */
> > > > >  	struct list_head hwpt_item;
> > > > >  	struct iommufd_sw_msi_maps present_sw_msi;
> > > > > @@ -707,6 +711,15 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
> > > > >  			    struct iommufd_vdevice, obj);
> > > > >  }
> > > > >
> > > > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > > > +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
> > > > > +#else
> > > > > +static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
> > > > > +{
> > > > > +	return -ENOTTY;
> > > > > +}
> > > > > +#endif
> > > > > +
> > > > >  #ifdef CONFIG_IOMMUFD_TEST
> > > > >  int iommufd_test(struct iommufd_ucmd *ucmd);
> > > > >  void iommufd_selftest_destroy(struct iommufd_object *obj);
> > > > > diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
> > > > > new file mode 100644
> > > > > index 000000000000..ae74f5b54735
> > > > > --- /dev/null
> > > > > +++ b/drivers/iommu/iommufd/liveupdate.c
> > > > > @@ -0,0 +1,49 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > > +
> > > > > +#define pr_fmt(fmt) "iommufd: " fmt
> > > > > +
> > > > > +#include <linux/file.h>
> > > > > +#include <linux/iommufd.h>
> > > > > +#include <linux/liveupdate.h>
> > > > > +
> > > > > +#include "iommufd_private.h"
> > > > > +
> > > > > +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
> > > > > +{
> > > > > +	struct iommu_hwpt_lu_set_preserve *cmd = ucmd->cmd;
> > > > > +	struct iommufd_hwpt_paging *hwpt_target, *hwpt;
> > > > > +	struct iommufd_ctx *ictx = ucmd->ictx;
> > > > > +	struct iommufd_object *obj;
> > > > > +	unsigned long index;
> > > > > +	int rc = 0;
> > > > > +
> > > > > +	hwpt_target = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id);
> > > > > +	if (IS_ERR(hwpt_target))
> > > > > +		return PTR_ERR(hwpt_target);
> > > > > +
> > > > > +	xa_lock(&ictx->objects);
> > > > > +	xa_for_each(&ictx->objects, index, obj) {
> > > > > +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
> > > > > +			continue;
> > > >
> > > > Couldn't these be HWPT_NESTED? Are we explicitly skipping HWPT_NESTED
> > > > here? ARM SMMUv3 heavily relies on IOMMU_DOMAIN_NESTED to back vIOMMUs
> > > > and hold critical guest translation state. We'd need to support
> > > > HWPT_NESTED for arm-smmu-v3.
> > > 
> > > For this series, I am not handling the NESTED and vIOMMU usecases. I
> > > will be sending a separate series to handle those, this is mentioned in
> > > cover letter also in the Future work.
> > > 
> > > Will add a note in commit message also.
> > 
> > I see, I missed it in the cover letter. Shall we add a TODO mentioning
> > that we'll support to NESTED too? (No strong feelings about this or
> > adding stuff to the commit message too.. the cover letter mentions it)
> > 
> > > >
> > > > > +
> > > > > +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
> > > > > +
> > > > > +		if (hwpt == hwpt_target)
> > > > > +			continue;
> > > > > +		if (!hwpt->lu_preserve)
> > > > > +			continue;
> > > > > +		if (hwpt->lu_token == cmd->hwpt_token) {
> > > > > +			rc = -EADDRINUSE;
> > > > > +			goto out;
> > > > > +		}
> > > >
> > > > I see that this entire loop is to avoid collisions but could we improve
> > > > this? We are doing an O(N) linear search over the entire ictx->objects
> > > > xarray while holding xa_lock on every setup call.
> > > >
> > > > If the kernel requires a strict 1:1 mapping of lu_token to hwpt,
> > > > wouldn't it be much better to track these in a dedicated xarray?
> > > >
> > > > Just thinking out loud, if we added a dedicated lu_tokens xarray to
> > > > iommufd_ctx, we could drop the linear search and the lock entirely,
> > > > letting the xarray handle the collision natively like this:
> > > >
> > > > 	rc = xa_insert(&ictx->lu_tokens, cmd->hwpt_token, hwpt_target, GFP_KERNEL);
> > > > 	if (rc == -EBUSY) {
> > > > 		rc = -EADDRINUSE;
> > > > 		goto out;
> > > > 	} else if (rc) {
> > > > 		goto out;
> > > > 	}
> > > >
> > > > This ensures instant collision detection without iterating the global
> > > > object pool. When the HWPT is eventually destroyed (or un-preserved), we
> > > > simply call xa_erase(&ictx->lu_tokens, hwpt->lu_token).
> > > 
> > > Agreed. We can call xa_erase when it is destroyed. This can also be used
> > > during actual preservation without taking the objects lock.
> > 
> > Awesome!
> > 
> > > >
> > > > > +	}
> > > > > +
> > > > > +	hwpt_target->lu_preserve = true;
> > > >
> > > > I don't see a way to unset hwpt->lu_preserve once it's been set. What if
> > > > a VMM marks a HWPT for preservation, but then the guest decides to rmmod
> > > > the device before the actual kexec? The VMM would need a way to
> > > > unpreserve it so we don't carry stale state across the live update?
> > > >
> > > > Are we relying on the VMM to always call IOMMU_DESTROY on that HWPT when
> > > > it's no longer needed for preservation? A clever VMM optimizing for perf
> > > > might just pool or cache detached HWPTs for future reuse. If that HWPT
> > > > goes back into a free pool and gets re-attached to a new device later,
> > > > the sticky lu_preserve state will inadvertently leak across the kexec..
> > > 
> > > As mentioned earlier, the HWPT is not being preserved in this call. So
> > > when VMM dies or rmmod happens, this HWPT will be destroyed following
> > > the normal flow.
> > > 
> > 
> > I think there might be a slight disconnect regarding the "normal flow"
> > of HWPT destruction. My concern isn't about the VMM dying or a simple 1:1
> > teardown. My concern is about a VMM that deliberately avoids calling
> > IOMMU_DESTROY to optimize allocations.
> > 
> > The iommufd UAPI already explicitly supports the HWPT pooling model.
> > The IOMMU_DEVICE_ATTACH ioctl takes a pt_id, allowing a VMM to
> > pre-allocate an HWPT and then 'point' various devices at it over time.
> > (Note that detaching a device from a HWPT attaches it to a blocked
> > domain.)
> > 
> > If a VMM uses a free-list/cache for its HWPTs, a guest hot-unplug will
> > cause the VMM to detach the device, but the VMM will keep the HWPT alive
> > in userspace for future reuse.
> > 
> > If that happens, the HWPT is now sitting in the VMM's free pool, but the
> > kernel still has it permanently flagged with lu_preserve = true. When
> > the VMM later pulls that HWPT from the pool to attach to a new device
> > (which might not need preservation), there is no way for the VMM to
> > UNMARK it for preservation.
> 
> Interesting.. My thinking is that a VMM that is aware of the live update
> use case should be responsible for its own object lifecycle. It should
> simply discard such HWPT rather than returning it to a free-list.
> 
> My concern with adding unpreserve ioctl is that it forces a lot of
> complex lifecycle tracking into the kernel, especially around the new
> locking that would be needed to handle races between parallel iommufd
> preserve/unpreserve calls.
> 
> Given that complexity, I think the cleaner approach is to avoid the new
> ioctl and keep the kernel-side implementation simpler.

Fair point, in that case let's just make sure this contract is explicit.
We should clearly document in the UAPI that there is no way to "UNMARK"
an HWPT for preservation and that VMMs implementing HWPT pooling or
dynamic attachment must destroy preserved HWPTs rather than returning 
them to a free-list if they want liveupdate support.

Once that's documented so userspace authors know exactly what the UAPI
expects of them, I'm on board with this approach.

[ ------ >8 ------ ]

> > > 
> > > LU_PRESERVE would imply that it is being preserved. Maybe
> > > "IOMMU_HWPT_LU_MARK_PRESERVE"?
> > 
> > Yup, sounds good! Thanks
> > 
> > > >
> > > > > + * @size: sizeof(struct iommu_hwpt_lu_set_preserve)
> > > > > + * @hwpt_id: Iommufd object ID of the target HWPT
> > > > > + * @hwpt_token: Token to identify this hwpt upon restore
> > > > > + *
> > > > > + * The target HWPT will be preserved during iommufd preservation.
> > > > > + *
> > > > > + * The hwpt_token is provided by userspace. If userspace enters a token
> > > > > + * already in use within this iommufd, -EADDRINUSE is returned from this ioctl.
> > > > > + */
> > > > > +struct iommu_hwpt_lu_set_preserve {
> > > > > +	__u32 size;
> > > > > +	__u32 hwpt_id;
> > > > > +	__u32 hwpt_token;
> > > > > +};
> > > >
> > > > Nit: Let's make sure we follow the 64-bit alignment as enforced in the
> > > > rest of this file, note the __u32 __reserved fields in existing IOCTL
> > > > structs.
> > > 
> > > Agreed. Will update
 
Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev
  2026-03-25 20:24   ` Pranjal Shrivastava
@ 2026-03-25 20:41     ` Samiullah Khawaja
  2026-03-25 21:23       ` Pranjal Shrivastava
  0 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-25 20:41 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Wed, Mar 25, 2026 at 08:24:24PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:46PM +0000, Samiullah Khawaja wrote:
>> Add APIs that can be used to preserve and unpreserve a vfio cdev. Use
>> the APIs exported by the IOMMU core to preserve/unpreserve device. Pass
>> the LUO preservation token of the attached iommufd into IOMMU preserve
>> device API. This establishes the ownership of the device with the
>> preserved iommufd.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  drivers/iommu/iommufd/device.c | 69 ++++++++++++++++++++++++++++++++++
>>  include/linux/iommufd.h        | 23 ++++++++++++
>>  2 files changed, 92 insertions(+)
>>
>> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
>> index 4c842368289f..30cb5218093b 100644
>> --- a/drivers/iommu/iommufd/device.c
>> +++ b/drivers/iommu/iommufd/device.c
>> @@ -2,6 +2,7 @@
>>  /* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
>>   */
>>  #include <linux/iommu.h>
>> +#include <linux/iommu-lu.h>
>>  #include <linux/iommufd.h>
>>  #include <linux/pci-ats.h>
>>  #include <linux/slab.h>
>> @@ -1661,3 +1662,71 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
>>  	iommufd_put_object(ucmd->ictx, &idev->obj);
>>  	return rc;
>>  }
>> +
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +int iommufd_device_preserve(struct liveupdate_session *s,
>> +			    struct iommufd_device *idev,
>> +			    u64 *tokenp)
>> +{
>> +	struct iommufd_group *igroup = idev->igroup;
>> +	struct iommufd_hwpt_paging *hwpt_paging;
>> +	struct iommufd_hw_pagetable *hwpt;
>> +	struct iommufd_attach *attach;
>> +	int ret;
>> +
>> +	mutex_lock(&igroup->lock);
>> +	attach = xa_load(&igroup->pasid_attach, IOMMU_NO_PASID);
>
>By explicitly looking up IOMMU_NO_PASID, we skip any PASID attachments
>the device might have. Since PASID live update is NOT supported in this
>series, should we check if the pasid_attach xarray contains anything
>other than IOMMU_NO_PASID and return -EOPNOTSUPP?
>
>Otherwise, we silently fail to preserve those domains without informing
>the VMM?

VMM should be able to preserve the NO_PASID domains even if it has PASID
attachments. This is the intended behaviour, I will document it in the
uAPI docs.
>
>> +	if (!attach) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>> +
>> +	hwpt = attach->hwpt;
>> +	hwpt_paging = find_hwpt_paging(hwpt);
>> +	if (!hwpt_paging || !hwpt_paging->lu_preserve) {
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
>> +	ret = liveupdate_get_token_outgoing(s, idev->ictx->file, tokenp);
>> +	if (ret)
>> +		goto out;
>> +
>> +	ret = iommu_preserve_device(hwpt_paging->common.domain,
>> +				    idev->dev,
>> +				    *tokenp);
>> +out:
>> +	mutex_unlock(&igroup->lock);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(iommufd_device_preserve, "IOMMUFD");
>> +
>> +void iommufd_device_unpreserve(struct liveupdate_session *s,
>> +			       struct iommufd_device *idev,
>> +			       u64 token)
>> +{
>> +	struct iommufd_group *igroup = idev->igroup;
>> +	struct iommufd_hwpt_paging *hwpt_paging;
>> +	struct iommufd_hw_pagetable *hwpt;
>> +	struct iommufd_attach *attach;
>> +
>> +	mutex_lock(&igroup->lock);
>> +	attach = xa_load(&igroup->pasid_attach, IOMMU_NO_PASID);
>> +	if (!attach) {
>> +		WARN_ON(-ENOENT);
>
>WARN_ON takes a condition.. if we want this to be printed always, why
>not WARN_ON(1, "...") ? What's the significance of having -ENOENT as a
>condition?

Will update this.
>
>> +		goto out;
>> +	}
>> +
>> +	hwpt = attach->hwpt;
>> +	hwpt_paging = find_hwpt_paging(hwpt);
>> +	if (!hwpt_paging || !hwpt_paging->lu_preserve) {
>> +		WARN_ON(-EINVAL);
>
>Same here for -EINVAL?

Same here.
>
>> +		goto out;
>> +	}
>> +
>> +	iommu_unpreserve_device(hwpt_paging->common.domain, idev->dev);
>> +out:
>> +	mutex_unlock(&igroup->lock);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(iommufd_device_unpreserve, "IOMMUFD");
>> +#endif
>
>[ ----- >8 ----- ]
>
>Thanks,
>Praan

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
  2026-03-25 20:36           ` Pranjal Shrivastava
@ 2026-03-25 20:46             ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-25 20:46 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, YiFei Zhu, Robin Murphy, Kevin Tian,
	Alex Williamson, Shuah Khan, iommu, linux-kernel, kvm,
	Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
	Leon Romanovsky, William Tu, Pratyush Yadav, Pasha Tatashin,
	David Matlack, Andrew Morton, Chris Li, Vipin Sharma

On Wed, Mar 25, 2026 at 08:36:20PM +0000, Pranjal Shrivastava wrote:
>On Wed, Mar 25, 2026 at 08:19:05PM +0000, Samiullah Khawaja wrote:
>> On Wed, Mar 25, 2026 at 06:55:36PM +0000, Pranjal Shrivastava wrote:
>> > On Wed, Mar 25, 2026 at 05:31:46PM +0000, Samiullah Khawaja wrote:
>> > > On Wed, Mar 25, 2026 at 02:37:37PM +0000, Pranjal Shrivastava wrote:
>> > > > On Tue, Feb 03, 2026 at 10:09:44PM +0000, Samiullah Khawaja wrote:
>> > > > > From: YiFei Zhu <zhuyifei@google.com>
>> > > > >
>> > > > > Userspace provides a token, which will then be used at restore to
>> > > > > identify this HWPT. The restoration logic is not implemented and will be
>> > > > > added later.
>> > > > >
>> > > > > Signed-off-by: YiFei Zhu <zhuyifei@google.com>
>> > > > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> > > > > ---
>> > > > >  drivers/iommu/iommufd/Makefile          |  1 +
>> > > > >  drivers/iommu/iommufd/iommufd_private.h | 13 +++++++
>> > > > >  drivers/iommu/iommufd/liveupdate.c      | 49 +++++++++++++++++++++++++
>> > > > >  drivers/iommu/iommufd/main.c            |  2 +
>> > > > >  include/uapi/linux/iommufd.h            | 19 ++++++++++
>> > > > >  5 files changed, 84 insertions(+)
>> > > > >  create mode 100644 drivers/iommu/iommufd/liveupdate.c
>> > > > >
>> > > > > diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
>> > > > > index 71d692c9a8f4..c3bf0b6452d3 100644
>> > > > > --- a/drivers/iommu/iommufd/Makefile
>> > > > > +++ b/drivers/iommu/iommufd/Makefile
>> > > > > @@ -17,3 +17,4 @@ obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o
>> > > > >
>> > > > >  iommufd_driver-y := driver.o
>> > > > >  obj-$(CONFIG_IOMMUFD_DRIVER_CORE) += iommufd_driver.o
>> > > > > +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
>> > > > > diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
>> > > > > index eb6d1a70f673..6424e7cea5b2 100644
>> > > > > --- a/drivers/iommu/iommufd/iommufd_private.h
>> > > > > +++ b/drivers/iommu/iommufd/iommufd_private.h
>> > > > > @@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
>> > > > >  	bool auto_domain : 1;
>> > > > >  	bool enforce_cache_coherency : 1;
>> > > > >  	bool nest_parent : 1;
>> > > > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> > > > > +	bool lu_preserve : 1;
>> > > > > +	u32 lu_token;
>> > > >
>> > > > Did we downsize the token? Shouldn't this be u64 as everywhere else?
>> > >
>> > > Note that this is different from the token that is used to preserve the
>> > > FD into LUO. This token is used to mark the HWPT for preservation, that
>> > > is it will be preserved when the FD is preserved.
>> > >
>> > > I will add more text in the commit message to make it clear.
>> > >
>> > > For consistency I will make it u64.
>> >
>> > I understand that it's logically distinct from the FD preservation token
>> > However, the userspace likely won't implement a separate 32-bit token
>> > generator just for IOMMUFD Live Update. I assume, it'll just use the a
>> > same 64-bit restore-token allocator. Keeping this as u64 prevents them
>> > from having to downcast or manage a separate ID space just for this IOCTL
>>
>> Agreed.
>> >
>> > > >
>> > > > > +#endif
>> > > > >  	/* Head at iommufd_ioas::hwpt_list */
>> > > > >  	struct list_head hwpt_item;
>> > > > >  	struct iommufd_sw_msi_maps present_sw_msi;
>> > > > > @@ -707,6 +711,15 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
>> > > > >  			    struct iommufd_vdevice, obj);
>> > > > >  }
>> > > > >
>> > > > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> > > > > +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
>> > > > > +#else
>> > > > > +static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>> > > > > +{
>> > > > > +	return -ENOTTY;
>> > > > > +}
>> > > > > +#endif
>> > > > > +
>> > > > >  #ifdef CONFIG_IOMMUFD_TEST
>> > > > >  int iommufd_test(struct iommufd_ucmd *ucmd);
>> > > > >  void iommufd_selftest_destroy(struct iommufd_object *obj);
>> > > > > diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
>> > > > > new file mode 100644
>> > > > > index 000000000000..ae74f5b54735
>> > > > > --- /dev/null
>> > > > > +++ b/drivers/iommu/iommufd/liveupdate.c
>> > > > > @@ -0,0 +1,49 @@
>> > > > > +// SPDX-License-Identifier: GPL-2.0-only
>> > > > > +
>> > > > > +#define pr_fmt(fmt) "iommufd: " fmt
>> > > > > +
>> > > > > +#include <linux/file.h>
>> > > > > +#include <linux/iommufd.h>
>> > > > > +#include <linux/liveupdate.h>
>> > > > > +
>> > > > > +#include "iommufd_private.h"
>> > > > > +
>> > > > > +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>> > > > > +{
>> > > > > +	struct iommu_hwpt_lu_set_preserve *cmd = ucmd->cmd;
>> > > > > +	struct iommufd_hwpt_paging *hwpt_target, *hwpt;
>> > > > > +	struct iommufd_ctx *ictx = ucmd->ictx;
>> > > > > +	struct iommufd_object *obj;
>> > > > > +	unsigned long index;
>> > > > > +	int rc = 0;
>> > > > > +
>> > > > > +	hwpt_target = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id);
>> > > > > +	if (IS_ERR(hwpt_target))
>> > > > > +		return PTR_ERR(hwpt_target);
>> > > > > +
>> > > > > +	xa_lock(&ictx->objects);
>> > > > > +	xa_for_each(&ictx->objects, index, obj) {
>> > > > > +		if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
>> > > > > +			continue;
>> > > >
>> > > > Couldn't these be HWPT_NESTED? Are we explicitly skipping HWPT_NESTED
>> > > > here? ARM SMMUv3 heavily relies on IOMMU_DOMAIN_NESTED to back vIOMMUs
>> > > > and hold critical guest translation state. We'd need to support
>> > > > HWPT_NESTED for arm-smmu-v3.
>> > >
>> > > For this series, I am not handling the NESTED and vIOMMU usecases. I
>> > > will be sending a separate series to handle those, this is mentioned in
>> > > cover letter also in the Future work.
>> > >
>> > > Will add a note in commit message also.
>> >
>> > I see, I missed it in the cover letter. Shall we add a TODO mentioning
>> > that we'll support to NESTED too? (No strong feelings about this or
>> > adding stuff to the commit message too.. the cover letter mentions it)
>> >
>> > > >
>> > > > > +
>> > > > > +		hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
>> > > > > +
>> > > > > +		if (hwpt == hwpt_target)
>> > > > > +			continue;
>> > > > > +		if (!hwpt->lu_preserve)
>> > > > > +			continue;
>> > > > > +		if (hwpt->lu_token == cmd->hwpt_token) {
>> > > > > +			rc = -EADDRINUSE;
>> > > > > +			goto out;
>> > > > > +		}
>> > > >
>> > > > I see that this entire loop is to avoid collisions but could we improve
>> > > > this? We are doing an O(N) linear search over the entire ictx->objects
>> > > > xarray while holding xa_lock on every setup call.
>> > > >
>> > > > If the kernel requires a strict 1:1 mapping of lu_token to hwpt,
>> > > > wouldn't it be much better to track these in a dedicated xarray?
>> > > >
>> > > > Just thinking out loud, if we added a dedicated lu_tokens xarray to
>> > > > iommufd_ctx, we could drop the linear search and the lock entirely,
>> > > > letting the xarray handle the collision natively like this:
>> > > >
>> > > > 	rc = xa_insert(&ictx->lu_tokens, cmd->hwpt_token, hwpt_target, GFP_KERNEL);
>> > > > 	if (rc == -EBUSY) {
>> > > > 		rc = -EADDRINUSE;
>> > > > 		goto out;
>> > > > 	} else if (rc) {
>> > > > 		goto out;
>> > > > 	}
>> > > >
>> > > > This ensures instant collision detection without iterating the global
>> > > > object pool. When the HWPT is eventually destroyed (or un-preserved), we
>> > > > simply call xa_erase(&ictx->lu_tokens, hwpt->lu_token).
>> > >
>> > > Agreed. We can call xa_erase when it is destroyed. This can also be used
>> > > during actual preservation without taking the objects lock.
>> >
>> > Awesome!
>> >
>> > > >
>> > > > > +	}
>> > > > > +
>> > > > > +	hwpt_target->lu_preserve = true;
>> > > >
>> > > > I don't see a way to unset hwpt->lu_preserve once it's been set. What if
>> > > > a VMM marks a HWPT for preservation, but then the guest decides to rmmod
>> > > > the device before the actual kexec? The VMM would need a way to
>> > > > unpreserve it so we don't carry stale state across the live update?
>> > > >
>> > > > Are we relying on the VMM to always call IOMMU_DESTROY on that HWPT when
>> > > > it's no longer needed for preservation? A clever VMM optimizing for perf
>> > > > might just pool or cache detached HWPTs for future reuse. If that HWPT
>> > > > goes back into a free pool and gets re-attached to a new device later,
>> > > > the sticky lu_preserve state will inadvertently leak across the kexec..
>> > >
>> > > As mentioned earlier, the HWPT is not being preserved in this call. So
>> > > when VMM dies or rmmod happens, this HWPT will be destroyed following
>> > > the normal flow.
>> > >
>> >
>> > I think there might be a slight disconnect regarding the "normal flow"
>> > of HWPT destruction. My concern isn't about the VMM dying or a simple 1:1
>> > teardown. My concern is about a VMM that deliberately avoids calling
>> > IOMMU_DESTROY to optimize allocations.
>> >
>> > The iommufd UAPI already explicitly supports the HWPT pooling model.
>> > The IOMMU_DEVICE_ATTACH ioctl takes a pt_id, allowing a VMM to
>> > pre-allocate an HWPT and then 'point' various devices at it over time.
>> > (Note that detaching a device from a HWPT attaches it to a blocked
>> > domain.)
>> >
>> > If a VMM uses a free-list/cache for its HWPTs, a guest hot-unplug will
>> > cause the VMM to detach the device, but the VMM will keep the HWPT alive
>> > in userspace for future reuse.
>> >
>> > If that happens, the HWPT is now sitting in the VMM's free pool, but the
>> > kernel still has it permanently flagged with lu_preserve = true. When
>> > the VMM later pulls that HWPT from the pool to attach to a new device
>> > (which might not need preservation), there is no way for the VMM to
>> > UNMARK it for preservation.
>>
>> Interesting.. My thinking is that a VMM that is aware of the live update
>> use case should be responsible for its own object lifecycle. It should
>> simply discard such HWPT rather than returning it to a free-list.
>>
>> My concern with adding unpreserve ioctl is that it forces a lot of
>> complex lifecycle tracking into the kernel, especially around the new
>> locking that would be needed to handle races between parallel iommufd
>> preserve/unpreserve calls.
>>
>> Given that complexity, I think the cleaner approach is to avoid the new
>> ioctl and keep the kernel-side implementation simpler.
>
>Fair point, in that case let's just make sure this contract is explicit.
>We should clearly document in the UAPI that there is no way to "UNMARK"
>an HWPT for preservation and that VMMs implementing HWPT pooling or
>dynamic attachment must destroy preserved HWPTs rather than returning
>them to a free-list if they want liveupdate support.

That sounds good to me. I will add it in the uAPI docs in next revision.
>
>Once that's documented so userspace authors know exactly what the UAPI
>expects of them, I'm on board with this approach.
>
>[ ------ >8 ------ ]
>
>> > >
>> > > LU_PRESERVE would imply that it is being preserved. Maybe
>> > > "IOMMU_HWPT_LU_MARK_PRESERVE"?
>> >
>> > Yup, sounds good! Thanks
>> >
>> > > >
>> > > > > + * @size: sizeof(struct iommu_hwpt_lu_set_preserve)
>> > > > > + * @hwpt_id: Iommufd object ID of the target HWPT
>> > > > > + * @hwpt_token: Token to identify this hwpt upon restore
>> > > > > + *
>> > > > > + * The target HWPT will be preserved during iommufd preservation.
>> > > > > + *
>> > > > > + * The hwpt_token is provided by userspace. If userspace enters a token
>> > > > > + * already in use within this iommufd, -EADDRINUSE is returned from this ioctl.
>> > > > > + */
>> > > > > +struct iommu_hwpt_lu_set_preserve {
>> > > > > +	__u32 size;
>> > > > > +	__u32 hwpt_id;
>> > > > > +	__u32 hwpt_token;
>> > > > > +};
>> > > >
>> > > > Nit: Let's make sure we follow the 64-bit alignment as enforced in the
>> > > > rest of this file, note the __u32 __reserved fields in existing IOCTL
>> > > > structs.
>> > >
>> > > Agreed. Will update
>
>Thanks,
>Praan

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 13/14] vfio/pci: Preserve the iommufd state of the vfio cdev
  2026-02-03 22:09 ` [PATCH 13/14] vfio/pci: Preserve the iommufd state of the " Samiullah Khawaja
  2026-02-17  4:18   ` Ankit Soni
  2026-03-23 21:17   ` Vipin Sharma
@ 2026-03-25 20:55   ` Pranjal Shrivastava
  2 siblings, 0 replies; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-25 20:55 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:47PM +0000, Samiullah Khawaja wrote:
> If the vfio cdev is attached to an iommufd, preserve the state of the
> attached iommufd also. Basically preserve the iommu state of the device
> and also the attached domain. The token returned by the preservation API
> will be used to restore/rebind to the iommufd state after liveupdate.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  drivers/vfio/pci/vfio_pci_liveupdate.c | 28 +++++++++++++++++++++++++-
>  include/linux/kho/abi/vfio_pci.h       | 10 +++++++++
>  2 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_liveupdate.c b/drivers/vfio/pci/vfio_pci_liveupdate.c
> index c52d6bdb455f..af6fbfb7a65c 100644
> --- a/drivers/vfio/pci/vfio_pci_liveupdate.c
> +++ b/drivers/vfio/pci/vfio_pci_liveupdate.c
> @@ -15,6 +15,7 @@
>  #include <linux/liveupdate.h>
>  #include <linux/errno.h>
>  #include <linux/vfio.h>
> +#include <linux/iommufd.h>
>  
>  #include "vfio_pci_priv.h"
>  
> @@ -39,6 +40,7 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
>  	struct vfio_pci_core_device_ser *ser;
>  	struct vfio_pci_core_device *vdev;
>  	struct pci_dev *pdev;
> +	u64 token = 0;
>  
>  	vdev = container_of(device, struct vfio_pci_core_device, vdev);
>  	pdev = vdev->pdev;
> @@ -49,15 +51,32 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
>  	if (vfio_pci_is_intel_display(pdev))
>  		return -EINVAL;
>  
> +#if CONFIG_IOMMU_LIVEUPDATE

Did we mean to use #ifdef or #if 

> +	/* If iommufd is attached, preserve the underlying domain */
> +	if (device->iommufd_attached) {
> +		int err = iommufd_device_preserve(args->session,
> +						  device->iommufd_device,
> +						  &token);
> +		if (err < 0)
> +			return err;
> +	}
> +#endif
> +
>  	ser = kho_alloc_preserve(sizeof(*ser));
> -	if (IS_ERR(ser))
> +	if (IS_ERR(ser)) {
> +		if (device->iommufd_attached)
> +			iommufd_device_unpreserve(args->session,
> +						  device->iommufd_device, token);

Few minor things here, we've protected the preserve call above with a
#ifdef but not the unpreserve here in the clean up path. Also, it looks
like in the previous patch we wrap both of these functions in #ifdefs
and define their #else static inline-d versions too?

> +
>  		return PTR_ERR(ser);
> +	}
>  
>  	pci_liveupdate_outgoing_preserve(pdev);
>  
>  	ser->bdf = pci_dev_id(pdev);
>  	ser->domain = pci_domain_nr(pdev->bus);
>  	ser->reset_works = vdev->reset_works;
> +	ser->iommufd_ser.token = token;
>  
>  	args->serialized_data = virt_to_phys(ser);
>  	return 0;
> @@ -66,6 +85,13 @@ static int vfio_pci_liveupdate_preserve(struct liveupdate_file_op_args *args)
>  static void vfio_pci_liveupdate_unpreserve(struct liveupdate_file_op_args *args)
>  {
>  	struct vfio_device *device = vfio_device_from_file(args->file);
> +	struct vfio_pci_core_device_ser *ser;
> +
> +	ser = phys_to_virt(args->serialized_data);
> +	if (device->iommufd_attached)
> +		iommufd_device_unpreserve(args->session,
> +					  device->iommufd_device,
> +					  ser->iommufd_ser.token);
>  
>  	pci_liveupdate_outgoing_unpreserve(to_pci_dev(device->dev));
>  	kho_unpreserve_free(phys_to_virt(args->serialized_data));
> diff --git a/include/linux/kho/abi/vfio_pci.h b/include/linux/kho/abi/vfio_pci.h
> index 6c3d3c6dfc09..d01bd58711c2 100644
> --- a/include/linux/kho/abi/vfio_pci.h
> +++ b/include/linux/kho/abi/vfio_pci.h
> @@ -28,6 +28,15 @@
>  
>  #define VFIO_PCI_LUO_FH_COMPATIBLE "vfio-pci-v1"
>  
> +/**
> + * struct vfio_iommufd_ser - Serialized state of the attached iommufd.
> + *
> + * @token: The token of the bound iommufd state.
> + */
> +struct vfio_iommufd_ser {
> +	u32 token;
> +} __packed;
> +
>  /**
>   * struct vfio_pci_core_device_ser - Serialized state of a single VFIO PCI
>   * device.
> @@ -40,6 +49,7 @@ struct vfio_pci_core_device_ser {
>  	u16 bdf;
>  	u16 domain;
>  	u8 reset_works;
> +	struct vfio_iommufd_ser iommufd_ser;
>  } __packed;
>  

Another struct alignment geekery: when we update vfio_iommufd_ser.token
to __u64 (as pointed out by other comments), putting it right after 
`reset_works` forces an unaligned 8-byte access at offset 5. Ideally, we
should add explicit padding (e.g. __u8 padding[3]) before iommufd_ser to
ensure natural alignment.

Because __packed prevents the compiler from automatically aligning 
fields, placing a 64-bit integer at an odd offset forces the CPU to
perform an unaligned memory read. While some archs handle this with a 
performance penalty, other archs could outright fault on unaligned 64-bit
accesses. Explicit padding guarantees safe, portable access across all
archs while maintaining the strict memory layout required by the KHO ABI.

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation
  2026-02-03 22:09 ` [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation Samiullah Khawaja
  2026-03-23 22:18   ` Vipin Sharma
@ 2026-03-25 21:05   ` Pranjal Shrivastava
  2026-03-27 18:25     ` Samiullah Khawaja
  1 sibling, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-25 21:05 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Tue, Feb 03, 2026 at 10:09:48PM +0000, Samiullah Khawaja wrote:
> Test iommufd preservation by setting up an iommufd and vfio cdev and
> preserve it across live update. Test takes VFIO cdev path of a device
> bound to vfio-pci driver and binds it to an iommufd being preserved. It
> also preserves the vfio cdev so the iommufd state associated with it is
> also preserved.
> 
> The restore path is tested by restoring the preserved vfio cdev only.
> Test tries to finish the session without restoring iommufd and confirms
> that it fails.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
> ---
>  tools/testing/selftests/iommu/Makefile        |  12 +
>  .../selftests/iommu/iommufd_liveupdate.c      | 209 ++++++++++++++++++
>  2 files changed, 221 insertions(+)
>  create mode 100644 tools/testing/selftests/iommu/iommufd_liveupdate.c
> 
> diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
> index 84abeb2f0949..263195af4d6a 100644
> --- a/tools/testing/selftests/iommu/Makefile
> +++ b/tools/testing/selftests/iommu/Makefile
> @@ -7,4 +7,16 @@ TEST_GEN_PROGS :=
>  TEST_GEN_PROGS += iommufd
>  TEST_GEN_PROGS += iommufd_fail_nth
>  
> +TEST_GEN_PROGS_EXTENDED += iommufd_liveupdate
> +
>  include ../lib.mk
> +include ../liveupdate/lib/libliveupdate.mk
> +
> +CFLAGS += -I$(top_srcdir)/tools/include
> +CFLAGS += -MD
> +CFLAGS += $(EXTRA_CFLAGS)
> +
> +$(TEST_GEN_PROGS_EXTENDED): %: %.o $(LIBLIVEUPDATE_O)
> +	$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) $(TARGET_ARCH) $< $(LIBLIVEUPDATE_O) $(LDLIBS) -static -o $@
> +
> +EXTRA_CLEAN += $(LIBLIVEUPDATE_O)
> diff --git a/tools/testing/selftests/iommu/iommufd_liveupdate.c b/tools/testing/selftests/iommu/iommufd_liveupdate.c
> new file mode 100644
> index 000000000000..8b4ea9f2b7e9
> --- /dev/null
> +++ b/tools/testing/selftests/iommu/iommufd_liveupdate.c
> @@ -0,0 +1,209 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * Copyright (c) 2025, Google LLC.
> + * Samiullah Khawaja <skhawaja@google.com>
> + */
> +
> +#include <fcntl.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <stdbool.h>
> +#include <unistd.h>
> +
> +#define __EXPORTED_HEADERS__
> +#include <linux/iommufd.h>
> +#include <linux/types.h>
> +#include <linux/vfio.h>
> +#include <linux/sizes.h>
> +#include <libliveupdate.h>
> +
> +#include "../kselftest.h"
> +
> +#define ksft_assert(condition) \
> +	do { if (!(condition)) \
> +	ksft_exit_fail_msg("Failed: %s at %s %d: %s\n", \
> +	#condition, __FILE__, __LINE__, strerror(errno)); } while (0)
> +
> +int setup_cdev(const char *vfio_cdev_path)
> +{
> +	int cdev_fd;
> +
> +	cdev_fd = open(vfio_cdev_path, O_RDWR);
> +	if (cdev_fd < 0)
> +		ksft_exit_skip("Failed to open VFIO cdev: %s\n", vfio_cdev_path);
> +
> +	return cdev_fd;
> +}
> +
> +int open_iommufd(void)
> +{
> +	int iommufd;
> +
> +	iommufd = open("/dev/iommu", O_RDWR);
> +	if (iommufd < 0)
> +		ksft_exit_skip("Failed to open /dev/iommu. IOMMUFD support not enabled.\n");
> +
> +	return iommufd;
> +}
> +
> +int setup_iommufd(int iommufd, int memfd, int cdev_fd, int hwpt_token)
> +{
> +	int ret;
> +
> +	struct vfio_device_bind_iommufd bind = {
> +		.argsz = sizeof(bind),
> +		.flags = 0,
> +	};
> +	struct iommu_ioas_alloc alloc_data = {
> +		.size = sizeof(alloc_data),
> +		.flags = 0,
> +	};
> +	struct iommu_hwpt_alloc hwpt_alloc = {
> +		.size = sizeof(hwpt_alloc),
> +		.flags = 0,
> +	};
> +	struct vfio_device_attach_iommufd_pt attach_data = {
> +		.argsz = sizeof(attach_data),
> +		.flags = 0,
> +	};
> +	struct iommu_hwpt_lu_set_preserve set_preserve = {
> +		.size = sizeof(set_preserve),
> +		.hwpt_token = hwpt_token,
> +	};
> +	struct iommu_ioas_map_file map_file = {
> +		.size = sizeof(map_file),
> +		.length = SZ_1M,
> +		.flags = IOMMU_IOAS_MAP_WRITEABLE | IOMMU_IOAS_MAP_READABLE,
> +		.iova = SZ_4G,
> +		.fd = memfd,
> +		.start = 0,
> +	};
> +
> +	bind.iommufd = iommufd;
> +	ret = ioctl(cdev_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
> +	ksft_assert(!ret);
> +
> +	ret = ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data);
> +	ksft_assert(!ret);
> +
> +	hwpt_alloc.dev_id = bind.out_devid;
> +	hwpt_alloc.pt_id = alloc_data.out_ioas_id;
> +	ret = ioctl(iommufd, IOMMU_HWPT_ALLOC, &hwpt_alloc);
> +	ksft_assert(!ret);
> +
> +	attach_data.pt_id = hwpt_alloc.out_hwpt_id;
> +	ret = ioctl(cdev_fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data);
> +	ksft_assert(!ret);
> +
> +	map_file.ioas_id = alloc_data.out_ioas_id;
> +	ret = ioctl(iommufd, IOMMU_IOAS_MAP_FILE, &map_file);
> +	ksft_assert(!ret);
> +
> +	set_preserve.hwpt_id = attach_data.pt_id;
> +	ret = ioctl(iommufd, IOMMU_HWPT_LU_SET_PRESERVE, &set_preserve);
> +	ksft_assert(!ret);
> +
> +	return ret;
> +}
> +
> +static int create_sealed_memfd(size_t size)
> +{
> +	int fd, ret;
> +
> +	fd = memfd_create("buffer", MFD_ALLOW_SEALING);
> +	ksft_assert(fd > 0);
> +
> +	ret = ftruncate(fd, size);
> +	ksft_assert(!ret);
> +
> +	ret = fcntl(fd, F_ADD_SEALS,
> +		    F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL);
> +	ksft_assert(!ret);
> +
> +	return fd;
> +}
> +
> +int main(int argc, char *argv[])

The iommufd selftest directory heavily utilizes the standard kselftest
harness (TEST_F, EXPECT_EQ, ASSERT_EQ), by writing a raw main(), this 
patch circumvents the standard test reporting structure (like returning 1
on argc < 2 instead of ksft_exit_skip(), and returning 0 at the end
instead of ksft_exit_pass() etc.). We should rewrite this cleanly 
following the existing iommufd test infrastructure using TEST_F blocks.

> +{
> +	int iommufd, cdev_fd, memfd, luo, session, ret;
> +	const int token = 0x123456;
> +	const int cdev_token = 0x654321;
> +	const int hwpt_token = 0x789012;
> +	const int memfd_token = 0x890123;

Shouldn't these be u64? Defining them as signed int-s undermines that
contract?

> +
> +	if (argc < 2) {
> +		printf("Usage: ./iommufd_liveupdate <vfio_cdev_path>\n");
> +		return 1;
> +	}
> +
> +	luo = luo_open_device();
> +	ksft_assert(luo > 0);
> +
> +	session = luo_retrieve_session(luo, "iommufd-test");

[ ---- >8 ---- ]

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev
  2026-03-25 20:41     ` Samiullah Khawaja
@ 2026-03-25 21:23       ` Pranjal Shrivastava
  2026-03-26  0:16         ` Samiullah Khawaja
  0 siblings, 1 reply; 98+ messages in thread
From: Pranjal Shrivastava @ 2026-03-25 21:23 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Wed, Mar 25, 2026 at 08:41:46PM +0000, Samiullah Khawaja wrote:
> On Wed, Mar 25, 2026 at 08:24:24PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Feb 03, 2026 at 10:09:46PM +0000, Samiullah Khawaja wrote:
> > > Add APIs that can be used to preserve and unpreserve a vfio cdev. Use
> > > the APIs exported by the IOMMU core to preserve/unpreserve device. Pass
> > > the LUO preservation token of the attached iommufd into IOMMU preserve
> > > device API. This establishes the ownership of the device with the
> > > preserved iommufd.
> > > 
> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> > > ---
> > >  drivers/iommu/iommufd/device.c | 69 ++++++++++++++++++++++++++++++++++
> > >  include/linux/iommufd.h        | 23 ++++++++++++
> > >  2 files changed, 92 insertions(+)
> > > 
> > > diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> > > index 4c842368289f..30cb5218093b 100644
> > > --- a/drivers/iommu/iommufd/device.c
> > > +++ b/drivers/iommu/iommufd/device.c
> > > @@ -2,6 +2,7 @@
> > >  /* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> > >   */
> > >  #include <linux/iommu.h>
> > > +#include <linux/iommu-lu.h>
> > >  #include <linux/iommufd.h>
> > >  #include <linux/pci-ats.h>
> > >  #include <linux/slab.h>
> > > @@ -1661,3 +1662,71 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
> > >  	iommufd_put_object(ucmd->ictx, &idev->obj);
> > >  	return rc;
> > >  }
> > > +
> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
> > > +int iommufd_device_preserve(struct liveupdate_session *s,
> > > +			    struct iommufd_device *idev,
> > > +			    u64 *tokenp)
> > > +{
> > > +	struct iommufd_group *igroup = idev->igroup;
> > > +	struct iommufd_hwpt_paging *hwpt_paging;
> > > +	struct iommufd_hw_pagetable *hwpt;
> > > +	struct iommufd_attach *attach;
> > > +	int ret;
> > > +
> > > +	mutex_lock(&igroup->lock);
> > > +	attach = xa_load(&igroup->pasid_attach, IOMMU_NO_PASID);
> > 
> > By explicitly looking up IOMMU_NO_PASID, we skip any PASID attachments
> > the device might have. Since PASID live update is NOT supported in this
> > series, should we check if the pasid_attach xarray contains anything
> > other than IOMMU_NO_PASID and return -EOPNOTSUPP?
> > 
> > Otherwise, we silently fail to preserve those domains without informing
> > the VMM?
> 
> VMM should be able to preserve the NO_PASID domains even if it has PASID
> attachments. This is the intended behaviour, I will document it in the
> uAPI docs.

I think I'm miscommunicating here. My concern isn't about whether the
kernel can mechanically preserve the NO_PASID domain when PASID 
attachments exist. I agree that part works fine.

My concern is purely about silent state loss. If a VMM asks the kernel
to preserve a device, it expects the entire IOMMU state for that device
to be safely handed over. If the kernel silently skips the PASID
attachments and returns success (0), the VMM on the new kernel will wake
up assuming those PASIDs are still perfectly intact. When the guest 
attempts a PASID-tagged DMA, it will unexpectedly fault.

So the question is: how strictly should the kernel protect userspace
from this footgun? A few options that I can see:

1. Rely on uAPI docs
2. Fail the preserve ioctl (-EOPNOTSUPP) if active PASID attachments
   are detected.
3. Add an opt-in flag: We could add a flag to the ioctl 
   (IOMMU_LU_FLAG_IGNORE_PASID) so userspace has to explicitly
   acknowledge the state drop?

Options 2 or 3 are especially important when we consider backwards 
compatibility. If this series is merged in 7.2  with the "silent drop"
behavior now, when full PASID live update support is eventually added
in a future kernel, userspace will have no robust way to know if it's
running on a kernel that preserves PASIDs or silently drops them. By 
returning an error or requiring a flag now, we reserve the right to 
cleanly implement the feature later without breaking the UAPI contract.

This is an open question from me, I'm okay with any of the 3 options
I'd like to know what the maintainers think about this as well.

[ ---- >8 ----- ]

Thanks,
Praan

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev
  2026-03-25 21:23       ` Pranjal Shrivastava
@ 2026-03-26  0:16         ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-26  0:16 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Wed, Mar 25, 2026 at 09:23:13PM +0000, Pranjal Shrivastava wrote:
>On Wed, Mar 25, 2026 at 08:41:46PM +0000, Samiullah Khawaja wrote:
>> On Wed, Mar 25, 2026 at 08:24:24PM +0000, Pranjal Shrivastava wrote:
>> > On Tue, Feb 03, 2026 at 10:09:46PM +0000, Samiullah Khawaja wrote:
>> > > Add APIs that can be used to preserve and unpreserve a vfio cdev. Use
>> > > the APIs exported by the IOMMU core to preserve/unpreserve device. Pass
>> > > the LUO preservation token of the attached iommufd into IOMMU preserve
>> > > device API. This establishes the ownership of the device with the
>> > > preserved iommufd.
>> > >
>> > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> > > ---
>> > >  drivers/iommu/iommufd/device.c | 69 ++++++++++++++++++++++++++++++++++
>> > >  include/linux/iommufd.h        | 23 ++++++++++++
>> > >  2 files changed, 92 insertions(+)
>> > >
>> > > diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
>> > > index 4c842368289f..30cb5218093b 100644
>> > > --- a/drivers/iommu/iommufd/device.c
>> > > +++ b/drivers/iommu/iommufd/device.c
>> > > @@ -2,6 +2,7 @@
>> > >  /* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
>> > >   */
>> > >  #include <linux/iommu.h>
>> > > +#include <linux/iommu-lu.h>
>> > >  #include <linux/iommufd.h>
>> > >  #include <linux/pci-ats.h>
>> > >  #include <linux/slab.h>
>> > > @@ -1661,3 +1662,71 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
>> > >  	iommufd_put_object(ucmd->ictx, &idev->obj);
>> > >  	return rc;
>> > >  }
>> > > +
>> > > +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> > > +int iommufd_device_preserve(struct liveupdate_session *s,
>> > > +			    struct iommufd_device *idev,
>> > > +			    u64 *tokenp)
>> > > +{
>> > > +	struct iommufd_group *igroup = idev->igroup;
>> > > +	struct iommufd_hwpt_paging *hwpt_paging;
>> > > +	struct iommufd_hw_pagetable *hwpt;
>> > > +	struct iommufd_attach *attach;
>> > > +	int ret;
>> > > +
>> > > +	mutex_lock(&igroup->lock);
>> > > +	attach = xa_load(&igroup->pasid_attach, IOMMU_NO_PASID);
>> >
>> > By explicitly looking up IOMMU_NO_PASID, we skip any PASID attachments
>> > the device might have. Since PASID live update is NOT supported in this
>> > series, should we check if the pasid_attach xarray contains anything
>> > other than IOMMU_NO_PASID and return -EOPNOTSUPP?
>> >
>> > Otherwise, we silently fail to preserve those domains without informing
>> > the VMM?
>>
>> VMM should be able to preserve the NO_PASID domains even if it has PASID
>> attachments. This is the intended behaviour, I will document it in the
>> uAPI docs.
>
>I think I'm miscommunicating here. My concern isn't about whether the
>kernel can mechanically preserve the NO_PASID domain when PASID
>attachments exist. I agree that part works fine.
>
>My concern is purely about silent state loss. If a VMM asks the kernel
>to preserve a device, it expects the entire IOMMU state for that device
>to be safely handed over. If the kernel silently skips the PASID
>attachments and returns success (0), the VMM on the new kernel will wake
>up assuming those PASIDs are still perfectly intact. When the guest
>attempts a PASID-tagged DMA, it will unexpectedly fault.

This is a valid concern about silent state loss and I should have been
elaborate in my reply earlier.

My thinking is that PASID preservation should be a granular and opt-in
feature.

- PRESERVE_DEVICE only preserves the IOMMU_NO_PASID attachment.
- A future patch would add mechanism to mark PASID attachment for
   preservation. The VMM would have to call this for each specific PASID
   it wants to preserve.
>
>So the question is: how strictly should the kernel protect userspace
>from this footgun? A few options that I can see:
>
>1. Rely on uAPI docs
>2. Fail the preserve ioctl (-EOPNOTSUPP) if active PASID attachments
>   are detected.
>3. Add an opt-in flag: We could add a flag to the ioctl
>   (IOMMU_LU_FLAG_IGNORE_PASID) so userspace has to explicitly
>   acknowledge the state drop?
>
>Options 2 or 3 are especially important when we consider backwards
>compatibility. If this series is merged in 7.2  with the "silent drop"
>behavior now, when full PASID live update support is eventually added
>in a future kernel, userspace will have no robust way to know if it's
>running on a kernel that preserves PASIDs or silently drops them. By
>returning an error or requiring a flag now, we reserve the right to
>cleanly implement the feature later without breaking the UAPI contract.

I think these are valid points and I will return -EOPNOTSUPP if active
PASID attachments are detected.

I will update this in next revision.
>
>This is an open question from me, I'm okay with any of the 3 options
>I'd like to know what the maintainers think about this as well.
>
>[ ---- >8 ----- ]
>
>Thanks,
>Praan

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation
  2026-03-25 21:05   ` Pranjal Shrivastava
@ 2026-03-27 18:25     ` Samiullah Khawaja
  2026-03-27 18:40       ` Samiullah Khawaja
  0 siblings, 1 reply; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-27 18:25 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Wed, Mar 25, 2026 at 09:05:32PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:48PM +0000, Samiullah Khawaja wrote:
>> Test iommufd preservation by setting up an iommufd and vfio cdev and
>> preserve it across live update. Test takes VFIO cdev path of a device
>> bound to vfio-pci driver and binds it to an iommufd being preserved. It
>> also preserves the vfio cdev so the iommufd state associated with it is
>> also preserved.
>>
>> The restore path is tested by restoring the preserved vfio cdev only.
>> Test tries to finish the session without restoring iommufd and confirms
>> that it fails.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
>> ---
>>  tools/testing/selftests/iommu/Makefile        |  12 +
>>  .../selftests/iommu/iommufd_liveupdate.c      | 209 ++++++++++++++++++
>>  2 files changed, 221 insertions(+)
>>  create mode 100644 tools/testing/selftests/iommu/iommufd_liveupdate.c
>>
>> diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
>> index 84abeb2f0949..263195af4d6a 100644
>> --- a/tools/testing/selftests/iommu/Makefile
>> +++ b/tools/testing/selftests/iommu/Makefile
>> @@ -7,4 +7,16 @@ TEST_GEN_PROGS :=
>>  TEST_GEN_PROGS += iommufd
>>  TEST_GEN_PROGS += iommufd_fail_nth
>>
>> +TEST_GEN_PROGS_EXTENDED += iommufd_liveupdate
>> +
>>  include ../lib.mk
>> +include ../liveupdate/lib/libliveupdate.mk
>> +
>> +CFLAGS += -I$(top_srcdir)/tools/include
>> +CFLAGS += -MD
>> +CFLAGS += $(EXTRA_CFLAGS)
>> +
>> +$(TEST_GEN_PROGS_EXTENDED): %: %.o $(LIBLIVEUPDATE_O)
>> +	$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) $(TARGET_ARCH) $< $(LIBLIVEUPDATE_O) $(LDLIBS) -static -o $@
>> +
>> +EXTRA_CLEAN += $(LIBLIVEUPDATE_O)
>> diff --git a/tools/testing/selftests/iommu/iommufd_liveupdate.c b/tools/testing/selftests/iommu/iommufd_liveupdate.c
>> new file mode 100644
>> index 000000000000..8b4ea9f2b7e9
>> --- /dev/null
>> +++ b/tools/testing/selftests/iommu/iommufd_liveupdate.c
>> @@ -0,0 +1,209 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +
>> +/*
>> + * Copyright (c) 2025, Google LLC.
>> + * Samiullah Khawaja <skhawaja@google.com>
>> + */
>> +
>> +#include <fcntl.h>
>> +#include <sys/ioctl.h>
>> +#include <sys/mman.h>
>> +#include <stdbool.h>
>> +#include <unistd.h>
>> +
>> +#define __EXPORTED_HEADERS__
>> +#include <linux/iommufd.h>
>> +#include <linux/types.h>
>> +#include <linux/vfio.h>
>> +#include <linux/sizes.h>
>> +#include <libliveupdate.h>
>> +
>> +#include "../kselftest.h"
>> +
>> +#define ksft_assert(condition) \
>> +	do { if (!(condition)) \
>> +	ksft_exit_fail_msg("Failed: %s at %s %d: %s\n", \
>> +	#condition, __FILE__, __LINE__, strerror(errno)); } while (0)
>> +
>> +int setup_cdev(const char *vfio_cdev_path)
>> +{
>> +	int cdev_fd;
>> +
>> +	cdev_fd = open(vfio_cdev_path, O_RDWR);
>> +	if (cdev_fd < 0)
>> +		ksft_exit_skip("Failed to open VFIO cdev: %s\n", vfio_cdev_path);
>> +
>> +	return cdev_fd;
>> +}
>> +
>> +int open_iommufd(void)
>> +{
>> +	int iommufd;
>> +
>> +	iommufd = open("/dev/iommu", O_RDWR);
>> +	if (iommufd < 0)
>> +		ksft_exit_skip("Failed to open /dev/iommu. IOMMUFD support not enabled.\n");
>> +
>> +	return iommufd;
>> +}
>> +
>> +int setup_iommufd(int iommufd, int memfd, int cdev_fd, int hwpt_token)
>> +{
>> +	int ret;
>> +
>> +	struct vfio_device_bind_iommufd bind = {
>> +		.argsz = sizeof(bind),
>> +		.flags = 0,
>> +	};
>> +	struct iommu_ioas_alloc alloc_data = {
>> +		.size = sizeof(alloc_data),
>> +		.flags = 0,
>> +	};
>> +	struct iommu_hwpt_alloc hwpt_alloc = {
>> +		.size = sizeof(hwpt_alloc),
>> +		.flags = 0,
>> +	};
>> +	struct vfio_device_attach_iommufd_pt attach_data = {
>> +		.argsz = sizeof(attach_data),
>> +		.flags = 0,
>> +	};
>> +	struct iommu_hwpt_lu_set_preserve set_preserve = {
>> +		.size = sizeof(set_preserve),
>> +		.hwpt_token = hwpt_token,
>> +	};
>> +	struct iommu_ioas_map_file map_file = {
>> +		.size = sizeof(map_file),
>> +		.length = SZ_1M,
>> +		.flags = IOMMU_IOAS_MAP_WRITEABLE | IOMMU_IOAS_MAP_READABLE,
>> +		.iova = SZ_4G,
>> +		.fd = memfd,
>> +		.start = 0,
>> +	};
>> +
>> +	bind.iommufd = iommufd;
>> +	ret = ioctl(cdev_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
>> +	ksft_assert(!ret);
>> +
>> +	ret = ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data);
>> +	ksft_assert(!ret);
>> +
>> +	hwpt_alloc.dev_id = bind.out_devid;
>> +	hwpt_alloc.pt_id = alloc_data.out_ioas_id;
>> +	ret = ioctl(iommufd, IOMMU_HWPT_ALLOC, &hwpt_alloc);
>> +	ksft_assert(!ret);
>> +
>> +	attach_data.pt_id = hwpt_alloc.out_hwpt_id;
>> +	ret = ioctl(cdev_fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data);
>> +	ksft_assert(!ret);
>> +
>> +	map_file.ioas_id = alloc_data.out_ioas_id;
>> +	ret = ioctl(iommufd, IOMMU_IOAS_MAP_FILE, &map_file);
>> +	ksft_assert(!ret);
>> +
>> +	set_preserve.hwpt_id = attach_data.pt_id;
>> +	ret = ioctl(iommufd, IOMMU_HWPT_LU_SET_PRESERVE, &set_preserve);
>> +	ksft_assert(!ret);
>> +
>> +	return ret;
>> +}
>> +
>> +static int create_sealed_memfd(size_t size)
>> +{
>> +	int fd, ret;
>> +
>> +	fd = memfd_create("buffer", MFD_ALLOW_SEALING);
>> +	ksft_assert(fd > 0);
>> +
>> +	ret = ftruncate(fd, size);
>> +	ksft_assert(!ret);
>> +
>> +	ret = fcntl(fd, F_ADD_SEALS,
>> +		    F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL);
>> +	ksft_assert(!ret);
>> +
>> +	return fd;
>> +}
>> +
>> +int main(int argc, char *argv[])
>
>The iommufd selftest directory heavily utilizes the standard kselftest
>harness (TEST_F, EXPECT_EQ, ASSERT_EQ), by writing a raw main(), this
>patch circumvents the standard test reporting structure (like returning 1
>on argc < 2 instead of ksft_exit_skip(), and returning 0 at the end
>instead of ksft_exit_pass() etc.). We should rewrite this cleanly
>following the existing iommufd test infrastructure using TEST_F blocks.

Agreed. Will do.
>
>> +{
>> +	int iommufd, cdev_fd, memfd, luo, session, ret;
>> +	const int token = 0x123456;
>> +	const int cdev_token = 0x654321;
>> +	const int hwpt_token = 0x789012;
>> +	const int memfd_token = 0x890123;
>
>Shouldn't these be u64? Defining them as signed int-s undermines that
>contract?

Agreed. Will update.
>
>> +
>> +	if (argc < 2) {
>> +		printf("Usage: ./iommufd_liveupdate <vfio_cdev_path>\n");
>> +		return 1;
>> +	}
>> +
>> +	luo = luo_open_device();
>> +	ksft_assert(luo > 0);
>> +
>> +	session = luo_retrieve_session(luo, "iommufd-test");
>
>[ ---- >8 ---- ]
>
>Thanks,
>Praan

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation
  2026-03-23 22:18   ` Vipin Sharma
@ 2026-03-27 18:32     ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-27 18:32 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Pranjal Shrivastava, YiFei Zhu

On Mon, Mar 23, 2026 at 03:18:39PM -0700, Vipin Sharma wrote:
>On Tue, Feb 03, 2026 at 10:09:48PM +0000, Samiullah Khawaja wrote:
>> Test iommufd preservation by setting up an iommufd and vfio cdev and
>> preserve it across live update. Test takes VFIO cdev path of a device
>> bound to vfio-pci driver and binds it to an iommufd being preserved. It
>> also preserves the vfio cdev so the iommufd state associated with it is
>> also preserved.
>>
>> The restore path is tested by restoring the preserved vfio cdev only.
>> Test tries to finish the session without restoring iommufd and confirms
>> that it fails.
>
>May be alos add more details that VFIO bind will also fail, it will return EPERM
>due to ...

Agreed will add more detail.
>
>> diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
>> +$(TEST_GEN_PROGS_EXTENDED): %: %.o $(LIBLIVEUPDATE_O)
>> +	$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) $(TARGET_ARCH) $< $(LIBLIVEUPDATE_O) $(LDLIBS) -static -o $@
>
>Why static? User can pass static flag through EXTRA_CFLAGS.

I will remove this.
>
>> diff --git a/tools/testing/selftests/iommu/iommufd_liveupdate.c b/tools/testing/selftests/iommu/iommufd_liveupdate.c
>> +
>> +#define ksft_assert(condition) \
>> +	do { if (!(condition)) \
>> +	ksft_exit_fail_msg("Failed: %s at %s %d: %s\n", \
>> +	#condition, __FILE__, __LINE__, strerror(errno)); } while (0)
>
>Please add indentation to the code.

Will do.
>
>> +
>> +int setup_cdev(const char *vfio_cdev_path)
>
>Nit: s/setup_cdev/open_cdev

Will change this to open_cdev.
>
>Since you are using open_iommufd() name below, this will make it
>consistent as there not setup here apart from opening.
>
>> +int setup_iommufd(int iommufd, int memfd, int cdev_fd, int hwpt_token)
>> +	bind.iommufd = iommufd;
>> +	ret = ioctl(cdev_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
>> +	ksft_assert(!ret);
>> +
>> +	ret = ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data);
>> +	ksft_assert(!ret);
>> +
>> +	hwpt_alloc.dev_id = bind.out_devid;
>> +	hwpt_alloc.pt_id = alloc_data.out_ioas_id;
>> +	ret = ioctl(iommufd, IOMMU_HWPT_ALLOC, &hwpt_alloc);
>> +	ksft_assert(!ret);
>> +
>> +	attach_data.pt_id = hwpt_alloc.out_hwpt_id;
>> +	ret = ioctl(cdev_fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data);
>> +	ksft_assert(!ret);
>> +
>> +	map_file.ioas_id = alloc_data.out_ioas_id;
>> +	ret = ioctl(iommufd, IOMMU_IOAS_MAP_FILE, &map_file);
>> +	ksft_assert(!ret);
>> +
>> +	set_preserve.hwpt_id = attach_data.pt_id;
>> +	ret = ioctl(iommufd, IOMMU_HWPT_LU_SET_PRESERVE, &set_preserve);
>> +	ksft_assert(!ret);
>> +
>> +	return ret;
>> +}
>
>iommufd_utils.h has functions defined for iommufd ioctls. I think we
>should use those functions here and if related to iommufd are not
>present we should add there.

iommufd_utils.h has utilities to use with mock testing, so those cannot
be used here. I will see if I can make separate helper functions for
these and use those here.
>
>> +int main(int argc, char *argv[])
>> +{
>> +	int iommufd, cdev_fd, memfd, luo, session, ret;
>> +	const int token = 0x123456;
>> +	const int cdev_token = 0x654321;
>> +	const int hwpt_token = 0x789012;
>> +	const int memfd_token = 0x890123;
>> +
>> +	if (argc < 2) {
>> +		printf("Usage: ./iommufd_liveupdate <vfio_cdev_path>\n");
>> +		return 1;
>> +	}
>> +
>> +	luo = luo_open_device();
>> +	ksft_assert(luo > 0);
>> +
>> +	session = luo_retrieve_session(luo, "iommufd-test");
>> +	if (session == -ENOENT) {
>> +		session = luo_create_session(luo, "iommufd-test");
>> +
>> +		iommufd = open_iommufd();
>> +		memfd = create_sealed_memfd(SZ_1M);
>> +		cdev_fd = setup_cdev(argv[1]);
>> +
>> +		ret = setup_iommufd(iommufd, memfd, cdev_fd, hwpt_token);
>> +		ksft_assert(!ret);
>> +
>> +		/* Cannot preserve cdev without iommufd */
>> +		ret = luo_session_preserve_fd(session, cdev_fd, cdev_token);
>> +		ksft_assert(ret);
>> +
>> +		/* Cannot preserve iommufd without preserving memfd. */
>> +		ret = luo_session_preserve_fd(session, iommufd, token);
>> +		ksft_assert(ret);
>> +
>> +		ret = luo_session_preserve_fd(session, memfd, memfd_token);
>> +		ksft_assert(!ret);
>> +
>> +		ret = luo_session_preserve_fd(session, iommufd, token);
>> +		ksft_assert(!ret);
>> +
>> +		ret = luo_session_preserve_fd(session, cdev_fd, cdev_token);
>> +		ksft_assert(!ret);
>> +
>
>All of these ksft_assert are hurting my eyes:) I like the approach in
>VFIO where library APIs does validation and test code only check actual
>things needed.
>
>Should we at least create a common function to combine both
>luo_session_preserve() and ksft_assert()?

Agreed. Will rework these.
>

Thanks,
Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation
  2026-03-27 18:25     ` Samiullah Khawaja
@ 2026-03-27 18:40       ` Samiullah Khawaja
  0 siblings, 0 replies; 98+ messages in thread
From: Samiullah Khawaja @ 2026-03-27 18:40 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, David Matlack, Andrew Morton,
	Chris Li, Vipin Sharma, YiFei Zhu

On Fri, Mar 27, 2026 at 06:25:36PM +0000, Samiullah Khawaja wrote:
>On Wed, Mar 25, 2026 at 09:05:32PM +0000, Pranjal Shrivastava wrote:
>>On Tue, Feb 03, 2026 at 10:09:48PM +0000, Samiullah Khawaja wrote:
>>>Test iommufd preservation by setting up an iommufd and vfio cdev and
>>>preserve it across live update. Test takes VFIO cdev path of a device
>>>bound to vfio-pci driver and binds it to an iommufd being preserved. It
>>>also preserves the vfio cdev so the iommufd state associated with it is
>>>also preserved.
>>>
>>>The restore path is tested by restoring the preserved vfio cdev only.
>>>Test tries to finish the session without restoring iommufd and confirms
>>>that it fails.
>>>
>>>Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>>>Signed-off-by: YiFei Zhu <zhuyifei@google.com>
>>>---
>>> tools/testing/selftests/iommu/Makefile        |  12 +
>>> .../selftests/iommu/iommufd_liveupdate.c      | 209 ++++++++++++++++++
>>> 2 files changed, 221 insertions(+)
>>> create mode 100644 tools/testing/selftests/iommu/iommufd_liveupdate.c
>>>
>>>+int main(int argc, char *argv[])
>>
>>The iommufd selftest directory heavily utilizes the standard kselftest
>>harness (TEST_F, EXPECT_EQ, ASSERT_EQ), by writing a raw main(), this
>>patch circumvents the standard test reporting structure (like returning 1
>>on argc < 2 instead of ksft_exit_skip(), and returning 0 at the end
>>instead of ksft_exit_pass() etc.). We should rewrite this cleanly
>>following the existing iommufd test infrastructure using TEST_F blocks.
>
>Agreed. Will do.

Thinking about this more, these liveupdate tests are two stage tests.
There is a run before liveupdate and there is a run after liveupdate. I
will have to use the luo_test mechanism here for this test. See example
usage in the kexec_test in the VFIO cdev preservation series:

https://lore.kernel.org/all/20260323235817.1960573-21-dmatlack@google.com
>>
>>>+{
>>>+	int iommufd, cdev_fd, memfd, luo, session, ret;
>>>+	const int token = 0x123456;
>>>+	const int cdev_token = 0x654321;
>>>+	const int hwpt_token = 0x789012;
>>>+	const int memfd_token = 0x890123;
>>
>>Shouldn't these be u64? Defining them as signed int-s undermines that
>>contract?
>
>Agreed. Will update.
>>
>>>+
>>>+	if (argc < 2) {
>>>+		printf("Usage: ./iommufd_liveupdate <vfio_cdev_path>\n");
>>>+		return 1;
>>>+	}
>>>+
>>>+	luo = luo_open_device();
>>>+	ksft_assert(luo > 0);
>>>+
>>>+	session = luo_retrieve_session(luo, "iommufd-test");
>>
>>[ ---- >8 ---- ]
>>
>>Thanks,
>>Praan
>
>Thanks,
>Sami

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2026-03-27 18:40 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks Samiullah Khawaja
2026-03-11 21:07   ` Pranjal Shrivastava
2026-03-12 16:43     ` Samiullah Khawaja
2026-03-12 23:43       ` Pranjal Shrivastava
2026-03-13 16:47         ` Samiullah Khawaja
2026-03-13 15:36       ` Pranjal Shrivastava
2026-03-13 16:58         ` Samiullah Khawaja
2026-03-16 22:54   ` Vipin Sharma
2026-03-17  1:06     ` Samiullah Khawaja
2026-03-23 23:27       ` Vipin Sharma
2026-02-03 22:09 ` [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton Samiullah Khawaja
2026-03-12 23:10   ` Pranjal Shrivastava
2026-03-13 18:42     ` Samiullah Khawaja
2026-03-17 20:09       ` Pranjal Shrivastava
2026-03-17 20:13         ` Samiullah Khawaja
2026-03-17 20:23           ` Pranjal Shrivastava
2026-03-17 21:03             ` Vipin Sharma
2026-03-18 18:51               ` Pranjal Shrivastava
2026-03-18 17:49             ` Samiullah Khawaja
2026-03-17 19:58   ` Vipin Sharma
2026-03-17 20:33     ` Samiullah Khawaja
2026-03-24 19:06       ` Vipin Sharma
2026-03-24 19:45         ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 03/14] liveupdate: luo_file: Add internal APIs for file preservation Samiullah Khawaja
2026-03-18 10:00   ` Pranjal Shrivastava
2026-03-18 16:54     ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages Samiullah Khawaja
2026-03-03 16:42   ` Ankit Soni
2026-03-03 18:41     ` Samiullah Khawaja
2026-03-20 17:27       ` Pranjal Shrivastava
2026-03-20 18:12         ` Samiullah Khawaja
2026-03-17 20:59   ` Vipin Sharma
2026-03-20  9:28     ` Pranjal Shrivastava
2026-03-20 18:27       ` Samiullah Khawaja
2026-03-20 11:01     ` Pranjal Shrivastava
2026-03-20 18:56       ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 05/14] iommupt: Implement preserve/unpreserve/restore callbacks Samiullah Khawaja
2026-03-20 21:57   ` Pranjal Shrivastava
2026-03-23 16:41     ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops Samiullah Khawaja
2026-03-19 16:04   ` Vipin Sharma
2026-03-19 16:27     ` Samiullah Khawaja
2026-03-20 23:01   ` Pranjal Shrivastava
2026-03-21 13:27     ` Pranjal Shrivastava
2026-03-23 18:32     ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 07/14] iommu/vt-d: Restore IOMMU state and reclaimed domain ids Samiullah Khawaja
2026-03-19 20:54   ` Vipin Sharma
2026-03-20  1:05     ` Samiullah Khawaja
2026-03-22 19:51   ` Pranjal Shrivastava
2026-03-23 19:33     ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 08/14] iommu: Restore and reattach preserved domains to devices Samiullah Khawaja
2026-03-10  5:16   ` Ankit Soni
2026-03-10 21:47     ` Samiullah Khawaja
2026-03-22 21:59   ` Pranjal Shrivastava
2026-03-23 18:02     ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 09/14] iommu/vt-d: preserve PASID table of preserved device Samiullah Khawaja
2026-03-23 18:19   ` Pranjal Shrivastava
2026-03-23 18:51     ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved Samiullah Khawaja
2026-03-19 23:35   ` Vipin Sharma
2026-03-20  0:40     ` Samiullah Khawaja
2026-03-20 23:34       ` Vipin Sharma
2026-03-23 16:24         ` Samiullah Khawaja
2026-03-25 14:37   ` Pranjal Shrivastava
2026-03-25 17:31     ` Samiullah Khawaja
2026-03-25 18:55       ` Pranjal Shrivastava
2026-03-25 20:19         ` Samiullah Khawaja
2026-03-25 20:36           ` Pranjal Shrivastava
2026-03-25 20:46             ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update Samiullah Khawaja
2026-02-25 23:47   ` Samiullah Khawaja
2026-03-03  5:56   ` Ankit Soni
2026-03-03 18:51     ` Samiullah Khawaja
2026-03-23 20:28   ` Vipin Sharma
2026-03-23 21:34     ` Samiullah Khawaja
2026-03-25 20:08   ` Pranjal Shrivastava
2026-03-25 20:32     ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev Samiullah Khawaja
2026-03-23 20:59   ` Vipin Sharma
2026-03-23 21:38     ` Samiullah Khawaja
2026-03-25 20:24   ` Pranjal Shrivastava
2026-03-25 20:41     ` Samiullah Khawaja
2026-03-25 21:23       ` Pranjal Shrivastava
2026-03-26  0:16         ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 13/14] vfio/pci: Preserve the iommufd state of the " Samiullah Khawaja
2026-02-17  4:18   ` Ankit Soni
2026-03-03 18:35     ` Samiullah Khawaja
2026-03-23 21:17   ` Vipin Sharma
2026-03-23 22:07     ` Samiullah Khawaja
2026-03-24 20:30       ` Vipin Sharma
2026-03-25 20:55   ` Pranjal Shrivastava
2026-02-03 22:09 ` [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation Samiullah Khawaja
2026-03-23 22:18   ` Vipin Sharma
2026-03-27 18:32     ` Samiullah Khawaja
2026-03-25 21:05   ` Pranjal Shrivastava
2026-03-27 18:25     ` Samiullah Khawaja
2026-03-27 18:40       ` Samiullah Khawaja

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox