From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4DD263E7155 for ; Mon, 27 Apr 2026 17:56:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777312598; cv=none; b=Ghsk2rY2iRouBEFueRmsZy4gi4wEDRpnS/qh1GkaYF6fu4MPhI3Ajt+Pj0RBcZTfWxaWAzpPdOWAuSqsuP9PLHss98witW6uxZyWf2IHgSVVwuB55MbUrmhuDpSSMonKkKH1hRNvtKbpWo0Jw7QrJdr764rIxnT3shQfTL7G04Y= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777312598; c=relaxed/simple; bh=Gnl99T6uCuPP01W1a7LnsZNsiGfRZlgPXYCBMSIHrIw=; h=Date:Mime-Version:Message-ID:Subject:From:To:Cc:Content-Type; b=HjixLpeuESiCv2CGPsXYgE9Z65a05rzQXrR+Up2dJC5z80br82HrINPqaU/zR/3E7v0idYyUAzbHTsAl5XJNtqaqfYOl6DxOEESCQ6KsyOwZXRAZXn/YAbwGkk5jOe2ESHJ3UugvXuzAOerAFieUn6UesQ0eu8AWyZdAKZWXZQ8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--skhawaja.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=wYuO49b8; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--skhawaja.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="wYuO49b8" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-2b7aba0af02so44246285ad.2 for ; Mon, 27 Apr 2026 10:56:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777312595; x=1777917395; darn=vger.kernel.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=VL5e9SqVD2d9FSgGkgq84pop030DdovuS69Pp2L/k7U=; b=wYuO49b8rJMSUJ67remKQAQVFr3eRguh7A8FF5xyetfjZdvVInkeI9e73IWkUoZ/B6 OioZ3edYwq0a0i6aKTlgvVe4Y9umQtk09r8cuESfqtFaOgJupj4E7asGtux+i+M2SRm0 I1LvBJgsH1KgbbDwkbs9x5ODMQTkIJ+Kj867z6Dj6mJOfJLQlriga8wY96Dk5cDohf20 teEBcB2+uHlT1LX+n0hINnfP6lpRtPFdsFx9zZUYkvbipjp+77T3cLHIPAZWPTbVm+7+ HEtgjFLa2FZWyAu8ZgBsbp+zOBAQqUilLi9udfPXw0tGvvOLNLsMogUH4zxmhXB+XxQ5 Y5kg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777312595; x=1777917395; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=VL5e9SqVD2d9FSgGkgq84pop030DdovuS69Pp2L/k7U=; b=RcE0DuBRcKjJmSG8HfR8M6o2yVIPpPgD8CnPHDjvaKLwrcDEJD5VQKe6BHwzBKxrlT ypalwmnIl9YiE8SqDgznd50jCA3SAXnn1gvevO333nTyS05KkEWwdSgV3ef2jAg1jYQq +nSVw53PBdkVwY8EdvYqtSaIWR83wM7Xf3LHBfN+M2zYo17rLXrR36O1+SQc38j2l9io xHCBO6qcijAIwfVYSPKb2bw+XgxvqDDFA8UKEcDavg7HmrVcfO72ZSjnFD03ffbkI6Nx CvIx4D7Blhti25YQLQTqLhx9RzZToeS9Nh3NAVmc29MOqpTwUJMimVjXJRymsoTIo4Yw twrw== X-Forwarded-Encrypted: i=1; AFNElJ+su7v+FD27s74rOPo+ezt254RVwB+lrHxn/yHrBjvFtwLKXR5426Vi3i2S3/9WzD+2oHs=@vger.kernel.org X-Gm-Message-State: AOJu0YxJ8a40kkgU9B8YZkdG28lBLe/9QQoYI6Hbnwch8HgdPuXxsm4m +4Wf2R0gAFKnEn4D+QhWK6w0GpGnP1DeLUjuCiGeaVEy9dpOK3W3+FePsyO4Idpr02u4H54/HzL wLPw+HPHDZ1HE6A== X-Received: from plvx5.prod.google.com ([2002:a17:902:9a45:b0:2b4:62f1:955b]) (user=skhawaja job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:20c4:b0:2ae:c5fc:b2ef with SMTP id d9443c01a7336-2b97a990953mr1578005ad.30.1777312595220; Mon, 27 Apr 2026 10:56:35 -0700 (PDT) Date: Mon, 27 Apr 2026 17:56:17 +0000 Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 X-Mailer: git-send-email 2.54.0.545.g6539524ca2-goog Message-ID: <20260427175633.1978233-1-skhawaja@google.com> Subject: [PATCH v2 00/16] iommu: Add live update state preservation From: Samiullah Khawaja To: David Woodhouse , Lu Baolu , Joerg Roedel , Will Deacon , Jason Gunthorpe Cc: Samiullah Khawaja , Robin Murphy , Kevin Tian , Alex Williamson , Shuah Khan , iommu@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Saeed Mahameed , Adithya Jayachandran , Parav Pandit , Leon Romanovsky , William Tu , Pratyush Yadav , Pasha Tatashin , David Matlack , Andrew Morton , Chris Li , Pranjal Shrivastava , Vipin Sharma , YiFei Zhu Content-Type: text/plain; charset="UTF-8" Hi, This patch series introduces a mechanism for IOMMU state preservation across live update, including the Intel VT-d driver support implementation. Please take a look at the following LWN article to learn about KHO and Live Update Orchestrator: https://lwn.net/Articles/1033364/ This work depends on, - VFIO CDEV preservation series (v3): https://lore.kernel.org/kvm/20260323235817.1960573-1-dmatlack@google.com/ - PCI device preservation series (v4): https://lore.kernel.org/all/20260423212316.3431746-1-dmatlack@google.com/ - LUO series to add FLB refcounting: https://lore.kernel.org/lkml/20260423174032.3140399-1-dmatlack@google.com/ The kernel tree with all dependencies is uploaded to the following Github location: https://github.com/samikhawaja/linux/tree/iommu/phase1-v2 Overall Goals: The goal of this effort is to preserve the IOMMU domains, managed by iommufd, attached to devices preserved through VFIO cdev. This allows DMA mappings and IOMMU context of a device assigned to a VM to be maintained across a kexec live update. This is achieved by preserving IOMMU page tables using Generic Page Table support, IOMMU root table and the relevant context entries across live update. The functionality in the previously sent RFC is split into two phases and this series implements the Phase 1. Phase 1 implements the following functionality: - Foundational work in IOMMU core and VT-d driver to preserve and restore IOMMU translation units, IOMMU domains and devices across liveupdate kexec. - The preservation is triggered by preserving vfio cdev FD and bound iommufd FD into a live update session. - An HWPT (and backing IOMMU domain) is only preserved if it contains only file type DMA mappings. Also the memfd being used for such mapping should be SEAL SEAL'd during mapping. - During live update boot, the state of preserved Intel VT-d, IOMMU domain and devices is restored. - The restored IOMMU domains are reattached to the preserved devices during early boot. - The DMA ownership of restored devices is also claimed during live update boot. This means that any attempt to bind a non-vfio drivers with them or binding a new iommufd with them will fail. Architectural Overview: The target architecture for IOMMU state preservation across a live update involves coordination between the Live Update Orchestrator, iommufd, and the IOMMU drivers. The core design uses the Live Update Orchestrator's file descriptor preservation mechanism to preserve iommufd file descriptors. The user marks the iommufd HWPTs for preservation using a new ioctl added in this series. Once done, the preservation of iommufd inside an LUO session is triggered using LUO ioctls. During preservation, the LUO preserve callback for an iommufd walks through the HWPTs it manages to identify the ones that need to be preserved. Once identified, a new IOMMU core API is used to preserve the iommu domain. The IOMMU core uses Generic Page Table to preserve the page tables of these domains. The domains are then marked as preserved. When the user triggers the preservation of a VFIO cdev that is attached to an iommufd that is preserved, the device attachment state of that VFIO cdev is also preserved using an API exported by iommufd. IOMMUFD fetches all the information that needs to be preserved and calls the IOMMU core API to preserve the device state. The IOMMU core also preserves state of IOMMU that is associated with this device. The IOMMU core has LUO FLB registered with the iommufd LUO file handler so the preserved iommu domain and iommu hardware unit state is available during boot for early restore in the next kernel. During boot the driver fetches the preserved state from the IOMMU core and restores the state of preserved IOMMUs. Later when IOMMU core goes through the devices and probes them, the iommu domains of preserved devices are restored and the preserved devices are attached to them. During attachment, the DMA ownership of these devices is also claimed. Tested: The new iommufd_liveupdate_kexec_test selftest was used to verify the preservation logic. It was tested using QEMU with virtual IOMMU (VT-d) support with virtio pcie device bound to the vfio-pci driver. Also Tested on an Intel machine with DSA device bound to vfio-pci driver. Following steps were followed for verification, - Bind the test device with vfio-pci driver - Run test on the machine by running ./iommufd_liveupdate_kexec_test - Trigger Kexec. - After reboot, try binding the device to a non-vfio pci driver, echo > /sys/class/bus/drivers/pci-pf-stub/bind - This should fail with "Device or resource busy". - Bind the device with vfio-pci driver and run the test again. ./iommufd_liveupdate_kexec_test --stage 2 - Test verifies that the device cannot be bound with a new iommufd and the session cannot be finished. Future Work: - Phase 2 with IOMMUFD restore to reclaim the preserved vfio cdev and restore the preserved HWPTs. - Full support for PASID preservation. - Nested IOMMU preservation. - Extend support to other IOMMU architectures (e.g., AMD-Vi, Arm SMMUv3). - DMA Alloc preservation support buddy allocator only. https://github.com/samikhawaja/linux/tree/dma-alloc-preserve Roadmap: The doc below gives a break down of the overall work into Patch series needed to complete Liveupdate feature from IOMMU perspective: https://docs.google.com/document/d/1enDn-uPE9U77U-xHEnzn6HHGKiePSAtMIP8EDU3NO0M High-Level Sequence Flow: The following diagrams illustrate the high-level interactions during the preservation phase. Note that function names in the diagram are kept abbreviated to save horizontal space. Preserve: Before live update the PREPARE event of Liveupdate Orchestrator invokes callbacks of the registered file and subsystem handlers. Userspace (VMM) | LUO Core | iommufd | IOMMU Core | IOMMU Driver -----------------|----------|---------------|---------------|------------- | | | | MARK_HWPT | | | | ---------------------------> | | | | Mark HWPT for | | | | preservation | | | | | | PRESERVE | | | | iommufd_fd | | | | -----------------> | | | | preserve | | | |----------> | | | | For each HWPT | | | |--------------> | | | | domain_presrv | | | |--------------> | | | | gpt(preserve) | | |<--------------| | |<--------------| | |<---------| | | | | | | ... | | | | | | | | PRESERVE, | | | | vfio_cdev_fd | | | | -----------------> | | | | preserve | | | |----------> | | | | | | | | iommu_preserv | | | | _device() | | | |--------------> | | | | preserve | | | | (iommu_hw) | | | |--------------> | | | | preserve(root) | | | | preserve(pasid) | | |<--------------| | | | | | | | preserve | | | | _device(dev) | | | |--------------> | | | | | | |<--------------| | |<--------------| | |<---------| | | Restore: After a live update, the preserved state is restored during boot. Userspace (VMM) | LUO Core | iommufd | IOMMU Core | IOMMU Driver -----------------|----------|---------------|---------------|------------- | | | | | | | | Restore | | | | Root, DIDs | | | | | | | | Register | | | probe devices | | | | | | | | restore | | | | domain | | | |--------------> | | | | restore | | | reattach | | | | domain | | | |--------------> | | | | Looking forward to your feedback on this. Changelog: v2: - Move IOMMU_LIVEUPDATE under IOMMU_SUPPORT dependencies. - Update copyright year to 2026. - Add Kernel-doc for FLB struct. - Add an ASCII diagram for FLB memory layout. - Change compatibility to iommu-liveupdate-v1. - Rename structs to more reasonable names. - Add comment explaining the rationale for BUG_ON during restoration. - Rename 'did' to 'attachment_id'. - Use phys_to_virt() and virt_to_phys() consistently. - Create separate functions for FLB unpreserve and folio_put. - Return the virtual address from the restore_array function. - Free serialized state on finish without checking for NULL. - Rename 'iommu-lu.h' to 'iommu-liveupdate.h'. - Suffix max with _per_page (iommu_max_objs_per_page). - Move unused helpers to the patches that actually need them. - Rename reserve_obj_ser to alloc_object_ser. - Only allow vfio drivers to use preserve_device APIs. - Move ops declarations under CONFIG_IOMMU_LIVEUPDATE guards. - Add lockdep_assert_held() validations in locked functions. - Rename device_ser_match to match_device_ser. - Create iommu_folio_update_stats(folio, nr_pages) helper. - Explicitly set incoherent flag to false for restored pages. - Do not use unpreserve_pages() function for error handling. - Preserve vasz and sign_extend of a domain. - Make the restored domain free-only. - Ignore error of pt_descend() during restore as it should never fail (dead code). - On domain restore, clear all features except the preserved. - Add a KUnit test to verify that a restored domain can be freed with zeroed features. - Bypass paging_domain_compatible() checks for restored domains. - Use an explicit if-check for preserved state to clear the context entries. - Rename clean_context to clear_unpreserved_context_entries. - Use "DMAR:" in pr_fmt instead of "iommu:". - Pass ROOT_ENTRY_NR directly into unpreserve functions. - Remove IOMMU lock during preserve as not needed. - Iterate all devices when clearing context entries instead of just PCIe devices. - Do not populate empty unpreserve_device callbacks if not needed. - Rename unpreserve_iommu_context to unpreserve_iommu_context_table. - Check the associated IOMMU in _restore_used_domain_id. - Add a comment that MMIO base is used as token for identifying IOMMU. - Add a comment in _restore_used_domain_ids that IDA allocation can safely fail. - Mark IOMMU state restored after restoring to prevent double restoration and UAF. - Fix domain leak in iommu_restore_domain error path. - Check DID start range before allocation. - Add lockdep_assert_held in group domain restoration path. - Add WARN_ON to catch unexpected group ownership during device probe. - Add comment explaining group ownership setup and reclaim during device probe. - Cleanup the full PASID directory instead of only 4K. - Rename ioctl to IOMMU_HWPT_LIVEUPDATE_MARK_PRESERVE. - Use XArray to save the HWPT liveupdate marks. - Add the ioctl struct to ucmd_buffer. - Make the ioctl 64-bit aligned (struct padding). - Add UAPI documentation stating that HWPTs cannot be unmarked. - Add UAPI documentation stating that only file-based mappings are allowed. - Rename structs to use "liveupdate" instead of "lu". - Rework HWPT preservation to prevent TOCTOU and mutex lock inside XA spin lock. - Check preserved pages at the IOPT level, not the HWPT level. - Return -EOPNOTSUPP if there are PASID attachments. - Do not allow detach (or attach) once the device is preserved. - Add MODULE_IMPORT_NS("IOMMUFD") in vfio_pci_liveupdate.c. - Make token u64. - Modify selftest to use live update kexec_test pattern. - Use helper macros in test instead of repeating ksft_assert - Rename setup_cdev to open_cdev. - Define constants (tokens) as u64 to avoid ABI warnings. v1: https://lore.kernel.org/all/20260203220948.2176157-1-skhawaja@google.com/ rfcv2: https://lore.kernel.org/all/20251202230303.1017519-1-skhawaja@google.com/ rfcv1: https://lore.kernel.org/all/20250928190624.3735830-1-skhawaja@google.com/ Pasha Tatashin (1): liveupdate: luo_file: Add internal APIs for file preservation Samiullah Khawaja (13): iommu: Implement IOMMU Live update FLB callbacks iommu: Implement IOMMU domain preservation iommu: Implement device and IOMMU HW preservation iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages iommupt: Implement preserve/unpreserve/restore callbacks iommu/vt-d: Implement device and iommu preserve/unpreserve ops iommu: Add APIs to get iommu and device preserved state iommu/vt-d: Restore IOMMU state and reclaimed domain ids iommu: Restore and reattach preserved domains to devices iommu/vt-d: preserve PASID table of preserved device iommufd: Add APIs to preserve/unpreserve a vfio cdev vfio/pci: Preserve the iommufd state of the vfio cdev iommufd/selftest: Add test to verify iommufd preservation YiFei Zhu (2): iommufd: Implement ioctl to mark HWPT for preservation iommufd: Persist iommu hardware pagetables for live update MAINTAINERS | 12 + drivers/iommu/Kconfig | 12 + drivers/iommu/Makefile | 1 + drivers/iommu/generic_pt/iommu_pt.h | 131 ++++ drivers/iommu/generic_pt/kunit_iommu_pt.h | 28 + drivers/iommu/intel/Makefile | 1 + drivers/iommu/intel/iommu.c | 159 ++++- drivers/iommu/intel/iommu.h | 50 +- drivers/iommu/intel/liveupdate.c | 337 ++++++++++ drivers/iommu/intel/nested.c | 2 +- drivers/iommu/intel/pasid.c | 7 +- drivers/iommu/intel/pasid.h | 9 + drivers/iommu/iommu-pages.c | 108 +++- drivers/iommu/iommu-pages.h | 30 + drivers/iommu/iommu.c | 79 ++- drivers/iommu/iommufd/Makefile | 1 + drivers/iommu/iommufd/device.c | 102 +++ drivers/iommu/iommufd/io_pagetable.c | 11 + drivers/iommu/iommufd/io_pagetable.h | 1 + drivers/iommu/iommufd/iommufd_private.h | 46 ++ drivers/iommu/iommufd/liveupdate.c | 339 ++++++++++ drivers/iommu/iommufd/main.c | 19 +- drivers/iommu/iommufd/pages.c | 7 + drivers/iommu/liveupdate.c | 592 ++++++++++++++++++ drivers/vfio/device_cdev.c | 10 + drivers/vfio/pci/vfio_pci_liveupdate.c | 33 +- include/linux/generic_pt/iommu.h | 19 +- include/linux/iommu-liveupdate.h | 156 +++++ include/linux/iommu.h | 47 ++ include/linux/iommufd.h | 29 + include/linux/kho/abi/iommu.h | 249 ++++++++ include/linux/kho/abi/iommufd.h | 51 ++ include/linux/liveupdate.h | 21 + include/uapi/linux/iommufd.h | 26 + kernel/liveupdate/luo_file.c | 69 ++ kernel/liveupdate/luo_internal.h | 17 + tools/testing/selftests/iommu/Makefile | 12 + .../iommu/iommufd_liveupdate_kexec_test.c | 239 +++++++ 38 files changed, 3016 insertions(+), 46 deletions(-) create mode 100644 drivers/iommu/intel/liveupdate.c create mode 100644 drivers/iommu/iommufd/liveupdate.c create mode 100644 drivers/iommu/liveupdate.c create mode 100644 include/linux/iommu-liveupdate.h create mode 100644 include/linux/kho/abi/iommu.h create mode 100644 include/linux/kho/abi/iommufd.h create mode 100644 tools/testing/selftests/iommu/iommufd_liveupdate_kexec_test.c base-commit: 2a4c0c11c0193889446cdb6f1540cc2b9aff97dd -- 2.54.0.545.g6539524ca2-goog