From: Samiullah Khawaja <skhawaja@google.com>
To: Pranjal Shrivastava <praan@google.com>
Cc: David Woodhouse <dwmw2@infradead.org>,
Lu Baolu <baolu.lu@linux.intel.com>,
Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>,
Jason Gunthorpe <jgg@ziepe.ca>, YiFei Zhu <zhuyifei@google.com>,
Robin Murphy <robin.murphy@arm.com>,
Kevin Tian <kevin.tian@intel.com>,
Alex Williamson <alex@shazbot.org>,
Shuah Khan <shuah@kernel.org>,
iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org, Saeed Mahameed <saeedm@nvidia.com>,
Adithya Jayachandran <ajayachandra@nvidia.com>,
Parav Pandit <parav@nvidia.com>,
Leon Romanovsky <leonro@nvidia.com>,
William Tu <witu@nvidia.com>,
Pratyush Yadav <pratyush@kernel.org>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
David Matlack <dmatlack@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chrisl@kernel.org>, Vipin Sharma <vipinsh@google.com>
Subject: Re: [PATCH v2 13/16] iommufd: Persist iommu hardware pagetables for live update
Date: Wed, 20 May 2026 19:40:05 +0000 [thread overview]
Message-ID: <ag4DzYZ4ToDcyfn3@google.com> (raw)
In-Reply-To: <agz5rFcl27H8yzTY@google.com>
On Wed, May 20, 2026 at 12:00:44AM +0000, Pranjal Shrivastava wrote:
>On Mon, Apr 27, 2026 at 05:56:30PM +0000, Samiullah Khawaja wrote:
>> From: YiFei Zhu <zhuyifei@google.com>
>>
>> Register iommufd with the LUO framework and implement the preserve and
>> unpreserve ops to save marked HWPTs.
>>
>> To make sure mappings do not change during preserved state, add a
>> liveupdate_immutable flag to IOAS. When an HWPT is preserved, its IOAS
>> is marked immutable and any map/unmap attempts will fail with -EBUSY.
>> This is synchronized using the domains_rwsem to prevent races with
>> concurrent mapping operations.
>>
>> The preserve callback iterates over the marked HWPTs, verifies that the
>> backing memory pages are preserved, and calls iommu_domain_preserve() to
>> preserve the associated IOMMU domain.
>>
>> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>
>[...]
>
>> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
>> index 111f4d42e210..3c88aa115d08 100644
>> --- a/drivers/iommu/iommufd/iommufd_private.h
>> +++ b/drivers/iommu/iommufd/iommufd_private.h
>> @@ -98,6 +98,9 @@ struct io_pagetable {
>> /* IOVA that cannot be allocated, struct iopt_reserved */
>> struct rb_root_cached reserved_itree;
>> u8 disable_large_pages;
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> + bool liveupdate_immutable;
>> +#endif
>> unsigned long iova_alignment;
>> };
>>
[snip]
>> +
>> +static int iommufd_preserve_hwpt(struct iommufd_hwpt_paging *hwpt,
>> + struct iommufd_hwpt_ser *hwpt_ser,
>> + struct liveupdate_session *session)
>> +{
>> + struct iommu_domain_ser *domain_ser;
>> + bool ioas_made_immutable = false;
>> + int rc;
>> +
>> + if (!hwpt->ioas->iopt.liveupdate_immutable) {
>> + /*
>> + * Make IOAS immutable so the DMA mappings do not change while
>> + * the HWPT is preserved. Since one IOAS can have multiple
>> + * HWPTs, if an error occurs this call needs to make the IOAS
>> + * mutable again if it was the one that made it immutable.
>> + */
>> + ioas_made_immutable = true;
>> + ioas_set_immutable(hwpt->ioas, true);
>> +
>> + rc = check_iopt_pages_preserved(session, hwpt);
>> + if (rc)
>> + goto err;
>> + }
>
>Nit:
>I'm thinking what happens for a shared IOAS situation? Say, 2 devices,
>in the same container, behind 2 different IOMMUs, sharing the IOAS. Each
>device will have it's own HWPT (N:1 mapping for HWPT v/s IOAS)
>
>So, what happens if Device 1 succeeds with preservation but device 2
>doesn't? For example:
>
>1. Preserve Device 1:
> - It sees liveupdate_immutable is false.
> - Sets ioas_made_immutable = true on its local stack.
> - Flips IOAS to immutable.
> - Preservation succeeds.
>
>2. Preserve Device 2:
> - It sees liveupdate_immutable is already true (because Device 1 set it).
> - Sets ioas_made_immutable = false on its local stack.
> - The Failure: iommu_domain_preserve fails for Device 2.
> - The Jump: Hits goto err;
>
>Now, inside the err: label for device 2, it checks
>if (ioas_made_immutable), since it is FALSE for device 2,
>it does nothing.
>
>I agree we return an error code to the caller which finally cleans it up
>well, but I'm considering if we should make liveupdate_immutable
>refcountable? Since the error handling in iommufd_preserve_hwpt() is
>logically incomplete for shared IOAS as it only attempts to restore
>mutability if the current HWPT set immutable = true;
This is a valid scenario and as you already pointed that it is handled
by the error handling in the caller.
But I agree refcount makes it cleaner. I will update this.
>
>> +
>> + hwpt_ser->token = hwpt->liveupdate_token;
>> + hwpt_ser->reclaimed = false;
>> +
>> + rc = iommu_domain_preserve(hwpt->common.domain, &domain_ser);
>> + if (rc < 0)
>> + goto err;
>> +
>> + hwpt_ser->domain_data = virt_to_phys(domain_ser);
>> + return 0;
>> +
>> +err:
>> + if (ioas_made_immutable)
>> + ioas_set_immutable(hwpt->ioas, false);
>> +
>> + return rc;
>> +}
>
>[...]
>
>> +
>> +static int iommufd_liveupdate_preserve(struct liveupdate_file_op_args *args)
>> +{
>> + struct iommufd_ctx *ictx = iommufd_ctx_from_file(args->file);
>> + struct iommufd_hwpt_paging *hwpt;
>> + struct iommufd_ser *iommufd_ser;
>> + struct iommufd_object *obj;
>> + unsigned int nr_hwpts;
>> + unsigned long index;
>> + unsigned int i;
>> + void *mem;
>> + int rc;
>> +
>> + if (IS_ERR(ictx))
>> + return PTR_ERR(ictx);
>> +
>> + mutex_lock(&ictx->liveupdate_mutex);
>> +
>> + /* Count the number of HWPTs to preserve */
>> + nr_hwpts = 0;
>> + xa_lock(&ictx->objects);
>> + xa_for_each_marked(&ictx->objects, index, obj, IOMMUFD_OBJ_LIVEUPDATE_MARK) {
>> + if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
>> + continue;
>> +
>> + hwpt = to_hwpt_paging(container_of(obj, struct iommufd_hw_pagetable, obj));
>> + if (!hwpt->common.domain) {
>> + rc = -EINVAL;
>> + xa_unlock(&ictx->objects);
>> + goto out_unlock;
>> + }
>> + nr_hwpts++;
>> + }
>> + xa_unlock(&ictx->objects);
>> +
>> + mem = kho_alloc_preserve(struct_size(iommufd_ser,
>> + hwpt_array, nr_hwpts));
>> + if (!mem) {
>> + rc = -ENOMEM;
>> + goto out_unlock;
>> + }
>> +
>> + iommufd_ser = mem;
>> + iommufd_ser->nr_hwpts = nr_hwpts;
>
>Nit: Can there be a TOCTOU here? We first count nr_hwpts in the first
>loop, but actually preserve them in the loop below. Is it possible for a
>Guest to race with these loops and destroy a HWPT?
>
>That could cause a bug in the new kernel as it may try to restore
>nr_hwpts which is one more than the preserved HWPTs.
Agreed. The HWPT is locked in the preserve loop below, I will update the
nr_hwpts to handle this after that loop to handle this.
>
>> +
>> + /* Preserve HWPTs */
>> + i = 0;
>> + xa_lock(&ictx->objects);
>> + xa_for_each_marked(&ictx->objects, index, obj, IOMMUFD_OBJ_LIVEUPDATE_MARK) {
>> + if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
>> + continue;
>> +
>> + if (!iommufd_lock_obj(obj)) {
>> + rc = -ENOENT;
>> + xa_unlock(&ictx->objects);
>> + goto out_unpreserve;
>> + }
>> +
>> + /*
>> + * HWPT is locked so it will not be destroyed. The xarray lock
>> + * can be released here before preserving the HWPT.
>> + */
>> + xa_unlock(&ictx->objects);
>> + hwpt = to_hwpt_paging(container_of(obj, struct iommufd_hw_pagetable, obj));
>> + rc = iommufd_preserve_hwpt(hwpt, &iommufd_ser->hwpt_array[i++], args->session);
>> + if (rc) {
>> + iommufd_put_object(ictx, obj);
>> + goto out_unpreserve;
>> + }
>> +
>> + /* Mark as preserved */
>> + hwpt->liveupdate_preserved = true;
>> + xa_lock(&ictx->objects);
>> + }
>> + xa_unlock(&ictx->objects);
>
>[...]
>
>> diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
>> index 9bdb2945afe1..3b0c0acb8856 100644
>> --- a/drivers/iommu/iommufd/pages.c
>> +++ b/drivers/iommu/iommufd/pages.c
>> @@ -55,6 +55,7 @@
>> #include <linux/overflow.h>
>> #include <linux/slab.h>
>> #include <linux/sched/mm.h>
>> +#include <linux/memfd.h>
>> #include <linux/vfio_pci_core.h>
>>
>> #include "double_span.h"
>> @@ -1421,6 +1422,7 @@ struct iopt_pages *iopt_alloc_file_pages(struct file *file,
>>
>> {
>> struct iopt_pages *pages;
>> + int seals;
>>
>> pages = iopt_alloc_pages(start_byte, length, writable);
>> if (IS_ERR(pages))
>> @@ -1428,6 +1430,11 @@ struct iopt_pages *iopt_alloc_file_pages(struct file *file,
>> pages->file = get_file(file);
>> pages->start = start - start_byte;
>> pages->type = IOPT_ADDRESS_FILE;
>> +
>> + seals = memfd_get_seals(file);
>> + if (seals > 0)
>> + pages->seals = seals;
>> +
>
>Can caching memfd seals create a TOCTOU issue?
>IIUC, iopt_alloc_file_pages happens at map time, However, the userspace
>is allowed to map a memfd and then apply the F_ADD_SEALS via fcntl()
>later in its setup sequence? For example a sequence like:
>
>1. VMM creates a memfd. It has 0 seals.
>2. VMM calls IOMMU_IOAS_MAP_FILE. IOMMUFD caches pages->seals = 0.
>3. VMM finishes its setup and calls:
> fcntl(fd, F_ADD_SEALS, F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL).
>
>4.VMM initiates Live Update.
>5.check_iopt_pages_preserved looks at the cached pages->seals
> (which is still 0), sees the seals are missing, & kills the LiveUpdate
> with -EINVAL, even though the file is properly sealed..
This is true and it is intentionally this way to make sure that the seal
is applied during mapping otherwise user can apply the seal after
resizing the memfd and preserve IOMMU mappings that are pointing to
unpreserved pages. Consider following:
1. VMM creates a memfd and seals is zero.
2. VMM maps memfd into ioas/hwpt.
3. VMM resizes the memfd.
4. VMM seals memfd
5. VMM preserves the memfd (it only preseves the current size).
6. VMM preserves iommufd and it succeeds as memfd is sealed.
But the pages being referred by the iommu mappings are refcounted in
current kernel, but not preserved.
Check the comment in check_iopt_pages_preserved() also. I will add a
comment here also.
>
>Thus, I guess we should dynamically check seals via memfd_get_seals()
>
>> return pages;
>> }
>>
>
>Thanks,
>Praan
Sami
next prev parent reply other threads:[~2026-05-20 19:40 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-27 17:56 [PATCH v2 00/16] iommu: Add live update state preservation Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 01/16] liveupdate: luo_file: Add internal APIs for file preservation Samiullah Khawaja
2026-05-18 11:40 ` Pranjal Shrivastava
2026-05-18 19:08 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 02/16] iommu: Implement IOMMU Live update FLB callbacks Samiullah Khawaja
2026-05-01 21:45 ` David Matlack
2026-05-18 11:52 ` Pranjal Shrivastava
2026-05-18 14:10 ` Pratyush Yadav
2026-05-18 15:08 ` Pranjal Shrivastava
2026-05-18 12:33 ` Pranjal Shrivastava
2026-05-18 17:20 ` Samiullah Khawaja
2026-05-18 17:32 ` Pranjal Shrivastava
2026-05-18 17:06 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 03/16] iommu: Implement IOMMU domain preservation Samiullah Khawaja
2026-05-01 22:08 ` David Matlack
2026-05-04 18:33 ` Samiullah Khawaja
2026-05-18 13:13 ` Pranjal Shrivastava
2026-05-18 18:55 ` Samiullah Khawaja
2026-05-18 21:36 ` Pranjal Shrivastava
2026-04-27 17:56 ` [PATCH v2 04/16] iommu: Implement device and IOMMU HW preservation Samiullah Khawaja
2026-05-01 22:42 ` David Matlack
2026-05-04 19:06 ` Samiullah Khawaja
2026-05-07 2:07 ` Baolu Lu
2026-05-07 18:47 ` Samiullah Khawaja
2026-05-18 14:01 ` Pranjal Shrivastava
2026-05-18 18:33 ` Samiullah Khawaja
2026-05-18 13:55 ` Pranjal Shrivastava
2026-05-18 18:44 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 05/16] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages Samiullah Khawaja
2026-05-18 14:23 ` Pranjal Shrivastava
2026-05-18 17:22 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 06/16] iommupt: Implement preserve/unpreserve/restore callbacks Samiullah Khawaja
2026-05-07 2:55 ` Baolu Lu
2026-05-07 18:40 ` Samiullah Khawaja
2026-05-19 13:15 ` Pranjal Shrivastava
2026-05-19 17:14 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 07/16] iommu/vt-d: Implement device and iommu preserve/unpreserve ops Samiullah Khawaja
2026-05-07 6:25 ` Baolu Lu
2026-05-08 2:36 ` Samiullah Khawaja
2026-05-18 20:32 ` Samiullah Khawaja
2026-05-19 14:40 ` Pranjal Shrivastava
2026-05-19 18:26 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 08/16] iommu: Add APIs to get iommu and device preserved state Samiullah Khawaja
2026-05-19 15:52 ` Pranjal Shrivastava
2026-05-20 17:24 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 09/16] iommu/vt-d: Restore IOMMU state and reclaimed domain ids Samiullah Khawaja
2026-05-07 9:05 ` Baolu Lu
2026-05-07 17:35 ` Samiullah Khawaja
2026-05-19 21:46 ` Pranjal Shrivastava
2026-05-20 18:02 ` Pranjal Shrivastava
2026-05-20 19:59 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 10/16] iommu: Restore and reattach preserved domains to devices Samiullah Khawaja
2026-05-07 13:54 ` Baolu Lu
2026-05-07 16:52 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 11/16] iommu/vt-d: preserve PASID table of preserved device Samiullah Khawaja
2026-05-08 6:05 ` Baolu Lu
2026-05-11 18:45 ` Samiullah Khawaja
2026-05-12 11:32 ` Baolu Lu
2026-05-19 22:35 ` Pranjal Shrivastava
2026-05-20 18:13 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 12/16] iommufd: Implement ioctl to mark HWPT for preservation Samiullah Khawaja
2026-05-19 23:05 ` Pranjal Shrivastava
2026-05-20 19:50 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 13/16] iommufd: Persist iommu hardware pagetables for live update Samiullah Khawaja
2026-05-20 0:00 ` Pranjal Shrivastava
2026-05-20 19:40 ` Samiullah Khawaja [this message]
2026-04-27 17:56 ` [PATCH v2 14/16] iommufd: Add APIs to preserve/unpreserve a vfio cdev Samiullah Khawaja
2026-05-20 0:46 ` Pranjal Shrivastava
2026-04-27 17:56 ` [PATCH v2 15/16] vfio/pci: Preserve the iommufd state of the " Samiullah Khawaja
2026-05-20 0:57 ` Pranjal Shrivastava
2026-05-20 19:54 ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 16/16] iommufd/selftest: Add test to verify iommufd preservation Samiullah Khawaja
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ag4DzYZ4ToDcyfn3@google.com \
--to=skhawaja@google.com \
--cc=ajayachandra@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=alex@shazbot.org \
--cc=baolu.lu@linux.intel.com \
--cc=chrisl@kernel.org \
--cc=dmatlack@google.com \
--cc=dwmw2@infradead.org \
--cc=iommu@lists.linux.dev \
--cc=jgg@ziepe.ca \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=leonro@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=parav@nvidia.com \
--cc=pasha.tatashin@soleen.com \
--cc=praan@google.com \
--cc=pratyush@kernel.org \
--cc=robin.murphy@arm.com \
--cc=saeedm@nvidia.com \
--cc=shuah@kernel.org \
--cc=vipinsh@google.com \
--cc=will@kernel.org \
--cc=witu@nvidia.com \
--cc=zhuyifei@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox