From: Samiullah Khawaja <skhawaja@google.com>
To: Pranjal Shrivastava <praan@google.com>
Cc: David Woodhouse <dwmw2@infradead.org>,
Lu Baolu <baolu.lu@linux.intel.com>,
Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>,
Jason Gunthorpe <jgg@ziepe.ca>, YiFei Zhu <zhuyifei@google.com>,
Robin Murphy <robin.murphy@arm.com>,
Kevin Tian <kevin.tian@intel.com>,
Alex Williamson <alex@shazbot.org>,
Shuah Khan <shuah@kernel.org>,
iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org, Saeed Mahameed <saeedm@nvidia.com>,
Adithya Jayachandran <ajayachandra@nvidia.com>,
Parav Pandit <parav@nvidia.com>,
Leon Romanovsky <leonro@nvidia.com>,
William Tu <witu@nvidia.com>,
Pratyush Yadav <pratyush@kernel.org>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
David Matlack <dmatlack@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chrisl@kernel.org>, Vipin Sharma <vipinsh@google.com>
Subject: Re: [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved
Date: Wed, 25 Mar 2026 17:31:46 +0000 [thread overview]
Message-ID: <acQSuuvxiVtGhY7A@google.com> (raw)
In-Reply-To: <acPzMRNMy6kDUglF@google.com>
On Wed, Mar 25, 2026 at 02:37:37PM +0000, Pranjal Shrivastava wrote:
>On Tue, Feb 03, 2026 at 10:09:44PM +0000, Samiullah Khawaja wrote:
>> From: YiFei Zhu <zhuyifei@google.com>
>>
>> Userspace provides a token, which will then be used at restore to
>> identify this HWPT. The restoration logic is not implemented and will be
>> added later.
>>
>> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>> drivers/iommu/iommufd/Makefile | 1 +
>> drivers/iommu/iommufd/iommufd_private.h | 13 +++++++
>> drivers/iommu/iommufd/liveupdate.c | 49 +++++++++++++++++++++++++
>> drivers/iommu/iommufd/main.c | 2 +
>> include/uapi/linux/iommufd.h | 19 ++++++++++
>> 5 files changed, 84 insertions(+)
>> create mode 100644 drivers/iommu/iommufd/liveupdate.c
>>
>> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
>> index 71d692c9a8f4..c3bf0b6452d3 100644
>> --- a/drivers/iommu/iommufd/Makefile
>> +++ b/drivers/iommu/iommufd/Makefile
>> @@ -17,3 +17,4 @@ obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o
>>
>> iommufd_driver-y := driver.o
>> obj-$(CONFIG_IOMMUFD_DRIVER_CORE) += iommufd_driver.o
>> +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
>> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
>> index eb6d1a70f673..6424e7cea5b2 100644
>> --- a/drivers/iommu/iommufd/iommufd_private.h
>> +++ b/drivers/iommu/iommufd/iommufd_private.h
>> @@ -374,6 +374,10 @@ struct iommufd_hwpt_paging {
>> bool auto_domain : 1;
>> bool enforce_cache_coherency : 1;
>> bool nest_parent : 1;
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> + bool lu_preserve : 1;
>> + u32 lu_token;
>
>Did we downsize the token? Shouldn't this be u64 as everywhere else?
Note that this is different from the token that is used to preserve the
FD into LUO. This token is used to mark the HWPT for preservation, that
is it will be preserved when the FD is preserved.
I will add more text in the commit message to make it clear.
For consistency I will make it u64.
>
>> +#endif
>> /* Head at iommufd_ioas::hwpt_list */
>> struct list_head hwpt_item;
>> struct iommufd_sw_msi_maps present_sw_msi;
>> @@ -707,6 +711,15 @@ iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id)
>> struct iommufd_vdevice, obj);
>> }
>>
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd);
>> +#else
>> +static inline int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>> +{
>> + return -ENOTTY;
>> +}
>> +#endif
>> +
>> #ifdef CONFIG_IOMMUFD_TEST
>> int iommufd_test(struct iommufd_ucmd *ucmd);
>> void iommufd_selftest_destroy(struct iommufd_object *obj);
>> diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
>> new file mode 100644
>> index 000000000000..ae74f5b54735
>> --- /dev/null
>> +++ b/drivers/iommu/iommufd/liveupdate.c
>> @@ -0,0 +1,49 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +
>> +#define pr_fmt(fmt) "iommufd: " fmt
>> +
>> +#include <linux/file.h>
>> +#include <linux/iommufd.h>
>> +#include <linux/liveupdate.h>
>> +
>> +#include "iommufd_private.h"
>> +
>> +int iommufd_hwpt_lu_set_preserve(struct iommufd_ucmd *ucmd)
>> +{
>> + struct iommu_hwpt_lu_set_preserve *cmd = ucmd->cmd;
>> + struct iommufd_hwpt_paging *hwpt_target, *hwpt;
>> + struct iommufd_ctx *ictx = ucmd->ictx;
>> + struct iommufd_object *obj;
>> + unsigned long index;
>> + int rc = 0;
>> +
>> + hwpt_target = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id);
>> + if (IS_ERR(hwpt_target))
>> + return PTR_ERR(hwpt_target);
>> +
>> + xa_lock(&ictx->objects);
>> + xa_for_each(&ictx->objects, index, obj) {
>> + if (obj->type != IOMMUFD_OBJ_HWPT_PAGING)
>> + continue;
>
>Couldn't these be HWPT_NESTED? Are we explicitly skipping HWPT_NESTED
>here? ARM SMMUv3 heavily relies on IOMMU_DOMAIN_NESTED to back vIOMMUs
>and hold critical guest translation state. We'd need to support
>HWPT_NESTED for arm-smmu-v3.
For this series, I am not handling the NESTED and vIOMMU usecases. I
will be sending a separate series to handle those, this is mentioned in
cover letter also in the Future work.
Will add a note in commit message also.
>
>> +
>> + hwpt = container_of(obj, struct iommufd_hwpt_paging, common.obj);
>> +
>> + if (hwpt == hwpt_target)
>> + continue;
>> + if (!hwpt->lu_preserve)
>> + continue;
>> + if (hwpt->lu_token == cmd->hwpt_token) {
>> + rc = -EADDRINUSE;
>> + goto out;
>> + }
>
>I see that this entire loop is to avoid collisions but could we improve
>this? We are doing an O(N) linear search over the entire ictx->objects
>xarray while holding xa_lock on every setup call.
>
>If the kernel requires a strict 1:1 mapping of lu_token to hwpt,
>wouldn't it be much better to track these in a dedicated xarray?
>
>Just thinking out loud, if we added a dedicated lu_tokens xarray to
>iommufd_ctx, we could drop the linear search and the lock entirely,
>letting the xarray handle the collision natively like this:
>
> rc = xa_insert(&ictx->lu_tokens, cmd->hwpt_token, hwpt_target, GFP_KERNEL);
> if (rc == -EBUSY) {
> rc = -EADDRINUSE;
> goto out;
> } else if (rc) {
> goto out;
> }
>
>This ensures instant collision detection without iterating the global
>object pool. When the HWPT is eventually destroyed (or un-preserved), we
>simply call xa_erase(&ictx->lu_tokens, hwpt->lu_token).
Agreed. We can call xa_erase when it is destroyed. This can also be used
during actual preservation without taking the objects lock.
>
>> + }
>> +
>> + hwpt_target->lu_preserve = true;
>
>I don't see a way to unset hwpt->lu_preserve once it's been set. What if
>a VMM marks a HWPT for preservation, but then the guest decides to rmmod
>the device before the actual kexec? The VMM would need a way to
>unpreserve it so we don't carry stale state across the live update?
>
>Are we relying on the VMM to always call IOMMU_DESTROY on that HWPT when
>it's no longer needed for preservation? A clever VMM optimizing for perf
>might just pool or cache detached HWPTs for future reuse. If that HWPT
>goes back into a free pool and gets re-attached to a new device later,
>the sticky lu_preserve state will inadvertently leak across the kexec..
As mentioned earlier, the HWPT is not being preserved in this call. So
when VMM dies or rmmod happens, this HWPT will be destroyed following
the normal flow.
I will add this in commit message.
>
>> + hwpt_target->lu_token = cmd->hwpt_token;
>> +
>> +out:
>> + xa_unlock(&ictx->objects);
>> + iommufd_put_object(ictx, &hwpt_target->common.obj);
>> + return rc;
>> +}
>> +
>> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
>> index 5cc4b08c25f5..e1a9b3051f65 100644
>> --- a/drivers/iommu/iommufd/main.c
>> +++ b/drivers/iommu/iommufd/main.c
>> @@ -493,6 +493,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>> __reserved),
>> IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
>> struct iommu_viommu_alloc, out_viommu_id),
>> + IOCTL_OP(IOMMU_HWPT_LU_SET_PRESERVE, iommufd_hwpt_lu_set_preserve,
>> + struct iommu_hwpt_lu_set_preserve, hwpt_token),
>> #ifdef CONFIG_IOMMUFD_TEST
>> IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
>> #endif
>> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
>> index 2c41920b641d..25d8cff987eb 100644
>> --- a/include/uapi/linux/iommufd.h
>> +++ b/include/uapi/linux/iommufd.h
>> @@ -57,6 +57,7 @@ enum {
>> IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92,
>> IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93,
>> IOMMUFD_CMD_HW_QUEUE_ALLOC = 0x94,
>> + IOMMUFD_CMD_HWPT_LU_SET_PRESERVE = 0x95,
>> };
>>
>> /**
>> @@ -1299,4 +1300,22 @@ struct iommu_hw_queue_alloc {
>> __aligned_u64 length;
>> };
>> #define IOMMU_HW_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HW_QUEUE_ALLOC)
>> +
>> +/**
>> + * struct iommu_hwpt_lu_set_preserve - ioctl(IOMMU_HWPT_LU_SET_PRESERVE)
>
>Nit: The IOCTL is called "IOMMU_HWPT_LU_SET_PRESERVE" which subtly
>implies the existence of a "GET_PRESERVE". Should we perhaps just call
>it IOMMU_HWPT_LU_PRESERVE?
LU_PRESERVE would imply that it is being preserved. Maybe
"IOMMU_HWPT_LU_MARK_PRESERVE"?
>
>> + * @size: sizeof(struct iommu_hwpt_lu_set_preserve)
>> + * @hwpt_id: Iommufd object ID of the target HWPT
>> + * @hwpt_token: Token to identify this hwpt upon restore
>> + *
>> + * The target HWPT will be preserved during iommufd preservation.
>> + *
>> + * The hwpt_token is provided by userspace. If userspace enters a token
>> + * already in use within this iommufd, -EADDRINUSE is returned from this ioctl.
>> + */
>> +struct iommu_hwpt_lu_set_preserve {
>> + __u32 size;
>> + __u32 hwpt_id;
>> + __u32 hwpt_token;
>> +};
>
>Nit: Let's make sure we follow the 64-bit alignment as enforced in the
>rest of this file, note the __u32 __reserved fields in existing IOCTL
>structs.
Agreed. Will update
>
>> +#define IOMMU_HWPT_LU_SET_PRESERVE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_LU_SET_PRESERVE)
>> #endif
>
>Thanks,
>Praan
next prev parent reply other threads:[~2026-03-25 17:31 UTC|newest]
Thread overview: 98+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-03 22:09 [PATCH 00/14] iommu: Add live update state preservation Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 01/14] iommu: Implement IOMMU LU FLB callbacks Samiullah Khawaja
2026-03-11 21:07 ` Pranjal Shrivastava
2026-03-12 16:43 ` Samiullah Khawaja
2026-03-12 23:43 ` Pranjal Shrivastava
2026-03-13 16:47 ` Samiullah Khawaja
2026-03-13 15:36 ` Pranjal Shrivastava
2026-03-13 16:58 ` Samiullah Khawaja
2026-03-16 22:54 ` Vipin Sharma
2026-03-17 1:06 ` Samiullah Khawaja
2026-03-23 23:27 ` Vipin Sharma
2026-02-03 22:09 ` [PATCH 02/14] iommu: Implement IOMMU core liveupdate skeleton Samiullah Khawaja
2026-03-12 23:10 ` Pranjal Shrivastava
2026-03-13 18:42 ` Samiullah Khawaja
2026-03-17 20:09 ` Pranjal Shrivastava
2026-03-17 20:13 ` Samiullah Khawaja
2026-03-17 20:23 ` Pranjal Shrivastava
2026-03-17 21:03 ` Vipin Sharma
2026-03-18 18:51 ` Pranjal Shrivastava
2026-03-18 17:49 ` Samiullah Khawaja
2026-03-17 19:58 ` Vipin Sharma
2026-03-17 20:33 ` Samiullah Khawaja
2026-03-24 19:06 ` Vipin Sharma
2026-03-24 19:45 ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 03/14] liveupdate: luo_file: Add internal APIs for file preservation Samiullah Khawaja
2026-03-18 10:00 ` Pranjal Shrivastava
2026-03-18 16:54 ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 04/14] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages Samiullah Khawaja
2026-03-03 16:42 ` Ankit Soni
2026-03-03 18:41 ` Samiullah Khawaja
2026-03-20 17:27 ` Pranjal Shrivastava
2026-03-20 18:12 ` Samiullah Khawaja
2026-03-17 20:59 ` Vipin Sharma
2026-03-20 9:28 ` Pranjal Shrivastava
2026-03-20 18:27 ` Samiullah Khawaja
2026-03-20 11:01 ` Pranjal Shrivastava
2026-03-20 18:56 ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 05/14] iommupt: Implement preserve/unpreserve/restore callbacks Samiullah Khawaja
2026-03-20 21:57 ` Pranjal Shrivastava
2026-03-23 16:41 ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 06/14] iommu/vt-d: Implement device and iommu preserve/unpreserve ops Samiullah Khawaja
2026-03-19 16:04 ` Vipin Sharma
2026-03-19 16:27 ` Samiullah Khawaja
2026-03-20 23:01 ` Pranjal Shrivastava
2026-03-21 13:27 ` Pranjal Shrivastava
2026-03-23 18:32 ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 07/14] iommu/vt-d: Restore IOMMU state and reclaimed domain ids Samiullah Khawaja
2026-03-19 20:54 ` Vipin Sharma
2026-03-20 1:05 ` Samiullah Khawaja
2026-03-22 19:51 ` Pranjal Shrivastava
2026-03-23 19:33 ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 08/14] iommu: Restore and reattach preserved domains to devices Samiullah Khawaja
2026-03-10 5:16 ` Ankit Soni
2026-03-10 21:47 ` Samiullah Khawaja
2026-03-22 21:59 ` Pranjal Shrivastava
2026-03-23 18:02 ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 09/14] iommu/vt-d: preserve PASID table of preserved device Samiullah Khawaja
2026-03-23 18:19 ` Pranjal Shrivastava
2026-03-23 18:51 ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 10/14] iommufd-lu: Implement ioctl to let userspace mark an HWPT to be preserved Samiullah Khawaja
2026-03-19 23:35 ` Vipin Sharma
2026-03-20 0:40 ` Samiullah Khawaja
2026-03-20 23:34 ` Vipin Sharma
2026-03-23 16:24 ` Samiullah Khawaja
2026-03-25 14:37 ` Pranjal Shrivastava
2026-03-25 17:31 ` Samiullah Khawaja [this message]
2026-03-25 18:55 ` Pranjal Shrivastava
2026-03-25 20:19 ` Samiullah Khawaja
2026-03-25 20:36 ` Pranjal Shrivastava
2026-03-25 20:46 ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 11/14] iommufd-lu: Persist iommu hardware pagetables for live update Samiullah Khawaja
2026-02-25 23:47 ` Samiullah Khawaja
2026-03-03 5:56 ` Ankit Soni
2026-03-03 18:51 ` Samiullah Khawaja
2026-03-23 20:28 ` Vipin Sharma
2026-03-23 21:34 ` Samiullah Khawaja
2026-03-25 20:08 ` Pranjal Shrivastava
2026-03-25 20:32 ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 12/14] iommufd: Add APIs to preserve/unpreserve a vfio cdev Samiullah Khawaja
2026-03-23 20:59 ` Vipin Sharma
2026-03-23 21:38 ` Samiullah Khawaja
2026-03-25 20:24 ` Pranjal Shrivastava
2026-03-25 20:41 ` Samiullah Khawaja
2026-03-25 21:23 ` Pranjal Shrivastava
2026-03-26 0:16 ` Samiullah Khawaja
2026-02-03 22:09 ` [PATCH 13/14] vfio/pci: Preserve the iommufd state of the " Samiullah Khawaja
2026-02-17 4:18 ` Ankit Soni
2026-03-03 18:35 ` Samiullah Khawaja
2026-03-23 21:17 ` Vipin Sharma
2026-03-23 22:07 ` Samiullah Khawaja
2026-03-24 20:30 ` Vipin Sharma
2026-03-25 20:55 ` Pranjal Shrivastava
2026-02-03 22:09 ` [PATCH 14/14] iommufd/selftest: Add test to verify iommufd preservation Samiullah Khawaja
2026-03-23 22:18 ` Vipin Sharma
2026-03-27 18:32 ` Samiullah Khawaja
2026-03-25 21:05 ` Pranjal Shrivastava
2026-03-27 18:25 ` Samiullah Khawaja
2026-03-27 18:40 ` Samiullah Khawaja
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=acQSuuvxiVtGhY7A@google.com \
--to=skhawaja@google.com \
--cc=ajayachandra@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=alex@shazbot.org \
--cc=baolu.lu@linux.intel.com \
--cc=chrisl@kernel.org \
--cc=dmatlack@google.com \
--cc=dwmw2@infradead.org \
--cc=iommu@lists.linux.dev \
--cc=jgg@ziepe.ca \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=leonro@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=parav@nvidia.com \
--cc=pasha.tatashin@soleen.com \
--cc=praan@google.com \
--cc=pratyush@kernel.org \
--cc=robin.murphy@arm.com \
--cc=saeedm@nvidia.com \
--cc=shuah@kernel.org \
--cc=vipinsh@google.com \
--cc=will@kernel.org \
--cc=witu@nvidia.com \
--cc=zhuyifei@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox