The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* Re: [PATCH v2 03/16] iommu: Implement IOMMU domain preservation
       [not found]   ` <afUkbz0IQh98siYn@google.com>
@ 2026-05-04 18:33     ` Samiullah Khawaja
  0 siblings, 0 replies; 14+ messages in thread
From: Samiullah Khawaja @ 2026-05-04 18:33 UTC (permalink / raw)
  To: David Matlack
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Fri, May 01, 2026 at 10:08:47PM +0000, David Matlack wrote:
>On 2026-04-27 05:56 PM, Samiullah Khawaja wrote:
>> Add IOMMU domain ops that can be implemented by the IOMMU drivers if
>> they support IOMMU domain preservation across liveupdate. The new IOMMU
>> domain preserve, unpreserve and restore APIs call these ops to perform
>> respective live update operations.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>
>> +static int alloc_object_ser(struct iommu_array_hdr_ser **curr_array_ptr, u64 max_objs)
>> +{
>> +	struct iommu_array_hdr_ser *curr_array = *curr_array_ptr;
>> +	struct iommu_array_hdr_ser *next_array;
>> +
>
>There's a trade-off being made in this function to leak deleted array
>elements instead of trying to reuse them that warrants a comment.

Are you referring to the "deleted" flag in the HDR of each object? Yes
we don't look for deleted objects in linked-list of arrays and reuse
them for simplicity. I will add a comment that says this.
>
>> +	if (curr_array->nr_objects >= max_objs) {
>> +		next_array = kho_alloc_preserve(PAGE_SIZE);
>> +		if (IS_ERR(next_array))
>> +			return PTR_ERR(next_array);
>> +
>> +		curr_array->next_array_phys = virt_to_phys(next_array);
>> +		*curr_array_ptr = next_array;
>> +		curr_array = next_array;
>> +	}
>> +
>> +	return curr_array->nr_objects++;
>> +}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 04/16] iommu: Implement device and IOMMU HW preservation
       [not found]   ` <afUscSeXKnNAJ0ZF@google.com>
@ 2026-05-04 19:06     ` Samiullah Khawaja
  0 siblings, 0 replies; 14+ messages in thread
From: Samiullah Khawaja @ 2026-05-04 19:06 UTC (permalink / raw)
  To: David Matlack
  Cc: David Woodhouse, Lu Baolu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe, Robin Murphy, Kevin Tian, Alex Williamson,
	Shuah Khan, iommu, linux-kernel, kvm, Saeed Mahameed,
	Adithya Jayachandran, Parav Pandit, Leon Romanovsky, William Tu,
	Pratyush Yadav, Pasha Tatashin, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Fri, May 01, 2026 at 10:42:57PM +0000, David Matlack wrote:
>On 2026-04-27 05:56 PM, Samiullah Khawaja wrote:
>> Add IOMMU ops to preserve/unpreserve a device. These can be implemented
>
>Can you make this comment more specific about what is being preserved?
>Saying it preserves a device is vague and maybe even misleading. It's
>more about  about preserving a device's attachment to a specific domain
>correct?

There is attachment ID, but the preservation of device can have IOMMU
driver specific things, so in core I mostly mention "preseve device
specific state". In later patches in this series, we save PASID table
using the same callback. I will add more details in the commit message.
>
>> by the IOMMU drivers that support preservation of devices that have
>> their IOMMU domains preserved. During device preservation the state of
>> the associated IOMMU is also preserved as dependency.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>
>> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
>> index f71f14518248..765d042e22e3 100644
>> --- a/drivers/iommu/liveupdate.c
>> +++ b/drivers/iommu/liveupdate.c
>
>> +static struct iommu_device_ser *alloc_iommu_device_ser(struct iommu_flb_obj *flb)
>
>It is unforunate that struct iommu_device_ser has nothing to do with
>struct iommu_device. The former represents an a PCI device, while the
>latter represents an IOMMU.

Yes, I went through various iterations of trying to name it in a
different way but keeping the "iommu_" prefix and the "device" state
naturally falls into this. Not sure if iommu_pci_device_ser is suitable?
>
>> diff --git a/include/linux/iommu-liveupdate.h b/include/linux/iommu-liveupdate.h
>> index 6019cfc27428..279c7ab04f09 100644
>> --- a/include/linux/iommu-liveupdate.h
>> +++ b/include/linux/iommu-liveupdate.h
>
>>  int iommu_domain_preserve(struct iommu_domain *domain, struct iommu_domain_ser **ser);
>>  void iommu_domain_unpreserve(struct iommu_domain *domain);
>> +int iommu_preserve_device(struct iommu_domain *domain,
>> +			  struct device *dev, u64 *preserved_state);
>> +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev);
>
>The naming scheme is inconsistent... Maybe it can be:
>
>  iommu_preserve_domain()
>  iommu_unpreserve_domain()
>  iommu_preserve_device() or iommu_preserve_device_attachment()
>  iommu_unpreserve_device() or iommu_unpreserve_device_attachment()

Agreed, but I am trying to follow the already existing naming scheme
for domains that is used for APIs in iommu.c

iommu_domain_free()
iommu_domain_init()
iommu_domain_preserve()

But I think this is rare, I will update to this as you mentioned above:

iommu_preserve_domain()
iommu_unpreserve_domain()
iommu_preserve_device()
iommu_unpreserve_device()
>
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 3853a3946733..1c424b32c5fc 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>
>> +#ifdef CONFIG_IOMMU_LIVEUPDATE
>> +	int (*preserve_device)(struct device *dev, struct iommu_device_ser *device_ser);
>> +	void (*unpreserve_device)(struct device *dev, struct iommu_device_ser *device_ser);
>> +	int (*preserve)(struct iommu_device *iommu, struct iommu_hw_ser *iommu_ser);
>> +	void (*unpreserve)(struct iommu_device *iommu, struct iommu_hw_ser *iommu_ser);
>> +#endif
>
>Maybe we can make these names a little more specific:
>
>  preserve_device_attachment()
>  unpreserve_device_attachment()

Attachment is too specific. See my comment above.
>  preserve_iommu()
>  unpreserve_iommu()

These are part of iommu_ops and having preserve_iommu() instead of
preserve() is redundant I think. Note ops like capable(), hw_info() in
the same struct.
>
>?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 04/16] iommu: Implement device and IOMMU HW preservation
       [not found] ` <20260427175633.1978233-5-skhawaja@google.com>
       [not found]   ` <afUscSeXKnNAJ0ZF@google.com>
@ 2026-05-07  2:07   ` Baolu Lu
  2026-05-07 18:47     ` Samiullah Khawaja
  1 sibling, 1 reply; 14+ messages in thread
From: Baolu Lu @ 2026-05-07  2:07 UTC (permalink / raw)
  To: Samiullah Khawaja, David Woodhouse, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan, iommu,
	linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On 4/28/26 01:56, Samiullah Khawaja wrote:
> Add IOMMU ops to preserve/unpreserve a device. These can be implemented
> by the IOMMU drivers that support preservation of devices that have
> their IOMMU domains preserved. During device preservation the state of
> the associated IOMMU is also preserved as dependency.
> 
> Signed-off-by: Samiullah Khawaja<skhawaja@google.com>
> ---
>   drivers/iommu/liveupdate.c       | 162 +++++++++++++++++++++++++++++++
>   include/linux/iommu-liveupdate.h |  33 +++++++
>   include/linux/iommu.h            |  20 ++++
>   3 files changed, 215 insertions(+)
> 
> diff --git a/drivers/iommu/liveupdate.c b/drivers/iommu/liveupdate.c
> index f71f14518248..765d042e22e3 100644
> --- a/drivers/iommu/liveupdate.c
> +++ b/drivers/iommu/liveupdate.c
> @@ -11,6 +11,7 @@
>   #include <linux/liveupdate.h>
>   #include <linux/iommu-liveupdate.h>
>   #include <linux/iommu.h>
> +#include <linux/pci.h>
>   #include <linux/errno.h>
>   
>   #define iommu_max_objs_per_page(_array) \
> @@ -293,3 +294,164 @@ void iommu_domain_unpreserve(struct iommu_domain *domain)
>   	domain->preserved_state = NULL;
>   }
>   EXPORT_SYMBOL_GPL(iommu_domain_unpreserve);
> +
> +static struct iommu_hw_ser *alloc_iommu_hw_ser(struct iommu_flb_obj *flb)
> +{
> +	int idx;
> +
> +	idx = alloc_object_ser((struct iommu_array_hdr_ser **)&flb->curr_iommu_array,
> +			       iommu_max_objs_per_page(flb->curr_iommu_array));
> +	if (idx < 0)
> +		return ERR_PTR(idx);
> +
> +	flb->curr_iommu_array->objects[idx].hdr.ref_count = 1;
> +	return &flb->curr_iommu_array->objects[idx];
> +}
> +
> +static int iommu_preserve_locked(struct iommu_device *iommu,
> +				 struct iommu_flb_obj *flb_obj)
> +{
> +	struct iommu_hw_ser *iommu_hw_ser;
> +	int ret;
> +
> +	if (!iommu->ops->preserve)
> +		return -EOPNOTSUPP;
> +
> +	lockdep_assert_held(&flb_obj->lock);
> +	if (iommu->outgoing_preserved_state) {
> +		iommu->outgoing_preserved_state->hdr.ref_count++;
> +		return 0;
> +	}
> +
> +	iommu_hw_ser = alloc_iommu_hw_ser(flb_obj);
> +	if (IS_ERR(iommu_hw_ser))
> +		return PTR_ERR(iommu_hw_ser);
> +
> +	ret = iommu->ops->preserve(iommu, iommu_hw_ser);
> +	if (ret) {
> +		iommu_hw_ser->hdr.deleted = true;
> +		return ret;
> +	}
> +
> +	iommu->outgoing_preserved_state = iommu_hw_ser;
> +	return ret;
> +}
> +
> +static void iommu_unpreserve_locked(struct iommu_device *iommu,
> +				    struct iommu_flb_obj *flb_obj)
> +{
> +	struct iommu_hw_ser *iommu_hw_ser = iommu->outgoing_preserved_state;
> +
> +	lockdep_assert_held(&flb_obj->lock);
> +	iommu_hw_ser->hdr.ref_count--;
> +	if (iommu_hw_ser->hdr.ref_count)
> +		return;
> +
> +	iommu->outgoing_preserved_state = NULL;
> +	iommu->ops->unpreserve(iommu, iommu_hw_ser);
> +	iommu_hw_ser->hdr.deleted = true;
> +}
> +
> +static struct iommu_device_ser *alloc_iommu_device_ser(struct iommu_flb_obj *flb)
> +{
> +	int idx;
> +
> +	idx = alloc_object_ser((struct iommu_array_hdr_ser **)&flb->curr_device_array,
> +			       iommu_max_objs_per_page(flb->curr_device_array));
> +	if (idx < 0)
> +		return ERR_PTR(idx);
> +
> +	flb->curr_device_array->objects[idx].hdr.ref_count = 1;
> +	return &flb->curr_device_array->objects[idx];
> +}
> +
> +int iommu_preserve_device(struct iommu_domain *domain,
> +			  struct device *dev, u64 *preserved_state)
> +{
> +	struct iommu_flb_obj *flb_obj;
> +	struct iommu_device_ser *device_ser;
> +	struct dev_iommu *iommu;
> +	struct pci_dev *pdev;
> +	int ret;
> +
> +	if (!dev_is_pci(dev))
> +		return -EOPNOTSUPP;
> +
> +	if (!domain->preserved_state)
> +		return -EINVAL;
> +
> +	if (!iommu_group_dma_owner_claimed(dev->iommu_group))
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(dev);
> +	iommu = dev->iommu;
> +	if (!iommu->iommu_dev->ops->preserve_device ||
> +	    !iommu->iommu_dev->ops->preserve)
> +		return -EOPNOTSUPP;
> +
> +	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
> +	if (ret)
> +		return ret;
> +
> +	guard(mutex)(&flb_obj->lock);
> +	device_ser = alloc_iommu_device_ser(flb_obj);
> +	if (IS_ERR(device_ser))
> +		return PTR_ERR(device_ser);
> +
> +	ret = iommu_preserve_locked(iommu->iommu_dev, flb_obj);
> +	if (ret) {
> +		device_ser->hdr.deleted = true;
> +		return ret;
> +	}
> +
> +	device_ser->domain_iommu_ser.domain_phys = __pa(domain->preserved_state);
> +	device_ser->domain_iommu_ser.iommu_phys = __pa(iommu->iommu_dev->outgoing_preserved_state);
> +	device_ser->devid = pci_dev_id(pdev);
> +	device_ser->pci_domain_nr = pci_domain_nr(pdev->bus);
> +
> +	ret = iommu->iommu_dev->ops->preserve_device(dev, device_ser);
> +	if (ret) {
> +		device_ser->hdr.deleted = true;
> +		iommu_unpreserve_locked(iommu->iommu_dev, flb_obj);
> +		return ret;
> +	}
> +
> +	dev->iommu->device_ser = device_ser;
> +	*preserved_state = virt_to_phys(device_ser);
> +	return 0;
> +}
> +
> +void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
> +{
> +	struct iommu_flb_obj *flb_obj;
> +	struct iommu_device_ser *iommu_device_ser;
> +	struct dev_iommu *iommu;
> +	struct pci_dev *pdev;
> +	int ret;
> +
> +	if (!dev_is_pci(dev))
> +		return;
> +
> +	if (!iommu_group_dma_owner_claimed(dev->iommu_group))
> +		return;
> +
> +	pdev = to_pci_dev(dev);
> +	iommu = dev->iommu;
> +	if (!iommu->iommu_dev->ops->unpreserve_device ||
> +	    !iommu->iommu_dev->ops->unpreserve)
> +		return;

Is it considered a driver bug if it implements the preserve hooks but
not unpreserve ones? This would at least cause a silent memory leak. How
about adding a WARN like this?

     if (WARN_ON_ONCE(!iommu->iommu_dev->ops->unpreserve_device ||
                      !iommu->iommu_dev->ops->unpreserve))
         return;

?

Thanks,
baolu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 06/16] iommupt: Implement preserve/unpreserve/restore callbacks
       [not found] ` <20260427175633.1978233-7-skhawaja@google.com>
@ 2026-05-07  2:55   ` Baolu Lu
  2026-05-07 18:40     ` Samiullah Khawaja
  0 siblings, 1 reply; 14+ messages in thread
From: Baolu Lu @ 2026-05-07  2:55 UTC (permalink / raw)
  To: Samiullah Khawaja, David Woodhouse, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan, iommu,
	linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On 4/28/26 01:56, Samiullah Khawaja wrote:
> Implement the iommu domain ops for presevation, unpresevation and
> restoration of iommu domains for liveupdate. Use the existing page
> walker to preserve the ioptdesc of the top_table and the lower tables.
> 
> Preserve top_level, VASZ and FEAT Sign Extended to restore the domain in
> the next kernel. On restore the domain has only the preserved features
> enabled and all the other features are zeroed. This is ok since the
> restored domain is made immutable and can only be freed. A kunit test is
> added to verify that the IOMMU domain free can be done with trimmed
> features.
> 
> Signed-off-by: Samiullah Khawaja<skhawaja@google.com>
> ---
>   drivers/iommu/generic_pt/iommu_pt.h       | 131 ++++++++++++++++++++++
>   drivers/iommu/generic_pt/kunit_iommu_pt.h |  28 +++++
>   include/linux/generic_pt/iommu.h          |  19 +++-
>   3 files changed, 177 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/generic_pt/iommu_pt.h b/drivers/iommu/generic_pt/iommu_pt.h
> index 19b6daf88f2a..7bca827e3a55 100644
> --- a/drivers/iommu/generic_pt/iommu_pt.h
> +++ b/drivers/iommu/generic_pt/iommu_pt.h
> @@ -961,6 +961,133 @@ static int NS(map_range)(struct pt_iommu *iommu_table, dma_addr_t iova,
>   	return ret;
>   }
>   
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +/**
> + * unpreserve() - Unpreserve page tables and other state of a domain.
> + * @domain: Domain to unpreserve
> + */
> +void DOMAIN_NS(unpreserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser)
> +{
> +	struct pt_iommu *iommu_table =
> +		container_of(domain, struct pt_iommu, domain);
> +	struct pt_common *common = common_from_iommu(iommu_table);
> +	struct pt_range range = pt_all_range(common);
> +	struct pt_iommu_collect_args collect = {
> +		.free_list = IOMMU_PAGES_LIST_INIT(collect.free_list),
> +	};
> +
> +	iommu_pages_list_add(&collect.free_list, range.top_table);
> +	pt_walk_range(&range, __collect_tables, &collect);
> +
> +	iommu_unpreserve_pages(&collect.free_list);
> +}
> +EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(unpreserve), "GENERIC_PT_IOMMU");
> +
> +/**
> + * preserve() - Preserve page tables and other state of a domain.
> + * @domain: Domain to preserve
> + *
> + * Returns: -ERRNO on failure, 0 on success.
> + */
> +int DOMAIN_NS(preserve)(struct iommu_domain *domain, struct iommu_domain_ser *ser)
> +{
> +	struct pt_iommu *iommu_table =
> +		container_of(domain, struct pt_iommu, domain);
> +	struct pt_common *common = common_from_iommu(iommu_table);
> +	struct pt_range range = pt_all_range(common);
> +	struct pt_iommu_collect_args collect = {
> +		.free_list = IOMMU_PAGES_LIST_INIT(collect.free_list),
> +	};
> +	int ret;
> +
> +	iommu_pages_list_add(&collect.free_list, range.top_table);
> +	pt_walk_range(&range, __collect_tables, &collect);
> +
> +	ret = iommu_preserve_pages(&collect.free_list);
> +	if (ret)
> +		return ret;
> +
> +	ser->top_table_phys = virt_to_phys(range.top_table);
> +	ser->top_level = range.top_level;
> +
> +	/*
> +	 * VASZ and SIGN_EXTEND will be needed in next kernel for collector page
> +	 * table walk to restore and free pages.
> +	 */
> +	ser->vasz = common->max_vasz_lg2;
> +	ser->sign_extend = pt_feature(common, PT_FEAT_SIGN_EXTEND);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(preserve), "GENERIC_PT_IOMMU");
> +
> +static int __restore_tables(struct pt_range *range, void *arg,
> +			    unsigned int level, struct pt_table_p *table)
> +{
> +	struct pt_state pts = pt_init(range, level, table);
> +	int ret;
> +
> +	for_each_pt_level_entry(&pts) {
> +		if (pts.type == PT_ENTRY_TABLE) {
> +			iommu_restore_page(virt_to_phys(pts.table_lower));
> +
> +			/*
> +			 * pt_descend can only fail if pts.table_lower is not
> +			 * init. So the if statement below is dead code.
> +			 */
> +			ret = pt_descend(&pts, arg, __restore_tables);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static const struct pt_iommu_ops NS(ops_immutable);
> +
> +/**
> + * restore() - Restore page tables and other state of a domain.
> + * @domain: Domain to preserve
> + *
> + * Returns: -ERRNO on failure, 0 on success.
> + */
> +int DOMAIN_NS(restore)(struct iommu_domain *domain, struct iommu_domain_ser *ser)
> +{
> +	struct pt_iommu *iommu_table =
> +		container_of(domain, struct pt_iommu, domain);
> +	struct pt_common *common = common_from_iommu(iommu_table);
> +	struct pt_range range;
> +
> +	common->max_vasz_lg2 = ser->vasz;
> +
> +	/* Make this domain immutable.*/
> +	iommu_table->ops = &NS(ops_immutable);
> +
> +	/*
> +	 * It is safe to override this here since this domain is immutable and
> +	 * can only be freed.
> +	 */
> +	common->features = 0;
> +	if (ser->sign_extend)
> +		common->features |= BIT(PT_FEAT_SIGN_EXTEND);
> +
> +	range = pt_all_range(common);
> +	iommu_restore_page(ser->top_table_phys);
> +
> +	/* Free new table */
> +	iommu_free_pages(range.top_table);
> +
> +	/* Set the restored top table */
> +	pt_top_set(common, phys_to_virt(ser->top_table_phys), ser->top_level);
> +
> +	/* Restore all pages*/
> +	range = pt_all_range(common);
> +	return pt_walk_range(&range, __restore_tables, NULL);
> +}
> +EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(restore), "GENERIC_PT_IOMMU");
> +#endif
> +
>   struct pt_unmap_args {
>   	struct iommu_pages_list free_list;
>   	pt_vaddr_t unmapped;
> @@ -1138,6 +1265,10 @@ static const struct pt_iommu_ops NS(ops) = {
>   	.deinit = NS(deinit),
>   };
>   
> +static const struct pt_iommu_ops NS(ops_immutable) = {
> +	.deinit = NS(deinit),
> +};
> +
>   static int pt_init_common(struct pt_common *common)
>   {
>   	struct pt_range top_range = pt_top_range(common);
> diff --git a/drivers/iommu/generic_pt/kunit_iommu_pt.h b/drivers/iommu/generic_pt/kunit_iommu_pt.h
> index e8a63c8ea850..af1918d693ed 100644
> --- a/drivers/iommu/generic_pt/kunit_iommu_pt.h
> +++ b/drivers/iommu/generic_pt/kunit_iommu_pt.h
> @@ -426,6 +426,33 @@ static void test_mixed(struct kunit *test)
>   	check_iova(test, start, oa, len);
>   }
>   
> +static void test_restore_free(struct kunit *test)
> +{
> +	struct kunit_iommu_priv *priv = test->priv;
> +	struct pt_range top_range = pt_top_range(priv->common);
> +	u64 start = 0x3fe400ULL << 12;
> +	u64 end = 0x4c0600ULL << 12;
> +	pt_vaddr_t len = end - start;
> +
> +	if (top_range.last_va <= start || sizeof(unsigned long) == 4)
> +		kunit_skip(test, "range is too small");
> +	if ((priv->safe_pgsize_bitmap & GENMASK(30, 21)) != (BIT(30) | BIT(21)))
> +		kunit_skip(test, "incompatible psize");
> +
> +	/* Map a large mixed range to populate multiple levels of page tables */
> +	do_map(test, start, start, len);
> +
> +	/*
> +	 * Simulate a restored state by clearing all features except
> +	 * SIGN_EXTEND. This verifies that the generic page table free walker
> +	 * can correctly tear down a populated domain when other features are
> +	 * zeroed.
> +	 */
> +	priv->common->features &= BIT(PT_FEAT_SIGN_EXTEND);
> +
> +	/* The domain will be freed when the test exits. */
> +}
> +
>   static struct kunit_case iommu_test_cases[] = {
>   	KUNIT_CASE_FMT(test_increase_level),
>   	KUNIT_CASE_FMT(test_map_simple),
> @@ -434,6 +461,7 @@ static struct kunit_case iommu_test_cases[] = {
>   	KUNIT_CASE_FMT(test_random_map),
>   	KUNIT_CASE_FMT(test_pgsize_boundary),
>   	KUNIT_CASE_FMT(test_mixed),
> +	KUNIT_CASE_FMT(test_restore_free),
>   	{},
>   };
>   
> diff --git a/include/linux/generic_pt/iommu.h b/include/linux/generic_pt/iommu.h
> index dd0edd02a48a..649b3b9eb1a0 100644
> --- a/include/linux/generic_pt/iommu.h
> +++ b/include/linux/generic_pt/iommu.h
> @@ -13,6 +13,7 @@ struct iommu_iotlb_gather;
>   struct pt_iommu_ops;
>   struct pt_iommu_driver_ops;
>   struct iommu_dirty_bitmap;
> +struct iommu_domain_ser;
>   
>   /**
>    * DOC: IOMMU Radix Page Table
> @@ -251,6 +252,12 @@ struct pt_iommu_cfg {
>   #define IOMMU_PROTOTYPES(fmt)                                                  \
>   	phys_addr_t pt_iommu_##fmt##_iova_to_phys(struct iommu_domain *domain, \
>   						  dma_addr_t iova);            \
> +	int pt_iommu_##fmt##_preserve(struct iommu_domain *domain,             \
> +				      struct iommu_domain_ser *ser);           \
> +	void pt_iommu_##fmt##_unpreserve(struct iommu_domain *domain,          \
> +					 struct iommu_domain_ser *ser);        \
> +	int pt_iommu_##fmt##_restore(struct iommu_domain *domain,              \
> +				     struct iommu_domain_ser *ser);            \
>   	int pt_iommu_##fmt##_read_and_clear_dirty(                             \
>   		struct iommu_domain *domain, unsigned long iova, size_t size,  \
>   		unsigned long flags, struct iommu_dirty_bitmap *dirty);        \
> @@ -266,12 +273,22 @@ struct pt_iommu_cfg {
>   	};                              \
>   	IOMMU_PROTOTYPES(fmt)
>   
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +#define IOMMU_PT_LIVEUPDATE_OPS(fmt)			\
> +	, .preserve = &pt_iommu_##fmt##_preserve,	\
> +	.unpreserve = &pt_iommu_##fmt##_unpreserve,	\
> +	.restore = &pt_iommu_##fmt##_restore

Nit: would it look better if we put it like this?

#define IOMMU_PT_LIVEUPDATE_OPS(fmt)			\
	, .preserve = &pt_iommu_##fmt##_preserve	\
	, .unpreserve = &pt_iommu_##fmt##_unpreserve	\
	, .restore = &pt_iommu_##fmt##_restore

> +#else
> +#define IOMMU_PT_LIVEUPDATE_OPS(fmt)
> +#endif

Thanks,
baolu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 07/16] iommu/vt-d: Implement device and iommu preserve/unpreserve ops
       [not found] ` <20260427175633.1978233-8-skhawaja@google.com>
@ 2026-05-07  6:25   ` Baolu Lu
  2026-05-08  2:36     ` Samiullah Khawaja
  0 siblings, 1 reply; 14+ messages in thread
From: Baolu Lu @ 2026-05-07  6:25 UTC (permalink / raw)
  To: Samiullah Khawaja, David Woodhouse, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan, iommu,
	linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On 4/28/26 01:56, Samiullah Khawaja wrote:
> Add implementation of the device and iommu presevation in a separate
> file. Also set the device and iommu preserve/unpreserve ops in the
> struct iommu_ops.
> 
> During normal shutdown the iommu translation is disabled. Since the root
> table is preserved during live update, it needs to be cleaned up and the
> context entries of the unpreserved devices need to be cleared.

This is not related to preserve/unpreserve ops and could be made in a
separated patch?

> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>   MAINTAINERS                      |   1 +
>   drivers/iommu/intel/Makefile     |   1 +
>   drivers/iommu/intel/iommu.c      |  52 +++++++++++-
>   drivers/iommu/intel/iommu.h      |  28 +++++++
>   drivers/iommu/intel/liveupdate.c | 139 +++++++++++++++++++++++++++++++
>   drivers/iommu/iommu.c            |  18 ++++
>   include/linux/iommu-liveupdate.h |  10 +++
>   include/linux/iommu.h            |  14 ++++
>   include/linux/kho/abi/iommu.h    |  18 ++++
>   9 files changed, 277 insertions(+), 4 deletions(-)
>   create mode 100644 drivers/iommu/intel/liveupdate.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 980041955abc..9f5c02c6c8c1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -13495,6 +13495,7 @@ M:	Samiullah Khawaja <skhawaja@google.com>
>   R:	Pranjal Shrivastava <praan@google.com>
>   L:	iommu@lists.linux.dev
>   S:	Maintained
> +F:	drivers/iommu/intel/liveupdate.c

This file is deeply integrated into the Intel IOMMU driver. Maintaining
it separately from the rest of the driver would add unnecessary
complexity. I suggest merging this entry into the existing Intel IOMMU
block so they can be managed together.

>   F:	drivers/iommu/liveupdate.c
>   F:	include/linux/iommu-liveupdate.h
>   F:	include/linux/kho/abi/iommu.h
> diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
> index ada651c4a01b..d38fc101bc35 100644
> --- a/drivers/iommu/intel/Makefile
> +++ b/drivers/iommu/intel/Makefile
> @@ -6,3 +6,4 @@ obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o
>   obj-$(CONFIG_INTEL_IOMMU_SVM) += svm.o
>   obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o
>   obj-$(CONFIG_INTEL_IOMMU_PERF_EVENTS) += perfmon.o
> +obj-$(CONFIG_IOMMU_LIVEUPDATE) += liveupdate.o
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index c3d18cd77d2f..68fecd4e57fa 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -16,6 +16,7 @@
>   #include <linux/crash_dump.h>
>   #include <linux/dma-direct.h>
>   #include <linux/dmi.h>
> +#include <linux/iommu-liveupdate.h>
>   #include <linux/memory.h>
>   #include <linux/pci.h>
>   #include <linux/pci-ats.h>
> @@ -52,6 +53,8 @@ static int rwbf_quirk;
>   
>   #define rwbf_required(iommu)	(rwbf_quirk || cap_rwbf((iommu)->cap))
>   
> +static void clear_unpreserved_context_entries(struct intel_iommu *iommu);
> +
>   /*
>    * set to 1 to panic kernel if can't successfully enable VT-d
>    * (used when kernel is launched w/ TXT)
> @@ -60,8 +63,6 @@ static int force_on = 0;
>   static int intel_iommu_tboot_noforce;
>   static int no_platform_optin;
>   
> -#define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
> -
>   /*
>    * Take a root_entry and return the Lower Context Table Pointer (LCTP)
>    * if marked present.
> @@ -2375,8 +2376,11 @@ void intel_iommu_shutdown(void)
>   		/* Disable PMRs explicitly here. */
>   		iommu_disable_protect_mem_regions(iommu);
>   
> -		/* Make sure the IOMMUs are switched off */
> -		iommu_disable_translation(iommu);
> +		/* Make sure the IOMMUs are switched off if not preserved. */
> +		if (iommu_preserved_state(&iommu->iommu))
> +			clear_unpreserved_context_entries(iommu);
> +		else
> +			iommu_disable_translation(iommu);
>   	}
>   }
>   
> @@ -2899,6 +2903,41 @@ static const struct iommu_dirty_ops intel_second_stage_dirty_ops = {
>   	.set_dirty_tracking = intel_iommu_set_dirty_tracking,
>   };
>   
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +static int clear_unpreserve_context_entry_fn(struct device *dev,
> +					     struct iommu_device *iommu,
> +					     void *arg)
> +{
> +	struct device_domain_info *info;
> +
> +	info = dev_iommu_priv_get(dev);
> +	if (!info)
> +		return 0;
> +
> +	if (dev_is_pci(dev) && dev_iommu_preserved_state(dev))
> +		return 0;
> +
> +	domain_context_clear(info);
> +	return 0;
> +}
> +
> +static void clear_unpreserved_context_entries(struct intel_iommu *iommu)
> +{
> +	struct iommu_dev_iter iter = {
> +		.fn = clear_unpreserve_context_entry_fn,
> +		.iommu = &iommu->iommu,
> +		.arg = NULL,
> +
> +	};
> +
> +	iommu_for_each_dev(&iter);
> +}
> +#else
> +static void clear_unpreserved_context_entries(struct intel_iommu *iommu)
> +{
> +}
> +#endif
> +
>   static struct iommu_domain *
>   intel_iommu_domain_alloc_second_stage(struct device *dev,
>   				      struct intel_iommu *iommu, u32 flags)
> @@ -3926,6 +3965,11 @@ const struct iommu_ops intel_iommu_ops = {
>   	.is_attach_deferred	= intel_iommu_is_attach_deferred,
>   	.def_domain_type	= device_def_domain_type,
>   	.page_response		= intel_iommu_page_response,
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	.preserve_device	= intel_iommu_preserve_device,
> +	.preserve		= intel_iommu_preserve,
> +	.unpreserve		= intel_iommu_unpreserve,
> +#endif
>   };
>   
>   static void quirk_iommu_igfx(struct pci_dev *dev)
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index ef145560aa98..5e0bc17e76bf 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -552,6 +552,8 @@ struct root_entry {
>   	u64     hi;
>   };
>   
> +#define ROOT_ENTRY_NR (VTD_PAGE_SIZE / sizeof(struct root_entry))
> +
>   /*
>    * low 64 bits:
>    * 0: present
> @@ -1284,6 +1286,32 @@ static inline int iopf_for_domain_replace(struct iommu_domain *new,
>   	return 0;
>   }
>   
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +int intel_iommu_preserve_device(struct device *dev,
> +				struct iommu_device_ser *device_ser);
> +int intel_iommu_preserve(struct iommu_device *iommu,
> +			 struct iommu_hw_ser *iommu_ser);
> +void intel_iommu_unpreserve(struct iommu_device *iommu,
> +			    struct iommu_hw_ser *iommu_ser);
> +#else
> +static inline int intel_iommu_preserve_device(struct device *dev,
> +					      struct iommu_device_ser *device_ser)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline int intel_iommu_preserve(struct iommu_device *iommu,
> +				       struct iommu_hw_ser *iommu_ser)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void intel_iommu_unpreserve(struct iommu_device *iommu,
> +					  struct iommu_hw_ser *iommu_ser)
> +{
> +}
> +#endif
> +
>   #ifdef CONFIG_INTEL_IOMMU_SVM
>   void intel_svm_check(struct intel_iommu *iommu);
>   struct iommu_domain *intel_svm_domain_alloc(struct device *dev,
> diff --git a/drivers/iommu/intel/liveupdate.c b/drivers/iommu/intel/liveupdate.c
> new file mode 100644
> index 000000000000..75fa68b701bf
> --- /dev/null
> +++ b/drivers/iommu/intel/liveupdate.c
> @@ -0,0 +1,139 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * Copyright (C) 2026, Google LLC
> + * Author: Samiullah Khawaja <skhawaja@google.com>
> + */
> +
> +#define pr_fmt(fmt)    "DMAR: liveupdate: " fmt
> +
> +#include <linux/kexec_handover.h>
> +#include <linux/liveupdate.h>
> +#include <linux/iommu-liveupdate.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +
> +#include "iommu.h"
> +#include "../iommu-pages.h"
> +
> +static void unpreserve_iommu_context_table(struct intel_iommu *iommu, int end)
> +{
> +	struct context_entry *context;
> +	int i;
> +
> +	for (i = 0; i < end; i++) {
> +		context = iommu_context_addr(iommu, i, 0, 0);
> +		if (context)
> +			iommu_unpreserve_page(context);
> +
> +		if (!sm_supported(iommu))
> +			continue;
> +
> +		context = iommu_context_addr(iommu, i, 0x80, 0);
> +		if (context)
> +			iommu_unpreserve_page(context);
> +	}
> +}
> +
> +static int preserve_iommu_context_table(struct intel_iommu *iommu)

Since this function preserves all context tables, should it be renamed
to preserve_iommu_context_tables()?

> +{
> +	struct context_entry *context;
> +	int ret;
> +	int i;
> +
> +	for (i = 0; i < ROOT_ENTRY_NR; i++) {
> +		/*
> +		 * Alloc the context tables now to make sure the iommu unit is
> +		 * properly preserved. These might stay unused and wastes around
> +		 * 32MB max in scalable mode.
> +		 */

Instead of allocating and preserving context tables for all root entries
(as noted, can waste up to 32MB), could we restrict this only to the
entries possibly in use by active PCI devices?

> +		spin_lock(&iommu->lock);
> +		context = iommu_context_addr(iommu, i, 0, 1);
> +		spin_unlock(&iommu->lock);
> +		if (!context) {
> +			ret = -ENOMEM;
> +			goto error;
> +		}
> +		ret = iommu_preserve_page(context);
> +		if (ret)
> +			goto error;
> +
> +		if (!sm_supported(iommu))
> +			continue;
> +
> +		spin_lock(&iommu->lock);
> +		context = iommu_context_addr(iommu, i, 0x80, 1);
> +		spin_unlock(&iommu->lock);
> +		if (!context) {
> +			ret = -ENOMEM;
> +			goto error_sm;
> +		}
> +		ret = iommu_preserve_page(context);
> +		if (ret)
> +			goto error_sm;
> +	}
> +
> +	return 0;
> +
> +error_sm:
> +	context = iommu_context_addr(iommu, i, 0, 0);
> +	iommu_unpreserve_page(context);
> +error:
> +	unpreserve_iommu_context_table(iommu, i);
> +	return ret;
> +}
> +
> +int intel_iommu_preserve_device(struct device *dev,
> +				struct iommu_device_ser *device_ser)
> +{
> +	struct device_domain_info *info = dev_iommu_priv_get(dev);
> +
> +	if (!dev_is_pci(dev)) {
> +		dev_err(dev, "Cannot preserve non-PCI device\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (!info)
> +		return -EINVAL;
> +
> +	device_ser->domain_iommu_ser.attachment_id = domain_id_iommu(info->domain,
> +								     info->iommu);
> +	return 0;
> +}
> +
> +int intel_iommu_preserve(struct iommu_device *iommu_dev,
> +			 struct iommu_hw_ser *ser)
> +{
> +	struct intel_iommu *iommu;
> +	int ret;
> +
> +	iommu = container_of(iommu_dev, struct intel_iommu, iommu);
> +
> +	ret = preserve_iommu_context_table(iommu);
> +	if (ret)
> +		return ret;
> +
> +	ret = iommu_preserve_page(iommu->root_entry);
> +	if (ret) {
> +		unpreserve_iommu_context_table(iommu, ROOT_ENTRY_NR);
> +		return ret;
> +	}
> +
> +	ser->intel.phys_addr = iommu->reg_phys;
> +	ser->intel.root_table = __pa(iommu->root_entry);
> +	ser->type = IOMMU_INTEL;
> +	ser->token = ser->intel.phys_addr;
> +
> +	return 0;
> +}
> +
> +void intel_iommu_unpreserve(struct iommu_device *iommu_dev,
> +			    struct iommu_hw_ser *iommu_ser)
> +{
> +	struct intel_iommu *iommu;
> +
> +	iommu = container_of(iommu_dev, struct intel_iommu, iommu);
> +
> +	unpreserve_iommu_context_table(iommu, ROOT_ENTRY_NR);
> +	iommu_unpreserve_page(iommu->root_entry);
> +}
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 782e73a9d45f..0561990f46e3 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -307,6 +307,24 @@ void iommu_device_unregister(struct iommu_device *iommu)
>   }
>   EXPORT_SYMBOL_GPL(iommu_device_unregister);
>   
> +static int _iommu_for_each_dev_cb(struct device *dev, void *data)
> +{
> +	struct iommu_dev_iter *iter = data;
> +
> +	if (dev->iommu && dev->iommu->iommu_dev == iter->iommu)
> +		return iter->fn(dev, iter->iommu, iter->arg);
> +
> +	return 0;
> +}
> +
> +void iommu_for_each_dev(struct iommu_dev_iter *iter)
> +{
> +	for (int i = 0; i < ARRAY_SIZE(iommu_buses); i++)
> +		bus_for_each_dev(iommu_buses[i], NULL, iter,
> +				 _iommu_for_each_dev_cb);
> +}
> +EXPORT_SYMBOL_GPL(iommu_for_each_dev);
> +
>   #if IS_ENABLED(CONFIG_IOMMUFD_TEST)
>   void iommu_device_unregister_bus(struct iommu_device *iommu,
>   				 const struct bus_type *bus,
> diff --git a/include/linux/iommu-liveupdate.h b/include/linux/iommu-liveupdate.h
> index 279c7ab04f09..c9d75c6b3be9 100644
> --- a/include/linux/iommu-liveupdate.h
> +++ b/include/linux/iommu-liveupdate.h
> @@ -33,6 +33,11 @@ void iommu_domain_unpreserve(struct iommu_domain *domain);
>   int iommu_preserve_device(struct iommu_domain *domain,
>   			  struct device *dev, u64 *preserved_state);
>   void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev);
> +
> +static inline void *iommu_preserved_state(struct iommu_device *iommu)
> +{
> +	return iommu->outgoing_preserved_state;
> +}
>   #else
>   static inline void *dev_iommu_preserved_state(struct device *dev)
>   {
> @@ -57,6 +62,11 @@ static inline int iommu_preserve_device(struct iommu_domain *domain,
>   static inline void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>   {
>   }
> +
> +static inline void *iommu_preserved_state(struct iommu_device *iommu)
> +{
> +	return NULL;
> +}
>   #endif
>   
>   int iommu_liveupdate_register_flb(struct liveupdate_file_handler *handler);
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 1c424b32c5fc..999be5127c65 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -1207,6 +1207,20 @@ static inline void *dev_iommu_priv_get(struct device *dev)
>   
>   void dev_iommu_priv_set(struct device *dev, void *priv);
>   
> +typedef int (*iommu_dev_iter_fn)(struct device *dev,
> +				 struct iommu_device *iommu, void *arg);
> +
> +/**
> + * struct iommu_dev_iter - Iterator for devices attached to an IOMMU
> + */
> +struct iommu_dev_iter {
> +	struct iommu_device *iommu;
> +	iommu_dev_iter_fn fn;
> +	void *arg;
> +};
> +
> +void iommu_for_each_dev(struct iommu_dev_iter *iter);

Is a generic helper necessary here? I am concerned about potential races
with concurrent operations especially probe_device or release_device.
And also the hot-plug/unplug scenarios?

Since the current preservation logic is limited to PCI devices, it might
be safer and simpler to use the existing for_each_pci_dev() for this
specific case instead of introducing a new generic helper?

> +
>   extern struct mutex iommu_probe_device_lock;
>   int iommu_probe_device(struct device *dev);
>   
> diff --git a/include/linux/kho/abi/iommu.h b/include/linux/kho/abi/iommu.h
> index 37b967820f14..5ffedf0dbd5a 100644
> --- a/include/linux/kho/abi/iommu.h
> +++ b/include/linux/kho/abi/iommu.h
> @@ -73,6 +73,7 @@
>   
>   enum iommu_type_ser {
>   	IOMMU_INVALID,
> +	IOMMU_INTEL,
>   };
>   
>   /**
> @@ -132,16 +133,33 @@ struct iommu_device_ser {
>   	struct iommu_dev_map_ser domain_iommu_ser;
>   } __packed;
>   
> +/**
> + * struct iommu_intel_ser - Serialized state of an Intel IOMMU instance
> + * @restored: Whether IOMMU state is restored
> + * @phys_addr: Physical address of the IOMMU register base
> + * @root_table: Physical address of the root entry table
> + */
> +struct iommu_intel_ser {
> +	u8 restored;
> +	u8 padding[7];
> +	u64 phys_addr;
> +	u64 root_table;
> +};
> +
>   /**
>    * struct iommu_hw_ser - Serialized state of an IOMMU instance
>    * @hdr: Common object header
>    * @token: Unique token for the IOMMU
>    * @type: IOMMU type serialized state belongs to
> + * @intel: Intel specific serialization data
>    */
>   struct iommu_hw_ser {
>   	struct iommu_hdr_ser hdr;
>   	u64 token;
>   	u64 type;
> +	union {
> +		struct iommu_intel_ser intel;
> +	};
>   } __packed;
>   
>   /**

Thanks,
baolu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 09/16] iommu/vt-d: Restore IOMMU state and reclaimed domain ids
       [not found] ` <20260427175633.1978233-10-skhawaja@google.com>
@ 2026-05-07  9:05   ` Baolu Lu
  2026-05-07 17:35     ` Samiullah Khawaja
  0 siblings, 1 reply; 14+ messages in thread
From: Baolu Lu @ 2026-05-07  9:05 UTC (permalink / raw)
  To: Samiullah Khawaja, David Woodhouse, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: baolu.lu, Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan,
	iommu, linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On 4/28/2026 1:56 AM, Samiullah Khawaja wrote:
> During boot fetch the preserved state of IOMMU unit and if found then
> restore the state.
> 
> - Reuse the root_table that was preserved in the previous kernel.
> - Reclaim the domain ids of the preserved domains for each preserved
>    devices so these are not acquired by another domain.
> 
> Signed-off-by: Samiullah Khawaja<skhawaja@google.com>
> ---
>   drivers/iommu/intel/iommu.c      | 55 ++++++++++++++++++++++--------
>   drivers/iommu/intel/iommu.h      |  7 ++++
>   drivers/iommu/intel/liveupdate.c | 57 ++++++++++++++++++++++++++++++++
>   3 files changed, 105 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 68fecd4e57fa..4118a0861f38 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -670,10 +670,17 @@ void dmar_fault_dump_ptes(struct intel_iommu *iommu, u16 source_id,
>   #endif
>   
>   /* iommu handling */
> -static int iommu_alloc_root_entry(struct intel_iommu *iommu)
> +static int iommu_alloc_root_entry(struct intel_iommu *iommu,
> +				  struct iommu_hw_ser *iommu_ser)
>   {
>   	struct root_entry *root;
>   
> +	if (iommu_ser) {
> +		intel_iommu_liveupdate_restore_root_table(iommu, iommu_ser);
> +		__iommu_flush_cache(iommu, iommu->root_entry, ROOT_SIZE);

The root table is already active and being used by the hardware at this
point, correct? Since the CPU should not be modifying an active table,
the __iommu_flush_cache() call seems unnecessary and potentially risky.

Furthermore, if the iommu state has been restored, is it necessary to
call iommu_set_root_entry()? Calling it would essentially replace an
active, in-use root table with an identical one, which appears
redundant.

> +		return 0;
> +	}
> +
>   	root = iommu_alloc_pages_node_sz(iommu->node, GFP_ATOMIC, SZ_4K);
>   	if (!root) {
>   		pr_err("Allocating root entry for %s failed\n",

Thanks,
baolu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 10/16] iommu: Restore and reattach preserved domains to devices
       [not found] ` <20260427175633.1978233-11-skhawaja@google.com>
@ 2026-05-07 13:54   ` Baolu Lu
  2026-05-07 16:52     ` Samiullah Khawaja
  0 siblings, 1 reply; 14+ messages in thread
From: Baolu Lu @ 2026-05-07 13:54 UTC (permalink / raw)
  To: Samiullah Khawaja, David Woodhouse, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: baolu.lu, Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan,
	iommu, linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On 4/28/2026 1:56 AM, Samiullah Khawaja wrote:
> Restore the preserved domains by restoring the page tables using restore
> IOMMU domain op. Reattach the preserved domain to the device during
> default domain setup. While attaching, reuse the domain ID that was used
> in the previous kernel. The context entry setup is not needed as that is
> preserved during liveupdate.
> 
> Signed-off-by: Samiullah Khawaja<skhawaja@google.com>
> ---
>   drivers/iommu/intel/iommu.c      | 49 ++++++++++++++------
>   drivers/iommu/intel/iommu.h      |  3 +-
>   drivers/iommu/intel/nested.c     |  2 +-
>   drivers/iommu/iommu.c            | 61 ++++++++++++++++++++++++-
>   drivers/iommu/liveupdate.c       | 78 ++++++++++++++++++++++++++++++++
>   include/linux/iommu-liveupdate.h | 50 ++++++++++++++++++++
>   6 files changed, 224 insertions(+), 19 deletions(-)

Please split the changes in the Intel iommu driver and the iommu core
into different patches.

> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 4118a0861f38..b90757164cd8 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -1031,7 +1031,8 @@ static bool first_level_by_default(struct intel_iommu *iommu)
>   	return true;
>   }
>   
> -int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
> +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
> +			int restore_did)

How about using a new helper for restored domain? For example,

	int domain_reattach_iommu(...)

or

	int domain_restore_iommu(...)

>   {
>   	struct iommu_domain_info *info, *curr;
>   	int num, ret = -ENOSPC;
> @@ -1051,8 +1052,11 @@ int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
>   		return 0;
>   	}
>   
> -	num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
> -			      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
> +	if (restore_did >= IDA_START_DID)
> +		num = restore_did;

For a preserved domain ID, do we need to check whether it has been
reserved from the ida pool?

> +	else
> +		num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
> +				      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
>   	if (num < 0) {
>   		pr_err("%s: No free domain ids\n", iommu->name);
>   		goto err_unlock;
> @@ -1320,10 +1324,14 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>   {
>   	struct device_domain_info *info = dev_iommu_priv_get(dev);
>   	struct intel_iommu *iommu = info->iommu;
> +	struct device_ser *device_ser = NULL;
>   	unsigned long flags;
>   	int ret;
>   
> -	ret = domain_attach_iommu(domain, iommu);
> +	device_ser = dev_iommu_restored_state(dev);
> +
> +	ret = domain_attach_iommu(domain, iommu,
> +				  dev_iommu_restore_did(dev, &domain->domain));

What is the expected behavior if there is a mismatch between the
restored state and the availability of a domain ID? Specifically, if
device_ser is found but dev_iommu_restore_did() fails (returns -1), how
should the driver proceed in that case?

>   	if (ret)
>   		return ret;
>   
> @@ -1336,16 +1344,18 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>   	if (dev_is_real_dma_subdevice(dev))
>   		return 0;
>   
> -	if (!sm_supported(iommu))
> -		ret = domain_context_mapping(domain, dev);
> -	else if (intel_domain_is_fs_paging(domain))
> -		ret = domain_setup_first_level(iommu, domain, dev,
> -					       IOMMU_NO_PASID, NULL);
> -	else if (intel_domain_is_ss_paging(domain))
> -		ret = domain_setup_second_level(iommu, domain, dev,
> -						IOMMU_NO_PASID, NULL);
> -	else if (WARN_ON(true))
> -		ret = -EINVAL;
> +	if (!device_ser) {
> +		if (!sm_supported(iommu))
> +			ret = domain_context_mapping(domain, dev);
> +		else if (intel_domain_is_fs_paging(domain))
> +			ret = domain_setup_first_level(iommu, domain, dev,
> +						       IOMMU_NO_PASID, NULL);
> +		else if (intel_domain_is_ss_paging(domain))
> +			ret = domain_setup_second_level(iommu, domain, dev,
> +							IOMMU_NO_PASID, NULL);
> +		else if (WARN_ON(true))
> +			ret = -EINVAL;
> +	}
>   
>   	if (ret)
>   		goto out_block_translation;
> @@ -3170,6 +3180,15 @@ int paging_domain_compatible(struct iommu_domain *domain, struct device *dev)
>   	struct intel_iommu *iommu = info->iommu;
>   	int ret = -EINVAL;
>   
> +#ifdef CONFIG_IOMMU_LIVEUPDATE
> +	/*
> +	 * Restored IOMMU domains are already attached to the device and can
> +	 * only be freed. So no need to check the compatibility.
> +	 */
> +	if (iommu_domain_restored_state(domain))
> +		return 0;
> +#endif
> +
>   	if (intel_domain_is_fs_paging(dmar_domain))
>   		ret = paging_domain_compatible_first_stage(dmar_domain, iommu);
>   	else if (intel_domain_is_ss_paging(dmar_domain))
> @@ -3647,7 +3666,7 @@ domain_add_dev_pasid(struct iommu_domain *domain,
>   	if (!dev_pasid)
>   		return ERR_PTR(-ENOMEM);
>   
> -	ret = domain_attach_iommu(dmar_domain, iommu);
> +	ret = domain_attach_iommu(dmar_domain, iommu, -1);
>   	if (ret)
>   		goto out_free;
>   
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index b0ec0b471a43..8e37acf7de12 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -1182,7 +1182,8 @@ void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>    */
>   #define QI_OPT_WAIT_DRAIN		BIT(0)
>   
> -int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
> +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
> +			int restore_did);
>   void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
>   void device_block_translation(struct device *dev);
>   int paging_domain_compatible(struct iommu_domain *domain, struct device *dev);
> diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
> index 2b979bec56ce..6e13f697b463 100644
> --- a/drivers/iommu/intel/nested.c
> +++ b/drivers/iommu/intel/nested.c
> @@ -40,7 +40,7 @@ static int intel_nested_attach_dev(struct iommu_domain *domain,
>   		return ret;
>   	}
>   
> -	ret = domain_attach_iommu(dmar_domain, iommu);
> +	ret = domain_attach_iommu(dmar_domain, iommu, -1);
>   	if (ret) {
>   		dev_err_ratelimited(dev, "Failed to attach domain to iommu\n");
>   		return ret;

Thanks,
baolu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 10/16] iommu: Restore and reattach preserved domains to devices
  2026-05-07 13:54   ` [PATCH v2 10/16] iommu: Restore and reattach preserved domains to devices Baolu Lu
@ 2026-05-07 16:52     ` Samiullah Khawaja
  0 siblings, 0 replies; 14+ messages in thread
From: Samiullah Khawaja @ 2026-05-07 16:52 UTC (permalink / raw)
  To: Baolu Lu
  Cc: David Woodhouse, Joerg Roedel, Will Deacon, Jason Gunthorpe,
	Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan, iommu,
	linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Thu, May 07, 2026 at 09:54:27PM +0800, Baolu Lu wrote:
>On 4/28/2026 1:56 AM, Samiullah Khawaja wrote:
>>Restore the preserved domains by restoring the page tables using restore
>>IOMMU domain op. Reattach the preserved domain to the device during
>>default domain setup. While attaching, reuse the domain ID that was used
>>in the previous kernel. The context entry setup is not needed as that is
>>preserved during liveupdate.
>>
>>Signed-off-by: Samiullah Khawaja<skhawaja@google.com>
>>---
>>  drivers/iommu/intel/iommu.c      | 49 ++++++++++++++------
>>  drivers/iommu/intel/iommu.h      |  3 +-
>>  drivers/iommu/intel/nested.c     |  2 +-
>>  drivers/iommu/iommu.c            | 61 ++++++++++++++++++++++++-
>>  drivers/iommu/liveupdate.c       | 78 ++++++++++++++++++++++++++++++++
>>  include/linux/iommu-liveupdate.h | 50 ++++++++++++++++++++
>>  6 files changed, 224 insertions(+), 19 deletions(-)
>
>Please split the changes in the Intel iommu driver and the iommu core
>into different patches.

Agreed. I will split these.
>
>>
>>diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>>index 4118a0861f38..b90757164cd8 100644
>>--- a/drivers/iommu/intel/iommu.c
>>+++ b/drivers/iommu/intel/iommu.c
>>@@ -1031,7 +1031,8 @@ static bool first_level_by_default(struct intel_iommu *iommu)
>>  	return true;
>>  }
>>-int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
>>+int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
>>+			int restore_did)
>
>How about using a new helper for restored domain? For example,
>
>	int domain_reattach_iommu(...)
>
>or
>
>	int domain_restore_iommu(...)

Agreed. I didn't do this to avoid code duplication, but I think this
might be much clean. Also in the restore path, the did test can also be
done as you pointed out below.
>
>>  {
>>  	struct iommu_domain_info *info, *curr;
>>  	int num, ret = -ENOSPC;
>>@@ -1051,8 +1052,11 @@ int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
>>  		return 0;
>>  	}
>>-	num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
>>-			      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
>>+	if (restore_did >= IDA_START_DID)
>>+		num = restore_did;
>
>For a preserved domain ID, do we need to check whether it has been
>reserved from the ida pool?

Yes, I will add a sanity check once I move this logic into separate
function.
>
>>+	else
>>+		num = ida_alloc_range(&iommu->domain_ida, IDA_START_DID,
>>+				      cap_ndoms(iommu->cap) - 1, GFP_KERNEL);
>>  	if (num < 0) {
>>  		pr_err("%s: No free domain ids\n", iommu->name);
>>  		goto err_unlock;
>>@@ -1320,10 +1324,14 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>>  {
>>  	struct device_domain_info *info = dev_iommu_priv_get(dev);
>>  	struct intel_iommu *iommu = info->iommu;
>>+	struct device_ser *device_ser = NULL;
>>  	unsigned long flags;
>>  	int ret;
>>-	ret = domain_attach_iommu(domain, iommu);
>>+	device_ser = dev_iommu_restored_state(dev);
>>+
>>+	ret = domain_attach_iommu(domain, iommu,
>>+				  dev_iommu_restore_did(dev, &domain->domain));
>
>What is the expected behavior if there is a mismatch between the
>restored state and the availability of a domain ID? Specifically, if
>device_ser is found but dev_iommu_restore_did() fails (returns -1), how
>should the driver proceed in that case?

The restore of the domain should fail in this case, as this is
unexpected behaviour, as the DID should be preserved in the previous
kernel and in this (new) kernel the did should have been reclaimed
during iommu HW restore.

But I think we can sanity check these in the separate restore function.
I will rework this.
>
>>  	if (ret)
>>  		return ret;
>>@@ -1336,16 +1344,18 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>>  	if (dev_is_real_dma_subdevice(dev))
>>  		return 0;
>>-	if (!sm_supported(iommu))
>>-		ret = domain_context_mapping(domain, dev);
>>-	else if (intel_domain_is_fs_paging(domain))
>>-		ret = domain_setup_first_level(iommu, domain, dev,
>>-					       IOMMU_NO_PASID, NULL);
>>-	else if (intel_domain_is_ss_paging(domain))
>>-		ret = domain_setup_second_level(iommu, domain, dev,
>>-						IOMMU_NO_PASID, NULL);
>>-	else if (WARN_ON(true))
>>-		ret = -EINVAL;
>>+	if (!device_ser) {
>>+		if (!sm_supported(iommu))
>>+			ret = domain_context_mapping(domain, dev);
>>+		else if (intel_domain_is_fs_paging(domain))
>>+			ret = domain_setup_first_level(iommu, domain, dev,
>>+						       IOMMU_NO_PASID, NULL);
>>+		else if (intel_domain_is_ss_paging(domain))
>>+			ret = domain_setup_second_level(iommu, domain, dev,
>>+							IOMMU_NO_PASID, NULL);
>>+		else if (WARN_ON(true))
>>+			ret = -EINVAL;
>>+	}
>>  	if (ret)
>>  		goto out_block_translation;
>>@@ -3170,6 +3180,15 @@ int paging_domain_compatible(struct iommu_domain *domain, struct device *dev)
>>  	struct intel_iommu *iommu = info->iommu;
>>  	int ret = -EINVAL;
>>+#ifdef CONFIG_IOMMU_LIVEUPDATE
>>+	/*
>>+	 * Restored IOMMU domains are already attached to the device and can
>>+	 * only be freed. So no need to check the compatibility.
>>+	 */
>>+	if (iommu_domain_restored_state(domain))
>>+		return 0;
>>+#endif
>>+
>>  	if (intel_domain_is_fs_paging(dmar_domain))
>>  		ret = paging_domain_compatible_first_stage(dmar_domain, iommu);
>>  	else if (intel_domain_is_ss_paging(dmar_domain))
>>@@ -3647,7 +3666,7 @@ domain_add_dev_pasid(struct iommu_domain *domain,
>>  	if (!dev_pasid)
>>  		return ERR_PTR(-ENOMEM);
>>-	ret = domain_attach_iommu(dmar_domain, iommu);
>>+	ret = domain_attach_iommu(dmar_domain, iommu, -1);
>>  	if (ret)
>>  		goto out_free;
>>diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
>>index b0ec0b471a43..8e37acf7de12 100644
>>--- a/drivers/iommu/intel/iommu.h
>>+++ b/drivers/iommu/intel/iommu.h
>>@@ -1182,7 +1182,8 @@ void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>>   */
>>  #define QI_OPT_WAIT_DRAIN		BIT(0)
>>-int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
>>+int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu,
>>+			int restore_did);
>>  void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
>>  void device_block_translation(struct device *dev);
>>  int paging_domain_compatible(struct iommu_domain *domain, struct device *dev);
>>diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
>>index 2b979bec56ce..6e13f697b463 100644
>>--- a/drivers/iommu/intel/nested.c
>>+++ b/drivers/iommu/intel/nested.c
>>@@ -40,7 +40,7 @@ static int intel_nested_attach_dev(struct iommu_domain *domain,
>>  		return ret;
>>  	}
>>-	ret = domain_attach_iommu(dmar_domain, iommu);
>>+	ret = domain_attach_iommu(dmar_domain, iommu, -1);
>>  	if (ret) {
>>  		dev_err_ratelimited(dev, "Failed to attach domain to iommu\n");
>>  		return ret;
>
>Thanks,
>baolu

Thanks,
Sami

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 09/16] iommu/vt-d: Restore IOMMU state and reclaimed domain ids
  2026-05-07  9:05   ` [PATCH v2 09/16] iommu/vt-d: Restore IOMMU state and reclaimed domain ids Baolu Lu
@ 2026-05-07 17:35     ` Samiullah Khawaja
  0 siblings, 0 replies; 14+ messages in thread
From: Samiullah Khawaja @ 2026-05-07 17:35 UTC (permalink / raw)
  To: Baolu Lu
  Cc: David Woodhouse, Joerg Roedel, Will Deacon, Jason Gunthorpe,
	Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan, iommu,
	linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Thu, May 07, 2026 at 05:05:07PM +0800, Baolu Lu wrote:
>On 4/28/2026 1:56 AM, Samiullah Khawaja wrote:
>>During boot fetch the preserved state of IOMMU unit and if found then
>>restore the state.
>>
>>- Reuse the root_table that was preserved in the previous kernel.
>>- Reclaim the domain ids of the preserved domains for each preserved
>>   devices so these are not acquired by another domain.
>>
>>Signed-off-by: Samiullah Khawaja<skhawaja@google.com>
>>---
>>  drivers/iommu/intel/iommu.c      | 55 ++++++++++++++++++++++--------
>>  drivers/iommu/intel/iommu.h      |  7 ++++
>>  drivers/iommu/intel/liveupdate.c | 57 ++++++++++++++++++++++++++++++++
>>  3 files changed, 105 insertions(+), 14 deletions(-)
>>
>>diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>>index 68fecd4e57fa..4118a0861f38 100644
>>--- a/drivers/iommu/intel/iommu.c
>>+++ b/drivers/iommu/intel/iommu.c
>>@@ -670,10 +670,17 @@ void dmar_fault_dump_ptes(struct intel_iommu *iommu, u16 source_id,
>>  #endif
>>  /* iommu handling */
>>-static int iommu_alloc_root_entry(struct intel_iommu *iommu)
>>+static int iommu_alloc_root_entry(struct intel_iommu *iommu,
>>+				  struct iommu_hw_ser *iommu_ser)
>>  {
>>  	struct root_entry *root;
>>+	if (iommu_ser) {
>>+		intel_iommu_liveupdate_restore_root_table(iommu, iommu_ser);
>>+		__iommu_flush_cache(iommu, iommu->root_entry, ROOT_SIZE);
>
>The root table is already active and being used by the hardware at this
>point, correct? Since the CPU should not be modifying an active table,
>the __iommu_flush_cache() call seems unnecessary and potentially risky.
>
>Furthermore, if the iommu state has been restored, is it necessary to
>call iommu_set_root_entry()? Calling it would essentially replace an
>active, in-use root table with an identical one, which appears
>redundant.

Agreed to both points. I will remove the redundant flush here.

The set_root_entry() is done outside this function in a separate loop
after setup of all iommus, I will check whether the iommu was preserved
there and skip set_root_entry().
>
>>+		return 0;
>>+	}
>>+
>>  	root = iommu_alloc_pages_node_sz(iommu->node, GFP_ATOMIC, SZ_4K);
>>  	if (!root) {
>>  		pr_err("Allocating root entry for %s failed\n",
>
>Thanks,
>baolu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 06/16] iommupt: Implement preserve/unpreserve/restore callbacks
  2026-05-07  2:55   ` [PATCH v2 06/16] iommupt: Implement preserve/unpreserve/restore callbacks Baolu Lu
@ 2026-05-07 18:40     ` Samiullah Khawaja
  0 siblings, 0 replies; 14+ messages in thread
From: Samiullah Khawaja @ 2026-05-07 18:40 UTC (permalink / raw)
  To: Baolu Lu
  Cc: David Woodhouse, Joerg Roedel, Will Deacon, Jason Gunthorpe,
	Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan, iommu,
	linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Thu, May 07, 2026 at 10:55:42AM +0800, Baolu Lu wrote:
>On 4/28/26 01:56, Samiullah Khawaja wrote:
>>Implement the iommu domain ops for presevation, unpresevation and
>>restoration of iommu domains for liveupdate. Use the existing page
>>walker to preserve the ioptdesc of the top_table and the lower tables.
>>
>>Preserve top_level, VASZ and FEAT Sign Extended to restore the domain in
>>the next kernel. On restore the domain has only the preserved features
>>enabled and all the other features are zeroed. This is ok since the
>>restored domain is made immutable and can only be freed. A kunit test is
>>added to verify that the IOMMU domain free can be done with trimmed
>>features.
>>
>>Signed-off-by: Samiullah Khawaja<skhawaja@google.com>
>>---
>>@@ -251,6 +252,12 @@ struct pt_iommu_cfg {
>>  #define IOMMU_PROTOTYPES(fmt)                                                  \
>>  	phys_addr_t pt_iommu_##fmt##_iova_to_phys(struct iommu_domain *domain, \
>>  						  dma_addr_t iova);            \
>>+	int pt_iommu_##fmt##_preserve(struct iommu_domain *domain,             \
>>+				      struct iommu_domain_ser *ser);           \
>>+	void pt_iommu_##fmt##_unpreserve(struct iommu_domain *domain,          \
>>+					 struct iommu_domain_ser *ser);        \
>>+	int pt_iommu_##fmt##_restore(struct iommu_domain *domain,              \
>>+				     struct iommu_domain_ser *ser);            \
>>  	int pt_iommu_##fmt##_read_and_clear_dirty(                             \
>>  		struct iommu_domain *domain, unsigned long iova, size_t size,  \
>>  		unsigned long flags, struct iommu_dirty_bitmap *dirty);        \
>>@@ -266,12 +273,22 @@ struct pt_iommu_cfg {
>>  	};                              \
>>  	IOMMU_PROTOTYPES(fmt)
>>+#ifdef CONFIG_IOMMU_LIVEUPDATE
>>+#define IOMMU_PT_LIVEUPDATE_OPS(fmt)			\
>>+	, .preserve = &pt_iommu_##fmt##_preserve,	\
>>+	.unpreserve = &pt_iommu_##fmt##_unpreserve,	\
>>+	.restore = &pt_iommu_##fmt##_restore
>
>Nit: would it look better if we put it like this?
>
>#define IOMMU_PT_LIVEUPDATE_OPS(fmt)			\
>	, .preserve = &pt_iommu_##fmt##_preserve	\
>	, .unpreserve = &pt_iommu_##fmt##_unpreserve	\
>	, .restore = &pt_iommu_##fmt##_restore

Agreed. I will change this.
>
>>+#else
>>+#define IOMMU_PT_LIVEUPDATE_OPS(fmt)
>>+#endif
>
>Thanks,
>baolu
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 04/16] iommu: Implement device and IOMMU HW preservation
  2026-05-07  2:07   ` Baolu Lu
@ 2026-05-07 18:47     ` Samiullah Khawaja
  0 siblings, 0 replies; 14+ messages in thread
From: Samiullah Khawaja @ 2026-05-07 18:47 UTC (permalink / raw)
  To: Baolu Lu
  Cc: David Woodhouse, Joerg Roedel, Will Deacon, Jason Gunthorpe,
	Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan, iommu,
	linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Thu, May 07, 2026 at 10:07:43AM +0800, Baolu Lu wrote:
>On 4/28/26 01:56, Samiullah Khawaja wrote:
>>Add IOMMU ops to preserve/unpreserve a device. These can be implemented
>>by the IOMMU drivers that support preservation of devices that have
>>their IOMMU domains preserved. During device preservation the state of
>>the associated IOMMU is also preserved as dependency.
>>
>>Signed-off-by: Samiullah Khawaja<skhawaja@google.com>
>>---
>>  drivers/iommu/liveupdate.c       | 162 +++++++++++++++++++++++++++++++
>>  include/linux/iommu-liveupdate.h |  33 +++++++
>>  include/linux/iommu.h            |  20 ++++
>>  3 files changed, 215 insertions(+)
>>

[snip]
>>+
>>+int iommu_preserve_device(struct iommu_domain *domain,
>>+			  struct device *dev, u64 *preserved_state)
>>+{
>>+	struct iommu_flb_obj *flb_obj;
>>+	struct iommu_device_ser *device_ser;
>>+	struct dev_iommu *iommu;
>>+	struct pci_dev *pdev;
>>+	int ret;
>>+
>>+	if (!dev_is_pci(dev))
>>+		return -EOPNOTSUPP;
>>+
>>+	if (!domain->preserved_state)
>>+		return -EINVAL;
>>+
>>+	if (!iommu_group_dma_owner_claimed(dev->iommu_group))
>>+		return -EINVAL;
>>+
>>+	pdev = to_pci_dev(dev);
>>+	iommu = dev->iommu;
>>+	if (!iommu->iommu_dev->ops->preserve_device ||
>>+	    !iommu->iommu_dev->ops->preserve)
>>+		return -EOPNOTSUPP;

I will check for unpreserve ops here also.
>>+
>>+	ret = liveupdate_flb_get_outgoing(&iommu_flb, (void **)&flb_obj);
>>+	if (ret)
>>+		return ret;
>>+

[snip]
>>+}
>>+
>>+void iommu_unpreserve_device(struct iommu_domain *domain, struct device *dev)
>>+{
>>+	struct iommu_flb_obj *flb_obj;
>>+	struct iommu_device_ser *iommu_device_ser;
>>+	struct dev_iommu *iommu;
>>+	struct pci_dev *pdev;
>>+	int ret;
>>+
>>+	if (!dev_is_pci(dev))
>>+		return;
>>+
>>+	if (!iommu_group_dma_owner_claimed(dev->iommu_group))
>>+		return;
>>+
>>+	pdev = to_pci_dev(dev);
>>+	iommu = dev->iommu;
>>+	if (!iommu->iommu_dev->ops->unpreserve_device ||
>>+	    !iommu->iommu_dev->ops->unpreserve)
>>+		return;
>
>Is it considered a driver bug if it implements the preserve hooks but
>not unpreserve ones? This would at least cause a silent memory leak. How
>about adding a WARN like this?

I think the preservation should not be done if the unpreserve ops are
not implemented by the driver. I will add these checks in the _preserve
functions and return EOPNOTSUPP if it is not implemented by the driver.
>
>    if (WARN_ON_ONCE(!iommu->iommu_dev->ops->unpreserve_device ||
>                     !iommu->iommu_dev->ops->unpreserve))
>        return;

We can add this warning here also.
>
>?
>
>Thanks,
>baolu
>

Thanks,
Sami

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 07/16] iommu/vt-d: Implement device and iommu preserve/unpreserve ops
  2026-05-07  6:25   ` [PATCH v2 07/16] iommu/vt-d: Implement device and iommu preserve/unpreserve ops Baolu Lu
@ 2026-05-08  2:36     ` Samiullah Khawaja
  0 siblings, 0 replies; 14+ messages in thread
From: Samiullah Khawaja @ 2026-05-08  2:36 UTC (permalink / raw)
  To: Baolu Lu
  Cc: David Woodhouse, Joerg Roedel, Will Deacon, Jason Gunthorpe,
	Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan, iommu,
	linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Thu, May 07, 2026 at 02:25:14PM +0800, Baolu Lu wrote:
>On 4/28/26 01:56, Samiullah Khawaja wrote:
>>Add implementation of the device and iommu presevation in a separate
>>file. Also set the device and iommu preserve/unpreserve ops in the
>>struct iommu_ops.
>>
>>During normal shutdown the iommu translation is disabled. Since the root
>>table is preserved during live update, it needs to be cleaned up and the
>>context entries of the unpreserved devices need to be cleared.
>
>This is not related to preserve/unpreserve ops and could be made in a
>separated patch?

Agreed. I will move this stuff to a separate patch.
>
>>
>>Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>>---
>>  MAINTAINERS                      |   1 +
>>  drivers/iommu/intel/Makefile     |   1 +
>>  drivers/iommu/intel/iommu.c      |  52 +++++++++++-
>>  drivers/iommu/intel/iommu.h      |  28 +++++++
>>  drivers/iommu/intel/liveupdate.c | 139 +++++++++++++++++++++++++++++++
>>  drivers/iommu/iommu.c            |  18 ++++
>>  include/linux/iommu-liveupdate.h |  10 +++
>>  include/linux/iommu.h            |  14 ++++
>>  include/linux/kho/abi/iommu.h    |  18 ++++
>>  9 files changed, 277 insertions(+), 4 deletions(-)
>>  create mode 100644 drivers/iommu/intel/liveupdate.c
>>
>>diff --git a/MAINTAINERS b/MAINTAINERS
>>index 980041955abc..9f5c02c6c8c1 100644
>>--- a/MAINTAINERS
>>+++ b/MAINTAINERS
>>@@ -13495,6 +13495,7 @@ M:	Samiullah Khawaja <skhawaja@google.com>
>>  R:	Pranjal Shrivastava <praan@google.com>
>>  L:	iommu@lists.linux.dev
>>  S:	Maintained
>>+F:	drivers/iommu/intel/liveupdate.c
>
>This file is deeply integrated into the Intel IOMMU driver. Maintaining
>it separately from the rest of the driver would add unnecessary
>complexity. I suggest merging this entry into the existing Intel IOMMU
>block so they can be managed together.

I understand your concern. This file is driver specific and should
definitely be routed through your tree.

My goal with listing it here wasn't to change the merge path, but rather
to ensure I am responsible for the liveupdate-specific logic in this
file. Because this file acts as the backend to the generic Live Update
framework, keeping it tied to the Liveupdate IOMMU entry helps guarantee
that future changes stay architecturally aligned and tested across the
whole liveupdate feature.

However, I think I will add a dedicated 'IOMMU VT-d LIVEUPDATE' block
where you are also listed as maintainer, and the patches for this file
should absolutely continue to flow through the Intel IOMMU tree.
>
>>  F:	drivers/iommu/liveupdate.c
>>  F:	include/linux/iommu-liveupdate.h
>>  F:	include/linux/kho/abi/iommu.h
>>diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
>>index ada651c4a01b..d38fc101bc35 100644
>>+

[snip]
>>+static void unpreserve_iommu_context_table(struct intel_iommu *iommu, int end)
>>+{
>>+	struct context_entry *context;
>>+	int i;
>>+
>>+	for (i = 0; i < end; i++) {
>>+		context = iommu_context_addr(iommu, i, 0, 0);
>>+		if (context)
>>+			iommu_unpreserve_page(context);
>>+
>>+		if (!sm_supported(iommu))
>>+			continue;
>>+
>>+		context = iommu_context_addr(iommu, i, 0x80, 0);
>>+		if (context)
>>+			iommu_unpreserve_page(context);
>>+	}
>>+}
>>+
>>+static int preserve_iommu_context_table(struct intel_iommu *iommu)
>
>Since this function preserves all context tables, should it be renamed
>to preserve_iommu_context_tables()?

Agreed. Will change this.
>
>>+{
>>+	struct context_entry *context;
>>+	int ret;
>>+	int i;
>>+
>>+	for (i = 0; i < ROOT_ENTRY_NR; i++) {
>>+		/*
>>+		 * Alloc the context tables now to make sure the iommu unit is
>>+		 * properly preserved. These might stay unused and wastes around
>>+		 * 32MB max in scalable mode.
>>+		 */
>
>Instead of allocating and preserving context tables for all root entries
>(as noted, can waste up to 32MB), could we restrict this only to the
>entries possibly in use by active PCI devices?

I think the hotplug devices or VFs created through SR-IOV will be missed
that way. Lets say device A is preserved and the associated iommu is
also preserved. And then a new device B is hotplugged and preserved,
then the context table for that will be missed.

Since we don't track the context_tables that are preserved, there is no
way to incremently preserve the new-ones. Let me look into the behaviour
of KHO, maybe we can make the preserve call idempotent and do these
incrementally.
>
>>+		spin_lock(&iommu->lock);
>>+		context = iommu_context_addr(iommu, i, 0, 1);
>>+		spin_unlock(&iommu->lock);
>>+		if (!context) {
>>+			ret = -ENOMEM;
>>+			goto error;
>>+		}

[snip]
>>diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>index 1c424b32c5fc..999be5127c65 100644
>>--- a/include/linux/iommu.h
>>+++ b/include/linux/iommu.h
>>@@ -1207,6 +1207,20 @@ static inline void *dev_iommu_priv_get(struct device *dev)
>>  void dev_iommu_priv_set(struct device *dev, void *priv);
>>+typedef int (*iommu_dev_iter_fn)(struct device *dev,
>>+				 struct iommu_device *iommu, void *arg);
>>+
>>+/**
>>+ * struct iommu_dev_iter - Iterator for devices attached to an IOMMU
>>+ */
>>+struct iommu_dev_iter {
>>+	struct iommu_device *iommu;
>>+	iommu_dev_iter_fn fn;
>>+	void *arg;
>>+};
>>+
>>+void iommu_for_each_dev(struct iommu_dev_iter *iter);
>
>Is a generic helper necessary here? I am concerned about potential races
>with concurrent operations especially probe_device or release_device.
>And also the hot-plug/unplug scenarios?
>
>Since the current preservation logic is limited to PCI devices, it might
>be safer and simpler to use the existing for_each_pci_dev() for this
>specific case instead of introducing a new generic helper?

The helper is used late during shutdown to clean up context entries for
devices that are not preserved. During this phase, the system is
quiescent and hotplug is disabled, so there are no races with concurrent
probe or release operations.

I was using for_each_pci_dev() in previous version, but the iterator is
needed to clear context entries of both PCI and non-PCI devices that are
unpreserved.

Please also see related comments on previous version:
https://lore.kernel.org/all/ab3R54-kyHGDEW9L@google.com/
>
>>+
>>  extern struct mutex iommu_probe_device_lock;
>>  int iommu_probe_device(struct device *dev);

[snip]
>>+	};
>>  } __packed;
>>  /**
>
>Thanks,
>baolu
>

Thanks,
Sami

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 11/16] iommu/vt-d: preserve PASID table of preserved device
       [not found] ` <20260427175633.1978233-12-skhawaja@google.com>
@ 2026-05-08  6:05   ` Baolu Lu
  2026-05-11 18:45     ` Samiullah Khawaja
  0 siblings, 1 reply; 14+ messages in thread
From: Baolu Lu @ 2026-05-08  6:05 UTC (permalink / raw)
  To: Samiullah Khawaja, David Woodhouse, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan, iommu,
	linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On 4/28/26 01:56, Samiullah Khawaja wrote:
> In scalable mode the PASID table is used to fetch the io page tables.
> Preserve and restore the PASID table of the preserved devices.
> 
> Signed-off-by: Samiullah Khawaja<skhawaja@google.com>
> ---
>   drivers/iommu/intel/iommu.c      |   5 +-
>   drivers/iommu/intel/iommu.h      |  12 +++
>   drivers/iommu/intel/liveupdate.c | 141 +++++++++++++++++++++++++++++++
>   drivers/iommu/intel/pasid.c      |   7 +-
>   drivers/iommu/intel/pasid.h      |   9 ++
>   include/linux/kho/abi/iommu.h    |  13 +++
>   6 files changed, 184 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index b90757164cd8..6d42051dcf7c 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -2951,8 +2951,10 @@ static int clear_unpreserve_context_entry_fn(struct device *dev,
>   	if (!info)
>   		return 0;
>   
> -	if (dev_is_pci(dev) && dev_iommu_preserved_state(dev))
> +	if (dev_is_pci(dev) && dev_iommu_preserved_state(dev)) {
> +		pasid_cleanup_preserved_table(dev);
>   		return 0;
> +	}
>   
>   	domain_context_clear(info);
>   	return 0;
> @@ -4013,6 +4015,7 @@ const struct iommu_ops intel_iommu_ops = {
>   	.page_response		= intel_iommu_page_response,
>   #ifdef CONFIG_IOMMU_LIVEUPDATE
>   	.preserve_device	= intel_iommu_preserve_device,
> +	.unpreserve_device	= intel_iommu_unpreserve_device,
>   	.preserve		= intel_iommu_preserve,
>   	.unpreserve		= intel_iommu_unpreserve,
>   #endif
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index 8e37acf7de12..62076a1a0b4d 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -1290,12 +1290,15 @@ static inline int iopf_for_domain_replace(struct iommu_domain *new,
>   #ifdef CONFIG_IOMMU_LIVEUPDATE
>   int intel_iommu_preserve_device(struct device *dev,
>   				struct iommu_device_ser *device_ser);
> +void intel_iommu_unpreserve_device(struct device *dev,
> +				   struct iommu_device_ser *device_ser);
>   int intel_iommu_preserve(struct iommu_device *iommu,
>   			 struct iommu_hw_ser *iommu_ser);
>   void intel_iommu_unpreserve(struct iommu_device *iommu,
>   			    struct iommu_hw_ser *iommu_ser);
>   void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
>   					       struct iommu_hw_ser *iommu_ser);
> +void pasid_cleanup_preserved_table(struct device *dev);
>   #else
>   static inline int intel_iommu_preserve_device(struct device *dev,
>   					      struct iommu_device_ser *device_ser)
> @@ -1303,6 +1306,11 @@ static inline int intel_iommu_preserve_device(struct device *dev,
>   	return -EOPNOTSUPP;
>   }
>   
> +static inline void intel_iommu_unpreserve_device(struct device *dev,
> +						 struct iommu_device_ser *device_ser)
> +{
> +}
> +
>   static inline int intel_iommu_preserve(struct iommu_device *iommu,
>   				       struct iommu_hw_ser *iommu_ser)
>   {
> @@ -1318,6 +1326,10 @@ static inline void intel_iommu_liveupdate_restore_root_table(struct intel_iommu
>   							     struct iommu_hw_ser *iommu_ser)
>   {
>   }
> +
> +static inline void pasid_cleanup_preserved_table(struct device *dev)
> +{
> +}
>   #endif
>   
>   #ifdef CONFIG_INTEL_IOMMU_SVM
> diff --git a/drivers/iommu/intel/liveupdate.c b/drivers/iommu/intel/liveupdate.c
> index 50a63812533f..404b485e97b9 100644
> --- a/drivers/iommu/intel/liveupdate.c
> +++ b/drivers/iommu/intel/liveupdate.c
> @@ -14,6 +14,7 @@
>   #include <linux/pci.h>
>   
>   #include "iommu.h"
> +#include "pasid.h"
>   #include "../iommu-pages.h"
>   
>   static void unpreserve_iommu_context_table(struct intel_iommu *iommu, int end)
> @@ -140,10 +141,96 @@ void intel_iommu_liveupdate_restore_root_table(struct intel_iommu *iommu,
>   	iommu_for_each_preserved_device(_restore_used_domain_ids, iommu);
>   }
>   
> +enum pasid_lu_op {
> +	PASID_LU_OP_PRESERVE = 1,
> +	PASID_LU_OP_UNPRESERVE,
> +	PASID_LU_OP_RESTORE,
> +	PASID_LU_OP_FREE,
> +};
> +
> +static int pasid_lu_do_op(void *table, enum pasid_lu_op op)
> +{
> +	int ret = 0;
> +
> +	switch (op) {
> +	case PASID_LU_OP_PRESERVE:
> +		ret = iommu_preserve_page(table);
> +		break;
> +	case PASID_LU_OP_UNPRESERVE:
> +		iommu_unpreserve_page(table);
> +		break;
> +	case PASID_LU_OP_RESTORE:
> +		iommu_restore_page(virt_to_phys(table));
> +		break;
> +	case PASID_LU_OP_FREE:
> +		iommu_free_pages(table);
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int pasid_lu_handle_pd(struct pasid_dir_entry *dir, enum pasid_lu_op op)
> +{
> +	struct pasid_entry *table;
> +	int ret;
> +
> +	/* Only preserve first table for NO_PASID. */
> +	table = get_pasid_table_from_pde(&dir[0]);
> +	if (!table)
> +		return -EINVAL;
> +
> +	ret = pasid_lu_do_op(table, op);
> +	if (ret)
> +		return ret;
> +
> +	ret = pasid_lu_do_op(dir, op);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	if (op == PASID_LU_OP_PRESERVE)
> +		pasid_lu_do_op(table, PASID_LU_OP_UNPRESERVE);
> +
> +	return ret;
> +}
> +
> +void pasid_cleanup_preserved_table(struct device *dev)
> +{
> +	struct pasid_table *pasid_table;
> +	struct pasid_dir_entry *dir;
> +	struct pasid_entry *table;
> +	size_t dir_size;
> +
> +	pasid_table = intel_pasid_get_table(dev);
> +	if (!pasid_table)
> +		return;
> +
> +	dir = pasid_table->table;
> +	table = get_pasid_table_from_pde(&dir[0]);
> +	if (!table)
> +		return;
> +
> +	/* Clear everything except the first entry in table. */
> +	memset(&table[1], 0, SZ_4K - sizeof(*table));
> +
> +	/* Use the folio order to calculate the size of Pasid Directory */
> +	dir_size = (1 << (folio_order(virt_to_folio(dir)) + PAGE_SHIFT));
> +
> +	/* Clear everything except the first entry in directory */
> +	memset(&dir[1], 0, dir_size - sizeof(struct pasid_dir_entry));
> +
> +	clflush_cache_range(&table[0], SZ_4K);
> +	clflush_cache_range(&dir[0], dir_size);
> +}

The PASID table is currently active and in use by the hardware. Clearing
the entries without the necessary hardware cache invalidation is buggy.

It seems this manual clearing is a workaround because PASID domain
preservation isn't supported yet. If so, rather than clearing the table
blindly, the code should verify if any PASIDs (other than
IOMMU_NO_PASID) are actually in use. If there are, the preserve callback
should return an error.

Or anything I overlooked here?

Thanks,
baolu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 11/16] iommu/vt-d: preserve PASID table of preserved device
  2026-05-08  6:05   ` [PATCH v2 11/16] iommu/vt-d: preserve PASID table of preserved device Baolu Lu
@ 2026-05-11 18:45     ` Samiullah Khawaja
  0 siblings, 0 replies; 14+ messages in thread
From: Samiullah Khawaja @ 2026-05-11 18:45 UTC (permalink / raw)
  To: Baolu Lu
  Cc: David Woodhouse, Joerg Roedel, Will Deacon, Jason Gunthorpe,
	Robin Murphy, Kevin Tian, Alex Williamson, Shuah Khan, iommu,
	linux-kernel, kvm, Saeed Mahameed, Adithya Jayachandran,
	Parav Pandit, Leon Romanovsky, William Tu, Pratyush Yadav,
	Pasha Tatashin, David Matlack, Andrew Morton, Chris Li,
	Pranjal Shrivastava, Vipin Sharma, YiFei Zhu

On Fri, May 08, 2026 at 02:05:58PM +0800, Baolu Lu wrote:
>On 4/28/26 01:56, Samiullah Khawaja wrote:
>>In scalable mode the PASID table is used to fetch the io page tables.
>>Preserve and restore the PASID table of the preserved devices.
>>
>>Signed-off-by: Samiullah Khawaja<skhawaja@google.com>
>>---
>>  drivers/iommu/intel/iommu.c      |   5 +-
>>  drivers/iommu/intel/iommu.h      |  12 +++
>>  drivers/iommu/intel/liveupdate.c | 141 +++++++++++++++++++++++++++++++
>>  drivers/iommu/intel/pasid.c      |   7 +-
>>  drivers/iommu/intel/pasid.h      |   9 ++
>>  include/linux/kho/abi/iommu.h    |  13 +++
>>  6 files changed, 184 insertions(+), 3 deletions(-)
>>

[snip]
>>+
>>+void pasid_cleanup_preserved_table(struct device *dev)
>>+{
>>+	struct pasid_table *pasid_table;
>>+	struct pasid_dir_entry *dir;
>>+	struct pasid_entry *table;
>>+	size_t dir_size;
>>+
>>+	pasid_table = intel_pasid_get_table(dev);
>>+	if (!pasid_table)
>>+		return;
>>+
>>+	dir = pasid_table->table;
>>+	table = get_pasid_table_from_pde(&dir[0]);
>>+	if (!table)
>>+		return;
>>+
>>+	/* Clear everything except the first entry in table. */
>>+	memset(&table[1], 0, SZ_4K - sizeof(*table));
>>+
>>+	/* Use the folio order to calculate the size of Pasid Directory */
>>+	dir_size = (1 << (folio_order(virt_to_folio(dir)) + PAGE_SHIFT));
>>+
>>+	/* Clear everything except the first entry in directory */
>>+	memset(&dir[1], 0, dir_size - sizeof(struct pasid_dir_entry));
>>+
>>+	clflush_cache_range(&table[0], SZ_4K);
>>+	clflush_cache_range(&dir[0], dir_size);
>>+}
>
>The PASID table is currently active and in use by the hardware. Clearing
>the entries without the necessary hardware cache invalidation is buggy.
>
>It seems this manual clearing is a workaround because PASID domain
>preservation isn't supported yet. If so, rather than clearing the table
>blindly, the code should verify if any PASIDs (other than
>IOMMU_NO_PASID) are actually in use. If there are, the preserve callback
>should return an error.

Thanks for looking into this, I agree and will remove the clearing logic
here.

Yes, we do this check in iommufd, as it is the dma owner of the device,
and only DMA owned devices are allowed to be preserved.

During preservation iommufd returns an error if the device has PASID
(non NO_PASID) attachments. And once the device is preserved, any PASID
attachments are not allowed until the device is unpreserved.

I think I will make this check robust by moving it into core and use
pasid_array. It will require some plumbing as pasid_array exists in
iommu.c file.
>
>Or anything I overlooked here?
>
>Thanks,
>baolu

Thanks,
Sami

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-05-11 18:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260427175633.1978233-1-skhawaja@google.com>
     [not found] ` <20260427175633.1978233-4-skhawaja@google.com>
     [not found]   ` <afUkbz0IQh98siYn@google.com>
2026-05-04 18:33     ` [PATCH v2 03/16] iommu: Implement IOMMU domain preservation Samiullah Khawaja
     [not found] ` <20260427175633.1978233-5-skhawaja@google.com>
     [not found]   ` <afUscSeXKnNAJ0ZF@google.com>
2026-05-04 19:06     ` [PATCH v2 04/16] iommu: Implement device and IOMMU HW preservation Samiullah Khawaja
2026-05-07  2:07   ` Baolu Lu
2026-05-07 18:47     ` Samiullah Khawaja
     [not found] ` <20260427175633.1978233-7-skhawaja@google.com>
2026-05-07  2:55   ` [PATCH v2 06/16] iommupt: Implement preserve/unpreserve/restore callbacks Baolu Lu
2026-05-07 18:40     ` Samiullah Khawaja
     [not found] ` <20260427175633.1978233-8-skhawaja@google.com>
2026-05-07  6:25   ` [PATCH v2 07/16] iommu/vt-d: Implement device and iommu preserve/unpreserve ops Baolu Lu
2026-05-08  2:36     ` Samiullah Khawaja
     [not found] ` <20260427175633.1978233-10-skhawaja@google.com>
2026-05-07  9:05   ` [PATCH v2 09/16] iommu/vt-d: Restore IOMMU state and reclaimed domain ids Baolu Lu
2026-05-07 17:35     ` Samiullah Khawaja
     [not found] ` <20260427175633.1978233-11-skhawaja@google.com>
2026-05-07 13:54   ` [PATCH v2 10/16] iommu: Restore and reattach preserved domains to devices Baolu Lu
2026-05-07 16:52     ` Samiullah Khawaja
     [not found] ` <20260427175633.1978233-12-skhawaja@google.com>
2026-05-08  6:05   ` [PATCH v2 11/16] iommu/vt-d: preserve PASID table of preserved device Baolu Lu
2026-05-11 18:45     ` Samiullah Khawaja

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox