* [PATCH v2 0/4] iommufd: Destroy vdevice on device unbind @ 2025-06-23 9:49 Xu Yilun 2025-06-23 9:49 ` [PATCH v2 1/4] iommufd: Add iommufd_object_tombstone_user() helper Xu Yilun ` (3 more replies) 0 siblings, 4 replies; 32+ messages in thread From: Xu Yilun @ 2025-06-23 9:49 UTC (permalink / raw) To: jgg, jgg, kevin.tian, will, aneesh.kumar Cc: iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu, yilun.xu This is v2 of Aneesh's patch [1]. It is to solve the lifecycle issue that vdevice may outlive idevice. It is a prerequisite for TIO, to ensure extra secure configurations (e.g. TSM Bind/Unbind) against vdevice could be rolled back on idevice unbind, so that VFIO could still work on the physical device without surprise. This patch revives some Nicolin's vdevice_alloc change in v5. [1]: https://lore.kernel.org/linux-iommu/20250610065146.1321816-1-aneesh.kumar@kernel.org/ [2]: https://lore.kernel.org/all/53025c827c44d68edb6469bfd940a8e8bc6147a5.1729897278.git.nicolinc@nvidia.com/ Xu Yilun (4): iommufd: Add iommufd_object_tombstone_user() helper iommufd/viommu: Fix the uninitialized iommufd_vdevice::ictx iommufd: Destroy vdevice on idevice destroy iommufd/selftest: Add coverage for vdevice tombstone drivers/iommu/iommufd/device.c | 43 +++++++++++++++++++++++++ drivers/iommu/iommufd/iommufd_private.h | 34 ++++++++++++++++++- drivers/iommu/iommufd/main.c | 32 +++++++++++++----- drivers/iommu/iommufd/viommu.c | 34 +++++++++++++++++-- tools/testing/selftests/iommu/iommufd.c | 13 ++++++++ 5 files changed, 144 insertions(+), 12 deletions(-) base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 -- 2.25.1 ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 1/4] iommufd: Add iommufd_object_tombstone_user() helper 2025-06-23 9:49 [PATCH v2 0/4] iommufd: Destroy vdevice on device unbind Xu Yilun @ 2025-06-23 9:49 ` Xu Yilun 2025-06-24 13:35 ` Jason Gunthorpe 2025-06-25 5:51 ` Aneesh Kumar K.V 2025-06-23 9:49 ` [PATCH v2 2/4] iommufd/viommu: Fix the uninitialized iommufd_vdevice::ictx Xu Yilun ` (2 subsequent siblings) 3 siblings, 2 replies; 32+ messages in thread From: Xu Yilun @ 2025-06-23 9:49 UTC (permalink / raw) To: jgg, jgg, kevin.tian, will, aneesh.kumar Cc: iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu, yilun.xu Add the iommufd_object_tombstone_user() helper, which allows the caller to destroy an iommufd object created by userspace. This is useful on some destroy paths when the kernel caller finds the object should have been removed by userspace but is still alive. With this helper, the caller destroys the object but leave the object ID reserved (so called tombstone). The tombstone prevents repurposing the object ID without awareness from the original user. Since this happens for abnomal userspace behavior, for simplicity, the tombstoned object ID would be permanently leaked until iommufd_fops_release(). I.e. the original user gets an error when calling ioctl(IOMMU_DESTROY) on that ID. The first use case would be to ensure the iommufd_vdevice can't outlive the associated iommufd_device. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> --- drivers/iommu/iommufd/iommufd_private.h | 23 +++++++++++++++++- drivers/iommu/iommufd/main.c | 31 ++++++++++++++++++------- 2 files changed, 45 insertions(+), 9 deletions(-) diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 9ccc83341f32..fbc9ef78d81f 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -187,7 +187,8 @@ void iommufd_object_finalize(struct iommufd_ctx *ictx, struct iommufd_object *obj); enum { - REMOVE_WAIT_SHORTTERM = 1, + REMOVE_WAIT_SHORTTERM = BIT(0), + REMOVE_OBJ_TOMBSTONE = BIT(1), }; int iommufd_object_remove(struct iommufd_ctx *ictx, struct iommufd_object *to_destroy, u32 id, @@ -213,6 +214,26 @@ static inline void iommufd_object_destroy_user(struct iommufd_ctx *ictx, WARN_ON(ret); } +/* + * Similar to iommufd_object_destroy_user(), except that the object ID is left + * reserved/tombstoned. + */ +static inline void iommufd_object_tombstone_user(struct iommufd_ctx *ictx, + struct iommufd_object *obj) +{ + int ret; + + ret = iommufd_object_remove(ictx, obj, obj->id, + REMOVE_WAIT_SHORTTERM | REMOVE_OBJ_TOMBSTONE); + + /* + * If there is a bug and we couldn't destroy the object then we did put + * back the caller's users refcount and will eventually try to free it + * again during close. + */ + WARN_ON(ret); +} + /* * The HWPT allocated by autodomains is used in possibly many devices and * is automatically destroyed when its refcount reaches zero. diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 3df468f64e7d..5fd75aba068b 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -167,7 +167,7 @@ int iommufd_object_remove(struct iommufd_ctx *ictx, goto err_xa; } - xas_store(&xas, NULL); + xas_store(&xas, (flags & REMOVE_OBJ_TOMBSTONE) ? XA_ZERO_ENTRY : NULL); if (ictx->vfio_ioas == container_of(obj, struct iommufd_ioas, obj)) ictx->vfio_ioas = NULL; xa_unlock(&ictx->objects); @@ -238,6 +238,7 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) struct iommufd_ctx *ictx = filp->private_data; struct iommufd_sw_msi_map *next; struct iommufd_sw_msi_map *cur; + XA_STATE(xas, &ictx->objects, 0); struct iommufd_object *obj; /* @@ -251,16 +252,30 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) */ while (!xa_empty(&ictx->objects)) { unsigned int destroyed = 0; - unsigned long index; - xa_for_each(&ictx->objects, index, obj) { - if (!refcount_dec_if_one(&obj->users)) - continue; + xas_set(&xas, 0); + for (;;) { + rcu_read_lock(); + obj = xas_find(&xas, ULONG_MAX); + rcu_read_unlock(); + + if (!obj) + break; + + if (!xa_is_zero(obj)) { + if (!refcount_dec_if_one(&obj->users)) + continue; + + iommufd_object_ops[obj->type].destroy(obj); + kfree(obj); + } + destroyed++; - xa_erase(&ictx->objects, index); - iommufd_object_ops[obj->type].destroy(obj); - kfree(obj); + xas_lock(&xas); + xas_store(&xas, NULL); + xas_unlock(&xas); } + /* Bug related to users refcount */ if (WARN_ON(!destroyed)) break; -- 2.25.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v2 1/4] iommufd: Add iommufd_object_tombstone_user() helper 2025-06-23 9:49 ` [PATCH v2 1/4] iommufd: Add iommufd_object_tombstone_user() helper Xu Yilun @ 2025-06-24 13:35 ` Jason Gunthorpe 2025-06-25 7:24 ` Xu Yilun 2025-06-25 5:51 ` Aneesh Kumar K.V 1 sibling, 1 reply; 32+ messages in thread From: Jason Gunthorpe @ 2025-06-24 13:35 UTC (permalink / raw) To: Xu Yilun Cc: kevin.tian, will, aneesh.kumar, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu On Mon, Jun 23, 2025 at 05:49:43PM +0800, Xu Yilun wrote: > @@ -251,16 +252,30 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) > */ > while (!xa_empty(&ictx->objects)) { > unsigned int destroyed = 0; > - unsigned long index; > > - xa_for_each(&ictx->objects, index, obj) { > - if (!refcount_dec_if_one(&obj->users)) > - continue; I don't think you need all these changes.. xa_for_each will skip the XA_ZERO_ENTRIES because it won't return NULL xa_empty() will fail, but just remove it. @@ -288,6 +288,7 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) struct iommufd_sw_msi_map *next; struct iommufd_sw_msi_map *cur; struct iommufd_object *obj; + bool empty; /* * The objects in the xarray form a graph of "users" counts, and we have @@ -298,11 +299,13 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) * until the entire list is destroyed. If this can't progress then there * is some bug related to object refcounting. */ - while (!xa_empty(&ictx->objects)) { + do { unsigned int destroyed = 0; unsigned long index; + empty = true; xa_for_each(&ictx->objects, index, obj) { + empty = false; if (!refcount_dec_if_one(&obj->users)) continue; destroyed++; @@ -313,8 +316,13 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) /* Bug related to users refcount */ if (WARN_ON(!destroyed)) break; - } - WARN_ON(!xa_empty(&ictx->groups)); + } while (!empty); + + /* + * There may be some tombstones left over from + * iommufd_object_tombstone_user() + */ + xa_destroy(&ictx->groups); Jason ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 1/4] iommufd: Add iommufd_object_tombstone_user() helper 2025-06-24 13:35 ` Jason Gunthorpe @ 2025-06-25 7:24 ` Xu Yilun 0 siblings, 0 replies; 32+ messages in thread From: Xu Yilun @ 2025-06-25 7:24 UTC (permalink / raw) To: Jason Gunthorpe Cc: kevin.tian, will, aneesh.kumar, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu On Tue, Jun 24, 2025 at 10:35:20AM -0300, Jason Gunthorpe wrote: > On Mon, Jun 23, 2025 at 05:49:43PM +0800, Xu Yilun wrote: > > @@ -251,16 +252,30 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) > > */ > > while (!xa_empty(&ictx->objects)) { > > unsigned int destroyed = 0; > > - unsigned long index; > > > > - xa_for_each(&ictx->objects, index, obj) { > > - if (!refcount_dec_if_one(&obj->users)) > > - continue; > > I don't think you need all these changes.. > > xa_for_each will skip the XA_ZERO_ENTRIES because it won't return NULL > > xa_empty() will fail, but just remove it. > > @@ -288,6 +288,7 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) > struct iommufd_sw_msi_map *next; > struct iommufd_sw_msi_map *cur; > struct iommufd_object *obj; > + bool empty; > > /* > * The objects in the xarray form a graph of "users" counts, and we have > @@ -298,11 +299,13 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) > * until the entire list is destroyed. If this can't progress then there > * is some bug related to object refcounting. > */ > - while (!xa_empty(&ictx->objects)) { > + do { > unsigned int destroyed = 0; > unsigned long index; > > + empty = true; > xa_for_each(&ictx->objects, index, obj) { > + empty = false; > if (!refcount_dec_if_one(&obj->users)) > continue; > destroyed++; > @@ -313,8 +316,13 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) > /* Bug related to users refcount */ > if (WARN_ON(!destroyed)) > break; > - } > - WARN_ON(!xa_empty(&ictx->groups)); > + } while (!empty); > + > + /* > + * There may be some tombstones left over from > + * iommufd_object_tombstone_user() > + */ > + xa_destroy(&ictx->groups); I'm good to it. I was obsessed to ensure the xarray super clean. Thanks, Yilun > > > Jason ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 1/4] iommufd: Add iommufd_object_tombstone_user() helper 2025-06-23 9:49 ` [PATCH v2 1/4] iommufd: Add iommufd_object_tombstone_user() helper Xu Yilun 2025-06-24 13:35 ` Jason Gunthorpe @ 2025-06-25 5:51 ` Aneesh Kumar K.V 2025-06-25 8:40 ` Xu Yilun 1 sibling, 1 reply; 32+ messages in thread From: Aneesh Kumar K.V @ 2025-06-25 5:51 UTC (permalink / raw) To: Xu Yilun, jgg, jgg, kevin.tian, will Cc: iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu, yilun.xu Xu Yilun <yilun.xu@linux.intel.com> writes: > Add the iommufd_object_tombstone_user() helper, which allows the caller > to destroy an iommufd object created by userspace. > > This is useful on some destroy paths when the kernel caller finds the > object should have been removed by userspace but is still alive. With > this helper, the caller destroys the object but leave the object ID > reserved (so called tombstone). The tombstone prevents repurposing the > object ID without awareness from the original user. > > Since this happens for abnomal userspace behavior, for simplicity, the > tombstoned object ID would be permanently leaked until > iommufd_fops_release(). I.e. the original user gets an error when > calling ioctl(IOMMU_DESTROY) on that ID. > > The first use case would be to ensure the iommufd_vdevice can't outlive > the associated iommufd_device. > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> > --- > drivers/iommu/iommufd/iommufd_private.h | 23 +++++++++++++++++- > drivers/iommu/iommufd/main.c | 31 ++++++++++++++++++------- > 2 files changed, 45 insertions(+), 9 deletions(-) > > diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h > index 9ccc83341f32..fbc9ef78d81f 100644 > --- a/drivers/iommu/iommufd/iommufd_private.h > +++ b/drivers/iommu/iommufd/iommufd_private.h > @@ -187,7 +187,8 @@ void iommufd_object_finalize(struct iommufd_ctx *ictx, > struct iommufd_object *obj); > > enum { > - REMOVE_WAIT_SHORTTERM = 1, > + REMOVE_WAIT_SHORTTERM = BIT(0), > + REMOVE_OBJ_TOMBSTONE = BIT(1), > }; > int iommufd_object_remove(struct iommufd_ctx *ictx, > struct iommufd_object *to_destroy, u32 id, > @@ -213,6 +214,26 @@ static inline void iommufd_object_destroy_user(struct iommufd_ctx *ictx, > WARN_ON(ret); > } > > +/* > + * Similar to iommufd_object_destroy_user(), except that the object ID is left > + * reserved/tombstoned. > + */ > +static inline void iommufd_object_tombstone_user(struct iommufd_ctx *ictx, > + struct iommufd_object *obj) > +{ > + int ret; > + > + ret = iommufd_object_remove(ictx, obj, obj->id, > + REMOVE_WAIT_SHORTTERM | REMOVE_OBJ_TOMBSTONE); > + > + /* > + * If there is a bug and we couldn't destroy the object then we did put > + * back the caller's users refcount and will eventually try to free it > + * again during close. > + */ > + WARN_ON(ret); > +} > + > /* > * The HWPT allocated by autodomains is used in possibly many devices and > * is automatically destroyed when its refcount reaches zero. > diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c > index 3df468f64e7d..5fd75aba068b 100644 > --- a/drivers/iommu/iommufd/main.c > +++ b/drivers/iommu/iommufd/main.c > @@ -167,7 +167,7 @@ int iommufd_object_remove(struct iommufd_ctx *ictx, > goto err_xa; > } > > - xas_store(&xas, NULL); > + xas_store(&xas, (flags & REMOVE_OBJ_TOMBSTONE) ? XA_ZERO_ENTRY : NULL); > if (ictx->vfio_ioas == container_of(obj, struct iommufd_ioas, obj)) > ictx->vfio_ioas = NULL; > xa_unlock(&ictx->objects); > @@ -238,6 +238,7 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) > struct iommufd_ctx *ictx = filp->private_data; > struct iommufd_sw_msi_map *next; > struct iommufd_sw_msi_map *cur; > + XA_STATE(xas, &ictx->objects, 0); > struct iommufd_object *obj; > > /* > @@ -251,16 +252,30 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) > */ > while (!xa_empty(&ictx->objects)) { > unsigned int destroyed = 0; > - unsigned long index; > > - xa_for_each(&ictx->objects, index, obj) { > - if (!refcount_dec_if_one(&obj->users)) > - continue; > + xas_set(&xas, 0); > + for (;;) { > + rcu_read_lock(); > + obj = xas_find(&xas, ULONG_MAX); > + rcu_read_unlock(); > What is the need for the rcu_read_lock()? > + > + if (!obj) > + break; > + > + if (!xa_is_zero(obj)) { > + if (!refcount_dec_if_one(&obj->users)) > + continue; > + > + iommufd_object_ops[obj->type].destroy(obj); > + kfree(obj); > + } > + > destroyed++; > - xa_erase(&ictx->objects, index); > - iommufd_object_ops[obj->type].destroy(obj); > - kfree(obj); > + xas_lock(&xas); > + xas_store(&xas, NULL); > + xas_unlock(&xas); is that xas_lock needed?. we can't run a xarray update parallel to this because neither iommufd ioctl not vfio cdev unbind can happen in parallel. I have this as an additonal comment added to the function in my change. /* * We don't need additional locks because the iommufd_fops_release() function is * only triggered when the iommufd descriptor is released. At that point, no * other iommufd-based ioctl operations can be running concurrently. * * The destruction of the vdevice via idevice unbind is also safe: * iommufd_fops_release() can only be called after the idevice has been unbound. * The idevice bind operation takes a reference to the iommufd descriptor, * preventing early release. */ > } > + > /* Bug related to users refcount */ > if (WARN_ON(!destroyed)) > break; > -- > 2.25.1 -aneesh ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 1/4] iommufd: Add iommufd_object_tombstone_user() helper 2025-06-25 5:51 ` Aneesh Kumar K.V @ 2025-06-25 8:40 ` Xu Yilun 0 siblings, 0 replies; 32+ messages in thread From: Xu Yilun @ 2025-06-25 8:40 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: jgg, jgg, kevin.tian, will, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu On Wed, Jun 25, 2025 at 11:21:15AM +0530, Aneesh Kumar K.V wrote: > Xu Yilun <yilun.xu@linux.intel.com> writes: > > > Add the iommufd_object_tombstone_user() helper, which allows the caller > > to destroy an iommufd object created by userspace. > > > > This is useful on some destroy paths when the kernel caller finds the > > object should have been removed by userspace but is still alive. With > > this helper, the caller destroys the object but leave the object ID > > reserved (so called tombstone). The tombstone prevents repurposing the > > object ID without awareness from the original user. > > > > Since this happens for abnomal userspace behavior, for simplicity, the > > tombstoned object ID would be permanently leaked until > > iommufd_fops_release(). I.e. the original user gets an error when > > calling ioctl(IOMMU_DESTROY) on that ID. > > > > The first use case would be to ensure the iommufd_vdevice can't outlive > > the associated iommufd_device. > > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> > > --- > > drivers/iommu/iommufd/iommufd_private.h | 23 +++++++++++++++++- > > drivers/iommu/iommufd/main.c | 31 ++++++++++++++++++------- > > 2 files changed, 45 insertions(+), 9 deletions(-) > > > > diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h > > index 9ccc83341f32..fbc9ef78d81f 100644 > > --- a/drivers/iommu/iommufd/iommufd_private.h > > +++ b/drivers/iommu/iommufd/iommufd_private.h > > @@ -187,7 +187,8 @@ void iommufd_object_finalize(struct iommufd_ctx *ictx, > > struct iommufd_object *obj); > > > > enum { > > - REMOVE_WAIT_SHORTTERM = 1, > > + REMOVE_WAIT_SHORTTERM = BIT(0), > > + REMOVE_OBJ_TOMBSTONE = BIT(1), > > }; > > int iommufd_object_remove(struct iommufd_ctx *ictx, > > struct iommufd_object *to_destroy, u32 id, > > @@ -213,6 +214,26 @@ static inline void iommufd_object_destroy_user(struct iommufd_ctx *ictx, > > WARN_ON(ret); > > } > > > > +/* > > + * Similar to iommufd_object_destroy_user(), except that the object ID is left > > + * reserved/tombstoned. > > + */ > > +static inline void iommufd_object_tombstone_user(struct iommufd_ctx *ictx, > > + struct iommufd_object *obj) > > +{ > > + int ret; > > + > > + ret = iommufd_object_remove(ictx, obj, obj->id, > > + REMOVE_WAIT_SHORTTERM | REMOVE_OBJ_TOMBSTONE); > > + > > + /* > > + * If there is a bug and we couldn't destroy the object then we did put > > + * back the caller's users refcount and will eventually try to free it > > + * again during close. > > + */ > > + WARN_ON(ret); > > +} > > + > > /* > > * The HWPT allocated by autodomains is used in possibly many devices and > > * is automatically destroyed when its refcount reaches zero. > > diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c > > index 3df468f64e7d..5fd75aba068b 100644 > > --- a/drivers/iommu/iommufd/main.c > > +++ b/drivers/iommu/iommufd/main.c > > @@ -167,7 +167,7 @@ int iommufd_object_remove(struct iommufd_ctx *ictx, > > goto err_xa; > > } > > > > - xas_store(&xas, NULL); > > + xas_store(&xas, (flags & REMOVE_OBJ_TOMBSTONE) ? XA_ZERO_ENTRY : NULL); > > if (ictx->vfio_ioas == container_of(obj, struct iommufd_ioas, obj)) > > ictx->vfio_ioas = NULL; > > xa_unlock(&ictx->objects); > > @@ -238,6 +238,7 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) > > struct iommufd_ctx *ictx = filp->private_data; > > struct iommufd_sw_msi_map *next; > > struct iommufd_sw_msi_map *cur; > > + XA_STATE(xas, &ictx->objects, 0); > > struct iommufd_object *obj; > > > > /* > > @@ -251,16 +252,30 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) > > */ > > while (!xa_empty(&ictx->objects)) { > > unsigned int destroyed = 0; > > - unsigned long index; > > > > - xa_for_each(&ictx->objects, index, obj) { > > - if (!refcount_dec_if_one(&obj->users)) > > - continue; > > + xas_set(&xas, 0); > > + for (;;) { > > + rcu_read_lock(); > > + obj = xas_find(&xas, ULONG_MAX); > > + rcu_read_unlock(); > > > > What is the need for the rcu_read_lock()? To surpress rcu warning in xas_find(). > > > + > > + if (!obj) > > + break; > > + > > + if (!xa_is_zero(obj)) { > > + if (!refcount_dec_if_one(&obj->users)) > > + continue; > > + > > + iommufd_object_ops[obj->type].destroy(obj); > > + kfree(obj); > > + } > > + > > destroyed++; > > - xa_erase(&ictx->objects, index); > > - iommufd_object_ops[obj->type].destroy(obj); > > - kfree(obj); > > + xas_lock(&xas); > > + xas_store(&xas, NULL); > > + xas_unlock(&xas); > > is that xas_lock needed?. we can't run a xarray update parallel to this > because neither iommufd ioctl not vfio cdev unbind can happen in parallel. That's true, but also to surpress warning in xas_store(). > > I have this as an additonal comment added to the function in my change. > > /* > * We don't need additional locks because the iommufd_fops_release() function is > * only triggered when the iommufd descriptor is released. At that point, no > * other iommufd-based ioctl operations can be running concurrently. > * > * The destruction of the vdevice via idevice unbind is also safe: > * iommufd_fops_release() can only be called after the idevice has been unbound. > * The idevice bind operation takes a reference to the iommufd descriptor, > * preventing early release. > */ That's good. But Jason has another suggestion that no need to clear tombstones on fops_release(), so we don't need these changes at all. Thanks, Yilun > > > > } > > + > > /* Bug related to users refcount */ > > if (WARN_ON(!destroyed)) > > break; > > -- > > 2.25.1 > > -aneesh ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 2/4] iommufd/viommu: Fix the uninitialized iommufd_vdevice::ictx 2025-06-23 9:49 [PATCH v2 0/4] iommufd: Destroy vdevice on device unbind Xu Yilun 2025-06-23 9:49 ` [PATCH v2 1/4] iommufd: Add iommufd_object_tombstone_user() helper Xu Yilun @ 2025-06-23 9:49 ` Xu Yilun 2025-06-24 3:24 ` Baolu Lu 2025-06-23 9:49 ` [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy Xu Yilun 2025-06-23 9:49 ` [PATCH v2 4/4] iommufd/selftest: Add coverage for vdevice tombstone Xu Yilun 3 siblings, 1 reply; 32+ messages in thread From: Xu Yilun @ 2025-06-23 9:49 UTC (permalink / raw) To: jgg, jgg, kevin.tian, will, aneesh.kumar Cc: iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu, yilun.xu Fix the uninitialized iommufd_vdevice::ictx. No code was using this field before, but later vdevice will use it to sync up with idevice on destroy paths. Fixes: 0ce5c2477af2 ("iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl") Cc: <stable@vger.kernel.org> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> --- drivers/iommu/iommufd/viommu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 01df2b985f02..4577b88c8560 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -130,6 +130,7 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) goto out_put_idev; } + vdev->ictx = ucmd->ictx; vdev->id = virt_id; vdev->dev = idev->dev; get_device(idev->dev); -- 2.25.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] iommufd/viommu: Fix the uninitialized iommufd_vdevice::ictx 2025-06-23 9:49 ` [PATCH v2 2/4] iommufd/viommu: Fix the uninitialized iommufd_vdevice::ictx Xu Yilun @ 2025-06-24 3:24 ` Baolu Lu 2025-06-24 6:35 ` Xu Yilun 0 siblings, 1 reply; 32+ messages in thread From: Baolu Lu @ 2025-06-24 3:24 UTC (permalink / raw) To: Xu Yilun, jgg, jgg, kevin.tian, will, aneesh.kumar Cc: iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, yilun.xu On 6/23/25 17:49, Xu Yilun wrote: > Fix the uninitialized iommufd_vdevice::ictx. No code was using this > field before, but later vdevice will use it to sync up with idevice on > destroy paths. > > Fixes: 0ce5c2477af2 ("iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl") > Cc:<stable@vger.kernel.org> > Signed-off-by: Xu Yilun<yilun.xu@linux.intel.com> > --- > drivers/iommu/iommufd/viommu.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c > index 01df2b985f02..4577b88c8560 100644 > --- a/drivers/iommu/iommufd/viommu.c > +++ b/drivers/iommu/iommufd/viommu.c > @@ -130,6 +130,7 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) > goto out_put_idev; > } > > + vdev->ictx = ucmd->ictx; iommufd_vdevice::ictx has been removed by this commit: 6e235a772199 ("iommufd: Drop unused ictx in struct iommufd_vdevice") in linux-next. Thanks, baolu ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] iommufd/viommu: Fix the uninitialized iommufd_vdevice::ictx 2025-06-24 3:24 ` Baolu Lu @ 2025-06-24 6:35 ` Xu Yilun 0 siblings, 0 replies; 32+ messages in thread From: Xu Yilun @ 2025-06-24 6:35 UTC (permalink / raw) To: Baolu Lu Cc: jgg, jgg, kevin.tian, will, aneesh.kumar, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, yilun.xu On Tue, Jun 24, 2025 at 11:24:02AM +0800, Baolu Lu wrote: > On 6/23/25 17:49, Xu Yilun wrote: > > Fix the uninitialized iommufd_vdevice::ictx. No code was using this > > field before, but later vdevice will use it to sync up with idevice on > > destroy paths. > > > > Fixes: 0ce5c2477af2 ("iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl") > > Cc:<stable@vger.kernel.org> > > Signed-off-by: Xu Yilun<yilun.xu@linux.intel.com> > > --- > > drivers/iommu/iommufd/viommu.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c > > index 01df2b985f02..4577b88c8560 100644 > > --- a/drivers/iommu/iommufd/viommu.c > > +++ b/drivers/iommu/iommufd/viommu.c > > @@ -130,6 +130,7 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) > > goto out_put_idev; > > } > > + vdev->ictx = ucmd->ictx; > > iommufd_vdevice::ictx has been removed by this commit: > > 6e235a772199 ("iommufd: Drop unused ictx in struct iommufd_vdevice") > > in linux-next. Ah, I see the thread. This patch should be dropped. Thanks, Yilun > > Thanks, > baolu ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-23 9:49 [PATCH v2 0/4] iommufd: Destroy vdevice on device unbind Xu Yilun 2025-06-23 9:49 ` [PATCH v2 1/4] iommufd: Add iommufd_object_tombstone_user() helper Xu Yilun 2025-06-23 9:49 ` [PATCH v2 2/4] iommufd/viommu: Fix the uninitialized iommufd_vdevice::ictx Xu Yilun @ 2025-06-23 9:49 ` Xu Yilun 2025-06-24 3:32 ` Baolu Lu ` (4 more replies) 2025-06-23 9:49 ` [PATCH v2 4/4] iommufd/selftest: Add coverage for vdevice tombstone Xu Yilun 3 siblings, 5 replies; 32+ messages in thread From: Xu Yilun @ 2025-06-23 9:49 UTC (permalink / raw) To: jgg, jgg, kevin.tian, will, aneesh.kumar Cc: iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu, yilun.xu Destroy iommufd_vdevice(vdev) on iommufd_idevice(idev) destroy so that vdev can't outlive idev. iommufd_device(idev) represents the physical device bound to iommufd, while the iommufd_vdevice(vdev) represents the virtual instance of the physical device in the VM. The lifecycle of the vdev should not be longer than idev. This doesn't cause real problem on existing use cases cause vdev doesn't impact the physical device, only provides virtualization information. But to extend vdev for Confidential Computing(CC), there are needs to do secure configuration for the vdev, e.g. TSM Bind/Unbind. These configurations should be rolled back on idev destroy, or the external driver(VFIO) functionality may be impact. Building the association between idev & vdev requires the two objects pointing each other, but not referencing each other. This requires proper locking. This is done by reviving some of Nicolin's patch [1]. There are 3 cases on idev destroy: 1. vdev is already destroyed by userspace. No extra handling needed. 2. vdev is still alive. Use iommufd_object_tombstone_user() to destroy vdev and tombstone the vdev ID. 3. vdev is being destroyed by userspace. The vdev ID is already freed, but vdev destroy handler is not complete. The idev destroy handler should wait for vdev destroy completion. [1]: https://lore.kernel.org/all/53025c827c44d68edb6469bfd940a8e8bc6147a5.1729897278.git.nicolinc@nvidia.com/ Original-by: Nicolin Chen <nicolinc@nvidia.com> Original-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> --- drivers/iommu/iommufd/device.c | 43 +++++++++++++++++++++++++ drivers/iommu/iommufd/iommufd_private.h | 11 +++++++ drivers/iommu/iommufd/main.c | 1 + drivers/iommu/iommufd/viommu.c | 33 +++++++++++++++++-- 4 files changed, 85 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 86244403b532..908a94a27bab 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -137,11 +137,54 @@ static struct iommufd_group *iommufd_get_group(struct iommufd_ctx *ictx, } } +static void iommufd_device_remove_vdev(struct iommufd_device *idev) +{ + bool vdev_removing = false; + + mutex_lock(&idev->igroup->lock); + if (idev->vdev) { + struct iommufd_vdevice *vdev; + + vdev = iommufd_get_vdevice(idev->ictx, idev->vdev->obj.id); + if (IS_ERR(vdev)) { + /* vdev is removed from xarray, but is not destroyed/freed */ + vdev_removing = true; + goto unlock; + } + + /* Should never happen */ + if (WARN_ON(vdev != idev->vdev)) { + idev->vdev = NULL; + iommufd_put_object(idev->ictx, &vdev->obj); + goto unlock; + } + + /* + * vdev cannot be destroyed after refcount_inc, safe to release + * idev->igroup->lock and use idev->vdev afterward. + */ + refcount_inc(&idev->vdev->obj.users); + iommufd_put_object(idev->ictx, &idev->vdev->obj); + } +unlock: + mutex_unlock(&idev->igroup->lock); + + if (vdev_removing) { + if (!wait_event_timeout(idev->ictx->destroy_wait, + !idev->vdev, + msecs_to_jiffies(60000))) + pr_crit("Time out waiting for iommufd vdevice removed\n"); + } else if (idev->vdev) { + iommufd_object_tombstone_user(idev->ictx, &idev->vdev->obj); + } +} + void iommufd_device_destroy(struct iommufd_object *obj) { struct iommufd_device *idev = container_of(obj, struct iommufd_device, obj); + iommufd_device_remove_vdev(idev); iommu_device_release_dma_owner(idev->dev); iommufd_put_group(idev->igroup); if (!iommufd_selftest_is_mock_dev(idev->dev)) diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index fbc9ef78d81f..f58aa4439c53 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -446,6 +446,7 @@ struct iommufd_device { /* always the physical device */ struct device *dev; bool enforce_cache_coherency; + struct iommufd_vdevice *vdev; }; static inline struct iommufd_device * @@ -621,6 +622,7 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_viommu_destroy(struct iommufd_object *obj); int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_vdevice_destroy(struct iommufd_object *obj); +void iommufd_vdevice_abort(struct iommufd_object *obj); struct iommufd_vdevice { struct iommufd_object obj; @@ -628,8 +630,17 @@ struct iommufd_vdevice { struct iommufd_viommu *viommu; struct device *dev; u64 id; /* per-vIOMMU virtual ID */ + struct iommufd_device *idev; }; +static inline struct iommufd_vdevice * +iommufd_get_vdevice(struct iommufd_ctx *ictx, u32 id) +{ + return container_of(iommufd_get_object(ictx, id, + IOMMUFD_OBJ_VDEVICE), + struct iommufd_vdevice, obj); +} + #ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); void iommufd_selftest_destroy(struct iommufd_object *obj); diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 5fd75aba068b..3f955a123095 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -531,6 +531,7 @@ static const struct iommufd_object_ops iommufd_object_ops[] = { }, [IOMMUFD_OBJ_VDEVICE] = { .destroy = iommufd_vdevice_destroy, + .abort = iommufd_vdevice_abort, }, [IOMMUFD_OBJ_VEVENTQ] = { .destroy = iommufd_veventq_destroy, diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 4577b88c8560..9b062e651ea5 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -84,16 +84,31 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) return rc; } -void iommufd_vdevice_destroy(struct iommufd_object *obj) +void iommufd_vdevice_abort(struct iommufd_object *obj) { struct iommufd_vdevice *vdev = container_of(obj, struct iommufd_vdevice, obj); struct iommufd_viommu *viommu = vdev->viommu; + struct iommufd_device *idev = vdev->idev; + + lockdep_assert_held(&idev->igroup->lock); /* xa_cmpxchg is okay to fail if alloc failed xa_cmpxchg previously */ xa_cmpxchg(&viommu->vdevs, vdev->id, vdev, NULL, GFP_KERNEL); refcount_dec(&viommu->obj.users); put_device(vdev->dev); + idev->vdev = NULL; +} + +void iommufd_vdevice_destroy(struct iommufd_object *obj) +{ + struct iommufd_vdevice *vdev = + container_of(obj, struct iommufd_vdevice, obj); + + mutex_lock(&vdev->idev->igroup->lock); + iommufd_vdevice_abort(obj); + mutex_unlock(&vdev->idev->igroup->lock); + wake_up_interruptible_all(&vdev->ictx->destroy_wait); } int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) @@ -124,18 +139,28 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) goto out_put_idev; } + mutex_lock(&idev->igroup->lock); + if (idev->vdev) { + rc = -EEXIST; + goto out_unlock_igroup; + } + vdev = iommufd_object_alloc(ucmd->ictx, vdev, IOMMUFD_OBJ_VDEVICE); if (IS_ERR(vdev)) { rc = PTR_ERR(vdev); - goto out_put_idev; + goto out_unlock_igroup; } + /* vdev can't outlive idev, vdev->idev is always valid, need no refcnt */ + vdev->idev = idev; vdev->ictx = ucmd->ictx; vdev->id = virt_id; vdev->dev = idev->dev; get_device(idev->dev); vdev->viommu = viommu; refcount_inc(&viommu->obj.users); + /* idev->vdev is protected by idev->igroup->lock, need no refcnt */ + idev->vdev = vdev; curr = xa_cmpxchg(&viommu->vdevs, virt_id, NULL, vdev, GFP_KERNEL); if (curr) { @@ -148,10 +173,12 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) if (rc) goto out_abort; iommufd_object_finalize(ucmd->ictx, &vdev->obj); - goto out_put_idev; + goto out_unlock_igroup; out_abort: iommufd_object_abort_and_destroy(ucmd->ictx, &vdev->obj); +out_unlock_igroup: + mutex_unlock(&idev->igroup->lock); out_put_idev: iommufd_put_object(ucmd->ictx, &idev->obj); out_put_viommu: -- 2.25.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-23 9:49 ` [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy Xu Yilun @ 2025-06-24 3:32 ` Baolu Lu 2025-06-24 8:11 ` Xu Yilun 2025-06-24 8:12 ` Tian, Kevin 2025-06-24 6:47 ` Xu Yilun ` (3 subsequent siblings) 4 siblings, 2 replies; 32+ messages in thread From: Baolu Lu @ 2025-06-24 3:32 UTC (permalink / raw) To: Xu Yilun, jgg, jgg, kevin.tian, will, aneesh.kumar Cc: iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, yilun.xu On 6/23/25 17:49, Xu Yilun wrote: > Destroy iommufd_vdevice(vdev) on iommufd_idevice(idev) destroy so that > vdev can't outlive idev. > > iommufd_device(idev) represents the physical device bound to iommufd, > while the iommufd_vdevice(vdev) represents the virtual instance of the > physical device in the VM. The lifecycle of the vdev should not be > longer than idev. This doesn't cause real problem on existing use cases > cause vdev doesn't impact the physical device, only provides > virtualization information. But to extend vdev for Confidential > Computing(CC), there are needs to do secure configuration for the vdev, > e.g. TSM Bind/Unbind. These configurations should be rolled back on idev > destroy, or the external driver(VFIO) functionality may be impact. > > Building the association between idev & vdev requires the two objects > pointing each other, but not referencing each other. Does this mean each idevice can have at most a single vdevice? Is it possible that different PASIDs of a physical device are assigned to userspace for different purposes, such that there is a need for multiple vdevices per idevice? Thanks, baolu ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-24 3:32 ` Baolu Lu @ 2025-06-24 8:11 ` Xu Yilun 2025-06-24 8:28 ` Tian, Kevin 2025-06-24 8:12 ` Tian, Kevin 1 sibling, 1 reply; 32+ messages in thread From: Xu Yilun @ 2025-06-24 8:11 UTC (permalink / raw) To: Baolu Lu Cc: jgg, jgg, kevin.tian, will, aneesh.kumar, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, yilun.xu On Tue, Jun 24, 2025 at 11:32:01AM +0800, Baolu Lu wrote: > On 6/23/25 17:49, Xu Yilun wrote: > > Destroy iommufd_vdevice(vdev) on iommufd_idevice(idev) destroy so that > > vdev can't outlive idev. > > > > iommufd_device(idev) represents the physical device bound to iommufd, > > while the iommufd_vdevice(vdev) represents the virtual instance of the > > physical device in the VM. The lifecycle of the vdev should not be > > longer than idev. This doesn't cause real problem on existing use cases > > cause vdev doesn't impact the physical device, only provides > > virtualization information. But to extend vdev for Confidential > > Computing(CC), there are needs to do secure configuration for the vdev, > > e.g. TSM Bind/Unbind. These configurations should be rolled back on idev > > destroy, or the external driver(VFIO) functionality may be impact. > > > > Building the association between idev & vdev requires the two objects > > pointing each other, but not referencing each other. > > Does this mean each idevice can have at most a single vdevice? Is it Yes, I got this idea from here. https://lore.kernel.org/all/20250604132403.GJ5028@nvidia.com/ > possible that different PASIDs of a physical device are assigned to > userspace for different purposes, such that there is a need for multiple > vdevices per idevice? Mm.. I don't have a clear idea how SIOV assignment work with iommufd, may come back later. Thanks, Yilun ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-24 8:11 ` Xu Yilun @ 2025-06-24 8:28 ` Tian, Kevin 0 siblings, 0 replies; 32+ messages in thread From: Tian, Kevin @ 2025-06-24 8:28 UTC (permalink / raw) To: Xu Yilun, Baolu Lu Cc: jgg@nvidia.com, jgg@ziepe.ca, will@kernel.org, aneesh.kumar@kernel.org, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, joro@8bytes.org, robin.murphy@arm.com, shuah@kernel.org, nicolinc@nvidia.com, aik@amd.com, Williams, Dan J, Xu, Yilun > From: Xu Yilun <yilun.xu@linux.intel.com> > Sent: Tuesday, June 24, 2025 4:12 PM > > On Tue, Jun 24, 2025 at 11:32:01AM +0800, Baolu Lu wrote: > > On 6/23/25 17:49, Xu Yilun wrote: > > > Destroy iommufd_vdevice(vdev) on iommufd_idevice(idev) destroy so > that > > > vdev can't outlive idev. > > > > > > iommufd_device(idev) represents the physical device bound to iommufd, > > > while the iommufd_vdevice(vdev) represents the virtual instance of the > > > physical device in the VM. The lifecycle of the vdev should not be > > > longer than idev. This doesn't cause real problem on existing use cases > > > cause vdev doesn't impact the physical device, only provides > > > virtualization information. But to extend vdev for Confidential > > > Computing(CC), there are needs to do secure configuration for the vdev, > > > e.g. TSM Bind/Unbind. These configurations should be rolled back on idev > > > destroy, or the external driver(VFIO) functionality may be impact. > > > > > > Building the association between idev & vdev requires the two objects > > > pointing each other, but not referencing each other. > > > > Does this mean each idevice can have at most a single vdevice? Is it > > Yes, I got this idea from here. > > https://lore.kernel.org/all/20250604132403.GJ5028@nvidia.com/ > > > possible that different PASIDs of a physical device are assigned to > > userspace for different purposes, such that there is a need for multiple > > vdevices per idevice? > > Mm.. I don't have a clear idea how SIOV assignment work with iommufd, > may come back later. > Let's put SIOV out of scope here. It's not supported yet. there are other obstacles to be figured out (e.g. igroup etc.) when it comes to the table. ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-24 3:32 ` Baolu Lu 2025-06-24 8:11 ` Xu Yilun @ 2025-06-24 8:12 ` Tian, Kevin 2025-06-25 1:55 ` Baolu Lu 1 sibling, 1 reply; 32+ messages in thread From: Tian, Kevin @ 2025-06-24 8:12 UTC (permalink / raw) To: Baolu Lu, Xu Yilun, jgg@nvidia.com, jgg@ziepe.ca, will@kernel.org, aneesh.kumar@kernel.org Cc: iommu@lists.linux.dev, linux-kernel@vger.kernel.org, joro@8bytes.org, robin.murphy@arm.com, shuah@kernel.org, nicolinc@nvidia.com, aik@amd.com, Williams, Dan J, Xu, Yilun > From: Baolu Lu <baolu.lu@linux.intel.com> > Sent: Tuesday, June 24, 2025 11:32 AM > > On 6/23/25 17:49, Xu Yilun wrote: > > Destroy iommufd_vdevice(vdev) on iommufd_idevice(idev) destroy so that > > vdev can't outlive idev. > > > > iommufd_device(idev) represents the physical device bound to iommufd, > > while the iommufd_vdevice(vdev) represents the virtual instance of the > > physical device in the VM. The lifecycle of the vdev should not be > > longer than idev. This doesn't cause real problem on existing use cases > > cause vdev doesn't impact the physical device, only provides > > virtualization information. But to extend vdev for Confidential > > Computing(CC), there are needs to do secure configuration for the vdev, > > e.g. TSM Bind/Unbind. These configurations should be rolled back on idev > > destroy, or the external driver(VFIO) functionality may be impact. > > > > Building the association between idev & vdev requires the two objects > > pointing each other, but not referencing each other. > > Does this mean each idevice can have at most a single vdevice? Is it > possible that different PASIDs of a physical device are assigned to > userspace for different purposes, such that there is a need for multiple > vdevices per idevice? > PASID is a resource of physical device. If it's reported to a VM then it becomes the resource of virtual device. 1:1 association makes sense here. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-24 8:12 ` Tian, Kevin @ 2025-06-25 1:55 ` Baolu Lu 0 siblings, 0 replies; 32+ messages in thread From: Baolu Lu @ 2025-06-25 1:55 UTC (permalink / raw) To: Tian, Kevin, Xu Yilun, jgg@nvidia.com, jgg@ziepe.ca, will@kernel.org, aneesh.kumar@kernel.org Cc: iommu@lists.linux.dev, linux-kernel@vger.kernel.org, joro@8bytes.org, robin.murphy@arm.com, shuah@kernel.org, nicolinc@nvidia.com, aik@amd.com, Williams, Dan J, Xu, Yilun On 6/24/25 16:12, Tian, Kevin wrote: >> From: Baolu Lu <baolu.lu@linux.intel.com> >> Sent: Tuesday, June 24, 2025 11:32 AM >> >> On 6/23/25 17:49, Xu Yilun wrote: >>> Destroy iommufd_vdevice(vdev) on iommufd_idevice(idev) destroy so that >>> vdev can't outlive idev. >>> >>> iommufd_device(idev) represents the physical device bound to iommufd, >>> while the iommufd_vdevice(vdev) represents the virtual instance of the >>> physical device in the VM. The lifecycle of the vdev should not be >>> longer than idev. This doesn't cause real problem on existing use cases >>> cause vdev doesn't impact the physical device, only provides >>> virtualization information. But to extend vdev for Confidential >>> Computing(CC), there are needs to do secure configuration for the vdev, >>> e.g. TSM Bind/Unbind. These configurations should be rolled back on idev >>> destroy, or the external driver(VFIO) functionality may be impact. >>> >>> Building the association between idev & vdev requires the two objects >>> pointing each other, but not referencing each other. >> >> Does this mean each idevice can have at most a single vdevice? Is it >> possible that different PASIDs of a physical device are assigned to >> userspace for different purposes, such that there is a need for multiple >> vdevices per idevice? >> > > PASID is a resource of physical device. If it's reported to a VM then > it becomes the resource of virtual device. 1:1 association makes > sense here. Okay, make sense. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-23 9:49 ` [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy Xu Yilun 2025-06-24 3:32 ` Baolu Lu @ 2025-06-24 6:47 ` Xu Yilun 2025-06-24 8:22 ` Tian, Kevin ` (2 subsequent siblings) 4 siblings, 0 replies; 32+ messages in thread From: Xu Yilun @ 2025-06-24 6:47 UTC (permalink / raw) To: jgg, jgg, kevin.tian, will, aneesh.kumar Cc: iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu > +void iommufd_vdevice_destroy(struct iommufd_object *obj) > +{ > + struct iommufd_vdevice *vdev = > + container_of(obj, struct iommufd_vdevice, obj); > + > + mutex_lock(&vdev->idev->igroup->lock); > + iommufd_vdevice_abort(obj); > + mutex_unlock(&vdev->idev->igroup->lock); > + wake_up_interruptible_all(&vdev->ictx->destroy_wait); Should change to wake_up_interruptible_all(&vdev->viommu->ictx->destroy_wait); since vdev->ictx will be deleted. Thanks, Yilun ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-23 9:49 ` [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy Xu Yilun 2025-06-24 3:32 ` Baolu Lu 2025-06-24 6:47 ` Xu Yilun @ 2025-06-24 8:22 ` Tian, Kevin 2025-06-26 4:59 ` Xu Yilun 2025-06-24 14:53 ` Jason Gunthorpe 2025-06-25 6:40 ` Aneesh Kumar K.V 4 siblings, 1 reply; 32+ messages in thread From: Tian, Kevin @ 2025-06-24 8:22 UTC (permalink / raw) To: Xu Yilun, jgg@nvidia.com, jgg@ziepe.ca, will@kernel.org, aneesh.kumar@kernel.org Cc: iommu@lists.linux.dev, linux-kernel@vger.kernel.org, joro@8bytes.org, robin.murphy@arm.com, shuah@kernel.org, nicolinc@nvidia.com, aik@amd.com, Williams, Dan J, baolu.lu@linux.intel.com, Xu, Yilun > From: Xu Yilun <yilun.xu@linux.intel.com> > Sent: Monday, June 23, 2025 5:50 PM > > +static void iommufd_device_remove_vdev(struct iommufd_device *idev) > +{ > + bool vdev_removing = false; > + > + mutex_lock(&idev->igroup->lock); > + if (idev->vdev) { > + struct iommufd_vdevice *vdev; > + > + vdev = iommufd_get_vdevice(idev->ictx, idev->vdev->obj.id); > + if (IS_ERR(vdev)) { > + /* vdev is removed from xarray, but is not > destroyed/freed */ > + vdev_removing = true; > + goto unlock; > + } > + > + /* Should never happen */ > + if (WARN_ON(vdev != idev->vdev)) { > + idev->vdev = NULL; > + iommufd_put_object(idev->ictx, &vdev->obj); > + goto unlock; > + } > + > + /* > + * vdev cannot be destroyed after refcount_inc, safe to > release "vdev cannot be destroyed by userspace" > + * idev->igroup->lock and use idev->vdev afterward. > + */ > + refcount_inc(&idev->vdev->obj.users); > + iommufd_put_object(idev->ictx, &idev->vdev->obj); s/idev->vdev/vdev/ > @@ -124,18 +139,28 @@ int iommufd_vdevice_alloc_ioctl(struct > iommufd_ucmd *ucmd) > goto out_put_idev; > } > > + mutex_lock(&idev->igroup->lock); > + if (idev->vdev) { > + rc = -EEXIST; > + goto out_unlock_igroup; > + } > + > vdev = iommufd_object_alloc(ucmd->ictx, vdev, > IOMMUFD_OBJ_VDEVICE); > if (IS_ERR(vdev)) { > rc = PTR_ERR(vdev); > - goto out_put_idev; > + goto out_unlock_igroup; > } > > + /* vdev can't outlive idev, vdev->idev is always valid, need no refcnt > */ > + vdev->idev = idev; > vdev->ictx = ucmd->ictx; > vdev->id = virt_id; > vdev->dev = idev->dev; > get_device(idev->dev); this is not necessary now, as idevice already holds a reference to device and now vdevice cannot outlive idevice. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-24 8:22 ` Tian, Kevin @ 2025-06-26 4:59 ` Xu Yilun 0 siblings, 0 replies; 32+ messages in thread From: Xu Yilun @ 2025-06-26 4:59 UTC (permalink / raw) To: Tian, Kevin Cc: jgg@nvidia.com, jgg@ziepe.ca, will@kernel.org, aneesh.kumar@kernel.org, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, joro@8bytes.org, robin.murphy@arm.com, shuah@kernel.org, nicolinc@nvidia.com, aik@amd.com, Williams, Dan J, baolu.lu@linux.intel.com, Xu, Yilun On Tue, Jun 24, 2025 at 08:22:11AM +0000, Tian, Kevin wrote: > > From: Xu Yilun <yilun.xu@linux.intel.com> > > Sent: Monday, June 23, 2025 5:50 PM > > > > +static void iommufd_device_remove_vdev(struct iommufd_device *idev) > > +{ > > + bool vdev_removing = false; > > + > > + mutex_lock(&idev->igroup->lock); > > + if (idev->vdev) { > > + struct iommufd_vdevice *vdev; > > + > > + vdev = iommufd_get_vdevice(idev->ictx, idev->vdev->obj.id); > > + if (IS_ERR(vdev)) { > > + /* vdev is removed from xarray, but is not > > destroyed/freed */ > > + vdev_removing = true; > > + goto unlock; > > + } > > + > > + /* Should never happen */ > > + if (WARN_ON(vdev != idev->vdev)) { > > + idev->vdev = NULL; > > + iommufd_put_object(idev->ictx, &vdev->obj); > > + goto unlock; > > + } > > + > > + /* > > + * vdev cannot be destroyed after refcount_inc, safe to > > release > > "vdev cannot be destroyed by userspace" > > > + * idev->igroup->lock and use idev->vdev afterward. > > + */ > > + refcount_inc(&idev->vdev->obj.users); > > + iommufd_put_object(idev->ictx, &idev->vdev->obj); > > s/idev->vdev/vdev/ Will adopt these fixes. > > > @@ -124,18 +139,28 @@ int iommufd_vdevice_alloc_ioctl(struct > > iommufd_ucmd *ucmd) > > goto out_put_idev; > > } > > > > + mutex_lock(&idev->igroup->lock); > > + if (idev->vdev) { > > + rc = -EEXIST; > > + goto out_unlock_igroup; > > + } > > + > > vdev = iommufd_object_alloc(ucmd->ictx, vdev, > > IOMMUFD_OBJ_VDEVICE); > > if (IS_ERR(vdev)) { > > rc = PTR_ERR(vdev); > > - goto out_put_idev; > > + goto out_unlock_igroup; > > } > > > > + /* vdev can't outlive idev, vdev->idev is always valid, need no refcnt > > */ > > + vdev->idev = idev; > > vdev->ictx = ucmd->ictx; > > vdev->id = virt_id; > > vdev->dev = idev->dev; > > get_device(idev->dev); > > this is not necessary now, as idevice already holds a reference to device > and now vdevice cannot outlive idevice. I agree. Besides, Jason suggests use vdev->dev for iommufd_vdevice_abort() safe reentrancy. I think we could change to use vdev->viommu. @@ -93,10 +93,17 @@ void iommufd_vdevice_abort(struct iommufd_object *obj) lockdep_assert_held(&idev->igroup->lock); + /* + * iommufd_vdevice_abort() could be reentrant, by + * iommufd_device_unbind() or by iommufd_destroy(). Cleanup only once. + */ + if (!viommu) + return; + /* xa_cmpxchg is okay to fail if alloc failed xa_cmpxchg previously */ xa_cmpxchg(&viommu->vdevs, vdev->id, vdev, NULL, GFP_KERNEL); refcount_dec(&viommu->obj.users); - put_device(vdev->dev); + vdev->viommu = NULL; idev->vdev = NULL; Thanks, Yilun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-23 9:49 ` [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy Xu Yilun ` (2 preceding siblings ...) 2025-06-24 8:22 ` Tian, Kevin @ 2025-06-24 14:53 ` Jason Gunthorpe 2025-06-24 23:57 ` Tian, Kevin 2025-06-25 10:06 ` Xu Yilun 2025-06-25 6:40 ` Aneesh Kumar K.V 4 siblings, 2 replies; 32+ messages in thread From: Jason Gunthorpe @ 2025-06-24 14:53 UTC (permalink / raw) To: Xu Yilun Cc: kevin.tian, will, aneesh.kumar, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu On Mon, Jun 23, 2025 at 05:49:45PM +0800, Xu Yilun wrote: > +static void iommufd_device_remove_vdev(struct iommufd_device *idev) > +{ > + bool vdev_removing = false; > + > + mutex_lock(&idev->igroup->lock); > + if (idev->vdev) { > + struct iommufd_vdevice *vdev; > + > + vdev = iommufd_get_vdevice(idev->ictx, idev->vdev->obj.id); > + if (IS_ERR(vdev)) { This incrs obj.users which will cause a concurrent iommufd_object_remove() to fail with -EBUSY, which we are trying to avoid. Also you can hit a race where the tombstone has NULL'd the entry but the racing destroy will then load the NULL with xas_load() and hit this: if (WARN_ON(obj != to_destroy)) { So, this doesn't look like it will work right to me.. You want somewhat different destroy logic: /* * The caller must directly obtain a shortterm_users reference without a users * reference using its own locking to protect the pointer. This function always * puts back the shortterm_users reference. */ int iommufd_object_remove_tombstone(struct iommufd_ctx *ictx, struct iommufd_object *to_destroy) { XA_STATE(xas, &ictx->objects, to_destroy->id); struct iommufd_object *obj; int ret; xa_lock(&ictx->objects); obj = xas_load(&xas); if (xa_is_zero(obj) || obj == NULL) { /* * Another thread is racing to destroy this, since we have the * shortterm_users refcount the other thread has xa_unlocked() * but not passed iommufd_object_dec_wait_shortterm(). */ if (refcount_dec_and_test(&to_destroy->shortterm_users)) wake_up_interruptible_all(&ictx->destroy_wait); ret = 0; goto err_xa; } else if (WARN_ON(obj != to_destroy)) { refcount_dec(&obj->shortterm_users); ret = -ENOENT; goto err_xa; } /* * The object is still in the xarray, so this thread will try to destroy * it. Put back the callers shortterm_users. */ refcount_dec(&obj->shortterm_users); if (!refcount_dec_if_one(&obj->users)) { ret = -EBUSY; goto err_xa; } /* Leave behind a tombstone to prevent re-use of this entry */ xas_store(&xas, XA_ZERO_ENTRY); xa_unlock(&ictx->objects); /* * Since users is zero any positive users_shortterm must be racing * iommufd_put_object(), or we have a bug. */ ret = iommufd_object_dec_wait_shortterm(ictx, obj); if (WARN_ON(ret)) return ret; iommufd_object_ops[obj->type].destroy(obj); kfree(obj); return 0; err_xa: xa_unlock(&ictx->objects); /* The returned object reference count is zero */ return ret; } Then you'd call it by doing something like: static void iommufd_device_remove_vdev(struct iommufd_device *idev) { struct iommufd_object *to_destroy = NULL; int ret; mutex_lock(&idev->igroup->lock); if (!idev->vdev) { mutex_unlock(&idev->igroup->lock); return; } if (refcount_inc_not_zero(&idev->vdev->obj.shortterm_users)) to_destroy = &idev->vdev->obj; mutex_unlock(&idev->igroup->lock); if (to_destroy) { ret = iommufd_object_remove_tombstone(idev->ictx, to_destroy); if (WARN_ON(ret)) return; } /* * We don't know what thread is actually going to destroy the vdev, but * once the vdev is destroyed the pointer is NULL'd. At this * point idev->users is 0 so no other thread can set a new vdev. */ if (!wait_event_timeout(idev->ictx->destroy_wait, !READ_ONCE(idev->vdev), msecs_to_jiffies(60000))) pr_crit("Time out waiting for iommufd vdevice removed\n"); } Though there is a cleaner option here, you could do: mutex_lock(&idev->igroup->lock); if (idev->vdev) iommufd_vdevice_abort(&idev->vdev->obj); mutex_unlock(&idev->igroup->lock); And make it safe to call abort twice, eg by setting dev to NULL and checking for that. First thread to get to the igroup lock, either via iommufd_vdevice_destroy() or via the above will do the actual abort synchronously without any wait_event_timeout. That seems better?? > + /* vdev can't outlive idev, vdev->idev is always valid, need no refcnt */ > + vdev->idev = idev; So this means a soon as 'idev->vdev = NULL;' happens idev is an invalid pointer. Need a WRITE_ONCE there. I would rephrase the comment as iommufd_device_destroy() waits until idev->vdev is NULL before freeing the idev, which only happens once the vdev is finished destruction. Thus we do not need refcounting on either idev->vdev or vdev->idev. and group both assignments together. > vdev->ictx = ucmd->ictx; > vdev->id = virt_id; > vdev->dev = idev->dev; > get_device(idev->dev); > vdev->viommu = viommu; > refcount_inc(&viommu->obj.users); > + /* idev->vdev is protected by idev->igroup->lock, need no refcnt */ > + idev->vdev = vdev; This can be WRITE_ONCE too Jason ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-24 14:53 ` Jason Gunthorpe @ 2025-06-24 23:57 ` Tian, Kevin 2025-06-25 1:36 ` Jason Gunthorpe 2025-06-25 10:06 ` Xu Yilun 1 sibling, 1 reply; 32+ messages in thread From: Tian, Kevin @ 2025-06-24 23:57 UTC (permalink / raw) To: Jason Gunthorpe, Xu Yilun Cc: will@kernel.org, aneesh.kumar@kernel.org, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, joro@8bytes.org, robin.murphy@arm.com, shuah@kernel.org, nicolinc@nvidia.com, aik@amd.com, Williams, Dan J, baolu.lu@linux.intel.com, Xu, Yilun > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, June 24, 2025 10:54 PM > > On Mon, Jun 23, 2025 at 05:49:45PM +0800, Xu Yilun wrote: > > +static void iommufd_device_remove_vdev(struct iommufd_device *idev) > > +{ > > + bool vdev_removing = false; > > + > > + mutex_lock(&idev->igroup->lock); > > + if (idev->vdev) { > > + struct iommufd_vdevice *vdev; > > + > > + vdev = iommufd_get_vdevice(idev->ictx, idev->vdev->obj.id); > > + if (IS_ERR(vdev)) { > > This incrs obj.users which will cause a concurrent > iommufd_object_remove() to fail with -EBUSY, which we are trying to > avoid. concurrent remove means a user-initiated IOMMU_DESTROY, for which failing with -EBUSY is expected as it doesn't wait for shortterm? > > Also you can hit a race where the tombstone has NULL'd the entry but > the racing destroy will then load the NULL with xas_load() and hit this: > > if (WARN_ON(obj != to_destroy)) { IOMMU_DESTROY doesn't provide to_destroy. or are you concerned about the racing between two kernel-initiated destroy paths? at least for vdev story this doesn't sound to be the case... ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-24 23:57 ` Tian, Kevin @ 2025-06-25 1:36 ` Jason Gunthorpe 2025-06-25 2:11 ` Tian, Kevin 0 siblings, 1 reply; 32+ messages in thread From: Jason Gunthorpe @ 2025-06-25 1:36 UTC (permalink / raw) To: Tian, Kevin Cc: Xu Yilun, will@kernel.org, aneesh.kumar@kernel.org, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, joro@8bytes.org, robin.murphy@arm.com, shuah@kernel.org, nicolinc@nvidia.com, aik@amd.com, Williams, Dan J, baolu.lu@linux.intel.com, Xu, Yilun On Tue, Jun 24, 2025 at 11:57:31PM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Tuesday, June 24, 2025 10:54 PM > > > > On Mon, Jun 23, 2025 at 05:49:45PM +0800, Xu Yilun wrote: > > > +static void iommufd_device_remove_vdev(struct iommufd_device *idev) > > > +{ > > > + bool vdev_removing = false; > > > + > > > + mutex_lock(&idev->igroup->lock); > > > + if (idev->vdev) { > > > + struct iommufd_vdevice *vdev; > > > + > > > + vdev = iommufd_get_vdevice(idev->ictx, idev->vdev->obj.id); > > > + if (IS_ERR(vdev)) { > > > > This incrs obj.users which will cause a concurrent > > iommufd_object_remove() to fail with -EBUSY, which we are trying to > > avoid. > > concurrent remove means a user-initiated IOMMU_DESTROY, for which > failing with -EBUSY is expected as it doesn't wait for shortterm? Yes a user IOMMU_DESTROY of the vdevice should not have a transient EBUSY failure. Avoiding that is the purpose of the shorterm_users mechanism. > > Also you can hit a race where the tombstone has NULL'd the entry but > > the racing destroy will then load the NULL with xas_load() and hit this: > > > > if (WARN_ON(obj != to_destroy)) { > > IOMMU_DESTROY doesn't provide to_destroy. Right, but IOMMU_DESTROY thread could have already gone past the xa_store(NULL) and then the kernel destroy thread could reach the above WARN as it does use to_destroy. Jason ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-25 1:36 ` Jason Gunthorpe @ 2025-06-25 2:11 ` Tian, Kevin 2025-06-25 12:33 ` Jason Gunthorpe 0 siblings, 1 reply; 32+ messages in thread From: Tian, Kevin @ 2025-06-25 2:11 UTC (permalink / raw) To: Jason Gunthorpe Cc: Xu Yilun, will@kernel.org, aneesh.kumar@kernel.org, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, joro@8bytes.org, robin.murphy@arm.com, shuah@kernel.org, nicolinc@nvidia.com, aik@amd.com, Williams, Dan J, baolu.lu@linux.intel.com, Xu, Yilun > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, June 25, 2025 9:36 AM > > On Tue, Jun 24, 2025 at 11:57:31PM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Tuesday, June 24, 2025 10:54 PM > > > > > > On Mon, Jun 23, 2025 at 05:49:45PM +0800, Xu Yilun wrote: > > > > +static void iommufd_device_remove_vdev(struct iommufd_device > *idev) > > > > +{ > > > > + bool vdev_removing = false; > > > > + > > > > + mutex_lock(&idev->igroup->lock); > > > > + if (idev->vdev) { > > > > + struct iommufd_vdevice *vdev; > > > > + > > > > + vdev = iommufd_get_vdevice(idev->ictx, idev->vdev->obj.id); > > > > + if (IS_ERR(vdev)) { > > > > > > This incrs obj.users which will cause a concurrent > > > iommufd_object_remove() to fail with -EBUSY, which we are trying to > > > avoid. > > > > concurrent remove means a user-initiated IOMMU_DESTROY, for which > > failing with -EBUSY is expected as it doesn't wait for shortterm? > > Yes a user IOMMU_DESTROY of the vdevice should not have a transient > EBUSY failure. Avoiding that is the purpose of the shorterm_users > mechanism. hmm my understanding is the opposite. currently iommufd_destroy() doesn't set REMOVE_WAIT_SHORTTERM: static int iommufd_destroy(struct iommufd_ucmd *ucmd) { struct iommu_destroy *cmd = ucmd->cmd; return iommufd_object_remove(ucmd->ictx, NULL, cmd->id, 0); } so it's natural for IOMMU_DESTROY to hit transient EBUSY when a parallel ioctl is being executed on the destroyed object: if (!refcount_dec_if_one(&obj->users)) { ret = -EBUSY; goto err_xa; } idevice unbind is just a similar (but indirect) transient race to IOMMU_DESTROY. waiting shorterm_users is more for kernel destroy. > > > > Also you can hit a race where the tombstone has NULL'd the entry but > > > the racing destroy will then load the NULL with xas_load() and hit this: > > > > > > if (WARN_ON(obj != to_destroy)) { > > > > IOMMU_DESTROY doesn't provide to_destroy. > > Right, but IOMMU_DESTROY thread could have already gone past the > xa_store(NULL) and then the kernel destroy thread could reach the > above WARN as it does use to_destroy. > If IOMMU_DESTROY have already gone past xa_store(NULL), there are two scenarios: 1) vdevice has been completely destroyed with idev->vdev=NULL. In such case iommufd_device_remove_vdev() is nop. 2) vdevice destroy has not been completed with idev->vdev still valid In such case iommufd_get_vdevice() fails with vdev_removing set. Then iommufd_device_remove_vdev() will wait on idev->vdev to be NULL instead of calling iommufd_object_tombstone_user(). so the said race won't happen. 😊 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-25 2:11 ` Tian, Kevin @ 2025-06-25 12:33 ` Jason Gunthorpe 0 siblings, 0 replies; 32+ messages in thread From: Jason Gunthorpe @ 2025-06-25 12:33 UTC (permalink / raw) To: Tian, Kevin Cc: Xu Yilun, will@kernel.org, aneesh.kumar@kernel.org, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, joro@8bytes.org, robin.murphy@arm.com, shuah@kernel.org, nicolinc@nvidia.com, aik@amd.com, Williams, Dan J, baolu.lu@linux.intel.com, Xu, Yilun On Wed, Jun 25, 2025 at 02:11:40AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, June 25, 2025 9:36 AM > > > > On Tue, Jun 24, 2025 at 11:57:31PM +0000, Tian, Kevin wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Tuesday, June 24, 2025 10:54 PM > > > > > > > > On Mon, Jun 23, 2025 at 05:49:45PM +0800, Xu Yilun wrote: > > > > > +static void iommufd_device_remove_vdev(struct iommufd_device > > *idev) > > > > > +{ > > > > > + bool vdev_removing = false; > > > > > + > > > > > + mutex_lock(&idev->igroup->lock); > > > > > + if (idev->vdev) { > > > > > + struct iommufd_vdevice *vdev; > > > > > + > > > > > + vdev = iommufd_get_vdevice(idev->ictx, idev->vdev->obj.id); > > > > > + if (IS_ERR(vdev)) { > > > > > > > > This incrs obj.users which will cause a concurrent > > > > iommufd_object_remove() to fail with -EBUSY, which we are trying to > > > > avoid. > > > > > > concurrent remove means a user-initiated IOMMU_DESTROY, for which > > > failing with -EBUSY is expected as it doesn't wait for shortterm? > > > > Yes a user IOMMU_DESTROY of the vdevice should not have a transient > > EBUSY failure. Avoiding that is the purpose of the shorterm_users > > mechanism. > > hmm my understanding is the opposite. > > currently iommufd_destroy() doesn't set REMOVE_WAIT_SHORTTERM: Oh yes I got them mixed up. > waiting shorterm_users is more for kernel destroy. Yes, it is to make kernel destroy not fail > Then iommufd_device_remove_vdev() will wait on idev->vdev to > be NULL instead of calling iommufd_object_tombstone_user(). Ah, because the original version incrs users, not just shortterm_users. Jason ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-24 14:53 ` Jason Gunthorpe 2025-06-24 23:57 ` Tian, Kevin @ 2025-06-25 10:06 ` Xu Yilun 2025-06-25 12:38 ` Jason Gunthorpe 1 sibling, 1 reply; 32+ messages in thread From: Xu Yilun @ 2025-06-25 10:06 UTC (permalink / raw) To: Jason Gunthorpe Cc: kevin.tian, will, aneesh.kumar, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu On Tue, Jun 24, 2025 at 11:53:46AM -0300, Jason Gunthorpe wrote: > On Mon, Jun 23, 2025 at 05:49:45PM +0800, Xu Yilun wrote: > > +static void iommufd_device_remove_vdev(struct iommufd_device *idev) > > +{ > > + bool vdev_removing = false; > > + > > + mutex_lock(&idev->igroup->lock); > > + if (idev->vdev) { > > + struct iommufd_vdevice *vdev; > > + > > + vdev = iommufd_get_vdevice(idev->ictx, idev->vdev->obj.id); > > + if (IS_ERR(vdev)) { > > This incrs obj.users which will cause a concurrent > iommufd_object_remove() to fail with -EBUSY, which we are trying to > avoid. I have the same question as Kevin, leave this to that thread. [...] > /* > * We don't know what thread is actually going to destroy the vdev, but > * once the vdev is destroyed the pointer is NULL'd. At this > * point idev->users is 0 so no other thread can set a new vdev. > */ > if (!wait_event_timeout(idev->ictx->destroy_wait, > !READ_ONCE(idev->vdev), > msecs_to_jiffies(60000))) > pr_crit("Time out waiting for iommufd vdevice removed\n"); > } > > Though there is a cleaner option here, you could do: > > mutex_lock(&idev->igroup->lock); > if (idev->vdev) > iommufd_vdevice_abort(&idev->vdev->obj); > mutex_unlock(&idev->igroup->lock); > > And make it safe to call abort twice, eg by setting dev to NULL and > checking for that. First thread to get to the igroup lock, either via > iommufd_vdevice_destroy() or via the above will do the actual abort > synchronously without any wait_event_timeout. That seems better?? I'm good to both options, but slightly tend not to make vdevice so special from other objects, so still prefer the wait_event option. > > > + /* vdev can't outlive idev, vdev->idev is always valid, need no refcnt */ > > + vdev->idev = idev; > > So this means a soon as 'idev->vdev = NULL;' happens idev is an > invalid pointer. Need a WRITE_ONCE there. > > I would rephrase the comment as > iommufd_device_destroy() waits until idev->vdev is NULL before > freeing the idev, which only happens once the vdev is finished > destruction. Thus we do not need refcounting on either idev->vdev or > vdev->idev. > > and group both assignments together. Good to me. Thanks, Yilun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-25 10:06 ` Xu Yilun @ 2025-06-25 12:38 ` Jason Gunthorpe 2025-06-26 3:31 ` Xu Yilun 0 siblings, 1 reply; 32+ messages in thread From: Jason Gunthorpe @ 2025-06-25 12:38 UTC (permalink / raw) To: Xu Yilun Cc: kevin.tian, will, aneesh.kumar, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu On Wed, Jun 25, 2025 at 06:06:00PM +0800, Xu Yilun wrote: > > /* > > * We don't know what thread is actually going to destroy the vdev, but > > * once the vdev is destroyed the pointer is NULL'd. At this > > * point idev->users is 0 so no other thread can set a new vdev. > > */ > > if (!wait_event_timeout(idev->ictx->destroy_wait, > > !READ_ONCE(idev->vdev), > > msecs_to_jiffies(60000))) > > pr_crit("Time out waiting for iommufd vdevice removed\n"); > > } > > > > Though there is a cleaner option here, you could do: > > > > mutex_lock(&idev->igroup->lock); > > if (idev->vdev) > > iommufd_vdevice_abort(&idev->vdev->obj); > > mutex_unlock(&idev->igroup->lock); > > > > And make it safe to call abort twice, eg by setting dev to NULL and > > checking for that. First thread to get to the igroup lock, either via > > iommufd_vdevice_destroy() or via the above will do the actual abort > > synchronously without any wait_event_timeout. That seems better?? > > I'm good to both options, but slightly tend not to make vdevice so > special from other objects, so still prefer the wait_event option. The wait_event is a ugly hack though, even in its existing code. The above version is better because it doesn't have any failure mode and doesn't introduce any unlocked use of the idev->vdev which is easier to reason about, no READ_ONCE/WRITE_ONCE/etc It sounds like you should largely leave the existing other parts the same as this v2, though can you try reorganize it to look a little more like the version I shared? Jason ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-25 12:38 ` Jason Gunthorpe @ 2025-06-26 3:31 ` Xu Yilun 2025-06-26 14:36 ` Jason Gunthorpe 0 siblings, 1 reply; 32+ messages in thread From: Xu Yilun @ 2025-06-26 3:31 UTC (permalink / raw) To: Jason Gunthorpe Cc: kevin.tian, will, aneesh.kumar, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu On Wed, Jun 25, 2025 at 09:38:32AM -0300, Jason Gunthorpe wrote: > On Wed, Jun 25, 2025 at 06:06:00PM +0800, Xu Yilun wrote: > > > /* > > > * We don't know what thread is actually going to destroy the vdev, but > > > * once the vdev is destroyed the pointer is NULL'd. At this > > > * point idev->users is 0 so no other thread can set a new vdev. > > > */ > > > if (!wait_event_timeout(idev->ictx->destroy_wait, > > > !READ_ONCE(idev->vdev), > > > msecs_to_jiffies(60000))) > > > pr_crit("Time out waiting for iommufd vdevice removed\n"); > > > } > > > > > > Though there is a cleaner option here, you could do: > > > > > > mutex_lock(&idev->igroup->lock); > > > if (idev->vdev) > > > iommufd_vdevice_abort(&idev->vdev->obj); > > > mutex_unlock(&idev->igroup->lock); > > > > > > And make it safe to call abort twice, eg by setting dev to NULL and > > > checking for that. First thread to get to the igroup lock, either via > > > iommufd_vdevice_destroy() or via the above will do the actual abort > > > synchronously without any wait_event_timeout. That seems better?? > > > > I'm good to both options, but slightly tend not to make vdevice so > > special from other objects, so still prefer the wait_event option. > > The wait_event is a ugly hack though, even in its existing code. The > above version is better because it doesn't have any failure mode and > doesn't introduce any unlocked use of the idev->vdev which is easier > to reason about, no READ_ONCE/WRITE_ONCE/etc > > It sounds like you should largely leave the existing other parts the > same as this v2, though can you try reorganize it to look a little > more like the version I shared? Sure. But may I confirm that your only want reentrant iommufd_vdevice_abort() but not your iommufd_object_remove_tombstone() changes? To me, grab a shortterm_users but not a user is a new operation model. I hesitate to add it when the existing refcount_inc(&obj->user) works for this case. Thanks, Yilun > > Jason > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-26 3:31 ` Xu Yilun @ 2025-06-26 14:36 ` Jason Gunthorpe 0 siblings, 0 replies; 32+ messages in thread From: Jason Gunthorpe @ 2025-06-26 14:36 UTC (permalink / raw) To: Xu Yilun Cc: kevin.tian, will, aneesh.kumar, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu On Thu, Jun 26, 2025 at 11:31:06AM +0800, Xu Yilun wrote: > > The wait_event is a ugly hack though, even in its existing code. The > > above version is better because it doesn't have any failure mode and > > doesn't introduce any unlocked use of the idev->vdev which is easier > > to reason about, no READ_ONCE/WRITE_ONCE/etc > > > > It sounds like you should largely leave the existing other parts the > > same as this v2, though can you try reorganize it to look a little > > more like the version I shared? > > Sure. But may I confirm that your only want reentrant > iommufd_vdevice_abort() but not your iommufd_object_remove_tombstone() > changes? I think take a look at how I organized the control flow in the patch I sent and try to use some of those ideas, it was a bit simpler > To me, grab a shortterm_users but not a user is a new operation model. I > hesitate to add it when the existing refcount_inc(&obj->user) works for > this case. Yes, I am convinced you should not do this. Just hold the users only and use the normal destroy with the XA_ZERO_ENTRY change Along with the locked abort idea. Jason ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-23 9:49 ` [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy Xu Yilun ` (3 preceding siblings ...) 2025-06-24 14:53 ` Jason Gunthorpe @ 2025-06-25 6:40 ` Aneesh Kumar K.V 2025-06-25 9:38 ` Xu Yilun 4 siblings, 1 reply; 32+ messages in thread From: Aneesh Kumar K.V @ 2025-06-25 6:40 UTC (permalink / raw) To: Xu Yilun, jgg, jgg, kevin.tian, will Cc: iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu, yilun.xu Xu Yilun <yilun.xu@linux.intel.com> writes: > Destroy iommufd_vdevice(vdev) on iommufd_idevice(idev) destroy so that > vdev can't outlive idev. > > iommufd_device(idev) represents the physical device bound to iommufd, > while the iommufd_vdevice(vdev) represents the virtual instance of the > physical device in the VM. The lifecycle of the vdev should not be > longer than idev. This doesn't cause real problem on existing use cases > cause vdev doesn't impact the physical device, only provides > virtualization information. But to extend vdev for Confidential > Computing(CC), there are needs to do secure configuration for the vdev, > e.g. TSM Bind/Unbind. These configurations should be rolled back on idev > destroy, or the external driver(VFIO) functionality may be impact. > > Building the association between idev & vdev requires the two objects > pointing each other, but not referencing each other. This requires > proper locking. This is done by reviving some of Nicolin's patch [1]. > > There are 3 cases on idev destroy: > > 1. vdev is already destroyed by userspace. No extra handling needed. > 2. vdev is still alive. Use iommufd_object_tombstone_user() to > destroy vdev and tombstone the vdev ID. > 3. vdev is being destroyed by userspace. The vdev ID is already freed, > but vdev destroy handler is not complete. The idev destroy handler > should wait for vdev destroy completion. > > [1]: https://lore.kernel.org/all/53025c827c44d68edb6469bfd940a8e8bc6147a5.1729897278.git.nicolinc@nvidia.com/ > > Original-by: Nicolin Chen <nicolinc@nvidia.com> > Original-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> > Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> This is the latest version I have. But as Jason suggested we can possibly switch to short term users to avoid parallel destroy returning EBUSY. I am using a mutex lock to block parallel vdevice destroy. diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 86244403b532..0bee1b3eadc6 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -221,6 +221,8 @@ struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx, refcount_inc(&idev->obj.users); /* igroup refcount moves into iommufd_device */ idev->igroup = igroup; + idev->vdev = NULL; + mutex_init(&idev->lock); /* * If the caller fails after this success it must call @@ -286,6 +288,23 @@ void iommufd_device_unbind(struct iommufd_device *idev) } EXPORT_SYMBOL_NS_GPL(iommufd_device_unbind, "IOMMUFD"); +void iommufd_device_tombstone_vdevice(struct iommufd_device *idev) +{ + guard(mutex)(&idev->lock); + + if (idev->vdev) { + struct iommufd_vdevice *vdev = idev->vdev; + /* + * We want to wait for shortterm users, so we need + * to pass the obj which requires an elevated refcount. + */ + refcount_inc(&vdev->obj.users); + iommufd_object_remove(idev->ictx, &vdev->obj, vdev->obj.id, + REMOVE_OBJ_TOMBSTONE | REMOVE_WAIT_SHORTTERM); + } +} +EXPORT_SYMBOL_NS_GPL(iommufd_device_tombstone_vdevice, "IOMMUFD"); + struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev) { return idev->ictx; diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 9ccc83341f32..960f79c31145 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -187,8 +187,10 @@ void iommufd_object_finalize(struct iommufd_ctx *ictx, struct iommufd_object *obj); enum { - REMOVE_WAIT_SHORTTERM = 1, + REMOVE_WAIT_SHORTTERM = BIT(0), + REMOVE_OBJ_TOMBSTONE = BIT(1), }; + int iommufd_object_remove(struct iommufd_ctx *ictx, struct iommufd_object *to_destroy, u32 id, unsigned int flags); @@ -199,7 +201,7 @@ int iommufd_object_remove(struct iommufd_ctx *ictx, * will be holding one. */ static inline void iommufd_object_destroy_user(struct iommufd_ctx *ictx, - struct iommufd_object *obj) + struct iommufd_object *obj) { int ret; @@ -425,6 +427,10 @@ struct iommufd_device { /* always the physical device */ struct device *dev; bool enforce_cache_coherency; + /* to protect the following members*/ + struct mutex lock; + /* if there is a vdevice mapping the idev */ + struct iommufd_vdevice *vdev; }; static inline struct iommufd_device * @@ -606,6 +612,7 @@ struct iommufd_vdevice { struct iommufd_ctx *ictx; struct iommufd_viommu *viommu; struct device *dev; + struct iommufd_device *idev; u64 id; /* per-vIOMMU virtual ID */ }; diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 3df468f64e7d..fd82140e6320 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -81,14 +81,16 @@ void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx, struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id, enum iommufd_object_type type) { + XA_STATE(xas, &ictx->objects, id); struct iommufd_object *obj; if (iommufd_should_fail()) return ERR_PTR(-ENOENT); xa_lock(&ictx->objects); - obj = xa_load(&ictx->objects, id); - if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) || + obj = xas_load(&xas); + if (!obj || xa_is_zero(obj) || + (type != IOMMUFD_OBJ_ANY && obj->type != type) || !iommufd_lock_obj(obj)) obj = ERR_PTR(-ENOENT); xa_unlock(&ictx->objects); @@ -167,7 +169,7 @@ int iommufd_object_remove(struct iommufd_ctx *ictx, goto err_xa; } - xas_store(&xas, NULL); + xas_store(&xas, (flags & REMOVE_OBJ_TOMBSTONE) ? XA_ZERO_ENTRY : NULL); if (ictx->vfio_ioas == container_of(obj, struct iommufd_ioas, obj)) ictx->vfio_ioas = NULL; xa_unlock(&ictx->objects); @@ -199,9 +201,30 @@ int iommufd_object_remove(struct iommufd_ctx *ictx, static int iommufd_destroy(struct iommufd_ucmd *ucmd) { + int ret; struct iommu_destroy *cmd = ucmd->cmd; + struct iommufd_object *obj; + struct iommufd_device *idev = NULL; + + obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY); + /* Destroying vdevice requires an idevice lock */ + if (!IS_ERR(obj) && obj->type == IOMMUFD_OBJ_VDEVICE) { + struct iommufd_vdevice *vdev = + container_of(obj, struct iommufd_vdevice, obj); + /* + * since vdev have an refcount on idev, this is safe. + */ + idev = vdev->idev; + mutex_lock(&idev->lock); + /* drop the additonal reference taken above */ + iommufd_put_object(ucmd->ictx, obj); + } + + ret = iommufd_object_remove(ucmd->ictx, NULL, cmd->id, 0); + if (idev) + mutex_unlock(&idev->lock); - return iommufd_object_remove(ucmd->ictx, NULL, cmd->id, 0); + return ret; } static int iommufd_fops_open(struct inode *inode, struct file *filp) @@ -233,6 +256,16 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp) return 0; } +/* + * We don't need additional locks because the iommufd_fops_release() function is + * only triggered when the iommufd descriptor is released. At that point, no + * other iommufd-based ioctl operations can be running concurrently. + * + * The destruction of the vdevice via idevice unbind is also safe: + * iommufd_fops_release() can only be called after the idevice has been unbound. + * The idevice bind operation takes a reference to the iommufd descriptor, + * preventing early release. + */ static int iommufd_fops_release(struct inode *inode, struct file *filp) { struct iommufd_ctx *ictx = filp->private_data; @@ -251,16 +284,22 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp) */ while (!xa_empty(&ictx->objects)) { unsigned int destroyed = 0; - unsigned long index; + XA_STATE(xas, &ictx->objects, 0); + + xas_for_each(&xas, obj, ULONG_MAX) { + + if (!xa_is_zero(obj)) { + if (!refcount_dec_if_one(&obj->users)) + continue; + + iommufd_object_ops[obj->type].destroy(obj); + kfree(obj); + } - xa_for_each(&ictx->objects, index, obj) { - if (!refcount_dec_if_one(&obj->users)) - continue; destroyed++; - xa_erase(&ictx->objects, index); - iommufd_object_ops[obj->type].destroy(obj); - kfree(obj); + xas_store(&xas, NULL); } + /* Bug related to users refcount */ if (WARN_ON(!destroyed)) break; diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 01df2b985f02..b3a5f505c7ed 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -89,10 +89,18 @@ void iommufd_vdevice_destroy(struct iommufd_object *obj) struct iommufd_vdevice *vdev = container_of(obj, struct iommufd_vdevice, obj); struct iommufd_viommu *viommu = vdev->viommu; + struct iommufd_device *idev = vdev->idev; + + /* + * since we have an refcount on idev, it can't be freed. + */ + lockdep_assert_held(&idev->lock); /* xa_cmpxchg is okay to fail if alloc failed xa_cmpxchg previously */ xa_cmpxchg(&viommu->vdevs, vdev->id, vdev, NULL, GFP_KERNEL); refcount_dec(&viommu->obj.users); + idev->vdev = NULL; + refcount_dec(&idev->obj.users); put_device(vdev->dev); } @@ -124,10 +132,15 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) goto out_put_idev; } + mutex_lock(&idev->lock); + if (idev->vdev) { + rc = -EINVAL; + goto out_put_idev_unlock; + } vdev = iommufd_object_alloc(ucmd->ictx, vdev, IOMMUFD_OBJ_VDEVICE); if (IS_ERR(vdev)) { rc = PTR_ERR(vdev); - goto out_put_idev; + goto out_put_idev_unlock; } vdev->id = virt_id; @@ -147,10 +160,17 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) if (rc) goto out_abort; iommufd_object_finalize(ucmd->ictx, &vdev->obj); - goto out_put_idev; + /* don't allow idev free without vdev free */ + refcount_inc(&idev->obj.users); + vdev->idev = idev; + /* vdev lifecycle now managed by idev */ + idev->vdev = vdev; + goto out_put_idev_unlock; out_abort: iommufd_object_abort_and_destroy(ucmd->ictx, &vdev->obj); +out_put_idev_unlock: + mutex_unlock(&idev->lock); out_put_idev: iommufd_put_object(ucmd->ictx, &idev->obj); out_put_viommu: diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 6328c3a05bcd..0bf4f8b7f8d2 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -694,6 +694,12 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) #if IS_ENABLED(CONFIG_EEH) eeh_dev_release(vdev->pdev); #endif + + /* destroy vdevice which involves tsm unbind before we disable pci disable */ + if (core_vdev->iommufd_device) + iommufd_device_tombstone_vdevice(core_vdev->iommufd_device); + + /* tsm unbind should happen before this */ vfio_pci_core_disable(vdev); mutex_lock(&vdev->igate); diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 34b6e6ca4bfa..8de3d903bfc0 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -63,6 +63,7 @@ void iommufd_device_detach(struct iommufd_device *idev, ioasid_t pasid); struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev); u32 iommufd_device_to_id(struct iommufd_device *idev); +void iommufd_device_tombstone_vdevice(struct iommufd_device *idev); struct iommufd_access_ops { u8 needs_pin_pages : 1; ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy 2025-06-25 6:40 ` Aneesh Kumar K.V @ 2025-06-25 9:38 ` Xu Yilun 0 siblings, 0 replies; 32+ messages in thread From: Xu Yilun @ 2025-06-25 9:38 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: jgg, jgg, kevin.tian, will, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu On Wed, Jun 25, 2025 at 12:10:28PM +0530, Aneesh Kumar K.V wrote: > Xu Yilun <yilun.xu@linux.intel.com> writes: > > > Destroy iommufd_vdevice(vdev) on iommufd_idevice(idev) destroy so that > > vdev can't outlive idev. > > > > iommufd_device(idev) represents the physical device bound to iommufd, > > while the iommufd_vdevice(vdev) represents the virtual instance of the > > physical device in the VM. The lifecycle of the vdev should not be > > longer than idev. This doesn't cause real problem on existing use cases > > cause vdev doesn't impact the physical device, only provides > > virtualization information. But to extend vdev for Confidential > > Computing(CC), there are needs to do secure configuration for the vdev, > > e.g. TSM Bind/Unbind. These configurations should be rolled back on idev > > destroy, or the external driver(VFIO) functionality may be impact. > > > > Building the association between idev & vdev requires the two objects > > pointing each other, but not referencing each other. This requires > > proper locking. This is done by reviving some of Nicolin's patch [1]. > > > > There are 3 cases on idev destroy: > > > > 1. vdev is already destroyed by userspace. No extra handling needed. > > 2. vdev is still alive. Use iommufd_object_tombstone_user() to > > destroy vdev and tombstone the vdev ID. > > 3. vdev is being destroyed by userspace. The vdev ID is already freed, > > but vdev destroy handler is not complete. The idev destroy handler > > should wait for vdev destroy completion. > > > > [1]: https://lore.kernel.org/all/53025c827c44d68edb6469bfd940a8e8bc6147a5.1729897278.git.nicolinc@nvidia.com/ > > > > Original-by: Nicolin Chen <nicolinc@nvidia.com> > > Original-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> > > Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> > > This is the latest version I have. But as Jason suggested we can > possibly switch to short term users to avoid parallel destroy returning > EBUSY. We can discuss in that thread, I have the same question as Kevin. > I am using a mutex lock to block parallel vdevice destroy. I don't find reason to add a new lock rather than reuse the idev->igroup->lock, which is already reviewed in Nicolin's series. [...] > diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c > index 3df468f64e7d..fd82140e6320 100644 > --- a/drivers/iommu/iommufd/main.c > +++ b/drivers/iommu/iommufd/main.c > @@ -81,14 +81,16 @@ void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx, > struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id, > enum iommufd_object_type type) > { > + XA_STATE(xas, &ictx->objects, id); > struct iommufd_object *obj; > > if (iommufd_should_fail()) > return ERR_PTR(-ENOENT); > > xa_lock(&ictx->objects); > - obj = xa_load(&ictx->objects, id); > - if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) || > + obj = xas_load(&xas); > + if (!obj || xa_is_zero(obj) || xas_load() & filter out zero entries, then what's the difference from xa_load()? > + (type != IOMMUFD_OBJ_ANY && obj->type != type) || > !iommufd_lock_obj(obj)) > obj = ERR_PTR(-ENOENT); > xa_unlock(&ictx->objects); [...] > static int iommufd_destroy(struct iommufd_ucmd *ucmd) > { > + int ret; > struct iommu_destroy *cmd = ucmd->cmd; > + struct iommufd_object *obj; > + struct iommufd_device *idev = NULL; > + > + obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY); > + /* Destroying vdevice requires an idevice lock */ > + if (!IS_ERR(obj) && obj->type == IOMMUFD_OBJ_VDEVICE) { > + struct iommufd_vdevice *vdev = > + container_of(obj, struct iommufd_vdevice, obj); > + /* > + * since vdev have an refcount on idev, this is safe. > + */ > + idev = vdev->idev; > + mutex_lock(&idev->lock); > + /* drop the additonal reference taken above */ > + iommufd_put_object(ucmd->ictx, obj); > + } > + > + ret = iommufd_object_remove(ucmd->ictx, NULL, cmd->id, 0); > + if (idev) > + mutex_unlock(&idev->lock); I'm trying best not to add vdev/idev specific things to generic code. I also don't prefer adding a idev specific lock around generic object remove flow. That makes idev/vdev way to special. So for these locking things, I prefer more to Nicolin's v5 and revives them. > > - return iommufd_object_remove(ucmd->ictx, NULL, cmd->id, 0); > + return ret; > } > [...] > @@ -147,10 +160,17 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) > if (rc) > goto out_abort; > iommufd_object_finalize(ucmd->ictx, &vdev->obj); > - goto out_put_idev; > + /* don't allow idev free without vdev free */ > + refcount_inc(&idev->obj.users); IIRC, it has been suggested no long term refcount to kernel created object. Besides, you actually disallow nothing. > + vdev->idev = idev; > + /* vdev lifecycle now managed by idev */ > + idev->vdev = vdev; > + goto out_put_idev_unlock; > > out_abort: > iommufd_object_abort_and_destroy(ucmd->ictx, &vdev->obj); > +out_put_idev_unlock: > + mutex_unlock(&idev->lock); > out_put_idev: > iommufd_put_object(ucmd->ictx, &idev->obj); > out_put_viommu: > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 6328c3a05bcd..0bf4f8b7f8d2 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -694,6 +694,12 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) > #if IS_ENABLED(CONFIG_EEH) > eeh_dev_release(vdev->pdev); > #endif > + > + /* destroy vdevice which involves tsm unbind before we disable pci disable */ > + if (core_vdev->iommufd_device) > + iommufd_device_tombstone_vdevice(core_vdev->iommufd_device); Ah.. I think all our effort to destroy vdevice on idevice destruction is to hide the sequence details from VFIO. > + > + /* tsm unbind should happen before this */ > vfio_pci_core_disable(vdev); I did mention there is still issue when vdevice lifecycle problem is solved. That VFIO for now do device disable configurations before iommufd_device_unbind() and cause problem. But my quick idea would be (not tested): @@ -544,12 +544,14 @@ static void vfio_df_device_last_close(struct vfio_device_file *df) lockdep_assert_held(&device->dev_set->lock); - if (device->ops->close_device) - device->ops->close_device(device); if (iommufd) vfio_df_iommufd_unbind(df); else vfio_device_group_unuse_iommu(device); + + if (device->ops->close_device) + device->ops->close_device(device); > > mutex_lock(&vdev->igate); > diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h > index 34b6e6ca4bfa..8de3d903bfc0 100644 > --- a/include/linux/iommufd.h > +++ b/include/linux/iommufd.h > @@ -63,6 +63,7 @@ void iommufd_device_detach(struct iommufd_device *idev, ioasid_t pasid); > > struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev); > u32 iommufd_device_to_id(struct iommufd_device *idev); > +void iommufd_device_tombstone_vdevice(struct iommufd_device *idev); > > struct iommufd_access_ops { > u8 needs_pin_pages : 1; ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 4/4] iommufd/selftest: Add coverage for vdevice tombstone 2025-06-23 9:49 [PATCH v2 0/4] iommufd: Destroy vdevice on device unbind Xu Yilun ` (2 preceding siblings ...) 2025-06-23 9:49 ` [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy Xu Yilun @ 2025-06-23 9:49 ` Xu Yilun 2025-06-24 13:41 ` Jason Gunthorpe 3 siblings, 1 reply; 32+ messages in thread From: Xu Yilun @ 2025-06-23 9:49 UTC (permalink / raw) To: jgg, jgg, kevin.tian, will, aneesh.kumar Cc: iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu, yilun.xu This tests the flow to tombstone vdevice when idevice is to be unbound before vdevice destroy. The expected result is: - idevice unbound tombstones vdevice ID, the ID can't be repurposed anymore. - Even ioctl(IOMMU_DESTROY) can't free it the tombstoned ID. - iommufd_fops_release() can still free everything. Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> --- tools/testing/selftests/iommu/iommufd.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 1a8e85afe9aa..092e1344447e 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -3014,6 +3014,19 @@ TEST_F(iommufd_viommu, vdevice_cache) } } +TEST_F(iommufd_viommu, vdevice_tombstone) +{ + uint32_t viommu_id = self->viommu_id; + uint32_t dev_id = self->device_id; + uint32_t vdev_id = 0; + + if (dev_id) { + test_cmd_vdevice_alloc(viommu_id, dev_id, 0x99, &vdev_id); + test_ioctl_destroy(self->stdev_id); + EXPECT_ERRNO(ENOENT, _test_ioctl_destroy(self->fd, vdev_id)); + } +} + FIXTURE(iommufd_device_pasid) { int fd; -- 2.25.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v2 4/4] iommufd/selftest: Add coverage for vdevice tombstone 2025-06-23 9:49 ` [PATCH v2 4/4] iommufd/selftest: Add coverage for vdevice tombstone Xu Yilun @ 2025-06-24 13:41 ` Jason Gunthorpe 2025-06-25 8:29 ` Xu Yilun 0 siblings, 1 reply; 32+ messages in thread From: Jason Gunthorpe @ 2025-06-24 13:41 UTC (permalink / raw) To: Xu Yilun Cc: kevin.tian, will, aneesh.kumar, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu On Mon, Jun 23, 2025 at 05:49:46PM +0800, Xu Yilun wrote: > diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c > index 1a8e85afe9aa..092e1344447e 100644 > --- a/tools/testing/selftests/iommu/iommufd.c > +++ b/tools/testing/selftests/iommu/iommufd.c > @@ -3014,6 +3014,19 @@ TEST_F(iommufd_viommu, vdevice_cache) > } > } > > +TEST_F(iommufd_viommu, vdevice_tombstone) > +{ > + uint32_t viommu_id = self->viommu_id; > + uint32_t dev_id = self->device_id; > + uint32_t vdev_id = 0; > + > + if (dev_id) { Why the if? The test should work or skip, not silently skip. self->device_id is setup by the fixture, it can't be absent. Jason ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 4/4] iommufd/selftest: Add coverage for vdevice tombstone 2025-06-24 13:41 ` Jason Gunthorpe @ 2025-06-25 8:29 ` Xu Yilun 0 siblings, 0 replies; 32+ messages in thread From: Xu Yilun @ 2025-06-25 8:29 UTC (permalink / raw) To: Jason Gunthorpe Cc: kevin.tian, will, aneesh.kumar, iommu, linux-kernel, joro, robin.murphy, shuah, nicolinc, aik, dan.j.williams, baolu.lu, yilun.xu On Tue, Jun 24, 2025 at 10:41:45AM -0300, Jason Gunthorpe wrote: > On Mon, Jun 23, 2025 at 05:49:46PM +0800, Xu Yilun wrote: > > diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c > > index 1a8e85afe9aa..092e1344447e 100644 > > --- a/tools/testing/selftests/iommu/iommufd.c > > +++ b/tools/testing/selftests/iommu/iommufd.c > > @@ -3014,6 +3014,19 @@ TEST_F(iommufd_viommu, vdevice_cache) > > } > > } > > > > +TEST_F(iommufd_viommu, vdevice_tombstone) > > +{ > > + uint32_t viommu_id = self->viommu_id; > > + uint32_t dev_id = self->device_id; > > + uint32_t vdev_id = 0; > > + > > + if (dev_id) { > > Why the if? The test should work or skip, not silently skip. > > self->device_id is setup by the fixture, it can't be absent. It can, when the variant no_viommu. I agree this should not be silently skipped, see below changes: Should also apply to viommu_alloc_nested_iopf, vdevice_cache. +TEST_F(iommufd_viommu, vdevice_tombstone) +{ + uint32_t viommu_id = self->viommu_id; + uint32_t dev_id = self->device_id; + uint32_t vdev_id = 0; + + if (!dev_id) + SKIP(return, "Skipping test for variant no_viommu"); + + test_cmd_vdevice_alloc(viommu_id, dev_id, 0x99, &vdev_id); + test_ioctl_destroy(self->stdev_id); + EXPECT_ERRNO(ENOENT, _test_ioctl_destroy(self->fd, vdev_id)); +} Thanks, Yilun > > Jason ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2025-06-26 14:37 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-23 9:49 [PATCH v2 0/4] iommufd: Destroy vdevice on device unbind Xu Yilun 2025-06-23 9:49 ` [PATCH v2 1/4] iommufd: Add iommufd_object_tombstone_user() helper Xu Yilun 2025-06-24 13:35 ` Jason Gunthorpe 2025-06-25 7:24 ` Xu Yilun 2025-06-25 5:51 ` Aneesh Kumar K.V 2025-06-25 8:40 ` Xu Yilun 2025-06-23 9:49 ` [PATCH v2 2/4] iommufd/viommu: Fix the uninitialized iommufd_vdevice::ictx Xu Yilun 2025-06-24 3:24 ` Baolu Lu 2025-06-24 6:35 ` Xu Yilun 2025-06-23 9:49 ` [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy Xu Yilun 2025-06-24 3:32 ` Baolu Lu 2025-06-24 8:11 ` Xu Yilun 2025-06-24 8:28 ` Tian, Kevin 2025-06-24 8:12 ` Tian, Kevin 2025-06-25 1:55 ` Baolu Lu 2025-06-24 6:47 ` Xu Yilun 2025-06-24 8:22 ` Tian, Kevin 2025-06-26 4:59 ` Xu Yilun 2025-06-24 14:53 ` Jason Gunthorpe 2025-06-24 23:57 ` Tian, Kevin 2025-06-25 1:36 ` Jason Gunthorpe 2025-06-25 2:11 ` Tian, Kevin 2025-06-25 12:33 ` Jason Gunthorpe 2025-06-25 10:06 ` Xu Yilun 2025-06-25 12:38 ` Jason Gunthorpe 2025-06-26 3:31 ` Xu Yilun 2025-06-26 14:36 ` Jason Gunthorpe 2025-06-25 6:40 ` Aneesh Kumar K.V 2025-06-25 9:38 ` Xu Yilun 2025-06-23 9:49 ` [PATCH v2 4/4] iommufd/selftest: Add coverage for vdevice tombstone Xu Yilun 2025-06-24 13:41 ` Jason Gunthorpe 2025-06-25 8:29 ` Xu Yilun
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).