From: Alex Williamson <alex@shazbot.org>
To: Zhiping Zhang <zhipingz@meta.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>, Leon Romanovsky <leon@kernel.org>,
Bjorn Helgaas <helgaas@kernel.org>, <kvm@vger.kernel.org>,
<linux-rdma@vger.kernel.org>, <linux-pci@vger.kernel.org>,
<netdev@vger.kernel.org>, <dri-devel@lists.freedesktop.org>,
Keith Busch <kbusch@kernel.org>, Yochai Cohen <yochai@nvidia.com>,
Yishai Hadas <yishaih@nvidia.com>,
alex@shazbot.org
Subject: Re: [PATCH v4 1/3] vfio: add dma-buf get_tph callback and DMA_BUF_TPH feature
Date: Thu, 21 May 2026 16:04:12 -0600 [thread overview]
Message-ID: <20260521160412.4fa75406@shazbot.org> (raw)
In-Reply-To: <20260519201401.1558410-2-zhipingz@meta.com>
On Tue, 19 May 2026 13:13:49 -0700
Zhiping Zhang <zhipingz@meta.com> wrote:
> Add a dma-buf get_tph callback for exporters to return TPH
> (TLP Processing Hints) metadata, and add VFIO_DEVICE_FEATURE_DMA_BUF_TPH
> so userspace can attach that metadata to a VFIO-exported dma-buf.
This should be two patches, the first extending the dma-buf framework
for the get_tph callback for explicit approval from dma-buf maintainers
(who are not even copied here). The second the vfio-pci implementation
of get_tph.
> 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces; the
> uAPI carries both with explicit validity flags so importers get the
> value matching their requested width. SET is write-once per dma-buf;
> the existing VFIO_DEVICE_FEATURE_DMA_BUF uAPI is unchanged.
I didn't see what motivated this write-once change, I thought we
understood that it was a userspace problem that the tph values need to
be set before providing the dma-buf fd to the importer and that races
relative to that are a userspace ordering problem. Write-once seems
unnecessarily restrictive and there's no justification provided here.
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
> drivers/vfio/pci/vfio_pci_core.c | 3 +
> drivers/vfio/pci/vfio_pci_dmabuf.c | 134 +++++++++++++++++++++++++++--
> drivers/vfio/pci/vfio_pci_priv.h | 12 +++
> include/linux/dma-buf.h | 21 +++++
> include/uapi/linux/vfio.h | 35 ++++++++
> 5 files changed, 198 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 3f8d093aacf8..94aa6dd95701 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1534,6 +1534,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
> return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
> case VFIO_DEVICE_FEATURE_DMA_BUF:
> return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
> + case VFIO_DEVICE_FEATURE_DMA_BUF_TPH:
> + return vfio_pci_core_feature_dma_buf_tph(vdev, flags, arg,
> + argsz);
> default:
> return -ENOTTY;
> }
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index f87fd32e4a01..be1c65385670 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -19,7 +19,24 @@ struct vfio_pci_dma_buf {
> u32 nr_ranges;
> struct kref kref;
> struct completion comp;
> - u8 revoked : 1;
> + /*
> + * TPH metadata published by VFIO_DEVICE_FEATURE_DMA_BUF_TPH and
> + * consumed by the @get_tph dma-buf callback.
> + *
> + * @tph_flags is the publish/consume gate: writers populate
> + * @steering_tag, @steering_tag_ext and @ph first, then store
> + * @tph_flags with smp_store_release(); readers do
> + * smp_load_acquire(&tph_flags) before accessing the value fields.
> + * @tph_flags == 0 means "TPH not set". Writers publish a non-zero
> + * value only once per dma-buf and serialize via vdev->memory_lock;
> + * readers stay lockless to avoid AB-BA against the dma_resv_lock held
> + * by importers.
> + */
Can you outline the ABBA hazard, I'm not seeing it. You're acquiring
memory_lock in the feature SET and dma_resv_lock doesn't appear to be
held when calling .get_tph(). There's a lot of lockless complication
here balanced on this claim of avoiding a hazard that doesn't appear
present.
> + u32 tph_flags;
> + u16 steering_tag_ext;
> + u8 steering_tag;
> + u8 ph;
> + bool revoked;
If we still used memory_lock for tph, these could be:
u8 tph_st_valid:1; /* memory_lock */
u8 tph_st_ext_valid:1; /* memory_lock */
u8 tph_ph:2; /* memory_lock */
u8 tph_st;
u16 tph_st_ext;
u8 revoked:1; /* dma_resv_lock */
The existing change of @revoked from bitfield to bool has no rationale
noted for it in the commit log.
> };
>
> static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> @@ -69,6 +86,36 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
> return ret;
> }
>
> +static int vfio_pci_dma_buf_get_tph(struct dma_buf *dmabuf, u16 *steering_tag,
> + u8 *ph, u8 st_width)
> +{
> + struct vfio_pci_dma_buf *priv = dmabuf->priv;
> + u32 flags;
> +
> + /* Pair with the smp_store_release() in VFIO_DEVICE_FEATURE_DMA_BUF_TPH. */
> + flags = smp_load_acquire(&priv->tph_flags);
> + if (!flags)
> + return -EOPNOTSUPP;
> +
> + switch (st_width) {
> + case 8:
> + if (!(flags & VFIO_DMA_BUF_TPH_ST))
> + return -EOPNOTSUPP;
> + *steering_tag = priv->steering_tag;
> + break;
> + case 16:
> + if (!(flags & VFIO_DMA_BUF_TPH_ST_EXT))
> + return -EOPNOTSUPP;
> + *steering_tag = priv->steering_tag_ext;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + *ph = priv->ph;
> + return 0;
> +}
> +
> static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
> struct sg_table *sgt,
> enum dma_data_direction dir)
> @@ -84,16 +131,17 @@ static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
> static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> {
> struct vfio_pci_dma_buf *priv = dmabuf->priv;
> + struct vfio_pci_core_device *vdev = READ_ONCE(priv->vdev);
>
> /*
> * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
> * The refcount prevents both.
> */
> - if (priv->vdev) {
> - down_write(&priv->vdev->memory_lock);
> + if (vdev) {
> + down_write(&vdev->memory_lock);
> list_del_init(&priv->dmabufs_elm);
> - up_write(&priv->vdev->memory_lock);
> - vfio_device_put_registration(&priv->vdev->vdev);
> + up_write(&vdev->memory_lock);
> + vfio_device_put_registration(&vdev->vdev);
> }
> kfree(priv->phys_vec);
> kfree(priv);
This seems unnecessary. I think this is just because priv->vdev is now
(unnecessarily) set via WRITE_ONCE, right? These are very well ordered
paths, prior to exposing the dma-buf, while the device is opened, during
release, after release. They don't seem to need the READ/WRITE_ONCE
treatment. This looks like noise from trying to make it lockless.
> @@ -101,6 +149,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
>
> static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
> .attach = vfio_pci_dma_buf_attach,
> + .get_tph = vfio_pci_dma_buf_get_tph,
> .map_dma_buf = vfio_pci_dma_buf_map,
> .unmap_dma_buf = vfio_pci_dma_buf_unmap,
> .release = vfio_pci_dma_buf_release,
> @@ -269,7 +318,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> goto err_free_priv;
> }
>
> - priv->vdev = vdev;
> + WRITE_ONCE(priv->vdev, vdev);
> priv->nr_ranges = get_dma_buf.nr_ranges;
> priv->size = length;
> ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider,
> @@ -331,6 +380,77 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> return ret;
> }
>
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> + u32 flags,
> + struct vfio_device_feature_dma_buf_tph __user *arg,
> + size_t argsz)
> +{
> + struct vfio_device_feature_dma_buf_tph set_tph;
> + struct vfio_pci_dma_buf *priv;
> + struct dma_buf *dmabuf;
> + int ret;
> +
> + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> + sizeof(set_tph));
> + if (ret != 1)
> + return ret;
> +
> + if (copy_from_user(&set_tph, arg, sizeof(set_tph)))
> + return -EFAULT;
> +
> + if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT))
> + return -EINVAL;
> +
> + if (!set_tph.flags)
> + return -EINVAL;
> +
> + /* PCIe TLP Processing Hint is a 2-bit field. */
> + if (set_tph.ph & ~0x3)
> + return -EINVAL;
> +
> + dmabuf = dma_buf_get(set_tph.dmabuf_fd);
> + if (IS_ERR(dmabuf))
> + return PTR_ERR(dmabuf);
> +
> + if (dmabuf->ops != &vfio_pci_dmabuf_ops) {
> + ret = -EINVAL;
> + goto out_put;
> + }
> +
> + priv = dmabuf->priv;
> + down_write(&vdev->memory_lock);
> + if (READ_ONCE(priv->vdev) != vdev) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + /*
> + * TPH metadata is write-once per dma-buf so that lockless readers only
> + * have to observe a single release-published transition from 0 -> flags.
> + */
> + if (READ_ONCE(priv->tph_flags)) {
> + ret = -EBUSY;
> + goto out_unlock;
> + }
> +
> + priv->steering_tag = set_tph.steering_tag;
> + priv->steering_tag_ext = set_tph.steering_tag_ext;
> + priv->ph = set_tph.ph;
> + /*
> + * Publish the TPH values before the gate flag, so that lockless
> + * readers in vfio_pci_dma_buf_get_tph() see fully-initialized
> + * fields once they observe a non-zero tph_flags.
> + */
> + smp_store_release(&priv->tph_flags, set_tph.flags);
> + ret = 0;
> +
> +out_unlock:
> + up_write(&vdev->memory_lock);
> +out_put:
> + dma_buf_put(dmabuf);
> + return ret;
> +}
> +
> void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> {
> struct vfio_pci_dma_buf *priv;
> @@ -388,7 +508,7 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
>
> dma_resv_lock(priv->dmabuf->resv, NULL);
> list_del_init(&priv->dmabufs_elm);
> - priv->vdev = NULL;
> + WRITE_ONCE(priv->vdev, NULL);
> priv->revoked = true;
> dma_buf_invalidate_mappings(priv->dmabuf);
> dma_resv_wait_timeout(priv->dmabuf->resv,
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index fca9d0dfac90..c58f369be4b3 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -118,6 +118,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> struct vfio_device_feature_dma_buf __user *arg,
> size_t argsz);
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> + u32 flags,
> + struct vfio_device_feature_dma_buf_tph __user *arg,
> + size_t argsz);
> void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
> void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
> #else
> @@ -128,6 +132,14 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> {
> return -ENOTTY;
> }
> +
> +static inline int
> +vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, u32 flags,
> + struct vfio_device_feature_dma_buf_tph __user *arg,
> + size_t argsz)
> +{
> + return -ENOTTY;
> +}
> static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> {
> }
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index d1203da56fc5..49eb6ad644a2 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -113,6 +113,27 @@ struct dma_buf_ops {
> */
> void (*unpin)(struct dma_buf_attachment *attach);
>
> + /**
> + * @get_tph:
> + * @dmabuf: DMA buffer for which to retrieve TPH metadata
> + * @steering_tag: Returns the raw TPH steering tag for @st_width
> + * @ph: Returns the TPH processing hint (2-bit value)
> + * @st_width: Consumer's supported steering tag width in bits (8 or 16)
> + *
> + * Return the TPH (TLP Processing Hints) metadata associated with this
> + * DMA buffer for the requested steering-tag width. 8-bit ST and 16-bit
> + * Extended ST are distinct namespaces in the PCIe TPH ST table and may
> + * both be present with different values, so the exporter must select the
> + * value that matches @st_width and must not substitute one for the other.
> + *
> + * Return 0 on success, -EOPNOTSUPP if no metadata is available for the
> + * requested width, or -EINVAL if @st_width is not 8 or 16.
> + *
> + * This callback is optional.
> + */
> + int (*get_tph)(struct dma_buf *dmabuf, u16 *steering_tag, u8 *ph,
> + u8 st_width);
> +
> /**
> * @map_dma_buf:
> *
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 5de618a3a5ee..a9cb6cbc6ade 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1534,6 +1534,41 @@ struct vfio_device_feature_dma_buf {
> */
> #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 12
>
> +/**
> + * Upon VFIO_DEVICE_FEATURE_SET associate TPH (TLP Processing Hints) metadata
> + * with a vfio-exported dma-buf. The dma-buf must have been created by
> + * VFIO_DEVICE_FEATURE_DMA_BUF on this device.
> + *
> + * dmabuf_fd is the file descriptor returned by VFIO_DEVICE_FEATURE_DMA_BUF.
> + *
> + * 8-bit ST (steering_tag) and 16-bit Extended ST (steering_tag_ext) are
> + * distinct namespaces in the PCIe TPH ST table and may both be present with
> + * different values. Userspace should populate the value(s) it has from the
> + * firmware ST table for this device and set the matching VFIO_DMA_BUF_TPH_ST /
> + * VFIO_DMA_BUF_TPH_ST_EXT bit in @flags. An importer requests a specific
> + * width and receives the matching value; if the requested width is not
> + * present, the importer is told TPH is unavailable for this dma-buf.
> + *
> + * ph is the 2-bit TLP Processing Hint and must be in the range [0, 3].
> + *
> + * The user must set TPH on the dma-buf before the importer consumes it.
> + * TPH metadata is write-once per dma-buf; a second SET returns -EBUSY.
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_BUF_TPH 13
> +
> +#define VFIO_DMA_BUF_TPH_ST (1 << 0) /* steering_tag valid */
> +#define VFIO_DMA_BUF_TPH_ST_EXT (1 << 1) /* steering_tag_ext valid */
> +
> +struct vfio_device_feature_dma_buf_tph {
> + __s32 dmabuf_fd;
> + __u32 flags;
> + __u8 steering_tag;
> + __u8 ph;
> + __u16 steering_tag_ext;
> +};
Sure is tempting to make the ph field the first 2-bits of u8 flags.
Thanks,
Alex
> +
> /* -------- API for Type1 VFIO IOMMU -------- */
>
> /**
next prev parent reply other threads:[~2026-05-21 22:04 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-19 20:13 [PATCH v4 0/3] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang
2026-05-19 20:13 ` [PATCH v4 1/3] vfio: add dma-buf get_tph callback and DMA_BUF_TPH feature Zhiping Zhang
2026-05-21 22:04 ` Alex Williamson [this message]
2026-05-21 22:24 ` Alex Williamson
2026-05-23 1:03 ` Zhiping Zhang
2026-05-22 23:53 ` Zhiping Zhang
2026-05-19 20:13 ` [PATCH v4 2/3] PCI/TPH: expose the enabled TPH requester type Zhiping Zhang
2026-05-19 20:13 ` [PATCH v4 3/3] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Zhiping Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260521160412.4fa75406@shazbot.org \
--to=alex@shazbot.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=helgaas@kernel.org \
--cc=jgg@ziepe.ca \
--cc=kbusch@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=leon@kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=yishaih@nvidia.com \
--cc=yochai@nvidia.com \
--cc=zhipingz@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.