Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v3 1/2] vfio: add dma-buf get_tph callback and DMA_BUF_TPH feature
       [not found] ` <20260512184755.4137227-2-zhipingz@meta.com>
@ 2026-05-13  1:33   ` fengchengwen
  0 siblings, 0 replies; 3+ messages in thread
From: fengchengwen @ 2026-05-13  1:33 UTC (permalink / raw)
  To: Zhiping Zhang, Alex Williamson, Jason Gunthorpe, Leon Romanovsky
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

Hi Zhiping,

I have several suggestions:

1. In struct vfio_device_feature_dma_buf_tph, steering_tag is defined as
   __u16, but PCIe TPH base steering tag is only 8-bit. We can use __u8
   for steering_tag to shrink structure size and reduce reserved padding.

2. The flags field seems unnecessary. We can use value 0 of steering_tag
   or steering_tag_ext to indicate the corresponding ST entry is not
   available, which simplifies the uAPI design.

3. All TPH metadata fields (st, ext st, ph) fit within 32 bits. We can
   wrap them into a union with atomic_t, then use atomic read/write
   instead of memory_lock plus smp_load_acquire/smp_store_release. This
   makes lockless access cleaner and avoids ordering maintenance.

For details, see the text.

On 5/13/2026 2:47 AM, Zhiping Zhang wrote:
> Add a dma-buf callback that returns raw TPH metadata from the exporter
> so peer devices can reuse the steering tag and processing hint
> associated with a VFIO-exported buffer. Add a new
> VFIO_DEVICE_FEATURE_DMA_BUF_TPH ioctl that takes the fd from
> VFIO_DEVICE_FEATURE_DMA_BUF along with the TPH values, validates the fd
> is a vfio-exported dma-buf belonging to this device, and stores the TPH
> metadata under memory_lock. The existing VFIO_DEVICE_FEATURE_DMA_BUF
> uAPI is unchanged.
> 
> 8-bit ST and 16-bit Extended ST are distinct namespaces in the PCIe TPH
> ST table (firmware reports them as separate fields with separate
> validity bits in the ACPI _DSM ST table), so the uAPI carries both
> values along with a flags field that indicates which value(s) are
> valid for this device. The exporter selects the value that matches the
> importer's requested width and returns -EOPNOTSUPP if that width is
> not present, instead of substituting a value across namespaces.
> 
> Publish the TPH fields under memory_lock and gate readers on a
> release/acquire on the flags field; this lets get_tph() run lockless
> and avoids inverting the memory_lock -> dma_resv_lock ordering set up
> by vfio_pci_dma_buf_move(). Convert the @revoked bitfield to a plain bool
> so concurrent updates of @revoked (under dma_resv_lock) and the new TPH
> fields (under memory_lock) cannot race on a shared bitfield byte.

The commit log includes many implementation details, why not remove or simply it.

> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> 
> ---
>  drivers/vfio/pci/vfio_pci_core.c   |   3 +
>  drivers/vfio/pci/vfio_pci_dmabuf.c | 113 ++++++++++++++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_priv.h   |  11 +++
>  include/linux/dma-buf.h            |  21 ++++++
>  include/uapi/linux/vfio.h          |  35 +++++++++
>  5 files changed, 182 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 3f8d093aacf8..94aa6dd95701 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1534,6 +1534,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_DMA_BUF:
>  		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_DMA_BUF_TPH:
> +		return vfio_pci_core_feature_dma_buf_tph(vdev, flags, arg,
> +							 argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index f87fd32e4a01..28247602e359 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -19,7 +19,23 @@ struct vfio_pci_dma_buf {
>  	u32 nr_ranges;
>  	struct kref kref;
>  	struct completion comp;
> -	u8 revoked : 1;
> +	/*
> +	 * TPH metadata published by VFIO_DEVICE_FEATURE_DMA_BUF_TPH and
> +	 * consumed by the @get_tph dma-buf callback.
> +	 *
> +	 * @tph_flags is the publish/consume gate: writers populate
> +	 * @steering_tag, @steering_tag_ext and @ph first, then store
> +	 * @tph_flags with smp_store_release(); readers do
> +	 * smp_load_acquire(&tph_flags) before accessing the value fields.
> +	 * @tph_flags == 0 means "TPH not set". Writers serialize via
> +	 * vdev->memory_lock; readers are lockless to avoid AB-BA against
> +	 * the dma_resv_lock held by importers.
> +	 */
> +	u32 tph_flags;

As subsequent comments, can proceed without tph_flags

> +	u16 steering_tag;
> +	u16 steering_tag_ext;
> +	u8 ph;

struct dma_buf_tph {
        union {
                atomic_t val;
                struct {
                        u16 st_ext;
                        u8 st;
                        u8 ph;
                };
        };
};
Set and get are done with atomic operation, no need for lock

> +	bool revoked;
>  };
>  
>  static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> @@ -69,6 +85,35 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
>  	return ret;
>  }
>  

...

>  
> +	/**
> +	 * @get_tph:
> +	 * @dmabuf: DMA buffer for which to retrieve TPH metadata
> +	 * @steering_tag: Returns the raw TPH steering tag for @st_width
> +	 * @ph: Returns the TPH processing hint (2-bit value)
> +	 * @st_width: Consumer's supported steering tag width in bits (8 or 16)
> +	 *
> +	 * Return the TPH (TLP Processing Hints) metadata associated with this
> +	 * DMA buffer for the requested steering-tag width. 8-bit ST and 16-bit
> +	 * Extended ST are distinct namespaces in the PCIe TPH ST table, so the
> +	 * exporter must select the value that matches @st_width and must not
> +	 * substitute one for the other.
> +	 *
> +	 * Return 0 on success, -EOPNOTSUPP if no metadata is available for the
> +	 * requested width, or -EINVAL if @st_width is not 8 or 16.
> +	 *
> +	 * This callback is optional.
> +	 */
> +	int (*get_tph)(struct dma_buf *dmabuf, u16 *steering_tag, u8 *ph,
> +		       u8 st_width);

how about rename steering_tag to st?

> +
>  	/**
>  	 * @map_dma_buf:
>  	 *
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 5de618a3a5ee..53b2bbd9fc1e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1534,6 +1534,41 @@ struct vfio_device_feature_dma_buf {
>   */
>  #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2  12
>  
> +/**
> + * Upon VFIO_DEVICE_FEATURE_SET associate TPH (TLP Processing Hints) metadata
> + * with a vfio-exported dma-buf. The dma-buf must have been created by
> + * VFIO_DEVICE_FEATURE_DMA_BUF on this device.
> + *
> + * dmabuf_fd is the file descriptor returned by VFIO_DEVICE_FEATURE_DMA_BUF.
> + *
> + * 8-bit ST (steering_tag) and 16-bit Extended ST (steering_tag_ext) are
> + * distinct namespaces in the PCIe TPH ST table; userspace should populate
> + * the value(s) it has from the firmware ST table for this device and set
> + * the matching VFIO_DMA_BUF_TPH_ST / VFIO_DMA_BUF_TPH_ST_EXT bit in @flags.
> + * An importer requests a specific width and receives the matching value;
> + * if the requested width is not present, the importer is told TPH is
> + * unavailable for this dma-buf.
> + *
> + * ph is the 2-bit TLP Processing Hint and must be in the range [0, 3].
> + *
> + * The user must set TPH on the dma-buf before the importer consumes it.
> + *
> + * Return: 0 on success, -errno on failure.

-1 and errno is set on failure.

> + */
> +#define VFIO_DEVICE_FEATURE_DMA_BUF_TPH 13
> +
> +#define VFIO_DMA_BUF_TPH_ST		(1 << 0)  /* steering_tag valid */
> +#define VFIO_DMA_BUF_TPH_ST_EXT		(1 << 1)  /* steering_tag_ext valid */

It could be represented by judge whether steering_tag/ext == 0

> +
> +struct vfio_device_feature_dma_buf_tph {
> +	__s32	dmabuf_fd;
> +	__u32	flags;
> +	__u16	steering_tag;
> +	__u16	steering_tag_ext;
> +	__u8	ph;
> +	__u8	reserved[3];

How about:
struct vfio_device_feature_dma_buf_tph {
	__s32	dmabuf_fd;
	__u16	st_ext;
	__u8	st;
	__u8	ph;
}
If st_ext is not zero means it valid, and also with st field.

Thanks

> +};
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3 2/2] RDMA/mlx5: get tph for p2p access when registering dma-buf mr
       [not found] ` <20260512184755.4137227-3-zhipingz@meta.com>
@ 2026-05-13  1:49   ` fengchengwen
  2026-05-13  6:37     ` Zhiping Zhang
  0 siblings, 1 reply; 3+ messages in thread
From: fengchengwen @ 2026-05-13  1:49 UTC (permalink / raw)
  To: Zhiping Zhang, Alex Williamson, Jason Gunthorpe, Leon Romanovsky
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

On 5/13/2026 2:47 AM, Zhiping Zhang wrote:
> Query dma-buf TPH metadata when registering a dma-buf MR for peer to
> peer access and translate the raw steering tag into an mlx5 steering
> tag index. Factor mlx5_st_alloc_index() so callers that already have a
> raw steering tag can allocate the corresponding mlx5 index directly.
> Keep the DMAH path as the first priority and only fall back to dma-buf
> metadata when no DMAH is supplied.
> 
> Add pcie_tph_get_st_width() so the mlx5 IB driver can query the
> device's negotiated ST width without poking pci_dev::tph_req_type
> directly (that field is gated by CONFIG_PCIE_TPH and would otherwise
> break !CONFIG_PCIE_TPH builds). Pass the width to the dma-buf
> get_tph() callback so the exporter can return the value that matches
> the consumer's capability.

1\ Recommend the PCI/TPH modification be committed separately.
2\ How about rename it to pcie_tph_enabled_req_type() ? so we could
   use already defined macro:
#define   PCI_TPH_REQ_DISABLE		0x0 /* No TPH requests allowed */
#define   PCI_TPH_REQ_TPH_ONLY		0x1 /* TPH only requests allowed */
#define   PCI_TPH_REQ_EXT_TPH		0x3 /* Extended TPH requests allowed */

> 
> Pass the dma_buf pointer that the umem already resolved into
> get_tph_mr_dmabuf() instead of re-resolving the user-supplied fd.
> Re-resolving opens a TOCTOU where a concurrent dup2() can substitute a
> different dma_buf between umem creation and TPH lookup.
> 
> Track the per-MR ownership of the allocated mlx5 ST index on
> mlx5_ib_mr (dmabuf_st_index / dmabuf_st_owned) and release it once the
> firmware mkey no longer references it. Both the cached path
> (mlx5r_umr_revoke_mr_with_lock + ib_frmr_pool_push) and the
> destroy_mkey path call mlx5_ib_mr_put_dmabuf_st() so the ST index does
> not leak when the MR is reused from the FRMR pool.
> 
> Initialize ret in mlx5_st_create() so the cached steering-tag path
> returns success cleanly under clang builds.
> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
>  drivers/infiniband/hw/mlx5/mlx5_ib.h          |  6 ++
>  drivers/infiniband/hw/mlx5/mr.c               | 72 ++++++++++++++++++-
>  .../net/ethernet/mellanox/mlx5/core/lib/st.c  | 27 ++++---
>  drivers/pci/tph.c                             | 20 ++++++
>  include/linux/mlx5/driver.h                   |  7 ++
>  include/linux/pci-tph.h                       |  2 +
>  6 files changed, 124 insertions(+), 10 deletions(-)
> 

...

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3 2/2] RDMA/mlx5: get tph for p2p access when registering dma-buf mr
  2026-05-13  1:49   ` [PATCH v3 2/2] RDMA/mlx5: get tph for p2p access when registering dma-buf mr fengchengwen
@ 2026-05-13  6:37     ` Zhiping Zhang
  0 siblings, 0 replies; 3+ messages in thread
From: Zhiping Zhang @ 2026-05-13  6:37 UTC (permalink / raw)
  To: fengchengwen
  Cc: Alex Williamson, Jason Gunthorpe, Leon Romanovsky, Bjorn Helgaas,
	kvm, linux-rdma, linux-pci, netdev, dri-devel, Keith Busch,
	Yochai Cohen, Yishai Hadas

On Tue, May 12, 2026 at 6:49 PM fengchengwen <fengchengwen@huawei.com> wrote:
>
> >
> On 5/13/2026 2:47 AM, Zhiping Zhang wrote:
> > Query dma-buf TPH metadata when registering a dma-buf MR for peer to
> > peer access and translate the raw steering tag into an mlx5 steering
> > tag index. Factor mlx5_st_alloc_index() so callers that already have a
> > raw steering tag can allocate the corresponding mlx5 index directly.
> > Keep the DMAH path as the first priority and only fall back to dma-buf
> > metadata when no DMAH is supplied.
> >
> > Add pcie_tph_get_st_width() so the mlx5 IB driver can query the
> > device's negotiated ST width without poking pci_dev::tph_req_type
> > directly (that field is gated by CONFIG_PCIE_TPH and would otherwise
> > break !CONFIG_PCIE_TPH builds). Pass the width to the dma-buf
> > get_tph() callback so the exporter can return the value that matches
> > the consumer's capability.
>
> 1\ Recommend the PCI/TPH modification be committed separately.
> 2\ How about rename it to pcie_tph_enabled_req_type() ? so we could
>    use already defined macro:
> #define   PCI_TPH_REQ_DISABLE           0x0 /* No TPH requests allowed */
> #define   PCI_TPH_REQ_TPH_ONLY          0x1 /* TPH only requests allowed */
> #define   PCI_TPH_REQ_EXT_TPH           0x3 /* Extended TPH requests allowed */
>

Hi Chengwen,
  Thanks for the great suggestions.
  1. Splitting the PCI/TPH helper change into a separate prep patch
sounds reasonable.
  2. I see your point about exposing the enabled TPH request type! I
want to take one
      more pass over the overall flow and switch to that if I don’t
find any issues.

  Zhiping

> >
> > Pass the dma_buf pointer that the umem already resolved into
> > get_tph_mr_dmabuf() instead of re-resolving the user-supplied fd.
> > Re-resolving opens a TOCTOU where a concurrent dup2() can substitute a
> > different dma_buf between umem creation and TPH lookup.
> >
> > Track the per-MR ownership of the allocated mlx5 ST index on
> > mlx5_ib_mr (dmabuf_st_index / dmabuf_st_owned) and release it once the
> > firmware mkey no longer references it. Both the cached path
> > (mlx5r_umr_revoke_mr_with_lock + ib_frmr_pool_push) and the
> > destroy_mkey path call mlx5_ib_mr_put_dmabuf_st() so the ST index does
> > not leak when the MR is reused from the FRMR pool.
> >
> > Initialize ret in mlx5_st_create() so the cached steering-tag path
> > returns success cleanly under clang builds.
> >
> > Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> > ---
> >  drivers/infiniband/hw/mlx5/mlx5_ib.h          |  6 ++
> >  drivers/infiniband/hw/mlx5/mr.c               | 72 ++++++++++++++++++-
> >  .../net/ethernet/mellanox/mlx5/core/lib/st.c  | 27 ++++---
> >  drivers/pci/tph.c                             | 20 ++++++
> >  include/linux/mlx5/driver.h                   |  7 ++
> >  include/linux/pci-tph.h                       |  2 +
> >  6 files changed, 124 insertions(+), 10 deletions(-)
> >
>
> ...

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-05-13  6:37 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260512184755.4137227-1-zhipingz@meta.com>
     [not found] ` <20260512184755.4137227-2-zhipingz@meta.com>
2026-05-13  1:33   ` [PATCH v3 1/2] vfio: add dma-buf get_tph callback and DMA_BUF_TPH feature fengchengwen
     [not found] ` <20260512184755.4137227-3-zhipingz@meta.com>
2026-05-13  1:49   ` [PATCH v3 2/2] RDMA/mlx5: get tph for p2p access when registering dma-buf mr fengchengwen
2026-05-13  6:37     ` Zhiping Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox