Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v5 0/4] vfio/dma-buf: add TPH support for peer-to-peer access
       [not found] <20260526144401.1485788-1-zhipingz@meta.com>
@ 2026-05-27  6:55 ` Christian König
  2026-05-27 12:14   ` Jason Gunthorpe
       [not found] ` <20260526144401.1485788-3-zhipingz@meta.com>
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2026-05-27  6:55 UTC (permalink / raw)
  To: Zhiping Zhang, Alex Williamson, Jason Gunthorpe, Leon Romanovsky,
	Sumit Semwal
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

On 5/26/26 16:43, Zhiping Zhang wrote:
> This series adds TLP Processing Hints (TPH) support to the VFIO dma-buf
> export path, allowing importing drivers (e.g. mlx5) to use the
> exporter's steering tag when performing peer-to-peer DMA into a
> VFIO-owned device.

I'm not an expert for TPH, but that sounds very strange to me.

As far as I know the TLP Processing Hints allow devices to give a steering tag to the root complex together with memory accesses to give fine grained control about cache usage. In other words it is an extension to the classic snoop bit.

For P2P that is obviously nonsense because we don't have P2P support for cached accesses.

So what puzzle piece I'm missing?

Regards,
Christian.

> 
> Patch 1 exposes the enabled TPH requester type through a small PCI/TPH
> helper so callers don't reach into pci_dev internals.
> Patch 2 adds the optional dma_buf_ops::get_tph callback to the dma-buf
> framework so importers can fetch TPH metadata from an exporter.
> Patch 3 implements get_tph in vfio-pci and adds the new uAPI
> (VFIO_DEVICE_FEATURE_DMA_BUF_TPH) for userspace to attach the metadata.
> Patch 4 wires up the mlx5 RDMA driver as a consumer.
> 
> Previous link:
> v4: https://lore.kernel.org/linux-pci/20260519201401.1558410-1-zhipingz@meta.com/
> v3: https://lore.kernel.org/linux-pci/20260512184755.4137227-1-zhipingz@meta.com/
> v2: https://lore.kernel.org/linux-pci/20260430200704.352228-1-zhipingz@meta.com/
> 
> Zhiping Zhang (4):
>   PCI/TPH: expose the enabled TPH requester type
>   dma-buf: add optional get_tph() callback
>   vfio/pci: implement get_tph and DMA_BUF_TPH feature
>   RDMA/mlx5: get tph for p2p access when registering dma-buf mr
> 
>  drivers/infiniband/hw/mlx5/mlx5_ib.h          |   6 +
>  drivers/infiniband/hw/mlx5/mr.c               |  86 +++++++++++++-
>  .../net/ethernet/mellanox/mlx5/core/lib/st.c  |  28 +++--
>  drivers/pci/tph.c                             |  12 ++
>  drivers/vfio/pci/vfio_pci_core.c              |   3 +
>  drivers/vfio/pci/vfio_pci_dmabuf.c            | 110 +++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_priv.h              |  12 ++
>  include/linux/dma-buf.h                       |  21 ++++
>  include/linux/mlx5/driver.h                   |   7 ++
>  include/linux/pci-tph.h                       |   2 +
>  include/uapi/linux/vfio.h                     |  37 ++++++
>  11 files changed, 311 insertions(+), 13 deletions(-)
> 
> --
> 2.53.0-Meta
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 2/4] dma-buf: add optional get_tph() callback
       [not found] ` <20260526144401.1485788-3-zhipingz@meta.com>
@ 2026-05-27  6:57   ` Christian König
  2026-05-27 17:03   ` Alex Williamson
  1 sibling, 0 replies; 11+ messages in thread
From: Christian König @ 2026-05-27  6:57 UTC (permalink / raw)
  To: Zhiping Zhang, Alex Williamson, Jason Gunthorpe, Leon Romanovsky,
	Sumit Semwal
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas



On 5/26/26 16:43, Zhiping Zhang wrote:
> [Sie erhalten nicht häufig E-Mails von zhipingz@meta.com. Weitere Informationen, warum dies wichtig ist, finden Sie unter https://aka.ms/LearnAboutSenderIdentification ]
> 
> Add an optional dma-buf get_tph callback so an exporter can return TPH
> (TLP Processing Hints) metadata to an importer.
> 
> 8-bit ST and 16-bit Extended ST are distinct namespaces in the PCIe TPH
> ST table and may both be present with different values. The importer
> passes its supported steering-tag width and the exporter returns the
> matching value, or -EOPNOTSUPP if no metadata is available for that
> width.
> 
> The first user is VFIO_DEVICE_FEATURE_DMA_BUF_TPH in vfio-pci, with the
> mlx5 RDMA driver as the first importer.
> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
>  include/linux/dma-buf.h | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index d1203da56fc5..49eb6ad644a2 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -113,6 +113,27 @@ struct dma_buf_ops {
>          */
>         void (*unpin)(struct dma_buf_attachment *attach);
> 
> +       /**
> +        * @get_tph:
> +        * @dmabuf: DMA buffer for which to retrieve TPH metadata
> +        * @steering_tag: Returns the raw TPH steering tag for @st_width
> +        * @ph: Returns the TPH processing hint (2-bit value)

Returned values as last arguments please.

Regards,
Christian.

> +        * @st_width: Consumer's supported steering tag width in bits (8 or 16)
> +        *
> +        * Return the TPH (TLP Processing Hints) metadata associated with this
> +        * DMA buffer for the requested steering-tag width. 8-bit ST and 16-bit
> +        * Extended ST are distinct namespaces in the PCIe TPH ST table and may
> +        * both be present with different values, so the exporter must select the
> +        * value that matches @st_width and must not substitute one for the other.
> +        *
> +        * Return 0 on success, -EOPNOTSUPP if no metadata is available for the
> +        * requested width, or -EINVAL if @st_width is not 8 or 16.
> +        *
> +        * This callback is optional.
> +        */
> +       int (*get_tph)(struct dma_buf *dmabuf, u16 *steering_tag, u8 *ph,
> +                      u8 st_width);
> +
>         /**
>          * @map_dma_buf:
>          *
> --
> 2.53.0-Meta
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 0/4] vfio/dma-buf: add TPH support for peer-to-peer access
  2026-05-27  6:55 ` [PATCH v5 0/4] vfio/dma-buf: add TPH support for peer-to-peer access Christian König
@ 2026-05-27 12:14   ` Jason Gunthorpe
  2026-05-27 12:23     ` Christian König
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2026-05-27 12:14 UTC (permalink / raw)
  To: Christian König
  Cc: Zhiping Zhang, Alex Williamson, Leon Romanovsky, Sumit Semwal,
	Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

On Wed, May 27, 2026 at 08:55:49AM +0200, Christian König wrote:
> On 5/26/26 16:43, Zhiping Zhang wrote:
> > This series adds TLP Processing Hints (TPH) support to the VFIO dma-buf
> > export path, allowing importing drivers (e.g. mlx5) to use the
> > exporter's steering tag when performing peer-to-peer DMA into a
> > VFIO-owned device.
> 
> I'm not an expert for TPH, but that sounds very strange to me.
> 
> As far as I know the TLP Processing Hints allow devices to give a
> steering tag to the root complex together with memory accesses to
> give fine grained control about cache usage. In other words it is an
> extension to the classic snoop bit.

TPHS includes an bit of data on every TLP and the data transits to the
eventual completer.

It does not have to be a root port.
 
> For P2P that is obviously nonsense because we don't have P2P support
> for cached accesses.

For P2P the TPH data on the TLP will transit to the P2P completer
unchanged.

It is up to the completer do define what it does with the TPH data.

Typically root ports in CPUs will use TPH data for cache placement
instructions. But who knows what a P2P device will use it for.

In Linux the driver that owns the completing address space gets to
specify how the TPH data works based on its own device specific
knowledge.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 0/4] vfio/dma-buf: add TPH support for peer-to-peer access
  2026-05-27 12:14   ` Jason Gunthorpe
@ 2026-05-27 12:23     ` Christian König
  2026-05-27 12:36       ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2026-05-27 12:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zhiping Zhang, Alex Williamson, Leon Romanovsky, Sumit Semwal,
	Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

On 5/27/26 14:14, Jason Gunthorpe wrote:
> On Wed, May 27, 2026 at 08:55:49AM +0200, Christian König wrote:
>> On 5/26/26 16:43, Zhiping Zhang wrote:
>>> This series adds TLP Processing Hints (TPH) support to the VFIO dma-buf
>>> export path, allowing importing drivers (e.g. mlx5) to use the
>>> exporter's steering tag when performing peer-to-peer DMA into a
>>> VFIO-owned device.
>>
>> I'm not an expert for TPH, but that sounds very strange to me.
>>
>> As far as I know the TLP Processing Hints allow devices to give a
>> steering tag to the root complex together with memory accesses to
>> give fine grained control about cache usage. In other words it is an
>> extension to the classic snoop bit.
> 
> TPHS includes an bit of data on every TLP and the data transits to the
> eventual completer.
> 
> It does not have to be a root port.
>  
>> For P2P that is obviously nonsense because we don't have P2P support
>> for cached accesses.
> 
> For P2P the TPH data on the TLP will transit to the P2P completer
> unchanged.
> 
> It is up to the completer do define what it does with the TPH data.
> 
> Typically root ports in CPUs will use TPH data for cache placement
> instructions. But who knows what a P2P device will use it for.
> 
> In Linux the driver that owns the completing address space gets to
> specify how the TPH data works based on its own device specific
> knowledge.

Yeah that's a good point, I should probably rephrase the question.

I'm aware of how TPH works by adding the extra ST to the TLP.

But my question is how is that useful to a PCIe endpoint? What is the effect of the ST here?

I only know about the caching use case for the root complex and that's clearly not the case here.

Regards,
Christian.


> 
> Jason


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 0/4] vfio/dma-buf: add TPH support for peer-to-peer access
  2026-05-27 12:23     ` Christian König
@ 2026-05-27 12:36       ` Jason Gunthorpe
  2026-05-27 12:53         ` Christian König
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2026-05-27 12:36 UTC (permalink / raw)
  To: Christian König
  Cc: Zhiping Zhang, Alex Williamson, Leon Romanovsky, Sumit Semwal,
	Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

On Wed, May 27, 2026 at 02:23:46PM +0200, Christian König wrote:

> Yeah that's a good point, I should probably rephrase the question.
> 
> I'm aware of how TPH works by adding the extra ST to the TLP.
> 
> But my question is how is that useful to a PCIe endpoint? What is the effect of the ST here?

TBH I've never heard Meta explain what their device is doing with
it. At least it seems to be super important to their device..

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 0/4] vfio/dma-buf: add TPH support for peer-to-peer access
  2026-05-27 12:36       ` Jason Gunthorpe
@ 2026-05-27 12:53         ` Christian König
  0 siblings, 0 replies; 11+ messages in thread
From: Christian König @ 2026-05-27 12:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zhiping Zhang, Alex Williamson, Leon Romanovsky, Sumit Semwal,
	Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

On 5/27/26 14:36, Jason Gunthorpe wrote:
> On Wed, May 27, 2026 at 02:23:46PM +0200, Christian König wrote:
> 
>> Yeah that's a good point, I should probably rephrase the question.
>>
>> I'm aware of how TPH works by adding the extra ST to the TLP.
>>
>> But my question is how is that useful to a PCIe endpoint? What is the effect of the ST here?
> 
> TBH I've never heard Meta explain what their device is doing with
> it. At least it seems to be super important to their device..

Yeah I think at least a brief description of what is going on here would be necessary for the review.

Otherwise we have only the info that the exporter wants to give an opaque ST for the importer to use and no technical description what that is good for, how to test it etc...

Regards,
Christian.

> 
> Jason


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 2/4] dma-buf: add optional get_tph() callback
       [not found] ` <20260526144401.1485788-3-zhipingz@meta.com>
  2026-05-27  6:57   ` [PATCH v5 2/4] dma-buf: add optional get_tph() callback Christian König
@ 2026-05-27 17:03   ` Alex Williamson
  1 sibling, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2026-05-27 17:03 UTC (permalink / raw)
  To: Zhiping Zhang
  Cc: Jason Gunthorpe, Leon Romanovsky, Sumit Semwal, Christian Konig,
	Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas, alex

On Tue, 26 May 2026 07:43:54 -0700
Zhiping Zhang <zhipingz@meta.com> wrote:

> Add an optional dma-buf get_tph callback so an exporter can return TPH
> (TLP Processing Hints) metadata to an importer.
> 
> 8-bit ST and 16-bit Extended ST are distinct namespaces in the PCIe TPH
> ST table and may both be present with different values. The importer
> passes its supported steering-tag width and the exporter returns the
> matching value, or -EOPNOTSUPP if no metadata is available for that
> width.
> 
> The first user is VFIO_DEVICE_FEATURE_DMA_BUF_TPH in vfio-pci, with the
> mlx5 RDMA driver as the first importer.
> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
>  include/linux/dma-buf.h | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index d1203da56fc5..49eb6ad644a2 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -113,6 +113,27 @@ struct dma_buf_ops {
>  	 */
>  	void (*unpin)(struct dma_buf_attachment *attach);
>  
> +	/**
> +	 * @get_tph:
> +	 * @dmabuf: DMA buffer for which to retrieve TPH metadata
> +	 * @steering_tag: Returns the raw TPH steering tag for @st_width
> +	 * @ph: Returns the TPH processing hint (2-bit value)
> +	 * @st_width: Consumer's supported steering tag width in bits (8 or 16)
> +	 *
> +	 * Return the TPH (TLP Processing Hints) metadata associated with this
> +	 * DMA buffer for the requested steering-tag width. 8-bit ST and 16-bit
> +	 * Extended ST are distinct namespaces in the PCIe TPH ST table and may
> +	 * both be present with different values, so the exporter must select the
> +	 * value that matches @st_width and must not substitute one for the other.
> +	 *
> +	 * Return 0 on success, -EOPNOTSUPP if no metadata is available for the
> +	 * requested width, or -EINVAL if @st_width is not 8 or 16.
> +	 *
> +	 * This callback is optional.
> +	 */
> +	int (*get_tph)(struct dma_buf *dmabuf, u16 *steering_tag, u8 *ph,
> +		       u8 st_width);

Why not make this `bool extended` rather than `u8 st_width` to avoid
the entire class of errors involving an invalid width?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 3/4] vfio/pci: implement get_tph and DMA_BUF_TPH feature
       [not found] ` <20260526144401.1485788-4-zhipingz@meta.com>
@ 2026-05-27 18:06   ` Alex Williamson
  0 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2026-05-27 18:06 UTC (permalink / raw)
  To: Zhiping Zhang
  Cc: Jason Gunthorpe, Leon Romanovsky, Sumit Semwal, Christian Konig,
	Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas, alex

On Tue, 26 May 2026 07:43:55 -0700
Zhiping Zhang <zhipingz@meta.com> wrote:

> Implement the dma-buf get_tph callback for vfio-pci-exported dma-bufs
> and add VFIO_DEVICE_FEATURE_DMA_BUF_TPH so userspace can attach TPH
> metadata to such a dma-buf.
> 
> 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces; the
> uAPI carries both with explicit validity flags, and get_tph() returns
> the value matching the importer's requested width (or -EOPNOTSUPP).
> 
> The TPH descriptor is published under a new per-dma-buf mutex
> (priv->lock) and read by get_tph() under the same mutex. The same
> mutex serialises with the priv->vdev clear in
> vfio_pci_dma_buf_cleanup() so a SET racing with device teardown
> cannot observe a half-detached dma-buf. memory_lock remain on the
> existing dma-buf paths; the outer order is memory_lock -> priv->lock.
> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c   |   3 +
>  drivers/vfio/pci/vfio_pci_dmabuf.c | 110 ++++++++++++++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_priv.h   |  12 ++++
>  include/uapi/linux/vfio.h          |  37 ++++++++++
>  4 files changed, 161 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 050e7542952e..4fa36f2f7555 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1569,6 +1569,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_DMA_BUF:
>  		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_DMA_BUF_TPH:
> +		return vfio_pci_core_feature_dma_buf_tph(vdev, flags, arg,
> +							 argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index 1a177ce7de54..3ea2978c376c 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -19,7 +19,19 @@ struct vfio_pci_dma_buf {
>  	u32 nr_ranges;
>  	struct kref kref;
>  	struct completion comp;
> -	u8 revoked : 1;
> +	/*
> +	 * @lock serializes TPH SET vs get_tph and the priv->vdev clear in
> +	 * vfio_pci_dma_buf_cleanup(). It nests inside memory_lock:
> +	 * the outer order across these paths is
> +	 * memory_lock -> priv->lock.
> +	 */
> +	struct mutex lock;
> +	u8 tph_st_valid:1;	/* priv->lock */
> +	u8 tph_st_ext_valid:1;	/* priv->lock */
> +	u8 tph_ph:2;		/* priv->lock */
> +	u8 tph_st;		/* priv->lock */
> +	u16 tph_st_ext;		/* priv->lock */
> +	u8 revoked:1;		/* dma_resv_lock */
>  };
>  
>  static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> @@ -69,6 +81,38 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
>  	return ret;
>  }
>  
> +static int vfio_pci_dma_buf_get_tph(struct dma_buf *dmabuf, u16 *steering_tag,
> +				    u8 *ph, u8 st_width)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +	int ret = 0;
> +
> +	mutex_lock(&priv->lock);

Use guard semantics and drop ret, return from the error branch directly.

> +	switch (st_width) {
> +	case 8:
> +		if (!priv->tph_st_valid) {
> +			ret = -EOPNOTSUPP;
> +			break;
> +		}
> +		*steering_tag = priv->tph_st;
> +		*ph = priv->tph_ph;
> +		break;
> +	case 16:
> +		if (!priv->tph_st_ext_valid) {
> +			ret = -EOPNOTSUPP;
> +			break;
> +		}
> +		*steering_tag = priv->tph_st_ext;
> +		*ph = priv->tph_ph;
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	mutex_unlock(&priv->lock);
> +	return ret;
> +}
> +
>  static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
>  				   struct sg_table *sgt,
>  				   enum dma_data_direction dir)
> @@ -95,12 +139,14 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
>  		up_write(&priv->vdev->memory_lock);
>  		vfio_device_put_registration(&priv->vdev->vdev);
>  	}
> +	mutex_destroy(&priv->lock);
>  	kfree(priv->phys_vec);
>  	kfree(priv);
>  }
>  
>  static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
>  	.attach = vfio_pci_dma_buf_attach,
> +	.get_tph = vfio_pci_dma_buf_get_tph,
>  	.map_dma_buf = vfio_pci_dma_buf_map,
>  	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
>  	.release = vfio_pci_dma_buf_release,
> @@ -265,6 +311,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  		ret = -ENOMEM;
>  		goto err_free_ranges;
>  	}
> +	mutex_init(&priv->lock);
>  	priv->phys_vec = kzalloc_objs(*priv->phys_vec, get_dma_buf.nr_ranges);
>  	if (!priv->phys_vec) {
>  		ret = -ENOMEM;
> @@ -327,12 +374,71 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  err_free_phys:
>  	kfree(priv->phys_vec);
>  err_free_priv:
> +	mutex_destroy(&priv->lock);
>  	kfree(priv);
>  err_free_ranges:
>  	kfree(dma_ranges);
>  	return ret;
>  }
>  
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> +				      u32 flags,
> +				      struct vfio_device_feature_dma_buf_tph __user *arg,
> +				      size_t argsz)
> +{
> +	struct vfio_device_feature_dma_buf_tph set_tph;
> +	struct vfio_pci_dma_buf *priv;
> +	struct dma_buf *dmabuf;
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> +				 sizeof(set_tph));
> +	if (ret != 1)
> +		return ret;

Don't we need to test whether the device supports the TPH capability
somewhere here?  AIUI the device must exposed a TPH capability to act
as a TPH completer.

> +
> +	if (copy_from_user(&set_tph, arg, sizeof(set_tph)))
> +		return -EFAULT;
> +
> +	if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT))
> +		return -EINVAL;
> +
> +	if (!set_tph.flags)
> +		return -EINVAL;

This seems to block the path to marking both steering tags to invalid
without any particular reason to do so.

> +
> +	/* PCIe TLP Processing Hint is a 2-bit field. */
> +	if (set_tph.ph & ~0x3)
> +		return -EINVAL;
> +
> +	dmabuf = dma_buf_get(set_tph.dmabuf_fd);
> +	if (IS_ERR(dmabuf))
> +		return PTR_ERR(dmabuf);
> +
> +	if (dmabuf->ops != &vfio_pci_dmabuf_ops) {
> +		ret = -EINVAL;
> +		goto out_put;
> +	}
> +
> +	priv = dmabuf->priv;
> +	mutex_lock(&priv->lock);

Use guards.

> +	if (priv->vdev != vdev) {

For here and the next chunk, why is priv->vdev pulled into this lock?
We're calling an ioctl on the vdev, that's stable.  We have a reference
to the dma-buf, that's stable.  I think the ordering prevents this from
needing to be under the lock, which probably means it should be
s/lock/tph_lock/.

> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	priv->tph_st = set_tph.steering_tag;
> +	priv->tph_st_ext = set_tph.steering_tag_ext;
> +	priv->tph_ph = set_tph.ph;
> +	priv->tph_st_valid = !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST);
> +	priv->tph_st_ext_valid = !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT);
> +	ret = 0;
> +
> +out_unlock:
> +	mutex_unlock(&priv->lock);
> +out_put:
> +	dma_buf_put(dmabuf);
> +	return ret;
> +}
> +
>  void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
>  {
>  	struct vfio_pci_dma_buf *priv;
> @@ -398,7 +504,9 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
>  			continue;
>  
>  		list_del_init(&priv->dmabufs_elm);
> +		mutex_lock(&priv->lock);
>  		priv->vdev = NULL;
> +		mutex_unlock(&priv->lock);

As above, seems unnecessary.

>  		vfio_device_put_registration(&vdev->vdev);
>  		fput(priv->dmabuf->file);
>  	}
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index fca9d0dfac90..c58f369be4b3 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -118,6 +118,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>  int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  				  struct vfio_device_feature_dma_buf __user *arg,
>  				  size_t argsz);
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> +				      u32 flags,
> +				      struct vfio_device_feature_dma_buf_tph __user *arg,
> +				      size_t argsz);
>  void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
>  void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
>  #else
> @@ -128,6 +132,14 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  {
>  	return -ENOTTY;
>  }
> +
> +static inline int
> +vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, u32 flags,
> +				  struct vfio_device_feature_dma_buf_tph __user *arg,
> +				  size_t argsz)
> +{
> +	return -ENOTTY;
> +}
>  static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
>  {
>  }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 5de618a3a5ee..55cac3b7122c 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1534,6 +1534,43 @@ struct vfio_device_feature_dma_buf {
>   */
>  #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2  12
>  
> +/**
> + * Upon VFIO_DEVICE_FEATURE_SET associate TPH (TLP Processing Hints) metadata
> + * with a vfio-exported dma-buf. The dma-buf must have been created by
> + * VFIO_DEVICE_FEATURE_DMA_BUF on this device.
> + *
> + * dmabuf_fd is the file descriptor returned by VFIO_DEVICE_FEATURE_DMA_BUF.
> + *
> + * 8-bit ST (steering_tag) and 16-bit Extended ST (steering_tag_ext) are
> + * distinct namespaces in the PCIe TPH ST table and may both be present with
> + * different values. Userspace should populate the value(s) it has from the
> + * firmware ST table for this device and set the matching VFIO_DMA_BUF_TPH_ST /
> + * VFIO_DMA_BUF_TPH_ST_EXT bit in @flags. An importer requests a specific
> + * width and receives the matching value; if the requested width is not
> + * present, the importer is told TPH is unavailable for this dma-buf.
> + *
> + * ph is the 2-bit TLP Processing Hint and must be in the range [0, 3].
> + *
> + * Userspace is responsible for setting TPH on the dma-buf before handing the
> + * fd to the importer. Calling SET again replaces the previously published
> + * values; racing a SET against an importer that is already consuming the
> + * dma-buf is a userspace ordering problem.
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_BUF_TPH 13
> +
> +#define VFIO_DMA_BUF_TPH_ST		(1 << 0)  /* steering_tag valid */
> +#define VFIO_DMA_BUF_TPH_ST_EXT		(1 << 1)  /* steering_tag_ext valid */
> +
> +struct vfio_device_feature_dma_buf_tph {
> +	__s32	dmabuf_fd;
> +	__u32	flags;
> +	__u8	steering_tag;
> +	__u8	ph;

Nit, I'd swap the order of these so the steering_tag fields are
back-to-back.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 4/4] RDMA/mlx5: get tph for p2p access when registering dma-buf mr
       [not found] ` <20260526144401.1485788-5-zhipingz@meta.com>
@ 2026-05-27 19:00   ` Alex Williamson
  2026-05-27 22:55   ` Michael Gur
  1 sibling, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2026-05-27 19:00 UTC (permalink / raw)
  To: Zhiping Zhang
  Cc: Jason Gunthorpe, Leon Romanovsky, Sumit Semwal, Christian Konig,
	Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas, alex

On Tue, 26 May 2026 07:43:56 -0700
Zhiping Zhang <zhipingz@meta.com> wrote:

> Query dma-buf TPH metadata when registering a dma-buf MR for peer-to-
> peer access and translate the returned steering tag into an mlx5 ST
> index. Keep the DMAH path as the first priority and only fall back to
> DMA-buf metadata when no DMAH is supplied.
> 
> Track per-MR ownership of the allocated ST index and release it on MR
> setup failure, destroy, and FRMR-pool reuse. Release the ST index before
> the MR is pushed back into the FRMR pool, and free mlx5_st_idx_data when
> its refcount reaches zero so repeated allocation/deallocation does not
> leak memory.
> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
>  drivers/infiniband/hw/mlx5/mlx5_ib.h          |  6 ++
>  drivers/infiniband/hw/mlx5/mr.c               | 86 ++++++++++++++++++-
>  .../net/ethernet/mellanox/mlx5/core/lib/st.c  | 28 ++++--
>  include/linux/mlx5/driver.h                   |  7 ++
>  4 files changed, 115 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> index e156dc4d7529..4ab867392267 100644
> --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
> +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> @@ -721,6 +721,12 @@ struct mlx5_ib_mr {
>  			u8 revoked :1;
>  			/* Indicates previous dmabuf page fault occurred */
>  			u8 dmabuf_faulted:1;
> +			/* Set when the MR owns dmabuf_st_index and must
> +			 * release it via mlx5_st_dealloc_index() once the
> +			 * firmware mkey is no longer referencing it.
> +			 */
> +			u8 dmabuf_st_owned:1;
> +			u16 dmabuf_st_index;
>  			struct mlx5_ib_mkey null_mmkey;
>  		};
>  	};
> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index 3b6da45061a5..8059b5e4da97 100644
> --- a/drivers/infiniband/hw/mlx5/mr.c
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -38,6 +38,7 @@
>  #include <linux/delay.h>
>  #include <linux/dma-buf.h>
>  #include <linux/dma-resv.h>
> +#include <linux/pci-tph.h>
>  #include <rdma/frmr_pools.h>
>  #include <rdma/ib_umem_odp.h>
>  #include "dm.h"
> @@ -46,6 +47,8 @@
>  #include "data_direct.h"
>  #include "dmah.h"
>  
> +MODULE_IMPORT_NS("DMA_BUF");
> +

This doesn't appear to add any dma-buf namespace dependencies.

>  static int mkey_max_umr_order(struct mlx5_ib_dev *dev)
>  {
>  	if (MLX5_CAP_GEN(dev->mdev, umr_extended_translation_offset))
> @@ -899,6 +902,63 @@ static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = {
>  	.invalidate_mappings = mlx5_ib_dmabuf_invalidate_cb,
>  };
>  
> +/*
> + * Query TPH metadata from @dmabuf and translate the raw steering tag into
> + * an mlx5 ST index. On success, returns 0 and the caller becomes the
> + * owner of *@st_index (must be released with mlx5_st_dealloc_index()
> + * once the firmware mkey no longer references it). On any failure
> + * *@st_index and *@ph are left as the no-TPH defaults set by the caller.
> + *
> + * @dmabuf must already be referenced by the caller (e.g. via the umem's
> + * attachment) so we don't re-resolve the user's fd here and avoid a
> + * dup2() TOCTOU between umem creation and TPH lookup.
> + */
> +static void get_tph_mr_dmabuf(struct mlx5_ib_dev *dev, struct dma_buf *dmabuf,
> +			      u16 *st_index, u8 *ph)
> +{
> +	u8 req_type;
> +	u16 steering_tag;
> +	u8 st_width;
> +	int ret;
> +
> +	if (!dmabuf->ops->get_tph)
> +		return;
> +
> +	req_type = pcie_tph_enabled_req_type(dev->mdev->pdev);
> +	switch (req_type) {
> +	case PCI_TPH_REQ_TPH_ONLY:
> +		st_width = 8;
> +		break;
> +	case PCI_TPH_REQ_EXT_TPH:
> +		st_width = 16;
> +		break;
> +	default:
> +		return;
> +	}
> +
> +	ret = dmabuf->ops->get_tph(dmabuf, &steering_tag, ph, st_width);
> +	if (ret) {
> +		mlx5_ib_dbg(dev, "get_tph failed (%d)\n", ret);
> +		*ph = MLX5_IB_NO_PH;
> +		return;
> +	}
> +
> +	ret = mlx5_st_alloc_index_by_tag(dev->mdev, steering_tag, st_index);
> +	if (ret) {
> +		*ph = MLX5_IB_NO_PH;
> +		mlx5_ib_dbg(dev, "st_alloc_index_by_tag failed (%d)\n", ret);
> +	}
> +}

ph handling is inconsistent, why not use a local variable and only set
the caller's pointer on success?

> +
> +static void mlx5_ib_mr_put_dmabuf_st(struct mlx5_ib_mr *mr)
> +{
> +	if (mr->umem && mr->dmabuf_st_owned) {
> +		mlx5_st_dealloc_index(mr_to_mdev(mr)->mdev,
> +				      mr->dmabuf_st_index);
> +		mr->dmabuf_st_owned = 0;
> +	}
> +}
> +
>  static struct ib_mr *
>  reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device,
>  		   u64 offset, u64 length, u64 virt_addr,
> @@ -941,16 +1001,26 @@ reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device,
>  		ph = dmah->ph;
>  		if (dmah->valid_fields & BIT(IB_DMAH_CPU_ID_EXISTS))
>  			st_index = mdmah->st_index;
> +	} else {
> +		get_tph_mr_dmabuf(dev, umem_dmabuf->attach->dmabuf,
> +				  &st_index, &ph);
>  	}
>  
>  	mr = alloc_cacheable_mr(pd, &umem_dmabuf->umem, virt_addr,
>  				access_flags, access_mode,
>  				st_index, ph);
>  	if (IS_ERR(mr)) {
> +		if (!dmah && st_index != MLX5_MKC_PCIE_TPH_NO_STEERING_TAG_INDEX)
> +			mlx5_st_dealloc_index(dev->mdev, st_index);
>  		ib_umem_release(&umem_dmabuf->umem);
>  		return ERR_CAST(mr);
>  	}
>  
> +	if (!dmah && st_index != MLX5_MKC_PCIE_TPH_NO_STEERING_TAG_INDEX) {
> +		mr->dmabuf_st_index = st_index;
> +		mr->dmabuf_st_owned = 1;
> +	}
> +
>  	mlx5_ib_dbg(dev, "mkey 0x%x\n", mr->mmkey.key);
>  
>  	atomic_add(ib_umem_num_pages(mr->umem), &dev->mdev->priv.reg_pages);
> @@ -1377,9 +1447,17 @@ static int mlx5r_handle_mkey_cleanup(struct mlx5_ib_mr *mr)
>  	bool is_odp = is_odp_mr(mr);
>  	int ret;
>  
> -	if (mr->ibmr.frmr.pool && !mlx5_umr_revoke_mr_with_lock(mr) &&
> -	    !ib_frmr_pool_push(mr->ibmr.device, &mr->ibmr))
> -		return 0;
> +	if (mr->ibmr.frmr.pool && !mlx5_umr_revoke_mr_with_lock(mr)) {
> +		/*
> +		 * The mkey has been revoked: firmware no longer references
> +		 * dmabuf_st_index, so release it before this mr can re-enter
> +		 * the FRMR cache for reuse by another registration.
> +		 */
> +		mlx5_ib_mr_put_dmabuf_st(mr);
> +
> +		if (!ib_frmr_pool_push(mr->ibmr.device, &mr->ibmr))
> +			return 0;
> +	}
>  
>  	if (is_odp)
>  		mutex_lock(&to_ib_umem_odp(mr->umem)->umem_mutex);
> @@ -1400,6 +1478,8 @@ static int mlx5r_handle_mkey_cleanup(struct mlx5_ib_mr *mr)
>  		dma_resv_unlock(
>  			to_ib_umem_dmabuf(mr->umem)->attach->dmabuf->resv);
>  	}
> +	if (!ret)
> +		mlx5_ib_mr_put_dmabuf_st(mr);
>  	return ret;
>  }
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> index 997be91f0a13..8929c17c88bc 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> @@ -29,7 +29,7 @@ struct mlx5_st *mlx5_st_create(struct mlx5_core_dev *dev)
>  	u8 direct_mode = 0;
>  	u16 num_entries;
>  	u32 tbl_loc;
> -	int ret;
> +	int ret = 0;

Unnecessary change.

>  
>  	if (!MLX5_CAP_GEN(dev, mkey_pcie_tph))
>  		return NULL;
> @@ -92,23 +92,18 @@ void mlx5_st_destroy(struct mlx5_core_dev *dev)
>  	kfree(st);
>  }
>  
> -int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type,
> -			unsigned int cpu_uid, u16 *st_index)
> +int mlx5_st_alloc_index_by_tag(struct mlx5_core_dev *dev, u16 tag,
> +			       u16 *st_index)
>  {
>  	struct mlx5_st_idx_data *idx_data;
>  	struct mlx5_st *st = dev->st;
>  	unsigned long index;
>  	u32 xa_id;
> -	u16 tag;
> -	int ret;
> +	int ret = 0;
>  
>  	if (!st)
>  		return -EOPNOTSUPP;
>  
> -	ret = pcie_tph_get_cpu_st(dev->pdev, mem_type, cpu_uid, &tag);
> -	if (ret)
> -		return ret;
> -
>  	if (st->direct_mode) {
>  		*st_index = tag;
>  		return 0;
> @@ -152,6 +147,20 @@ int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type,
>  	mutex_unlock(&st->lock);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(mlx5_st_alloc_index_by_tag);
> +
> +int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type,
> +			unsigned int cpu_uid, u16 *st_index)
> +{
> +	u16 tag;
> +	int ret;
> +
> +	ret = pcie_tph_get_cpu_st(dev->pdev, mem_type, cpu_uid, &tag);
> +	if (ret)
> +		return ret;
> +
> +	return mlx5_st_alloc_index_by_tag(dev, tag, st_index);
> +}
>  EXPORT_SYMBOL_GPL(mlx5_st_alloc_index);
>  
>  int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index)
> @@ -175,6 +184,7 @@ int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index)
>  
>  	if (refcount_dec_and_test(&idx_data->usecount)) {
>  		xa_erase(&st->idx_xa, st_index);
> +		kfree(idx_data);
>  		/* We leave PCI config space as was before, no mkey will refer to it */
>  	}

Should this be pulled out as a fix separate from the feature added
here?  Thanks,

Alex

>  
> diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
> index 04b96c5abb57..523a9ab0ae1e 100644
> --- a/include/linux/mlx5/driver.h
> +++ b/include/linux/mlx5/driver.h
> @@ -1166,10 +1166,17 @@ int mlx5_dm_sw_icm_dealloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type
>  			   u64 length, u16 uid, phys_addr_t addr, u32 obj_id);
>  
>  #ifdef CONFIG_PCIE_TPH
> +int mlx5_st_alloc_index_by_tag(struct mlx5_core_dev *dev, u16 tag,
> +			       u16 *st_index);
>  int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type,
>  			unsigned int cpu_uid, u16 *st_index);
>  int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index);
>  #else
> +static inline int mlx5_st_alloc_index_by_tag(struct mlx5_core_dev *dev,
> +					     u16 tag, u16 *st_index)
> +{
> +	return -EOPNOTSUPP;
> +}
>  static inline int mlx5_st_alloc_index(struct mlx5_core_dev *dev,
>  				      enum tph_mem_type mem_type,
>  				      unsigned int cpu_uid, u16 *st_index)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 1/4] PCI/TPH: expose the enabled TPH requester type
       [not found] ` <20260526144401.1485788-2-zhipingz@meta.com>
@ 2026-05-27 20:53   ` Alex Williamson
  0 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2026-05-27 20:53 UTC (permalink / raw)
  To: Zhiping Zhang
  Cc: Jason Gunthorpe, Leon Romanovsky, Sumit Semwal, Christian Konig,
	Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas, alex

On Tue, 26 May 2026 07:43:53 -0700
Zhiping Zhang <zhipingz@meta.com> wrote:

> Add pcie_tph_enabled_req_type() so drivers can query the enabled TPH
> requester mode without reaching into pci_dev internals.
> 
> This keeps pci_dev::tph_req_type inside the PCI/TPH code and provides a
> !CONFIG_PCIE_TPH stub for callers.
> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
>  drivers/pci/tph.c       | 12 ++++++++++++
>  include/linux/pci-tph.h |  2 ++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
> index 91145e8d9d95..6c4492075ae9 100644
> --- a/drivers/pci/tph.c
> +++ b/drivers/pci/tph.c
> @@ -174,6 +174,18 @@ u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
>  }
>  EXPORT_SYMBOL(pcie_tph_get_st_table_loc);
>  
> +/**
> + * pcie_tph_enabled_req_type - Return the device's enabled TPH requester type
> + * @pdev: PCI device to query
> + *
> + * Return: PCI_TPH_REQ_DISABLE, PCI_TPH_REQ_TPH_ONLY or PCI_TPH_REQ_EXT_TPH.
> + */
> +u8 pcie_tph_enabled_req_type(struct pci_dev *pdev)
> +{
> +	return pdev->tph_req_type;
> +}
> +EXPORT_SYMBOL(pcie_tph_enabled_req_type);
> +
>  /*
>   * Return the size of ST table. If ST table is not in TPH Requester Extended
>   * Capability space, return 0. Otherwise return the ST Table Size + 1.
> diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
> index be68cd17f2f8..fe572737b409 100644
> --- a/include/linux/pci-tph.h
> +++ b/include/linux/pci-tph.h
> @@ -30,6 +30,7 @@ void pcie_disable_tph(struct pci_dev *pdev);
>  int pcie_enable_tph(struct pci_dev *pdev, int mode);
>  u16 pcie_tph_get_st_table_size(struct pci_dev *pdev);
>  u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev);
> +u8 pcie_tph_enabled_req_type(struct pci_dev *pdev);
>  #else
>  static inline int pcie_tph_set_st_entry(struct pci_dev *pdev,
>  					unsigned int index, u16 tag)
> @@ -41,6 +42,7 @@ static inline int pcie_tph_get_cpu_st(struct pci_dev *dev,
>  static inline void pcie_disable_tph(struct pci_dev *pdev) { }
>  static inline int pcie_enable_tph(struct pci_dev *pdev, int mode)
>  { return -EINVAL; }
> +static inline u8 pcie_tph_enabled_req_type(struct pci_dev *pdev) { return 0; }

nit, s/0/PCI_TPH_REQ_DISABLE/ for consistency.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 4/4] RDMA/mlx5: get tph for p2p access when registering dma-buf mr
       [not found] ` <20260526144401.1485788-5-zhipingz@meta.com>
  2026-05-27 19:00   ` [PATCH v5 4/4] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Alex Williamson
@ 2026-05-27 22:55   ` Michael Gur
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Gur @ 2026-05-27 22:55 UTC (permalink / raw)
  To: Zhiping Zhang, Alex Williamson, Jason Gunthorpe, Leon Romanovsky,
	Sumit Semwal, Christian Konig
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas


On 5/26/2026 5:43 PM, Zhiping Zhang wrote:
> Query dma-buf TPH metadata when registering a dma-buf MR for peer-to-
> peer access and translate the returned steering tag into an mlx5 ST
> index. Keep the DMAH path as the first priority and only fall back to
> DMA-buf metadata when no DMAH is supplied.
>
> Track per-MR ownership of the allocated ST index and release it on MR
> setup failure, destroy, and FRMR-pool reuse. Release the ST index before
> the MR is pushed back into the FRMR pool, and free mlx5_st_idx_data when
> its refcount reaches zero so repeated allocation/deallocation does not
> leak memory.
>
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
>   drivers/infiniband/hw/mlx5/mlx5_ib.h          |  6 ++
>   drivers/infiniband/hw/mlx5/mr.c               | 86 ++++++++++++++++++-
>   .../net/ethernet/mellanox/mlx5/core/lib/st.c  | 28 ++++--
>   include/linux/mlx5/driver.h                   |  7 ++
>   4 files changed, 115 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> index e156dc4d7529..4ab867392267 100644
> --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
> +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> @@ -721,6 +721,12 @@ struct mlx5_ib_mr {
>   			u8 revoked :1;
>   			/* Indicates previous dmabuf page fault occurred */
>   			u8 dmabuf_faulted:1;
> +			/* Set when the MR owns dmabuf_st_index and must
> +			 * release it via mlx5_st_dealloc_index() once the
> +			 * firmware mkey is no longer referencing it.
> +			 */
mkey st value is kept after revoke, regardless of st alloc and dealloc.
mkeys are kept in FRMR pool for future reuse even if their st index is 
currently stale.
> +			u8 dmabuf_st_owned:1;
> +			u16 dmabuf_st_index;
st_index can be read from the frmr pool key. No need to store again.
>   			struct mlx5_ib_mkey null_mmkey;
>   		};
>   	};
> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index 3b6da45061a5..8059b5e4da97 100644
> --- a/drivers/infiniband/hw/mlx5/mr.c
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -38,6 +38,7 @@
>   #include <linux/delay.h>
>   #include <linux/dma-buf.h>
>   #include <linux/dma-resv.h>
> +#include <linux/pci-tph.h>
>   #include <rdma/frmr_pools.h>
>   #include <rdma/ib_umem_odp.h>
>   #include "dm.h"
> @@ -46,6 +47,8 @@
>   #include "data_direct.h"
>   #include "dmah.h"
>   
> +MODULE_IMPORT_NS("DMA_BUF");
> +
>   static int mkey_max_umr_order(struct mlx5_ib_dev *dev)
>   {
>   	if (MLX5_CAP_GEN(dev->mdev, umr_extended_translation_offset))
> @@ -899,6 +902,63 @@ static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = {
>   	.invalidate_mappings = mlx5_ib_dmabuf_invalidate_cb,
>   };
>   
> +/*
> + * Query TPH metadata from @dmabuf and translate the raw steering tag into
> + * an mlx5 ST index. On success, returns 0 and the caller becomes the
> + * owner of *@st_index (must be released with mlx5_st_dealloc_index()
> + * once the firmware mkey no longer references it). On any failure
> + * *@st_index and *@ph are left as the no-TPH defaults set by the caller.
> + *
> + * @dmabuf must already be referenced by the caller (e.g. via the umem's
> + * attachment) so we don't re-resolve the user's fd here and avoid a
> + * dup2() TOCTOU between umem creation and TPH lookup.
> + */
> +static void get_tph_mr_dmabuf(struct mlx5_ib_dev *dev, struct dma_buf *dmabuf,
> +			      u16 *st_index, u8 *ph)
> +{
> +	u8 req_type;
> +	u16 steering_tag;
> +	u8 st_width;
> +	int ret;
> +
> +	if (!dmabuf->ops->get_tph)
> +		return;
> +
> +	req_type = pcie_tph_enabled_req_type(dev->mdev->pdev);
> +	switch (req_type) {
> +	case PCI_TPH_REQ_TPH_ONLY:
> +		st_width = 8;
> +		break;
> +	case PCI_TPH_REQ_EXT_TPH:
> +		st_width = 16;
> +		break;
> +	default:
> +		return;
> +	}
> +
> +	ret = dmabuf->ops->get_tph(dmabuf, &steering_tag, ph, st_width);
> +	if (ret) {
> +		mlx5_ib_dbg(dev, "get_tph failed (%d)\n", ret);
> +		*ph = MLX5_IB_NO_PH;
> +		return;
> +	}
> +
> +	ret = mlx5_st_alloc_index_by_tag(dev->mdev, steering_tag, st_index);
> +	if (ret) {
> +		*ph = MLX5_IB_NO_PH;
> +		mlx5_ib_dbg(dev, "st_alloc_index_by_tag failed (%d)\n", ret);
> +	}
> +}
> +
> +static void mlx5_ib_mr_put_dmabuf_st(struct mlx5_ib_mr *mr)
> +{
> +	if (mr->umem && mr->dmabuf_st_owned) {
> +		mlx5_st_dealloc_index(mr_to_mdev(mr)->mdev,
> +				      mr->dmabuf_st_index);
> +		mr->dmabuf_st_owned = 0;
> +	}
> +}
> +
>   static struct ib_mr *
>   reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device,
>   		   u64 offset, u64 length, u64 virt_addr,
> @@ -941,16 +1001,26 @@ reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device,
>   		ph = dmah->ph;
>   		if (dmah->valid_fields & BIT(IB_DMAH_CPU_ID_EXISTS))
>   			st_index = mdmah->st_index;
> +	} else {
> +		get_tph_mr_dmabuf(dev, umem_dmabuf->attach->dmabuf,
> +				  &st_index, &ph);
>   	}
>   
>   	mr = alloc_cacheable_mr(pd, &umem_dmabuf->umem, virt_addr,
>   				access_flags, access_mode,
>   				st_index, ph);
>   	if (IS_ERR(mr)) {
> +		if (!dmah && st_index != MLX5_MKC_PCIE_TPH_NO_STEERING_TAG_INDEX)
> +			mlx5_st_dealloc_index(dev->mdev, st_index);
>   		ib_umem_release(&umem_dmabuf->umem);
>   		return ERR_CAST(mr);
>   	}
>   
> +	if (!dmah && st_index != MLX5_MKC_PCIE_TPH_NO_STEERING_TAG_INDEX) {
> +		mr->dmabuf_st_index = st_index;
> +		mr->dmabuf_st_owned = 1;
> +	}
> +
>   	mlx5_ib_dbg(dev, "mkey 0x%x\n", mr->mmkey.key);
>   
>   	atomic_add(ib_umem_num_pages(mr->umem), &dev->mdev->priv.reg_pages);
> @@ -1377,9 +1447,17 @@ static int mlx5r_handle_mkey_cleanup(struct mlx5_ib_mr *mr)
>   	bool is_odp = is_odp_mr(mr);
>   	int ret;
>   
> -	if (mr->ibmr.frmr.pool && !mlx5_umr_revoke_mr_with_lock(mr) &&
> -	    !ib_frmr_pool_push(mr->ibmr.device, &mr->ibmr))
> -		return 0;
> +	if (mr->ibmr.frmr.pool && !mlx5_umr_revoke_mr_with_lock(mr)) {
> +		/*
> +		 * The mkey has been revoked: firmware no longer references
> +		 * dmabuf_st_index, so release it before this mr can re-enter
> +		 * the FRMR cache for reuse by another registration.
> +		 */
> +		mlx5_ib_mr_put_dmabuf_st(mr);
> +
> +		if (!ib_frmr_pool_push(mr->ibmr.device, &mr->ibmr))
> +			return 0;
> +	}

The Sashiko comment on previous version of this series was wrong about 
the concept of FRMR pools and its reuse of mkeys.

Please move the st put operation outside the mkey cleanup flow.

>   
>   	if (is_odp)
>   		mutex_lock(&to_ib_umem_odp(mr->umem)->umem_mutex);
> @@ -1400,6 +1478,8 @@ static int mlx5r_handle_mkey_cleanup(struct mlx5_ib_mr *mr)
>   		dma_resv_unlock(
>   			to_ib_umem_dmabuf(mr->umem)->attach->dmabuf->resv);
>   	}
> +	if (!ret)
> +		mlx5_ib_mr_put_dmabuf_st(mr);
>   	return ret;
>   }
>   

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-27 22:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260526144401.1485788-1-zhipingz@meta.com>
2026-05-27  6:55 ` [PATCH v5 0/4] vfio/dma-buf: add TPH support for peer-to-peer access Christian König
2026-05-27 12:14   ` Jason Gunthorpe
2026-05-27 12:23     ` Christian König
2026-05-27 12:36       ` Jason Gunthorpe
2026-05-27 12:53         ` Christian König
     [not found] ` <20260526144401.1485788-3-zhipingz@meta.com>
2026-05-27  6:57   ` [PATCH v5 2/4] dma-buf: add optional get_tph() callback Christian König
2026-05-27 17:03   ` Alex Williamson
     [not found] ` <20260526144401.1485788-4-zhipingz@meta.com>
2026-05-27 18:06   ` [PATCH v5 3/4] vfio/pci: implement get_tph and DMA_BUF_TPH feature Alex Williamson
     [not found] ` <20260526144401.1485788-5-zhipingz@meta.com>
2026-05-27 19:00   ` [PATCH v5 4/4] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Alex Williamson
2026-05-27 22:55   ` Michael Gur
     [not found] ` <20260526144401.1485788-2-zhipingz@meta.com>
2026-05-27 20:53   ` [PATCH v5 1/4] PCI/TPH: expose the enabled TPH requester type Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox