All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/5] vfio/dma-buf: add TPH support for peer-to-peer access
@ 2026-06-10 19:31 Zhiping Zhang
  2026-06-10 19:31 ` [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc Zhiping Zhang
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-10 19:31 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe, Leon Romanovsky, Sumit Semwal,
	Christian Konig
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang

This series adds TLP Processing Hints (TPH) support to the VFIO dma-buf
export path, allowing importing drivers (e.g. mlx5) to use the
exporter's steering tag when performing peer-to-peer DMA into a
VFIO-owned device.

There is no separate in-tree vendor kernel driver for the target device:
vfio-pci is the in-tree driver and the targeted device is managed
from userspace via VFIO passthrough. That is why the ST has to flow
through a uAPI: userspace owns the device and its ST table, so it is the
entity that can publish a meaningful value for a given dma-buf. The
kernel-visible participants are still in-tree: vfio-pci exports the
dma-buf and mlx5 imports it.

On the effect: the endpoint's PCIe ingress block uses the 8-bit ST as
an in-band instruction for the incoming P2P TLP -- selecting a target
cache partition and, on writes, an in-flight operation on the data
before it lands. The dma-buf callback keeps this opaque to the
framework -- only the producer (userspace owner of the VFIO device)
and the consumer (endpoint block) need to interpret the value. The
dma-buf get_tph callback itself is optional for workloads that depend
on the endpoint's in-flight operation that fallback does not produce
the same result.

The dma-buf hook is intentionally generic and discoverable rather than
a private side channel. The exporter owns the completing address
space for the dma-buf and decides whether it can provide a meaningful
ST/PH tuple for that completer; the dma-buf core keeps the tuple opaque,
and importers merely request the namespace they support and place the
returned value on generated TLPs. Exporters that cannot derive a
meaningful tuple simply return -EOPNOTSUPP.

Patch 1 is a pre-existing fix split out from the series:
mlx5_st_dealloc_index() removed the xarray entry but never freed the
backing struct, so repeated alloc/dealloc cycles leaked memory.
Patch 2 adds small PCI/TPH type helpers so drivers can query the enabled
TPH requester mode and the device's TPH Completer Supported field
without reaching into pci_dev internals (and so callers in
CONFIG_PCIE_TPH=n builds get a clean fallback).
Patch 3 adds the optional dma_buf_ops::get_tph callback plus the
dma_buf_get_tph() importer wrapper so importers can fetch TPH metadata
from an exporter under dmabuf->resv.
Patch 4 implements get_tph in vfio-pci and adds the new uAPI
(VFIO_DEVICE_FEATURE_DMA_BUF_TPH) for userspace to attach the metadata.
Patch 5 wires up the mlx5 RDMA driver as a consumer.

Build-tested with both CONFIG_PCIE_TPH=y and CONFIG_PCIE_TPH=n.
Functional validation on the target topology: PCIe analyzer captures
on the P2P TLPs confirm the ST emitted by mlx5 matches the value
published through VFIO_DEVICE_FEATURE_DMA_BUF_TPH, and the end-to-end
P2P workload only produces results consistent with the endpoint's
ST-selected in-flight operation. For example, with userspace
publishing 8-bit ST=0xf0 and PH=2, an analyzer capture of a peer-to-
peer MWr64 shows "STP MWr64 TC=0 OHC=2 ..." followed by "OHC-B
ST=F0h PH=2 HV=1":
(TLP Captures)
08000260 -> STP MWr64 TC=0 OHC=2 TS=0 Attr=0 L=8
F0000004 -> RID=4h:0h.0h EP- Tag=F0h
E0200000 -> AddrH=000020E0h
00080006 -> AddrL=06000800h
90F00000 -> OHC-B ST=F0h PH=2 HV=1 AMA=0 AV-

Previous link:
v6: https://lore.kernel.org/dri-devel/20260608185646.4085127-1-zhipingz@meta.com/
v5: https://lore.kernel.org/dri-devel/20260526144401.1485788-1-zhipingz@meta.com/
v4: https://lore.kernel.org/linux-pci/20260519201401.1558410-1-zhipingz@meta.com/
v3: https://lore.kernel.org/linux-pci/20260512184755.4137227-1-zhipingz@meta.com/
v2: https://lore.kernel.org/linux-pci/20260430200704.352228-1-zhipingz@meta.com/

Zhiping Zhang (5):
  net/mlx5: free mlx5_st_idx_data on final dealloc
  PCI/TPH: Add requester/completer type helpers
  dma-buf: add optional get_tph() callback
  vfio/pci: implement get_tph and DMA_BUF_TPH feature
  RDMA/mlx5: get tph for p2p access when registering dma-buf mr

 drivers/dma-buf/dma-buf.c                     |  25 ++++
 drivers/infiniband/core/frmr_pools.c          |  20 +++-
 drivers/infiniband/hw/mlx5/mr.c               | 111 +++++++++++++++++-
 .../net/ethernet/mellanox/mlx5/core/lib/st.c  |  50 ++++++--
 drivers/pci/tph.c                             |  43 +++++++
 drivers/vfio/pci/vfio_pci_core.c              |   3 +
 drivers/vfio/pci/vfio_pci_dmabuf.c            |  94 ++++++++++++++-
 drivers/vfio/pci/vfio_pci_priv.h              |  12 ++
 include/linux/dma-buf.h                       |  21 ++++
 include/linux/mlx5/driver.h                   |  12 ++
 include/linux/pci-tph.h                       |   8 ++
 include/rdma/frmr_pools.h                     |   5 +-
 include/uapi/linux/vfio.h                     |  37 ++++++
 13 files changed, 421 insertions(+), 20 deletions(-)

-- 
2.53.0-Meta

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc
  2026-06-10 19:31 [PATCH v7 0/5] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang
@ 2026-06-10 19:31 ` Zhiping Zhang
  2026-06-11  7:47   ` Christian König
  2026-06-11 20:25   ` sashiko-bot
  2026-06-10 19:31 ` [PATCH v7 2/5] PCI/TPH: Add requester/completer type helpers Zhiping Zhang
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-10 19:31 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe, Leon Romanovsky, Sumit Semwal,
	Christian Konig
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang

When the last reference to an ST table entry is dropped,
mlx5_st_dealloc_index() removed the entry from idx_xa but leaked the
backing mlx5_st_idx_data allocation. Repeated alloc/dealloc cycles
therefore accumulate one struct mlx5_st_idx_data per cycle.

Free idx_data after the xa_erase() so the lifetime of the bookkeeping
struct matches the lifetime of the ST entry it tracks.

Fixes: 888a7776f4fb ("net/mlx5: Add support for device steering tag")
Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/lib/st.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
index 997be91f0a13..7cedc348790d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
@@ -175,6 +175,7 @@ int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index)
 
 	if (refcount_dec_and_test(&idx_data->usecount)) {
 		xa_erase(&st->idx_xa, st_index);
+		kfree(idx_data);
 		/* We leave PCI config space as was before, no mkey will refer to it */
 	}
 
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 2/5] PCI/TPH: Add requester/completer type helpers
  2026-06-10 19:31 [PATCH v7 0/5] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang
  2026-06-10 19:31 ` [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc Zhiping Zhang
@ 2026-06-10 19:31 ` Zhiping Zhang
  2026-06-11 20:25   ` sashiko-bot
  2026-06-10 19:31 ` [PATCH v7 3/5] dma-buf: add optional get_tph() callback Zhiping Zhang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-10 19:31 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe, Leon Romanovsky, Sumit Semwal,
	Christian Konig
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang,
	Bjorn Helgaas

Add pcie_tph_enabled_req_type() so drivers can query the enabled TPH
requester mode without reaching into pci_dev internals.

Add pcie_tph_completer_type() so drivers that publish TPH metadata for
a device acting as a completer can gate on the "TPH Completer
Supported" field of Device Capabilities 2 (bits 13:12,
PCI_EXP_DEVCAP2_TPH_COMP_MASK) rather than reusing requester-side
state. Fold the reserved 0b10 encoding into NONE so callers only see
the defined values.

This keeps pci_dev::tph_req_type and the completer-capability decode
inside the PCI/TPH code and provides !CONFIG_PCIE_TPH stubs for
callers.

Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>

---
 drivers/pci/tph.c       | 43 +++++++++++++++++++++++++++++++++++++++++
 include/linux/pci-tph.h |  8 ++++++++
 2 files changed, 51 insertions(+)

diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index 91145e8d9d95..4fe076bba953 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -174,6 +174,49 @@ u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
 }
 EXPORT_SYMBOL(pcie_tph_get_st_table_loc);
 
+/**
+ * pcie_tph_enabled_req_type - Return the device's enabled TPH requester type
+ * @pdev: PCI device to query
+ *
+ * Return: PCI_TPH_REQ_DISABLE, PCI_TPH_REQ_TPH_ONLY or PCI_TPH_REQ_EXT_TPH.
+ */
+u8 pcie_tph_enabled_req_type(struct pci_dev *pdev)
+{
+	return pdev->tph_req_type;
+}
+EXPORT_SYMBOL(pcie_tph_enabled_req_type);
+
+/**
+ * pcie_tph_completer_type - Return the device's TPH Completer support
+ * @pdev: PCI device to query
+ *
+ * Reads the "TPH Completer Supported" field (bits 13:12) of Device
+ * Capabilities 2. The reserved 0b10 encoding is folded into
+ * "not supported" so callers only need to compare against the three
+ * defined values.
+ *
+ * Return: one of %PCI_EXP_DEVCAP2_TPH_COMP_NONE,
+ *         %PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY or
+ *         %PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH.
+ */
+u8 pcie_tph_completer_type(struct pci_dev *pdev)
+{
+	u32 reg;
+
+	if (pcie_capability_read_dword(pdev, PCI_EXP_DEVCAP2, &reg))
+		return PCI_EXP_DEVCAP2_TPH_COMP_NONE;
+
+	switch (FIELD_GET(PCI_EXP_DEVCAP2_TPH_COMP_MASK, reg)) {
+	case PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY:
+		return PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY;
+	case PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH:
+		return PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH;
+	default:
+		return PCI_EXP_DEVCAP2_TPH_COMP_NONE;
+	}
+}
+EXPORT_SYMBOL(pcie_tph_completer_type);
+
 /*
  * Return the size of ST table. If ST table is not in TPH Requester Extended
  * Capability space, return 0. Otherwise return the ST Table Size + 1.
diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
index be68cd17f2f8..7743af6fe432 100644
--- a/include/linux/pci-tph.h
+++ b/include/linux/pci-tph.h
@@ -9,6 +9,8 @@
 #ifndef LINUX_PCI_TPH_H
 #define LINUX_PCI_TPH_H
 
+#include <linux/pci_regs.h>
+
 /*
  * According to the ECN for PCI Firmware Spec, Steering Tag can be different
  * depending on the memory type: Volatile Memory or Persistent Memory. When a
@@ -30,6 +32,8 @@ void pcie_disable_tph(struct pci_dev *pdev);
 int pcie_enable_tph(struct pci_dev *pdev, int mode);
 u16 pcie_tph_get_st_table_size(struct pci_dev *pdev);
 u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev);
+u8 pcie_tph_enabled_req_type(struct pci_dev *pdev);
+u8 pcie_tph_completer_type(struct pci_dev *pdev);
 #else
 static inline int pcie_tph_set_st_entry(struct pci_dev *pdev,
 					unsigned int index, u16 tag)
@@ -41,6 +45,10 @@ static inline int pcie_tph_get_cpu_st(struct pci_dev *dev,
 static inline void pcie_disable_tph(struct pci_dev *pdev) { }
 static inline int pcie_enable_tph(struct pci_dev *pdev, int mode)
 { return -EINVAL; }
+static inline u8 pcie_tph_enabled_req_type(struct pci_dev *pdev)
+{ return PCI_TPH_REQ_DISABLE; }
+static inline u8 pcie_tph_completer_type(struct pci_dev *pdev)
+{ return PCI_EXP_DEVCAP2_TPH_COMP_NONE; }
 #endif
 
 #endif /* LINUX_PCI_TPH_H */
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 3/5] dma-buf: add optional get_tph() callback
  2026-06-10 19:31 [PATCH v7 0/5] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang
  2026-06-10 19:31 ` [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc Zhiping Zhang
  2026-06-10 19:31 ` [PATCH v7 2/5] PCI/TPH: Add requester/completer type helpers Zhiping Zhang
@ 2026-06-10 19:31 ` Zhiping Zhang
  2026-06-11 10:35   ` Christian König
  2026-06-11 20:26   ` sashiko-bot
  2026-06-10 19:31 ` [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature Zhiping Zhang
  2026-06-10 19:31 ` [PATCH v7 5/5] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Zhiping Zhang
  4 siblings, 2 replies; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-10 19:31 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe, Leon Romanovsky, Sumit Semwal,
	Christian Konig
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang

Add an optional dma_buf_ops.get_tph callback and a dma_buf_get_tph()
wrapper for importers.

8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces, so
the importer requests the namespace it can emit and the exporter
returns the matching ST/PH tuple or -EOPNOTSUPP.

dma_buf_get_tph() is the importer entry point. It returns -EOPNOTSUPP
when the exporter lacks the callback and requires dmabuf->resv to be
held while the callback runs.

The first user is VFIO_DEVICE_FEATURE_DMA_BUF_TPH in vfio-pci, with
mlx5 as the first importer.

Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
---
 drivers/dma-buf/dma-buf.c | 25 +++++++++++++++++++++++++
 include/linux/dma-buf.h   | 21 +++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index d504c636dc29..aff79ea12e43 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -1144,6 +1144,31 @@ void dma_buf_unpin(struct dma_buf_attachment *attach)
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_unpin, "DMA_BUF");
 
+/**
+ * dma_buf_get_tph - Retrieve TPH metadata from an exporter
+ * @dmabuf: DMA buffer to query
+ * @extended: false for 8-bit ST, true for 16-bit Extended ST
+ * @steering_tag: returns the raw steering tag for the requested namespace
+ * @ph: returns the TPH processing hint
+ *
+ * Wrapper for the optional &dma_buf_ops.get_tph callback.
+ *
+ * Must be called with &dma_buf.resv held. Returns -EOPNOTSUPP if the
+ * exporter does not implement the callback or has no metadata for the
+ * requested namespace.
+ */
+int dma_buf_get_tph(struct dma_buf *dmabuf, bool extended,
+		    u16 *steering_tag, u8 *ph)
+{
+	dma_resv_assert_held(dmabuf->resv);
+
+	if (!dmabuf->ops->get_tph)
+		return -EOPNOTSUPP;
+
+	return dmabuf->ops->get_tph(dmabuf, extended, steering_tag, ph);
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_get_tph, "DMA_BUF");
+
 /**
  * dma_buf_map_attachment - Returns the scatterlist table of the attachment;
  * mapped into _device_ address space. Is a wrapper for map_dma_buf() of the
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index d1203da56fc5..6a54e0f251a2 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -113,6 +113,25 @@ struct dma_buf_ops {
 	 */
 	void (*unpin)(struct dma_buf_attachment *attach);
 
+	/**
+	 * @get_tph:
+	 * @dmabuf: DMA buffer for which to retrieve TPH metadata
+	 * @extended: false for 8-bit ST, true for 16-bit Extended ST
+	 * @steering_tag: Returns the raw TPH steering tag for the requested
+	 *                namespace
+	 * @ph: Returns the TPH processing hint (2-bit value)
+	 *
+	 * Return TPH metadata for the namespace selected by @extended. Return
+	 * 0 on success, or -EOPNOTSUPP if no metadata is available.
+	 *
+	 * This callback is optional. Importers must not call it directly;
+	 * the dma_buf_get_tph() wrapper is the only entry point and handles
+	 * the NULL-callback case. The callback is invoked with
+	 * &dma_buf.resv held.
+	 */
+	int (*get_tph)(struct dma_buf *dmabuf, bool extended,
+		       u16 *steering_tag, u8 *ph);
+
 	/**
 	 * @map_dma_buf:
 	 *
@@ -563,6 +582,8 @@ void dma_buf_detach(struct dma_buf *dmabuf,
 		    struct dma_buf_attachment *attach);
 int dma_buf_pin(struct dma_buf_attachment *attach);
 void dma_buf_unpin(struct dma_buf_attachment *attach);
+int dma_buf_get_tph(struct dma_buf *dmabuf, bool extended,
+		    u16 *steering_tag, u8 *ph);
 
 struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info);
 
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature
  2026-06-10 19:31 [PATCH v7 0/5] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang
                   ` (2 preceding siblings ...)
  2026-06-10 19:31 ` [PATCH v7 3/5] dma-buf: add optional get_tph() callback Zhiping Zhang
@ 2026-06-10 19:31 ` Zhiping Zhang
  2026-06-11 20:25   ` sashiko-bot
  2026-06-10 19:31 ` [PATCH v7 5/5] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Zhiping Zhang
  4 siblings, 1 reply; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-10 19:31 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe, Leon Romanovsky, Sumit Semwal,
	Christian Konig
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang

Implement dma-buf get_tph for vfio-pci exported dma-bufs and add
VFIO_DEVICE_FEATURE_DMA_BUF_TPH so userspace can publish TPH metadata
for a VFIO-owned device.

8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces; the
uAPI carries both with explicit validity flags, and get_tph() returns
the value matching the importer's requested namespace or -EOPNOTSUPP.

Publish and read the TPH descriptor under dmabuf->resv, matching the
locking used for other importer-visible dma-buf state. The SET ioctl
takes dma_resv_lock_interruptible(), while the callback runs under
DMA-buf's asserted resv lock.

Reject requests the device cannot consume as a completer:
pcie_tph_completer_type() must report at least
PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY, and Extended ST requires
PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH. Validate fields before the completer
check so userspace gets the narrowest errno.

Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
---
 drivers/vfio/pci/vfio_pci_core.c   |  3 +
 drivers/vfio/pci/vfio_pci_dmabuf.c | 94 +++++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_priv.h   | 12 ++++
 include/uapi/linux/vfio.h          | 37 ++++++++++++
 4 files changed, 145 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 050e7542952e..4fa36f2f7555 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1569,6 +1569,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_DMA_BUF:
 		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_DMA_BUF_TPH:
+		return vfio_pci_core_feature_dma_buf_tph(vdev, flags, arg,
+							 argsz);
 	default:
 		return -ENOTTY;
 	}
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 1a177ce7de54..0a0705c8dbea 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -3,6 +3,7 @@
  */
 #include <linux/dma-buf-mapping.h>
 #include <linux/pci-p2pdma.h>
+#include <linux/pci-tph.h>
 #include <linux/dma-resv.h>
 
 #include "vfio_pci_priv.h"
@@ -19,7 +20,12 @@ struct vfio_pci_dma_buf {
 	u32 nr_ranges;
 	struct kref kref;
 	struct completion comp;
-	u8 revoked : 1;
+	u8 tph_st_valid:1;
+	u8 tph_st_ext_valid:1;
+	u8 tph_ph:2;
+	u8 tph_st;
+	u16 tph_st_ext;
+	u8 revoked:1;
 };
 
 static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
@@ -69,6 +75,26 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
 	return ret;
 }
 
+static int vfio_pci_dma_buf_get_tph(struct dma_buf *dmabuf, bool extended,
+				    u16 *steering_tag, u8 *ph)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	dma_resv_assert_held(dmabuf->resv);
+
+	if (extended) {
+		if (!priv->tph_st_ext_valid)
+			return -EOPNOTSUPP;
+		*steering_tag = priv->tph_st_ext;
+	} else {
+		if (!priv->tph_st_valid)
+			return -EOPNOTSUPP;
+		*steering_tag = priv->tph_st;
+	}
+	*ph = priv->tph_ph;
+	return 0;
+}
+
 static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
 				   struct sg_table *sgt,
 				   enum dma_data_direction dir)
@@ -101,6 +127,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
 
 static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
 	.attach = vfio_pci_dma_buf_attach,
+	.get_tph = vfio_pci_dma_buf_get_tph,
 	.map_dma_buf = vfio_pci_dma_buf_map,
 	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
 	.release = vfio_pci_dma_buf_release,
@@ -333,6 +360,71 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 	return ret;
 }
 
+int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
+				      u32 flags,
+				      struct vfio_device_feature_dma_buf_tph __user *arg,
+				      size_t argsz)
+{
+	struct vfio_device_feature_dma_buf_tph set_tph;
+	struct vfio_pci_dma_buf *priv;
+	struct dma_buf *dmabuf;
+	u8 comp;
+	int ret;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
+				 sizeof(set_tph));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&set_tph, arg, sizeof(set_tph)))
+		return -EFAULT;
+
+	if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT))
+		return -EINVAL;
+
+	if (set_tph.ph & ~0x3)
+		return -EINVAL;
+
+	comp = pcie_tph_completer_type(vdev->pdev);
+	if (comp == PCI_EXP_DEVCAP2_TPH_COMP_NONE)
+		return -EOPNOTSUPP;
+	if ((set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT) &&
+	    comp != PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH)
+		return -EOPNOTSUPP;
+
+	dmabuf = dma_buf_get(set_tph.dmabuf_fd);
+	if (IS_ERR(dmabuf))
+		return PTR_ERR(dmabuf);
+
+	if (dmabuf->ops != &vfio_pci_dmabuf_ops) {
+		ret = -EINVAL;
+		goto out_put;
+	}
+
+	priv = dmabuf->priv;
+	if (priv->vdev != vdev) {
+		ret = -EINVAL;
+		goto out_put;
+	}
+
+	ret = dma_resv_lock_interruptible(dmabuf->resv, NULL);
+	if (ret)
+		goto out_put;
+
+	priv->tph_st         = set_tph.steering_tag;
+	priv->tph_st_ext     = set_tph.steering_tag_ext;
+	priv->tph_ph         = set_tph.ph;
+	priv->tph_st_valid   = !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST);
+	priv->tph_st_ext_valid =
+		!!(set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT);
+	dma_resv_unlock(dmabuf->resv);
+	ret = 0;
+
+out_put:
+	dma_buf_put(dmabuf);
+	return ret;
+}
+
 void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
 {
 	struct vfio_pci_dma_buf *priv;
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index fca9d0dfac90..c58f369be4b3 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -118,6 +118,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 				  struct vfio_device_feature_dma_buf __user *arg,
 				  size_t argsz);
+int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
+				      u32 flags,
+				      struct vfio_device_feature_dma_buf_tph __user *arg,
+				      size_t argsz);
 void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
 void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
 #else
@@ -128,6 +132,14 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 {
 	return -ENOTTY;
 }
+
+static inline int
+vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf_tph __user *arg,
+				  size_t argsz)
+{
+	return -ENOTTY;
+}
 static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
 {
 }
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..5dd693220a0d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1534,6 +1534,43 @@ struct vfio_device_feature_dma_buf {
  */
 #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2  12
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_SET associate TPH (TLP Processing Hints) metadata
+ * with a vfio-exported dma-buf. The dma-buf must have been created by
+ * VFIO_DEVICE_FEATURE_DMA_BUF on this device, and the device must report
+ * TPH Completer support in Device Capabilities 2 (bits 13:12); requests
+ * carrying VFIO_DMA_BUF_TPH_ST_EXT additionally require the device to
+ * report the Extended TPH Completer encoding. Otherwise the ioctl
+ * returns -EOPNOTSUPP.
+ *
+ * dmabuf_fd is the file descriptor returned by VFIO_DEVICE_FEATURE_DMA_BUF.
+ *
+ * 8-bit ST (steering_tag) and 16-bit Extended ST (steering_tag_ext) are
+ * distinct namespaces. Userspace supplies whichever values are valid and sets
+ * the matching VFIO_DMA_BUF_TPH_ST / VFIO_DMA_BUF_TPH_ST_EXT bits in @flags;
+ * an importer requests one namespace and receives the matching value.
+ *
+ * @flags == 0 marks any previously published ST / Extended-ST as invalid
+ * for future get_tph() requests on this dma-buf.
+ *
+ * ph is the 2-bit TLP Processing Hint and must be in the range [0, 3].
+ *
+ * Userspace must publish TPH before handing the dma-buf fd to an importer.
+ * Calling SET again replaces the published values.
+ */
+#define VFIO_DEVICE_FEATURE_DMA_BUF_TPH 13
+
+#define VFIO_DMA_BUF_TPH_ST		(1 << 0)
+#define VFIO_DMA_BUF_TPH_ST_EXT		(1 << 1)
+
+struct vfio_device_feature_dma_buf_tph {
+	__s32	dmabuf_fd;
+	__u32	flags;
+	__u16	steering_tag_ext;
+	__u8	steering_tag;
+	__u8	ph;
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 5/5] RDMA/mlx5: get tph for p2p access when registering dma-buf mr
  2026-06-10 19:31 [PATCH v7 0/5] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang
                   ` (3 preceding siblings ...)
  2026-06-10 19:31 ` [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature Zhiping Zhang
@ 2026-06-10 19:31 ` Zhiping Zhang
  2026-06-11 12:44   ` Michael Gur
  2026-06-11 20:26   ` sashiko-bot
  4 siblings, 2 replies; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-10 19:31 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe, Leon Romanovsky, Sumit Semwal,
	Christian Konig
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang

Query dma-buf TPH metadata when registering a dma-buf MR for peer-to-
peer access to a PCIe endpoint and use it to program requester-side TPH
on the outbound mkey. If the exporter has no metadata, fall back to the
existing no-TPH path.

For TPH-backed FRMRs, make the extra ST-table reference belong to the
hardware mkey handle rather than the transient MR object. Extend the
FRMR pool API so reuse and final destroy can transfer and drop that ref
at the handle lifetime boundaries, and add mlx5_st_get_index() to take
a ref on an already-known ST index.

Also decode PH from kernel_vendor_key when recreating pooled mkeys so
the requester hint matches the pool key.

Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
---
 drivers/infiniband/core/frmr_pools.c          |  20 +++-
 drivers/infiniband/hw/mlx5/mr.c               | 111 +++++++++++++++++-
 .../net/ethernet/mellanox/mlx5/core/lib/st.c  |  49 ++++++--
 include/linux/mlx5/driver.h                   |  12 ++
 include/rdma/frmr_pools.h                     |   5 +-
 5 files changed, 178 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/core/frmr_pools.c b/drivers/infiniband/core/frmr_pools.c
index 5e992ff3d7cf..61a77847118e 100644
--- a/drivers/infiniband/core/frmr_pools.c
+++ b/drivers/infiniband/core/frmr_pools.c
@@ -92,7 +92,8 @@ static void destroy_all_handles_in_queue(struct ib_device *device,
 	u32 count;
 
 	while (pop_frmr_handles_page(pool, queue, &page, &count)) {
-		pools->pool_ops->destroy_frmrs(device, page->handles, count);
+		pools->pool_ops->destroy_frmrs(device, &pool->key,
+					       page->handles, count);
 		kfree(page);
 	}
 }
@@ -136,7 +137,8 @@ static bool age_pinned_pool(struct ib_device *device, struct ib_frmr_pool *pool)
 	spin_unlock(&pool->lock);
 
 	if (destroyed)
-		pools->pool_ops->destroy_frmrs(device, handles, destroyed);
+		pools->pool_ops->destroy_frmrs(device, &pool->key, handles,
+					       destroyed);
 	kfree(handles);
 	return has_work;
 }
@@ -453,9 +455,11 @@ int ib_frmr_pools_set_pinned(struct ib_device *device, struct ib_frmr_key *key,
 }
 
 static int get_frmr_from_pool(struct ib_device *device,
-			      struct ib_frmr_pool *pool, struct ib_mr *mr)
+			      struct ib_frmr_pool *pool, struct ib_mr *mr,
+			      bool *reused)
 {
 	struct ib_frmr_pools *pools = device->frmr_pools;
+	bool local_reused = false;
 	u32 handle;
 	int err;
 
@@ -464,6 +468,7 @@ static int get_frmr_from_pool(struct ib_device *device,
 		if (pool->inactive_queue.ci > 0) {
 			handle = pop_handle_from_queue_locked(
 				&pool->inactive_queue);
+			local_reused = true;
 		} else {
 			spin_unlock(&pool->lock);
 			err = pools->pool_ops->create_frmrs(device, &pool->key,
@@ -474,6 +479,7 @@ static int get_frmr_from_pool(struct ib_device *device,
 		}
 	} else {
 		handle = pop_handle_from_queue_locked(&pool->queue);
+		local_reused = true;
 	}
 
 	pool->in_use++;
@@ -484,6 +490,8 @@ static int get_frmr_from_pool(struct ib_device *device,
 
 	mr->frmr.pool = pool;
 	mr->frmr.handle = handle;
+	if (reused)
+		*reused = local_reused;
 
 	return 0;
 }
@@ -493,10 +501,12 @@ static int get_frmr_from_pool(struct ib_device *device,
  *
  * @device: The device to pop the FRMR handle from.
  * @mr: The MR to pop the FRMR handle from.
+ * @reused: Optional output that reports whether the returned handle was
+ *	    reused from the pool instead of freshly created.
  *
  * Returns 0 on success, negative error code on failure.
  */
-int ib_frmr_pool_pop(struct ib_device *device, struct ib_mr *mr)
+int ib_frmr_pool_pop(struct ib_device *device, struct ib_mr *mr, bool *reused)
 {
 	struct ib_frmr_pools *pools = device->frmr_pools;
 	struct ib_frmr_pool *pool;
@@ -509,7 +519,7 @@ int ib_frmr_pool_pop(struct ib_device *device, struct ib_mr *mr)
 			return PTR_ERR(pool);
 	}
 
-	return get_frmr_from_pool(device, pool, mr);
+	return get_frmr_from_pool(device, pool, mr, reused);
 }
 EXPORT_SYMBOL(ib_frmr_pool_pop);
 
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 3b6da45061a5..5697c2862615 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -38,6 +38,7 @@
 #include <linux/delay.h>
 #include <linux/dma-buf.h>
 #include <linux/dma-resv.h>
+#include <linux/pci-tph.h>
 #include <rdma/frmr_pools.h>
 #include <rdma/ib_umem_odp.h>
 #include "dm.h"
@@ -167,12 +168,39 @@ static int get_unchangeable_access_flags(struct mlx5_ib_dev *dev,
 #define MLX5_FRMR_POOLS_KERNEL_KEY_PH_MASK 0xFF0000
 #define MLX5_FRMR_POOLS_KERNEL_KEY_ST_INDEX_MASK 0xFFFF
 
+static int mlx5_ib_get_frmr_st_handle_ref(struct mlx5_ib_dev *dev,
+					  u16 st_index)
+{
+	if (st_index == MLX5_MKC_PCIE_TPH_NO_STEERING_TAG_INDEX)
+		return 0;
+
+	return mlx5_st_get_index(dev->mdev, st_index);
+}
+
+static void mlx5_ib_put_st_index_ref(struct mlx5_ib_dev *dev, u16 st_index)
+{
+	if (st_index == MLX5_MKC_PCIE_TPH_NO_STEERING_TAG_INDEX)
+		return;
+
+	mlx5_st_dealloc_index(dev->mdev, st_index);
+}
+
+static void mlx5_ib_put_frmr_st_handle_ref(struct mlx5_ib_dev *dev,
+					   u64 kernel_vendor_key)
+{
+	u16 st_index = kernel_vendor_key &
+		       MLX5_FRMR_POOLS_KERNEL_KEY_ST_INDEX_MASK;
+
+	mlx5_ib_put_st_index_ref(dev, st_index);
+}
+
 static struct mlx5_ib_mr *
 _mlx5_frmr_pool_alloc(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 		      int access_flags, int access_mode,
 		      unsigned long page_size, u16 st_index, u8 ph)
 {
 	struct mlx5_ib_mr *mr;
+	bool reused = false;
 	int err;
 
 	mr = kzalloc_obj(*mr);
@@ -195,11 +223,14 @@ _mlx5_frmr_pool_alloc(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 
 	mr->ibmr.frmr.key.kernel_vendor_key =
 		st_index | (ph << MLX5_FRMR_POOLS_KERNEL_KEY_PH_SHIFT);
-	err = ib_frmr_pool_pop(&dev->ib_dev, &mr->ibmr);
+	err = ib_frmr_pool_pop(&dev->ib_dev, &mr->ibmr, &reused);
 	if (err) {
 		kfree(mr);
 		return ERR_PTR(err);
 	}
+	if (reused)
+		mlx5_ib_put_frmr_st_handle_ref(
+			dev, mr->ibmr.frmr.key.kernel_vendor_key);
 	mr->mmkey.key = mr->ibmr.frmr.handle;
 	init_waitqueue_head(&mr->mmkey.wait);
 
@@ -229,7 +260,7 @@ struct mlx5_ib_mr *mlx5_mr_cache_alloc(struct mlx5_ib_dev *dev,
 	init_waitqueue_head(&mr->mmkey.wait);
 
 	mr->ibmr.frmr.key = key;
-	ret = ib_frmr_pool_pop(&dev->ib_dev, &mr->ibmr);
+	ret = ib_frmr_pool_pop(&dev->ib_dev, &mr->ibmr, NULL);
 	if (ret) {
 		kfree(mr);
 		return ERR_PTR(ret);
@@ -273,7 +304,8 @@ static int mlx5r_create_mkeys(struct ib_device *device, struct ib_frmr_key *key,
 
 	st_index = key->kernel_vendor_key &
 		   MLX5_FRMR_POOLS_KERNEL_KEY_ST_INDEX_MASK;
-	ph = key->kernel_vendor_key & MLX5_FRMR_POOLS_KERNEL_KEY_PH_MASK;
+	ph = (key->kernel_vendor_key & MLX5_FRMR_POOLS_KERNEL_KEY_PH_MASK) >>
+	     MLX5_FRMR_POOLS_KERNEL_KEY_PH_SHIFT;
 	if (ph) {
 		/* Normalize ph: swap MLX5_IB_NO_PH for 0 */
 		if (ph == MLX5_IB_NO_PH)
@@ -299,7 +331,8 @@ static int mlx5r_create_mkeys(struct ib_device *device, struct ib_frmr_key *key,
 	return err;
 }
 
-static void mlx5r_destroy_mkeys(struct ib_device *device, u32 *handles,
+static void mlx5r_destroy_mkeys(struct ib_device *device,
+				const struct ib_frmr_key *key, u32 *handles,
 				unsigned int count)
 {
 	struct mlx5_ib_dev *dev = to_mdev(device);
@@ -311,6 +344,9 @@ static void mlx5r_destroy_mkeys(struct ib_device *device, u32 *handles,
 			pr_warn_ratelimited(
 				"mlx5_ib: failed to destroy mkey %d: %d",
 				handles[i], err);
+		else
+			mlx5_ib_put_frmr_st_handle_ref(dev,
+						       key->kernel_vendor_key);
 	}
 }
 
@@ -333,6 +369,7 @@ static int mlx5r_build_frmr_key(struct ib_device *device,
 		get_unchangeable_access_flags(dev, in->access_flags);
 	out->vendor_key = in->vendor_key;
 	out->num_dma_blocks = in->num_dma_blocks;
+	out->kernel_vendor_key = in->kernel_vendor_key;
 
 	return 0;
 }
@@ -753,6 +790,12 @@ static struct ib_mr *create_real_mr(struct ib_pd *pd, struct ib_umem *umem,
 
 	xlt_with_umr = mlx5r_umr_can_load_pas(dev, umem->length);
 	if (xlt_with_umr) {
+		err = mlx5_ib_get_frmr_st_handle_ref(dev, st_index);
+		if (err) {
+			ib_umem_release(umem);
+			return ERR_PTR(err);
+		}
+
 		mr = alloc_cacheable_mr(pd, umem, iova, access_flags,
 					MLX5_MKC_ACCESS_MODE_MTT,
 					st_index, ph);
@@ -767,6 +810,8 @@ static struct ib_mr *create_real_mr(struct ib_pd *pd, struct ib_umem *umem,
 		mutex_unlock(&dev->slow_path_mutex);
 	}
 	if (IS_ERR(mr)) {
+		if (xlt_with_umr)
+			mlx5_ib_put_st_index_ref(dev, st_index);
 		ib_umem_release(umem);
 		return ERR_CAST(mr);
 	}
@@ -899,6 +944,52 @@ static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = {
 	.invalidate_mappings = mlx5_ib_dmabuf_invalidate_cb,
 };
 
+static void get_tph_mr_dmabuf(struct mlx5_ib_dev *dev, struct dma_buf *dmabuf,
+			      u16 *st_index, u8 *ph)
+{
+	u16 local_st_index;
+	u16 steering_tag;
+	u8 local_ph;
+	bool extended;
+	int ret;
+
+	switch (pcie_tph_enabled_req_type(dev->mdev->pdev)) {
+	case PCI_TPH_REQ_TPH_ONLY:
+		extended = false;
+		break;
+	case PCI_TPH_REQ_EXT_TPH:
+		extended = true;
+		break;
+	default:
+		return;
+	}
+
+	dma_resv_lock(dmabuf->resv, NULL);
+	ret = dma_buf_get_tph(dmabuf, extended, &steering_tag, &local_ph);
+	dma_resv_unlock(dmabuf->resv);
+	if (ret) {
+		if (ret != -EOPNOTSUPP)
+			mlx5_ib_dbg(dev, "get_tph failed (%d)\n", ret);
+		return;
+	}
+
+	ret = mlx5_st_alloc_index_by_tag(dev->mdev, steering_tag,
+					 &local_st_index);
+	if (ret) {
+		mlx5_ib_dbg(dev, "st_alloc_index_by_tag failed (%d)\n", ret);
+		return;
+	}
+
+	*st_index = local_st_index;
+	*ph = local_ph;
+}
+
+static void mlx5_ib_mr_put_frmr_st_handle_ref(struct mlx5_ib_mr *mr)
+{
+	mlx5_ib_put_frmr_st_handle_ref(mr_to_mdev(mr),
+				       mr->ibmr.frmr.key.kernel_vendor_key);
+}
+
 static struct ib_mr *
 reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device,
 		   u64 offset, u64 length, u64 virt_addr,
@@ -941,12 +1032,22 @@ reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device,
 		ph = dmah->ph;
 		if (dmah->valid_fields & BIT(IB_DMAH_CPU_ID_EXISTS))
 			st_index = mdmah->st_index;
+
+		err = mlx5_ib_get_frmr_st_handle_ref(dev, st_index);
+		if (err) {
+			ib_umem_release(&umem_dmabuf->umem);
+			return ERR_PTR(err);
+		}
+	} else {
+		get_tph_mr_dmabuf(dev, umem_dmabuf->attach->dmabuf,
+				  &st_index, &ph);
 	}
 
 	mr = alloc_cacheable_mr(pd, &umem_dmabuf->umem, virt_addr,
 				access_flags, access_mode,
 				st_index, ph);
 	if (IS_ERR(mr)) {
+		mlx5_ib_put_st_index_ref(dev, st_index);
 		ib_umem_release(&umem_dmabuf->umem);
 		return ERR_CAST(mr);
 	}
@@ -1400,6 +1501,8 @@ static int mlx5r_handle_mkey_cleanup(struct mlx5_ib_mr *mr)
 		dma_resv_unlock(
 			to_ib_umem_dmabuf(mr->umem)->attach->dmabuf->resv);
 	}
+	if (!ret)
+		mlx5_ib_mr_put_frmr_st_handle_ref(mr);
 	return ret;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
index 7cedc348790d..877b37b4e639 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
@@ -92,23 +92,18 @@ void mlx5_st_destroy(struct mlx5_core_dev *dev)
 	kfree(st);
 }
 
-int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type,
-			unsigned int cpu_uid, u16 *st_index)
+int mlx5_st_alloc_index_by_tag(struct mlx5_core_dev *dev, u16 tag,
+			       u16 *st_index)
 {
 	struct mlx5_st_idx_data *idx_data;
 	struct mlx5_st *st = dev->st;
 	unsigned long index;
 	u32 xa_id;
-	u16 tag;
-	int ret;
+	int ret = 0;
 
 	if (!st)
 		return -EOPNOTSUPP;
 
-	ret = pcie_tph_get_cpu_st(dev->pdev, mem_type, cpu_uid, &tag);
-	if (ret)
-		return ret;
-
 	if (st->direct_mode) {
 		*st_index = tag;
 		return 0;
@@ -152,8 +147,46 @@ int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type,
 	mutex_unlock(&st->lock);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(mlx5_st_alloc_index_by_tag);
+
+int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type,
+			unsigned int cpu_uid, u16 *st_index)
+{
+	u16 tag;
+	int ret;
+
+	ret = pcie_tph_get_cpu_st(dev->pdev, mem_type, cpu_uid, &tag);
+	if (ret)
+		return ret;
+
+	return mlx5_st_alloc_index_by_tag(dev, tag, st_index);
+}
 EXPORT_SYMBOL_GPL(mlx5_st_alloc_index);
 
+int mlx5_st_get_index(struct mlx5_core_dev *dev, u16 st_index)
+{
+	struct mlx5_st_idx_data *idx_data;
+	struct mlx5_st *st = dev->st;
+	int ret = 0;
+
+	if (!st)
+		return -EOPNOTSUPP;
+
+	if (st->direct_mode)
+		return 0;
+
+	mutex_lock(&st->lock);
+	idx_data = xa_load(&st->idx_xa, st_index);
+	if (WARN_ON_ONCE(!idx_data))
+		ret = -EINVAL;
+	else
+		refcount_inc(&idx_data->usecount);
+	mutex_unlock(&st->lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mlx5_st_get_index);
+
 int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index)
 {
 	struct mlx5_st_idx_data *idx_data;
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 04b96c5abb57..0480b5c4f189 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1166,10 +1166,22 @@ int mlx5_dm_sw_icm_dealloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type
 			   u64 length, u16 uid, phys_addr_t addr, u32 obj_id);
 
 #ifdef CONFIG_PCIE_TPH
+int mlx5_st_alloc_index_by_tag(struct mlx5_core_dev *dev, u16 tag,
+			       u16 *st_index);
+int mlx5_st_get_index(struct mlx5_core_dev *dev, u16 st_index);
 int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type,
 			unsigned int cpu_uid, u16 *st_index);
 int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index);
 #else
+static inline int mlx5_st_alloc_index_by_tag(struct mlx5_core_dev *dev,
+					     u16 tag, u16 *st_index)
+{
+	return -EOPNOTSUPP;
+}
+static inline int mlx5_st_get_index(struct mlx5_core_dev *dev, u16 st_index)
+{
+	return -EOPNOTSUPP;
+}
 static inline int mlx5_st_alloc_index(struct mlx5_core_dev *dev,
 				      enum tph_mem_type mem_type,
 				      unsigned int cpu_uid, u16 *st_index)
diff --git a/include/rdma/frmr_pools.h b/include/rdma/frmr_pools.h
index af1b88801fa4..a08d2b2cf9f3 100644
--- a/include/rdma/frmr_pools.h
+++ b/include/rdma/frmr_pools.h
@@ -24,7 +24,8 @@ struct ib_frmr_key {
 struct ib_frmr_pool_ops {
 	int (*create_frmrs)(struct ib_device *device, struct ib_frmr_key *key,
 			    u32 *handles, u32 count);
-	void (*destroy_frmrs)(struct ib_device *device, u32 *handles,
+	void (*destroy_frmrs)(struct ib_device *device,
+			      const struct ib_frmr_key *key, u32 *handles,
 			      u32 count);
 	int (*build_key)(struct ib_device *device, const struct ib_frmr_key *in,
 			 struct ib_frmr_key *out);
@@ -33,7 +34,7 @@ struct ib_frmr_pool_ops {
 int ib_frmr_pools_init(struct ib_device *device,
 		       const struct ib_frmr_pool_ops *pool_ops);
 void ib_frmr_pools_cleanup(struct ib_device *device);
-int ib_frmr_pool_pop(struct ib_device *device, struct ib_mr *mr);
+int ib_frmr_pool_pop(struct ib_device *device, struct ib_mr *mr, bool *reused);
 int ib_frmr_pool_push(struct ib_device *device, struct ib_mr *mr);
 
 #endif /* FRMR_POOLS_H */
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc
  2026-06-10 19:31 ` [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc Zhiping Zhang
@ 2026-06-11  7:47   ` Christian König
  2026-06-11 22:53     ` Zhiping Zhang
  2026-06-11 20:25   ` sashiko-bot
  1 sibling, 1 reply; 25+ messages in thread
From: Christian König @ 2026-06-11  7:47 UTC (permalink / raw)
  To: Zhiping Zhang, Alex Williamson, Jason Gunthorpe, Leon Romanovsky,
	Sumit Semwal
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

On 6/10/26 21:31, Zhiping Zhang wrote:
> When the last reference to an ST table entry is dropped,
> mlx5_st_dealloc_index() removed the entry from idx_xa but leaked the
> backing mlx5_st_idx_data allocation. Repeated alloc/dealloc cycles
> therefore accumulate one struct mlx5_st_idx_data per cycle.
> 
> Free idx_data after the xa_erase() so the lifetime of the bookkeeping
> struct matches the lifetime of the ST entry it tracks.
> 
> Fixes: 888a7776f4fb ("net/mlx5: Add support for device steering tag")
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>

Since this is an obvious bug fix I think it shouldn't be part of this patch set and go upstream completely independent.

Regards,
Christian.

> ---
>  drivers/net/ethernet/mellanox/mlx5/core/lib/st.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> index 997be91f0a13..7cedc348790d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> @@ -175,6 +175,7 @@ int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index)
>  
>  	if (refcount_dec_and_test(&idx_data->usecount)) {
>  		xa_erase(&st->idx_xa, st_index);
> +		kfree(idx_data);
>  		/* We leave PCI config space as was before, no mkey will refer to it */
>  	}
>  


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/5] dma-buf: add optional get_tph() callback
  2026-06-10 19:31 ` [PATCH v7 3/5] dma-buf: add optional get_tph() callback Zhiping Zhang
@ 2026-06-11 10:35   ` Christian König
  2026-06-11 23:07     ` Zhiping Zhang
  2026-06-11 20:26   ` sashiko-bot
  1 sibling, 1 reply; 25+ messages in thread
From: Christian König @ 2026-06-11 10:35 UTC (permalink / raw)
  To: Zhiping Zhang, Alex Williamson, Jason Gunthorpe, Leon Romanovsky,
	Sumit Semwal
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

On 6/10/26 21:31, Zhiping Zhang wrote:
> Add an optional dma_buf_ops.get_tph callback and a dma_buf_get_tph()
> wrapper for importers.
> 
> 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces, so
> the importer requests the namespace it can emit and the exporter
> returns the matching ST/PH tuple or -EOPNOTSUPP.
> 
> dma_buf_get_tph() is the importer entry point. It returns -EOPNOTSUPP
> when the exporter lacks the callback and requires dmabuf->resv to be
> held while the callback runs.
> 
> The first user is VFIO_DEVICE_FEATURE_DMA_BUF_TPH in vfio-pci, with
> mlx5 as the first importer.
> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
>  drivers/dma-buf/dma-buf.c | 25 +++++++++++++++++++++++++
>  include/linux/dma-buf.h   | 21 +++++++++++++++++++++
>  2 files changed, 46 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index d504c636dc29..aff79ea12e43 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -1144,6 +1144,31 @@ void dma_buf_unpin(struct dma_buf_attachment *attach)
>  }
>  EXPORT_SYMBOL_NS_GPL(dma_buf_unpin, "DMA_BUF");
>  
> +/**
> + * dma_buf_get_tph - Retrieve TPH metadata from an exporter
> + * @dmabuf: DMA buffer to query
> + * @extended: false for 8-bit ST, true for 16-bit Extended ST
> + * @steering_tag: returns the raw steering tag for the requested namespace
> + * @ph: returns the TPH processing hint
> + *
> + * Wrapper for the optional &dma_buf_ops.get_tph callback.
> + *
> + * Must be called with &dma_buf.resv held. Returns -EOPNOTSUPP if the
> + * exporter does not implement the callback or has no metadata for the
> + * requested namespace.
> + */
> +int dma_buf_get_tph(struct dma_buf *dmabuf, bool extended,
> +		    u16 *steering_tag, u8 *ph)

That name needs improvement, maybe something like dma_buf_get_pci_tph().

It also needs some brief explanation what TPH is, maybe a reference to the PCIe spec name etc...

And document in the list of functions that this one should be called with the lock held.

> +{
> +	dma_resv_assert_held(dmabuf->resv);
> +
> +	if (!dmabuf->ops->get_tph)
> +		return -EOPNOTSUPP;
> +
> +	return dmabuf->ops->get_tph(dmabuf, extended, steering_tag, ph);
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_get_tph, "DMA_BUF");
> +
>  /**
>   * dma_buf_map_attachment - Returns the scatterlist table of the attachment;
>   * mapped into _device_ address space. Is a wrapper for map_dma_buf() of the
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index d1203da56fc5..6a54e0f251a2 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -113,6 +113,25 @@ struct dma_buf_ops {
>  	 */
>  	void (*unpin)(struct dma_buf_attachment *attach);
>  
> +	/**
> +	 * @get_tph:
> +	 * @dmabuf: DMA buffer for which to retrieve TPH metadata
> +	 * @extended: false for 8-bit ST, true for 16-bit Extended ST
> +	 * @steering_tag: Returns the raw TPH steering tag for the requested
> +	 *                namespace
> +	 * @ph: Returns the TPH processing hint (2-bit value)
> +	 *
> +	 * Return TPH metadata for the namespace selected by @extended. Return
> +	 * 0 on success, or -EOPNOTSUPP if no metadata is available.
> +	 *
> +	 * This callback is optional. Importers must not call it directly;
> +	 * the dma_buf_get_tph() wrapper is the only entry point and handles
> +	 * the NULL-callback case. The callback is invoked with
> +	 * &dma_buf.resv held.

That most of that should be obvious, we only need that it's optional and that the lock should be held. Everything else can be dropped.

And most of the description/documentation should be on the wrapper function, exporters who implement the callback should know what they are doing.

Regards,
Christian.

> +	 */
> +	int (*get_tph)(struct dma_buf *dmabuf, bool extended,
> +		       u16 *steering_tag, u8 *ph);
> +
>  	/**
>  	 * @map_dma_buf:
>  	 *
> @@ -563,6 +582,8 @@ void dma_buf_detach(struct dma_buf *dmabuf,
>  		    struct dma_buf_attachment *attach);
>  int dma_buf_pin(struct dma_buf_attachment *attach);
>  void dma_buf_unpin(struct dma_buf_attachment *attach);
> +int dma_buf_get_tph(struct dma_buf *dmabuf, bool extended,
> +		    u16 *steering_tag, u8 *ph);
>  
>  struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info);
>  


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 5/5] RDMA/mlx5: get tph for p2p access when registering dma-buf mr
  2026-06-10 19:31 ` [PATCH v7 5/5] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Zhiping Zhang
@ 2026-06-11 12:44   ` Michael Gur
  2026-06-11 23:09     ` Zhiping Zhang
  2026-06-11 20:26   ` sashiko-bot
  1 sibling, 1 reply; 25+ messages in thread
From: Michael Gur @ 2026-06-11 12:44 UTC (permalink / raw)
  To: Zhiping Zhang, Alex Williamson, Jason Gunthorpe, Leon Romanovsky,
	Sumit Semwal, Christian Konig
  Cc: Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas


On 6/10/2026 10:31 PM, Zhiping Zhang wrote:
> Query dma-buf TPH metadata when registering a dma-buf MR for peer-to-
> peer access to a PCIe endpoint and use it to program requester-side TPH
> on the outbound mkey. If the exporter has no metadata, fall back to the
> existing no-TPH path.
>
> For TPH-backed FRMRs, make the extra ST-table reference belong to the
> hardware mkey handle rather than the transient MR object. Extend the
> FRMR pool API so reuse and final destroy can transfer and drop that ref
> at the handle lifetime boundaries, and add mlx5_st_get_index() to take
> a ref on an already-known ST index.
I'd keep the ST reference tied to MRs, where the ST is actually in use.
There's no functional need to couple ST refcounting to mkey lifetime.
Once an MR is destroyed and its mkey revoked, the mkey can no longer 
generate traffic, it's just an idle entry in the FRMR pool waiting to be 
aged out or reused.
This lets us drop all FRMR pool changes from this patch and keep a 
simple flow of 'acquire on MR create, release on MR destroy'.
> Also decode PH from kernel_vendor_key when recreating pooled mkeys so
> the requester hint matches the pool key.
I've fixed that in a series I've sent earlier this week, please rebase 
next version on top of it.

Thanks,
Michael
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
>   drivers/infiniband/core/frmr_pools.c          |  20 +++-
>   drivers/infiniband/hw/mlx5/mr.c               | 111 +++++++++++++++++-
>   .../net/ethernet/mellanox/mlx5/core/lib/st.c  |  49 ++++++--
>   include/linux/mlx5/driver.h                   |  12 ++
>   include/rdma/frmr_pools.h                     |   5 +-
>   5 files changed, 178 insertions(+), 19 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature
  2026-06-11 16:11 [PATCH v7 0/5] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang
@ 2026-06-11 16:11 ` Zhiping Zhang
  2026-06-12 16:46   ` sashiko-bot
  2026-06-12 17:10   ` Alex Williamson
  0 siblings, 2 replies; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-11 16:11 UTC (permalink / raw)
  To: netdev; +Cc: kvm, linux-rdma, linux-pci, dri-devel, Zhiping Zhang

Implement dma-buf get_tph for vfio-pci exported dma-bufs and add
VFIO_DEVICE_FEATURE_DMA_BUF_TPH so userspace can publish TPH metadata
for a VFIO-owned device.

8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces; the
uAPI carries both with explicit validity flags, and get_tph() returns
the value matching the importer's requested namespace or -EOPNOTSUPP.

Publish and read the TPH descriptor under dmabuf->resv, matching the
locking used for other importer-visible dma-buf state. The SET ioctl
takes dma_resv_lock_interruptible(), while the callback runs under
DMA-buf's asserted resv lock.

Reject requests the device cannot consume as a completer:
pcie_tph_completer_type() must report at least
PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY, and Extended ST requires
PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH. Validate fields before the completer
check so userspace gets the narrowest errno.

Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
---
 drivers/vfio/pci/vfio_pci_core.c   |  3 +
 drivers/vfio/pci/vfio_pci_dmabuf.c | 94 +++++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_priv.h   | 12 ++++
 include/uapi/linux/vfio.h          | 37 ++++++++++++
 4 files changed, 145 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 050e7542952e..4fa36f2f7555 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1569,6 +1569,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_DMA_BUF:
 		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_DMA_BUF_TPH:
+		return vfio_pci_core_feature_dma_buf_tph(vdev, flags, arg,
+							 argsz);
 	default:
 		return -ENOTTY;
 	}
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 1a177ce7de54..0a0705c8dbea 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -3,6 +3,7 @@
  */
 #include <linux/dma-buf-mapping.h>
 #include <linux/pci-p2pdma.h>
+#include <linux/pci-tph.h>
 #include <linux/dma-resv.h>
 
 #include "vfio_pci_priv.h"
@@ -19,7 +20,12 @@ struct vfio_pci_dma_buf {
 	u32 nr_ranges;
 	struct kref kref;
 	struct completion comp;
-	u8 revoked : 1;
+	u8 tph_st_valid:1;
+	u8 tph_st_ext_valid:1;
+	u8 tph_ph:2;
+	u8 tph_st;
+	u16 tph_st_ext;
+	u8 revoked:1;
 };
 
 static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
@@ -69,6 +75,26 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
 	return ret;
 }
 
+static int vfio_pci_dma_buf_get_tph(struct dma_buf *dmabuf, bool extended,
+				    u16 *steering_tag, u8 *ph)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	dma_resv_assert_held(dmabuf->resv);
+
+	if (extended) {
+		if (!priv->tph_st_ext_valid)
+			return -EOPNOTSUPP;
+		*steering_tag = priv->tph_st_ext;
+	} else {
+		if (!priv->tph_st_valid)
+			return -EOPNOTSUPP;
+		*steering_tag = priv->tph_st;
+	}
+	*ph = priv->tph_ph;
+	return 0;
+}
+
 static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
 				   struct sg_table *sgt,
 				   enum dma_data_direction dir)
@@ -101,6 +127,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
 
 static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
 	.attach = vfio_pci_dma_buf_attach,
+	.get_tph = vfio_pci_dma_buf_get_tph,
 	.map_dma_buf = vfio_pci_dma_buf_map,
 	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
 	.release = vfio_pci_dma_buf_release,
@@ -333,6 +360,71 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 	return ret;
 }
 
+int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
+				      u32 flags,
+				      struct vfio_device_feature_dma_buf_tph __user *arg,
+				      size_t argsz)
+{
+	struct vfio_device_feature_dma_buf_tph set_tph;
+	struct vfio_pci_dma_buf *priv;
+	struct dma_buf *dmabuf;
+	u8 comp;
+	int ret;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
+				 sizeof(set_tph));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&set_tph, arg, sizeof(set_tph)))
+		return -EFAULT;
+
+	if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT))
+		return -EINVAL;
+
+	if (set_tph.ph & ~0x3)
+		return -EINVAL;
+
+	comp = pcie_tph_completer_type(vdev->pdev);
+	if (comp == PCI_EXP_DEVCAP2_TPH_COMP_NONE)
+		return -EOPNOTSUPP;
+	if ((set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT) &&
+	    comp != PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH)
+		return -EOPNOTSUPP;
+
+	dmabuf = dma_buf_get(set_tph.dmabuf_fd);
+	if (IS_ERR(dmabuf))
+		return PTR_ERR(dmabuf);
+
+	if (dmabuf->ops != &vfio_pci_dmabuf_ops) {
+		ret = -EINVAL;
+		goto out_put;
+	}
+
+	priv = dmabuf->priv;
+	if (priv->vdev != vdev) {
+		ret = -EINVAL;
+		goto out_put;
+	}
+
+	ret = dma_resv_lock_interruptible(dmabuf->resv, NULL);
+	if (ret)
+		goto out_put;
+
+	priv->tph_st         = set_tph.steering_tag;
+	priv->tph_st_ext     = set_tph.steering_tag_ext;
+	priv->tph_ph         = set_tph.ph;
+	priv->tph_st_valid   = !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST);
+	priv->tph_st_ext_valid =
+		!!(set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT);
+	dma_resv_unlock(dmabuf->resv);
+	ret = 0;
+
+out_put:
+	dma_buf_put(dmabuf);
+	return ret;
+}
+
 void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
 {
 	struct vfio_pci_dma_buf *priv;
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index fca9d0dfac90..c58f369be4b3 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -118,6 +118,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 				  struct vfio_device_feature_dma_buf __user *arg,
 				  size_t argsz);
+int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
+				      u32 flags,
+				      struct vfio_device_feature_dma_buf_tph __user *arg,
+				      size_t argsz);
 void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
 void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
 #else
@@ -128,6 +132,14 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 {
 	return -ENOTTY;
 }
+
+static inline int
+vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf_tph __user *arg,
+				  size_t argsz)
+{
+	return -ENOTTY;
+}
 static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
 {
 }
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..5dd693220a0d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1534,6 +1534,43 @@ struct vfio_device_feature_dma_buf {
  */
 #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2  12
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_SET associate TPH (TLP Processing Hints) metadata
+ * with a vfio-exported dma-buf. The dma-buf must have been created by
+ * VFIO_DEVICE_FEATURE_DMA_BUF on this device, and the device must report
+ * TPH Completer support in Device Capabilities 2 (bits 13:12); requests
+ * carrying VFIO_DMA_BUF_TPH_ST_EXT additionally require the device to
+ * report the Extended TPH Completer encoding. Otherwise the ioctl
+ * returns -EOPNOTSUPP.
+ *
+ * dmabuf_fd is the file descriptor returned by VFIO_DEVICE_FEATURE_DMA_BUF.
+ *
+ * 8-bit ST (steering_tag) and 16-bit Extended ST (steering_tag_ext) are
+ * distinct namespaces. Userspace supplies whichever values are valid and sets
+ * the matching VFIO_DMA_BUF_TPH_ST / VFIO_DMA_BUF_TPH_ST_EXT bits in @flags;
+ * an importer requests one namespace and receives the matching value.
+ *
+ * @flags == 0 marks any previously published ST / Extended-ST as invalid
+ * for future get_tph() requests on this dma-buf.
+ *
+ * ph is the 2-bit TLP Processing Hint and must be in the range [0, 3].
+ *
+ * Userspace must publish TPH before handing the dma-buf fd to an importer.
+ * Calling SET again replaces the published values.
+ */
+#define VFIO_DEVICE_FEATURE_DMA_BUF_TPH 13
+
+#define VFIO_DMA_BUF_TPH_ST		(1 << 0)
+#define VFIO_DMA_BUF_TPH_ST_EXT		(1 << 1)
+
+struct vfio_device_feature_dma_buf_tph {
+	__s32	dmabuf_fd;
+	__u32	flags;
+	__u16	steering_tag_ext;
+	__u8	steering_tag;
+	__u8	ph;
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature
  2026-06-10 19:31 ` [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature Zhiping Zhang
@ 2026-06-11 20:25   ` sashiko-bot
  2026-06-11 23:02     ` Zhiping Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: sashiko-bot @ 2026-06-11 20:25 UTC (permalink / raw)
  To: Zhiping Zhang; +Cc: linux-pci, kvm, dri-devel

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] Unconditional validation of the flag-gated 'ph' parameter can lead to spurious -EINVAL errors, breaking the syscall trust boundary.
- [Medium] VFIO_DEVICE_FEATURE_PROBE falsely reports that VFIO_DEVICE_FEATURE_DMA_BUF_TPH is supported even if the hardware lacks TPH support.
--

commit d57637d667ca6dc95def77a0a5bd339d62d4f506
Author: Zhiping Zhang <zhipingz@meta.com>

vfio/pci: implement get_tph and DMA_BUF_TPH feature

This commit implements dma-buf get_tph for vfio-pci exported dma-bufs and
adds the VFIO_DEVICE_FEATURE_DMA_BUF_TPH feature. This allows userspace to
publish TPH metadata for a VFIO-owned device. It checks completer support
and protects the metadata under dmabuf->resv.

> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index 1a177ce7de546..0a0705c8dbea6 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c

[ ... ]

> @@ -333,6 +360,71 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  	return ret;
>  }
>  
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> +				      u32 flags,
> +				      struct vfio_device_feature_dma_buf_tph __user *arg,
> +				      size_t argsz)
> +{
> +	struct vfio_device_feature_dma_buf_tph set_tph;
> +	struct vfio_pci_dma_buf *priv;
> +	struct dma_buf *dmabuf;
> +	u8 comp;
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> +				 sizeof(set_tph));
> +	if (ret != 1)
> +		return ret;

[Severity: Medium]
vfio_check_feature() returns 0 for VFIO_DEVICE_FEATURE_PROBE, causing an 
early return of success. 

Should the hardware capability check (pcie_tph_completer_type) occur before
checking the feature flags, to avoid falsely reporting that the feature is
supported on devices that lack TPH?

> +
> +	if (copy_from_user(&set_tph, arg, sizeof(set_tph)))
> +		return -EFAULT;
> +
> +	if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT))
> +		return -EINVAL;
> +
> +	if (set_tph.ph & ~0x3)
> +		return -EINVAL;

[Severity: Medium]
When userspace clears the TPH metadata by passing flags == 0, is it required
to zero-fill the unused ph field? 

Since this check is unconditional, could stack garbage in the uninitialized 
ph field cause spurious -EINVAL errors, rejecting a valid invalidation 
request?

> +
> +	comp = pcie_tph_completer_type(vdev->pdev);
> +	if (comp == PCI_EXP_DEVCAP2_TPH_COMP_NONE)
> +		return -EOPNOTSUPP;

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610193158.2614209-1-zhipingz@meta.com?part=4

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/5] PCI/TPH: Add requester/completer type helpers
  2026-06-10 19:31 ` [PATCH v7 2/5] PCI/TPH: Add requester/completer type helpers Zhiping Zhang
@ 2026-06-11 20:25   ` sashiko-bot
  2026-06-11 23:06     ` Zhiping Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: sashiko-bot @ 2026-06-11 20:25 UTC (permalink / raw)
  To: Zhiping Zhang; +Cc: kvm, linux-pci, dri-devel

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [Medium] pcie_tph_completer_type() incorrectly returns PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH when reading from a disconnected or hot-removed device.

Pre-existing issues:
- [High] The internal get_rp_completer_type() function was not updated to use the new pcie_tph_completer_type() helper, meaning it can still return the reserved 0b10 value, which can corrupt the device's TPH Requester Enable state.
--

commit e3087f1c41da11f8e5258fd8b874372276270977
Author: Zhiping Zhang <zhipingz@meta.com>

PCI/TPH: Add requester/completer type helpers

Add pcie_tph_enabled_req_type() and pcie_tph_completer_type() so drivers can
query the enabled TPH requester mode and completer support safely. Fold the
reserved 0b10 encoding into NONE so callers only see defined values.

> diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
> index 91145e8d9d951..4fe076bba953a 100644
> --- a/drivers/pci/tph.c
> +++ b/drivers/pci/tph.c
[ ... ]
> +u8 pcie_tph_completer_type(struct pci_dev *pdev)
> +{
> +	u32 reg;
> +
> +	if (pcie_capability_read_dword(pdev, PCI_EXP_DEVCAP2, &reg))
> +		return PCI_EXP_DEVCAP2_TPH_COMP_NONE;
> +
> +	switch (FIELD_GET(PCI_EXP_DEVCAP2_TPH_COMP_MASK, reg)) {

[Severity: Medium]
If the device is hot-removed, pcie_capability_read_dword() returns success (0)
but can leave reg as 0xFFFFFFFF (PCI_ERROR_RESPONSE due to master abort).

Will FIELD_GET(PCI_EXP_DEVCAP2_TPH_COMP_MASK, reg) then extract 0b11 (3),
matching PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH, causing the helper to falsely
advertise that a disconnected device supports Extended TPH? Should this logic
check if reg is ~0?

> +	case PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY:
> +		return PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY;
> +	case PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH:
> +		return PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH;
> +	default:
> +		return PCI_EXP_DEVCAP2_TPH_COMP_NONE;
> +	}
> +}
> +EXPORT_SYMBOL(pcie_tph_completer_type);

[Severity: High]
This is a pre-existing issue, but the internal get_rp_completer_type() function
was not updated to use this new pcie_tph_completer_type() helper.

Because it still returns the raw extracted Completer Supported field, it can
return the reserved 0b10 value without applying the safety folding introduced
here:

drivers/pci/tph.c:get_rp_completer_type() {
    ...
	return FIELD_GET(PCI_EXP_DEVCAP2_TPH_COMP_MASK, reg);
}

If the device supports EXT_TPH (3) and the Root Port returns the reserved value
(2), could this incorrectly propagate the reserved value?

drivers/pci/tph.c:pcie_enable_tph() {
    ...
		pdev->tph_req_type = min(pdev->tph_req_type, rp_req_type);
    ...
}

This would then write the reserved value (2) into the hardware's TPH Control
register, misconfiguring the device:

drivers/pci/tph.c:pcie_enable_tph() {
    ...
	reg &= ~PCI_TPH_CTRL_REQ_EN_MASK;
	reg |= FIELD_PREP(PCI_TPH_CTRL_REQ_EN_MASK, pdev->tph_req_type);

	pci_write_config_dword(pdev, pdev->tph_cap + PCI_TPH_CTRL, reg);
    ...
}

Should get_rp_completer_type() be updated to use the new helper?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610193158.2614209-1-zhipingz@meta.com?part=2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc
  2026-06-10 19:31 ` [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc Zhiping Zhang
  2026-06-11  7:47   ` Christian König
@ 2026-06-11 20:25   ` sashiko-bot
  2026-06-11 22:54     ` Zhiping Zhang
  1 sibling, 1 reply; 25+ messages in thread
From: sashiko-bot @ 2026-06-11 20:25 UTC (permalink / raw)
  To: Zhiping Zhang; +Cc: kvm, linux-pci, dri-devel

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Out-of-bounds read in UVERBS_METHOD_DMAH_ALLOC handler due to missing bounds check on user-supplied cpu_id.
--

commit efb72743b37fe4c5005fca2d75b36ef50832bc6c
Author: Zhiping Zhang <zhipingz@meta.com>

net/mlx5: free mlx5_st_idx_data on final dealloc
    
When the last reference to an ST table entry is dropped,
mlx5_st_dealloc_index() removed the entry from idx_xa but leaked the
backing mlx5_st_idx_data allocation. Repeated alloc/dealloc cycles
therefore accumulate one struct mlx5_st_idx_data per cycle.
    
Free idx_data after the xa_erase() so the lifetime of the bookkeeping
struct matches the lifetime of the ST entry it tracks.
    
Fixes: 888a7776f4fb ("net/mlx5: Add support for device steering tag")
Signed-off-by: Zhiping Zhang <zhipingz@meta.com>

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> index 997be91f0a135..7cedc348790db 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> @@ -175,6 +175,7 @@ int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index)
>  
>  	if (refcount_dec_and_test(&idx_data->usecount)) {
>  		xa_erase(&st->idx_xa, st_index);
> +		kfree(idx_data);
>  		/* We leave PCI config space as was before, no mkey will refer to it */
>  	}
>  

[Severity: Critical]
This is a pre-existing issue, but does
UVERBS_HANDLER(UVERBS_METHOD_DMAH_ALLOC) in
drivers/infiniband/core/uverbs_std_types_dmah.c lack bounds checking on the
untrusted user-supplied dmah->cpu_id?

drivers/infiniband/core/uverbs_std_types_dmah.c:UVERBS_HANDLER(UVERBS_METHOD_DMAH_ALLOC)() {
    ...
		ret = uverbs_copy_from(&dmah->cpu_id, attrs,
				       UVERBS_ATTR_ALLOC_DMAH_CPU_ID);
		if (ret)
			goto err;

		if (!cpumask_test_cpu(dmah->cpu_id, current->cpus_ptr)) {
			ret = -EPERM;
			goto err;
		}
    ...
}

The dmah->cpu_id is copied directly from an untrusted uverbs attribute
without value validation.

Since cpumask_check only emits a warning if the CPU ID is out of bounds and
does not cap the value, could test_bit compute a bit offset far outside the
bounds of current->cpus_ptr?

Can an attacker exploit this to read unmapped kernel memory (causing a kernel
panic) or infer the contents of kernel memory by observing whether the
syscall succeeds or fails with -EPERM?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610193158.2614209-1-zhipingz@meta.com?part=1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/5] dma-buf: add optional get_tph() callback
  2026-06-10 19:31 ` [PATCH v7 3/5] dma-buf: add optional get_tph() callback Zhiping Zhang
  2026-06-11 10:35   ` Christian König
@ 2026-06-11 20:26   ` sashiko-bot
  1 sibling, 0 replies; 25+ messages in thread
From: sashiko-bot @ 2026-06-11 20:26 UTC (permalink / raw)
  To: Zhiping Zhang; +Cc: linux-pci, kvm, dri-devel

> Add an optional dma_buf_ops.get_tph callback and a dma_buf_get_tph()
> wrapper for importers.
> 
> 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces, so
> the importer requests the namespace it can emit and the exporter
> returns the matching ST/PH tuple or -EOPNOTSUPP.
> 
> dma_buf_get_tph() is the importer entry point. It returns -EOPNOTSUPP
> when the exporter lacks the callback and requires dmabuf->resv to be
> held while the callback runs.
> 
> The first user is VFIO_DEVICE_FEATURE_DMA_BUF_TPH in vfio-pci, with
> mlx5 as the first importer.
> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610193158.2614209-1-zhipingz@meta.com?part=3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 5/5] RDMA/mlx5: get tph for p2p access when registering dma-buf mr
  2026-06-10 19:31 ` [PATCH v7 5/5] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Zhiping Zhang
  2026-06-11 12:44   ` Michael Gur
@ 2026-06-11 20:26   ` sashiko-bot
  1 sibling, 0 replies; 25+ messages in thread
From: sashiko-bot @ 2026-06-11 20:26 UTC (permalink / raw)
  To: Zhiping Zhang; +Cc: dri-devel, kvm, linux-pci

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Off-by-one error in mlx5r_create_mkeys error handling path causes hardware mkey leak and invalid destroy.
- [High] Hardware mkeys are leaked in `ib_frmr_pools_set_pinned` if pushing a handle to the pool's queue fails.
--

commit 08f720e58b52ed548a72c452a9bd15258e2e9264
Author: Zhiping Zhang <zhipingz@meta.com>

RDMA/mlx5: get tph for p2p access when registering dma-buf mr

This commit queries dma-buf TPH metadata when registering a dma-buf MR for
peer-to-peer access, falling back to no-TPH if absent. It also transfers
the ST-table reference ownership to the hardware mkey handle rather than
the transient MR object, and updates the pool API to handle this transfer.

> diff --git a/drivers/infiniband/core/frmr_pools.c b/drivers/infiniband/core/frmr_pools.c
> --- a/drivers/infiniband/core/frmr_pools.c
> +++ b/drivers/infiniband/core/frmr_pools.c
[ ... ]
> @@ -430,16 +430,16 @@ int ib_frmr_pools_set_pinned(struct ib_device *device, struct ib_frmr_key *key,
>  	ret = pools->pool_ops->create_frmrs(device, key, handles,
>  					    needed_handles);
>  	if (ret) {
>  		kfree(handles);
>  		return ret;
>  	}
>  
>  	spin_lock(&pool->lock);
>  	for (i = 0; i < needed_handles; i++) {
>  		ret = push_handle_to_queue_locked(&pool->queue,
>  						  handles[i]);
>  		if (ret)
>  			goto end;
>  	}
>  
>  end:
>  	spin_unlock(&pool->lock);
>  	kfree(handles);

[Severity: High]
This is a pre-existing issue, but can this error path leak the successfully
created hardware mkeys?

In ib_frmr_pools_set_pinned(), up to needed_handles hardware mkeys are
successfully created via create_frmrs(). If push_handle_to_queue_locked()
fails, the code jumps to the end: label and frees the handles array.

However, the mkeys that were successfully created but not yet pushed don't
appear to be explicitly destroyed here. Could this lead to a bulk resource
leak if queue page allocation fails?

> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> --- a/drivers/infiniband/hw/mlx5/mr.c
> +++ b/drivers/infiniband/hw/mlx5/mr.c
[ ... ]
> @@ -335,16 +335,16 @@ static int mlx5r_create_mkeys(struct ib_device *device, struct ib_frmr_key *key,
>  	for (i = 0; i < count; i++) {
>  		assign_mkey_variant(dev, handles + i, in);
>  		err = mlx5_core_create_mkey(dev->mdev, handles + i, in, inlen);
>  		if (err)
>  			goto free_in;
>  	}
>  free_in:
>  	kfree(in);
>  	if (err)
>  		for (; i > 0; i--)
>  			mlx5_core_destroy_mkey(dev->mdev, handles[i]);

[Severity: Critical]
This is a pre-existing issue, but does this rollback loop properly clean up
the successfully created mkeys?

When mlx5_core_create_mkey() fails at index i, the rollback loop 
for (; i > 0; i--) attempts to destroy handles[i]. Because the creation
failed, handles[i] still contains the unallocated index preset by
assign_mkey_variant().

Executing mlx5_core_destroy_mkey() on this value could send a destroy
command for an unowned mkey, potentially causing hardware state corruption.

Additionally, since the loop stops at i > 0, does it permanently leak the
successfully created mkey at handles[0]?

>  	return err;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610193158.2614209-1-zhipingz@meta.com?part=5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc
  2026-06-11  7:47   ` Christian König
@ 2026-06-11 22:53     ` Zhiping Zhang
  2026-06-11 23:45       ` Zhiping Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-11 22:53 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Williamson, Jason Gunthorpe, Leon Romanovsky, Sumit Semwal,
	Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

On Thu, Jun 11, 2026 at 12:47 AM Christian König
<christian.koenig@amd.com> wrote:
>
> >
> On 6/10/26 21:31, Zhiping Zhang wrote:
> > When the last reference to an ST table entry is dropped,
> > mlx5_st_dealloc_index() removed the entry from idx_xa but leaked the
> > backing mlx5_st_idx_data allocation. Repeated alloc/dealloc cycles
> > therefore accumulate one struct mlx5_st_idx_data per cycle.
> >
> > Free idx_data after the xa_erase() so the lifetime of the bookkeeping
> > struct matches the lifetime of the ST entry it tracks.
> >
> > Fixes: 888a7776f4fb ("net/mlx5: Add support for device steering tag")
> > Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
>
> Since this is an obvious bug fix I think it shouldn't be part of this patch set and go upstream completely independent.
>
> Regards,
> Christian.
>

Sure, Michael replied that he has made a patch to fix it, i will rebase on top.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc
  2026-06-11 20:25   ` sashiko-bot
@ 2026-06-11 22:54     ` Zhiping Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-11 22:54 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: kvm, linux-pci, dri-devel

On Thu, Jun 11, 2026 at 1:26 PM <sashiko-bot@kernel.org> wrote:
>
> >
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
>
> Pre-existing issues:
> - [Critical] Out-of-bounds read in UVERBS_METHOD_DMAH_ALLOC handler due to missing bounds check on user-supplied cpu_id.
> --
>
> commit efb72743b37fe4c5005fca2d75b36ef50832bc6c
> Author: Zhiping Zhang <zhipingz@meta.com>
>
> net/mlx5: free mlx5_st_idx_data on final dealloc
>
> When the last reference to an ST table entry is dropped,
> mlx5_st_dealloc_index() removed the entry from idx_xa but leaked the
> backing mlx5_st_idx_data allocation. Repeated alloc/dealloc cycles
> therefore accumulate one struct mlx5_st_idx_data per cycle.
>
> Free idx_data after the xa_erase() so the lifetime of the bookkeeping
> struct matches the lifetime of the ST entry it tracks.
>
> Fixes: 888a7776f4fb ("net/mlx5: Add support for device steering tag")
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
>
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> > index 997be91f0a135..7cedc348790db 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
> > @@ -175,6 +175,7 @@ int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index)
> >
> >       if (refcount_dec_and_test(&idx_data->usecount)) {
> >               xa_erase(&st->idx_xa, st_index);
> > +             kfree(idx_data);
> >               /* We leave PCI config space as was before, no mkey will refer to it */
> >       }
> >
>
> [Severity: Critical]
> This is a pre-existing issue, but does
> UVERBS_HANDLER(UVERBS_METHOD_DMAH_ALLOC) in
> drivers/infiniband/core/uverbs_std_types_dmah.c lack bounds checking on the
> untrusted user-supplied dmah->cpu_id?
>
> drivers/infiniband/core/uverbs_std_types_dmah.c:UVERBS_HANDLER(UVERBS_METHOD_DMAH_ALLOC)() {
>     ...
>                 ret = uverbs_copy_from(&dmah->cpu_id, attrs,
>                                        UVERBS_ATTR_ALLOC_DMAH_CPU_ID);
>                 if (ret)
>                         goto err;
>
>                 if (!cpumask_test_cpu(dmah->cpu_id, current->cpus_ptr)) {
>                         ret = -EPERM;
>                         goto err;
>                 }
>     ...
> }
>
> The dmah->cpu_id is copied directly from an untrusted uverbs attribute
> without value validation.
>
> Since cpumask_check only emits a warning if the CPU ID is out of bounds and
> does not cap the value, could test_bit compute a bit offset far outside the
> bounds of current->cpus_ptr?
>
> Can an attacker exploit this to read unmapped kernel memory (causing a kernel
> panic) or infer the contents of kernel memory by observing whether the
> syscall succeeds or fails with -EPERM?
>
> --
> Sashiko AI review · https://urldefense.com/v3/__https://sashiko.dev/*/patchset/20260610193158.2614209-1-zhipingz@meta.com?part=1__;Iw!!Bt8RZUm9aw!-yx89CgOVvF9E6-bX3TWkgI1oBHKqqBP1C88LmgmavxlZKHEFJdEeW7gwHdiVC2U2gzS3RGnvW_CiJNQVwxS$

Not relevant to my patch series.

Zhiping

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature
  2026-06-11 20:25   ` sashiko-bot
@ 2026-06-11 23:02     ` Zhiping Zhang
  2026-06-12 16:59       ` Alex Williamson
  0 siblings, 1 reply; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-11 23:02 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-pci, kvm, dri-devel

On Thu, Jun 11, 2026 at 1:25 PM <sashiko-bot@kernel.org> wrote:
>
> >
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [Medium] Unconditional validation of the flag-gated 'ph' parameter can lead to spurious -EINVAL errors, breaking the syscall trust boundary.
> - [Medium] VFIO_DEVICE_FEATURE_PROBE falsely reports that VFIO_DEVICE_FEATURE_DMA_BUF_TPH is supported even if the hardware lacks TPH support.
> --
>
> commit d57637d667ca6dc95def77a0a5bd339d62d4f506
> Author: Zhiping Zhang <zhipingz@meta.com>
>
> vfio/pci: implement get_tph and DMA_BUF_TPH feature
>
> This commit implements dma-buf get_tph for vfio-pci exported dma-bufs and
> adds the VFIO_DEVICE_FEATURE_DMA_BUF_TPH feature. This allows userspace to
> publish TPH metadata for a VFIO-owned device. It checks completer support
> and protects the metadata under dmabuf->resv.
>
> > diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > index 1a177ce7de546..0a0705c8dbea6 100644
> > --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> > +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
>
> [ ... ]
>
> > @@ -333,6 +360,71 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> >       return ret;
> >  }
> >
> > +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> > +                                   u32 flags,
> > +                                   struct vfio_device_feature_dma_buf_tph __user *arg,
> > +                                   size_t argsz)
> > +{
> > +     struct vfio_device_feature_dma_buf_tph set_tph;
> > +     struct vfio_pci_dma_buf *priv;
> > +     struct dma_buf *dmabuf;
> > +     u8 comp;
> > +     int ret;
> > +
> > +     ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> > +                              sizeof(set_tph));
> > +     if (ret != 1)
> > +             return ret;
>
> [Severity: Medium]
> vfio_check_feature() returns 0 for VFIO_DEVICE_FEATURE_PROBE, causing an
> early return of success.
>
> Should the hardware capability check (pcie_tph_completer_type) occur before
> checking the feature flags, to avoid falsely reporting that the feature is
> supported on devices that lack TPH?
>

Will fix it with a change like below:
 -     if (ret != 1)
+     if (ret != 1) {
+             if (ret == 0 &&
+                 pcie_tph_completer_type(vdev->pdev) ==
+                     PCI_EXP_DEVCAP2_TPH_COMP_NONE)
+                     return -EOPNOTSUPP;
                return ret;
+     }

> > +
> > +     if (copy_from_user(&set_tph, arg, sizeof(set_tph)))
> > +             return -EFAULT;
> > +
> > +     if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT))
> > +             return -EINVAL;
> > +
> > +     if (set_tph.ph & ~0x3)
> > +             return -EINVAL;
>
> [Severity: Medium]
> When userspace clears the TPH metadata by passing flags == 0, is it required
> to zero-fill the unused ph field?
>

Not an issue.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/5] PCI/TPH: Add requester/completer type helpers
  2026-06-11 20:25   ` sashiko-bot
@ 2026-06-11 23:06     ` Zhiping Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-11 23:06 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: kvm, linux-pci, dri-devel

On Thu, Jun 11, 2026 at 1:26 PM <sashiko-bot@kernel.org> wrote:
>
> >

> > diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
> > index 91145e8d9d951..4fe076bba953a 100644
> > --- a/drivers/pci/tph.c
> > +++ b/drivers/pci/tph.c
> [ ... ]
> > +u8 pcie_tph_completer_type(struct pci_dev *pdev)
> > +{
> > +     u32 reg;
> > +
> > +     if (pcie_capability_read_dword(pdev, PCI_EXP_DEVCAP2, &reg))
> > +             return PCI_EXP_DEVCAP2_TPH_COMP_NONE;
> > +
> > +     switch (FIELD_GET(PCI_EXP_DEVCAP2_TPH_COMP_MASK, reg)) {
>
> [Severity: Medium]
> If the device is hot-removed, pcie_capability_read_dword() returns success (0)
> but can leave reg as 0xFFFFFFFF (PCI_ERROR_RESPONSE due to master abort).
>

will fix this one with check like below before the switch statement:
 +     if (PCI_POSSIBLE_ERROR(reg))
 +             return PCI_EXP_DEVCAP2_TPH_COMP_NONE;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/5] dma-buf: add optional get_tph() callback
  2026-06-11 10:35   ` Christian König
@ 2026-06-11 23:07     ` Zhiping Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-11 23:07 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Williamson, Jason Gunthorpe, Leon Romanovsky, Sumit Semwal,
	Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

On Thu, Jun 11, 2026 at 3:35 AM Christian König
<christian.koenig@amd.com> wrote:
>
> >
> On 6/10/26 21:31, Zhiping Zhang wrote:
> > Add an optional dma_buf_ops.get_tph callback and a dma_buf_get_tph()
> > wrapper for importers.
> >
> > 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces, so
> > the importer requests the namespace it can emit and the exporter
> > returns the matching ST/PH tuple or -EOPNOTSUPP.
> >
> > dma_buf_get_tph() is the importer entry point. It returns -EOPNOTSUPP
> > when the exporter lacks the callback and requires dmabuf->resv to be
> > held while the callback runs.
> >
> > The first user is VFIO_DEVICE_FEATURE_DMA_BUF_TPH in vfio-pci, with
> > mlx5 as the first importer.
> >
> > Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> > ---
> >  drivers/dma-buf/dma-buf.c | 25 +++++++++++++++++++++++++
> >  include/linux/dma-buf.h   | 21 +++++++++++++++++++++
> >  2 files changed, 46 insertions(+)
> >
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index d504c636dc29..aff79ea12e43 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -1144,6 +1144,31 @@ void dma_buf_unpin(struct dma_buf_attachment *attach)
> >  }
> >  EXPORT_SYMBOL_NS_GPL(dma_buf_unpin, "DMA_BUF");
> >
> > +/**
> > + * dma_buf_get_tph - Retrieve TPH metadata from an exporter
> > + * @dmabuf: DMA buffer to query
> > + * @extended: false for 8-bit ST, true for 16-bit Extended ST
> > + * @steering_tag: returns the raw steering tag for the requested namespace
> > + * @ph: returns the TPH processing hint
> > + *
> > + * Wrapper for the optional &dma_buf_ops.get_tph callback.
> > + *
> > + * Must be called with &dma_buf.resv held. Returns -EOPNOTSUPP if the
> > + * exporter does not implement the callback or has no metadata for the
> > + * requested namespace.
> > + */
> > +int dma_buf_get_tph(struct dma_buf *dmabuf, bool extended,
> > +                 u16 *steering_tag, u8 *ph)
>
> That name needs improvement, maybe something like dma_buf_get_pci_tph().
>
> It also needs some brief explanation what TPH is, maybe a reference to the PCIe spec name etc...
>
> And document in the list of functions that this one should be called with the lock held.
>
> > +{
> > +     dma_resv_assert_held(dmabuf->resv);
> > +
> > +     if (!dmabuf->ops->get_tph)
> > +             return -EOPNOTSUPP;
> > +
> > +     return dmabuf->ops->get_tph(dmabuf, extended, steering_tag, ph);
> > +}
> > +EXPORT_SYMBOL_NS_GPL(dma_buf_get_tph, "DMA_BUF");
> > +
> >  /**
> >   * dma_buf_map_attachment - Returns the scatterlist table of the attachment;
> >   * mapped into _device_ address space. Is a wrapper for map_dma_buf() of the
> > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> > index d1203da56fc5..6a54e0f251a2 100644
> > --- a/include/linux/dma-buf.h
> > +++ b/include/linux/dma-buf.h
> > @@ -113,6 +113,25 @@ struct dma_buf_ops {
> >        */
> >       void (*unpin)(struct dma_buf_attachment *attach);
> >
> > +     /**
> > +      * @get_tph:
> > +      * @dmabuf: DMA buffer for which to retrieve TPH metadata
> > +      * @extended: false for 8-bit ST, true for 16-bit Extended ST
> > +      * @steering_tag: Returns the raw TPH steering tag for the requested
> > +      *                namespace
> > +      * @ph: Returns the TPH processing hint (2-bit value)
> > +      *
> > +      * Return TPH metadata for the namespace selected by @extended. Return
> > +      * 0 on success, or -EOPNOTSUPP if no metadata is available.
> > +      *
> > +      * This callback is optional. Importers must not call it directly;
> > +      * the dma_buf_get_tph() wrapper is the only entry point and handles
> > +      * the NULL-callback case. The callback is invoked with
> > +      * &dma_buf.resv held.
>
> That most of that should be obvious, we only need that it's optional and that the lock should be held. Everything else can be dropped.
>
> And most of the description/documentation should be on the wrapper function, exporters who implement the callback should know what they are doing.
>
> Regards,
> Christian.
>

sure will do

Thanks,
Zhiping

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 5/5] RDMA/mlx5: get tph for p2p access when registering dma-buf mr
  2026-06-11 12:44   ` Michael Gur
@ 2026-06-11 23:09     ` Zhiping Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-11 23:09 UTC (permalink / raw)
  To: Michael Gur
  Cc: Alex Williamson, Jason Gunthorpe, Leon Romanovsky, Sumit Semwal,
	Christian Konig, Bjorn Helgaas, kvm, linux-rdma, linux-pci,
	netdev, dri-devel, Keith Busch, Yochai Cohen, Yishai Hadas

On Thu, Jun 11, 2026 at 5:44 AM Michael Gur <michaelgur@nvidia.com> wrote:
>
> >
>
> On 6/10/2026 10:31 PM, Zhiping Zhang wrote:
> > Query dma-buf TPH metadata when registering a dma-buf MR for peer-to-
> > peer access to a PCIe endpoint and use it to program requester-side TPH
> > on the outbound mkey. If the exporter has no metadata, fall back to the
> > existing no-TPH path.
> >
> > For TPH-backed FRMRs, make the extra ST-table reference belong to the
> > hardware mkey handle rather than the transient MR object. Extend the
> > FRMR pool API so reuse and final destroy can transfer and drop that ref
> > at the handle lifetime boundaries, and add mlx5_st_get_index() to take
> > a ref on an already-known ST index.
> I'd keep the ST reference tied to MRs, where the ST is actually in use.
> There's no functional need to couple ST refcounting to mkey lifetime.
> Once an MR is destroyed and its mkey revoked, the mkey can no longer
> generate traffic, it's just an idle entry in the FRMR pool waiting to be
> aged out or reused.
> This lets us drop all FRMR pool changes from this patch and keep a
> simple flow of 'acquire on MR create, release on MR destroy'.
> > Also decode PH from kernel_vendor_key when recreating pooled mkeys so
> > the requester hint matches the pool key.
> I've fixed that in a series I've sent earlier this week, please rebase
> next version on top of it.
>
> Thanks,
> Michael

ack, thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc
  2026-06-11 22:53     ` Zhiping Zhang
@ 2026-06-11 23:45       ` Zhiping Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Zhiping Zhang @ 2026-06-11 23:45 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Williamson, Jason Gunthorpe, Leon Romanovsky, Sumit Semwal,
	Bjorn Helgaas, kvm, linux-rdma, linux-pci, netdev, dri-devel,
	Keith Busch, Yochai Cohen, Yishai Hadas

On Thu, Jun 11, 2026 at 3:53 PM Zhiping Zhang <zhipingz@meta.com> wrote:
>
> On Thu, Jun 11, 2026 at 12:47 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > > >
> > On 6/10/26 21:31, Zhiping Zhang wrote:
> > > When the last reference to an ST table entry is dropped,
> > > mlx5_st_dealloc_index() removed the entry from idx_xa but leaked the
> > > backing mlx5_st_idx_data allocation. Repeated alloc/dealloc cycles
> > > therefore accumulate one struct mlx5_st_idx_data per cycle.
> > >
> > > Free idx_data after the xa_erase() so the lifetime of the bookkeeping
> > > struct matches the lifetime of the ST entry it tracks.
> > >
> > > Fixes: 888a7776f4fb ("net/mlx5: Add support for device steering tag")
> > > Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> >
> > Since this is an obvious bug fix I think it shouldn't be part of this patch set and go upstream completely independent.
> >
> > Regards,
> > Christian.
> >
>
> Sure, Michael replied that he has made a patch to fix it, i will rebase on top.

Never mind, it seems Michael's patch did not contain the fix, let me
submit a separate set.

Thanks,
Zhiping

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature
  2026-06-11 16:11 ` [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature Zhiping Zhang
@ 2026-06-12 16:46   ` sashiko-bot
  2026-06-12 17:10   ` Alex Williamson
  1 sibling, 0 replies; 25+ messages in thread
From: sashiko-bot @ 2026-06-12 16:46 UTC (permalink / raw)
  To: Zhiping Zhang; +Cc: kvm, dri-devel, linux-pci

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] Unconditional validation of the flag-gated `ph` parameter outside its flag gate.
- [Medium] VFIO_DEVICE_FEATURE_PROBE incorrectly reports that the DMA_BUF_TPH feature is supported for devices that lack TPH completer capabilities.
--

commit cadaa92f5a8da2be3b081b506699064b7bf68059
Author: Zhiping Zhang <zhipingz@meta.com>

vfio/pci: implement get_tph and DMA_BUF_TPH feature

Implement dma-buf get_tph for vfio-pci exported dma-bufs and add
VFIO_DEVICE_FEATURE_DMA_BUF_TPH so userspace can publish TPH metadata
for a VFIO-owned device.

> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index 1a177ce7de546..0a0705c8dbea6 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c

[ ... ]

> @@ -333,6 +360,71 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  	return ret;
>  }
>  
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> +				      u32 flags,
> +				      struct vfio_device_feature_dma_buf_tph __user *arg,
> +				      size_t argsz)
> +{
> +	struct vfio_device_feature_dma_buf_tph set_tph;
> +	struct vfio_pci_dma_buf *priv;
> +	struct dma_buf *dmabuf;
> +	u8 comp;
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> +				 sizeof(set_tph));
> +	if (ret != 1)
> +		return ret;

[Severity: Medium]
Will VFIO_DEVICE_FEATURE_PROBE incorrectly report that the feature is
supported on all devices?

vfio_check_feature() returns 0 for probe requests, which means this code
will return 0 (success) before checking the hardware's TPH completer
capabilities below. Should the hardware capability check happen before
returning the probe result?

> +
> +	if (copy_from_user(&set_tph, arg, sizeof(set_tph)))
> +		return -EFAULT;
> +
> +	if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT))
> +		return -EINVAL;
> +
> +	if (set_tph.ph & ~0x3)
> +		return -EINVAL;

[Severity: Medium]
Does this code reject valid clear requests if the unused ph field contains
uninitialized data?

When userspace sets flags == 0 to clear the metadata, ph is effectively
unused and might contain garbage. Validating it unconditionally outside of
its flag gate could cause valid API usage to fail with -EINVAL. Should the
ph validation only happen if VFIO_DMA_BUF_TPH_ST or
VFIO_DMA_BUF_TPH_ST_EXT are set?

> +
> +	comp = pcie_tph_completer_type(vdev->pdev);
> +	if (comp == PCI_EXP_DEVCAP2_TPH_COMP_NONE)
> +		return -EOPNOTSUPP;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260611161546.4075580-1-zhipingz@meta.com?part=4

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature
  2026-06-11 23:02     ` Zhiping Zhang
@ 2026-06-12 16:59       ` Alex Williamson
  0 siblings, 0 replies; 25+ messages in thread
From: Alex Williamson @ 2026-06-12 16:59 UTC (permalink / raw)
  To: Zhiping Zhang; +Cc: sashiko-reviews, linux-pci, kvm, dri-devel, alex

On Thu, 11 Jun 2026 16:02:25 -0700
Zhiping Zhang <zhipingz@meta.com> wrote:

> On Thu, Jun 11, 2026 at 1:25 PM <sashiko-bot@kernel.org> wrote:
> >  
> > >  
> > Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> > - [Medium] Unconditional validation of the flag-gated 'ph' parameter can lead to spurious -EINVAL errors, breaking the syscall trust boundary.
> > - [Medium] VFIO_DEVICE_FEATURE_PROBE falsely reports that VFIO_DEVICE_FEATURE_DMA_BUF_TPH is supported even if the hardware lacks TPH support.
> > --
> >
> > commit d57637d667ca6dc95def77a0a5bd339d62d4f506
> > Author: Zhiping Zhang <zhipingz@meta.com>
> >
> > vfio/pci: implement get_tph and DMA_BUF_TPH feature
> >
> > This commit implements dma-buf get_tph for vfio-pci exported dma-bufs and
> > adds the VFIO_DEVICE_FEATURE_DMA_BUF_TPH feature. This allows userspace to
> > publish TPH metadata for a VFIO-owned device. It checks completer support
> > and protects the metadata under dmabuf->resv.
> >  
> > > diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > > index 1a177ce7de546..0a0705c8dbea6 100644
> > > --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> > > +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c  
> >
> > [ ... ]
> >  
> > > @@ -333,6 +360,71 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> > >       return ret;
> > >  }
> > >
> > > +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> > > +                                   u32 flags,
> > > +                                   struct vfio_device_feature_dma_buf_tph __user *arg,
> > > +                                   size_t argsz)
> > > +{
> > > +     struct vfio_device_feature_dma_buf_tph set_tph;
> > > +     struct vfio_pci_dma_buf *priv;
> > > +     struct dma_buf *dmabuf;
> > > +     u8 comp;
> > > +     int ret;
> > > +
> > > +     ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> > > +                              sizeof(set_tph));
> > > +     if (ret != 1)
> > > +             return ret;  
> >
> > [Severity: Medium]
> > vfio_check_feature() returns 0 for VFIO_DEVICE_FEATURE_PROBE, causing an
> > early return of success.
> >
> > Should the hardware capability check (pcie_tph_completer_type) occur before
> > checking the feature flags, to avoid falsely reporting that the feature is
> > supported on devices that lack TPH?
> >  
> 
> Will fix it with a change like below:
>  -     if (ret != 1)
> +     if (ret != 1) {
> +             if (ret == 0 &&
> +                 pcie_tph_completer_type(vdev->pdev) ==
> +                     PCI_EXP_DEVCAP2_TPH_COMP_NONE)
> +                     return -EOPNOTSUPP;
>                 return ret;
> +     }

Typically this is done before the check feature call.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature
  2026-06-11 16:11 ` [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature Zhiping Zhang
  2026-06-12 16:46   ` sashiko-bot
@ 2026-06-12 17:10   ` Alex Williamson
  1 sibling, 0 replies; 25+ messages in thread
From: Alex Williamson @ 2026-06-12 17:10 UTC (permalink / raw)
  To: Zhiping Zhang; +Cc: netdev, kvm, linux-rdma, linux-pci, dri-devel, alex

On Thu, 11 Jun 2026 09:11:19 -0700
Zhiping Zhang <zhipingz@meta.com> wrote:

> Implement dma-buf get_tph for vfio-pci exported dma-bufs and add
> VFIO_DEVICE_FEATURE_DMA_BUF_TPH so userspace can publish TPH metadata
> for a VFIO-owned device.
> 
> 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces; the
> uAPI carries both with explicit validity flags, and get_tph() returns
> the value matching the importer's requested namespace or -EOPNOTSUPP.
> 
> Publish and read the TPH descriptor under dmabuf->resv, matching the
> locking used for other importer-visible dma-buf state. The SET ioctl
> takes dma_resv_lock_interruptible(), while the callback runs under
> DMA-buf's asserted resv lock.
> 
> Reject requests the device cannot consume as a completer:
> pcie_tph_completer_type() must report at least
> PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY, and Extended ST requires
> PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH. Validate fields before the completer
> check so userspace gets the narrowest errno.
> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c   |  3 +
>  drivers/vfio/pci/vfio_pci_dmabuf.c | 94 +++++++++++++++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_priv.h   | 12 ++++
>  include/uapi/linux/vfio.h          | 37 ++++++++++++
>  4 files changed, 145 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 050e7542952e..4fa36f2f7555 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1569,6 +1569,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_DMA_BUF:
>  		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_DMA_BUF_TPH:
> +		return vfio_pci_core_feature_dma_buf_tph(vdev, flags, arg,
> +							 argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index 1a177ce7de54..0a0705c8dbea 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -3,6 +3,7 @@
>   */
>  #include <linux/dma-buf-mapping.h>
>  #include <linux/pci-p2pdma.h>
> +#include <linux/pci-tph.h>
>  #include <linux/dma-resv.h>
>  
>  #include "vfio_pci_priv.h"
> @@ -19,7 +20,12 @@ struct vfio_pci_dma_buf {
>  	u32 nr_ranges;
>  	struct kref kref;
>  	struct completion comp;
> -	u8 revoked : 1;
> +	u8 tph_st_valid:1;
> +	u8 tph_st_ext_valid:1;
> +	u8 tph_ph:2;
> +	u8 tph_st;
> +	u16 tph_st_ext;
> +	u8 revoked:1;

If these bitfields are now all protected under dma_resv_lock they
should be grouped together with a comment to that effect, no need for
revoked to get kicked out to its own storage unit.  In [1] I'm
proposing runtime modified flags each get their own storage unit, but
for more isolated cases, so long as we keep track and enforce serialized
updates, I'm ok with runtime bitfields.  Thanks,

Alex

[1]https://lore.kernel.org/all/20260611213539.4100590-1-alex.williamson@nvidia.com/

>  };
>  
>  static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> @@ -69,6 +75,26 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
>  	return ret;
>  }
>  
> +static int vfio_pci_dma_buf_get_tph(struct dma_buf *dmabuf, bool extended,
> +				    u16 *steering_tag, u8 *ph)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	dma_resv_assert_held(dmabuf->resv);
> +
> +	if (extended) {
> +		if (!priv->tph_st_ext_valid)
> +			return -EOPNOTSUPP;
> +		*steering_tag = priv->tph_st_ext;
> +	} else {
> +		if (!priv->tph_st_valid)
> +			return -EOPNOTSUPP;
> +		*steering_tag = priv->tph_st;
> +	}
> +	*ph = priv->tph_ph;
> +	return 0;
> +}
> +
>  static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
>  				   struct sg_table *sgt,
>  				   enum dma_data_direction dir)
> @@ -101,6 +127,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
>  
>  static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
>  	.attach = vfio_pci_dma_buf_attach,
> +	.get_tph = vfio_pci_dma_buf_get_tph,
>  	.map_dma_buf = vfio_pci_dma_buf_map,
>  	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
>  	.release = vfio_pci_dma_buf_release,
> @@ -333,6 +360,71 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  	return ret;
>  }
>  
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> +				      u32 flags,
> +				      struct vfio_device_feature_dma_buf_tph __user *arg,
> +				      size_t argsz)
> +{
> +	struct vfio_device_feature_dma_buf_tph set_tph;
> +	struct vfio_pci_dma_buf *priv;
> +	struct dma_buf *dmabuf;
> +	u8 comp;
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> +				 sizeof(set_tph));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&set_tph, arg, sizeof(set_tph)))
> +		return -EFAULT;
> +
> +	if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT))
> +		return -EINVAL;
> +
> +	if (set_tph.ph & ~0x3)
> +		return -EINVAL;
> +
> +	comp = pcie_tph_completer_type(vdev->pdev);
> +	if (comp == PCI_EXP_DEVCAP2_TPH_COMP_NONE)
> +		return -EOPNOTSUPP;
> +	if ((set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT) &&
> +	    comp != PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH)
> +		return -EOPNOTSUPP;
> +
> +	dmabuf = dma_buf_get(set_tph.dmabuf_fd);
> +	if (IS_ERR(dmabuf))
> +		return PTR_ERR(dmabuf);
> +
> +	if (dmabuf->ops != &vfio_pci_dmabuf_ops) {
> +		ret = -EINVAL;
> +		goto out_put;
> +	}
> +
> +	priv = dmabuf->priv;
> +	if (priv->vdev != vdev) {
> +		ret = -EINVAL;
> +		goto out_put;
> +	}
> +
> +	ret = dma_resv_lock_interruptible(dmabuf->resv, NULL);
> +	if (ret)
> +		goto out_put;
> +
> +	priv->tph_st         = set_tph.steering_tag;
> +	priv->tph_st_ext     = set_tph.steering_tag_ext;
> +	priv->tph_ph         = set_tph.ph;
> +	priv->tph_st_valid   = !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST);
> +	priv->tph_st_ext_valid =
> +		!!(set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT);
> +	dma_resv_unlock(dmabuf->resv);
> +	ret = 0;
> +
> +out_put:
> +	dma_buf_put(dmabuf);
> +	return ret;
> +}
> +
>  void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
>  {
>  	struct vfio_pci_dma_buf *priv;
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index fca9d0dfac90..c58f369be4b3 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -118,6 +118,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>  int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  				  struct vfio_device_feature_dma_buf __user *arg,
>  				  size_t argsz);
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> +				      u32 flags,
> +				      struct vfio_device_feature_dma_buf_tph __user *arg,
> +				      size_t argsz);
>  void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
>  void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
>  #else
> @@ -128,6 +132,14 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  {
>  	return -ENOTTY;
>  }
> +
> +static inline int
> +vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, u32 flags,
> +				  struct vfio_device_feature_dma_buf_tph __user *arg,
> +				  size_t argsz)
> +{
> +	return -ENOTTY;
> +}
>  static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
>  {
>  }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 5de618a3a5ee..5dd693220a0d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1534,6 +1534,43 @@ struct vfio_device_feature_dma_buf {
>   */
>  #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2  12
>  
> +/**
> + * Upon VFIO_DEVICE_FEATURE_SET associate TPH (TLP Processing Hints) metadata
> + * with a vfio-exported dma-buf. The dma-buf must have been created by
> + * VFIO_DEVICE_FEATURE_DMA_BUF on this device, and the device must report
> + * TPH Completer support in Device Capabilities 2 (bits 13:12); requests
> + * carrying VFIO_DMA_BUF_TPH_ST_EXT additionally require the device to
> + * report the Extended TPH Completer encoding. Otherwise the ioctl
> + * returns -EOPNOTSUPP.
> + *
> + * dmabuf_fd is the file descriptor returned by VFIO_DEVICE_FEATURE_DMA_BUF.
> + *
> + * 8-bit ST (steering_tag) and 16-bit Extended ST (steering_tag_ext) are
> + * distinct namespaces. Userspace supplies whichever values are valid and sets
> + * the matching VFIO_DMA_BUF_TPH_ST / VFIO_DMA_BUF_TPH_ST_EXT bits in @flags;
> + * an importer requests one namespace and receives the matching value.
> + *
> + * @flags == 0 marks any previously published ST / Extended-ST as invalid
> + * for future get_tph() requests on this dma-buf.
> + *
> + * ph is the 2-bit TLP Processing Hint and must be in the range [0, 3].
> + *
> + * Userspace must publish TPH before handing the dma-buf fd to an importer.
> + * Calling SET again replaces the published values.
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_BUF_TPH 13
> +
> +#define VFIO_DMA_BUF_TPH_ST		(1 << 0)
> +#define VFIO_DMA_BUF_TPH_ST_EXT		(1 << 1)
> +
> +struct vfio_device_feature_dma_buf_tph {
> +	__s32	dmabuf_fd;
> +	__u32	flags;
> +	__u16	steering_tag_ext;
> +	__u8	steering_tag;
> +	__u8	ph;
> +};
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-06-12 17:10 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10 19:31 [PATCH v7 0/5] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang
2026-06-10 19:31 ` [PATCH v7 1/5] net/mlx5: free mlx5_st_idx_data on final dealloc Zhiping Zhang
2026-06-11  7:47   ` Christian König
2026-06-11 22:53     ` Zhiping Zhang
2026-06-11 23:45       ` Zhiping Zhang
2026-06-11 20:25   ` sashiko-bot
2026-06-11 22:54     ` Zhiping Zhang
2026-06-10 19:31 ` [PATCH v7 2/5] PCI/TPH: Add requester/completer type helpers Zhiping Zhang
2026-06-11 20:25   ` sashiko-bot
2026-06-11 23:06     ` Zhiping Zhang
2026-06-10 19:31 ` [PATCH v7 3/5] dma-buf: add optional get_tph() callback Zhiping Zhang
2026-06-11 10:35   ` Christian König
2026-06-11 23:07     ` Zhiping Zhang
2026-06-11 20:26   ` sashiko-bot
2026-06-10 19:31 ` [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature Zhiping Zhang
2026-06-11 20:25   ` sashiko-bot
2026-06-11 23:02     ` Zhiping Zhang
2026-06-12 16:59       ` Alex Williamson
2026-06-10 19:31 ` [PATCH v7 5/5] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Zhiping Zhang
2026-06-11 12:44   ` Michael Gur
2026-06-11 23:09     ` Zhiping Zhang
2026-06-11 20:26   ` sashiko-bot
  -- strict thread matches above, loose matches on Subject: below --
2026-06-11 16:11 [PATCH v7 0/5] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang
2026-06-11 16:11 ` [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature Zhiping Zhang
2026-06-12 16:46   ` sashiko-bot
2026-06-12 17:10   ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.