* [PATCH v10 0/4] vfio/dma-buf: add TPH support for peer-to-peer access
@ 2026-06-30 22:42 Zhiping Zhang
2026-06-30 22:42 ` [PATCH v10 1/4] PCI/TPH: Add requester/completer type helpers Zhiping Zhang
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Zhiping Zhang @ 2026-06-30 22:42 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal,
Christian Konig, Alex Williamson, Bjorn Helgaas
Cc: kvm, linux-rdma, linux-pci, dri-devel, Zhiping Zhang
This series adds TLP Processing Hints (TPH) support to the VFIO dma-buf
export path, allowing importing drivers (e.g. mlx5) to use the
exporter's steering tag when performing peer-to-peer DMA into a
VFIO-owned device.
There is no separate in-tree vendor kernel driver for the target device:
vfio-pci is the in-tree driver and the targeted device is managed
from userspace via VFIO passthrough. That is why the ST has to flow
through a uAPI: userspace owns the device and its ST table, so it is the
entity that can configure a meaningful value for a given dma-buf. The
kernel-visible participants are still in-tree: vfio-pci exports the
dma-buf and mlx5 imports it.
On the effect: the endpoint's PCIe ingress block uses the ST as
an in-band instruction for the incoming P2P TLP -- selecting a target
cache partition and, on writes, an in-flight operation on the data
before it lands. The dma-buf callback keeps this opaque to the
framework -- only the producer (userspace owner of the VFIO device)
and the consumer (endpoint block) need to interpret the value. The
dma-buf get_pci_tph callback itself is optional, but workloads that
depend on the endpoint's in-flight operation need it because fallback
does not produce the same result.
The dma-buf hook is intentionally generic and discoverable rather than
a private side channel. The exporter owns the completing address
space for the dma-buf and decides whether it can provide a meaningful
ST/PH tuple for that completer; the dma-buf core keeps the tuple opaque,
and importers merely request the namespace they support and place the
returned value on generated TLPs. Exporters that cannot derive a
meaningful tuple simply return -EOPNOTSUPP.
Patch 1 adds small PCI/TPH type helpers so drivers can query the enabled
TPH requester mode and the device's TPH Completer Supported field
without reaching into pci_dev internals (and so callers in
CONFIG_PCIE_TPH=n builds get a clean fallback).
Patch 2 adds the optional dma_buf_ops::get_pci_tph callback plus the
dma_buf_get_pci_tph() importer wrapper so importers can fetch TPH
metadata from an exporter under dmabuf->resv.
Patch 3 implements get_pci_tph in vfio-pci and adds the new uAPI
(VFIO_DEVICE_FEATURE_DMA_BUF_TPH) for userspace to attach the metadata.
Patch 4 wires up the mlx5 RDMA driver as a consumer.
Build-tested with both CONFIG_PCIE_TPH=y and CONFIG_PCIE_TPH=n.
Functional validation on the target topology: PCIe analyzer captures
on the P2P TLPs confirm the ST emitted by mlx5 matches the value
configured through VFIO_DEVICE_FEATURE_DMA_BUF_TPH, and the end-to-end
P2P workload only produces results consistent with the endpoint's
ST-selected in-flight operation. For example, with userspace
configuring 8-bit ST=0xf0 and PH=2, an analyzer capture of a peer-to-
peer MWr64 shows "STP MWr64 TC=0 OHC=2 ..." followed by "OHC-B
ST=F0h PH=2 HV=1":
(TLP Captures)
08000260 -> STP MWr64 TC=0 OHC=2 TS=0 Attr=0 L=8
F0000004 -> RID=4h:0h.0h EP- Tag=F0h
E0200000 -> AddrH=000020E0h
00080006 -> AddrL=06000800h
90F00000 -> OHC-B ST=F0h PH=2 HV=1 AMA=0 AV-
Depends on (submitted separately):
net/mlx5: free mlx5_st_idx_data on final dealloc
https://lore.kernel.org/linux-rdma/20260612170406.3339093-1-zhipingz@meta.com
Changes since v9:
Patch 3 (vfio/pci): Address Alex Williamson's comments by annotating
the existing unlocked @revoked read with READ_ONCE() and rewriting
the DMA_BUF_TPH uAPI text around @flags, future-query semantics,
@ph, and undefined bits. No behavior change.
Patch 4 (RDMA/mlx5): Address Michael Gur's comments by renaming the
per-MR ST ref helpers to drop the misleading "frmr" infix and by
preventing mlx5r_build_frmr_key() from propagating user-provided
kernel_vendor_key. Also fix PH encoding consistency between FRMR and
reg_create() and balance the MR-scoped ST ref across both creation
paths and failures.
Previous link:
v9: https://lore.kernel.org/dri-devel/20260622184211.2229399-1-zhipingz@meta.com/
v8: https://lore.kernel.org/dri-devel/20260615065912.2177918-1-zhipingz@meta.com/
v7: https://lore.kernel.org/dri-devel/20260611161546.4075580-1-zhipingz@meta.com/
v6: https://lore.kernel.org/dri-devel/20260608185646.4085127-1-zhipingz@meta.com/
v5: https://lore.kernel.org/dri-devel/20260526144401.1485788-1-zhipingz@meta.com/
v4: https://lore.kernel.org/linux-pci/20260519201401.1558410-1-zhipingz@meta.com/
v3: https://lore.kernel.org/linux-pci/20260512184755.4137227-1-zhipingz@meta.com/
v2: https://lore.kernel.org/linux-pci/20260430200704.352228-1-zhipingz@meta.com/
Zhiping Zhang (4):
PCI/TPH: Add requester/completer type helpers
dma-buf: add optional get_pci_tph() callback
vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature
RDMA/mlx5: get tph for p2p access when registering dma-buf mr
drivers/dma-buf/dma-buf.c | 25 ++++
drivers/infiniband/hw/mlx5/main.c | 1 +
drivers/infiniband/hw/mlx5/mr.c | 116 +++++++++++++++++-
.../net/ethernet/mellanox/mlx5/core/lib/st.c | 49 ++++++--
drivers/pci/tph.c | 45 +++++++
drivers/vfio/pci/vfio_pci_core.c | 3 +
drivers/vfio/pci/vfio_pci_dmabuf.c | 99 ++++++++++++++-
drivers/vfio/pci/vfio_pci_priv.h | 12 ++
include/linux/dma-buf.h | 22 ++++
include/linux/mlx5/driver.h | 15 +++
include/linux/pci-tph.h | 8 ++
include/uapi/linux/vfio.h | 43 +++++++
12 files changed, 422 insertions(+), 16 deletions(-)
base-commit: 97d2a397efe7752ebf9204a1cfd365afd80c3b28
--
2.53.0-Meta
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH v10 1/4] PCI/TPH: Add requester/completer type helpers 2026-06-30 22:42 [PATCH v10 0/4] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang @ 2026-06-30 22:42 ` Zhiping Zhang 2026-06-30 23:07 ` sashiko-bot 2026-06-30 22:42 ` [PATCH v10 2/4] dma-buf: add optional get_pci_tph() callback Zhiping Zhang ` (2 subsequent siblings) 3 siblings, 1 reply; 14+ messages in thread From: Zhiping Zhang @ 2026-06-30 22:42 UTC (permalink / raw) To: Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal, Christian Konig, Alex Williamson, Bjorn Helgaas Cc: kvm, linux-rdma, linux-pci, dri-devel, Zhiping Zhang Add pcie_tph_enabled_req_type() so drivers can query the enabled TPH requester mode without reaching into pci_dev internals. Add pcie_tph_completer_type() so drivers that publish TPH metadata for a device acting as a completer can gate on the "TPH Completer Supported" field of Device Capabilities 2 (bits 13:12, PCI_EXP_DEVCAP2_TPH_COMP_MASK) rather than reusing requester-side state. Fold the reserved 0b10 encoding into NONE so callers only see the defined values. This keeps pci_dev::tph_req_type and the completer-capability decode inside the PCI/TPH code and provides !CONFIG_PCIE_TPH stubs for callers. Signed-off-by: Zhiping Zhang <zhipingz@meta.com> --- drivers/pci/tph.c | 45 +++++++++++++++++++++++++++++++++++++++++ include/linux/pci-tph.h | 8 ++++++++ 2 files changed, 53 insertions(+) diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c index 655ffd60e62f..e7693fd9d676 100644 --- a/drivers/pci/tph.c +++ b/drivers/pci/tph.c @@ -173,6 +173,51 @@ u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev) } EXPORT_SYMBOL(pcie_tph_get_st_table_loc); +/** + * pcie_tph_enabled_req_type - Return the device's enabled TPH requester type + * @pdev: PCI device to query + * + * Return: PCI_TPH_REQ_DISABLE, PCI_TPH_REQ_TPH_ONLY or PCI_TPH_REQ_EXT_TPH. + */ +u8 pcie_tph_enabled_req_type(struct pci_dev *pdev) +{ + return pdev->tph_req_type; +} +EXPORT_SYMBOL(pcie_tph_enabled_req_type); + +/** + * pcie_tph_completer_type - Return the device's TPH Completer support + * @pdev: PCI device to query + * + * Reads the "TPH Completer Supported" field (bits 13:12) of Device + * Capabilities 2. The reserved 0b10 encoding is folded into + * "not supported" so callers only need to compare against the three + * defined values. + * + * Return: one of %PCI_EXP_DEVCAP2_TPH_COMP_NONE, + * %PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY or + * %PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH. + */ +u8 pcie_tph_completer_type(struct pci_dev *pdev) +{ + u32 reg; + + if (pcie_capability_read_dword(pdev, PCI_EXP_DEVCAP2, ®)) + return PCI_EXP_DEVCAP2_TPH_COMP_NONE; + if (PCI_POSSIBLE_ERROR(reg)) + return PCI_EXP_DEVCAP2_TPH_COMP_NONE; + + switch (FIELD_GET(PCI_EXP_DEVCAP2_TPH_COMP_MASK, reg)) { + case PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY: + return PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY; + case PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH: + return PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH; + default: + return PCI_EXP_DEVCAP2_TPH_COMP_NONE; + } +} +EXPORT_SYMBOL(pcie_tph_completer_type); + /* * Return the size of ST table. If ST table is not in TPH Requester Extended * Capability space, return 0. Otherwise return the ST Table Size + 1. diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h index be68cd17f2f8..7743af6fe432 100644 --- a/include/linux/pci-tph.h +++ b/include/linux/pci-tph.h @@ -9,6 +9,8 @@ #ifndef LINUX_PCI_TPH_H #define LINUX_PCI_TPH_H +#include <linux/pci_regs.h> + /* * According to the ECN for PCI Firmware Spec, Steering Tag can be different * depending on the memory type: Volatile Memory or Persistent Memory. When a @@ -30,6 +32,8 @@ void pcie_disable_tph(struct pci_dev *pdev); int pcie_enable_tph(struct pci_dev *pdev, int mode); u16 pcie_tph_get_st_table_size(struct pci_dev *pdev); u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev); +u8 pcie_tph_enabled_req_type(struct pci_dev *pdev); +u8 pcie_tph_completer_type(struct pci_dev *pdev); #else static inline int pcie_tph_set_st_entry(struct pci_dev *pdev, unsigned int index, u16 tag) @@ -41,6 +45,10 @@ static inline int pcie_tph_get_cpu_st(struct pci_dev *dev, static inline void pcie_disable_tph(struct pci_dev *pdev) { } static inline int pcie_enable_tph(struct pci_dev *pdev, int mode) { return -EINVAL; } +static inline u8 pcie_tph_enabled_req_type(struct pci_dev *pdev) +{ return PCI_TPH_REQ_DISABLE; } +static inline u8 pcie_tph_completer_type(struct pci_dev *pdev) +{ return PCI_EXP_DEVCAP2_TPH_COMP_NONE; } #endif #endif /* LINUX_PCI_TPH_H */ -- 2.53.0-Meta ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v10 1/4] PCI/TPH: Add requester/completer type helpers 2026-06-30 22:42 ` [PATCH v10 1/4] PCI/TPH: Add requester/completer type helpers Zhiping Zhang @ 2026-06-30 23:07 ` sashiko-bot 0 siblings, 0 replies; 14+ messages in thread From: sashiko-bot @ 2026-06-30 23:07 UTC (permalink / raw) To: Zhiping Zhang; +Cc: linux-pci, kvm, dri-devel Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: - [Medium] The internal `get_rp_completer_type()` function was not updated to use the new `pcie_tph_completer_type()` helper, leading to an API contract violation in `pcie_tph_enabled_req_type()`. -- commit d64d0831457ed161ad198682067e0f147ca1411e Author: Zhiping Zhang <zhipingz@meta.com> PCI/TPH: Add requester/completer type helpers This patch introduces pcie_tph_enabled_req_type() and pcie_tph_completer_type() to allow drivers to query TPH requester mode and completer capabilities without accessing pci_dev internals directly. It also ensures the reserved 0b10 completer capability encoding is folded into a "not supported" value for callers. > diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c > index 655ffd60e62f6..e7693fd9d6763 100644 > --- a/drivers/pci/tph.c > +++ b/drivers/pci/tph.c > @@ -173,6 +173,51 @@ u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev) > } > EXPORT_SYMBOL(pcie_tph_get_st_table_loc); > > +/** > + * pcie_tph_enabled_req_type - Return the device's enabled TPH requester type > + * @pdev: PCI device to query > + * > + * Return: PCI_TPH_REQ_DISABLE, PCI_TPH_REQ_TPH_ONLY or PCI_TPH_REQ_EXT_TPH. > + */ > +u8 pcie_tph_enabled_req_type(struct pci_dev *pdev) > +{ > + return pdev->tph_req_type; > +} [Severity: Medium] Could this helper return an undocumented value of 2 if a Root Port's TPH Completer Supported field incorrectly exposes the reserved 0b10 value? The internal get_rp_completer_type() function was not updated to use the new pcie_tph_completer_type() helper, so it retains its manual decoding: drivers/pci/tph.c:get_rp_completer_type() { ... return FIELD_GET(PCI_EXP_DEVCAP2_TPH_COMP_MASK, reg); } If get_rp_completer_type() returns 2, pcie_enable_tph() can assign this to pdev->tph_req_type via the min() calculation: drivers/pci/tph.c:pcie_enable_tph() { ... rp_req_type = get_rp_completer_type(pdev); /* Final req_type is the smallest value of two */ pdev->tph_req_type = min(pdev->tph_req_type, rp_req_type); } Should get_rp_completer_type() be refactored to use pcie_tph_completer_type() to properly fold the reserved 0b10 encoding and avoid this API contract violation? [ ... ] -- Sashiko AI review · https://sashiko.dev/#/patchset/20260630224328.3218796-1-zhipingz@meta.com?part=1 ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v10 2/4] dma-buf: add optional get_pci_tph() callback 2026-06-30 22:42 [PATCH v10 0/4] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang 2026-06-30 22:42 ` [PATCH v10 1/4] PCI/TPH: Add requester/completer type helpers Zhiping Zhang @ 2026-06-30 22:42 ` Zhiping Zhang 2026-06-30 23:01 ` sashiko-bot 2026-07-01 8:25 ` Christian König 2026-06-30 22:42 ` [PATCH v10 3/4] vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature Zhiping Zhang 2026-06-30 22:42 ` [PATCH v10 4/4] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Zhiping Zhang 3 siblings, 2 replies; 14+ messages in thread From: Zhiping Zhang @ 2026-06-30 22:42 UTC (permalink / raw) To: Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal, Christian Konig, Alex Williamson, Bjorn Helgaas Cc: kvm, linux-rdma, linux-pci, dri-devel, Zhiping Zhang Add an optional dma_buf_ops.get_pci_tph callback and a DMA-buf importer wrapper, dma_buf_get_pci_tph(). TPH is PCIe TLP Processing Hint. 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces, so the importer requests the namespace it can emit and the exporter returns the matching ST/PH tuple or -EOPNOTSUPP. dma_buf_get_pci_tph() is the importer entry point. It requires &dmabuf->resv to be held while the callback runs and returns -EOPNOTSUPP when the exporter does not provide PCI TPH metadata. The first user is VFIO_DEVICE_FEATURE_DMA_BUF_TPH in vfio-pci, with mlx5 as the first importer. Signed-off-by: Zhiping Zhang <zhipingz@meta.com> --- drivers/dma-buf/dma-buf.c | 25 +++++++++++++++++++++++++ include/linux/dma-buf.h | 22 ++++++++++++++++++++++ 2 files changed, 47 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index d504c636dc29..7a4c9b0d5dab 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -1144,6 +1144,31 @@ void dma_buf_unpin(struct dma_buf_attachment *attach) } EXPORT_SYMBOL_NS_GPL(dma_buf_unpin, "DMA_BUF"); +/** + * dma_buf_get_pci_tph - Retrieve PCIe TLP Processing Hint (TPH) metadata + * @dmabuf: DMA buffer to query + * @extended: false for 8-bit ST, true for 16-bit Extended ST + * @steering_tag: returns the raw steering tag for the requested namespace + * @ph: returns the TPH processing hint + * + * Wrapper for the optional &dma_buf_ops.get_pci_tph callback. + * + * Must be called with &dma_buf.resv held. Returns -EOPNOTSUPP if the + * exporter does not implement the callback or has no metadata for the + * requested namespace. + */ +int dma_buf_get_pci_tph(struct dma_buf *dmabuf, bool extended, + u16 *steering_tag, u8 *ph) +{ + dma_resv_assert_held(dmabuf->resv); + + if (!dmabuf->ops->get_pci_tph) + return -EOPNOTSUPP; + + return dmabuf->ops->get_pci_tph(dmabuf, extended, steering_tag, ph); +} +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pci_tph, "DMA_BUF"); + /** * dma_buf_map_attachment - Returns the scatterlist table of the attachment; * mapped into _device_ address space. Is a wrapper for map_dma_buf() of the diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index d1203da56fc5..53b2686ad8fc 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -113,6 +113,26 @@ struct dma_buf_ops { */ void (*unpin)(struct dma_buf_attachment *attach); + /** + * @get_pci_tph: + * + * Retrieve PCIe TLP Processing Hint (TPH) steering metadata for + * this buffer so an importer can program a matching ST/PH hint on + * outbound TLPs targeting the exporter for peer-to-peer DMA. + * + * @dmabuf: DMA buffer for which to retrieve TPH metadata + * @extended: false for 8-bit ST, true for 16-bit Extended ST + * @steering_tag: Returns the raw TPH steering tag for the requested + * namespace + * @ph: Returns the TPH processing hint (2-bit value) + * + * Optional callback for dma_buf_get_pci_tph(). Called with + * &dma_buf.resv held. Returns 0 on success or -EOPNOTSUPP when + * the exporter has no metadata for the requested namespace. + */ + int (*get_pci_tph)(struct dma_buf *dmabuf, bool extended, + u16 *steering_tag, u8 *ph); + /** * @map_dma_buf: * @@ -563,6 +583,8 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach); int dma_buf_pin(struct dma_buf_attachment *attach); void dma_buf_unpin(struct dma_buf_attachment *attach); +int dma_buf_get_pci_tph(struct dma_buf *dmabuf, bool extended, + u16 *steering_tag, u8 *ph); struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info); -- 2.53.0-Meta ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v10 2/4] dma-buf: add optional get_pci_tph() callback 2026-06-30 22:42 ` [PATCH v10 2/4] dma-buf: add optional get_pci_tph() callback Zhiping Zhang @ 2026-06-30 23:01 ` sashiko-bot 2026-07-01 8:25 ` Christian König 1 sibling, 0 replies; 14+ messages in thread From: sashiko-bot @ 2026-06-30 23:01 UTC (permalink / raw) To: Zhiping Zhang; +Cc: linux-pci, dri-devel, kvm Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: - [Low] The newly added dma-buf API `dma_buf_get_pci_tph()` and its corresponding `get_pci_tph` callback are missing from the centralized dma-buf locking convention documentation. -- commit aa25de1b77cd2b1778e96e6bffe1b1b6ce39fd9c Author: Zhiping Zhang <zhipingz@meta.com> dma-buf: add optional get_pci_tph() callback This commit adds an optional get_pci_tph callback to dma_buf_ops and a corresponding wrapper function dma_buf_get_pci_tph() for importers. These allow importers to retrieve PCIe TLP Processing Hint (TPH) metadata for peer-to-peer DMA. Both the callback and the wrapper require the dma-buf reservation lock to be held. > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > index d504c636dc294..7a4c9b0d5dab4 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c [ ... ] > @@ -1144,6 +1144,31 @@ void dma_buf_unpin(struct dma_buf_attachment *attach) > } > EXPORT_SYMBOL_NS_GPL(dma_buf_unpin, "DMA_BUF"); > > +/** > + * dma_buf_get_pci_tph - Retrieve PCIe TLP Processing Hint (TPH) metadata > + * @dmabuf: DMA buffer to query > + * @extended: false for 8-bit ST, true for 16-bit Extended ST > + * @steering_tag: returns the raw steering tag for the requested namespace > + * @ph: returns the TPH processing hint > + * > + * Wrapper for the optional &dma_buf_ops.get_pci_tph callback. > + * > + * Must be called with &dma_buf.resv held. Returns -EOPNOTSUPP if the > + * exporter does not implement the callback or has no metadata for the > + * requested namespace. > + */ > +int dma_buf_get_pci_tph(struct dma_buf *dmabuf, bool extended, > + u16 *steering_tag, u8 *ph) > +{ > + dma_resv_assert_held(dmabuf->resv); [Severity: Low] This isn't a bug, but should dma_buf_get_pci_tph() be added to the list of functions in the "DOC: locking convention" block earlier in this file? The centralized documentation block serves as a single source of truth for importers by listing all functions that require the dma-buf reservation lock, such as dma_buf_pin() and dma_buf_vmap(). Adding this new function to that list would help keep the locking documentation complete. > + > + if (!dmabuf->ops->get_pci_tph) > + return -EOPNOTSUPP; > + > + return dmabuf->ops->get_pci_tph(dmabuf, extended, steering_tag, ph); > +} > +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pci_tph, "DMA_BUF"); [ ... ] > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h > index d1203da56fc5f..53b2686ad8fc6 100644 > --- a/include/linux/dma-buf.h > +++ b/include/linux/dma-buf.h [ ... ] > @@ -113,6 +113,26 @@ struct dma_buf_ops { > */ > void (*unpin)(struct dma_buf_attachment *attach); > > + /** > + * @get_pci_tph: > + * > + * Retrieve PCIe TLP Processing Hint (TPH) steering metadata for > + * this buffer so an importer can program a matching ST/PH hint on > + * outbound TLPs targeting the exporter for peer-to-peer DMA. > + * > + * @dmabuf: DMA buffer for which to retrieve TPH metadata > + * @extended: false for 8-bit ST, true for 16-bit Extended ST > + * @steering_tag: Returns the raw TPH steering tag for the requested > + * namespace > + * @ph: Returns the TPH processing hint (2-bit value) > + * > + * Optional callback for dma_buf_get_pci_tph(). Called with > + * &dma_buf.resv held. Returns 0 on success or -EOPNOTSUPP when > + * the exporter has no metadata for the requested namespace. > + */ > + int (*get_pci_tph)(struct dma_buf *dmabuf, bool extended, > + u16 *steering_tag, u8 *ph); [Severity: Low] Similarly, should &dma_buf_ops.get_pci_tph() be appended to the list of callbacks in the same "DOC: locking convention" block? The existing documentation explicitly calls out exporter callbacks that are invoked with the locked dma-buf reservation, so updating it here would keep the locking contract documentation consistent. -- Sashiko AI review · https://sashiko.dev/#/patchset/20260630224328.3218796-1-zhipingz@meta.com?part=2 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v10 2/4] dma-buf: add optional get_pci_tph() callback 2026-06-30 22:42 ` [PATCH v10 2/4] dma-buf: add optional get_pci_tph() callback Zhiping Zhang 2026-06-30 23:01 ` sashiko-bot @ 2026-07-01 8:25 ` Christian König 2026-07-01 17:53 ` Zhiping Zhang 1 sibling, 1 reply; 14+ messages in thread From: Christian König @ 2026-07-01 8:25 UTC (permalink / raw) To: Zhiping Zhang, Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal, Alex Williamson, Bjorn Helgaas Cc: kvm, linux-rdma, linux-pci, dri-devel On 7/1/26 00:42, Zhiping Zhang wrote: > Add an optional dma_buf_ops.get_pci_tph callback and a > DMA-buf importer wrapper, dma_buf_get_pci_tph(). > > TPH is PCIe TLP Processing Hint. 8-bit ST and 16-bit Extended ST are > distinct PCIe TPH namespaces, so the importer requests the namespace it > can emit and the exporter returns the matching ST/PH tuple or > -EOPNOTSUPP. > > dma_buf_get_pci_tph() is the importer entry point. It requires > &dmabuf->resv to be held while the callback runs and returns > -EOPNOTSUPP when the exporter does not provide PCI TPH metadata. > > The first user is VFIO_DEVICE_FEATURE_DMA_BUF_TPH in vfio-pci, with > mlx5 as the first importer. > > Signed-off-by: Zhiping Zhang <zhipingz@meta.com> > --- > drivers/dma-buf/dma-buf.c | 25 +++++++++++++++++++++++++ > include/linux/dma-buf.h | 22 ++++++++++++++++++++++ > 2 files changed, 47 insertions(+) > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > index d504c636dc29..7a4c9b0d5dab 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -1144,6 +1144,31 @@ void dma_buf_unpin(struct dma_buf_attachment *attach) > } > EXPORT_SYMBOL_NS_GPL(dma_buf_unpin, "DMA_BUF"); > > +/** > + * dma_buf_get_pci_tph - Retrieve PCIe TLP Processing Hint (TPH) metadata > + * @dmabuf: DMA buffer to query > + * @extended: false for 8-bit ST, true for 16-bit Extended ST > + * @steering_tag: returns the raw steering tag for the requested namespace > + * @ph: returns the TPH processing hint > + * > + * Wrapper for the optional &dma_buf_ops.get_pci_tph callback. > + * > + * Must be called with &dma_buf.resv held. Returns -EOPNOTSUPP if the > + * exporter does not implement the callback or has no metadata for the > + * requested namespace. Please add something like this: * The returned information is only valid till the next invalidate_mappings() callback from the exporter and should be re-queried when a new mapping is created after invalidation. Apart from that it looks good to me, but I still think we need some kind of example that this works for other DMA-buf users as well. Just demonstrating that this also works with some simple FPGA or similar PCIe endpoint should be sufficient. Regards, Christian. > + */ > +int dma_buf_get_pci_tph(struct dma_buf *dmabuf, bool extended, > + u16 *steering_tag, u8 *ph) > +{ > + dma_resv_assert_held(dmabuf->resv); > + > + if (!dmabuf->ops->get_pci_tph) > + return -EOPNOTSUPP; > + > + return dmabuf->ops->get_pci_tph(dmabuf, extended, steering_tag, ph); > +} > +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pci_tph, "DMA_BUF"); > + > /** > * dma_buf_map_attachment - Returns the scatterlist table of the attachment; > * mapped into _device_ address space. Is a wrapper for map_dma_buf() of the > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h > index d1203da56fc5..53b2686ad8fc 100644 > --- a/include/linux/dma-buf.h > +++ b/include/linux/dma-buf.h > @@ -113,6 +113,26 @@ struct dma_buf_ops { > */ > void (*unpin)(struct dma_buf_attachment *attach); > > + /** > + * @get_pci_tph: > + * > + * Retrieve PCIe TLP Processing Hint (TPH) steering metadata for > + * this buffer so an importer can program a matching ST/PH hint on > + * outbound TLPs targeting the exporter for peer-to-peer DMA. > + * > + * @dmabuf: DMA buffer for which to retrieve TPH metadata > + * @extended: false for 8-bit ST, true for 16-bit Extended ST > + * @steering_tag: Returns the raw TPH steering tag for the requested > + * namespace > + * @ph: Returns the TPH processing hint (2-bit value) > + * > + * Optional callback for dma_buf_get_pci_tph(). Called with > + * &dma_buf.resv held. Returns 0 on success or -EOPNOTSUPP when > + * the exporter has no metadata for the requested namespace. > + */ > + int (*get_pci_tph)(struct dma_buf *dmabuf, bool extended, > + u16 *steering_tag, u8 *ph); > + > /** > * @map_dma_buf: > * > @@ -563,6 +583,8 @@ void dma_buf_detach(struct dma_buf *dmabuf, > struct dma_buf_attachment *attach); > int dma_buf_pin(struct dma_buf_attachment *attach); > void dma_buf_unpin(struct dma_buf_attachment *attach); > +int dma_buf_get_pci_tph(struct dma_buf *dmabuf, bool extended, > + u16 *steering_tag, u8 *ph); > > struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info); > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v10 2/4] dma-buf: add optional get_pci_tph() callback 2026-07-01 8:25 ` Christian König @ 2026-07-01 17:53 ` Zhiping Zhang 2026-07-02 7:06 ` Christian König 0 siblings, 1 reply; 14+ messages in thread From: Zhiping Zhang @ 2026-07-01 17:53 UTC (permalink / raw) To: Christian König Cc: Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal, Alex Williamson, Bjorn Helgaas, kvm, linux-rdma, linux-pci, dri-devel > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > > index d504c636dc29..7a4c9b0d5dab 100644 > > --- a/drivers/dma-buf/dma-buf.c > > +++ b/drivers/dma-buf/dma-buf.c > > @@ -1144,6 +1144,31 @@ void dma_buf_unpin(struct dma_buf_attachment *attach) > > } > > EXPORT_SYMBOL_NS_GPL(dma_buf_unpin, "DMA_BUF"); > > > > +/** > > + * dma_buf_get_pci_tph - Retrieve PCIe TLP Processing Hint (TPH) metadata > > + * @dmabuf: DMA buffer to query > > + * @extended: false for 8-bit ST, true for 16-bit Extended ST > > + * @steering_tag: returns the raw steering tag for the requested namespace > > + * @ph: returns the TPH processing hint > > + * > > + * Wrapper for the optional &dma_buf_ops.get_pci_tph callback. > > + * > > + * Must be called with &dma_buf.resv held. Returns -EOPNOTSUPP if the > > + * exporter does not implement the callback or has no metadata for the > > + * requested namespace. > > Please add something like this: > > * The returned information is only valid till the next invalidate_mappings() callback from the exporter and should be re-queried when a new mapping is created after invalidation. > Thanks, Will do in v11! > Apart from that it looks good to me, but I still think we need some kind of example that this works for other DMA-buf users as well. > > Just demonstrating that this also works with some simple FPGA or similar PCIe endpoint should be sufficient. > > Regards, > Christian. > On v10, I have validated a second importer: another vendor's NIC (driver not upstream yet, so locally patched to call dma_buf_get_pci_tph). A PCIe analyzer confirms the TLP steering tag matches the exporter's for both mlx5/ConnectX-8 and this second NIC — two unrelated importer drivers exercising the API end-to-end. Thanks, Zhiping ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v10 2/4] dma-buf: add optional get_pci_tph() callback 2026-07-01 17:53 ` Zhiping Zhang @ 2026-07-02 7:06 ` Christian König 0 siblings, 0 replies; 14+ messages in thread From: Christian König @ 2026-07-02 7:06 UTC (permalink / raw) To: Zhiping Zhang Cc: Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal, Alex Williamson, Bjorn Helgaas, kvm, linux-rdma, linux-pci, dri-devel On 7/1/26 19:53, Zhiping Zhang wrote: >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c >>> index d504c636dc29..7a4c9b0d5dab 100644 >>> --- a/drivers/dma-buf/dma-buf.c >>> +++ b/drivers/dma-buf/dma-buf.c >>> @@ -1144,6 +1144,31 @@ void dma_buf_unpin(struct dma_buf_attachment *attach) >>> } >>> EXPORT_SYMBOL_NS_GPL(dma_buf_unpin, "DMA_BUF"); >>> >>> +/** >>> + * dma_buf_get_pci_tph - Retrieve PCIe TLP Processing Hint (TPH) metadata >>> + * @dmabuf: DMA buffer to query >>> + * @extended: false for 8-bit ST, true for 16-bit Extended ST >>> + * @steering_tag: returns the raw steering tag for the requested namespace >>> + * @ph: returns the TPH processing hint >>> + * >>> + * Wrapper for the optional &dma_buf_ops.get_pci_tph callback. >>> + * >>> + * Must be called with &dma_buf.resv held. Returns -EOPNOTSUPP if the >>> + * exporter does not implement the callback or has no metadata for the >>> + * requested namespace. >> >> Please add something like this: >> >> * The returned information is only valid till the next invalidate_mappings() callback from the exporter and should be re-queried when a new mapping is created after invalidation. >> > > Thanks, Will do in v11! > >> Apart from that it looks good to me, but I still think we need some kind of example that this works for other DMA-buf users as well. >> >> Just demonstrating that this also works with some simple FPGA or similar PCIe endpoint should be sufficient. >> >> Regards, >> Christian. >> > > On v10, I have validated a second importer: another vendor's NIC > (driver not upstream yet, so locally patched to > call dma_buf_get_pci_tph). A PCIe analyzer confirms the TLP steering > tag matches the exporter's for both mlx5/ConnectX-8 > and this second NIC — two unrelated importer drivers exercising the > API end-to-end. That sounds like it would be sufficient, yes. Thanks, Christian. > > Thanks, > Zhiping ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v10 3/4] vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature 2026-06-30 22:42 [PATCH v10 0/4] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang 2026-06-30 22:42 ` [PATCH v10 1/4] PCI/TPH: Add requester/completer type helpers Zhiping Zhang 2026-06-30 22:42 ` [PATCH v10 2/4] dma-buf: add optional get_pci_tph() callback Zhiping Zhang @ 2026-06-30 22:42 ` Zhiping Zhang 2026-06-30 23:08 ` sashiko-bot 2026-07-01 18:07 ` Alex Williamson 2026-06-30 22:42 ` [PATCH v10 4/4] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Zhiping Zhang 3 siblings, 2 replies; 14+ messages in thread From: Zhiping Zhang @ 2026-06-30 22:42 UTC (permalink / raw) To: Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal, Christian Konig, Alex Williamson, Bjorn Helgaas Cc: kvm, linux-rdma, linux-pci, dri-devel, Zhiping Zhang Implement dma-buf get_pci_tph for vfio-pci exported dma-bufs and add VFIO_DEVICE_FEATURE_DMA_BUF_TPH so userspace can publish TPH metadata for a VFIO-owned device. 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces; the uAPI carries both with explicit validity flags, and get_pci_tph() returns the value matching the importer's requested namespace or -EOPNOTSUPP. Publish and read the TPH descriptor under dmabuf->resv, matching the locking used for other importer-visible dma-buf state. The SET ioctl takes dma_resv_lock_interruptible(), while the callback runs under DMA-buf's asserted resv lock. Reject requests the device cannot consume as a completer: pcie_tph_completer_type() must report at least PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY, and Extended ST requires PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH. Make PROBE follow the same hardware gate so the feature only probes as supported when the device can really consume it. Signed-off-by: Zhiping Zhang <zhipingz@meta.com> --- drivers/vfio/pci/vfio_pci_core.c | 3 + drivers/vfio/pci/vfio_pci_dmabuf.c | 99 +++++++++++++++++++++++++++++- drivers/vfio/pci/vfio_pci_priv.h | 12 ++++ include/uapi/linux/vfio.h | 43 +++++++++++++ 4 files changed, 155 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index a28f1e99362c..c7d6902bc61b 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1572,6 +1572,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, return vfio_pci_core_feature_token(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_DMA_BUF: return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); + case VFIO_DEVICE_FEATURE_DMA_BUF_TPH: + return vfio_pci_core_feature_dma_buf_tph(vdev, flags, arg, + argsz); default: return -ENOTTY; } diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index c16f460c01d6..8de72f9e7502 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -3,6 +3,7 @@ */ #include <linux/dma-buf-mapping.h> #include <linux/pci-p2pdma.h> +#include <linux/pci-tph.h> #include <linux/dma-resv.h> #include "vfio_pci_priv.h" @@ -19,7 +20,14 @@ struct vfio_pci_dma_buf { u32 nr_ranges; struct kref kref; struct completion comp; - u8 revoked : 1; + + /* Protected by dmabuf->resv. */ + u16 tph_st_ext; + u8 tph_st; + bool revoked; + u8 tph_st_valid:1; + u8 tph_st_ext_valid:1; + u8 tph_ph:2; }; static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, @@ -30,7 +38,7 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, if (!attachment->peer2peer) return -EOPNOTSUPP; - if (priv->revoked) + if (READ_ONCE(priv->revoked)) return -ENODEV; if (!dma_buf_attach_revocable(attachment)) @@ -69,6 +77,26 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, return ret; } +static int vfio_pci_dma_buf_get_pci_tph(struct dma_buf *dmabuf, bool extended, + u16 *steering_tag, u8 *ph) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + + dma_resv_assert_held(dmabuf->resv); + + if (extended) { + if (!priv->tph_st_ext_valid) + return -EOPNOTSUPP; + *steering_tag = priv->tph_st_ext; + } else { + if (!priv->tph_st_valid) + return -EOPNOTSUPP; + *steering_tag = priv->tph_st; + } + *ph = priv->tph_ph; + return 0; +} + static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, struct sg_table *sgt, enum dma_data_direction dir) @@ -101,6 +129,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) static const struct dma_buf_ops vfio_pci_dmabuf_ops = { .attach = vfio_pci_dma_buf_attach, + .get_pci_tph = vfio_pci_dma_buf_get_pci_tph, .map_dma_buf = vfio_pci_dma_buf_map, .unmap_dma_buf = vfio_pci_dma_buf_unmap, .release = vfio_pci_dma_buf_release, @@ -333,6 +362,72 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, return ret; } +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, + u32 flags, + struct vfio_device_feature_dma_buf_tph __user *arg, + size_t argsz) +{ + struct vfio_device_feature_dma_buf_tph set_tph; + struct vfio_pci_dma_buf *priv; + struct dma_buf *dmabuf; + u8 comp; + int ret; + + comp = pcie_tph_completer_type(vdev->pdev); + if (comp == PCI_EXP_DEVCAP2_TPH_COMP_NONE) + return -EOPNOTSUPP; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, + sizeof(set_tph)); + if (ret != 1) + return ret; + + if (copy_from_user(&set_tph, arg, sizeof(set_tph))) + return -EFAULT; + + if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT)) + return -EINVAL; + + if (set_tph.ph & ~0x3) + return -EINVAL; + + if ((set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT) && + comp != PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH) + return -EOPNOTSUPP; + + dmabuf = dma_buf_get(set_tph.dmabuf_fd); + if (IS_ERR(dmabuf)) + return PTR_ERR(dmabuf); + + if (dmabuf->ops != &vfio_pci_dmabuf_ops) { + ret = -EINVAL; + goto out_put; + } + + priv = dmabuf->priv; + if (priv->vdev != vdev) { + ret = -EINVAL; + goto out_put; + } + + ret = dma_resv_lock_interruptible(dmabuf->resv, NULL); + if (ret) + goto out_put; + + priv->tph_st = set_tph.steering_tag; + priv->tph_st_ext = set_tph.steering_tag_ext; + priv->tph_ph = set_tph.ph; + priv->tph_st_valid = !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST); + priv->tph_st_ext_valid = + !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT); + dma_resv_unlock(dmabuf->resv); + ret = 0; + +out_put: + dma_buf_put(dmabuf); + return ret; +} + void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) { struct vfio_pci_dma_buf *priv; diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index fca9d0dfac90..c58f369be4b3 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -118,6 +118,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_feature_dma_buf __user *arg, size_t argsz); +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, + u32 flags, + struct vfio_device_feature_dma_buf_tph __user *arg, + size_t argsz); void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); #else @@ -128,6 +132,14 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, { return -ENOTTY; } + +static inline int +vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf_tph __user *arg, + size_t argsz) +{ + return -ENOTTY; +} static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) { } diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 5de618a3a5ee..4c1c70aac150 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1534,6 +1534,49 @@ struct vfio_device_feature_dma_buf { */ #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 12 +/** + * Upon VFIO_DEVICE_FEATURE_SET associate TPH (TLP Processing Hints) metadata + * with a vfio-exported dma-buf. The dma-buf must have been created by + * VFIO_DEVICE_FEATURE_DMA_BUF on this device, and the device must report + * TPH Completer support in Device Capabilities 2 (bits 13:12); requests + * carrying VFIO_DMA_BUF_TPH_ST_EXT additionally require the device to + * report the Extended TPH Completer encoding. Otherwise the ioctl + * returns -EOPNOTSUPP. + * + * @dmabuf_fd is the file descriptor returned by VFIO_DEVICE_FEATURE_DMA_BUF. + * + * 8-bit ST (@steering_tag) and 16-bit Extended ST (@steering_tag_ext) are + * distinct namespaces. @flags is the authoritative validity mask for the + * two namespaces: VFIO_DMA_BUF_TPH_ST marks @steering_tag valid and + * VFIO_DMA_BUF_TPH_ST_EXT marks @steering_tag_ext valid. An importer + * requests one namespace and receives the matching value. + * + * Each SET fully replaces the dma-buf's TPH state for future queries + * only. A namespace whose flag bit is clear is reported as unsupported + * to future importers; an importer that has already retrieved a value + * is unaffected. @flags == 0 leaves both namespaces unsupported for + * future importers. + * + * @ph is the 2-bit TLP Processing Hint and must be in the range [0, 3]. + * + * Undefined @flags and @ph bits must always be zero. + * + * Userspace must configure TPH before handing the dma-buf fd to an + * importer. + */ +#define VFIO_DEVICE_FEATURE_DMA_BUF_TPH 13 + +#define VFIO_DMA_BUF_TPH_ST (1 << 0) +#define VFIO_DMA_BUF_TPH_ST_EXT (1 << 1) + +struct vfio_device_feature_dma_buf_tph { + __s32 dmabuf_fd; + __u32 flags; + __u16 steering_tag_ext; + __u8 steering_tag; + __u8 ph; +}; + /* -------- API for Type1 VFIO IOMMU -------- */ /** -- 2.53.0-Meta ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v10 3/4] vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature 2026-06-30 22:42 ` [PATCH v10 3/4] vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature Zhiping Zhang @ 2026-06-30 23:08 ` sashiko-bot 2026-07-01 18:07 ` Alex Williamson 1 sibling, 0 replies; 14+ messages in thread From: sashiko-bot @ 2026-06-30 23:08 UTC (permalink / raw) To: Zhiping Zhang; +Cc: kvm, linux-pci, dri-devel Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider: - [Low] Lockless plain load of `priv->vdev` races with concurrent modification in cleanup path. - [Low] Missing `WRITE_ONCE()` for `priv->revoked` leads to data race with `READ_ONCE()`. -- commit d524eab36e9482385fe1c1a6c6280e667bf56a9c Author: Zhiping Zhang <zhipingz@meta.com> Subject: vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature This commit implements dma-buf get_pci_tph for vfio-pci exported dma-bufs and adds VFIO_DEVICE_FEATURE_DMA_BUF_TPH. This allows userspace to publish TPH metadata for a VFIO-owned device. > diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c > index c16f460c01d68..8de72f9e75024 100644 > --- a/drivers/vfio/pci/vfio_pci_dmabuf.c > +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c > @@ -30,7 +38,7 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, > if (!attachment->peer2peer) > return -EOPNOTSUPP; > > - if (priv->revoked) > + if (READ_ONCE(priv->revoked)) [Severity: Low] Since this updates the read of priv->revoked to use READ_ONCE(), should the corresponding writers in vfio_pci_dma_buf_move() also be updated to use WRITE_ONCE()? For example, when a concurrent state change like a device reset writes to this field: vfio_pci_dma_buf_move() { ... priv->revoked = true; ... } Will the lack of WRITE_ONCE() here trigger KCSAN data race warnings when evaluated concurrently by the new READ_ONCE() check? > return -ENODEV; > > if (!dma_buf_attach_revocable(attachment)) [ ... ] > @@ -333,6 +362,72 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, > return ret; > } > > +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, > + u32 flags, > + struct vfio_device_feature_dma_buf_tph __user *arg, > + size_t argsz) > +{ [ ... ] > + dmabuf = dma_buf_get(set_tph.dmabuf_fd); > + if (IS_ERR(dmabuf)) > + return PTR_ERR(dmabuf); > + > + if (dmabuf->ops != &vfio_pci_dmabuf_ops) { > + ret = -EINVAL; > + goto out_put; > + } > + > + priv = dmabuf->priv; > + if (priv->vdev != vdev) { [Severity: Low] Can this lockless read of priv->vdev race with concurrent modifications in the cleanup path? If userspace calls the VFIO_DEVICE_FEATURE_DMA_BUF_TPH ioctl on device A, passing a dmabuf_fd belonging to device B, while device B is concurrently being closed, could this race with vfio_pci_dma_buf_cleanup()? vfio_pci_dma_buf_cleanup() { ... priv->vdev = NULL; ... } Will this concurrent plain store trigger KCSAN warnings during the pointer equality check? > + ret = -EINVAL; > + goto out_put; > + } > + > + ret = dma_resv_lock_interruptible(dmabuf->resv, NULL); -- Sashiko AI review · https://sashiko.dev/#/patchset/20260630224328.3218796-1-zhipingz@meta.com?part=3 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v10 3/4] vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature 2026-06-30 22:42 ` [PATCH v10 3/4] vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature Zhiping Zhang 2026-06-30 23:08 ` sashiko-bot @ 2026-07-01 18:07 ` Alex Williamson 2026-07-01 21:07 ` Zhiping Zhang 1 sibling, 1 reply; 14+ messages in thread From: Alex Williamson @ 2026-07-01 18:07 UTC (permalink / raw) To: Zhiping Zhang Cc: Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal, Christian Konig, Bjorn Helgaas, kvm, linux-rdma, linux-pci, dri-devel, alex On Tue, 30 Jun 2026 15:42:25 -0700 Zhiping Zhang <zhipingz@meta.com> wrote: > Implement dma-buf get_pci_tph for vfio-pci exported dma-bufs and add > VFIO_DEVICE_FEATURE_DMA_BUF_TPH so userspace can publish TPH metadata > for a VFIO-owned device. > > 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces; the > uAPI carries both with explicit validity flags, and get_pci_tph() > returns the value matching the importer's requested namespace or > -EOPNOTSUPP. > > Publish and read the TPH descriptor under dmabuf->resv, matching the > locking used for other importer-visible dma-buf state. The SET ioctl > takes dma_resv_lock_interruptible(), while the callback runs under > DMA-buf's asserted resv lock. > > Reject requests the device cannot consume as a completer: > pcie_tph_completer_type() must report at least > PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY, and Extended ST requires > PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH. Make PROBE follow the same hardware > gate so the feature only probes as supported when the device can really > consume it. > > Signed-off-by: Zhiping Zhang <zhipingz@meta.com> > --- > drivers/vfio/pci/vfio_pci_core.c | 3 + > drivers/vfio/pci/vfio_pci_dmabuf.c | 99 +++++++++++++++++++++++++++++- > drivers/vfio/pci/vfio_pci_priv.h | 12 ++++ > include/uapi/linux/vfio.h | 43 +++++++++++++ > 4 files changed, 155 insertions(+), 2 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index a28f1e99362c..c7d6902bc61b 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -1572,6 +1572,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, > return vfio_pci_core_feature_token(vdev, flags, arg, argsz); > case VFIO_DEVICE_FEATURE_DMA_BUF: > return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); > + case VFIO_DEVICE_FEATURE_DMA_BUF_TPH: > + return vfio_pci_core_feature_dma_buf_tph(vdev, flags, arg, > + argsz); > default: > return -ENOTTY; > } > diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c > index c16f460c01d6..8de72f9e7502 100644 > --- a/drivers/vfio/pci/vfio_pci_dmabuf.c > +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c > @@ -3,6 +3,7 @@ > */ > #include <linux/dma-buf-mapping.h> > #include <linux/pci-p2pdma.h> > +#include <linux/pci-tph.h> > #include <linux/dma-resv.h> > > #include "vfio_pci_priv.h" > @@ -19,7 +20,14 @@ struct vfio_pci_dma_buf { > u32 nr_ranges; > struct kref kref; > struct completion comp; > - u8 revoked : 1; > + > + /* Protected by dmabuf->resv. */ > + u16 tph_st_ext; > + u8 tph_st; > + bool revoked; > + u8 tph_st_valid:1; > + u8 tph_st_ext_valid:1; > + u8 tph_ph:2; Since it seems there will be a v11, note again the comment made here on v9: On Tue, 23 Jun 2026 22:24:54 -0700 Zhiping Zhang <zhipingz@meta.com> wrote: > On Tue, Jun 23, 2026 at 11:17 AM Alex Williamson <alex@shazbot.org> wrote: > > > > Nit, it would be more accurate to say: > > > > /* > > * Updates protected by dmabuf->resv, @revoked additionally > > * protected by memory_lock. > > */ > > > > revoked also has an unprotected read, but it's previously existing and > > benign, and likely just needs a READ_ONCE() annotation. > > > > Agreed, I'll update the comment and add READ_ONCE() as well. The READ_ONCE was added, but the comment remains as in v9. The READ_ONCE rationale should be described in the commit log too. Thanks, Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v10 3/4] vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature 2026-07-01 18:07 ` Alex Williamson @ 2026-07-01 21:07 ` Zhiping Zhang 0 siblings, 0 replies; 14+ messages in thread From: Zhiping Zhang @ 2026-07-01 21:07 UTC (permalink / raw) To: Alex Williamson Cc: Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal, Christian Konig, Bjorn Helgaas, kvm, linux-rdma, linux-pci, dri-devel > Since it seems there will be a v11, note again the comment made here on > v9: > > On Tue, 23 Jun 2026 22:24:54 -0700 > Zhiping Zhang <zhipingz@meta.com> wrote: > > On Tue, Jun 23, 2026 at 11:17 AM Alex Williamson <alex@shazbot.org> wrote: > > > > > > Nit, it would be more accurate to say: > > > > > > /* > > > * Updates protected by dmabuf->resv, @revoked additionally > > > * protected by memory_lock. > > > */ > > > > > > revoked also has an unprotected read, but it's previously existing and > > > benign, and likely just needs a READ_ONCE() annotation. > > > > > > > Agreed, I'll update the comment and add READ_ONCE() as well. > > The READ_ONCE was added, but the comment remains as in v9. The > READ_ONCE rationale should be described in the commit log too. Thanks, > > Alex Sure, sorry for the miss. will do! Thanks, Zhiping ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v10 4/4] RDMA/mlx5: get tph for p2p access when registering dma-buf mr 2026-06-30 22:42 [PATCH v10 0/4] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang ` (2 preceding siblings ...) 2026-06-30 22:42 ` [PATCH v10 3/4] vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature Zhiping Zhang @ 2026-06-30 22:42 ` Zhiping Zhang 2026-06-30 23:11 ` sashiko-bot 3 siblings, 1 reply; 14+ messages in thread From: Zhiping Zhang @ 2026-06-30 22:42 UTC (permalink / raw) To: Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal, Christian Konig, Alex Williamson, Bjorn Helgaas Cc: kvm, linux-rdma, linux-pci, dri-devel, Zhiping Zhang Peer-to-peer DMA between a mlx5 NIC and a foreign PCIe endpoint (typically a GPU or a vfio-pci passthrough device) traverses the host PCIe fabric. The endpoint exporting the dma-buf knows which PCIe TLP Processing Hint (TPH) Steering Tag yields the best placement for the traffic it will sink: per-endpoint hint selection lets the root complex or switch direct DMA to a specific cache slice / NUMA node, cutting cross-socket snoop traffic and DRAM pressure under sustained p2p workloads. Until now the mlx5 importer had no way to learn the exporter's chosen ST tag, so dma-buf MRs were registered without TPH and ran with the default (no-hint) routing. With dma_buf_get_pci_tph() in place this patch wires up mlx5_ib to query that metadata at MR registration time for p2p access and use it to program requester-side TPH on the outbound mkey. If the exporter has no metadata, fall back to the existing no-TPH path so behavior for non-TPH-aware exporters is unchanged. Use mlx5_st_alloc_index_by_tag() to translate exporter-provided steering tags into local ST entries when table mode is active, and add mlx5_st_get_index() for DMAH-backed flows that already carry an ST index. For TPH-backed FRMRs, keep the extra ST-table reference tied to MR lifetime rather than pooled mkey lifetime. Acquire the ref before MR creation and release it again when the MR is returned to the pool or the backing mkey is destroyed, while leaving the generic FRMR pool core unchanged. Import the DMA_BUF namespace for the new dma_buf_get_pci_tph() call so modular mlx5_ib builds link cleanly. Signed-off-by: Zhiping Zhang <zhipingz@meta.com> --- drivers/infiniband/hw/mlx5/main.c | 1 + drivers/infiniband/hw/mlx5/mr.c | 116 +++++++++++++++++- .../net/ethernet/mellanox/mlx5/core/lib/st.c | 49 ++++++-- include/linux/mlx5/driver.h | 15 +++ 4 files changed, 167 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index 02809114fc79..a2b497f6b16b 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -60,6 +60,7 @@ MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>"); MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) IB driver"); MODULE_LICENSE("Dual BSD/GPL"); +MODULE_IMPORT_NS("DMA_BUF"); struct mlx5_ib_event_work { struct work_struct work; diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index e6b74955d95d..55216ae63761 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -39,6 +39,7 @@ #include <linux/delay.h> #include <linux/dma-buf.h> #include <linux/dma-resv.h> +#include <linux/pci-tph.h> #include <rdma/frmr_pools.h> #include <rdma/ib_umem_odp.h> #include "dm.h" @@ -167,6 +168,39 @@ static int get_unchangeable_access_flags(struct mlx5_ib_dev *dev, #define MLX5_FRMR_POOLS_KERNEL_KEY_PH_MASK GENMASK_ULL(23, 16) #define MLX5_FRMR_POOLS_KERNEL_KEY_ST_INDEX_MASK GENMASK_ULL(15, 0) +static u8 mlx5_ib_tph_key_ph(u8 ph) +{ + if (ph == MLX5_IB_NO_PH || ph == 0) + ph ^= MLX5_IB_NO_PH; + + return ph; +} + +static int mlx5_ib_get_st_handle_ref(struct mlx5_ib_dev *dev, u16 st_index) +{ + if (st_index == MLX5_MKC_PCIE_TPH_NO_STEERING_TAG_INDEX) + return 0; + + return mlx5_st_get_index(dev->mdev, st_index); +} + +static void mlx5_ib_put_st_index_ref(struct mlx5_ib_dev *dev, u16 st_index) +{ + if (st_index == MLX5_MKC_PCIE_TPH_NO_STEERING_TAG_INDEX) + return; + + mlx5_st_dealloc_index(dev->mdev, st_index); +} + +static void mlx5_ib_put_st_handle_ref(struct mlx5_ib_dev *dev, + u64 kernel_vendor_key) +{ + u16 st_index = FIELD_GET(MLX5_FRMR_POOLS_KERNEL_KEY_ST_INDEX_MASK, + kernel_vendor_key); + + mlx5_ib_put_st_index_ref(dev, st_index); +} + static struct mlx5_ib_mr * _mlx5_frmr_pool_alloc(struct mlx5_ib_dev *dev, struct ib_umem *umem, int access_flags, int access_mode, @@ -189,13 +223,10 @@ _mlx5_frmr_pool_alloc(struct mlx5_ib_dev *dev, struct ib_umem *umem, MLX5_FRMR_POOLS_KEY_ACCESS_MODE_KSM_MASK : 0; - /* Normalize ph: swap 0 and MLX5_IB_NO_PH */ - if (ph == MLX5_IB_NO_PH || ph == 0) - ph ^= MLX5_IB_NO_PH; - mr->ibmr.frmr.key.kernel_vendor_key = FIELD_PREP(MLX5_FRMR_POOLS_KERNEL_KEY_ST_INDEX_MASK, st_index) | - FIELD_PREP(MLX5_FRMR_POOLS_KERNEL_KEY_PH_MASK, ph); + FIELD_PREP(MLX5_FRMR_POOLS_KERNEL_KEY_PH_MASK, + mlx5_ib_tph_key_ph(ph)); err = ib_frmr_pool_pop(&dev->ib_dev, &mr->ibmr); if (err) { kfree(mr); @@ -218,7 +249,9 @@ struct mlx5_ib_mr *mlx5_mr_cache_alloc(struct mlx5_ib_dev *dev, 0 : MLX5_FRMR_POOLS_KEY_ACCESS_MODE_KSM_MASK, .num_dma_blocks = ndescs, - .kernel_vendor_key = 0, /* no PH and no ST index */ + .kernel_vendor_key = + FIELD_PREP(MLX5_FRMR_POOLS_KERNEL_KEY_ST_INDEX_MASK, + MLX5_MKC_PCIE_TPH_NO_STEERING_TAG_INDEX), }; struct mlx5_ib_mr *mr; int ret; @@ -557,6 +590,10 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, struct ib_umem *umem, mr->ibmr.pd = pd; mr->access_flags = access_flags; mr->page_shift = order_base_2(page_size); + mr->ibmr.frmr.key.kernel_vendor_key = + FIELD_PREP(MLX5_FRMR_POOLS_KERNEL_KEY_ST_INDEX_MASK, st_index) | + FIELD_PREP(MLX5_FRMR_POOLS_KERNEL_KEY_PH_MASK, + mlx5_ib_tph_key_ph(ph)); inlen = MLX5_ST_SZ_BYTES(create_mkey_in); if (populate) @@ -753,6 +790,12 @@ static struct ib_mr *create_real_mr(struct ib_pd *pd, struct ib_umem *umem, st_index = mdmah->st_index; } + err = mlx5_ib_get_st_handle_ref(dev, st_index); + if (err) { + ib_umem_release(umem); + return ERR_PTR(err); + } + xlt_with_umr = mlx5r_umr_can_load_pas(dev, umem->length); if (xlt_with_umr) { mr = alloc_cacheable_mr(pd, umem, iova, access_flags, @@ -769,6 +812,7 @@ static struct ib_mr *create_real_mr(struct ib_pd *pd, struct ib_umem *umem, mutex_unlock(&dev->slow_path_mutex); } if (IS_ERR(mr)) { + mlx5_ib_put_st_index_ref(dev, st_index); ib_umem_release(umem); return ERR_CAST(mr); } @@ -903,6 +947,52 @@ static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = { .invalidate_mappings = mlx5_ib_dmabuf_invalidate_cb, }; +static void get_pci_tph_mr_dmabuf(struct mlx5_ib_dev *dev, struct dma_buf *dmabuf, + u16 *st_index, u8 *ph) +{ + u16 local_st_index; + u16 steering_tag; + u8 local_ph; + bool extended; + int ret; + + switch (pcie_tph_enabled_req_type(dev->mdev->pdev)) { + case PCI_TPH_REQ_TPH_ONLY: + extended = false; + break; + case PCI_TPH_REQ_EXT_TPH: + extended = true; + break; + default: + return; + } + + dma_resv_lock(dmabuf->resv, NULL); + ret = dma_buf_get_pci_tph(dmabuf, extended, &steering_tag, &local_ph); + dma_resv_unlock(dmabuf->resv); + if (ret) { + if (ret != -EOPNOTSUPP) + mlx5_ib_dbg(dev, "get_pci_tph failed (%d)\n", ret); + return; + } + + ret = mlx5_st_alloc_index_by_tag(dev->mdev, steering_tag, + &local_st_index); + if (ret) { + mlx5_ib_dbg(dev, "st_alloc_index_by_tag failed (%d)\n", ret); + return; + } + + *st_index = local_st_index; + *ph = local_ph; +} + +static void mlx5_ib_mr_put_st_handle_ref(struct mlx5_ib_mr *mr) +{ + mlx5_ib_put_st_handle_ref(mr_to_mdev(mr), + mr->ibmr.frmr.key.kernel_vendor_key); +} + static struct ib_mr * reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device, u64 offset, u64 length, u64 virt_addr, @@ -945,12 +1035,22 @@ reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device, ph = dmah->ph; if (dmah->valid_fields & BIT(IB_DMAH_CPU_ID_EXISTS)) st_index = mdmah->st_index; + + err = mlx5_ib_get_st_handle_ref(dev, st_index); + if (err) { + ib_umem_release(&umem_dmabuf->umem); + return ERR_PTR(err); + } + } else { + get_pci_tph_mr_dmabuf(dev, umem_dmabuf->attach->dmabuf, + &st_index, &ph); } mr = alloc_cacheable_mr(pd, &umem_dmabuf->umem, virt_addr, access_flags, access_mode, st_index, ph); if (IS_ERR(mr)) { + mlx5_ib_put_st_index_ref(dev, st_index); ib_umem_release(&umem_dmabuf->umem); return ERR_CAST(mr); } @@ -1405,6 +1505,7 @@ static int mlx5r_handle_mkey_cleanup(struct mlx5_ib_mr *mr) if (mr->ibmr.frmr.pool) { if (!mlx5_umr_revoke_mr_with_lock(mr)) { ib_frmr_pool_push(mr->ibmr.device, &mr->ibmr); + mlx5_ib_mr_put_st_handle_ref(mr); return 0; } } @@ -1432,6 +1533,9 @@ static int mlx5r_handle_mkey_cleanup(struct mlx5_ib_mr *mr) if (mr->ibmr.frmr.pool && !ret) ib_frmr_pool_drop(&mr->ibmr); + if (!ret) + mlx5_ib_mr_put_st_handle_ref(mr); + return ret; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c index 7cedc348790d..877b37b4e639 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c @@ -92,23 +92,18 @@ void mlx5_st_destroy(struct mlx5_core_dev *dev) kfree(st); } -int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type, - unsigned int cpu_uid, u16 *st_index) +int mlx5_st_alloc_index_by_tag(struct mlx5_core_dev *dev, u16 tag, + u16 *st_index) { struct mlx5_st_idx_data *idx_data; struct mlx5_st *st = dev->st; unsigned long index; u32 xa_id; - u16 tag; - int ret; + int ret = 0; if (!st) return -EOPNOTSUPP; - ret = pcie_tph_get_cpu_st(dev->pdev, mem_type, cpu_uid, &tag); - if (ret) - return ret; - if (st->direct_mode) { *st_index = tag; return 0; @@ -152,8 +147,46 @@ int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type, mutex_unlock(&st->lock); return ret; } +EXPORT_SYMBOL_GPL(mlx5_st_alloc_index_by_tag); + +int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type, + unsigned int cpu_uid, u16 *st_index) +{ + u16 tag; + int ret; + + ret = pcie_tph_get_cpu_st(dev->pdev, mem_type, cpu_uid, &tag); + if (ret) + return ret; + + return mlx5_st_alloc_index_by_tag(dev, tag, st_index); +} EXPORT_SYMBOL_GPL(mlx5_st_alloc_index); +int mlx5_st_get_index(struct mlx5_core_dev *dev, u16 st_index) +{ + struct mlx5_st_idx_data *idx_data; + struct mlx5_st *st = dev->st; + int ret = 0; + + if (!st) + return -EOPNOTSUPP; + + if (st->direct_mode) + return 0; + + mutex_lock(&st->lock); + idx_data = xa_load(&st->idx_xa, st_index); + if (WARN_ON_ONCE(!idx_data)) + ret = -EINVAL; + else + refcount_inc(&idx_data->usecount); + mutex_unlock(&st->lock); + + return ret; +} +EXPORT_SYMBOL_GPL(mlx5_st_get_index); + int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index) { struct mlx5_st_idx_data *idx_data; diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h index b1871c0821d0..d26e5744c843 100644 --- a/include/linux/mlx5/driver.h +++ b/include/linux/mlx5/driver.h @@ -1182,16 +1182,31 @@ int mlx5_dm_sw_icm_dealloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type u64 length, u16 uid, phys_addr_t addr, u32 obj_id); #ifdef CONFIG_PCIE_TPH +int mlx5_st_alloc_index_by_tag(struct mlx5_core_dev *dev, u16 tag, + u16 *st_index); +int mlx5_st_get_index(struct mlx5_core_dev *dev, u16 st_index); int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type, unsigned int cpu_uid, u16 *st_index); int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index); #else +static inline int mlx5_st_alloc_index_by_tag(struct mlx5_core_dev *dev, + u16 tag, u16 *st_index) +{ + return -EOPNOTSUPP; +} + +static inline int mlx5_st_get_index(struct mlx5_core_dev *dev, u16 st_index) +{ + return -EOPNOTSUPP; +} + static inline int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type, unsigned int cpu_uid, u16 *st_index) { return -EOPNOTSUPP; } + static inline int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index) { return -EOPNOTSUPP; -- 2.53.0-Meta ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v10 4/4] RDMA/mlx5: get tph for p2p access when registering dma-buf mr 2026-06-30 22:42 ` [PATCH v10 4/4] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Zhiping Zhang @ 2026-06-30 23:11 ` sashiko-bot 0 siblings, 0 replies; 14+ messages in thread From: sashiko-bot @ 2026-06-30 23:11 UTC (permalink / raw) To: Zhiping Zhang; +Cc: linux-pci, dri-devel, kvm Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: Pre-existing issues: - [High] Memory leak of `struct mlx5_st_idx_data` in `mlx5_st_dealloc_index()` when its refcount reaches 0. -- Note: Due to system recitation restrictions, this review report cannot quote the original patch diff inline. The findings are provided as a free-form summary below. commit 9ada3f634918f1f6bc02c86bed9350c775185dbc Author: Zhiping Zhang <zhipingz@meta.com> Subject: RDMA/mlx5: get tph for p2p access when registering dma-buf mr This commit wires up mlx5_ib to query dma-buf TPH metadata at MR registration time to program requester-side TPH on the outbound mkey. It translates exporter-provided steering tags into local ST entries and ties the extra ST-table reference to the MR lifetime. [Severity: High] This is a pre-existing issue, but I noticed something while reviewing the surrounding code in drivers/net/ethernet/mellanox/mlx5/core/lib/st.c. In the mlx5_st_dealloc_index() function, when the usecount refcount drops to zero, the code calls xa_erase() to remove the element from the xarray. However, it appears the dynamically allocated memory for the idx_data structure itself is never freed. Could this lead to an unbounded memory leak when a user-space application repeatedly registers and deregisters a dma-buf MR with a TPH tag? During registration, mlx5_st_alloc_index_by_tag() allocates a new struct mlx5_st_idx_data via kzalloc_obj(). If deregistration drops the reference count to zero and removes the pointer from the xarray but fails to free the structure, could this be triggered in a loop by an unprivileged user-space process to deplete kernel memory? Would it make sense to add a kfree() for the idx_data structure immediately after the xa_erase() call in the refcount_dec_and_test() block? -- Sashiko AI review · https://sashiko.dev/#/patchset/20260630224328.3218796-1-zhipingz@meta.com?part=4 ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-07-02 7:06 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-30 22:42 [PATCH v10 0/4] vfio/dma-buf: add TPH support for peer-to-peer access Zhiping Zhang 2026-06-30 22:42 ` [PATCH v10 1/4] PCI/TPH: Add requester/completer type helpers Zhiping Zhang 2026-06-30 23:07 ` sashiko-bot 2026-06-30 22:42 ` [PATCH v10 2/4] dma-buf: add optional get_pci_tph() callback Zhiping Zhang 2026-06-30 23:01 ` sashiko-bot 2026-07-01 8:25 ` Christian König 2026-07-01 17:53 ` Zhiping Zhang 2026-07-02 7:06 ` Christian König 2026-06-30 22:42 ` [PATCH v10 3/4] vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature Zhiping Zhang 2026-06-30 23:08 ` sashiko-bot 2026-07-01 18:07 ` Alex Williamson 2026-07-01 21:07 ` Zhiping Zhang 2026-06-30 22:42 ` [PATCH v10 4/4] RDMA/mlx5: get tph for p2p access when registering dma-buf mr Zhiping Zhang 2026-06-30 23:11 ` sashiko-bot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox