Linux PCI subsystem development
 help / color / mirror / Atom feed
* [PATCH v8 0/7] vfio/pci: Add PCIe TPH support
@ 2026-05-08  6:40 Chengwen Feng
  2026-05-08  6:40 ` [PATCH v8 1/7] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Chengwen Feng @ 2026-05-08  6:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: wathsala.vithanage, helgaas, wei.huang2, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

This patchset enables userspace control over PCIe TPH steering tags,
motivated by the following considerations:

1. Why userspace needs the capability to control steering tags:
   When PCIe devices are fully owned by userspace workloads such as DPDK
   and SPDK, only userspace has full knowledge of core binding policies
   and traffic distribution strategies. Without this series, userspace
   cannot enable TPH or configure steering tags, leaving built-in PCIe
   performance optimizations unused in high-throughput polling I/O
   scenarios.

2. Why this interface must be implemented in VFIO:
   VFIO is the standard, secure community solution for granting full
   PCIe device ownership to userspace. Existing kernel TPH interfaces
   are designed purely for in-kernel drivers. For user-owned devices,
   VFIO provides the only isolated and correct path to expose per-device
   TPH management.

TPH supports both IV and DS modes. Since device-specific (DS) TPH mode
introduces cross-VM isolation risks such as untrusted guests programming
arbitrary steering tags to impact other domains, so a new module parameter
`enable_unsafe_tph_ds_mode` is added. It defaults to off, and blocks all
unsafe DS-mode TPH operations when disabled.

To restrict abuse of SET_ST and prevent arbitrary steering tag programming
from userspace, the interface only accepts explicit CPU ID, memory type
and index inputs. The kernel resolves the corresponding steering tag
internally before programming, limiting userspace to controlled,
index-based configuration.

Based on earlier RFC work by Wathsala Vithanage

v8:
- Make GET_ST op could retrieve CPU's steer tags for DS mode.
  note: the original impl could for DS mode + No ST Table, the
  backgroud is that we found one netcard defined ST table with DS
  mode, but also need to config set ST by device-specific way.
- Support verify index when SET_ST.
- Fix Sashiko review comments:
  1. Add fix pcie_tph_get_st_table_size for msi-x table commit
  2. Add argsz validation for GET/SET_ST copy st
  3. Verify mem-type when SET_ST with cpu=U32_MAX
v7:
- Address Bjorn's comment on [1/6] commit.
- Don't report ds mode defaultly (enable_unsafe_tph_ds_mode=0)
- Fix Sashiko review comments:
  1. pcie_tph_get_st_table_loc()'s stub return 0
  2. Tph ioctl argsz validation wrong use offsetofend
  3. Disable TPH when device was taken-over/close to/by userspace
  4. Serialize all TPH operations under vdev->igate to prevent hardware
     control and bitfield races.
  5. Check unused ioctl field to be zero.
v6:
- Address Alex's comment on [1/6] commit.
- Fix Sashiko review comments:
  Add tph_cap validation for pcie_tph_get_st_modes/st_table_loc.
  Add argsz validation for each op cmd.
  Move disable tph from ioctl-reset to register.
  Verify reserved field for get/set ST op.
  Fix ABI mismatch due to pointer arithmetic of get/st ST op.

Chengwen Feng (7):
  PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction
  PCI/TPH: Export pcie_tph_get_st_modes() for external use
  PCI/TPH: Fix pcie_tph_get_st_table_size() for MSI-X table location
  vfio/pci: Add PCIe TPH interface with capability query
  vfio/pci: Add PCIe TPH enable/disable support
  vfio/pci: Add PCIe TPH GET_ST interface
  vfio/pci: Add PCIe TPH SET_ST interface

 drivers/pci/tph.c                |  31 ++--
 drivers/vfio/pci/vfio_pci.c      |  13 +-
 drivers/vfio/pci/vfio_pci_core.c | 270 ++++++++++++++++++++++++++++++-
 include/linux/pci-tph.h          |   7 +
 include/linux/vfio_pci_core.h    |   3 +-
 include/uapi/linux/vfio.h        | 133 +++++++++++++++
 6 files changed, 444 insertions(+), 13 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v8 1/7] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction
  2026-05-08  6:40 [PATCH v8 0/7] vfio/pci: Add PCIe TPH support Chengwen Feng
@ 2026-05-08  6:40 ` Chengwen Feng
  2026-05-08  6:40 ` [PATCH v8 2/7] PCI/TPH: Export pcie_tph_get_st_modes() for external use Chengwen Feng
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Chengwen Feng @ 2026-05-08  6:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: wathsala.vithanage, helgaas, wei.huang2, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

pcie_tph_get_st_table_loc() incorrectly uses FIELD_GET(), which shifts the
field value to bit 0. But the function is designed to return raw
PCI_TPH_LOC_* values as defined in the function comment.

This causes incorrect ST table location detection. Fix it by using bitwise
AND with PCI_TPH_CAP_LOC_MASK to return the unshifted field value matching
the function specification.

This doesn't make a difference to mlx5_st_create(), the lone external
caller, because it only checks for PCI_TPH_LOC_NONE (0), but will be needed
for callers that check for PCI_TPH_LOC_CAP or PCI_TPH_LOC_MSIX.

Fixes: d2e8a34876ce ("PCI/TPH: Add Steering Tag support")
Cc: stable@vger.kernel.org
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Alex Williamson <alex.williamson@nvidia.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/tph.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index 91145e8d9d95..877cf556242b 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -170,7 +170,7 @@ u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
 
 	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, &reg);
 
-	return FIELD_GET(PCI_TPH_CAP_LOC_MASK, reg);
+	return reg & PCI_TPH_CAP_LOC_MASK;
 }
 EXPORT_SYMBOL(pcie_tph_get_st_table_loc);
 
@@ -185,9 +185,6 @@ u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
 
 	/* Check ST table location first */
 	loc = pcie_tph_get_st_table_loc(pdev);
-
-	/* Convert loc to match with PCI_TPH_LOC_* defined in pci_regs.h */
-	loc = FIELD_PREP(PCI_TPH_CAP_LOC_MASK, loc);
 	if (loc != PCI_TPH_LOC_CAP)
 		return 0;
 
@@ -316,8 +313,6 @@ int pcie_tph_set_st_entry(struct pci_dev *pdev, unsigned int index, u16 tag)
 	set_ctrl_reg_req_en(pdev, PCI_TPH_REQ_DISABLE);
 
 	loc = pcie_tph_get_st_table_loc(pdev);
-	/* Convert loc to match with PCI_TPH_LOC_* */
-	loc = FIELD_PREP(PCI_TPH_CAP_LOC_MASK, loc);
 
 	switch (loc) {
 	case PCI_TPH_LOC_MSIX:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v8 2/7] PCI/TPH: Export pcie_tph_get_st_modes() for external use
  2026-05-08  6:40 [PATCH v8 0/7] vfio/pci: Add PCIe TPH support Chengwen Feng
  2026-05-08  6:40 ` [PATCH v8 1/7] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
@ 2026-05-08  6:40 ` Chengwen Feng
  2026-05-08 19:02   ` sashiko-bot
  2026-05-08  6:40 ` [PATCH v8 3/7] PCI/TPH: Fix pcie_tph_get_st_table_size() for MSI-X table location Chengwen Feng
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Chengwen Feng @ 2026-05-08  6:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: wathsala.vithanage, helgaas, wei.huang2, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Export the helper to retrieve supported PCIe TPH steering tag modes so
that drivers like VFIO can query and expose device capabilities to
userspace.

Add stub functions for pcie_tph_get_st_table_size() and
pcie_tph_get_st_table_loc() when !CONFIG_PCIE_TPH.

Add tph_cap validation for pcie_tph_get_st_modes() and
pcie_tph_get_st_table_loc() to prevent invalid PCI configuration
space access when TPH is not supported.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/tph.c       | 19 +++++++++++++++++--
 include/linux/pci-tph.h |  7 +++++++
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index 877cf556242b..ba31b010f67a 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -145,15 +145,27 @@ static void set_ctrl_reg_req_en(struct pci_dev *pdev, u8 req_type)
 	pci_write_config_dword(pdev, pdev->tph_cap + PCI_TPH_CTRL, reg);
 }
 
-static u8 get_st_modes(struct pci_dev *pdev)
+/**
+ * pcie_tph_get_st_modes - Get supported Steering Tag modes
+ * @pdev: PCI device to query
+ *
+ * Return:
+ *  Bitmask of supported ST modes (PCI_TPH_CAP_ST_NS, PCI_TPH_CAP_ST_IV,
+ *                                 PCI_TPH_CAP_ST_DS)
+ */
+u8 pcie_tph_get_st_modes(struct pci_dev *pdev)
 {
 	u32 reg;
 
+	if (!pdev->tph_cap)
+		return 0;
+
 	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, &reg);
 	reg &= PCI_TPH_CAP_ST_NS | PCI_TPH_CAP_ST_IV | PCI_TPH_CAP_ST_DS;
 
 	return reg;
 }
+EXPORT_SYMBOL(pcie_tph_get_st_modes);
 
 /**
  * pcie_tph_get_st_table_loc - Return the device's ST table location
@@ -168,6 +180,9 @@ u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
 {
 	u32 reg;
 
+	if (!pdev->tph_cap)
+		return PCI_TPH_LOC_NONE;
+
 	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, &reg);
 
 	return reg & PCI_TPH_CAP_LOC_MASK;
@@ -395,7 +410,7 @@ int pcie_enable_tph(struct pci_dev *pdev, int mode)
 
 	/* Sanitize and check ST mode compatibility */
 	mode &= PCI_TPH_CTRL_MODE_SEL_MASK;
-	dev_modes = get_st_modes(pdev);
+	dev_modes = pcie_tph_get_st_modes(pdev);
 	if (!((1 << mode) & dev_modes))
 		return -EINVAL;
 
diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
index be68cd17f2f8..5772d48ea444 100644
--- a/include/linux/pci-tph.h
+++ b/include/linux/pci-tph.h
@@ -30,6 +30,7 @@ void pcie_disable_tph(struct pci_dev *pdev);
 int pcie_enable_tph(struct pci_dev *pdev, int mode);
 u16 pcie_tph_get_st_table_size(struct pci_dev *pdev);
 u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev);
+u8 pcie_tph_get_st_modes(struct pci_dev *pdev);
 #else
 static inline int pcie_tph_set_st_entry(struct pci_dev *pdev,
 					unsigned int index, u16 tag)
@@ -41,6 +42,12 @@ static inline int pcie_tph_get_cpu_st(struct pci_dev *dev,
 static inline void pcie_disable_tph(struct pci_dev *pdev) { }
 static inline int pcie_enable_tph(struct pci_dev *pdev, int mode)
 { return -EINVAL; }
+static inline u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
+{ return 0; }
+static inline u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
+{ return 0; }
+static inline u8 pcie_tph_get_st_modes(struct pci_dev *pdev)
+{ return 0; }
 #endif
 
 #endif /* LINUX_PCI_TPH_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v8 3/7] PCI/TPH: Fix pcie_tph_get_st_table_size() for MSI-X table location
  2026-05-08  6:40 [PATCH v8 0/7] vfio/pci: Add PCIe TPH support Chengwen Feng
  2026-05-08  6:40 ` [PATCH v8 1/7] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
  2026-05-08  6:40 ` [PATCH v8 2/7] PCI/TPH: Export pcie_tph_get_st_modes() for external use Chengwen Feng
@ 2026-05-08  6:40 ` Chengwen Feng
  2026-05-08 19:31   ` sashiko-bot
  2026-05-08  6:40 ` [PATCH v8 4/7] vfio/pci: Add PCIe TPH interface with capability query Chengwen Feng
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Chengwen Feng @ 2026-05-08  6:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: wathsala.vithanage, helgaas, wei.huang2, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

pcie_tph_get_st_table_size() previously only returned valid size when ST
Table is in TPH Capability space. According to PCIe spec [1], ST table size
is also valid when ST Table is located in MSI-X table.

Fix it to return valid table size for both PCI_TPH_LOC_CAP and
PCI_TPH_LOC_MSIX locations.

[1] PCI Express Base 6.1 Table 7-258 TPH Requester Capability Register
ST Table Size:
- Value indicates the maximum number of ST Table entries the Function may
  use. Software reads this field to determine the ST Table Size N, which is
  encoded as N-1. For example, a returned value of 000 0000 0011b indicates
  a table size of four entries.
- There is an upper limit of 64 entries when the ST Table is located in the
  TPH Requester Extended Capability structure.
- When the ST Table is located in the MSI-X Table, this value is limited by
  the size of the MSI-X Table.
- This field is only applicable for Functions that implement an ST Table as
  indicated by the ST Table Location field. Otherwise, the value in this
  field is undefined.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/pci/tph.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index ba31b010f67a..de5bd7039cdc 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -191,7 +191,8 @@ EXPORT_SYMBOL(pcie_tph_get_st_table_loc);
 
 /*
  * Return the size of ST table. If ST table is not in TPH Requester Extended
- * Capability space, return 0. Otherwise return the ST Table Size + 1.
+ * Capability space or MSI-X table, return 0. Otherwise return the
+ * ST Table Size + 1.
  */
 u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
 {
@@ -200,7 +201,7 @@ u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
 
 	/* Check ST table location first */
 	loc = pcie_tph_get_st_table_loc(pdev);
-	if (loc != PCI_TPH_LOC_CAP)
+	if (loc != PCI_TPH_LOC_CAP && loc != PCI_TPH_LOC_MSIX)
 		return 0;
 
 	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, &reg);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v8 4/7] vfio/pci: Add PCIe TPH interface with capability query
  2026-05-08  6:40 [PATCH v8 0/7] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (2 preceding siblings ...)
  2026-05-08  6:40 ` [PATCH v8 3/7] PCI/TPH: Fix pcie_tph_get_st_table_size() for MSI-X table location Chengwen Feng
@ 2026-05-08  6:40 ` Chengwen Feng
  2026-05-08 20:03   ` sashiko-bot
  2026-05-08 22:40   ` Alex Williamson
  2026-05-08  6:40 ` [PATCH v8 5/7] vfio/pci: Add PCIe TPH enable/disable support Chengwen Feng
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 16+ messages in thread
From: Chengwen Feng @ 2026-05-08  6:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: wathsala.vithanage, helgaas, wei.huang2, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Add VFIO_DEVICE_PCI_TPH IOCTL to allow userspace to query device TPH
capabilities, supported modes, and steering tag table information.

Add module parameter 'enable_unsafe_tph_ds_mode' to restrict unsafe
device-specific TPH mode to trusted userspace only.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/vfio/pci/vfio_pci.c      |  13 ++-
 drivers/vfio/pci/vfio_pci_core.c |  56 ++++++++++++-
 include/linux/vfio_pci_core.h    |   3 +-
 include/uapi/linux/vfio.h        | 133 +++++++++++++++++++++++++++++++
 4 files changed, 202 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 0c771064c0b8..40bf5aa9fd0b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -60,6 +60,12 @@ static bool disable_denylist;
 module_param(disable_denylist, bool, 0444);
 MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
 
+#ifdef CONFIG_PCIE_TPH
+static bool enable_unsafe_tph_ds_mode;
+module_param(enable_unsafe_tph_ds_mode, bool, 0444);
+MODULE_PARM_DESC(enable_unsafe_tph_ds_mode, "Enable UNSAFE TPH device-specific (DS) mode. This mode provides weak isolation, cannot be safely used for virtual machines. If you do not know what this is for, step away. (default: false)");
+#endif
+
 static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
 {
 	switch (pdev->vendor) {
@@ -257,12 +263,17 @@ static int __init vfio_pci_init(void)
 {
 	int ret;
 	bool is_disable_vga = true;
+	bool is_enable_unsafe_tph_ds_mode = false;
 
 #ifdef CONFIG_VFIO_PCI_VGA
 	is_disable_vga = disable_vga;
 #endif
+#ifdef CONFIG_PCIE_TPH
+	is_enable_unsafe_tph_ds_mode = enable_unsafe_tph_ds_mode;
+#endif
 
-	vfio_pci_core_set_params(nointxmask, is_disable_vga, disable_idle_d3);
+	vfio_pci_core_set_params(nointxmask, is_disable_vga, disable_idle_d3,
+				 is_enable_unsafe_tph_ds_mode);
 
 	/* Register and scan for devices */
 	ret = pci_register_driver(&vfio_pci_driver);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 3f8d093aacf8..0e97b128fd63 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -29,6 +29,7 @@
 #include <linux/sched/mm.h>
 #include <linux/iommufd.h>
 #include <linux/pci-p2pdma.h>
+#include <linux/pci-tph.h>
 #if IS_ENABLED(CONFIG_EEH)
 #include <asm/eeh.h>
 #endif
@@ -41,6 +42,7 @@
 static bool nointxmask;
 static bool disable_vga;
 static bool disable_idle_d3;
+static bool enable_unsafe_tph_ds_mode;
 
 static void vfio_pci_eventfd_rcu_free(struct rcu_head *rcu)
 {
@@ -1461,6 +1463,54 @@ static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
 				  ioeventfd.fd);
 }
 
+static int vfio_pci_tph_get_cap(struct vfio_pci_core_device *vdev,
+				struct vfio_device_pci_tph_op *op,
+				void __user *uarg)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_tph_cap cap = {0};
+	u8 mode;
+
+	if (op->argsz < offsetofend(struct vfio_device_pci_tph_op, cap))
+		return -EINVAL;
+
+	mode = pcie_tph_get_st_modes(pdev);
+	/* Hide unsafe device-specific (DS) mode by default */
+	if (!enable_unsafe_tph_ds_mode)
+		mode &= ~PCI_TPH_CAP_ST_DS;
+	if (mode == 0 || mode == PCI_TPH_CAP_ST_NS)
+		return -EOPNOTSUPP;
+
+	if (mode & PCI_TPH_CAP_ST_IV)
+		cap.supported_modes |= VFIO_PCI_TPH_MODE_IV;
+	if (mode & PCI_TPH_CAP_ST_DS)
+		cap.supported_modes |= VFIO_PCI_TPH_MODE_DS;
+	cap.st_table_sz = pcie_tph_get_st_table_size(pdev);
+
+	if (copy_to_user(uarg, &cap, sizeof(cap)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int vfio_pci_ioctl_tph(struct vfio_pci_core_device *vdev,
+			      void __user *uarg)
+{
+	struct vfio_device_pci_tph_op op = {0};
+	size_t minsz = sizeof(op.argsz) + sizeof(op.op);
+
+	if (copy_from_user(&op, uarg, minsz))
+		return -EFAULT;
+
+	switch (op.op) {
+	case VFIO_PCI_TPH_GET_CAP:
+		return vfio_pci_tph_get_cap(vdev, &op, uarg + minsz);
+	default:
+		/* Other ops are not implemented yet */
+		return -EINVAL;
+	}
+}
+
 long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 			 unsigned long arg)
 {
@@ -1483,6 +1533,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		return vfio_pci_ioctl_reset(vdev, uarg);
 	case VFIO_DEVICE_SET_IRQS:
 		return vfio_pci_ioctl_set_irqs(vdev, uarg);
+	case VFIO_DEVICE_PCI_TPH:
+		return vfio_pci_ioctl_tph(vdev, uarg);
 	default:
 		return -ENOTTY;
 	}
@@ -2570,11 +2622,13 @@ static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
 }
 
 void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
-			      bool is_disable_idle_d3)
+			      bool is_disable_idle_d3,
+			      bool is_enable_unsafe_tph_ds_mode)
 {
 	nointxmask = is_nointxmask;
 	disable_vga = is_disable_vga;
 	disable_idle_d3 = is_disable_idle_d3;
+	enable_unsafe_tph_ds_mode = is_enable_unsafe_tph_ds_mode;
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_set_params);
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 2ebba746c18f..5af2a2e04ca7 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -157,7 +157,8 @@ int vfio_pci_core_register_dev_region(struct vfio_pci_core_device *vdev,
 				      const struct vfio_pci_regops *ops,
 				      size_t size, u32 flags, void *data);
 void vfio_pci_core_set_params(bool nointxmask, bool is_disable_vga,
-			      bool is_disable_idle_d3);
+			      bool is_disable_idle_d3,
+			      bool is_enable_unsafe_tph_ds_mode);
 void vfio_pci_core_close_device(struct vfio_device *core_vdev);
 int vfio_pci_core_init_dev(struct vfio_device *core_vdev);
 void vfio_pci_core_release_dev(struct vfio_device *core_vdev);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..81da2bd0c21b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1321,6 +1321,139 @@ struct vfio_precopy_info {
 
 #define VFIO_MIG_GET_PRECOPY_INFO _IO(VFIO_TYPE, VFIO_BASE + 21)
 
+/**
+ * struct vfio_pci_tph_cap - PCIe TPH capability information
+ * @supported_modes: Supported TPH operating modes
+ * @st_table_sz: Number of entries in ST table; 0 means no ST table
+ * @reserved: Must be zero
+ *
+ * Used with VFIO_PCI_TPH_GET_CAP operation to return device
+ * TLP Processing Hints (TPH) capabilities to userspace.
+ */
+struct vfio_pci_tph_cap {
+	__u8  supported_modes;
+#define VFIO_PCI_TPH_MODE_IV	(1u << 0) /* Interrupt vector */
+#define VFIO_PCI_TPH_MODE_DS	(1u << 1) /* Device specific */
+	__u8  reserved0;
+	__u16 st_table_sz;
+	__u32 reserved;
+};
+
+/**
+ * struct vfio_pci_tph_ctrl - TPH enable control structure
+ * @mode: Selected TPH operating mode (VFIO_PCI_TPH_MODE_*)
+ * @reserved: Must be zero
+ *
+ * Used with VFIO_PCI_TPH_ENABLE operation to specify the
+ * operating mode when enabling TPH on the device.
+ */
+struct vfio_pci_tph_ctrl {
+	__u8 mode;
+	__u8 reserved[7];
+};
+
+/**
+ * struct vfio_pci_tph_entry - Single TPH steering tag entry
+ * @cpu: CPU identifier for steering tag calculation
+ * @mem_type: Memory type (VFIO_PCI_TPH_MEM_TYPE_*)
+ * @reserved0: Must be zero
+ * @index: ST table index for programming
+ * @st: Unused for SET_ST
+ * @reserved1: Must be zero
+ *
+ * For VFIO_PCI_TPH_GET_ST:
+ *   Userspace sets @cpu and @mem_type; kernel returns @st.
+ *
+ * For VFIO_PCI_TPH_SET_ST:
+ *   Userspace sets @index, @cpu, and @mem_type.
+ *   Kernel internally computes the steering tag and programs
+ *   it into the specified @index.
+ *
+ *   If @cpu == U32_MAX, kernel clears the steering tag at
+ *   the specified @index.
+ */
+struct vfio_pci_tph_entry {
+	__u32 cpu;
+	__u8  mem_type;
+#define VFIO_PCI_TPH_MEM_TYPE_VM	0
+#define VFIO_PCI_TPH_MEM_TYPE_PM	1
+	__u8  reserved0;
+	__u16 index;
+	__u16 st;
+	__u16 reserved1;
+};
+
+/**
+ * struct vfio_pci_tph_st - Batch steering tag request
+ * @count: Number of entries in the array
+ * @reserved: Must be zero
+ * @ents: Flexible array of steering tag entries
+ *
+ * Container structure for batch get/set operations.
+ * Used with both VFIO_PCI_TPH_GET_ST and VFIO_PCI_TPH_SET_ST.
+ */
+struct vfio_pci_tph_st {
+	__u32 count;
+	__u32 reserved;
+	struct vfio_pci_tph_entry ents[];
+#define VFIO_PCI_TPH_MAX_ENTRIES    2048
+};
+
+/**
+ * struct vfio_device_pci_tph_op - Argument for VFIO_DEVICE_PCI_TPH
+ * @argsz: User allocated size of this structure
+ * @op: TPH operation (VFIO_PCI_TPH_*)
+ * @cap: Capability data for GET_CAP
+ * @ctrl: Control data for ENABLE
+ * @st: Batch entry data for GET_ST/SET_ST
+ *
+ * @argsz must be set by the user to the size of the structure
+ * being executed. Kernel validates input and returns data
+ * only within the specified size.
+ *
+ * Operations:
+ * - VFIO_PCI_TPH_GET_CAP: Query device TPH capabilities.
+ * - VFIO_PCI_TPH_ENABLE:  Enable TPH using mode from &ctrl.
+ * - VFIO_PCI_TPH_DISABLE: Disable TPH on the device.
+ * - VFIO_PCI_TPH_GET_ST:  Retrieve CPU steering tags for Device-Specific (DS)
+ *                         mode. Used when device requires SW to obtain ST
+ *                         values for programming.
+ * - VFIO_PCI_TPH_SET_ST:  Program steering tag entries into device ST table.
+ *                         Valid when ST table resides in TPH Requester
+ *                         Capability or MSI-X Table.
+ *                         If any entry fails, all programmed entries are rolled
+ *                         back to 0 before returning error.
+ */
+struct vfio_device_pci_tph_op {
+	__u32 argsz;
+	__u32 op;
+#define VFIO_PCI_TPH_GET_CAP	0
+#define VFIO_PCI_TPH_ENABLE	1
+#define VFIO_PCI_TPH_DISABLE	2
+#define VFIO_PCI_TPH_GET_ST	3
+#define VFIO_PCI_TPH_SET_ST	4
+	union {
+		struct vfio_pci_tph_cap cap;
+		struct vfio_pci_tph_ctrl ctrl;
+		struct vfio_pci_tph_st st;
+	};
+};
+
+/**
+ * VFIO_DEVICE_PCI_TPH - _IO(VFIO_TYPE, VFIO_BASE + 22)
+ *
+ * IOCTL for managing PCIe TLP Processing Hints (TPH) on
+ * a VFIO-assigned PCI device. Provides operations to query
+ * device capabilities, enable/disable TPH, retrieve CPU's
+ * steering tags, and program steering tag tables.
+ *
+ * Return: 0 on success, negative errno on failure.
+ *         -EOPNOTSUPP: Operation not supported
+ *         -ENODEV: Device or required functionality not present
+ *         -EINVAL: Invalid argument or TPH not supported
+ */
+#define VFIO_DEVICE_PCI_TPH	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
 /*
  * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
  * state with the platform-based power management.  Device use of lower power
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v8 5/7] vfio/pci: Add PCIe TPH enable/disable support
  2026-05-08  6:40 [PATCH v8 0/7] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (3 preceding siblings ...)
  2026-05-08  6:40 ` [PATCH v8 4/7] vfio/pci: Add PCIe TPH interface with capability query Chengwen Feng
@ 2026-05-08  6:40 ` Chengwen Feng
  2026-05-08 20:46   ` sashiko-bot
  2026-05-08  6:40 ` [PATCH v8 6/7] vfio/pci: Add PCIe TPH GET_ST interface Chengwen Feng
  2026-05-08  6:40 ` [PATCH v8 7/7] vfio/pci: Add PCIe TPH SET_ST interface Chengwen Feng
  6 siblings, 1 reply; 16+ messages in thread
From: Chengwen Feng @ 2026-05-08  6:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: wathsala.vithanage, helgaas, wei.huang2, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Add support to enable and disable TPH function with mode selection.

Restrict unsafe device-specific TPH mode to be allowed only when module
parameter enable_unsafe_tph_ds_mode=1 is set.

Disable TPH when:
1) Taking over ownership of the device (before user visibility),
2) Userspace closes the device FD to clean up state.

Serialize all TPH operations under vdev->igate mutex using scope-based
automatic locking to prevent hardware control and bitfield races.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 48 ++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 0e97b128fd63..bfc7e87d190f 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -738,6 +738,9 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
 #endif
 	vfio_pci_dma_buf_cleanup(vdev);
 
+	/* Disable TPH when userspace closes the device FD */
+	pcie_disable_tph(vdev->pdev);
+
 	vfio_pci_core_disable(vdev);
 
 	mutex_lock(&vdev->igate);
@@ -1493,18 +1496,60 @@ static int vfio_pci_tph_get_cap(struct vfio_pci_core_device *vdev,
 	return 0;
 }
 
+static int vfio_pci_tph_enable(struct vfio_pci_core_device *vdev,
+			      struct vfio_device_pci_tph_op *op,
+			      void __user *uarg)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_tph_ctrl ctrl;
+	int mode;
+
+	if (op->argsz < offsetofend(struct vfio_device_pci_tph_op, ctrl))
+		return -EINVAL;
+
+	if (copy_from_user(&ctrl, uarg, sizeof(ctrl)))
+		return -EFAULT;
+
+	if (ctrl.mode != VFIO_PCI_TPH_MODE_IV &&
+	    ctrl.mode != VFIO_PCI_TPH_MODE_DS)
+		return -EINVAL;
+
+	if (ctrl.mode == VFIO_PCI_TPH_MODE_DS && !enable_unsafe_tph_ds_mode)
+		return -EOPNOTSUPP;
+
+	/* Reserved must be zero */
+	if (memchr_inv(ctrl.reserved, 0, sizeof(ctrl.reserved)))
+		return -EINVAL;
+
+	mode = (ctrl.mode == VFIO_PCI_TPH_MODE_IV) ? PCI_TPH_ST_IV_MODE :
+						     PCI_TPH_ST_DS_MODE;
+	return pcie_enable_tph(pdev, mode);
+}
+
+static int vfio_pci_tph_disable(struct vfio_pci_core_device *vdev)
+{
+	pcie_disable_tph(vdev->pdev);
+	return 0;
+}
+
 static int vfio_pci_ioctl_tph(struct vfio_pci_core_device *vdev,
 			      void __user *uarg)
 {
 	struct vfio_device_pci_tph_op op = {0};
 	size_t minsz = sizeof(op.argsz) + sizeof(op.op);
 
+	guard(mutex)(&vdev->igate);
+
 	if (copy_from_user(&op, uarg, minsz))
 		return -EFAULT;
 
 	switch (op.op) {
 	case VFIO_PCI_TPH_GET_CAP:
 		return vfio_pci_tph_get_cap(vdev, &op, uarg + minsz);
+	case VFIO_PCI_TPH_ENABLE:
+		return vfio_pci_tph_enable(vdev, &op, uarg + minsz);
+	case VFIO_PCI_TPH_DISABLE:
+		return vfio_pci_tph_disable(vdev);
 	default:
 		/* Other ops are not implemented yet */
 		return -EINVAL;
@@ -2257,6 +2302,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 	if (!disable_idle_d3)
 		pm_runtime_put(dev);
 
+	/* Disable TPH when taking over ownership of the device */
+	pcie_disable_tph(pdev);
+
 	ret = vfio_register_group_dev(&vdev->vdev);
 	if (ret)
 		goto out_power;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v8 6/7] vfio/pci: Add PCIe TPH GET_ST interface
  2026-05-08  6:40 [PATCH v8 0/7] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (4 preceding siblings ...)
  2026-05-08  6:40 ` [PATCH v8 5/7] vfio/pci: Add PCIe TPH enable/disable support Chengwen Feng
@ 2026-05-08  6:40 ` Chengwen Feng
  2026-05-08  6:40 ` [PATCH v8 7/7] vfio/pci: Add PCIe TPH SET_ST interface Chengwen Feng
  6 siblings, 0 replies; 16+ messages in thread
From: Chengwen Feng @ 2026-05-08  6:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: wathsala.vithanage, helgaas, wei.huang2, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Add support to batch get CPU steering tags for device-specific TPH mode.
This interface requires enabling the 'enable_unsafe_tph_ds_mode' module
parameter.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 76 ++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index bfc7e87d190f..7ec2dd32f106 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1532,6 +1532,80 @@ static int vfio_pci_tph_disable(struct vfio_pci_core_device *vdev)
 	return 0;
 }
 
+static int vfio_pci_tph_get_st(struct vfio_pci_core_device *vdev,
+			       struct vfio_device_pci_tph_op *op,
+			       void __user *uarg)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 mode = pcie_tph_get_st_modes(pdev);
+	struct vfio_pci_tph_entry *ents;
+	struct vfio_pci_tph_st st;
+	enum tph_mem_type mtype;
+	size_t size, ents_off;
+	int i, err;
+
+	if (!enable_unsafe_tph_ds_mode || !(mode & PCI_TPH_CAP_ST_DS))
+		return -EOPNOTSUPP;
+
+	if (op->argsz < offsetofend(struct vfio_device_pci_tph_op, st))
+		return -EINVAL;
+
+	if (copy_from_user(&st, uarg, sizeof(st)))
+		return -EFAULT;
+
+	/* Check reserved fields are zero */
+	if (memchr_inv(&st.reserved, 0, sizeof(st.reserved)))
+		return -EINVAL;
+
+	if (!st.count || st.count > VFIO_PCI_TPH_MAX_ENTRIES)
+		return -EINVAL;
+
+	size = st.count * sizeof(*ents);
+	if (op->argsz < offsetofend(struct vfio_device_pci_tph_op, st) + size)
+		return -EINVAL;
+
+	ents = kvmalloc(size, GFP_KERNEL);
+	if (!ents)
+		return -ENOMEM;
+
+	ents_off = offsetof(struct vfio_pci_tph_st, ents);
+	if (copy_from_user(ents, uarg + ents_off, size)) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	for (i = 0; i < st.count; i++) {
+		/* Check reserved fields and index are zero */
+		if (memchr_inv(&ents[i].reserved0, 0, sizeof(ents[i].reserved0)) ||
+		    memchr_inv(&ents[i].reserved1, 0, sizeof(ents[i].reserved1)) ||
+		    ents[i].index != 0) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		if (ents[i].mem_type == VFIO_PCI_TPH_MEM_TYPE_VM) {
+			mtype = TPH_MEM_TYPE_VM;
+		} else if (ents[i].mem_type == VFIO_PCI_TPH_MEM_TYPE_PM) {
+			mtype = TPH_MEM_TYPE_PM;
+		} else {
+			err = -EINVAL;
+			goto out;
+		}
+
+		err = pcie_tph_get_cpu_st(pdev, mtype, ents[i].cpu,
+					  &ents[i].st);
+		if (err)
+			goto out;
+	}
+
+	if (copy_to_user(uarg + ents_off, ents, size))
+		err = -EFAULT;
+
+out:
+	kvfree(ents);
+	return err;
+}
+
 static int vfio_pci_ioctl_tph(struct vfio_pci_core_device *vdev,
 			      void __user *uarg)
 {
@@ -1550,6 +1624,8 @@ static int vfio_pci_ioctl_tph(struct vfio_pci_core_device *vdev,
 		return vfio_pci_tph_enable(vdev, &op, uarg + minsz);
 	case VFIO_PCI_TPH_DISABLE:
 		return vfio_pci_tph_disable(vdev);
+	case VFIO_PCI_TPH_GET_ST:
+		return vfio_pci_tph_get_st(vdev, &op, uarg + minsz);
 	default:
 		/* Other ops are not implemented yet */
 		return -EINVAL;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v8 7/7] vfio/pci: Add PCIe TPH SET_ST interface
  2026-05-08  6:40 [PATCH v8 0/7] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (5 preceding siblings ...)
  2026-05-08  6:40 ` [PATCH v8 6/7] vfio/pci: Add PCIe TPH GET_ST interface Chengwen Feng
@ 2026-05-08  6:40 ` Chengwen Feng
  2026-05-08 21:49   ` sashiko-bot
  6 siblings, 1 reply; 16+ messages in thread
From: Chengwen Feng @ 2026-05-08  6:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: wathsala.vithanage, helgaas, wei.huang2, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Add VFIO_PCI_TPH_SET_ST operation to support batch programming of steering
tag entries. If any entry fails, roll back successfully programmed entries
to 0 to prevent inconsistent device state.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 90 ++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 7ec2dd32f106..9e399696ce6e 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1606,6 +1606,94 @@ static int vfio_pci_tph_get_st(struct vfio_pci_core_device *vdev,
 	return err;
 }
 
+static int vfio_pci_tph_set_st(struct vfio_pci_core_device *vdev,
+			       struct vfio_device_pci_tph_op *op,
+			       void __user *uarg)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_tph_entry *ents;
+	struct vfio_pci_tph_st st;
+	enum tph_mem_type mtype;
+	size_t size, ents_off;
+	int i = 0, j, err;
+	u32 tab_sz;
+	u16 st_val;
+
+	tab_sz = pcie_tph_get_st_table_size(pdev);
+	if (tab_sz == 0)
+		return -EOPNOTSUPP;
+
+	if (op->argsz < offsetofend(struct vfio_device_pci_tph_op, st))
+		return -EINVAL;
+
+	if (copy_from_user(&st, uarg, sizeof(st)))
+		return -EFAULT;
+
+	if (!st.count || st.count > VFIO_PCI_TPH_MAX_ENTRIES)
+		return -EINVAL;
+
+	/* Check reserved fields are zero */
+	if (memchr_inv(&st.reserved, 0, sizeof(st.reserved)))
+		return -EINVAL;
+
+	size = st.count * sizeof(*ents);
+	if (op->argsz < offsetofend(struct vfio_device_pci_tph_op, st) + size)
+		return -EINVAL;
+
+	ents = kvmalloc(size, GFP_KERNEL);
+	if (!ents)
+		return -ENOMEM;
+
+	ents_off = offsetof(struct vfio_pci_tph_st, ents);
+	if (copy_from_user(ents, uarg + ents_off, size)) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	for (; i < st.count; i++) {
+		err = -EINVAL;
+
+		/* Check reserved fields and st are zero */
+		if (memchr_inv(&ents[i].reserved0, 0, sizeof(ents[i].reserved0)) ||
+		    memchr_inv(&ents[i].reserved1, 0, sizeof(ents[i].reserved1)) ||
+		    ents[i].st != 0)
+			goto out;
+
+		if (ents[i].mem_type == VFIO_PCI_TPH_MEM_TYPE_VM)
+			mtype = TPH_MEM_TYPE_VM;
+		else if (ents[i].mem_type == VFIO_PCI_TPH_MEM_TYPE_PM)
+			mtype = TPH_MEM_TYPE_PM;
+		else
+			goto out;
+
+		if (ents[i].index >= tab_sz)
+			goto out;
+
+		if (ents[i].cpu == U32_MAX) {
+			err = pcie_tph_set_st_entry(pdev, ents[i].index, 0);
+			if (err)
+				goto out;
+			continue;
+		}
+
+		err = pcie_tph_get_cpu_st(pdev, mtype, ents[i].cpu, &st_val);
+		if (err)
+			goto out;
+		err = pcie_tph_set_st_entry(pdev, ents[i].index, st_val);
+		if (err)
+			goto out;
+	}
+
+out:
+	if (err) {
+		/* Roll back previously programmed entries to 0 */
+		for (j = 0; j < i; j++)
+			pcie_tph_set_st_entry(pdev, ents[j].index, 0);
+	}
+	kvfree(ents);
+	return err;
+}
+
 static int vfio_pci_ioctl_tph(struct vfio_pci_core_device *vdev,
 			      void __user *uarg)
 {
@@ -1626,6 +1714,8 @@ static int vfio_pci_ioctl_tph(struct vfio_pci_core_device *vdev,
 		return vfio_pci_tph_disable(vdev);
 	case VFIO_PCI_TPH_GET_ST:
 		return vfio_pci_tph_get_st(vdev, &op, uarg + minsz);
+	case VFIO_PCI_TPH_SET_ST:
+		return vfio_pci_tph_set_st(vdev, &op, uarg + minsz);
 	default:
 		/* Other ops are not implemented yet */
 		return -EINVAL;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 2/7] PCI/TPH: Export pcie_tph_get_st_modes() for external use
  2026-05-08  6:40 ` [PATCH v8 2/7] PCI/TPH: Export pcie_tph_get_st_modes() for external use Chengwen Feng
@ 2026-05-08 19:02   ` sashiko-bot
  0 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-05-08 19:02 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] pcie_tph_get_st_modes() and pcie_tph_get_st_table_loc() do not check the return value of pci_read_config_dword(), causing them to return false capabilities on read failures.
--

commit 521881ea093ae852c8fad775f8ffcbe3b9c7a46b
Author: Chengwen Feng <fengchengwen@huawei.com>

PCI/TPH: Export pcie_tph_get_st_modes() for external use

This commit exports the pcie_tph_get_st_modes() helper to allow external
drivers like VFIO to query supported PCIe TPH steering tag modes. It also
adds stub functions for when CONFIG_PCIE_TPH is disabled and includes
tph_cap validation to prevent invalid PCI configuration space access.

> diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
> index 877cf556242b4..ba31b010f67ad 100644
> --- a/drivers/pci/tph.c
> +++ b/drivers/pci/tph.c
> @@ -145,15 +145,27 @@ static void set_ctrl_reg_req_en(struct pci_dev *pdev, u8 req_type)
[ ... ]
> -static u8 get_st_modes(struct pci_dev *pdev)
> +/**
> + * pcie_tph_get_st_modes - Get supported Steering Tag modes
> + * @pdev: PCI device to query
> + *
> + * Return:
> + *  Bitmask of supported ST modes (PCI_TPH_CAP_ST_NS, PCI_TPH_CAP_ST_IV,
> + *                                 PCI_TPH_CAP_ST_DS)
> + */
> +u8 pcie_tph_get_st_modes(struct pci_dev *pdev)
>  {
>  	u32 reg;
>  
> +	if (!pdev->tph_cap)
> +		return 0;
> +
>  	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, &reg);
>  	reg &= PCI_TPH_CAP_ST_NS | PCI_TPH_CAP_ST_IV | PCI_TPH_CAP_ST_DS;
>  
>  	return reg;
>  }

If pci_read_config_dword() fails (for example, if the device is disconnected
or isolated), it sets the output variable reg to PCI_ERROR_RESPONSE (~0U).

Because the return value is not checked, this code will apply the bitmask
to ~0U, which evaluates to 0x07 (PCI_TPH_CAP_ST_NS | PCI_TPH_CAP_ST_IV |
PCI_TPH_CAP_ST_DS).

Since this function is being exported for external use by drivers like VFIO,
could this falsely advertise that the device supports all three ST modes
instead of none? Should the return value of pci_read_config_dword() be
checked to ensure 0 is returned on failure?

> +EXPORT_SYMBOL(pcie_tph_get_st_modes);
>  
>  /**
>   * pcie_tph_get_st_table_loc - Return the device's ST table location
> @@ -168,6 +180,9 @@ u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
>  {
>  	u32 reg;
>  
> +	if (!pdev->tph_cap)
> +		return PCI_TPH_LOC_NONE;
> +
>  	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, &reg);
>  
>  	return reg & PCI_TPH_CAP_LOC_MASK;

Similarly, does an unchecked read failure here lead to returning an invalid
location mask derived from the ~0U error response? Would it be safer to check
the return value and return PCI_TPH_LOC_NONE if the read fails?

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260508064053.37529-1-fengchengwen@huawei.com?part=2

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 3/7] PCI/TPH: Fix pcie_tph_get_st_table_size() for MSI-X table location
  2026-05-08  6:40 ` [PATCH v8 3/7] PCI/TPH: Fix pcie_tph_get_st_table_size() for MSI-X table location Chengwen Feng
@ 2026-05-08 19:31   ` sashiko-bot
  0 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-05-08 19:31 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Modifying pcie_tph_get_st_table_size() to return > 0 for MSI-X located ST tables causes severe out-of-bounds read/writes in PCI configuration space during TPH state save and restore.
--

commit 73d4864dbc8b240284d0329056ab114562107193
Author: Chengwen Feng <fengchengwen@huawei.com>

    PCI/TPH: Fix pcie_tph_get_st_table_size() for MSI-X table location
    
    pcie_tph_get_st_table_size() previously only returned valid size when ST
    Table is in TPH Capability space. According to PCIe spec [1], ST table size
    is also valid when ST Table is located in MSI-X table.

[ ... ]

> diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
> index ba31b010f67ad..de5bd7039cdca 100644
> --- a/drivers/pci/tph.c
> +++ b/drivers/pci/tph.c

[ ... ]

> @@ -200,7 +201,7 @@ u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
>  
>  	/* Check ST table location first */
>  	loc = pcie_tph_get_st_table_loc(pdev);
> -	if (loc != PCI_TPH_LOC_CAP)
> +	if (loc != PCI_TPH_LOC_CAP && loc != PCI_TPH_LOC_MSIX)
>  		return 0;
>  
>  	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, &reg);

If pcie_tph_get_st_table_size() returns a value greater than 0 when the ST
table is located in the MSI-X table, does this cause pci_restore_tph_state()
to write out of bounds in the PCI configuration space?

Looking at pci_restore_tph_state(), it uses the size returned by
pcie_tph_get_st_table_size() to write ST entries directly into the TPH
Extended Capability:

drivers/pci/tph.c:pci_restore_tph_state() {
	...
	st_entry = (u16 *)cap;
	offset = PCI_TPH_BASE_SIZEOF;
	num_entries = pcie_tph_get_st_table_size(pdev);
	for (i = 0; i < num_entries; i++) {
		pci_write_config_word(pdev, pdev->tph_cap + offset,
				      *st_entry++);
		offset += sizeof(u16);
	}
}

When the ST table is located in MSI-X, the entries are memory mapped rather
than stored in the TPH capability itself.

Because the TPH Extended Capability is only 12 bytes long in this case, could
this loop clobber subsequent capabilities in the configuration space during
resume?

A similar out-of-bounds read would occur in pci_save_tph_state().

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260508064053.37529-1-fengchengwen@huawei.com?part=3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 4/7] vfio/pci: Add PCIe TPH interface with capability query
  2026-05-08  6:40 ` [PATCH v8 4/7] vfio/pci: Add PCIe TPH interface with capability query Chengwen Feng
@ 2026-05-08 20:03   ` sashiko-bot
  2026-05-08 22:40   ` Alex Williamson
  1 sibling, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-05-08 20:03 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] UAPI Backwards Compatibility Breakage and Buffer Overflow Risk
- [High] Missing memory_lock Causes Hardware State Races During Reset
- [Medium] UAPI Bloat / Bypassing VFIO_DEVICE_FEATURE Framework
- [Medium] Incorrect Module Parameter Scope Limits Vendor Drivers
--

commit 83318520665fc31c05135b8adff081970464291e
Author: Chengwen Feng <fengchengwen@huawei.com>

vfio/pci: Add PCIe TPH interface with capability query
    
Add VFIO_DEVICE_PCI_TPH IOCTL to allow userspace to query device TPH
capabilities, supported modes, and steering tag table information.

> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 0c771064c0b84..40bf5aa9fd0be 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -60,6 +60,12 @@ static bool disable_denylist;
>  module_param(disable_denylist, bool, 0444);
>  MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
>  
> +#ifdef CONFIG_PCIE_TPH
> +static bool enable_unsafe_tph_ds_mode;
> +module_param(enable_unsafe_tph_ds_mode, bool, 0444);

Vendor-specific VFIO drivers (such as mlx5_vfio_pci or hisi_vfio_pci) load
the shared vfio_pci_core.ko library directly and do not depend on vfio_pci.ko.

If users of these vendor drivers want to enable unsafe TPH DS mode, they
have no way to do so unless they explicitly load the generic vfio_pci driver,
which alters the core state for all bound devices.

Should this parameter reside directly in vfio_pci_core.c instead?

[ ... ]

> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 3f8d093aacf8a..0e97b128fd637 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
[ ... ]
> @@ -1461,6 +1463,54 @@ static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
>  				  ioeventfd.fd);
>  }
>  
> +static int vfio_pci_tph_get_cap(struct vfio_pci_core_device *vdev,
> +				struct vfio_device_pci_tph_op *op,
> +				void __user *uarg)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct vfio_pci_tph_cap cap = {0};
> +	u8 mode;
> +
> +	if (op->argsz < offsetofend(struct vfio_device_pci_tph_op, cap))
> +		return -EINVAL;

If struct vfio_pci_tph_cap is extended in future kernel releases, offsetofend()
will increase. Will new kernels reject valid requests from older userspace
binaries with -EINVAL, breaking backwards compatibility?

> +
> +	mode = pcie_tph_get_st_modes(pdev);
> +	/* Hide unsafe device-specific (DS) mode by default */
> +	if (!enable_unsafe_tph_ds_mode)
> +		mode &= ~PCI_TPH_CAP_ST_DS;
> +	if (mode == 0 || mode == PCI_TPH_CAP_ST_NS)
> +		return -EOPNOTSUPP;
> +
> +	if (mode & PCI_TPH_CAP_ST_IV)
> +		cap.supported_modes |= VFIO_PCI_TPH_MODE_IV;
> +	if (mode & PCI_TPH_CAP_ST_DS)
> +		cap.supported_modes |= VFIO_PCI_TPH_MODE_DS;
> +	cap.st_table_sz = pcie_tph_get_st_table_size(pdev);
> +
> +	if (copy_to_user(uarg, &cap, sizeof(cap)))

If the capability structure grows in the future, will the kernel write beyond
the bounds of an older userspace application's allocated buffer?

> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static int vfio_pci_ioctl_tph(struct vfio_pci_core_device *vdev,
> +			      void __user *uarg)
> +{
> +	struct vfio_device_pci_tph_op op = {0};
> +	size_t minsz = sizeof(op.argsz) + sizeof(op.op);
> +
> +	if (copy_from_user(&op, uarg, minsz))
> +		return -EFAULT;
> +
> +	switch (op.op) {
> +	case VFIO_PCI_TPH_GET_CAP:
> +		return vfio_pci_tph_get_cap(vdev, &op, uarg + minsz);

This handler does not acquire down_read(&vdev->memory_lock). Does this allow
TPH IOCTLs to run concurrently with a PCI hot reset
(VFIO_DEVICE_PCI_HOT_RESET) which takes the write lock via
vfio_pci_zap_and_down_write_memory_lock()?

Concurrent execution could allow vfio_pci_tph_get_cap() to read configuration
space while the secondary bus is in reset, yielding bogus data.

More importantly, later in this series when vfio_pci_tph_set_st() is added to
write directly to the MSI-X table via MMIO, performing MMIO writes to a device
actively undergoing reset can cause PCI Master Aborts. On certain platforms,
this can trigger AER panics or EEH isolation.

[ ... ]
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 5de618a3a5eee..81da2bd0c21b0 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
[ ... ]
> +/**
> + * VFIO_DEVICE_PCI_TPH - _IO(VFIO_TYPE, VFIO_BASE + 22)
> + *
> + * IOCTL for managing PCIe TLP Processing Hints (TPH) on
> + * a VFIO-assigned PCI device. Provides operations to query
> + * device capabilities, enable/disable TPH, retrieve CPU's
> + * steering tags, and program steering tag tables.
> + *
> + * Return: 0 on success, negative errno on failure.
> + *         -EOPNOTSUPP: Operation not supported
> + *         -ENODEV: Device or required functionality not present
> + *         -EINVAL: Invalid argument or TPH not supported
> + */
> +#define VFIO_DEVICE_PCI_TPH	_IO(VFIO_TYPE, VFIO_BASE + 22)

This introduces a new top-level IOCTL to manage PCIe TPH capabilities instead
of using the standard VFIO_DEVICE_FEATURE IOCTL.

Could this be implemented using VFIO_DEVICE_FEATURE, which was explicitly
designed to act as a unified multiplexer for querying and setting new device
features, to avoid unnecessarily bloating the UAPI namespace and duplicating
boilerplate for sizing and feature discovery?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260508064053.37529-1-fengchengwen@huawei.com?part=4

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 5/7] vfio/pci: Add PCIe TPH enable/disable support
  2026-05-08  6:40 ` [PATCH v8 5/7] vfio/pci: Add PCIe TPH enable/disable support Chengwen Feng
@ 2026-05-08 20:46   ` sashiko-bot
  0 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-05-08 20:46 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] TPH operations access the device's PCI configuration space without ensuring the device is in the D0 power state, risking accesses while suspended (D3).
- [High] `vdev->igate` is insufficient to prevent bitfield tearing on `pdev->tph_enabled` against other PCI subsystems.
--

commit e6a89cf03d6c51144466205cce4a3bf38731b5db
Author: Chengwen Feng <fengchengwen@huawei.com>

vfio/pci: Add PCIe TPH enable/disable support

This patch adds support for enabling and disabling the TPH function via
mode selection. It also disables TPH when ownership is taken or when
userspace closes the device FD.

> Serialize all TPH operations under vdev->igate mutex using scope-based
> automatic locking to prevent hardware control and bitfield races.

If pdev->tph_enabled is a bitfield inside struct pci_dev that shares a
memory word with other core PCI bitfields (like state_saved or
is_busmaster), will vdev->igate prevent bitfield tearing?

Since other kernel subsystems modify these neighboring bitfields without
holding the VFIO-private igate mutex (such as the PCI PM core modifying
state_saved during runtime suspend), concurrent updates could result in
data races and lost state updates.

> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 0e97b128fd637..bfc7e87d190f4 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -738,6 +738,9 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
>  #endif
>  	vfio_pci_dma_buf_cleanup(vdev);
>  
> +	/* Disable TPH when userspace closes the device FD */
> +	pcie_disable_tph(vdev->pdev);
> +
>  	vfio_pci_core_disable(vdev);

Is it safe to access the device's PCI configuration space here?

Since vfio_pci_core_disable() is explicitly responsible for waking up the
device via pm_runtime_resume(), pcie_disable_tph() is being called before
the device is ensured to be in the D0 power state. 

Accessing the config space of a suspended device could cause PCIe errors
like Unsupported Requests.

[ ... ]

>  	switch (op.op) {
>  	case VFIO_PCI_TPH_GET_CAP:
>  		return vfio_pci_tph_get_cap(vdev, &op, uarg + minsz);
> +	case VFIO_PCI_TPH_ENABLE:
> +		return vfio_pci_tph_enable(vdev, &op, uarg + minsz);
> +	case VFIO_PCI_TPH_DISABLE:
> +		return vfio_pci_tph_disable(vdev);

Are these TPH capability accesses safe without a pm_runtime_resume_and_get()
and pm_runtime_put() wrapper?

Userspace could potentially access the config space while the device is idle
and runtime suspended.

[ ... ]

> @@ -2257,6 +2302,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  	if (!disable_idle_d3)
>  		pm_runtime_put(dev);
>  
> +	/* Disable TPH when taking over ownership of the device */
> +	pcie_disable_tph(pdev);
> +

Could this pcie_disable_tph() execute while the device is suspending?

It is called after pm_runtime_put(), meaning the PM core may have already
started asynchronously suspending the device, potentially leading to config
space accesses while the device is in D3.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260508064053.37529-1-fengchengwen@huawei.com?part=5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 7/7] vfio/pci: Add PCIe TPH SET_ST interface
  2026-05-08  6:40 ` [PATCH v8 7/7] vfio/pci: Add PCIe TPH SET_ST interface Chengwen Feng
@ 2026-05-08 21:49   ` sashiko-bot
  0 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-05-08 21:49 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] The error path in `vfio_pci_tph_set_st()` destructively zeroes out successfully programmed ST entries on partial failure, causing an unknown and inconsistent device state.
- [High] Positive PCIBIOS error codes from `pcie_tph_set_st_entry()` are propagated to userspace `ioctl()`, resulting in false success reporting.
--

commit c7946d420debcc101b2f2daf21858db188b20ee7
Author: Chengwen Feng <fengchengwen@huawei.com>

vfio/pci: Add PCIe TPH SET_ST interface

This commit adds the VFIO_PCI_TPH_SET_ST operation to support batch programming
of steering tag entries. It includes an error path that attempts to roll back
successfully programmed entries to 0 if any entry fails.

> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c

[ ... ]

> @@ -1606,6 +1606,94 @@ static int vfio_pci_tph_get_st(struct vfio_pci_core_device *vdev,
>  	return err;
>  }
>  
> +static int vfio_pci_tph_set_st(struct vfio_pci_core_device *vdev,
> +			       struct vfio_device_pci_tph_op *op,
> +			       void __user *uarg)
> +{

[ ... ]

> +		if (ents[i].cpu == U32_MAX) {
> +			err = pcie_tph_set_st_entry(pdev, ents[i].index, 0);
> +			if (err)
> +				goto out;

Could this return a positive PCIBIOS error code to userspace?

pcie_tph_set_st_entry() eventually calls pci_write_config_word(), which returns
positive PCIBIOS_* error codes (like PCIBIOS_SET_FAILED). Since the VFS ioctl
handler only treats strictly negative values as errors, returning a positive
value directly will cause userspace to falsely assume the ioctl succeeded.

Should this error be converted using pcibios_err_to_errno()?

> +			continue;
> +		}
> +
> +		err = pcie_tph_get_cpu_st(pdev, mtype, ents[i].cpu, &st_val);
> +		if (err)
> +			goto out;
> +		err = pcie_tph_set_st_entry(pdev, ents[i].index, st_val);
> +		if (err)
> +			goto out;
> +	}
> +
> +out:
> +	if (err) {
> +		/* Roll back previously programmed entries to 0 */
> +		for (j = 0; j < i; j++)
> +			pcie_tph_set_st_entry(pdev, ents[j].index, 0);

Does this error path actually leave the device in an inconsistent state if the
entries contained valid non-zero tags before this ioctl was called?

Since the original ST values are not read or saved prior to modification,
this "rollback" destructively overwrites them with 0 rather than restoring the
true previous state.

Could we read and save the original values before modifying them, or return
an error without attempting a partial revert?

> +	}
> +	kvfree(ents);
> +	return err;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260508064053.37529-1-fengchengwen@huawei.com?part=7

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 4/7] vfio/pci: Add PCIe TPH interface with capability query
  2026-05-08  6:40 ` [PATCH v8 4/7] vfio/pci: Add PCIe TPH interface with capability query Chengwen Feng
  2026-05-08 20:03   ` sashiko-bot
@ 2026-05-08 22:40   ` Alex Williamson
  2026-05-09  3:28     ` fengchengwen
  1 sibling, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2026-05-08 22:40 UTC (permalink / raw)
  To: Chengwen Feng
  Cc: jgg, wathsala.vithanage, helgaas, wei.huang2, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci, alex

On Fri, 8 May 2026 14:40:50 +0800
Chengwen Feng <fengchengwen@huawei.com> wrote:

> Add VFIO_DEVICE_PCI_TPH IOCTL to allow userspace to query device TPH
> capabilities, supported modes, and steering tag table information.
> 
> Add module parameter 'enable_unsafe_tph_ds_mode' to restrict unsafe
> device-specific TPH mode to trusted userspace only.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> ---
>  drivers/vfio/pci/vfio_pci.c      |  13 ++-
>  drivers/vfio/pci/vfio_pci_core.c |  56 ++++++++++++-
>  include/linux/vfio_pci_core.h    |   3 +-
>  include/uapi/linux/vfio.h        | 133 +++++++++++++++++++++++++++++++
>  4 files changed, 202 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 0c771064c0b8..40bf5aa9fd0b 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -60,6 +60,12 @@ static bool disable_denylist;
>  module_param(disable_denylist, bool, 0444);
>  MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
>  
> +#ifdef CONFIG_PCIE_TPH
> +static bool enable_unsafe_tph_ds_mode;
> +module_param(enable_unsafe_tph_ds_mode, bool, 0444);
> +MODULE_PARM_DESC(enable_unsafe_tph_ds_mode, "Enable UNSAFE TPH device-specific (DS) mode. This mode provides weak isolation, cannot be safely used for virtual machines. If you do not know what this is for, step away. (default: false)");
> +#endif
> +

Why is the "unsafe" aspect of this keyed on mode rather than storage
location?

Currently the user cannot enable TPH, the capability is read-only, but
the user does have direct access to the MSI-X table.  We rely on an
agreement that the user needs to use SET_IRQS to allocate host vectors
and we use interrupt remapping as protection against abuse, but there's
no mediation of STs written directly to the MSI-X table.  If the device
supports IV mode with ST in the MSI-X table, nothing prevents the user
from writing those ST entries directly to the MSI-X table.  Therefore
doesn't it have the same security concern as DS mode?

Further, config space lives in the device and various devices are known
to have alternate means for accessing their config space.
Virtualization of config space is more to present the device in the VM
address space and bridge features between guest and host.  It's not
great as a security barrier.

Maybe it's really neither the mode nor storage location, and we need to
decide if TPH as a whole introduces any new security considerations.
It seems arguable whether we can actually prevent a device from
including arbitrary STs on TLPs in any case and maybe we're really only
exposing a curated programming interface.

...
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 5de618a3a5ee..81da2bd0c21b 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1321,6 +1321,139 @@ struct vfio_precopy_info {
>  
>  #define VFIO_MIG_GET_PRECOPY_INFO _IO(VFIO_TYPE, VFIO_BASE + 21)
>  
> +/**
> + * struct vfio_pci_tph_cap - PCIe TPH capability information
> + * @supported_modes: Supported TPH operating modes
> + * @st_table_sz: Number of entries in ST table; 0 means no ST table
> + * @reserved: Must be zero
> + *
> + * Used with VFIO_PCI_TPH_GET_CAP operation to return device
> + * TLP Processing Hints (TPH) capabilities to userspace.
> + */
> +struct vfio_pci_tph_cap {
> +	__u8  supported_modes;
> +#define VFIO_PCI_TPH_MODE_IV	(1u << 0) /* Interrupt vector */
> +#define VFIO_PCI_TPH_MODE_DS	(1u << 1) /* Device specific */
> +	__u8  reserved0;
> +	__u16 st_table_sz;
> +	__u32 reserved;
> +};
> +
> +/**
> + * struct vfio_pci_tph_ctrl - TPH enable control structure
> + * @mode: Selected TPH operating mode (VFIO_PCI_TPH_MODE_*)
> + * @reserved: Must be zero
> + *
> + * Used with VFIO_PCI_TPH_ENABLE operation to specify the
> + * operating mode when enabling TPH on the device.
> + */
> +struct vfio_pci_tph_ctrl {
> +	__u8 mode;
> +	__u8 reserved[7];
> +};
> +
> +/**
> + * struct vfio_pci_tph_entry - Single TPH steering tag entry
> + * @cpu: CPU identifier for steering tag calculation
> + * @mem_type: Memory type (VFIO_PCI_TPH_MEM_TYPE_*)
> + * @reserved0: Must be zero
> + * @index: ST table index for programming
> + * @st: Unused for SET_ST
> + * @reserved1: Must be zero
> + *
> + * For VFIO_PCI_TPH_GET_ST:
> + *   Userspace sets @cpu and @mem_type; kernel returns @st.
> + *
> + * For VFIO_PCI_TPH_SET_ST:
> + *   Userspace sets @index, @cpu, and @mem_type.
> + *   Kernel internally computes the steering tag and programs
> + *   it into the specified @index.
> + *
> + *   If @cpu == U32_MAX, kernel clears the steering tag at
> + *   the specified @index.
> + */
> +struct vfio_pci_tph_entry {
> +	__u32 cpu;
> +	__u8  mem_type;
> +#define VFIO_PCI_TPH_MEM_TYPE_VM	0
> +#define VFIO_PCI_TPH_MEM_TYPE_PM	1
> +	__u8  reserved0;
> +	__u16 index;
> +	__u16 st;
> +	__u16 reserved1;
> +};
> +
> +/**
> + * struct vfio_pci_tph_st - Batch steering tag request
> + * @count: Number of entries in the array
> + * @reserved: Must be zero
> + * @ents: Flexible array of steering tag entries
> + *
> + * Container structure for batch get/set operations.
> + * Used with both VFIO_PCI_TPH_GET_ST and VFIO_PCI_TPH_SET_ST.
> + */
> +struct vfio_pci_tph_st {
> +	__u32 count;
> +	__u32 reserved;
> +	struct vfio_pci_tph_entry ents[];
> +#define VFIO_PCI_TPH_MAX_ENTRIES    2048
> +};
> +
> +/**
> + * struct vfio_device_pci_tph_op - Argument for VFIO_DEVICE_PCI_TPH
> + * @argsz: User allocated size of this structure
> + * @op: TPH operation (VFIO_PCI_TPH_*)
> + * @cap: Capability data for GET_CAP
> + * @ctrl: Control data for ENABLE
> + * @st: Batch entry data for GET_ST/SET_ST
> + *
> + * @argsz must be set by the user to the size of the structure
> + * being executed. Kernel validates input and returns data
> + * only within the specified size.
> + *
> + * Operations:
> + * - VFIO_PCI_TPH_GET_CAP: Query device TPH capabilities.
> + * - VFIO_PCI_TPH_ENABLE:  Enable TPH using mode from &ctrl.
> + * - VFIO_PCI_TPH_DISABLE: Disable TPH on the device.
> + * - VFIO_PCI_TPH_GET_ST:  Retrieve CPU steering tags for Device-Specific (DS)
> + *                         mode. Used when device requires SW to obtain ST
> + *                         values for programming.
> + * - VFIO_PCI_TPH_SET_ST:  Program steering tag entries into device ST table.
> + *                         Valid when ST table resides in TPH Requester
> + *                         Capability or MSI-X Table.
> + *                         If any entry fails, all programmed entries are rolled
> + *                         back to 0 before returning error.
> + */
> +struct vfio_device_pci_tph_op {
> +	__u32 argsz;
> +	__u32 op;
> +#define VFIO_PCI_TPH_GET_CAP	0
> +#define VFIO_PCI_TPH_ENABLE	1
> +#define VFIO_PCI_TPH_DISABLE	2
> +#define VFIO_PCI_TPH_GET_ST	3
> +#define VFIO_PCI_TPH_SET_ST	4
> +	union {
> +		struct vfio_pci_tph_cap cap;
> +		struct vfio_pci_tph_ctrl ctrl;
> +		struct vfio_pci_tph_st st;
> +	};
> +};
> +
> +/**
> + * VFIO_DEVICE_PCI_TPH - _IO(VFIO_TYPE, VFIO_BASE + 22)
> + *
> + * IOCTL for managing PCIe TLP Processing Hints (TPH) on
> + * a VFIO-assigned PCI device. Provides operations to query
> + * device capabilities, enable/disable TPH, retrieve CPU's
> + * steering tags, and program steering tag tables.
> + *
> + * Return: 0 on success, negative errno on failure.
> + *         -EOPNOTSUPP: Operation not supported
> + *         -ENODEV: Device or required functionality not present
> + *         -EINVAL: Invalid argument or TPH not supported
> + */
> +#define VFIO_DEVICE_PCI_TPH	_IO(VFIO_TYPE, VFIO_BASE + 22)
> +

This seems like the wrong shape to me and introduces yet another ioctl
multiplexer.  We already have that via the device feature interface.
I'd propose this only needs one new DEVICE_FEATURE ioctl, TPH_ST.  The
uAPI would look like:

struct vfio_device_feature_tph_st {
	__u32 flags;
#define VFIO_TPH_ST_MEM_TYPE_PM	(1 << 0)
	__u16 index;
	__u16 count;
	__u32 data[]; /* host CPU# on SET, ST value on GET */
}

The user can SET multiple STs at once that have the same mem_type
(assuming that's a reasonable limitation).  On SET, each {cpu#,
mem_type} is translated to a host value and stored internally.  A GET
returns that translated ST value for DS use cases.

The user can use PROBE to determine if this feature is available.

We already provide the TPH capability read-only in config space, we can
use that rather than an explicit INFO/GET_CAP interface.

When the feature is available, the TPH control register is virtualized.
On enabling TPH via config space, vfio will store the translated ST
values to the appropriate location, or none, and enable TPH.  On SET
while already enabled, vfio will update both the internal table and the
device location (or none).  Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 4/7] vfio/pci: Add PCIe TPH interface with capability query
  2026-05-08 22:40   ` Alex Williamson
@ 2026-05-09  3:28     ` fengchengwen
  2026-05-11  4:36       ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: fengchengwen @ 2026-05-09  3:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, wathsala.vithanage, helgaas, wei.huang2, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci

On 5/9/2026 6:40 AM, Alex Williamson wrote:
> On Fri, 8 May 2026 14:40:50 +0800
> Chengwen Feng <fengchengwen@huawei.com> wrote:
> 
>> Add VFIO_DEVICE_PCI_TPH IOCTL to allow userspace to query device TPH
>> capabilities, supported modes, and steering tag table information.
>>
>> Add module parameter 'enable_unsafe_tph_ds_mode' to restrict unsafe
>> device-specific TPH mode to trusted userspace only.
>>
>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>> ---
>>  drivers/vfio/pci/vfio_pci.c      |  13 ++-
>>  drivers/vfio/pci/vfio_pci_core.c |  56 ++++++++++++-
>>  include/linux/vfio_pci_core.h    |   3 +-
>>  include/uapi/linux/vfio.h        | 133 +++++++++++++++++++++++++++++++
>>  4 files changed, 202 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 0c771064c0b8..40bf5aa9fd0b 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -60,6 +60,12 @@ static bool disable_denylist;
>>  module_param(disable_denylist, bool, 0444);
>>  MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
>>  
>> +#ifdef CONFIG_PCIE_TPH
>> +static bool enable_unsafe_tph_ds_mode;
>> +module_param(enable_unsafe_tph_ds_mode, bool, 0444);
>> +MODULE_PARM_DESC(enable_unsafe_tph_ds_mode, "Enable UNSAFE TPH device-specific (DS) mode. This mode provides weak isolation, cannot be safely used for virtual machines. If you do not know what this is for, step away. (default: false)");
>> +#endif
>> +
> 
> Why is the "unsafe" aspect of this keyed on mode rather than storage
> location?
> 
> Currently the user cannot enable TPH, the capability is read-only, but
> the user does have direct access to the MSI-X table.  We rely on an
> agreement that the user needs to use SET_IRQS to allocate host vectors
> and we use interrupt remapping as protection against abuse, but there's
> no mediation of STs written directly to the MSI-X table.  If the device
> supports IV mode with ST in the MSI-X table, nothing prevents the user
> from writing those ST entries directly to the MSI-X table.  Therefore
> doesn't it have the same security concern as DS mode?


Agree, from this perspective, even if it is in MSI-X table, it is still unsafe.
So TPH is unsafe as a whole, not just DS mode.

> 
> Further, config space lives in the device and various devices are known
> to have alternate means for accessing their config space.
> Virtualization of config space is more to present the device in the VM
> address space and bridge features between guest and host.  It's not
> great as a security barrier.
> 
> Maybe it's really neither the mode nor storage location, and we need to
> decide if TPH as a whole introduces any new security considerations.

I will adjust the module parameter to control TPH globally instead of only DS mode.

> It seems arguable whether we can actually prevent a device from
> including arbitrary STs on TLPs in any case and maybe we're really only
> exposing a curated programming interface.
> 
> ...
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 5de618a3a5ee..81da2bd0c21b 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h

...

>> +#define VFIO_DEVICE_PCI_TPH	_IO(VFIO_TYPE, VFIO_BASE + 22)
>> +
> 
> This seems like the wrong shape to me and introduces yet another ioctl
> multiplexer.  We already have that via the device feature interface.
> I'd propose this only needs one new DEVICE_FEATURE ioctl, TPH_ST.  The
> uAPI would look like:
> 
> struct vfio_device_feature_tph_st {
> 	__u32 flags;
> #define VFIO_TPH_ST_MEM_TYPE_PM	(1 << 0)
> 	__u16 index;
> 	__u16 count;
> 	__u32 data[]; /* host CPU# on SET, ST value on GET */
> }
> 
> The user can SET multiple STs at once that have the same mem_type
> (assuming that's a reasonable limitation).  On SET, each {cpu#,

Agree, using the same mem_type for a batch is a good idea.

Because it could set multiple index, so how about:

struct vfio_pci_tph_entry {
	__u32 cpu;
	__u16 val;	/* ST index on SET, ST value on GET */
	__u16 reserved;
}

struct vfio_device_feature_tph_st {
	__u32 op;
#define VFIO_TPH_OP_GET_ST	0
#define VFIO_TPH_OP_SET_ST	1
	__u32 flags;
#define VFIO_TPH_ST_MEM_TYPE_PM	(1 << 0)
	__u16 count;
	__u16 reserved1;
	struct vfio_pci_tph_entry ents[];
}

> mem_type} is translated to a host value and stored internally.  A GET

Should we store internally? How about writing directly to the device?

> returns that translated ST value for DS use cases.
> 
> The user can use PROBE to determine if this feature is available.
> 
> We already provide the TPH capability read-only in config space, we can
> use that rather than an explicit INFO/GET_CAP interface.

OK

> 
> When the feature is available, the TPH control register is virtualized.
> On enabling TPH via config space, vfio will store the translated ST
> values to the appropriate location, or none, and enable TPH.  On SET
> while already enabled, vfio will update both the internal table and the
> device location (or none).  Thanks,
> 
> Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 4/7] vfio/pci: Add PCIe TPH interface with capability query
  2026-05-09  3:28     ` fengchengwen
@ 2026-05-11  4:36       ` Alex Williamson
  0 siblings, 0 replies; 16+ messages in thread
From: Alex Williamson @ 2026-05-11  4:36 UTC (permalink / raw)
  To: fengchengwen
  Cc: jgg, wathsala.vithanage, helgaas, wei.huang2, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci, alex

On Sat, 9 May 2026 11:28:03 +0800
fengchengwen <fengchengwen@huawei.com> wrote:
> On 5/9/2026 6:40 AM, Alex Williamson wrote:
> > On Fri, 8 May 2026 14:40:50 +0800
> > Chengwen Feng <fengchengwen@huawei.com> wrote:
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index 0c771064c0b8..40bf5aa9fd0b 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -60,6 +60,12 @@ static bool disable_denylist;
> >>  module_param(disable_denylist, bool, 0444);
> >>  MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
> >>  
> >> +#ifdef CONFIG_PCIE_TPH
> >> +static bool enable_unsafe_tph_ds_mode;
> >> +module_param(enable_unsafe_tph_ds_mode, bool, 0444);
> >> +MODULE_PARM_DESC(enable_unsafe_tph_ds_mode, "Enable UNSAFE TPH device-specific (DS) mode. This mode provides weak isolation, cannot be safely used for virtual machines. If you do not know what this is for, step away. (default: false)");
> >> +#endif
> >> +  
> > 
> > Why is the "unsafe" aspect of this keyed on mode rather than storage
> > location?
> > 
> > Currently the user cannot enable TPH, the capability is read-only, but
> > the user does have direct access to the MSI-X table.  We rely on an
> > agreement that the user needs to use SET_IRQS to allocate host vectors
> > and we use interrupt remapping as protection against abuse, but there's
> > no mediation of STs written directly to the MSI-X table.  If the device
> > supports IV mode with ST in the MSI-X table, nothing prevents the user
> > from writing those ST entries directly to the MSI-X table.  Therefore
> > doesn't it have the same security concern as DS mode?  
> 
> 
> Agree, from this perspective, even if it is in MSI-X table, it is still unsafe.
> So TPH is unsafe as a whole, not just DS mode.
> 
> > 
> > Further, config space lives in the device and various devices are known
> > to have alternate means for accessing their config space.
> > Virtualization of config space is more to present the device in the VM
> > address space and bridge features between guest and host.  It's not
> > great as a security barrier.
> > 
> > Maybe it's really neither the mode nor storage location, and we need to
> > decide if TPH as a whole introduces any new security considerations.  
> 
> I will adjust the module parameter to control TPH globally instead of
> only DS mode.

I'm not convinced that's the right solution either.  It's a usage
barrier if the TPH capability isn't exposed R/W, but does it guarantee
the device won't make use of such TLPs anyway?  If the device has
config space backdoors or can otherwise be manipulated to send these
hints, a vfio-pci module option is just security theater.  It's also a
burden for users and for each variant driver for devices supporting TPH.

We do however need to consider how changing the behavior of the
capability affects existing users, like QEMU.  We may need to consider
two device features, one that only supports SET with no payload to
enable virtualized access to the TPH capability and another that
provides the ST handling interface.

> > It seems arguable whether we can actually prevent a device from
> > including arbitrary STs on TLPs in any case and maybe we're really
> > only exposing a curated programming interface.
> > 
> > ...  
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 5de618a3a5ee..81da2bd0c21b 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h  
> 
> ...
> 
> >> +#define VFIO_DEVICE_PCI_TPH	_IO(VFIO_TYPE, VFIO_BASE + 22)
> >> +  
> > 
> > This seems like the wrong shape to me and introduces yet another
> > ioctl multiplexer.  We already have that via the device feature
> > interface. I'd propose this only needs one new DEVICE_FEATURE
> > ioctl, TPH_ST.  The uAPI would look like:
> > 
> > struct vfio_device_feature_tph_st {
> > 	__u32 flags;
> > #define VFIO_TPH_ST_MEM_TYPE_PM	(1 << 0)
> > 	__u16 index;
> > 	__u16 count;
> > 	__u32 data[]; /* host CPU# on SET, ST value on GET */
> > }
> > 
> > The user can SET multiple STs at once that have the same mem_type
> > (assuming that's a reasonable limitation).  On SET, each {cpu#,  
> 
> Agree, using the same mem_type for a batch is a good idea.
> 
> Because it could set multiple index, so how about:
> 
> struct vfio_pci_tph_entry {
> 	__u32 cpu;
> 	__u16 val;	/* ST index on SET, ST value on GET */
> 	__u16 reserved;
> }

In the structure I proposed the user can set/get contiguous index
ranges according to index and count, where the data field can then just
be a u32 array.  Why does the user need to be able to set/get
arbitrary, non-contiguous indexes?

In a VM use case we'd likely be trapping individual writes, therefore
we'd be intercepting one index at a time.
 
> struct vfio_device_feature_tph_st {
> 	__u32 op;
> #define VFIO_TPH_OP_GET_ST	0
> #define VFIO_TPH_OP_SET_ST	1

The vfio device feature interface already handles set/get, we don't
need this.

> 	__u32 flags;
> #define VFIO_TPH_ST_MEM_TYPE_PM	(1 << 0)
> 	__u16 count;
> 	__u16 reserved1;
> 	struct vfio_pci_tph_entry ents[];
> }
> 
> > mem_type} is translated to a host value and stored internally.  A
> > GET  
> 
> Should we store internally? How about writing directly to the device?

Perhaps we should, but I think we need a flag to indicate whether we're
virtualizing a write to hardware (capability or MSI-X table) or SET'ing
an index that userspace will later retrieve for DS mode via GET.
Otherwise we don't know until the mode bits are written which location,
if any, the capability is actually using.  For example the device can
support either MSI-X or capability locations, but enable DS mode.  For
consistency though, it might make sense to write to an internal table
regardless.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-05-11  4:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08  6:40 [PATCH v8 0/7] vfio/pci: Add PCIe TPH support Chengwen Feng
2026-05-08  6:40 ` [PATCH v8 1/7] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
2026-05-08  6:40 ` [PATCH v8 2/7] PCI/TPH: Export pcie_tph_get_st_modes() for external use Chengwen Feng
2026-05-08 19:02   ` sashiko-bot
2026-05-08  6:40 ` [PATCH v8 3/7] PCI/TPH: Fix pcie_tph_get_st_table_size() for MSI-X table location Chengwen Feng
2026-05-08 19:31   ` sashiko-bot
2026-05-08  6:40 ` [PATCH v8 4/7] vfio/pci: Add PCIe TPH interface with capability query Chengwen Feng
2026-05-08 20:03   ` sashiko-bot
2026-05-08 22:40   ` Alex Williamson
2026-05-09  3:28     ` fengchengwen
2026-05-11  4:36       ` Alex Williamson
2026-05-08  6:40 ` [PATCH v8 5/7] vfio/pci: Add PCIe TPH enable/disable support Chengwen Feng
2026-05-08 20:46   ` sashiko-bot
2026-05-08  6:40 ` [PATCH v8 6/7] vfio/pci: Add PCIe TPH GET_ST interface Chengwen Feng
2026-05-08  6:40 ` [PATCH v8 7/7] vfio/pci: Add PCIe TPH SET_ST interface Chengwen Feng
2026-05-08 21:49   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox