* [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
@ 2026-05-28 12:46 Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 1/8] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
` (8 more replies)
0 siblings, 9 replies; 22+ messages in thread
From: Chengwen Feng @ 2026-05-28 12:46 UTC (permalink / raw)
To: alex, jgg
Cc: wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
This patchset enables userspace control over PCIe TPH steering tags,
motivated by the following considerations:
1. Why userspace needs the capability to control steering tags:
When PCIe devices are fully owned by userspace workloads such as DPDK
and SPDK, only userspace has full knowledge of core binding policies
and traffic distribution strategies. Without this series, userspace
cannot enable TPH or configure steering tags, leaving built-in PCIe
performance optimizations unused in high-throughput polling I/O
scenarios.
2. Why this interface must be implemented in VFIO:
VFIO is the standard, secure community solution for granting full
PCIe device ownership to userspace. Existing kernel TPH interfaces
are designed purely for in-kernel drivers. For user-owned devices,
VFIO provides the only isolated and correct path to expose per-device
TPH management.
TPH supports both IV and DS modes. Since both modes could introduces
cross-VM isolation risks such as untrusted guests programming arbitrary
steering tags to impact other domains:
1. If ST location in MSI-X table, untrusted guests may program the
MSI-X table.
2. If ST don't locate in MSI-X or CAP, untrusted guests may program the
device-specific register.
So a new module parameter `enable_unsafe_tph` is added. It defaults to
off, and blocks all unsafe TPH operations when disabled.
Based on earlier RFC work by Wathsala Vithanage
---
v14:
- Return PCI_TPH_LOC_NONE when !CONFIG_PCIE_TPH accord Alex's comment
- Fix Sashiko comments:
- Clear ST shadow state across user session
- Fix out-of-bounds byte masking in vfio_pci_tph_config_read
v13:
- Fix Alex's comments:
- Add virtualize of TPH request type
- Adopt two feature
v12:
- Fix Alex's comments:
- Support enable NS_MODE from userspace
- Remove restriction of get/st operation must enable TPH by impl
shadow ST table scheme
- Refine uAPI definition
Chengwen Feng (7):
PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction
PCI/TPH: Cache TPH requester capability at probe time
PCI/TPH: Add requester selection policy to pcie_enable_tph()
PCI/TPH: Add requester policy to pcie_tph_get_cpu_st()
vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST_CONFIG for PCIe TPH ST
configuration
vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_CPU_ST to query TPH steering tag
vfio/pci: Virtualize PCIe TPH capability registers
Zhiping Zhang (1):
PCI/TPH: expose the enabled TPH requester type
Documentation/PCI/tph.rst | 22 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 +-
.../net/ethernet/mellanox/mlx5/core/lib/st.c | 5 +-
drivers/pci/tph.c | 116 +++++++---
drivers/vfio/pci/vfio_pci.c | 13 +-
drivers/vfio/pci/vfio_pci_config.c | 78 +++++++
drivers/vfio/pci/vfio_pci_core.c | 199 +++++++++++++++++-
include/linux/pci-tph.h | 29 ++-
include/linux/pci.h | 1 +
include/linux/vfio_pci_core.h | 6 +-
include/uapi/linux/vfio.h | 49 +++++
11 files changed, 477 insertions(+), 45 deletions(-)
--
2.17.1
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v14 1/8] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction
2026-05-28 12:46 [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Chengwen Feng
@ 2026-05-28 12:46 ` Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 2/8] PCI/TPH: Cache TPH requester capability at probe time Chengwen Feng
` (7 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Chengwen Feng @ 2026-05-28 12:46 UTC (permalink / raw)
To: alex, jgg
Cc: wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
pcie_tph_get_st_table_loc() incorrectly uses FIELD_GET(), which shifts the
field value to bit 0. But the function is designed to return raw
PCI_TPH_LOC_* values as defined in the function comment.
This causes incorrect ST table location detection. Fix it by using bitwise
AND with PCI_TPH_CAP_LOC_MASK to return the unshifted field value matching
the function specification.
This doesn't make a difference to mlx5_st_create(), the lone external
caller, because it only checks for PCI_TPH_LOC_NONE (0), but will be needed
for callers that check for PCI_TPH_LOC_CAP or PCI_TPH_LOC_MSIX.
Also add tph_cap validation for pcie_tph_get_st_table_loc() to prevent
invalid PCI configuration space access when TPH is not supported. Add stub
functions for pcie_tph_get_st_table_size() and pcie_tph_get_st_table_loc()
when !CONFIG_PCIE_TPH.
Fixes: d2e8a34876ce ("PCI/TPH: Add Steering Tag support")
Cc: stable@vger.kernel.org
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Alex Williamson <alex.williamson@nvidia.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
---
drivers/pci/tph.c | 12 +++++-------
include/linux/pci-tph.h | 5 +++++
2 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index 91145e8d9d95..bef3a55539c4 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -166,11 +166,14 @@ static u8 get_st_modes(struct pci_dev *pdev)
*/
u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
{
- u32 reg;
+ u32 reg = 0;
+
+ if (!pdev->tph_cap)
+ return PCI_TPH_LOC_NONE;
pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, ®);
- return FIELD_GET(PCI_TPH_CAP_LOC_MASK, reg);
+ return reg & PCI_TPH_CAP_LOC_MASK;
}
EXPORT_SYMBOL(pcie_tph_get_st_table_loc);
@@ -185,9 +188,6 @@ u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
/* Check ST table location first */
loc = pcie_tph_get_st_table_loc(pdev);
-
- /* Convert loc to match with PCI_TPH_LOC_* defined in pci_regs.h */
- loc = FIELD_PREP(PCI_TPH_CAP_LOC_MASK, loc);
if (loc != PCI_TPH_LOC_CAP)
return 0;
@@ -316,8 +316,6 @@ int pcie_tph_set_st_entry(struct pci_dev *pdev, unsigned int index, u16 tag)
set_ctrl_reg_req_en(pdev, PCI_TPH_REQ_DISABLE);
loc = pcie_tph_get_st_table_loc(pdev);
- /* Convert loc to match with PCI_TPH_LOC_* */
- loc = FIELD_PREP(PCI_TPH_CAP_LOC_MASK, loc);
switch (loc) {
case PCI_TPH_LOC_MSIX:
diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
index be68cd17f2f8..6f02b020d7d7 100644
--- a/include/linux/pci-tph.h
+++ b/include/linux/pci-tph.h
@@ -8,6 +8,7 @@
*/
#ifndef LINUX_PCI_TPH_H
#define LINUX_PCI_TPH_H
+#include <linux/pci.h>
/*
* According to the ECN for PCI Firmware Spec, Steering Tag can be different
@@ -41,6 +42,10 @@ static inline int pcie_tph_get_cpu_st(struct pci_dev *dev,
static inline void pcie_disable_tph(struct pci_dev *pdev) { }
static inline int pcie_enable_tph(struct pci_dev *pdev, int mode)
{ return -EINVAL; }
+static inline u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
+{ return 0; }
+static inline u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
+{ return PCI_TPH_LOC_NONE; }
#endif
#endif /* LINUX_PCI_TPH_H */
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v14 2/8] PCI/TPH: Cache TPH requester capability at probe time
2026-05-28 12:46 [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 1/8] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
@ 2026-05-28 12:46 ` Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 3/8] PCI/TPH: Add requester selection policy to pcie_enable_tph() Chengwen Feng
` (6 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Chengwen Feng @ 2026-05-28 12:46 UTC (permalink / raw)
To: alex, jgg
Cc: wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
Calculate the negotiated TPH requester type from device and root port
capabilities once in pci_tph_init().
Add tph_ext_requester flag to cache whether the device is allowed to
issue Extended TPH requests after topology negotiation. If the final
requester type is disabled, clear TPH capability to prevent usage.
Simplify pcie_enable_tph() by using the cached requester capability
instead of recalculating every time.
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
drivers/pci/tph.c | 43 +++++++++++++++++++++++++------------------
include/linux/pci.h | 1 +
2 files changed, 26 insertions(+), 18 deletions(-)
diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index bef3a55539c4..b4ecc510033a 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -384,7 +384,6 @@ int pcie_enable_tph(struct pci_dev *pdev, int mode)
{
u32 reg;
u8 dev_modes;
- u8 rp_req_type;
/* Honor "notph" kernel parameter */
if (pci_tph_disabled)
@@ -404,23 +403,8 @@ int pcie_enable_tph(struct pci_dev *pdev, int mode)
pdev->tph_mode = mode;
- /* Get req_type supported by device and its Root Port */
- pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, ®);
- if (FIELD_GET(PCI_TPH_CAP_EXT_TPH, reg))
- pdev->tph_req_type = PCI_TPH_REQ_EXT_TPH;
- else
- pdev->tph_req_type = PCI_TPH_REQ_TPH_ONLY;
-
- /* Check if the device is behind a Root Port */
- if (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END) {
- rp_req_type = get_rp_completer_type(pdev);
-
- /* Final req_type is the smallest value of two */
- pdev->tph_req_type = min(pdev->tph_req_type, rp_req_type);
- }
-
- if (pdev->tph_req_type == PCI_TPH_REQ_DISABLE)
- return -EINVAL;
+ pdev->tph_req_type = pdev->tph_ext_requester ? PCI_TPH_REQ_EXT_TPH :
+ PCI_TPH_REQ_TPH_ONLY;
/* Write them into TPH control register */
pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CTRL, ®);
@@ -510,13 +494,36 @@ void pci_no_tph(void)
void pci_tph_init(struct pci_dev *pdev)
{
+ u8 tph_req_type, rp_req_type;
int num_entries;
u32 save_size;
+ u32 reg = 0;
pdev->tph_cap = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_TPH);
if (!pdev->tph_cap)
return;
+ /* Get req_type supported by device and its Root Port */
+ pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, ®);
+ if (FIELD_GET(PCI_TPH_CAP_EXT_TPH, reg))
+ tph_req_type = PCI_TPH_REQ_EXT_TPH;
+ else
+ tph_req_type = PCI_TPH_REQ_TPH_ONLY;
+
+ /* Check if the device is behind a Root Port */
+ if (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END) {
+ rp_req_type = get_rp_completer_type(pdev);
+ /* Final req_type is the smallest value of two */
+ tph_req_type = min(tph_req_type, rp_req_type);
+ }
+
+ if (tph_req_type == PCI_TPH_REQ_DISABLE) {
+ pdev->tph_cap = 0;
+ return;
+ }
+
+ pdev->tph_ext_requester = !!(tph_req_type == PCI_TPH_REQ_EXT_TPH);
+
num_entries = pcie_tph_get_st_table_size(pdev);
save_size = sizeof(u32) + num_entries * sizeof(u16);
pci_add_ext_cap_save_buffer(pdev, PCI_EXT_CAP_ID_TPH, save_size);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..0f4361aca794 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -509,6 +509,7 @@ struct pci_dev {
unsigned int rom_bar_overlap:1; /* ROM BAR disable broken */
unsigned int rom_attr_enabled:1; /* Display of ROM attribute enabled? */
unsigned int non_mappable_bars:1; /* BARs can't be mapped to user-space */
+ unsigned int tph_ext_requester:1; /* Can issue Extended TPH requests */
pci_dev_flags_t dev_flags;
atomic_t enable_cnt; /* pci_enable_device has been called */
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v14 3/8] PCI/TPH: Add requester selection policy to pcie_enable_tph()
2026-05-28 12:46 [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 1/8] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 2/8] PCI/TPH: Cache TPH requester capability at probe time Chengwen Feng
@ 2026-05-28 12:46 ` Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 4/8] PCI/TPH: Add requester policy to pcie_tph_get_cpu_st() Chengwen Feng
` (5 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Chengwen Feng @ 2026-05-28 12:46 UTC (permalink / raw)
To: alex, jgg
Cc: wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
Introduce enum tph_req_policy to support auto, standard (8-bit ST) and
extended (16-bit ST) TPH requester selection.
Extend pcie_enable_tph() to take the new policy argument, implement
selection logic and add strict input validation. Update function prototype,
inline fallback stub and kernel-doc to reflect new parameters and behavior.
Refresh Documentation/PCI/tph.rst to document the updated function
signature and new requester policy options.
Adapt existing call sites in bnxt and mlx5 drivers to use TPH_REQ_AUTO
to preserve original runtime behavior.
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
Documentation/PCI/tph.rst | 12 +++++--
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/lib/st.c | 2 +-
drivers/pci/tph.c | 31 ++++++++++++++-----
include/linux/pci-tph.h | 16 ++++++++--
5 files changed, 50 insertions(+), 13 deletions(-)
diff --git a/Documentation/PCI/tph.rst b/Documentation/PCI/tph.rst
index b6cf22b9bd90..a5f14d17977c 100644
--- a/Documentation/PCI/tph.rst
+++ b/Documentation/PCI/tph.rst
@@ -47,15 +47,23 @@ Manage TPH
To enable TPH for a device, use the following function::
- int pcie_enable_tph(struct pci_dev *pdev, int mode);
+ int pcie_enable_tph(struct pci_dev *pdev, int mode,
+ enum tph_req_policy req_policy);
This function enables TPH support for device with a specific ST mode.
-Current supported modes include:
+
+Supported modes include:
* PCI_TPH_ST_NS_MODE - NO ST Mode
* PCI_TPH_ST_IV_MODE - Interrupt Vector Mode
* PCI_TPH_ST_DS_MODE - Device Specific Mode
+Supported requester policies:
+
+ * TPH_REQ_AUTO - Auto select by capability
+ * TPH_REQ_STANDARD - Force standard TPH (8-bit Steering Tag)
+ * TPH_REQ_EXTENDED - Force extended TPH (16-bit Steering Tag)
+
`pcie_enable_tph()` checks whether the requested mode is actually
supported by the device before enabling. The device driver can figure out
which TPH mode is supported and can be properly enabled based on the
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 008c34cff7b4..1f8d3e728a9f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -11846,7 +11846,7 @@ static int bnxt_request_irq(struct bnxt *bp)
#endif
/* Enable TPH support as part of IRQ request */
- rc = pcie_enable_tph(bp->pdev, PCI_TPH_ST_IV_MODE);
+ rc = pcie_enable_tph(bp->pdev, PCI_TPH_ST_IV_MODE, TPH_REQ_AUTO);
if (!rc)
bp->tph_mode = PCI_TPH_ST_IV_MODE;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
index 997be91f0a13..26ce59d16785 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
@@ -55,7 +55,7 @@ struct mlx5_st *mlx5_st_create(struct mlx5_core_dev *dev)
}
/* The OS doesn't support ST */
- ret = pcie_enable_tph(pdev, PCI_TPH_ST_DS_MODE);
+ ret = pcie_enable_tph(pdev, PCI_TPH_ST_DS_MODE, TPH_REQ_AUTO);
if (ret)
return NULL;
diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index b4ecc510033a..f7758445988e 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -373,14 +373,21 @@ EXPORT_SYMBOL(pcie_disable_tph);
* - PCI_TPH_ST_IV_MODE: Interrupt Vector Mode
* - PCI_TPH_ST_DS_MODE: Device Specific Mode
*
+ * @req_policy: TPH requester selection policy
+ *
+ * - TPH_REQ_AUTO: Auto select by capability
+ * - TPH_REQ_STANDARD: Force standard TPH (8-bit ST)
+ * - TPH_REQ_EXTENDED: Force extended TPH (16-bit ST)
+ *
* Check whether the mode is actually supported by the device before enabling
- * and return an error if not. Additionally determine what types of requests,
- * TPH or extended TPH, can be issued by the device based on its TPH requester
- * capability and the Root Port's completer capability.
+ * and return an error if not. Select TPH requester type according to
+ * @req_policy and negotiated capability between device and root port. Return
+ * error for invalid policy or unsupported extended TPH requests.
*
* Return: 0 on success, otherwise negative value (-errno)
*/
-int pcie_enable_tph(struct pci_dev *pdev, int mode)
+int pcie_enable_tph(struct pci_dev *pdev, int mode,
+ enum tph_req_policy req_policy)
{
u32 reg;
u8 dev_modes;
@@ -401,10 +408,20 @@ int pcie_enable_tph(struct pci_dev *pdev, int mode)
if (!((1 << mode) & dev_modes))
return -EINVAL;
- pdev->tph_mode = mode;
+ if (req_policy == TPH_REQ_AUTO) {
+ pdev->tph_req_type = pdev->tph_ext_requester ?
+ PCI_TPH_REQ_EXT_TPH : PCI_TPH_REQ_TPH_ONLY;
+ } else if (req_policy == TPH_REQ_STANDARD) {
+ pdev->tph_req_type = PCI_TPH_REQ_TPH_ONLY;
+ } else if (req_policy == TPH_REQ_EXTENDED) {
+ if (!pdev->tph_ext_requester)
+ return -EINVAL;
+ pdev->tph_req_type = PCI_TPH_REQ_EXT_TPH;
+ } else {
+ return -EINVAL;
+ }
- pdev->tph_req_type = pdev->tph_ext_requester ? PCI_TPH_REQ_EXT_TPH :
- PCI_TPH_REQ_TPH_ONLY;
+ pdev->tph_mode = mode;
/* Write them into TPH control register */
pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CTRL, ®);
diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
index 6f02b020d7d7..eb7f165b2570 100644
--- a/include/linux/pci-tph.h
+++ b/include/linux/pci-tph.h
@@ -21,6 +21,16 @@ enum tph_mem_type {
TPH_MEM_TYPE_PM /* persistent memory */
};
+/*
+ * TPH requester selection policy
+ * Used to choose standard/extended TPH when enabling TPH
+ */
+enum tph_req_policy {
+ TPH_REQ_AUTO, /* Auto select by capability */
+ TPH_REQ_STANDARD, /* Force standard TPH (8-bit ST) */
+ TPH_REQ_EXTENDED /* Force extended TPH (16-bit ST) */
+};
+
#ifdef CONFIG_PCIE_TPH
int pcie_tph_set_st_entry(struct pci_dev *pdev,
unsigned int index, u16 tag);
@@ -28,7 +38,8 @@ int pcie_tph_get_cpu_st(struct pci_dev *dev,
enum tph_mem_type mem_type,
unsigned int cpu, u16 *tag);
void pcie_disable_tph(struct pci_dev *pdev);
-int pcie_enable_tph(struct pci_dev *pdev, int mode);
+int pcie_enable_tph(struct pci_dev *pdev, int mode,
+ enum tph_req_policy req_policy);
u16 pcie_tph_get_st_table_size(struct pci_dev *pdev);
u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev);
#else
@@ -40,7 +51,8 @@ static inline int pcie_tph_get_cpu_st(struct pci_dev *dev,
unsigned int cpu, u16 *tag)
{ return -EINVAL; }
static inline void pcie_disable_tph(struct pci_dev *pdev) { }
-static inline int pcie_enable_tph(struct pci_dev *pdev, int mode)
+static inline int pcie_enable_tph(struct pci_dev *pdev, int mode,
+ enum tph_req_policy req_policy)
{ return -EINVAL; }
static inline u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
{ return 0; }
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v14 4/8] PCI/TPH: Add requester policy to pcie_tph_get_cpu_st()
2026-05-28 12:46 [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Chengwen Feng
` (2 preceding siblings ...)
2026-05-28 12:46 ` [PATCH v14 3/8] PCI/TPH: Add requester selection policy to pcie_enable_tph() Chengwen Feng
@ 2026-05-28 12:46 ` Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 5/8] PCI/TPH: expose the enabled TPH requester type Chengwen Feng
` (4 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Chengwen Feng @ 2026-05-28 12:46 UTC (permalink / raw)
To: alex, jgg
Cc: wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
Extend pcie_tph_get_cpu_st() to accept enum tph_req_policy, so callers
can retrieve Steering Tag for standard or extended TPH explicitly.
Add input validation and extended TPH capability check. Update function
prototype, inline stub, kernel-doc and Documentation/PCI/tph.rst.
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
Documentation/PCI/tph.rst | 10 ++++---
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 ++
.../net/ethernet/mellanox/mlx5/core/lib/st.c | 3 ++-
drivers/pci/tph.c | 26 ++++++++++++++++---
include/linux/pci-tph.h | 9 +++++--
5 files changed, 40 insertions(+), 10 deletions(-)
diff --git a/Documentation/PCI/tph.rst b/Documentation/PCI/tph.rst
index a5f14d17977c..17a369e72af1 100644
--- a/Documentation/PCI/tph.rst
+++ b/Documentation/PCI/tph.rst
@@ -87,11 +87,13 @@ To retrieve a Steering Tag for a target memory associated with a specific
CPU, use the following function::
int pcie_tph_get_cpu_st(struct pci_dev *pdev, enum tph_mem_type type,
+ enum tph_req_policy req_policy,
unsigned int cpu, u16 *tag);
The `type` argument is used to specify the memory type, either volatile
-or persistent, of the target memory. The `cpu` argument specifies the
-CPU where the memory is associated to.
+or persistent, of the target memory. The `req_policy` used to select
+standard/extended ST. The `cpu` argument specifies the CPU where the memory
+is associated to.
After the ST value is retrieved, the device driver can use the following
function to write the ST into the device::
@@ -125,8 +127,8 @@ been changed. Here is a sample code for IRQ affinity notifier:
/* Pick a right CPU as the target - here is just an example */
cpu_id = cpumask_first(irq->cpu_mask);
- if (pcie_tph_get_cpu_st(irq->pdev, TPH_MEM_TYPE_VM, cpu_id,
- &tag))
+ if (pcie_tph_get_cpu_st(irq->pdev, TPH_MEM_TYPE_VM, TPH_REQ_AUTO,
+ cpu_id, &tag))
return;
if (pcie_tph_set_st_entry(irq->pdev, irq->msix_nr, tag))
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1f8d3e728a9f..c3cb7cd7b74a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -11734,6 +11734,7 @@ static void bnxt_irq_affinity_notify(struct irq_affinity_notify *notify,
return;
if (pcie_tph_get_cpu_st(irq->bp->pdev, TPH_MEM_TYPE_VM,
+ TPH_REQ_AUTO,
cpumask_first(irq->cpu_mask), &tag))
return;
@@ -11892,6 +11893,7 @@ static int bnxt_request_irq(struct bnxt *bp)
/* Init ST table entry */
if (pcie_tph_get_cpu_st(irq->bp->pdev, TPH_MEM_TYPE_VM,
+ TPH_REQ_AUTO,
cpumask_first(irq->cpu_mask),
&tag))
continue;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
index 26ce59d16785..043e13c95028 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
@@ -105,7 +105,8 @@ int mlx5_st_alloc_index(struct mlx5_core_dev *dev, enum tph_mem_type mem_type,
if (!st)
return -EOPNOTSUPP;
- ret = pcie_tph_get_cpu_st(dev->pdev, mem_type, cpu_uid, &tag);
+ ret = pcie_tph_get_cpu_st(dev->pdev, mem_type, TPH_REQ_AUTO,
+ cpu_uid, &tag);
if (ret)
return ret;
diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index f7758445988e..4a2c3575a883 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -236,24 +236,44 @@ static int write_tag_to_st_table(struct pci_dev *pdev, int index, u16 tag)
* with a specific CPU
* @pdev: PCI device
* @mem_type: target memory type (volatile or persistent RAM)
+ * @req_policy: TPH requester selection policy
+ *
+ * - TPH_REQ_AUTO: Use currently active requester type
+ * - TPH_REQ_STANDARD: Retrieve tag for standard TPH (8-bit ST)
+ * - TPH_REQ_EXTENDED: Retrieve tag for extended TPH (16-bit ST)
+ *
* @cpu: associated CPU id
* @tag: Steering Tag to be returned
*
- * Return the Steering Tag for a target memory that is associated with a
- * specific CPU as indicated by cpu.
+ * Return the Steering Tag for a target memory and requester policy that is
+ * associated with a specific CPU as indicated by cpu.
*
* Return: 0 if success, otherwise negative value (-errno)
*/
int pcie_tph_get_cpu_st(struct pci_dev *pdev, enum tph_mem_type mem_type,
+ enum tph_req_policy req_policy,
unsigned int cpu, u16 *tag)
{
#ifdef CONFIG_ACPI
struct pci_dev *rp;
acpi_handle rp_acpi_handle;
union st_info info;
+ u8 tph_req_type;
u32 cpu_uid;
int ret;
+ if (req_policy == TPH_REQ_AUTO) {
+ tph_req_type = pdev->tph_req_type;
+ } else if (req_policy == TPH_REQ_STANDARD) {
+ tph_req_type = PCI_TPH_REQ_TPH_ONLY;
+ } else if (req_policy == TPH_REQ_EXTENDED) {
+ if (!pdev->tph_ext_requester)
+ return -EINVAL;
+ tph_req_type = PCI_TPH_REQ_EXT_TPH;
+ } else {
+ return -EINVAL;
+ }
+
ret = acpi_get_cpu_uid(cpu, &cpu_uid);
if (ret != 0)
return ret;
@@ -269,7 +289,7 @@ int pcie_tph_get_cpu_st(struct pci_dev *pdev, enum tph_mem_type mem_type,
return -EINVAL;
}
- *tag = tph_extract_tag(mem_type, pdev->tph_req_type, &info);
+ *tag = tph_extract_tag(mem_type, tph_req_type, &info);
pci_dbg(pdev, "get steering tag: mem_type=%s, cpu=%d, tag=%#04x\n",
(mem_type == TPH_MEM_TYPE_VM) ? "volatile" : "persistent",
diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
index eb7f165b2570..7f78a40173c9 100644
--- a/include/linux/pci-tph.h
+++ b/include/linux/pci-tph.h
@@ -23,10 +23,13 @@ enum tph_mem_type {
/*
* TPH requester selection policy
- * Used to choose standard/extended TPH when enabling TPH
+ * Used to choose standard/extended TPH when enabling TPH,
+ * and select requester type for Steering Tag retrieval.
*/
enum tph_req_policy {
- TPH_REQ_AUTO, /* Auto select by capability */
+ TPH_REQ_AUTO, /* Auto select by capability or
+ * use active requester type
+ */
TPH_REQ_STANDARD, /* Force standard TPH (8-bit ST) */
TPH_REQ_EXTENDED /* Force extended TPH (16-bit ST) */
};
@@ -36,6 +39,7 @@ int pcie_tph_set_st_entry(struct pci_dev *pdev,
unsigned int index, u16 tag);
int pcie_tph_get_cpu_st(struct pci_dev *dev,
enum tph_mem_type mem_type,
+ enum tph_req_policy req_policy,
unsigned int cpu, u16 *tag);
void pcie_disable_tph(struct pci_dev *pdev);
int pcie_enable_tph(struct pci_dev *pdev, int mode,
@@ -48,6 +52,7 @@ static inline int pcie_tph_set_st_entry(struct pci_dev *pdev,
{ return -EINVAL; }
static inline int pcie_tph_get_cpu_st(struct pci_dev *dev,
enum tph_mem_type mem_type,
+ enum tph_req_policy req_policy,
unsigned int cpu, u16 *tag)
{ return -EINVAL; }
static inline void pcie_disable_tph(struct pci_dev *pdev) { }
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v14 5/8] PCI/TPH: expose the enabled TPH requester type
2026-05-28 12:46 [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Chengwen Feng
` (3 preceding siblings ...)
2026-05-28 12:46 ` [PATCH v14 4/8] PCI/TPH: Add requester policy to pcie_tph_get_cpu_st() Chengwen Feng
@ 2026-05-28 12:46 ` Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 6/8] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST_CONFIG for PCIe TPH ST configuration Chengwen Feng
` (3 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Chengwen Feng @ 2026-05-28 12:46 UTC (permalink / raw)
To: alex, jgg
Cc: wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
From: Zhiping Zhang <zhipingz@meta.com>
Add pcie_tph_enabled_req_type() so drivers can query the enabled TPH
requester mode without reaching into pci_dev internals.
Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
drivers/pci/tph.c | 12 ++++++++++++
include/linux/pci-tph.h | 3 +++
2 files changed, 15 insertions(+)
diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index 4a2c3575a883..c7957b4a28d3 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -460,6 +460,18 @@ int pcie_enable_tph(struct pci_dev *pdev, int mode,
}
EXPORT_SYMBOL(pcie_enable_tph);
+/**
+ * pcie_tph_enabled_req_type - Return the device's enabled TPH requester type
+ * @pdev: PCI device to query
+ *
+ * Return: PCI_TPH_REQ_DISABLE, PCI_TPH_REQ_TPH_ONLY or PCI_TPH_REQ_EXT_TPH.
+ */
+u8 pcie_tph_enabled_req_type(struct pci_dev *pdev)
+{
+ return pdev->tph_req_type;
+}
+EXPORT_SYMBOL(pcie_tph_enabled_req_type);
+
void pci_restore_tph_state(struct pci_dev *pdev)
{
struct pci_cap_saved_state *save_state;
diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
index 7f78a40173c9..55af14d2457e 100644
--- a/include/linux/pci-tph.h
+++ b/include/linux/pci-tph.h
@@ -44,6 +44,7 @@ int pcie_tph_get_cpu_st(struct pci_dev *dev,
void pcie_disable_tph(struct pci_dev *pdev);
int pcie_enable_tph(struct pci_dev *pdev, int mode,
enum tph_req_policy req_policy);
+u8 pcie_tph_enabled_req_type(struct pci_dev *pdev);
u16 pcie_tph_get_st_table_size(struct pci_dev *pdev);
u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev);
#else
@@ -59,6 +60,8 @@ static inline void pcie_disable_tph(struct pci_dev *pdev) { }
static inline int pcie_enable_tph(struct pci_dev *pdev, int mode,
enum tph_req_policy req_policy)
{ return -EINVAL; }
+static inline u8 pcie_tph_enabled_req_type(struct pci_dev *pdev)
+{ return PCI_TPH_REQ_DISABLE; }
static inline u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
{ return 0; }
static inline u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v14 6/8] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST_CONFIG for PCIe TPH ST configuration
2026-05-28 12:46 [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Chengwen Feng
` (4 preceding siblings ...)
2026-05-28 12:46 ` [PATCH v14 5/8] PCI/TPH: expose the enabled TPH requester type Chengwen Feng
@ 2026-05-28 12:46 ` Chengwen Feng
2026-05-28 15:17 ` sashiko-bot
2026-05-28 12:46 ` [PATCH v14 7/8] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_CPU_ST to query TPH steering tag Chengwen Feng
` (2 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Chengwen Feng @ 2026-05-28 12:46 UTC (permalink / raw)
To: alex, jgg
Cc: wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
Add a new VFIO device feature VFIO_DEVICE_FEATURE_TPH_ST_CONFIG to allow
userspace to configure PCIe TPH Steering Tag table entries. This interface
supports only configuration writes, read operations are not permitted.
Implement shadow ST table to cache entries, paired with per-device mutex
for concurrent access protection. Batch write failure triggers entry
rollback to guarantee hardware and shadow table consistency.
Introduce enable_unsafe_tph module parameter to gate this non-isolated TPH
feature.
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
drivers/vfio/pci/vfio_pci.c | 13 +++-
drivers/vfio/pci/vfio_pci_core.c | 129 ++++++++++++++++++++++++++++++-
include/linux/vfio_pci_core.h | 6 +-
include/uapi/linux/vfio.h | 20 +++++
4 files changed, 165 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 0c771064c0b8..6d73668459cf 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -60,6 +60,12 @@ static bool disable_denylist;
module_param(disable_denylist, bool, 0444);
MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
+#ifdef CONFIG_PCIE_TPH
+static bool enable_unsafe_tph;
+module_param(enable_unsafe_tph, bool, 0444);
+MODULE_PARM_DESC(enable_unsafe_tph, "Enable PCIe TPH (Transaction Processing Hints) support. It may break platform isolation. If you do not know what this is for, step away. (default: false)");
+#endif
+
static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
{
switch (pdev->vendor) {
@@ -257,12 +263,17 @@ static int __init vfio_pci_init(void)
{
int ret;
bool is_disable_vga = true;
+ bool is_enable_unsafe_tph = false;
#ifdef CONFIG_VFIO_PCI_VGA
is_disable_vga = disable_vga;
#endif
+#ifdef CONFIG_PCIE_TPH
+ is_enable_unsafe_tph = enable_unsafe_tph;
+#endif
- vfio_pci_core_set_params(nointxmask, is_disable_vga, disable_idle_d3);
+ vfio_pci_core_set_params(nointxmask, is_disable_vga, disable_idle_d3,
+ is_enable_unsafe_tph);
/* Register and scan for devices */
ret = pci_register_driver(&vfio_pci_driver);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 050e7542952e..c9035c0acaf0 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -29,6 +29,7 @@
#include <linux/sched/mm.h>
#include <linux/iommufd.h>
#include <linux/pci-p2pdma.h>
+#include <linux/pci-tph.h>
#if IS_ENABLED(CONFIG_EEH)
#include <asm/eeh.h>
#endif
@@ -41,6 +42,7 @@
static bool nointxmask;
static bool disable_vga;
static bool disable_idle_d3;
+static bool enable_unsafe_tph;
static void vfio_pci_eventfd_rcu_free(struct rcu_head *rcu)
{
@@ -528,6 +530,9 @@ static const struct dev_pm_ops vfio_pci_core_pm_ops = {
NULL)
};
+static int vfio_pci_tph_init(struct vfio_pci_core_device *vdev);
+static void vfio_pci_tph_deinit(struct vfio_pci_core_device *vdev);
+
int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
{
struct pci_dev *pdev = vdev->pdev;
@@ -582,6 +587,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
if (ret)
goto out_free_zdev;
+ ret = vfio_pci_tph_init(vdev);
+ if (ret)
+ goto out_free_config;
+
msix_pos = pdev->msix_cap;
if (msix_pos) {
u16 flags;
@@ -606,6 +615,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
return 0;
+out_free_config:
+ vfio_config_free(vdev);
out_free_zdev:
vfio_pci_zdev_close_device(vdev);
out_free_state:
@@ -679,6 +690,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
kfree(vdev->region);
vdev->region = NULL; /* don't krealloc a freed pointer */
+ vfio_pci_tph_deinit(vdev);
vfio_config_free(vdev);
for (i = 0; i < PCI_STD_NUM_BARS; i++) {
@@ -1551,6 +1563,116 @@ static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
return 0;
}
+static int vfio_pci_tph_st_shadow_size(struct vfio_pci_core_device *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u32 loc = pcie_tph_get_st_table_loc(pdev);
+ int ret;
+
+ if (loc == PCI_TPH_LOC_CAP) {
+ return pcie_tph_get_st_table_size(pdev);
+ } else if (loc == PCI_TPH_LOC_MSIX) {
+ ret = pci_msix_vec_count(pdev);
+ if (ret < 0)
+ return 0;
+ return ret;
+ } else {
+ return 0;
+ }
+}
+
+static int vfio_pci_tph_init(struct vfio_pci_core_device *vdev)
+{
+ vdev->tph_st_entries = vfio_pci_tph_st_shadow_size(vdev);
+ vdev->tph_st_shadow = NULL;
+
+ if (vdev->tph_st_entries) {
+ vdev->tph_st_shadow = kcalloc(vdev->tph_st_entries, sizeof(u16),
+ GFP_KERNEL);
+ if (!vdev->tph_st_shadow)
+ return -ENOMEM;
+ }
+
+ mutex_init(&vdev->tph_lock);
+
+ return 0;
+}
+
+static void vfio_pci_tph_deinit(struct vfio_pci_core_device *vdev)
+{
+ kfree(vdev->tph_st_shadow);
+ vdev->tph_st_shadow = NULL;
+ vdev->tph_st_entries = 0;
+
+ mutex_destroy(&vdev->tph_lock);
+}
+
+static int vfio_pci_core_feature_tph_st_config(
+ struct vfio_pci_core_device *vdev,
+ u32 flags,
+ struct vfio_device_feature_tph_st_config __user *arg,
+ size_t argsz)
+{
+ struct vfio_device_feature_tph_st_config config;
+ struct pci_dev *pdev = vdev->pdev;
+ void __user *uptr;
+ int i, idx, ret;
+ size_t sz;
+ u16 *sts;
+
+ if (!enable_unsafe_tph || !vdev->tph_st_shadow)
+ return -EOPNOTSUPP;
+
+ ret = vfio_check_feature(flags, argsz,
+ VFIO_DEVICE_FEATURE_SET |
+ VFIO_DEVICE_FEATURE_PROBE,
+ sizeof(config));
+ if (ret <= 0)
+ return ret;
+
+ if (copy_from_user(&config, arg, sizeof(config)))
+ return -EFAULT;
+
+ if (config.count == 0 || config.index >= vdev->tph_st_entries ||
+ config.count > vdev->tph_st_entries ||
+ config.index + config.count > vdev->tph_st_entries ||
+ config.reserved != 0)
+ return -EINVAL;
+
+ guard(mutex)(&vdev->tph_lock);
+
+ uptr = u64_to_user_ptr(config.data_uptr);
+ sts = memdup_array_user(uptr, config.count, sizeof(u16));
+ sz = config.count * sizeof(u16);
+ if (IS_ERR(sts))
+ return PTR_ERR(sts);
+
+ if (pcie_tph_enabled_req_type(pdev) == PCI_TPH_REQ_DISABLE) {
+ memcpy(&vdev->tph_st_shadow[config.index], sts, sz);
+ kfree(sts);
+ return 0;
+ }
+
+ for (i = 0; i < config.count; i++) {
+ idx = config.index + i;
+ ret = pcie_tph_set_st_entry(pdev, idx, sts[i]);
+ if (ret)
+ goto rollback;
+ }
+
+ memcpy(&vdev->tph_st_shadow[config.index], sts, sz);
+ kfree(sts);
+ return 0;
+
+rollback:
+ while (i-- > 0) {
+ idx = config.index + i;
+ pcie_tph_set_st_entry(pdev, idx, vdev->tph_st_shadow[idx]);
+ }
+ kfree(sts);
+ return ret;
+}
+
int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
void __user *arg, size_t argsz)
{
@@ -1569,6 +1691,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_DMA_BUF:
return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
+ case VFIO_DEVICE_FEATURE_TPH_ST_CONFIG:
+ return vfio_pci_core_feature_tph_st_config(vdev, flags,
+ arg, argsz);
default:
return -ENOTTY;
}
@@ -2605,11 +2730,13 @@ static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
}
void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
- bool is_disable_idle_d3)
+ bool is_disable_idle_d3,
+ bool is_enable_unsafe_tph)
{
nointxmask = is_nointxmask;
disable_vga = is_disable_vga;
disable_idle_d3 = is_disable_idle_d3;
+ enable_unsafe_tph = is_enable_unsafe_tph;
}
EXPORT_SYMBOL_GPL(vfio_pci_core_set_params);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 89165b769e5c..ae8e7011ab8e 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -142,6 +142,9 @@ struct vfio_pci_core_device {
struct notifier_block nb;
struct rw_semaphore memory_lock;
struct list_head dmabufs;
+ struct mutex tph_lock;
+ u16 *tph_st_shadow;
+ u16 tph_st_entries;
};
enum vfio_pci_io_width {
@@ -157,7 +160,8 @@ int vfio_pci_core_register_dev_region(struct vfio_pci_core_device *vdev,
const struct vfio_pci_regops *ops,
size_t size, u32 flags, void *data);
void vfio_pci_core_set_params(bool nointxmask, bool is_disable_vga,
- bool is_disable_idle_d3);
+ bool is_disable_idle_d3,
+ bool is_enable_unsafe_tph);
void vfio_pci_core_close_device(struct vfio_device *core_vdev);
int vfio_pci_core_init_dev(struct vfio_device *core_vdev);
void vfio_pci_core_release_dev(struct vfio_device *core_vdev);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..07d8cce6ef32 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1534,6 +1534,26 @@ struct vfio_device_feature_dma_buf {
*/
#define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 12
+/**
+ * VFIO_DEVICE_FEATURE_TPH_ST_CONFIG - Configure PCIe TPH Steering Tag entries
+ *
+ * Provides userspace interface to configure PCIe TPH ST table entries.
+ *
+ * @index: Start entry offset within ST table
+ * @count: Number of consecutive entries to configure
+ * @data_uptr: Userspace data buffer for 16-bit raw ST values
+ *
+ * This feature is gated by enable_unsafe_tph module parameter.
+ */
+#define VFIO_DEVICE_FEATURE_TPH_ST_CONFIG 13
+
+struct vfio_device_feature_tph_st_config {
+ __u16 index;
+ __u16 count;
+ __u32 reserved; /* Reserved for future use, must be zero */
+ __aligned_u64 data_uptr;
+};
+
/* -------- API for Type1 VFIO IOMMU -------- */
/**
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v14 7/8] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_CPU_ST to query TPH steering tag
2026-05-28 12:46 [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Chengwen Feng
` (5 preceding siblings ...)
2026-05-28 12:46 ` [PATCH v14 6/8] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST_CONFIG for PCIe TPH ST configuration Chengwen Feng
@ 2026-05-28 12:46 ` Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 8/8] vfio/pci: Virtualize PCIe TPH capability registers Chengwen Feng
2026-06-01 15:58 ` [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Alex Williamson
8 siblings, 0 replies; 22+ messages in thread
From: Chengwen Feng @ 2026-05-28 12:46 UTC (permalink / raw)
To: alex, jgg
Cc: wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
Add read-only VFIO device feature TPH_CPU_ST for userspace to resolve
PCIe TPH Steering Tag by given CPU ID, memory type (VM/PM), and
TPH requester type (standard/extended).
Define compact flag bits to select memory type and requester type.
Implement batch query logic: input buffer carries CPU ID array, output
returns corresponding resolved ST tags.
Add strict sanity checks for flags, count and reserved fields.
The feature is gated by enable_unsafe_tph module parameter.
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
drivers/vfio/pci/vfio_pci_core.c | 64 ++++++++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 29 +++++++++++++++
2 files changed, 93 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index c9035c0acaf0..851e4d0828f2 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1673,6 +1673,67 @@ static int vfio_pci_core_feature_tph_st_config(
return ret;
}
+static int vfio_pci_core_feature_tph_cpu_st(struct vfio_pci_core_device *vdev,
+ u32 flags,
+ struct vfio_device_feature_tph_cpu_st __user *arg,
+ size_t argsz)
+{
+ struct vfio_device_feature_tph_cpu_st cpu_st;
+ struct pci_dev *pdev = vdev->pdev;
+ enum tph_req_policy policy;
+ enum tph_mem_type mtype;
+ void __user *uptr;
+ int i, ret;
+ u32 *cpus;
+ u16 *sts;
+ u16 st;
+
+ if (!enable_unsafe_tph)
+ return -EOPNOTSUPP;
+
+ ret = vfio_check_feature(flags, argsz,
+ VFIO_DEVICE_FEATURE_GET |
+ VFIO_DEVICE_FEATURE_PROBE,
+ sizeof(cpu_st));
+ if (ret <= 0)
+ return ret;
+
+ if (copy_from_user(&cpu_st, arg, sizeof(cpu_st)))
+ return -EFAULT;
+
+ if (cpu_st.flags & ~(VFIO_TPH_CPU_ST_MEM_TYPE_MASK |
+ VFIO_TPH_CPU_ST_REQ_TYPE_MASK) ||
+ cpu_st.count == 0 || cpu_st.count > nr_cpu_ids ||
+ cpu_st.reserved != 0)
+ return -EINVAL;
+
+ uptr = u64_to_user_ptr(cpu_st.data_uptr);
+ cpus = memdup_array_user(uptr, cpu_st.count, sizeof(u32));
+ if (IS_ERR(cpus))
+ return PTR_ERR(cpus);
+
+ mtype = (cpu_st.flags & VFIO_TPH_CPU_ST_MEM_TYPE_MASK) ==
+ VFIO_TPH_CPU_ST_MEM_TYPE_VM ? TPH_MEM_TYPE_VM : TPH_MEM_TYPE_PM;
+ policy = (cpu_st.flags & VFIO_TPH_CPU_ST_REQ_TYPE_MASK) ==
+ VFIO_TPH_CPU_ST_REQ_STANDARD ?
+ TPH_REQ_STANDARD : TPH_REQ_EXTENDED;
+ sts = (u16 *)cpus;
+ for (i = 0; i < cpu_st.count; i++) {
+ ret = pcie_tph_get_cpu_st(pdev, mtype, policy, cpus[i], &st);
+ if (ret)
+ goto out;
+ sts[i] = st;
+ }
+
+ ret = copy_to_user(uptr, sts, cpu_st.count * sizeof(u16));
+ if (ret)
+ ret = -EFAULT;
+
+out:
+ kfree(cpus);
+ return ret;
+}
+
int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
void __user *arg, size_t argsz)
{
@@ -1694,6 +1755,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
case VFIO_DEVICE_FEATURE_TPH_ST_CONFIG:
return vfio_pci_core_feature_tph_st_config(vdev, flags,
arg, argsz);
+ case VFIO_DEVICE_FEATURE_TPH_CPU_ST:
+ return vfio_pci_core_feature_tph_cpu_st(vdev, flags,
+ arg, argsz);
default:
return -ENOTTY;
}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 07d8cce6ef32..027b2d553ddb 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1554,6 +1554,35 @@ struct vfio_device_feature_tph_st_config {
__aligned_u64 data_uptr;
};
+/**
+ * VFIO_DEVICE_FEATURE_TPH_CPU_ST - Query TPH ST by CPU, mem and requester type
+ *
+ * Read-only interface to resolve Steering Tag using:
+ * - CPU ID
+ * - Memory type (volatile / persistent)
+ * - TPH requested type (standard 8-bit / extended 16-bit)
+ *
+ * @flags: bit0 - memory type, bit1 - TPH requester type
+ * @count: Number of CPU IDs to query consecutively
+ * @data_uptr: Userspace buffer: [in] u32 cpu_id array, [out] u16 st_tag array
+ *
+ * This feature is gated by enable_unsafe_tph module parameter.
+ */
+#define VFIO_DEVICE_FEATURE_TPH_CPU_ST 14
+
+struct vfio_device_feature_tph_cpu_st {
+ __u32 flags;
+#define VFIO_TPH_CPU_ST_MEM_TYPE_MASK 1u
+#define VFIO_TPH_CPU_ST_MEM_TYPE_VM (0u << 0)
+#define VFIO_TPH_CPU_ST_MEM_TYPE_PM (1u << 0)
+#define VFIO_TPH_CPU_ST_REQ_TYPE_MASK (1u << 1)
+#define VFIO_TPH_CPU_ST_REQ_STANDARD (0u << 1)
+#define VFIO_TPH_CPU_ST_REQ_EXTENDED (1u << 1)
+ __u16 count;
+ __u16 reserved; /* Reserved for future use, must be zero */
+ __aligned_u64 data_uptr;
+};
+
/* -------- API for Type1 VFIO IOMMU -------- */
/**
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v14 8/8] vfio/pci: Virtualize PCIe TPH capability registers
2026-05-28 12:46 [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Chengwen Feng
` (6 preceding siblings ...)
2026-05-28 12:46 ` [PATCH v14 7/8] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_CPU_ST to query TPH steering tag Chengwen Feng
@ 2026-05-28 12:46 ` Chengwen Feng
2026-05-28 16:42 ` sashiko-bot
2026-06-01 15:58 ` [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Alex Williamson
8 siblings, 1 reply; 22+ messages in thread
From: Chengwen Feng @ 2026-05-28 12:46 UTC (permalink / raw)
To: alex, jgg
Cc: wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
Implement read/write virtualization for PCIe TPH Extended Capability:
1. Read path: Mask out extended requester bit if hardware does not support
extended ST, to hide unavailable capability from userspace.
2. Write path: Validate TPH mode and requester configuration, mediate TPH
enable/disable requests. Sync shadow ST table to hardware after
successful TPH enabling.
Clear shadow table and reset TPH status on user session transition, to
prevent cross-session state leakage.
Gate the enable/disable TPH feature by enable_unsafe_tph module parameter.
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
drivers/vfio/pci/vfio_pci_config.c | 78 ++++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_core.c | 8 ++-
2 files changed, 85 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index a10ed733f0e3..bed3794e6033 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -20,8 +20,10 @@
* must be negotiated with the underlying OS.
*/
+#include <linux/bitfield.h>
#include <linux/fs.h>
#include <linux/pci.h>
+#include <linux/pci-tph.h>
#include <linux/uaccess.h>
#include <linux/vfio.h>
#include <linux/slab.h>
@@ -35,6 +37,8 @@
((offset >= PCI_BASE_ADDRESS_0 && offset < PCI_BASE_ADDRESS_5 + 4) || \
(offset >= PCI_ROM_ADDRESS && offset < PCI_ROM_ADDRESS + 4))
+extern bool enable_unsafe_tph;
+
/*
* Lengths of PCI Config Capabilities
* 0: Removed from the user visible capability list
@@ -313,6 +317,78 @@ static int vfio_virt_config_read(struct vfio_pci_core_device *vdev, int pos,
return count;
}
+static int vfio_pci_tph_config_read(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 *val)
+{
+ const int target_byte = PCI_TPH_CAP + 1;
+ struct pci_dev *pdev = vdev->pdev;
+ const u8 ext_bit = BIT(0);
+ u8 *data = (u8 *)val;
+ int byte_off;
+ int ret;
+
+ if (pdev->tph_ext_requester || offset > target_byte ||
+ offset + count <= target_byte)
+ return vfio_direct_config_read(vdev, pos, count, perm,
+ offset, val);
+
+ ret = vfio_direct_config_read(vdev, pos, count, perm, offset, val);
+
+ /* Mask out TPH requester capability */
+ byte_off = target_byte - offset;
+ /* Defensive check: prevent out-of-bounds access */
+ if (byte_off < count)
+ data[byte_off] &= ~ext_bit;
+
+ return ret;
+}
+
+static int vfio_pci_tph_config_write(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 val)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u32 data = le32_to_cpu(val);
+ enum tph_req_policy policy;
+ u8 mode, req;
+ int i, ret;
+
+ if (!enable_unsafe_tph)
+ return count;
+
+ /*
+ * TPH_CTRL (offset 4) register layout:
+ * Byte 0 (offset 4): Bits [2:0] - ST Mode selection
+ * Byte 1 (offset 5): Bits [1:0] - TPH Requester enable
+ *
+ * Only intercept writes with count >= 2 which cover both control
+ * fields. Single-byte partial writes do not affect combined
+ * configuration.
+ */
+ if (offset != PCI_TPH_CTRL || count < 2)
+ return count;
+
+ guard(mutex)(&vdev->tph_lock);
+
+ mode = FIELD_GET(PCI_TPH_CTRL_MODE_SEL_MASK, data);
+ req = FIELD_GET(PCI_TPH_CTRL_REQ_EN_MASK, data);
+ if (req == PCI_TPH_REQ_TPH_ONLY || req == PCI_TPH_REQ_EXT_TPH) {
+ policy = (req == PCI_TPH_REQ_TPH_ONLY) ?
+ TPH_REQ_STANDARD : TPH_REQ_EXTENDED;
+ ret = pcie_enable_tph(pdev, mode, policy);
+ if (ret == 0 && vdev->tph_st_shadow) {
+ for (i = 0; i < vdev->tph_st_entries; i++)
+ pcie_tph_set_st_entry(pdev, i,
+ vdev->tph_st_shadow[i]);
+ }
+ } else if (req == PCI_TPH_REQ_DISABLE) {
+ pcie_disable_tph(vdev->pdev);
+ }
+
+ return count;
+}
+
static struct perm_bits direct_ro_perms = {
.readfn = vfio_direct_config_read,
};
@@ -1121,6 +1197,8 @@ int __init vfio_pci_init_perm_bits(void)
ret |= init_pci_ext_cap_err_perm(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
ret |= init_pci_ext_cap_pwr_perm(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_raw_config_write;
+ ecap_perms[PCI_EXT_CAP_ID_TPH].readfn = vfio_pci_tph_config_read;
+ ecap_perms[PCI_EXT_CAP_ID_TPH].writefn = vfio_pci_tph_config_write;
ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn = vfio_raw_config_write;
if (ret)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 851e4d0828f2..2a5cb945275e 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -42,7 +42,7 @@
static bool nointxmask;
static bool disable_vga;
static bool disable_idle_d3;
-static bool enable_unsafe_tph;
+bool enable_unsafe_tph;
static void vfio_pci_eventfd_rcu_free(struct rcu_head *rcu)
{
@@ -1595,6 +1595,9 @@ static int vfio_pci_tph_init(struct vfio_pci_core_device *vdev)
mutex_init(&vdev->tph_lock);
+ /* Reset TPH status on new user session */
+ pcie_disable_tph(vdev->pdev);
+
return 0;
}
@@ -1605,6 +1608,9 @@ static void vfio_pci_tph_deinit(struct vfio_pci_core_device *vdev)
vdev->tph_st_entries = 0;
mutex_destroy(&vdev->tph_lock);
+
+ /* Reset TPH status on session exit */
+ pcie_disable_tph(vdev->pdev);
}
static int vfio_pci_core_feature_tph_st_config(
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v14 6/8] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST_CONFIG for PCIe TPH ST configuration
2026-05-28 12:46 ` [PATCH v14 6/8] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST_CONFIG for PCIe TPH ST configuration Chengwen Feng
@ 2026-05-28 15:17 ` sashiko-bot
0 siblings, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-05-28 15:17 UTC (permalink / raw)
To: Chengwen Feng; +Cc: linux-pci, kvm
Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] The ioctl handler accesses device hardware registers without ensuring the device is in the D0 power state, which can cause host crashes.
- [Medium] The shadow ST table is not populated with the initial hardware state, leading to hardware state corruption upon rollback.
- [Low] Mixing goto-based error handling with scope-based cleanup (`guard()`) violates kernel cleanup guidelines.
- [Low] Shadow ST table memory is allocated unconditionally during device initialization, ignoring the module parameter gate.
--
commit f65e8a2860b984f5dfa0664d516582dd9981c329
Author: Chengwen Feng <fengchengwen@huawei.com>
vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST_CONFIG for PCIe TPH ST configuration
This commit adds a new VFIO device feature to allow userspace to configure
PCIe TPH Steering Tag table entries. It implements a shadow ST table to cache
entries and introduces a module parameter to gate the feature.
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 050e7542952ed..c9035c0acaf08 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
[ ... ]
> @@ -1551,6 +1563,116 @@ static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
> return 0;
> }
>
> +static int vfio_pci_tph_st_shadow_size(struct vfio_pci_core_device *vdev)
[ ... ]
> +static int vfio_pci_tph_init(struct vfio_pci_core_device *vdev)
> +{
> + vdev->tph_st_entries = vfio_pci_tph_st_shadow_size(vdev);
> + vdev->tph_st_shadow = NULL;
> +
> + if (vdev->tph_st_entries) {
> + vdev->tph_st_shadow = kcalloc(vdev->tph_st_entries, sizeof(u16),
> + GFP_KERNEL);
[Severity: Low]
Since this feature is gated by the enable_unsafe_tph module parameter during
ioctl execution, should this memory allocation also be gated by that parameter?
If the feature is disabled (the default), this allocates memory for every
TPH-capable device that will never be used.
> + if (!vdev->tph_st_shadow)
> + return -ENOMEM;
> + }
> +
> + mutex_init(&vdev->tph_lock);
> +
> + return 0;
> +}
[ ... ]
> +static int vfio_pci_core_feature_tph_st_config(
> + struct vfio_pci_core_device *vdev,
> + u32 flags,
> + struct vfio_device_feature_tph_st_config __user *arg,
> + size_t argsz)
> +{
> + struct vfio_device_feature_tph_st_config config;
> + struct pci_dev *pdev = vdev->pdev;
> + void __user *uptr;
> + int i, idx, ret;
> + size_t sz;
> + u16 *sts;
> +
> + if (!enable_unsafe_tph || !vdev->tph_st_shadow)
> + return -EOPNOTSUPP;
> +
> + ret = vfio_check_feature(flags, argsz,
> + VFIO_DEVICE_FEATURE_SET |
> + VFIO_DEVICE_FEATURE_PROBE,
> + sizeof(config));
> + if (ret <= 0)
> + return ret;
> +
> + if (copy_from_user(&config, arg, sizeof(config)))
> + return -EFAULT;
> +
> + if (config.count == 0 || config.index >= vdev->tph_st_entries ||
> + config.count > vdev->tph_st_entries ||
> + config.index + config.count > vdev->tph_st_entries ||
> + config.reserved != 0)
> + return -EINVAL;
> +
> + guard(mutex)(&vdev->tph_lock);
[Severity: Low]
Is it okay to mix guard() based locking with goto-based error handling in the
same function?
According to include/linux/cleanup.h, the expectation is to convert all
resources to scope-based cleanup or convert none of them. This code uses a
goto rollback pattern and manual kfree() alongside guard().
> +
> + uptr = u64_to_user_ptr(config.data_uptr);
> + sts = memdup_array_user(uptr, config.count, sizeof(u16));
> + sz = config.count * sizeof(u16);
> + if (IS_ERR(sts))
> + return PTR_ERR(sts);
> +
> + if (pcie_tph_enabled_req_type(pdev) == PCI_TPH_REQ_DISABLE) {
> + memcpy(&vdev->tph_st_shadow[config.index], sts, sz);
> + kfree(sts);
> + return 0;
> + }
> +
> + for (i = 0; i < config.count; i++) {
> + idx = config.index + i;
> + ret = pcie_tph_set_st_entry(pdev, idx, sts[i]);
[Severity: High]
Does this code need to ensure the device is in the D0 power state before
accessing its hardware registers?
The ioctl handler directly calls pcie_tph_set_st_entry() which accesses the
device's PCI configuration space or memory-mapped MSI-X table. If userspace
has placed the device into a low-power state (like D3hot) and no runtime PM
reference is acquired here, accessing MMIO regions could result in Unsupported
Requests and potentially trigger a fatal machine check exception.
> + if (ret)
> + goto rollback;
> + }
> +
> + memcpy(&vdev->tph_st_shadow[config.index], sts, sz);
> + kfree(sts);
> + return 0;
> +
> +rollback:
> + while (i-- > 0) {
> + idx = config.index + i;
> + pcie_tph_set_st_entry(pdev, idx, vdev->tph_st_shadow[idx]);
[Severity: Medium]
Will this rollback write zeros to the hardware during the first ioctl call?
The shadow ST table vdev->tph_st_shadow is allocated with kcalloc and never
synchronized with the initial hardware state. If a batch write fails after
successfully modifying some entries, this rollback loop appears to write the
zeroed shadow contents back to the hardware, potentially overwriting valid
BIOS- or host-configured ST entries instead of their true initial values.
> + }
> + kfree(sts);
> + return ret;
> +}
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528124649.14732-1-fengchengwen@huawei.com?part=6
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 8/8] vfio/pci: Virtualize PCIe TPH capability registers
2026-05-28 12:46 ` [PATCH v14 8/8] vfio/pci: Virtualize PCIe TPH capability registers Chengwen Feng
@ 2026-05-28 16:42 ` sashiko-bot
0 siblings, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-05-28 16:42 UTC (permalink / raw)
To: Chengwen Feng; +Cc: linux-pci, kvm
Thank you for your contribution! Sashiko AI review found 8 potential issue(s) to consider:
- [Critical] Unconditional overwrite of hardware ST Table destroys guest's native configuration.
- [High] Silent dropping of partial writes and ST table configuration in TPH config space.
- [High] Modifying ST Table entries while the TPH Requester is enabled.
- [High] Concurrent modification of `struct pci_dev` bitfields (`tph_enabled` vs. `msix_enabled`) without a shared lock.
- [Medium] Mixing `goto`-based error handling and `guard()` scope-based cleanup in the same function.
- [Low] Masking operations on an uninitialized stack variable when `vfio_direct_config_read()` fails.
- [Low] Destroying the `tph_lock` mutex before the final operation it serializes.
- [Low] Comment incorrectly identifies the `PCI_TPH_CTRL` offset.
--
commit 597ad56e8c1a479406f7f445961a108159c417c8
Author: Chengwen Feng <fengchengwen@huawei.com>
vfio/pci: Virtualize PCIe TPH capability registers
Implement read/write virtualization for PCIe TPH Extended Capability:
1. Read path: Mask out extended requester bit if hardware does not support
extended ST, to hide unavailable capability from userspace.
2. Write path: Validate TPH mode and requester configuration, mediate TPH
enable/disable requests. Sync shadow ST table to hardware after
successful TPH enabling.
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index a10ed733f0e3a..bed3794e6033d 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
[ ... ]
> @@ -313,6 +317,78 @@ static int vfio_virt_config_read(struct vfio_pci_core_device *vdev, int pos,
> return count;
> }
>
> +static int vfio_pci_tph_config_read(struct vfio_pci_core_device *vdev, int pos,
> + int count, struct perm_bits *perm,
> + int offset, __le32 *val)
> +{
> + const int target_byte = PCI_TPH_CAP + 1;
> + struct pci_dev *pdev = vdev->pdev;
> + const u8 ext_bit = BIT(0);
> + u8 *data = (u8 *)val;
> + int byte_off;
> + int ret;
> +
> + if (pdev->tph_ext_requester || offset > target_byte ||
> + offset + count <= target_byte)
> + return vfio_direct_config_read(vdev, pos, count, perm,
> + offset, val);
> +
> + ret = vfio_direct_config_read(vdev, pos, count, perm, offset, val);
> +
> + /* Mask out TPH requester capability */
> + byte_off = target_byte - offset;
> + /* Defensive check: prevent out-of-bounds access */
> + if (byte_off < count)
> + data[byte_off] &= ~ext_bit;
[Severity: Low]
If vfio_direct_config_read() returns a negative error code, does this execute
the bitwise mask on uninitialized stack data in val before returning? Even
though the upper layers discard the value on error, perhaps this should check
for ret < 0 before proceeding?
> +
> + return ret;
> +}
> +
> +static int vfio_pci_tph_config_write(struct vfio_pci_core_device *vdev, int pos,
> + int count, struct perm_bits *perm,
> + int offset, __le32 val)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + u32 data = le32_to_cpu(val);
> + enum tph_req_policy policy;
> + u8 mode, req;
> + int i, ret;
> +
> + if (!enable_unsafe_tph)
> + return count;
> +
> + /*
> + * TPH_CTRL (offset 4) register layout:
> + * Byte 0 (offset 4): Bits [2:0] - ST Mode selection
> + * Byte 1 (offset 5): Bits [1:0] - TPH Requester enable
[Severity: Low]
This isn't a bug, but the comment states the TPH_CTRL register is at offset 4.
Isn't PCI_TPH_CTRL actually at offset 8, while offset 4 is the PCI_TPH_CAP
register?
> + *
> + * Only intercept writes with count >= 2 which cover both control
> + * fields. Single-byte partial writes do not affect combined
> + * configuration.
> + */
> + if (offset != PCI_TPH_CTRL || count < 2)
> + return count;
[Severity: High]
By returning count directly here, doesn't this silently drop all byte-wise
writes to the control register, as well as any writes to the ST Table if it is
located within the TPH capability space (PCI_TPH_LOC_CAP)? The guest might
assume the writes succeeded, leading to a permanent hardware state mismatch.
Should this fallback to writing the unhandled cases instead of returning count
early?
> +
> + guard(mutex)(&vdev->tph_lock);
> +
> + mode = FIELD_GET(PCI_TPH_CTRL_MODE_SEL_MASK, data);
> + req = FIELD_GET(PCI_TPH_CTRL_REQ_EN_MASK, data);
> + if (req == PCI_TPH_REQ_TPH_ONLY || req == PCI_TPH_REQ_EXT_TPH) {
> + policy = (req == PCI_TPH_REQ_TPH_ONLY) ?
> + TPH_REQ_STANDARD : TPH_REQ_EXTENDED;
> + ret = pcie_enable_tph(pdev, mode, policy);
[Severity: High]
Is there a data race when setting pdev->tph_enabled inside pcie_enable_tph()?
The tph_enabled flag shares the same unsigned int word in struct pci_dev as
msix_enabled. Since this path is only protected by vdev->tph_lock, if userspace
concurrently triggers VFIO_DEVICE_SET_IRQS (which mutates msix_enabled under
vdev->igate), could the compiler's read-modify-write on the shared bitfield
word silently corrupt these PCI device flags?
> + if (ret == 0 && vdev->tph_st_shadow) {
> + for (i = 0; i < vdev->tph_st_entries; i++)
> + pcie_tph_set_st_entry(pdev, i,
> + vdev->tph_st_shadow[i]);
> + }
[Severity: High]
The PCIe Base Specification (6.17.3) states that software must not modify the
Steering Tag Table while the TPH Requester Enable bit is 1. Since
pcie_enable_tph() sets the enable bit in hardware before this loop writes to
the ST Table, could this cause undefined hardware behavior or dropped TLPs?
[Severity: Critical]
Also, if a generic unmodified VM configures the ST Table natively via MMIO
without using the custom VFIO ioctl, tph_st_shadow will contain only its
default zeros. Won't this unconditional iteration wipe out the guest's valid
native ST tags with zeros at the exact moment TPH is enabled, permanently
corrupting the device state?
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 851e4d0828f2d..2a5cb945275ea 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
[ ... ]
> @@ -1605,6 +1608,9 @@ static void vfio_pci_tph_deinit(struct vfio_pci_core_device *vdev)
> vdev->tph_st_entries = 0;
>
> mutex_destroy(&vdev->tph_lock);
> +
> + /* Reset TPH status on session exit */
> + pcie_disable_tph(vdev->pdev);
> }
[Severity: Low]
Does it make sense to destroy the vdev->tph_lock mutex before calling
pcie_disable_tph()? Even though this runs sequentially during device teardown,
it seems semantically inverted to destroy the lock before manipulating the
hardware state it protects.
[ ... ]
> @@ -1645,6 +1651,13 @@ static int vfio_pci_core_feature_tph_st_config(
> config.reserved != 0)
> return -EINVAL;
>
> guard(mutex)(&vdev->tph_lock);
>
> uptr = u64_to_user_ptr(config.data_uptr);
> sts = memdup_array_user(uptr, config.count, sizeof(u16));
> sz = config.count * sizeof(u16);
> if (IS_ERR(sts))
> return PTR_ERR(sts);
>
> if (pcie_tph_enabled_req_type(pdev) == PCI_TPH_REQ_DISABLE) {
> memcpy(&vdev->tph_st_shadow[config.index], sts, sz);
> kfree(sts);
> return 0;
> }
>
> for (i = 0; i < config.count; i++) {
> idx = config.index + i;
> ret = pcie_tph_set_st_entry(pdev, idx, sts[i]);
> if (ret)
> goto rollback;
> }
[Severity: High]
This loop calls pcie_tph_set_st_entry() to modify hardware ST entries while
pcie_tph_enabled_req_type(pdev) != PCI_TPH_REQ_DISABLE. Doesn't this violate
the PCIe spec by modifying the Steering Tag Table while the TPH Requester Enable
bit is 1?
[Severity: Medium]
According to the cleanup subsystem guidelines, mixing goto-based error handling
labels with scope-based guard() declarations in the same function is not allowed.
Since guard(mutex)(&vdev->tph_lock) is used at the start of this block, and
goto rollback is used here for error handling, could this be refactored to
fully adopt scope-based cleanup like __free(kfree) for the sts array?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528124649.14732-1-fengchengwen@huawei.com?part=8
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
2026-05-28 12:46 [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Chengwen Feng
` (7 preceding siblings ...)
2026-05-28 12:46 ` [PATCH v14 8/8] vfio/pci: Virtualize PCIe TPH capability registers Chengwen Feng
@ 2026-06-01 15:58 ` Alex Williamson
2026-06-02 14:46 ` fengchengwen
8 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2026-06-01 15:58 UTC (permalink / raw)
To: Chengwen Feng
Cc: jgg, wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci, alex
On Thu, 28 May 2026 20:46:41 +0800
Chengwen Feng <fengchengwen@huawei.com> wrote:
> This patchset enables userspace control over PCIe TPH steering tags,
> motivated by the following considerations:
>
> 1. Why userspace needs the capability to control steering tags:
> When PCIe devices are fully owned by userspace workloads such as DPDK
> and SPDK, only userspace has full knowledge of core binding policies
> and traffic distribution strategies. Without this series, userspace
> cannot enable TPH or configure steering tags, leaving built-in PCIe
> performance optimizations unused in high-throughput polling I/O
> scenarios.
>
> 2. Why this interface must be implemented in VFIO:
> VFIO is the standard, secure community solution for granting full
> PCIe device ownership to userspace. Existing kernel TPH interfaces
> are designed purely for in-kernel drivers. For user-owned devices,
> VFIO provides the only isolated and correct path to expose per-device
> TPH management.
>
> TPH supports both IV and DS modes. Since both modes could introduces
> cross-VM isolation risks such as untrusted guests programming arbitrary
> steering tags to impact other domains:
> 1. If ST location in MSI-X table, untrusted guests may program the
> MSI-X table.
> 2. If ST don't locate in MSI-X or CAP, untrusted guests may program the
> device-specific register.
> So a new module parameter `enable_unsafe_tph` is added. It defaults to
> off, and blocks all unsafe TPH operations when disabled.
>
> Based on earlier RFC work by Wathsala Vithanage
>
> ---
> v14:
> - Return PCI_TPH_LOC_NONE when !CONFIG_PCIE_TPH accord Alex's comment
> - Fix Sashiko comments:
> - Clear ST shadow state across user session
> - Fix out-of-bounds byte masking in vfio_pci_tph_config_read
> v13:
> - Fix Alex's comments:
> - Add virtualize of TPH request type
> - Adopt two feature
- Patch 2/ sets tph_cap to zero if TPH cannot be enabled, but this is
never propagated to vfio-pci to drop the TPH capability from the
capability chain.
- Patch 3/ exposing the enum to callers is a bit ugly, maybe we should
retain pci_enable_tph() as the "auto" wrapper, resulting in no
change to existing users, and add _ext and _std wrappers.
- Patch 4/ same, multiple wrappers keeps the "auto" api cleaner. Also
note that auto is broken before TPH is enabled when tph_req_type is
zero.
- Patch 6/:
- mutex_init/destroy belong in vfio_pci_core_init_dev/release_dev()
- Re-order to avoid forward declarations.
- vfio_check_feature() does not require declaring PROBE as a
supported op.
- pcie_tph_set_st_entry() can write to MMIO space without testing
for or enabling device memory or power state.
- tph_init allocates shadow table regardless of opt-in for tph
support.
- Redundant tests for count/index in feature function.
- Duplicate memcpy/kfree in feature function.
- Patch 7/:
- Unclear why it's vfio-pci's responsibility to provide this
translation interface.
- PROBE is unnecessary for vfio_check_feature().
- Array size conversion is strange.
- Patch 8/:
- default config read handles emulated bits. Clear the bit if
necessary in vconfig, use p_setd() to mark PCI_TPH_CAP_EXT_TPH
virtualized, read-only.
- Follow examples like vfio_pm_config_write, vfio_vpd_config_write,
vfio_exp_config_write, etc. Set the virtualized and writable
bits, perform default config write, evaluate if anything was
enabled, effect the change in hardware via kernel APIs.
- pcie_disable_tph() happens after the restore state is recorded.
- TPH capability becomes read/write, an ABI change to the user, with
only a host-based opt-in. Seems like the userspace driver also
needs to opt-in to the ABI change.
- Memory enable not assured for pcie_tph_set_st_entry().
- Sashiko notes there are existing hazards in the tph_enabled
storage unit relative to multiple bitfields that can be updated
concurrently. The problem is worse with this exposed through
vfio-pci.
Thanks,
Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
2026-06-01 15:58 ` [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Alex Williamson
@ 2026-06-02 14:46 ` fengchengwen
2026-06-02 23:08 ` Alex Williamson
0 siblings, 1 reply; 22+ messages in thread
From: fengchengwen @ 2026-06-02 14:46 UTC (permalink / raw)
To: Alex Williamson
Cc: jgg, wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
Hi Alex,
On 6/1/2026 11:58 PM, Alex Williamson wrote:
> On Thu, 28 May 2026 20:46:41 +0800
> Chengwen Feng <fengchengwen@huawei.com> wrote:
>
>> This patchset enables userspace control over PCIe TPH steering tags,
>> motivated by the following considerations:
>>
>> 1. Why userspace needs the capability to control steering tags:
>> When PCIe devices are fully owned by userspace workloads such as DPDK
>> and SPDK, only userspace has full knowledge of core binding policies
>> and traffic distribution strategies. Without this series, userspace
>> cannot enable TPH or configure steering tags, leaving built-in PCIe
>> performance optimizations unused in high-throughput polling I/O
>> scenarios.
>>
>> 2. Why this interface must be implemented in VFIO:
>> VFIO is the standard, secure community solution for granting full
>> PCIe device ownership to userspace. Existing kernel TPH interfaces
>> are designed purely for in-kernel drivers. For user-owned devices,
>> VFIO provides the only isolated and correct path to expose per-device
>> TPH management.
>>
>> TPH supports both IV and DS modes. Since both modes could introduces
>> cross-VM isolation risks such as untrusted guests programming arbitrary
>> steering tags to impact other domains:
>> 1. If ST location in MSI-X table, untrusted guests may program the
>> MSI-X table.
>> 2. If ST don't locate in MSI-X or CAP, untrusted guests may program the
>> device-specific register.
>> So a new module parameter `enable_unsafe_tph` is added. It defaults to
>> off, and blocks all unsafe TPH operations when disabled.
>>
>> Based on earlier RFC work by Wathsala Vithanage
>>
>> ---
>> v14:
>> - Return PCI_TPH_LOC_NONE when !CONFIG_PCIE_TPH accord Alex's comment
>> - Fix Sashiko comments:
>> - Clear ST shadow state across user session
>> - Fix out-of-bounds byte masking in vfio_pci_tph_config_read
>> v13:
>> - Fix Alex's comments:
>> - Add virtualize of TPH request type
>> - Adopt two feature
>
>
> - Patch 2/ sets tph_cap to zero if TPH cannot be enabled, but this is
> never propagated to vfio-pci to drop the TPH capability from the
> capability chain.
>
> - Patch 3/ exposing the enum to callers is a bit ugly, maybe we should
> retain pci_enable_tph() as the "auto" wrapper, resulting in no
> change to existing users, and add _ext and _std wrappers.
>
> - Patch 4/ same, multiple wrappers keeps the "auto" api cleaner. Also
> note that auto is broken before TPH is enabled when tph_req_type is
> zero.
>
> - Patch 6/:
> - mutex_init/destroy belong in vfio_pci_core_init_dev/release_dev()
> - Re-order to avoid forward declarations.
> - vfio_check_feature() does not require declaring PROBE as a
> supported op.
> - pcie_tph_set_st_entry() can write to MMIO space without testing
> for or enabling device memory or power state.
> - tph_init allocates shadow table regardless of opt-in for tph
> support.
> - Redundant tests for count/index in feature function.
> - Duplicate memcpy/kfree in feature function.
>
> - Patch 7/:
> - Unclear why it's vfio-pci's responsibility to provide this
> translation interface.
> - PROBE is unnecessary for vfio_check_feature().
> - Array size conversion is strange.
>
> - Patch 8/:
> - default config read handles emulated bits. Clear the bit if
> necessary in vconfig, use p_setd() to mark PCI_TPH_CAP_EXT_TPH
> virtualized, read-only.
> - Follow examples like vfio_pm_config_write, vfio_vpd_config_write,
> vfio_exp_config_write, etc. Set the virtualized and writable
> bits, perform default config write, evaluate if anything was
> enabled, effect the change in hardware via kernel APIs.
> - pcie_disable_tph() happens after the restore state is recorded.
> - TPH capability becomes read/write, an ABI change to the user, with
> only a host-based opt-in. Seems like the userspace driver also
> needs to opt-in to the ABI change.
> - Memory enable not assured for pcie_tph_set_st_entry().
> - Sashiko notes there are existing hazards in the tph_enabled
> storage unit relative to multiple bitfields that can be updated
> concurrently. The problem is worse with this exposed through
> vfio-pci.
Thanks for review, all your comments are resolved in v15:
- Patch2: Hide TPH capability in vfio ext cap chain when TPH unsupported
via vfio_ext_cap_len returning zero.
- Patch3/4: Preserve original pcie_enable_tph() auto API intact, add
_explicit helper for vfio; fix auto enable corner case with zero req_type.
- Patch6: relocate tph init/destroy to core init/release dev, remove redundant
PROBE, range check, duplicate kfree; add power check for set_st_entry; shadow
allocated only with host opt-in + hw TPH present.
- Patch7: drop redundant translation helper, unused PROBE, weird array size
conversion.
- Patch8: emulate EXT_TPH bit via vconfig + p_setd readonly; follow standard
vfio config write flow(vconfig first then hw sync); reorder disable_tph after
save_state; new TPH ABI gated solely by host-side enable_unsafe_tph; add
pre-write device state check.
v15 posted for further review.
Thanks
>
> Thanks,
> Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
2026-06-02 14:46 ` fengchengwen
@ 2026-06-02 23:08 ` Alex Williamson
2026-06-03 0:34 ` fengchengwen
0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2026-06-02 23:08 UTC (permalink / raw)
To: fengchengwen
Cc: jgg, wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci, alex
On Tue, 2 Jun 2026 22:46:13 +0800
fengchengwen <fengchengwen@huawei.com> wrote:
> On 6/1/2026 11:58 PM, Alex Williamson wrote:
> >
> >
> > - Patch 2/ sets tph_cap to zero if TPH cannot be enabled, but this is
> > never propagated to vfio-pci to drop the TPH capability from the
> > capability chain.
> >
> > - Patch 3/ exposing the enum to callers is a bit ugly, maybe we should
> > retain pci_enable_tph() as the "auto" wrapper, resulting in no
> > change to existing users, and add _ext and _std wrappers.
> >
> > - Patch 4/ same, multiple wrappers keeps the "auto" api cleaner. Also
> > note that auto is broken before TPH is enabled when tph_req_type is
> > zero.
> >
> > - Patch 6/:
> > - mutex_init/destroy belong in vfio_pci_core_init_dev/release_dev()
> > - Re-order to avoid forward declarations.
> > - vfio_check_feature() does not require declaring PROBE as a
> > supported op.
> > - pcie_tph_set_st_entry() can write to MMIO space without testing
> > for or enabling device memory or power state.
> > - tph_init allocates shadow table regardless of opt-in for tph
> > support.
> > - Redundant tests for count/index in feature function.
> > - Duplicate memcpy/kfree in feature function.
> >
> > - Patch 7/:
> > - Unclear why it's vfio-pci's responsibility to provide this
> > translation interface.
> > - PROBE is unnecessary for vfio_check_feature().
> > - Array size conversion is strange.
> >
> > - Patch 8/:
> > - default config read handles emulated bits. Clear the bit if
> > necessary in vconfig, use p_setd() to mark PCI_TPH_CAP_EXT_TPH
> > virtualized, read-only.
> > - Follow examples like vfio_pm_config_write, vfio_vpd_config_write,
> > vfio_exp_config_write, etc. Set the virtualized and writable
> > bits, perform default config write, evaluate if anything was
> > enabled, effect the change in hardware via kernel APIs.
> > - pcie_disable_tph() happens after the restore state is recorded.
> > - TPH capability becomes read/write, an ABI change to the user, with
> > only a host-based opt-in. Seems like the userspace driver also
> > needs to opt-in to the ABI change.
> > - Memory enable not assured for pcie_tph_set_st_entry().
> > - Sashiko notes there are existing hazards in the tph_enabled
> > storage unit relative to multiple bitfields that can be updated
> > concurrently. The problem is worse with this exposed through
> > vfio-pci.
>
> Thanks for review, all your comments are resolved in v15:
Not everything. At a glance I can see that v15 is still pushing for
vfio-pci to provide the interface to translate CPU IDs to steering
tags, despite my push back above that it doesn't seem within vfio-pci's
scope to provide a CPU ID -> ST interface.
The host-only opt-in issue above also doesn't appear to be addressed or
rebutted.
Typically v15 would be an indication that a series is coming to
maturity, but with the quick cadence and lack of discussion throughout
this development, I'm not even sure we've arrived at a proper uAPI for
this feature. Thanks,
Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
2026-06-02 23:08 ` Alex Williamson
@ 2026-06-03 0:34 ` fengchengwen
2026-06-03 0:45 ` Jason Gunthorpe
2026-06-03 17:58 ` Alex Williamson
0 siblings, 2 replies; 22+ messages in thread
From: fengchengwen @ 2026-06-03 0:34 UTC (permalink / raw)
To: Alex Williamson
Cc: jgg, wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
On 6/3/2026 7:08 AM, Alex Williamson wrote:
> On Tue, 2 Jun 2026 22:46:13 +0800
> fengchengwen <fengchengwen@huawei.com> wrote:
>> On 6/1/2026 11:58 PM, Alex Williamson wrote:
>>>
>>>
>>> - Patch 2/ sets tph_cap to zero if TPH cannot be enabled, but this is
>>> never propagated to vfio-pci to drop the TPH capability from the
>>> capability chain.
>>>
>>> - Patch 3/ exposing the enum to callers is a bit ugly, maybe we should
>>> retain pci_enable_tph() as the "auto" wrapper, resulting in no
>>> change to existing users, and add _ext and _std wrappers.
>>>
>>> - Patch 4/ same, multiple wrappers keeps the "auto" api cleaner. Also
>>> note that auto is broken before TPH is enabled when tph_req_type is
>>> zero.
>>>
>>> - Patch 6/:
>>> - mutex_init/destroy belong in vfio_pci_core_init_dev/release_dev()
>>> - Re-order to avoid forward declarations.
>>> - vfio_check_feature() does not require declaring PROBE as a
>>> supported op.
>>> - pcie_tph_set_st_entry() can write to MMIO space without testing
>>> for or enabling device memory or power state.
>>> - tph_init allocates shadow table regardless of opt-in for tph
>>> support.
>>> - Redundant tests for count/index in feature function.
>>> - Duplicate memcpy/kfree in feature function.
>>>
>>> - Patch 7/:
>>> - Unclear why it's vfio-pci's responsibility to provide this
>>> translation interface.
>>> - PROBE is unnecessary for vfio_check_feature().
>>> - Array size conversion is strange.
>>>
>>> - Patch 8/:
>>> - default config read handles emulated bits. Clear the bit if
>>> necessary in vconfig, use p_setd() to mark PCI_TPH_CAP_EXT_TPH
>>> virtualized, read-only.
>>> - Follow examples like vfio_pm_config_write, vfio_vpd_config_write,
>>> vfio_exp_config_write, etc. Set the virtualized and writable
>>> bits, perform default config write, evaluate if anything was
>>> enabled, effect the change in hardware via kernel APIs.
>>> - pcie_disable_tph() happens after the restore state is recorded.
>>> - TPH capability becomes read/write, an ABI change to the user, with
>>> only a host-based opt-in. Seems like the userspace driver also
>>> needs to opt-in to the ABI change.
>>> - Memory enable not assured for pcie_tph_set_st_entry().
>>> - Sashiko notes there are existing hazards in the tph_enabled
>>> storage unit relative to multiple bitfields that can be updated
>>> concurrently. The problem is worse with this exposed through
>>> vfio-pci.
>>
>> Thanks for review, all your comments are resolved in v15:
>
> Not everything. At a glance I can see that v15 is still pushing for
> vfio-pci to provide the interface to translate CPU IDs to steering
> tags, despite my push back above that it doesn't seem within vfio-pci's
> scope to provide a CPU ID -> ST interface.
This translation helper is required to handle varied TPH configurations,
including cases where ST entries live inside device private registers
instead of MSI-X or TPH capability space.
Prior discussion with Jason (
ref: https://lore.kernel.org/all/20260414151125.GF2577880@ziepe.ca/)
indicated VFIO-PCI is the correct component for per-device TPH
implementation.
Production high-speed NICs from multiple vendors rely on this CPU-to-ST
mapping logic for DPDK optimized polling I/O; removing this helper would
render the whole TPH userspace control feature unusable for existing
target workloads.
Therefore, I believe it is appropriate to support it.
>
> The host-only opt-in issue above also doesn't appear to be addressed or
> rebutted.
enable_unsafe_tph host opt-in already fully gates all new TPH ABI:
TPH cap + feature ioctl are completely hidden from userspace when off,
no userspace opt-in needed.
>
> Typically v15 would be an indication that a series is coming to
> maturity, but with the quick cadence and lack of discussion throughout
> this development, I'm not even sure we've arrived at a proper uAPI for
> this feature. Thanks,
Sorry for sending frequent quick updates without leaving adequate window
for list review.
This feature originates from an earlier ARM prototype posted last year for
DPDK TPH acceleration scenarios. Based on that foundational exploration,
this series has refined the interface through continuous upstream review
feedback.
All recent rapid iterations are code improvements and review fixes.
Thanks
>
> Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
2026-06-03 0:34 ` fengchengwen
@ 2026-06-03 0:45 ` Jason Gunthorpe
2026-06-03 1:25 ` fengchengwen
2026-06-03 18:53 ` Alex Williamson
2026-06-03 17:58 ` Alex Williamson
1 sibling, 2 replies; 22+ messages in thread
From: Jason Gunthorpe @ 2026-06-03 0:45 UTC (permalink / raw)
To: fengchengwen
Cc: Alex Williamson, wathsala.vithanage, helgaas, wei.huang2,
zhipingz, wangzhou1, wangyushan12, liuyonglong, kvm, linux-pci
On Wed, Jun 03, 2026 at 08:34:53AM +0800, fengchengwen wrote:
> This translation helper is required to handle varied TPH configurations,
> including cases where ST entries live inside device private registers
> instead of MSI-X or TPH capability space.
I thought we agreed vfio had to block that feature in config space due
to security concerns? That's what enable_unsafe_tph is about, right?
> Prior discussion with Jason (
> ref: https://lore.kernel.org/all/20260414151125.GF2577880@ziepe.ca/)
> indicated VFIO-PCI is the correct component for per-device TPH
> implementation.
> > The host-only opt-in issue above also doesn't appear to be addressed or
> > rebutted.
>
> enable_unsafe_tph host opt-in already fully gates all new TPH ABI:
> TPH cap + feature ioctl are completely hidden from userspace when off,
> no userspace opt-in needed.
I agree in general that the presence of enable_unsafe_tph VFIO has to
report the per-device steering tags to userspace so it can program the
device specific registers, nothing else can do this.
But I don't really get what is going on with all the new uAPI
proposed.
I don't understand what VFIO_DEVICE_FEATURE_TPH_ST_CONFIG is supposed
to be for, userspace should not be able to directly program registers
that are kernel controled. In this case it is 'safe' and it should be
only programmed by specifying abstract information like TPH_CPU_ST
does.
I'm also wondering why it needs a shadow entry, that doesn't seem
normal for config space tables..
Jason
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
2026-06-03 0:45 ` Jason Gunthorpe
@ 2026-06-03 1:25 ` fengchengwen
2026-06-03 18:53 ` Alex Williamson
1 sibling, 0 replies; 22+ messages in thread
From: fengchengwen @ 2026-06-03 1:25 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, wathsala.vithanage, helgaas, wei.huang2,
zhipingz, wangzhou1, wangyushan12, liuyonglong, kvm, linux-pci
On 6/3/2026 8:45 AM, Jason Gunthorpe wrote:
> On Wed, Jun 03, 2026 at 08:34:53AM +0800, fengchengwen wrote:
>
>> This translation helper is required to handle varied TPH configurations,
>> including cases where ST entries live inside device private registers
>> instead of MSI-X or TPH capability space.
>
> I thought we agreed vfio had to block that feature in config space due
> to security concerns? That's what enable_unsafe_tph is about, right?
Exactly. All ST related uAPIs are fully hidden and blocked when
enable_unsafe_tph is off, complying with our earlier security agreement.
>
>> Prior discussion with Jason (
>> ref: https://lore.kernel.org/all/20260414151125.GF2577880@ziepe.ca/)
>> indicated VFIO-PCI is the correct component for per-device TPH
>> implementation.
>
>>> The host-only opt-in issue above also doesn't appear to be addressed or
>>> rebutted.
>>
>> enable_unsafe_tph host opt-in already fully gates all new TPH ABI:
>> TPH cap + feature ioctl are completely hidden from userspace when off,
>> no userspace opt-in needed.
>
> I agree in general that the presence of enable_unsafe_tph VFIO has to
> report the per-device steering tags to userspace so it can program the
> device specific registers, nothing else can do this.
>
> But I don't really get what is going on with all the new uAPI
> proposed.
>
> I don't understand what VFIO_DEVICE_FEATURE_TPH_ST_CONFIG is supposed
> to be for, userspace should not be able to directly program registers
> that are kernel controled. In this case it is 'safe' and it should be
> only programmed by specifying abstract information like TPH_CPU_ST
> does.
Two separated feature ops are defined for clear layered programming model:
- TPH_CPU_ST: Query corresponding ST index via given CPU ID;
- TPH_ST_CONFIG: Write ST value, only valid for ST stored inside PCI
config space or MSI-X table.
Userspace workflow: Fetch ST via TPH_CPU_ST first, then pick proper
interface to set ST based on device ST storage location, lastly enable TPH.
Since the full device is bound to VFIO without in-kernel driver control,
letting userspace manage TPH/ST entirely is reasonable.
>
> I'm also wondering why it needs a shadow entry, that doesn't seem
> normal for config space tables..
Shadow table serves two key purposes:
1. Align with core PCI TPH API constraint which requires TPH enabled
before any ST programming, enabling consistent userspace programming
sequence;
2. Backup configured ST values to roll back hardware state during
device close for cross-session cleanup.
Thanks
>
> Jason
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
2026-06-03 0:34 ` fengchengwen
2026-06-03 0:45 ` Jason Gunthorpe
@ 2026-06-03 17:58 ` Alex Williamson
2026-06-04 1:58 ` fengchengwen
1 sibling, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2026-06-03 17:58 UTC (permalink / raw)
To: fengchengwen
Cc: jgg, wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci, alex
On Wed, 3 Jun 2026 08:34:53 +0800
fengchengwen <fengchengwen@huawei.com> wrote:
> On 6/3/2026 7:08 AM, Alex Williamson wrote:
> > On Tue, 2 Jun 2026 22:46:13 +0800
> > fengchengwen <fengchengwen@huawei.com> wrote:
> >> On 6/1/2026 11:58 PM, Alex Williamson wrote:
> >>>
> >>>
> >>> - Patch 2/ sets tph_cap to zero if TPH cannot be enabled, but this is
> >>> never propagated to vfio-pci to drop the TPH capability from the
> >>> capability chain.
> >>>
> >>> - Patch 3/ exposing the enum to callers is a bit ugly, maybe we should
> >>> retain pci_enable_tph() as the "auto" wrapper, resulting in no
> >>> change to existing users, and add _ext and _std wrappers.
> >>>
> >>> - Patch 4/ same, multiple wrappers keeps the "auto" api cleaner. Also
> >>> note that auto is broken before TPH is enabled when tph_req_type is
> >>> zero.
> >>>
> >>> - Patch 6/:
> >>> - mutex_init/destroy belong in vfio_pci_core_init_dev/release_dev()
> >>> - Re-order to avoid forward declarations.
> >>> - vfio_check_feature() does not require declaring PROBE as a
> >>> supported op.
> >>> - pcie_tph_set_st_entry() can write to MMIO space without testing
> >>> for or enabling device memory or power state.
> >>> - tph_init allocates shadow table regardless of opt-in for tph
> >>> support.
> >>> - Redundant tests for count/index in feature function.
> >>> - Duplicate memcpy/kfree in feature function.
> >>>
> >>> - Patch 7/:
> >>> - Unclear why it's vfio-pci's responsibility to provide this
> >>> translation interface.
> >>> - PROBE is unnecessary for vfio_check_feature().
> >>> - Array size conversion is strange.
> >>>
> >>> - Patch 8/:
> >>> - default config read handles emulated bits. Clear the bit if
> >>> necessary in vconfig, use p_setd() to mark PCI_TPH_CAP_EXT_TPH
> >>> virtualized, read-only.
> >>> - Follow examples like vfio_pm_config_write, vfio_vpd_config_write,
> >>> vfio_exp_config_write, etc. Set the virtualized and writable
> >>> bits, perform default config write, evaluate if anything was
> >>> enabled, effect the change in hardware via kernel APIs.
> >>> - pcie_disable_tph() happens after the restore state is recorded.
> >>> - TPH capability becomes read/write, an ABI change to the user, with
> >>> only a host-based opt-in. Seems like the userspace driver also
> >>> needs to opt-in to the ABI change.
> >>> - Memory enable not assured for pcie_tph_set_st_entry().
> >>> - Sashiko notes there are existing hazards in the tph_enabled
> >>> storage unit relative to multiple bitfields that can be updated
> >>> concurrently. The problem is worse with this exposed through
> >>> vfio-pci.
> >>
> >> Thanks for review, all your comments are resolved in v15:
> >
> > Not everything. At a glance I can see that v15 is still pushing for
> > vfio-pci to provide the interface to translate CPU IDs to steering
> > tags, despite my push back above that it doesn't seem within vfio-pci's
> > scope to provide a CPU ID -> ST interface.
>
> This translation helper is required to handle varied TPH configurations,
> including cases where ST entries live inside device private registers
> instead of MSI-X or TPH capability space.
>
> Prior discussion with Jason (
> ref: https://lore.kernel.org/all/20260414151125.GF2577880@ziepe.ca/)
> indicated VFIO-PCI is the correct component for per-device TPH
> implementation.
>
> Production high-speed NICs from multiple vendors rely on this CPU-to-ST
> mapping logic for DPDK optimized polling I/O; removing this helper would
> render the whole TPH userspace control feature unusable for existing
> target workloads.
>
> Therefore, I believe it is appropriate to support it.
The citation is only describing that vfio should provide a way to
program the steering tags. That's not the issue I'm raising.
The proposed uAPI has morphed from one where the user provides CPU IDs,
which are translated to STs and written to the device or returned to
support device specific implementations, to one where raw STs are used.
I think this was a result of my question of how this interface
integrates with the concurrently proposed interface where drivers can
expose TPH values for a dmabuf. Suddenly the user is managing raw TPH
values and what used to be an integrated feature of performing CPU ID
to ST value, with necessary ugliness of exposing that raw ST value for
device specific support, becomes largely a standalone translation
interface, which might be more generically implemented elsewhere in the
kernel. It largely loses any direct association to the vfio aspect of
the device.
The interface now looks more like one where raw STs are provided to
vfio-pci through various sources. It then becomes unclear why vfio-pci
itself is necessarily one of those sources.
> > The host-only opt-in issue above also doesn't appear to be addressed or
> > rebutted.
>
> enable_unsafe_tph host opt-in already fully gates all new TPH ABI:
> TPH cap + feature ioctl are completely hidden from userspace when off,
> no userspace opt-in needed.
I'd argue that opt-in on the host doesn't mean we should ignore the ABI
change to userspace. There's an entire virtualization stack that needs
to be built on this change, intercepting writes to the TPH control
register, MSI-X vectors, and capability ST entries. The host opt-in is
a global switch, it might enable your userspace driver, but it might
also break QEMU, where writes to the TPH capability had been dropped
previously. A per instance opt-in, gated by the host opt-in, allows
seamless integration of the new feature.
> > Typically v15 would be an indication that a series is coming to
> > maturity, but with the quick cadence and lack of discussion throughout
> > this development, I'm not even sure we've arrived at a proper uAPI for
> > this feature. Thanks,
>
> Sorry for sending frequent quick updates without leaving adequate window
> for list review.
>
> This feature originates from an earlier ARM prototype posted last year for
> DPDK TPH acceleration scenarios. Based on that foundational exploration,
> this series has refined the interface through continuous upstream review
> feedback.
>
> All recent rapid iterations are code improvements and review fixes.
The issue is that feedback like above is represented as if it's been
addressed and changelogs are scant on details, requiring significant,
repeated effort to understand the actual changes between iterations.
The issues raised and interface design require discussion, not simply
assertions that it's correct and addressed. Thanks,
Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
2026-06-03 0:45 ` Jason Gunthorpe
2026-06-03 1:25 ` fengchengwen
@ 2026-06-03 18:53 ` Alex Williamson
2026-06-04 18:33 ` Jason Gunthorpe
1 sibling, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2026-06-03 18:53 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: fengchengwen, wathsala.vithanage, helgaas, wei.huang2, zhipingz,
wangzhou1, wangyushan12, liuyonglong, kvm, linux-pci, alex
On Tue, 2 Jun 2026 21:45:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Wed, Jun 03, 2026 at 08:34:53AM +0800, fengchengwen wrote:
>
> > This translation helper is required to handle varied TPH configurations,
> > including cases where ST entries live inside device private registers
> > instead of MSI-X or TPH capability space.
>
> I thought we agreed vfio had to block that feature in config space due
> to security concerns? That's what enable_unsafe_tph is about, right?
Steering tags can effectively live in one of three places. The
architected locations are the TPH capability itself and the MSI-X
table, where support is mutually exclusive. However the device can
support either of these and still enable DS mode, thus essentially
a third location.
None of these are particularly more or less unsafe than another. It's
inadvisable for drivers to write to the MSI-X vector table, versus
using the SET_IRQS uAPI, but it's not prevented anymore via mmap. We
don't provide direct write access to the TPH capability, but backdoors
to config space are not uncommon. So if we consider TPH itself to be
unsafe, all forms of it present some degree of the same attack vector.
I'm actually still wrestling with the question whether any of this is
really unsafe beyond the scope of ownership of the device at all.
> > Prior discussion with Jason (
> > ref: https://lore.kernel.org/all/20260414151125.GF2577880@ziepe.ca/)
> > indicated VFIO-PCI is the correct component for per-device TPH
> > implementation.
>
> > > The host-only opt-in issue above also doesn't appear to be addressed or
> > > rebutted.
> >
> > enable_unsafe_tph host opt-in already fully gates all new TPH ABI:
> > TPH cap + feature ioctl are completely hidden from userspace when off,
> > no userspace opt-in needed.
>
> I agree in general that the presence of enable_unsafe_tph VFIO has to
> report the per-device steering tags to userspace so it can program the
> device specific registers, nothing else can do this.
What's the scope of "nothing else" here? The raw ST value for a CPU is
derived from an ACPI _DSM associated to a root port. So if the scope
is that the kernel needs to provide that translation, I agree, but
where in the kernel it's provided may still require some thought.
> But I don't really get what is going on with all the new uAPI
> proposed.
>
> I don't understand what VFIO_DEVICE_FEATURE_TPH_ST_CONFIG is supposed
> to be for, userspace should not be able to directly program registers
> that are kernel controled. In this case it is 'safe' and it should be
> only programmed by specifying abstract information like TPH_CPU_ST
> does.
>
> I'm also wondering why it needs a shadow entry, that doesn't seem
> normal for config space tables..
As above, the user owns the device and even in the ~safer~ storage
locations we can't really guarantee that the user cannot overwrite
MSI-X vector table entries or manipulate the device to insert values
into config space.
We also need to consider how this incorporates Zhiping's series, where
TPH values can be associated to a dmabuf. Would userspace get access
to those raw TPH values to program the initiator? It must for DS
support. Therefore the interface became one of the user handling raw
ST values and the CPU ID to ST translation became this ill-fitting
sidecar.
The shadow table comes about because we don't know until the mode is
selected whether we're using the DS (effective) location or one of the
architected locations. In fact, the internal kernel API doesn't even
let us even set a TPH value without TPH enabled, which presumes a
specific programming model that may not align with userspace.
I had envisioned that a shadow would also provide symmetry in the
interface for the DS implementation, but I was never really able to
make that stick with this rapid succession, no pause for clarification
or discussion, development style. Thanks,
Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
2026-06-03 17:58 ` Alex Williamson
@ 2026-06-04 1:58 ` fengchengwen
0 siblings, 0 replies; 22+ messages in thread
From: fengchengwen @ 2026-06-04 1:58 UTC (permalink / raw)
To: Alex Williamson
Cc: jgg, wathsala.vithanage, helgaas, wei.huang2, zhipingz, wangzhou1,
wangyushan12, liuyonglong, kvm, linux-pci
On 6/4/2026 1:58 AM, Alex Williamson wrote:
> On Wed, 3 Jun 2026 08:34:53 +0800
> fengchengwen <fengchengwen@huawei.com> wrote:
>
>> On 6/3/2026 7:08 AM, Alex Williamson wrote:
>>> On Tue, 2 Jun 2026 22:46:13 +0800
>>> fengchengwen <fengchengwen@huawei.com> wrote:
>>>> On 6/1/2026 11:58 PM, Alex Williamson wrote:
>>>>>
>>>>>
>>>>> - Patch 2/ sets tph_cap to zero if TPH cannot be enabled, but this is
>>>>> never propagated to vfio-pci to drop the TPH capability from the
>>>>> capability chain.
>>>>>
>>>>> - Patch 3/ exposing the enum to callers is a bit ugly, maybe we should
>>>>> retain pci_enable_tph() as the "auto" wrapper, resulting in no
>>>>> change to existing users, and add _ext and _std wrappers.
>>>>>
>>>>> - Patch 4/ same, multiple wrappers keeps the "auto" api cleaner. Also
>>>>> note that auto is broken before TPH is enabled when tph_req_type is
>>>>> zero.
>>>>>
>>>>> - Patch 6/:
>>>>> - mutex_init/destroy belong in vfio_pci_core_init_dev/release_dev()
>>>>> - Re-order to avoid forward declarations.
>>>>> - vfio_check_feature() does not require declaring PROBE as a
>>>>> supported op.
>>>>> - pcie_tph_set_st_entry() can write to MMIO space without testing
>>>>> for or enabling device memory or power state.
>>>>> - tph_init allocates shadow table regardless of opt-in for tph
>>>>> support.
>>>>> - Redundant tests for count/index in feature function.
>>>>> - Duplicate memcpy/kfree in feature function.
>>>>>
>>>>> - Patch 7/:
>>>>> - Unclear why it's vfio-pci's responsibility to provide this
>>>>> translation interface.
>>>>> - PROBE is unnecessary for vfio_check_feature().
>>>>> - Array size conversion is strange.
>>>>>
>>>>> - Patch 8/:
>>>>> - default config read handles emulated bits. Clear the bit if
>>>>> necessary in vconfig, use p_setd() to mark PCI_TPH_CAP_EXT_TPH
>>>>> virtualized, read-only.
>>>>> - Follow examples like vfio_pm_config_write, vfio_vpd_config_write,
>>>>> vfio_exp_config_write, etc. Set the virtualized and writable
>>>>> bits, perform default config write, evaluate if anything was
>>>>> enabled, effect the change in hardware via kernel APIs.
>>>>> - pcie_disable_tph() happens after the restore state is recorded.
>>>>> - TPH capability becomes read/write, an ABI change to the user, with
>>>>> only a host-based opt-in. Seems like the userspace driver also
>>>>> needs to opt-in to the ABI change.
>>>>> - Memory enable not assured for pcie_tph_set_st_entry().
>>>>> - Sashiko notes there are existing hazards in the tph_enabled
>>>>> storage unit relative to multiple bitfields that can be updated
>>>>> concurrently. The problem is worse with this exposed through
>>>>> vfio-pci.
>>>>
>>>> Thanks for review, all your comments are resolved in v15:
>>>
>>> Not everything. At a glance I can see that v15 is still pushing for
>>> vfio-pci to provide the interface to translate CPU IDs to steering
>>> tags, despite my push back above that it doesn't seem within vfio-pci's
>>> scope to provide a CPU ID -> ST interface.
>>
>> This translation helper is required to handle varied TPH configurations,
>> including cases where ST entries live inside device private registers
>> instead of MSI-X or TPH capability space.
>>
>> Prior discussion with Jason (
>> ref: https://lore.kernel.org/all/20260414151125.GF2577880@ziepe.ca/)
>> indicated VFIO-PCI is the correct component for per-device TPH
>> implementation.
>>
>> Production high-speed NICs from multiple vendors rely on this CPU-to-ST
>> mapping logic for DPDK optimized polling I/O; removing this helper would
>> render the whole TPH userspace control feature unusable for existing
>> target workloads.
>>
>> Therefore, I believe it is appropriate to support it.
>
> The citation is only describing that vfio should provide a way to
> program the steering tags. That's not the issue I'm raising.
Sorry, it's my fault. The interface from CPU to ST is strongly related to
the user-space usage of TPH. If we place it in something like sysfs, it
might not be focused, so it's placed here.
>
> The proposed uAPI has morphed from one where the user provides CPU IDs,
> which are translated to STs and written to the device or returned to
> support device specific implementations, to one where raw STs are used.
>
> I think this was a result of my question of how this interface
> integrates with the concurrently proposed interface where drivers can
> expose TPH values for a dmabuf. Suddenly the user is managing raw TPH
> values and what used to be an integrated feature of performing CPU ID
> to ST value, with necessary ugliness of exposing that raw ST value for
> device specific support, becomes largely a standalone translation
> interface, which might be more generically implemented elsewhere in the
> kernel. It largely loses any direct association to the vfio aspect of
> the device.
>
> The interface now looks more like one where raw STs are provided to
> vfio-pci through various sources. It then becomes unclear why vfio-pci
> itself is necessarily one of those sources.
According to section 7.5.3.15 in the PCIe protocol, "TPH Completer Supported
- Value indicates Completer support for TPH or Extended TPH. Applicable only
to Root Ports and Endpoints." This means that the steering-tag can be sent to
the host CPU or another device (that is, in the P2P scenario).
It is considered that the current implementation is flexible, and it can handle
both of the preceding two scenarios.
- For the stash to the host CPU, the steering-tag value can be obtained through
the CPU_ST.
- For the stash to another device, the steering-tag value can be obtained in other
ways (non-kernel mode).
Once the steering-tag value is obtained, the following question arises: how to
write the value to the device?
- If the device supports the standard ST table (for example, in the configuration
space or MSI-X table), the ST_CONFIG interface is called to write the value.
- If the device does not support the standard ST table, the driver completes the
operation by itself.
I will add this PCIe-spec based design description into v16 cover letter.
>
>>> The host-only opt-in issue above also doesn't appear to be addressed or
>>> rebutted.
>>
>> enable_unsafe_tph host opt-in already fully gates all new TPH ABI:
>> TPH cap + feature ioctl are completely hidden from userspace when off,
>> no userspace opt-in needed.
>
> I'd argue that opt-in on the host doesn't mean we should ignore the ABI
> change to userspace. There's an entire virtualization stack that needs
> to be built on this change, intercepting writes to the TPH control
> register, MSI-X vectors, and capability ST entries. The host opt-in is
> a global switch, it might enable your userspace driver, but it might
> also break QEMU, where writes to the TPH capability had been dropped
> previously. A per instance opt-in, gated by the host opt-in, allows
> seamless integration of the new feature.
Fully understand, I will:
- Introduce new VFIO_DEVICE_FEATURE_TPH_ENABLE as per-device opt-in; each
vfio_pci_device defaults tph_abi_enable = false.
- With enable_unsafe_tph disabled globally: TPH cap remains read-only, all
new TPH feature ops are hidden completely, consistent with legacy kernel.
- When enable_unsafe_tph is enabled but userspace does not SET this new
feature: TPH capability keeps read-only, all guest writes to TPH_CTRL are
silently dropped same as original implementation, existing QEMU runs with
zero modification and no regression.
- Only after userspace explicitly SET VFIO_DEVICE_FEATURE_TPH_ENABLE on
specific device: TPH_CTRL gets WRITE permission and full TPH uAPIs become
available for that single device.
>
>>> Typically v15 would be an indication that a series is coming to
>>> maturity, but with the quick cadence and lack of discussion throughout
>>> this development, I'm not even sure we've arrived at a proper uAPI for
>>> this feature. Thanks,
>>
>> Sorry for sending frequent quick updates without leaving adequate window
>> for list review.
>>
>> This feature originates from an earlier ARM prototype posted last year for
>> DPDK TPH acceleration scenarios. Based on that foundational exploration,
>> this series has refined the interface through continuous upstream review
>> feedback.
>>
>> All recent rapid iterations are code improvements and review fixes.
>
> The issue is that feedback like above is represented as if it's been
> addressed and changelogs are scant on details, requiring significant,
> repeated effort to understand the actual changes between iterations.
> The issues raised and interface design require discussion, not simply
> assertions that it's correct and addressed. Thanks,
I apologize for sparse changelog content and insufficient design discussion
along quick version iterations. Starting from v16 submission:
- Slow down patch posting cadence and reserve enough time for community
design discussion before sending new revision.
- Enrich detailed per-patch changelog items to explicitly list every review
induced UAPI and code modification for easier tracking.
Thanks
>
> Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
2026-06-03 18:53 ` Alex Williamson
@ 2026-06-04 18:33 ` Jason Gunthorpe
2026-06-04 20:46 ` Alex Williamson
0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2026-06-04 18:33 UTC (permalink / raw)
To: Alex Williamson
Cc: fengchengwen, wathsala.vithanage, helgaas, wei.huang2, zhipingz,
wangzhou1, wangyushan12, liuyonglong, kvm, linux-pci
On Wed, Jun 03, 2026 at 12:53:08PM -0600, Alex Williamson wrote:
> On Tue, 2 Jun 2026 21:45:24 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> > On Wed, Jun 03, 2026 at 08:34:53AM +0800, fengchengwen wrote:
> >
> > > This translation helper is required to handle varied TPH configurations,
> > > including cases where ST entries live inside device private registers
> > > instead of MSI-X or TPH capability space.
> >
> > I thought we agreed vfio had to block that feature in config space due
> > to security concerns? That's what enable_unsafe_tph is about, right?
>
> Steering tags can effectively live in one of three places. The
> architected locations are the TPH capability itself and the MSI-X
> table, where support is mutually exclusive. However the device can
> support either of these and still enable DS mode, thus essentially
> a third location.
>
> None of these are particularly more or less unsafe than another. It's
> inadvisable for drivers to write to the MSI-X vector table, versus
> using the SET_IRQS uAPI, but it's not prevented anymore via mmap.
We'd have to go back to preventing it..
> We don't provide direct write access to the TPH capability, but
> backdoors to config space are not uncommon.
I think we can discount this, if a device exposes TPH through config
then it should keep it secure.
> I'm actually still wrestling with the question whether any of this is
> really unsafe beyond the scope of ownership of the device at all.
Well there is certainly some kind of DOS argument for calling it
unsafe, and we don't actually know what effects the TPH value even
has.
I'd be much more worried about some "smart" accelerator device doing
something impactful based on TPH than insecure config space for
example.
> > I agree in general that the presence of enable_unsafe_tph VFIO has to
> > report the per-device steering tags to userspace so it can program the
> > device specific registers, nothing else can do this.
>
> What's the scope of "nothing else" here? The raw ST value for a CPU is
> derived from an ACPI _DSM associated to a root port. So if the scope
> is that the kernel needs to provide that translation, I agree, but
> where in the kernel it's provided may still require some thought.
The APIs accept a PCI device:
int pcie_tph_get_cpu_st(struct pci_dev *pdev, enum tph_mem_type mem_type,
unsigned int cpu, u16 *tag)
IIRC the ACPI allows it to vary based on the source. Nothing else
could provide the PCI device owned by VFIO to these functions..
> We also need to consider how this incorporates Zhiping's series, where
> TPH values can be associated to a dmabuf. Would userspace get access
> to those raw TPH values to program the initiator? It must for DS
> support. Therefore the interface became one of the user handling raw
> ST values and the CPU ID to ST translation became this ill-fitting
> sidecar.
Arguably yes, likely iommufd would need an API to return the ST of the
dmabuf when it maps it.
Jason
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v14 0/8] vfio/pci: Add PCIe TPH support
2026-06-04 18:33 ` Jason Gunthorpe
@ 2026-06-04 20:46 ` Alex Williamson
0 siblings, 0 replies; 22+ messages in thread
From: Alex Williamson @ 2026-06-04 20:46 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: fengchengwen, wathsala.vithanage, helgaas, wei.huang2, zhipingz,
wangzhou1, wangyushan12, liuyonglong, kvm, linux-pci, alex
On Thu, 4 Jun 2026 15:33:02 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Wed, Jun 03, 2026 at 12:53:08PM -0600, Alex Williamson wrote:
> > On Tue, 2 Jun 2026 21:45:24 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > > On Wed, Jun 03, 2026 at 08:34:53AM +0800, fengchengwen wrote:
> > >
> > > > This translation helper is required to handle varied TPH configurations,
> > > > including cases where ST entries live inside device private registers
> > > > instead of MSI-X or TPH capability space.
> > >
> > > I thought we agreed vfio had to block that feature in config space due
> > > to security concerns? That's what enable_unsafe_tph is about, right?
> >
> > Steering tags can effectively live in one of three places. The
> > architected locations are the TPH capability itself and the MSI-X
> > table, where support is mutually exclusive. However the device can
> > support either of these and still enable DS mode, thus essentially
> > a third location.
> >
> > None of these are particularly more or less unsafe than another. It's
> > inadvisable for drivers to write to the MSI-X vector table, versus
> > using the SET_IRQS uAPI, but it's not prevented anymore via mmap.
>
> We'd have to go back to preventing it..
Disabling mmaps across vector tables obviously re-introduces the
performance issues on devices that don't account for a 64K exclusion
zone around the MSI-X vector table.
> > We don't provide direct write access to the TPH capability, but
> > backdoors to config space are not uncommon.
>
> I think we can discount this, if a device exposes TPH through config
> then it should keep it secure.
We have plenty of other caveats for userspace drivers that demand
"well behaved devices", so I agree we might be able to ignore the
backdoor vector. That still means we can't control STs in DS mode, and
IV mode requires an opt-in to protect the vector table, at a potential
performance penalty, depending on the host and device.
> > I'm actually still wrestling with the question whether any of this
> > is really unsafe beyond the scope of ownership of the device at
> > all.
>
> Well there is certainly some kind of DOS argument for calling it
> unsafe, and we don't actually know what effects the TPH value even
> has.
>
> I'd be much more worried about some "smart" accelerator device doing
> something impactful based on TPH than insecure config space for
> example.
Is that just a general sense of FUD to add weight to push the whole
thing behind an opt-in?
We could spin the effective storage locations a few different ways,
allowing TPH without an opt-in if we could limit it to the capability
location, requiring MSI-X mmap exclusion to enable devices that support
IV, and unblocking DS with another opt-in. But if we have a general
sense that this is scary in all cases, then we might as well just have
one global opt-in.
> > > I agree in general that the presence of enable_unsafe_tph VFIO has to
> > > report the per-device steering tags to userspace so it can program the
> > > device specific registers, nothing else can do this.
> >
> > What's the scope of "nothing else" here? The raw ST value for a CPU is
> > derived from an ACPI _DSM associated to a root port. So if the scope
> > is that the kernel needs to provide that translation, I agree, but
> > where in the kernel it's provided may still require some thought.
>
> The APIs accept a PCI device:
>
> int pcie_tph_get_cpu_st(struct pci_dev *pdev, enum tph_mem_type mem_type,
> unsigned int cpu, u16 *tag)
>
>
> IIRC the ACPI allows it to vary based on the source. Nothing else
> could provide the PCI device owned by VFIO to these functions..
The function immediately translates the endpoint function to a root
bridge, so I could certainly imagine some pcie port driver that creates
an interface to provide ST translations. It could really be exposed
through sysfs too, except the entries are a multiple of NR_CPUS, so the
attribute count would explode if we adhere to a single value per
attribute. Regardless, it's an ACPI lookup, so it doesn't rely on
device ownership, only using the device to lookup the ACPI companion to
invoke the DSM.
> > We also need to consider how this incorporates Zhiping's series, where
> > TPH values can be associated to a dmabuf. Would userspace get access
> > to those raw TPH values to program the initiator? It must for DS
> > support. Therefore the interface became one of the user handling raw
> > ST values and the CPU ID to ST translation became this ill-fitting
> > sidecar.
>
> Arguably yes, likely iommufd would need an API to return the ST of the
> dmabuf when it maps it.
Right, so hosting a CPU ID to ST translation interface in the vfio-pci
device feature interface feels pretty shortsighted. I don't know what
it should look like though. Thanks,
Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2026-06-04 20:46 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 12:46 [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 1/8] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 2/8] PCI/TPH: Cache TPH requester capability at probe time Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 3/8] PCI/TPH: Add requester selection policy to pcie_enable_tph() Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 4/8] PCI/TPH: Add requester policy to pcie_tph_get_cpu_st() Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 5/8] PCI/TPH: expose the enabled TPH requester type Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 6/8] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST_CONFIG for PCIe TPH ST configuration Chengwen Feng
2026-05-28 15:17 ` sashiko-bot
2026-05-28 12:46 ` [PATCH v14 7/8] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_CPU_ST to query TPH steering tag Chengwen Feng
2026-05-28 12:46 ` [PATCH v14 8/8] vfio/pci: Virtualize PCIe TPH capability registers Chengwen Feng
2026-05-28 16:42 ` sashiko-bot
2026-06-01 15:58 ` [PATCH v14 0/8] vfio/pci: Add PCIe TPH support Alex Williamson
2026-06-02 14:46 ` fengchengwen
2026-06-02 23:08 ` Alex Williamson
2026-06-03 0:34 ` fengchengwen
2026-06-03 0:45 ` Jason Gunthorpe
2026-06-03 1:25 ` fengchengwen
2026-06-03 18:53 ` Alex Williamson
2026-06-04 18:33 ` Jason Gunthorpe
2026-06-04 20:46 ` Alex Williamson
2026-06-03 17:58 ` Alex Williamson
2026-06-04 1:58 ` fengchengwen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox