Linux PCI subsystem development
 help / color / mirror / Atom feed
* [PATCH v17 00/12] vfio/pci: Add PCIe TPH support
@ 2026-06-16 10:46 Chengwen Feng
  2026-06-16 10:46 ` [PATCH v17 01/12] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
                   ` (11 more replies)
  0 siblings, 12 replies; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

This patchset enables full userspace configurable PCIe TPH support for
VFIO, which brings performance benefits for userspace workloads such as
DPDK and SPDK.

Currently VFIO only exposes read-only TPH capability registers to
userspace, while all write operations are silently discarded. This
prevents userspace from enabling and configuring TPH, limiting performance
optimization opportunities.

Per PCIe spec 7.5.3.15: TPH Completer support is applicable to Root Ports
and Endpoints, allowing Steering Tags to target host CPUs or peer devices
for P2P transactions.

TPH usage model can be divided into three fundamental parts:
1. Retrieve Steering Tag:
   - Tags targeting host CPUs are obtained via platform methods (ACPI _DSM)
     wrapped in pcie_tph_get_cpu_st(). Userspace requires a generic
     interface to query these CPU-associated ST values.
   - Tags targeting peer devices are managed by userspace drivers.
2. Program Steering Tag table:
   - For devices with standard ST table structures (in capability space or
     MSI-X table), userspace needs a unified interface to configure ST
     entries.
   - Devices without standard ST tables are handled by userspace itself.
3. Toggle device TPH Requester enable/disable state.

To support the above scenarios, this series extends PCI and VFIO with
complete TPH virtualization features:
- [*PCI*] Support sysfs binary file to export CPU to steering-tag mapping,
  so that userspace could retrieve CPU's ST by read.
- [*VFIO*] New device feature TPH_ST_CONFIG: Batch configure interface for
  device ST table entries, with shadow cache and atomic rollback support.
- [*VFIO*] Full TPH capability register virtualization: allow userspace to
  toggle TPH Requester state via TPH_CTRL register writes.

To guarantee isolation and security, this patchset adopts a two-level
safety gate design with careful ABI considerations:
1. Global unsafe gate:
   TPH caching behavior may cross isolation domains and impact shared
   platform resources. A new module parameter `enable_unsafe_tph` is
   introduced (default off) to globally gate all VFIO TPH functionalities.
2. Per-device opt-in gate:
   To preserve strict ABI compatibility and avoid unexpected hardware
   state changes for existing users, a new VFIO device feature TPH_ENABLE
   is added. TPH capabilities are only available after userspace explicitly
   enables it per-device.

Because Kernel PCI TPH implementation requires TPH Requester to be enabled
before programming ST entries. To support userspace configuring ST table
in arbitrary order, a shadow ST table is introduced to buffer ST writes
before TPH is enabled. All cached entries are flushed to hardware when
TPH Requester turns on. This also provides atomic batch rollback capability
for reliable configuration.

The patchset is split into two logical parts: the first eight patches fix
and refactor core PCI/TPH kernel code to export required helper interfaces
and CPU to ST mapping, the remaining four patches implement corresponding
VFIO TPH virtualization layer step by step.

Based on earlier RFC work by Wathsala Vithanage

---
v17:
- Move retrieve CPU to ST mapping logic from VFIO to PCI subsystem
- Remove tph_lock which seemed not use
- Fix Sashiko review comment of v16:
  - tph_permit is bit field which has concurrent problem
  - Fix tph_permit not reset when re-open device
  - TPH capability virtualization write has concurrent, don't rollback
    original value problems.
  - Missing virtualization of TPH Capability Header leaks the physical
    Next Capability Pointer to the guest
v16:
- Supports opt-in at the device level which address Alex's comment.
- Split sub-commit: add hide TPH capability when TPH is unsupported.
- Optimize the tph fields layout of the pci_dev structure.
- Optimize virtualize PCIe TPH capability commit: support rollback
  when set fail.
- Reorder PCI/TPH commits: make fix commit ahead.
- Reorganized the cover letter to serve as the starting point for
  discussion.
v15: Address Alex's comments:
- Drop TPH capability when tph_cap=0
- Use _explicit postfix other than add policy parameter for enable
  TPH and get tph st.
- Make sure set st entry under D0
- Reimpl virtualize TPH capability register
- Other fix

Chengwen Feng (11):
  PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction
  PCI/TPH: Fix tph_enabled concurrent update race by bitfield packing
  PCI/TPH: Cache TPH requester capability at probe time
  PCI/TPH: Refactor pcie_enable_tph & add explicit requester variant
  PCI/TPH: Refactor pcie_tph_get_cpu_st & add explicit variant
  PCI/TPH: Add pcie_tph_supported() helper to check TPH capability
    attributes
  PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  vfio/pci: Hide TPH capability when TPH is unsupported
  vfio/pci: Add TPH_ENABLE feature skeleton and unsafe module parameter
  vfio/pci: Add TPH_ST_CONFIG for PCIe TPH ST configuration
  vfio/pci: Virtualize PCIe TPH capability registers

Zhiping Zhang (1):
  PCI/TPH: Expose the enabled TPH requester type

 drivers/pci/pci-sysfs.c            |   3 +
 drivers/pci/pci.h                  |   4 +
 drivers/pci/tph.c                  | 325 ++++++++++++++++++++++-------
 drivers/vfio/pci/vfio_pci.c        |  13 +-
 drivers/vfio/pci/vfio_pci_config.c | 121 +++++++++++
 drivers/vfio/pci/vfio_pci_core.c   | 152 +++++++++++++-
 include/linux/pci-tph.h            |  22 ++
 include/linux/pci.h                |   6 +-
 include/linux/vfio_pci_core.h      |   6 +-
 include/uapi/linux/pci.h           |  15 ++
 include/uapi/linux/vfio.h          |  29 +++
 11 files changed, 618 insertions(+), 78 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v17 01/12] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 11:00   ` sashiko-bot
  2026-06-16 10:46 ` [PATCH v17 02/12] PCI/TPH: Fix tph_enabled concurrent update race by bitfield packing Chengwen Feng
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

pcie_tph_get_st_table_loc() incorrectly uses FIELD_GET(), which shifts the
field value to bit 0. But the function is designed to return raw
PCI_TPH_LOC_* values as defined in the function comment.

This causes incorrect ST table location detection. Fix it by using bitwise
AND with PCI_TPH_CAP_LOC_MASK to return the unshifted field value matching
the function specification.

This doesn't make a difference to mlx5_st_create(), the lone external
caller, because it only checks for PCI_TPH_LOC_NONE (0), but will be needed
for callers that check for PCI_TPH_LOC_CAP or PCI_TPH_LOC_MSIX.

Also add tph_cap validation for pcie_tph_get_st_table_loc() to prevent
invalid PCI configuration space access when TPH is not supported. Add stub
functions for pcie_tph_get_st_table_size() and pcie_tph_get_st_table_loc()
when !CONFIG_PCIE_TPH.

Fixes: d2e8a34876ce ("PCI/TPH: Add Steering Tag support")
Cc: stable@vger.kernel.org
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Alex Williamson <alex.williamson@nvidia.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/tph.c       | 12 +++++-------
 include/linux/pci-tph.h |  5 +++++
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index 91145e8d9d95..bef3a55539c4 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -166,11 +166,14 @@ static u8 get_st_modes(struct pci_dev *pdev)
  */
 u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
 {
-	u32 reg;
+	u32 reg = 0;
+
+	if (!pdev->tph_cap)
+		return PCI_TPH_LOC_NONE;
 
 	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, &reg);
 
-	return FIELD_GET(PCI_TPH_CAP_LOC_MASK, reg);
+	return reg & PCI_TPH_CAP_LOC_MASK;
 }
 EXPORT_SYMBOL(pcie_tph_get_st_table_loc);
 
@@ -185,9 +188,6 @@ u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
 
 	/* Check ST table location first */
 	loc = pcie_tph_get_st_table_loc(pdev);
-
-	/* Convert loc to match with PCI_TPH_LOC_* defined in pci_regs.h */
-	loc = FIELD_PREP(PCI_TPH_CAP_LOC_MASK, loc);
 	if (loc != PCI_TPH_LOC_CAP)
 		return 0;
 
@@ -316,8 +316,6 @@ int pcie_tph_set_st_entry(struct pci_dev *pdev, unsigned int index, u16 tag)
 	set_ctrl_reg_req_en(pdev, PCI_TPH_REQ_DISABLE);
 
 	loc = pcie_tph_get_st_table_loc(pdev);
-	/* Convert loc to match with PCI_TPH_LOC_* */
-	loc = FIELD_PREP(PCI_TPH_CAP_LOC_MASK, loc);
 
 	switch (loc) {
 	case PCI_TPH_LOC_MSIX:
diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
index be68cd17f2f8..6f02b020d7d7 100644
--- a/include/linux/pci-tph.h
+++ b/include/linux/pci-tph.h
@@ -8,6 +8,7 @@
  */
 #ifndef LINUX_PCI_TPH_H
 #define LINUX_PCI_TPH_H
+#include <linux/pci.h>
 
 /*
  * According to the ECN for PCI Firmware Spec, Steering Tag can be different
@@ -41,6 +42,10 @@ static inline int pcie_tph_get_cpu_st(struct pci_dev *dev,
 static inline void pcie_disable_tph(struct pci_dev *pdev) { }
 static inline int pcie_enable_tph(struct pci_dev *pdev, int mode)
 { return -EINVAL; }
+static inline u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
+{ return 0; }
+static inline u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
+{ return PCI_TPH_LOC_NONE; }
 #endif
 
 #endif /* LINUX_PCI_TPH_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 02/12] PCI/TPH: Fix tph_enabled concurrent update race by bitfield packing
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
  2026-06-16 10:46 ` [PATCH v17 01/12] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 10:55   ` sashiko-bot
  2026-06-16 10:46 ` [PATCH v17 03/12] PCI/TPH: Cache TPH requester capability at probe time Chengwen Feng
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Split tph_enabled from shared pci_dev bitfield into spare bit of tph_cap's
u16: tph_cap is immutable post-enumeration (15 bits for offset), remaining
1 bit stores tph_enabled. Removes cross-bitfield concurrent write hazards
highlighted by Sashiko after VFIO TPH exposure. No functional changes.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 include/linux/pci.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..109182658f76 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -471,7 +471,6 @@ struct pci_dev {
 	unsigned int	ats_enabled:1;		/* Address Translation Svc */
 	unsigned int	pasid_enabled:1;	/* Process Address Space ID */
 	unsigned int	pri_enabled:1;		/* Page Request Interface */
-	unsigned int	tph_enabled:1;		/* TLP Processing Hints */
 	unsigned int	fm_enabled:1;		/* Flit Mode (segment captured) */
 	unsigned int	is_managed:1;		/* Managed via devres */
 	unsigned int	is_msi_managed:1;	/* MSI release via devres installed */
@@ -589,7 +588,8 @@ struct pci_dev {
 	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
 
 #ifdef CONFIG_PCIE_TPH
-	u16		tph_cap;	/* TPH capability offset */
+	u16		tph_cap:15;	/* TPH capability offset */
+	u16		tph_enabled:1;	/* Whether TPH is enabled */
 	u8		tph_mode;	/* TPH mode */
 	u8		tph_req_type;	/* TPH requester type */
 #endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 03/12] PCI/TPH: Cache TPH requester capability at probe time
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
  2026-06-16 10:46 ` [PATCH v17 01/12] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
  2026-06-16 10:46 ` [PATCH v17 02/12] PCI/TPH: Fix tph_enabled concurrent update race by bitfield packing Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 10:55   ` sashiko-bot
  2026-06-16 10:46 ` [PATCH v17 04/12] PCI/TPH: Refactor pcie_enable_tph & add explicit requester variant Chengwen Feng
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Calculate the negotiated TPH requester type from device and root port
capabilities once in pci_tph_init().

Add tph_ext_support flag to cache whether the device is allowed to
issue Extended TPH requests after topology negotiation. If the final
requester type is disabled, clear TPH capability to prevent usage.

Simplify pcie_enable_tph() by using the cached requester capability
instead of recalculating every time.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/pci/tph.c   | 43 +++++++++++++++++++++++++------------------
 include/linux/pci.h |  4 +++-
 2 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index bef3a55539c4..951f0a33ff66 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -384,7 +384,6 @@ int pcie_enable_tph(struct pci_dev *pdev, int mode)
 {
 	u32 reg;
 	u8 dev_modes;
-	u8 rp_req_type;
 
 	/* Honor "notph" kernel parameter */
 	if (pci_tph_disabled)
@@ -404,23 +403,8 @@ int pcie_enable_tph(struct pci_dev *pdev, int mode)
 
 	pdev->tph_mode = mode;
 
-	/* Get req_type supported by device and its Root Port */
-	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, &reg);
-	if (FIELD_GET(PCI_TPH_CAP_EXT_TPH, reg))
-		pdev->tph_req_type = PCI_TPH_REQ_EXT_TPH;
-	else
-		pdev->tph_req_type = PCI_TPH_REQ_TPH_ONLY;
-
-	/* Check if the device is behind a Root Port */
-	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END) {
-		rp_req_type = get_rp_completer_type(pdev);
-
-		/* Final req_type is the smallest value of two */
-		pdev->tph_req_type = min(pdev->tph_req_type, rp_req_type);
-	}
-
-	if (pdev->tph_req_type == PCI_TPH_REQ_DISABLE)
-		return -EINVAL;
+	pdev->tph_req_type = pdev->tph_ext_support ? PCI_TPH_REQ_EXT_TPH :
+						     PCI_TPH_REQ_TPH_ONLY;
 
 	/* Write them into TPH control register */
 	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CTRL, &reg);
@@ -510,13 +494,36 @@ void pci_no_tph(void)
 
 void pci_tph_init(struct pci_dev *pdev)
 {
+	u8 tph_req_type, rp_req_type;
 	int num_entries;
 	u32 save_size;
+	u32 reg = 0;
 
 	pdev->tph_cap = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_TPH);
 	if (!pdev->tph_cap)
 		return;
 
+	/* Get req_type supported by device and its Root Port */
+	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, &reg);
+	if (FIELD_GET(PCI_TPH_CAP_EXT_TPH, reg))
+		tph_req_type = PCI_TPH_REQ_EXT_TPH;
+	else
+		tph_req_type = PCI_TPH_REQ_TPH_ONLY;
+
+	/* Check if the device is behind a Root Port */
+	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END) {
+		rp_req_type = get_rp_completer_type(pdev);
+		/* Final req_type is the smallest value of two */
+		tph_req_type = min(tph_req_type, rp_req_type);
+	}
+
+	if (tph_req_type == PCI_TPH_REQ_DISABLE) {
+		pdev->tph_cap = 0;
+		return;
+	}
+
+	pdev->tph_ext_support = !!(tph_req_type == PCI_TPH_REQ_EXT_TPH);
+
 	num_entries = pcie_tph_get_st_table_size(pdev);
 	save_size = sizeof(u32) + num_entries * sizeof(u16);
 	pci_add_ext_cap_save_buffer(pdev, PCI_EXT_CAP_ID_TPH, save_size);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 109182658f76..285c0f00882e 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -588,7 +588,9 @@ struct pci_dev {
 	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
 
 #ifdef CONFIG_PCIE_TPH
-	u16		tph_cap:15;	/* TPH capability offset */
+	u16		tph_cap:14;	/* TPH capability offset */
+	u16		tph_ext_support:1; /* Indicate whether Extended TPH
+					    * requester is supported */
 	u16		tph_enabled:1;	/* Whether TPH is enabled */
 	u8		tph_mode;	/* TPH mode */
 	u8		tph_req_type;	/* TPH requester type */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 04/12] PCI/TPH: Refactor pcie_enable_tph & add explicit requester variant
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (2 preceding siblings ...)
  2026-06-16 10:46 ` [PATCH v17 03/12] PCI/TPH: Cache TPH requester capability at probe time Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 10:53   ` sashiko-bot
  2026-06-16 10:46 ` [PATCH v17 05/12] PCI/TPH: Refactor pcie_tph_get_cpu_st & add explicit variant Chengwen Feng
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Refactor pcie_enable_tph implementation: extract core logic into static
internal enable_tph() helper accepting explicit requester type.

- Preserve original pcie_enable_tph() unchanged as auto wrapper; it
  auto-selects EXT/standard TPH requester per device capability, existing
  bnxt/mlx5 callers require zero modification.
- Add exported pcie_enable_tph_explicit() with bool 'extended' parameter
  for explicit STD/EXT selection, used by upcoming VFIO TPH support.

Input validation for EXT_TPH availability is retained inside helper to
reject invalid explicit EXT request if hardware does not support extended
requester.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/pci/tph.c       | 70 ++++++++++++++++++++++++++++-------------
 include/linux/pci-tph.h |  4 +++
 2 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index 951f0a33ff66..51009ac9b379 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -364,23 +364,7 @@ void pcie_disable_tph(struct pci_dev *pdev)
 }
 EXPORT_SYMBOL(pcie_disable_tph);
 
-/**
- * pcie_enable_tph - Enable TPH support for device using a specific ST mode
- * @pdev: PCI device
- * @mode: ST mode to enable. Current supported modes include:
- *
- *   - PCI_TPH_ST_NS_MODE: NO ST Mode
- *   - PCI_TPH_ST_IV_MODE: Interrupt Vector Mode
- *   - PCI_TPH_ST_DS_MODE: Device Specific Mode
- *
- * Check whether the mode is actually supported by the device before enabling
- * and return an error if not. Additionally determine what types of requests,
- * TPH or extended TPH, can be issued by the device based on its TPH requester
- * capability and the Root Port's completer capability.
- *
- * Return: 0 on success, otherwise negative value (-errno)
- */
-int pcie_enable_tph(struct pci_dev *pdev, int mode)
+static int enable_tph(struct pci_dev *pdev, int mode, u8 req_type)
 {
 	u32 reg;
 	u8 dev_modes;
@@ -401,10 +385,11 @@ int pcie_enable_tph(struct pci_dev *pdev, int mode)
 	if (!((1 << mode) & dev_modes))
 		return -EINVAL;
 
-	pdev->tph_mode = mode;
+	if (req_type == PCI_TPH_REQ_EXT_TPH && !pdev->tph_ext_support)
+		return -EINVAL;
 
-	pdev->tph_req_type = pdev->tph_ext_support ? PCI_TPH_REQ_EXT_TPH :
-						     PCI_TPH_REQ_TPH_ONLY;
+	pdev->tph_mode = mode;
+	pdev->tph_req_type = req_type;
 
 	/* Write them into TPH control register */
 	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CTRL, &reg);
@@ -413,7 +398,7 @@ int pcie_enable_tph(struct pci_dev *pdev, int mode)
 	reg |= FIELD_PREP(PCI_TPH_CTRL_MODE_SEL_MASK, pdev->tph_mode);
 
 	reg &= ~PCI_TPH_CTRL_REQ_EN_MASK;
-	reg |= FIELD_PREP(PCI_TPH_CTRL_REQ_EN_MASK, pdev->tph_req_type);
+	reg |= FIELD_PREP(PCI_TPH_CTRL_REQ_EN_MASK, req_type);
 
 	pci_write_config_dword(pdev, pdev->tph_cap + PCI_TPH_CTRL, reg);
 
@@ -421,8 +406,51 @@ int pcie_enable_tph(struct pci_dev *pdev, int mode)
 
 	return 0;
 }
+
+/**
+ * pcie_enable_tph - Enable TPH support for device using a specific ST mode
+ * @pdev: PCI device
+ * @mode: ST mode to enable. Current supported modes include:
+ *
+ *   - PCI_TPH_ST_NS_MODE: NO ST Mode
+ *   - PCI_TPH_ST_IV_MODE: Interrupt Vector Mode
+ *   - PCI_TPH_ST_DS_MODE: Device Specific Mode
+ *
+ * Check whether the mode is actually supported by the device before enabling
+ * and return an error if not. Additionally determine what types of requests,
+ * TPH or extended TPH, can be issued by the device based on its TPH requester
+ * capability and the Root Port's completer capability.
+ *
+ * Return: 0 on success, otherwise negative value (-errno)
+ */
+int pcie_enable_tph(struct pci_dev *pdev, int mode)
+{
+	u8 req_type = pdev->tph_ext_support ? PCI_TPH_REQ_EXT_TPH :
+					      PCI_TPH_REQ_TPH_ONLY;
+	return enable_tph(pdev, mode, req_type);
+}
 EXPORT_SYMBOL(pcie_enable_tph);
 
+/**
+ * pcie_enable_tph_explicit - Enable TPH with explicit requester selection
+ * @pdev: PCI device to operate
+ * @mode: ST table operating mode (NS/IV/DS)
+ * @extended: true = EXT_TPH, false = standard TPH only
+ *
+ * Unlike auto-detecting pcie_enable_tph(), caller selects requester type
+ * manually instead of hardware auto-selection. Rejects EXT_TPH request
+ * if device lacks extended requester capability.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int pcie_enable_tph_explicit(struct pci_dev *pdev, int mode, bool extended)
+{
+	u8 req_type = extended ? PCI_TPH_REQ_EXT_TPH : PCI_TPH_REQ_TPH_ONLY;
+
+	return enable_tph(pdev, mode, req_type);
+}
+EXPORT_SYMBOL(pcie_enable_tph_explicit);
+
 void pci_restore_tph_state(struct pci_dev *pdev)
 {
 	struct pci_cap_saved_state *save_state;
diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
index 6f02b020d7d7..ca0faa98afac 100644
--- a/include/linux/pci-tph.h
+++ b/include/linux/pci-tph.h
@@ -29,6 +29,7 @@ int pcie_tph_get_cpu_st(struct pci_dev *dev,
 			unsigned int cpu, u16 *tag);
 void pcie_disable_tph(struct pci_dev *pdev);
 int pcie_enable_tph(struct pci_dev *pdev, int mode);
+int pcie_enable_tph_explicit(struct pci_dev *pdev, int mode, bool extended);
 u16 pcie_tph_get_st_table_size(struct pci_dev *pdev);
 u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev);
 #else
@@ -42,6 +43,9 @@ static inline int pcie_tph_get_cpu_st(struct pci_dev *dev,
 static inline void pcie_disable_tph(struct pci_dev *pdev) { }
 static inline int pcie_enable_tph(struct pci_dev *pdev, int mode)
 { return -EINVAL; }
+static inline int pcie_enable_tph_explicit(struct pci_dev *pdev, int mode,
+					   bool extended)
+{ return -EINVAL; }
 static inline u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
 { return 0; }
 static inline u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 05/12] PCI/TPH: Refactor pcie_tph_get_cpu_st & add explicit variant
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (3 preceding siblings ...)
  2026-06-16 10:46 ` [PATCH v17 04/12] PCI/TPH: Refactor pcie_enable_tph & add explicit requester variant Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 10:53   ` sashiko-bot
  2026-06-16 10:46 ` [PATCH v17 06/12] PCI/TPH: Expose the enabled TPH requester type Chengwen Feng
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Refactor pcie_tph_get_cpu_st(): extract core logic into static internal
get_cpu_st() helper accepting explicit requester type parameter.

- Preserve original pcie_tph_get_cpu_st() unchanged as auto wrapper; it
  uses existing pdev->tph_req_type automatically, existing callers require
  no change.
- Add exported pcie_tph_get_cpu_st_explicit() with bool 'extended'
  parameter for manual STD/EXT requester selection, consumed by upcoming
  VFIO TPH code.
- Add capability check: reject explicit EXT request when device does not
  support extended TPH requester.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/pci/tph.c       | 68 ++++++++++++++++++++++++++++++-----------
 include/linux/pci-tph.h |  7 +++++
 2 files changed, 57 insertions(+), 18 deletions(-)

diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index 51009ac9b379..aca08671fdfe 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -231,21 +231,8 @@ static int write_tag_to_st_table(struct pci_dev *pdev, int index, u16 tag)
 	return pci_write_config_word(pdev, offset, tag);
 }
 
-/**
- * pcie_tph_get_cpu_st() - Retrieve Steering Tag for a target memory associated
- * with a specific CPU
- * @pdev: PCI device
- * @mem_type: target memory type (volatile or persistent RAM)
- * @cpu: associated CPU id
- * @tag: Steering Tag to be returned
- *
- * Return the Steering Tag for a target memory that is associated with a
- * specific CPU as indicated by cpu.
- *
- * Return: 0 if success, otherwise negative value (-errno)
- */
-int pcie_tph_get_cpu_st(struct pci_dev *pdev, enum tph_mem_type mem_type,
-			unsigned int cpu, u16 *tag)
+static int get_cpu_st(struct pci_dev *pdev, enum tph_mem_type mem_type,
+		      u8 req_type, unsigned int cpu, u16 *tag)
 {
 #ifdef CONFIG_ACPI
 	struct pci_dev *rp;
@@ -269,19 +256,64 @@ int pcie_tph_get_cpu_st(struct pci_dev *pdev, enum tph_mem_type mem_type,
 		return -EINVAL;
 	}
 
-	*tag = tph_extract_tag(mem_type, pdev->tph_req_type, &info);
+	*tag = tph_extract_tag(mem_type, req_type, &info);
 
-	pci_dbg(pdev, "get steering tag: mem_type=%s, cpu=%d, tag=%#04x\n",
+	pci_dbg(pdev, "get steering tag: mem_type=%s, req_type=%u, cpu=%d, tag=%#04x\n",
 		(mem_type == TPH_MEM_TYPE_VM) ? "volatile" : "persistent",
-		cpu, *tag);
+		req_type, cpu, *tag);
 
 	return 0;
 #else
 	return -ENODEV;
 #endif
 }
+
+/**
+ * pcie_tph_get_cpu_st() - Retrieve Steering Tag for a target memory associated
+ * with a specific CPU
+ * @pdev: PCI device
+ * @mem_type: target memory type (volatile or persistent RAM)
+ * @cpu: associated CPU id
+ * @tag: Steering Tag to be returned
+ *
+ * Return the Steering Tag for a target memory that is associated with a
+ * specific CPU as indicated by cpu.
+ *
+ * Return: 0 if success, otherwise negative value (-errno)
+ */
+int pcie_tph_get_cpu_st(struct pci_dev *pdev, enum tph_mem_type mem_type,
+			unsigned int cpu, u16 *tag)
+{
+	return get_cpu_st(pdev, mem_type, pdev->tph_req_type, cpu, tag);
+}
 EXPORT_SYMBOL(pcie_tph_get_cpu_st);
 
+/**
+ * pcie_tph_get_cpu_st_explicit - Get ST with explicit requester type
+ * @pdev: PCI device
+ * @mem_type: target memory type (volatile or persistent RAM)
+ * @extended: true=EXT_TPH, false=standard TPH only
+ * @cpu: associated CPU id
+ * @tag: output steering tag pointer
+ *
+ * Unlike auto pcie_tph_get_cpu_st(), caller manually picks requester type.
+ * Rejects EXT request if device lacks extended requester capability.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int pcie_tph_get_cpu_st_explicit(struct pci_dev *pdev,
+				 enum tph_mem_type mem_type,
+				 bool extended, unsigned int cpu, u16 *tag)
+{
+	u8 req_type = extended ? PCI_TPH_REQ_EXT_TPH : PCI_TPH_REQ_TPH_ONLY;
+
+	if (extended && !pdev->tph_ext_support)
+		return -EINVAL;
+
+	return get_cpu_st(pdev, mem_type, req_type, cpu, tag);
+}
+EXPORT_SYMBOL(pcie_tph_get_cpu_st_explicit);
+
 /**
  * pcie_tph_set_st_entry() - Set Steering Tag in the ST table entry
  * @pdev: PCI device
diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
index ca0faa98afac..1a508b3d511f 100644
--- a/include/linux/pci-tph.h
+++ b/include/linux/pci-tph.h
@@ -27,6 +27,9 @@ int pcie_tph_set_st_entry(struct pci_dev *pdev,
 int pcie_tph_get_cpu_st(struct pci_dev *dev,
 			enum tph_mem_type mem_type,
 			unsigned int cpu, u16 *tag);
+int pcie_tph_get_cpu_st_explicit(struct pci_dev *pdev,
+				 enum tph_mem_type mem_type,
+				 bool extended, unsigned int cpu, u16 *tag);
 void pcie_disable_tph(struct pci_dev *pdev);
 int pcie_enable_tph(struct pci_dev *pdev, int mode);
 int pcie_enable_tph_explicit(struct pci_dev *pdev, int mode, bool extended);
@@ -40,6 +43,10 @@ static inline int pcie_tph_get_cpu_st(struct pci_dev *dev,
 				      enum tph_mem_type mem_type,
 				      unsigned int cpu, u16 *tag)
 { return -EINVAL; }
+static inline int pcie_tph_get_cpu_st_explicit(struct pci_dev *pdev,
+				enum tph_mem_type mem_type,
+				bool extended, unsigned int cpu, u16 *tag)
+{ return -EINVAL; }
 static inline void pcie_disable_tph(struct pci_dev *pdev) { }
 static inline int pcie_enable_tph(struct pci_dev *pdev, int mode)
 { return -EINVAL; }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 06/12] PCI/TPH: Expose the enabled TPH requester type
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (4 preceding siblings ...)
  2026-06-16 10:46 ` [PATCH v17 05/12] PCI/TPH: Refactor pcie_tph_get_cpu_st & add explicit variant Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 10:51   ` sashiko-bot
  2026-06-16 10:46 ` [PATCH v17 07/12] PCI/TPH: Add pcie_tph_supported() helper to check TPH capability attributes Chengwen Feng
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

From: Zhiping Zhang <zhipingz@meta.com>

Add pcie_tph_enabled_req_type() so drivers can query the enabled TPH
requester mode without reaching into pci_dev internals.

Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/pci/tph.c       | 12 ++++++++++++
 include/linux/pci-tph.h |  3 +++
 2 files changed, 15 insertions(+)

diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index aca08671fdfe..6c4623cacc85 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -483,6 +483,18 @@ int pcie_enable_tph_explicit(struct pci_dev *pdev, int mode, bool extended)
 }
 EXPORT_SYMBOL(pcie_enable_tph_explicit);
 
+/**
+ * pcie_tph_enabled_req_type - Return the device's enabled TPH requester type
+ * @pdev: PCI device to query
+ *
+ * Return: PCI_TPH_REQ_DISABLE, PCI_TPH_REQ_TPH_ONLY or PCI_TPH_REQ_EXT_TPH.
+ */
+u8 pcie_tph_enabled_req_type(struct pci_dev *pdev)
+{
+	return pdev->tph_req_type;
+}
+EXPORT_SYMBOL(pcie_tph_enabled_req_type);
+
 void pci_restore_tph_state(struct pci_dev *pdev)
 {
 	struct pci_cap_saved_state *save_state;
diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
index 1a508b3d511f..e4f7045fc152 100644
--- a/include/linux/pci-tph.h
+++ b/include/linux/pci-tph.h
@@ -33,6 +33,7 @@ int pcie_tph_get_cpu_st_explicit(struct pci_dev *pdev,
 void pcie_disable_tph(struct pci_dev *pdev);
 int pcie_enable_tph(struct pci_dev *pdev, int mode);
 int pcie_enable_tph_explicit(struct pci_dev *pdev, int mode, bool extended);
+u8 pcie_tph_enabled_req_type(struct pci_dev *pdev);
 u16 pcie_tph_get_st_table_size(struct pci_dev *pdev);
 u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev);
 #else
@@ -47,6 +48,8 @@ static inline int pcie_tph_get_cpu_st_explicit(struct pci_dev *pdev,
 				enum tph_mem_type mem_type,
 				bool extended, unsigned int cpu, u16 *tag)
 { return -EINVAL; }
+static inline u8 pcie_tph_enabled_req_type(struct pci_dev *pdev)
+{ return PCI_TPH_REQ_DISABLE; }
 static inline void pcie_disable_tph(struct pci_dev *pdev) { }
 static inline int pcie_enable_tph(struct pci_dev *pdev, int mode)
 { return -EINVAL; }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 07/12] PCI/TPH: Add pcie_tph_supported() helper to check TPH capability attributes
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (5 preceding siblings ...)
  2026-06-16 10:46 ` [PATCH v17 06/12] PCI/TPH: Expose the enabled TPH requester type Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 10:52   ` sashiko-bot
  2026-06-16 10:46 ` [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping Chengwen Feng
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Add new helper pcie_tph_supported() with want_ext parameter:
- want_ext = false: Check if device has valid TPH capability;
- want_ext = true: Check hardware Extended TPH support.

This helper is prepared for follow-up VFIO TPH virtualization patches to
uniformly query basic TPH existence and Extended TPH capability.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/pci/tph.c       | 19 +++++++++++++++++++
 include/linux/pci-tph.h |  3 +++
 2 files changed, 22 insertions(+)

diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index 6c4623cacc85..95280aab4fb5 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -600,3 +600,22 @@ void pci_tph_init(struct pci_dev *pdev)
 	save_size = sizeof(u32) + num_entries * sizeof(u16);
 	pci_add_ext_cap_save_buffer(pdev, PCI_EXT_CAP_ID_TPH, save_size);
 }
+
+/**
+ * pcie_tph_supported - Check TPH capability attribute
+ * @pdev: PCI device to query
+ * @want_ext: false - check TPH cap exists; true - check EXT_TPH support
+ *
+ * Return: true on matched condition, false otherwise
+ */
+bool pcie_tph_supported(struct pci_dev *pdev, bool want_ext)
+{
+	if (!pdev->tph_cap)
+		return false;
+
+	if (!want_ext)
+		return true;
+
+	return pdev->tph_ext_support;
+}
+EXPORT_SYMBOL(pcie_tph_supported);
diff --git a/include/linux/pci-tph.h b/include/linux/pci-tph.h
index e4f7045fc152..5917a0694c1d 100644
--- a/include/linux/pci-tph.h
+++ b/include/linux/pci-tph.h
@@ -36,6 +36,7 @@ int pcie_enable_tph_explicit(struct pci_dev *pdev, int mode, bool extended);
 u8 pcie_tph_enabled_req_type(struct pci_dev *pdev);
 u16 pcie_tph_get_st_table_size(struct pci_dev *pdev);
 u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev);
+bool pcie_tph_supported(struct pci_dev *pdev, bool want_ext);
 #else
 static inline int pcie_tph_set_st_entry(struct pci_dev *pdev,
 					unsigned int index, u16 tag)
@@ -60,6 +61,8 @@ static inline u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
 { return 0; }
 static inline u32 pcie_tph_get_st_table_loc(struct pci_dev *pdev)
 { return PCI_TPH_LOC_NONE; }
+static inline bool pcie_tph_supported(struct pci_dev *pdev, bool want_ext)
+{ return false; }
 #endif
 
 #endif /* LINUX_PCI_TPH_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (6 preceding siblings ...)
  2026-06-16 10:46 ` [PATCH v17 07/12] PCI/TPH: Add pcie_tph_supported() helper to check TPH capability attributes Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 11:00   ` sashiko-bot
  2026-06-16 14:42   ` Jason Gunthorpe
  2026-06-16 10:46 ` [PATCH v17 09/12] vfio/pci: Hide TPH capability when TPH is unsupported Chengwen Feng
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Add per-device sysfs binary attribute tph_cpu_st to expose ACPI DSM CPU
to steering-tag data to userspace, resolving the concern that VFIO should
not host CPU-to-ST translation interfaces.

Follow PCI standard binattr framework: dynamic visible group, fixed-size
8-byte packed uapi entry, aligned offset read, root-only 0400 permission.
Refactor duplicate ACPI DSM logic into shared tph_get_cpu_st_info helper.

ABI: /sys/bus/pci/devices/<BDF>/tph_cpu_st

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/pci/pci-sysfs.c  |   3 ++
 drivers/pci/pci.h        |   4 ++
 drivers/pci/tph.c        | 113 +++++++++++++++++++++++++++++++++------
 include/uapi/linux/pci.h |  15 ++++++
 4 files changed, 120 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index d37860841260..ad9e4e8d320b 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1832,6 +1832,9 @@ const struct attribute_group *pci_dev_attr_groups[] = {
 #ifdef CONFIG_PCI_TSM
 	&pci_tsm_auth_attr_group,
 	&pci_tsm_attr_group,
+#endif
+#ifdef CONFIG_PCIE_TPH
+	&pcie_tph_cpu_st_attr_group,
 #endif
 	NULL,
 };
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4a14f88e543a..09306078a658 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1366,6 +1366,10 @@ static inline pci_power_t acpi_pci_choose_state(struct pci_dev *pdev)
 extern const struct attribute_group aspm_ctrl_attr_group;
 #endif
 
+#ifdef CONFIG_PCIE_TPH
+extern const struct attribute_group pcie_tph_cpu_st_attr_group;
+#endif
+
 #ifdef CONFIG_X86_INTEL_MID
 bool pci_use_mid_pm(void);
 int mid_pci_set_power_state(struct pci_dev *pdev, pci_power_t state);
diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index 95280aab4fb5..aca5093e8152 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -130,6 +130,29 @@ static acpi_status tph_invoke_dsm(acpi_handle handle, u32 cpu_uid,
 
 	return AE_OK;
 }
+
+static int tph_get_cpu_st_info(struct pci_dev *pdev, unsigned int cpu,
+			       union st_info *info)
+{
+	acpi_handle rp_acpi_handle;
+	struct pci_dev *rp;
+	u32 cpu_uid;
+	int ret;
+
+	ret = acpi_get_cpu_uid(cpu, &cpu_uid);
+	if (ret != 0)
+		return ret;
+
+	rp = pcie_find_root_port(pdev);
+	if (!rp || !rp->bus || !rp->bus->bridge)
+		return -ENODEV;
+
+	rp_acpi_handle = ACPI_HANDLE(rp->bus->bridge);
+	if (tph_invoke_dsm(rp_acpi_handle, cpu_uid, info) != AE_OK)
+		return -EINVAL;
+
+	return 0;
+}
 #endif
 
 /* Update the TPH Requester Enable field of TPH Control Register */
@@ -231,31 +254,36 @@ static int write_tag_to_st_table(struct pci_dev *pdev, int index, u16 tag)
 	return pci_write_config_word(pdev, offset, tag);
 }
 
+static void get_cpu_all_st(struct pci_dev *pdev, unsigned int cpu,
+			   struct pci_tph_cpu_st *st)
+{
+	memset(st, 0, sizeof(*st));
+#ifdef CONFIG_ACPI
+	union st_info info;
+	int ret;
+
+	ret = tph_get_cpu_st_info(pdev, cpu, &info);
+	if (ret == 0) {
+		st->vm_st = info.vm_st_valid ? info.vm_st : 0;
+		st->pm_st = info.pm_st_valid ? info.pm_st : 0;
+		st->vm_xst = info.vm_xst_valid ? info.vm_xst : 0;
+		st->pm_xst = info.pm_xst_valid ? info.pm_xst : 0;
+		st->reserved = 0;
+	}
+#endif
+}
+
 static int get_cpu_st(struct pci_dev *pdev, enum tph_mem_type mem_type,
 		      u8 req_type, unsigned int cpu, u16 *tag)
 {
 #ifdef CONFIG_ACPI
-	struct pci_dev *rp;
-	acpi_handle rp_acpi_handle;
 	union st_info info;
-	u32 cpu_uid;
 	int ret;
 
-	ret = acpi_get_cpu_uid(cpu, &cpu_uid);
+	ret = tph_get_cpu_st_info(pdev, cpu, &info);
 	if (ret != 0)
 		return ret;
 
-	rp = pcie_find_root_port(pdev);
-	if (!rp || !rp->bus || !rp->bus->bridge)
-		return -ENODEV;
-
-	rp_acpi_handle = ACPI_HANDLE(rp->bus->bridge);
-
-	if (tph_invoke_dsm(rp_acpi_handle, cpu_uid, &info) != AE_OK) {
-		*tag = 0;
-		return -EINVAL;
-	}
-
 	*tag = tph_extract_tag(mem_type, req_type, &info);
 
 	pci_dbg(pdev, "get steering tag: mem_type=%s, req_type=%u, cpu=%d, tag=%#04x\n",
@@ -619,3 +647,58 @@ bool pcie_tph_supported(struct pci_dev *pdev, bool want_ext)
 	return pdev->tph_ext_support;
 }
 EXPORT_SYMBOL(pcie_tph_supported);
+
+static ssize_t tph_cpu_st_read(struct file *filp, struct kobject *kobj,
+			       const struct bin_attribute *bin_attr, char *buf,
+			       loff_t off, size_t count)
+{
+	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
+	size_t entry_sz = PCI_TPH_CPU_ST_ENTRY_SZ;
+	struct pci_tph_cpu_st st;
+	unsigned int target_cpu;
+	size_t copy_len;
+
+	if (off >= nr_cpu_ids * entry_sz || off % entry_sz != 0)
+		return 0;
+
+	target_cpu = off / entry_sz;
+	if (!cpu_possible(target_cpu))
+		return -ENODEV;
+
+	get_cpu_all_st(pdev, target_cpu, &st);
+
+	copy_len = min_t(size_t, entry_sz, count);
+	memcpy(buf, &st, copy_len);
+
+	return copy_len;
+}
+static BIN_ATTR(tph_cpu_st, 0400, tph_cpu_st_read, NULL, 0);
+
+static const struct bin_attribute *const tph_cpu_st_bin_attrs[] = {
+	&bin_attr_tph_cpu_st,
+	NULL,
+};
+
+static size_t tph_cpu_st_bin_size(struct kobject *kobj,
+				  const struct bin_attribute *a, int n)
+{
+	return nr_cpu_ids * PCI_TPH_CPU_ST_ENTRY_SZ;
+}
+
+static umode_t tph_cpu_st_attr_is_visible(struct kobject *kobj,
+					  const struct bin_attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	if (pcie_tph_supported(pdev, false))
+		return a->attr.mode;
+
+	return 0;
+}
+
+const struct attribute_group pcie_tph_cpu_st_attr_group = {
+	.bin_attrs = tph_cpu_st_bin_attrs,
+	.bin_size = tph_cpu_st_bin_size,
+	.is_bin_visible = tph_cpu_st_attr_is_visible,
+};
diff --git a/include/uapi/linux/pci.h b/include/uapi/linux/pci.h
index 4f150028965d..d8aa9f8d47f6 100644
--- a/include/uapi/linux/pci.h
+++ b/include/uapi/linux/pci.h
@@ -46,4 +46,19 @@ enum pci_hotplug_event {
 	PCI_HOTPLUG_CARD_NOT_PRESENT,
 };
 
+/*
+ * PCIe TPH sysfs binary entry for CPU-to-ST mapping
+ * Sysfs file: /sys/bus/pci/devices/<BDF>/tph_cpu_st
+ * Each entry is 8 bytes aligned, seek offset = cpu_id * PCI_TPH_CPU_ST_ENTRY_SZ
+ */
+struct pci_tph_cpu_st {
+	__u8  vm_st;        /* Volatile Memory Steering Tag (1 byte) */
+	__u8  pm_st;        /* Persistent Memory Steering Tag (1 byte) */
+	__u16 vm_xst;       /* Volatile Memory Extended Steering Tag (2 bytes) */
+	__u16 pm_xst;       /* Persistent Memory Extended Steering Tag (2 bytes) */
+	__u16 reserved;     /* Padding to 8 bytes for aligned offset lookup */
+} __packed;
+
+#define PCI_TPH_CPU_ST_ENTRY_SZ sizeof(struct pci_tph_cpu_st)
+
 #endif /* _UAPILINUX_PCI_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 09/12] vfio/pci: Hide TPH capability when TPH is unsupported
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (7 preceding siblings ...)
  2026-06-16 10:46 ` [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 10:56   ` sashiko-bot
  2026-06-16 10:46 ` [PATCH v17 10/12] vfio/pci: Add TPH_ENABLE feature skeleton and unsafe module parameter Chengwen Feng
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Check the device negotiated TPH support status before parsing the TPH
extended capability. Return zero length to hide the capability from
userspace if TPH is disabled during topology negotiation.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index a10ed733f0e3..5c6ab172df6c 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -22,6 +22,7 @@
 
 #include <linux/fs.h>
 #include <linux/pci.h>
+#include <linux/pci-tph.h>
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/slab.h>
@@ -1450,6 +1451,8 @@ static int vfio_ext_cap_len(struct vfio_pci_core_device *vdev, u16 ecap, u16 epo
 		byte &= PCI_DPA_CAP_SUBSTATE_MASK;
 		return PCI_DPA_BASE_SIZEOF + byte + 1;
 	case PCI_EXT_CAP_ID_TPH:
+		if (!pcie_tph_supported(pdev, false))
+			return 0;
 		ret = pci_read_config_dword(pdev, epos + PCI_TPH_CAP, &dword);
 		if (ret)
 			return pcibios_err_to_errno(ret);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 10/12] vfio/pci: Add TPH_ENABLE feature skeleton and unsafe module parameter
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (8 preceding siblings ...)
  2026-06-16 10:46 ` [PATCH v17 09/12] vfio/pci: Hide TPH capability when TPH is unsupported Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 10:55   ` sashiko-bot
  2026-06-16 10:46 ` [PATCH v17 11/12] vfio/pci: Add TPH_ST_CONFIG for PCIe TPH ST configuration Chengwen Feng
  2026-06-16 10:46 ` [PATCH v17 12/12] vfio/pci: Virtualize PCIe TPH capability registers Chengwen Feng
  11 siblings, 1 reply; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Introduce module param enable_unsafe_tph to gate all TPH related features,
and add VFIO_DEVICE_FEATURE_TPH_ENABLE uapi together with per-device
tph_permit flag.

This is a preparatory implementation: only feature framework is added
for now, actual TPH_CTRL register permission control and steering tag
features (TPH_CPU_ST / TPH_ST_CONFIG) will be attached in subsequent
TPH capability virtualization commits.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/vfio/pci/vfio_pci.c        | 13 ++++++++++++-
 drivers/vfio/pci/vfio_pci_config.c |  1 +
 drivers/vfio/pci/vfio_pci_core.c   | 25 ++++++++++++++++++++++++-
 include/linux/vfio_pci_core.h      |  4 +++-
 include/uapi/linux/vfio.h          |  7 +++++++
 5 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 0c771064c0b8..6d73668459cf 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -60,6 +60,12 @@ static bool disable_denylist;
 module_param(disable_denylist, bool, 0444);
 MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
 
+#ifdef CONFIG_PCIE_TPH
+static bool enable_unsafe_tph;
+module_param(enable_unsafe_tph, bool, 0444);
+MODULE_PARM_DESC(enable_unsafe_tph, "Enable PCIe TPH (Transaction Processing Hints) support. It may break platform isolation. If you do not know what this is for, step away. (default: false)");
+#endif
+
 static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
 {
 	switch (pdev->vendor) {
@@ -257,12 +263,17 @@ static int __init vfio_pci_init(void)
 {
 	int ret;
 	bool is_disable_vga = true;
+	bool is_enable_unsafe_tph = false;
 
 #ifdef CONFIG_VFIO_PCI_VGA
 	is_disable_vga = disable_vga;
 #endif
+#ifdef CONFIG_PCIE_TPH
+	is_enable_unsafe_tph = enable_unsafe_tph;
+#endif
 
-	vfio_pci_core_set_params(nointxmask, is_disable_vga, disable_idle_d3);
+	vfio_pci_core_set_params(nointxmask, is_disable_vga, disable_idle_d3,
+				 is_enable_unsafe_tph);
 
 	/* Register and scan for devices */
 	ret = pci_register_driver(&vfio_pci_driver);
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 5c6ab172df6c..251d3ec7fdd4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1783,6 +1783,7 @@ int vfio_config_init(struct vfio_pci_core_device *vdev)
 		goto out;
 
 	vdev->bardirty = true;
+	vdev->tph_permit = false;
 
 	/*
 	 * XXX can we just pci_load_saved_state/pci_restore_state?
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 050e7542952e..d5e534dd5829 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -41,6 +41,7 @@
 static bool nointxmask;
 static bool disable_vga;
 static bool disable_idle_d3;
+static bool enable_unsafe_tph;
 
 static void vfio_pci_eventfd_rcu_free(struct rcu_head *rcu)
 {
@@ -1551,6 +1552,24 @@ static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
 	return 0;
 }
 
+static int vfio_pci_core_feature_tph_enable(struct vfio_pci_core_device *vdev,
+					    u32 flags, size_t argsz)
+{
+	int ret;
+
+	if (!enable_unsafe_tph)
+		return -EOPNOTSUPP;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
+	if (ret <= 0)
+		return ret;
+
+	if (!vdev->tph_permit)
+		vdev->tph_permit = 1;
+
+	return 0;
+}
+
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
@@ -1569,6 +1588,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_DMA_BUF:
 		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_TPH_ENABLE:
+		return vfio_pci_core_feature_tph_enable(vdev, flags, argsz);
 	default:
 		return -ENOTTY;
 	}
@@ -2605,11 +2626,13 @@ static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
 }
 
 void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
-			      bool is_disable_idle_d3)
+			      bool is_disable_idle_d3,
+			      bool is_enable_unsafe_tph)
 {
 	nointxmask = is_nointxmask;
 	disable_vga = is_disable_vga;
 	disable_idle_d3 = is_disable_idle_d3;
+	enable_unsafe_tph = is_enable_unsafe_tph;
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_set_params);
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 89165b769e5c..a946b35e6b85 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -127,6 +127,7 @@ struct vfio_pci_core_device {
 	bool			needs_pm_restore:1;
 	bool			pm_intx_masked:1;
 	bool			pm_runtime_engaged:1;
+	bool			tph_permit;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
@@ -157,7 +158,8 @@ int vfio_pci_core_register_dev_region(struct vfio_pci_core_device *vdev,
 				      const struct vfio_pci_regops *ops,
 				      size_t size, u32 flags, void *data);
 void vfio_pci_core_set_params(bool nointxmask, bool is_disable_vga,
-			      bool is_disable_idle_d3);
+			      bool is_disable_idle_d3,
+			      bool is_enable_unsafe_tph);
 void vfio_pci_core_close_device(struct vfio_device *core_vdev);
 int vfio_pci_core_init_dev(struct vfio_device *core_vdev);
 void vfio_pci_core_release_dev(struct vfio_device *core_vdev);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..e5a4d1d7091b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1534,6 +1534,13 @@ struct vfio_device_feature_dma_buf {
  */
 #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2  12
 
+/*
+ * Device-level opt-in for TPH (Transaction Processing Hints) support.
+ * When set, allows access to TPH_CPU_ST and TPH_ST_CONFIG features.
+ * Requires global enable_unsafe_tph module parameter to be enabled.
+ */
+#define VFIO_DEVICE_FEATURE_TPH_ENABLE	13
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 11/12] vfio/pci: Add TPH_ST_CONFIG for PCIe TPH ST configuration
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (9 preceding siblings ...)
  2026-06-16 10:46 ` [PATCH v17 10/12] vfio/pci: Add TPH_ENABLE feature skeleton and unsafe module parameter Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 11:05   ` sashiko-bot
  2026-06-16 10:46 ` [PATCH v17 12/12] vfio/pci: Virtualize PCIe TPH capability registers Chengwen Feng
  11 siblings, 1 reply; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Add a new VFIO device feature VFIO_DEVICE_FEATURE_TPH_ST_CONFIG to allow
userspace to configure PCIe TPH Steering Tag table entries. This interface
supports only configuration writes, read operations are not permitted.

Implement shadow ST table to cache entries, paired with per-device mutex
for concurrent access protection. Batch write failure triggers entry
rollback to guarantee hardware and shadow table consistency.

The feature is double gated:
1. Global enable_unsafe_tph module parameter must be enabled;
2. Userspace needs to firstly SET VFIO_DEVICE_FEATURE_TPH_ENABLE
   to set per-device tph_permit flag before using TPH_CPU_CONFIG.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 123 +++++++++++++++++++++++++++++++
 include/linux/vfio_pci_core.h    |   2 +
 include/uapi/linux/vfio.h        |  22 ++++++
 3 files changed, 147 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d5e534dd5829..0f602faeaef3 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -29,6 +29,7 @@
 #include <linux/sched/mm.h>
 #include <linux/iommufd.h>
 #include <linux/pci-p2pdma.h>
+#include <linux/pci-tph.h>
 #if IS_ENABLED(CONFIG_EEH)
 #include <asm/eeh.h>
 #endif
@@ -529,6 +530,50 @@ static const struct dev_pm_ops vfio_pci_core_pm_ops = {
 			   NULL)
 };
 
+static int vfio_pci_tph_st_shadow_size(struct vfio_pci_core_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u32 loc = pcie_tph_get_st_table_loc(pdev);
+	int ret;
+
+	if (loc == PCI_TPH_LOC_CAP) {
+		return pcie_tph_get_st_table_size(pdev);
+	} else if (loc == PCI_TPH_LOC_MSIX) {
+		ret = pci_msix_vec_count(pdev);
+		if (ret < 0)
+			return 0;
+		return ret;
+	} else {
+		return 0;
+	}
+}
+
+static int vfio_pci_tph_init(struct vfio_pci_core_device *vdev)
+{
+	vdev->tph_st_entries = 0;
+	vdev->tph_st_shadow = NULL;
+
+	if (!enable_unsafe_tph)
+		return 0;
+
+	vdev->tph_st_entries = vfio_pci_tph_st_shadow_size(vdev);
+	if (vdev->tph_st_entries) {
+		vdev->tph_st_shadow = kcalloc(vdev->tph_st_entries, sizeof(u16),
+					      GFP_KERNEL);
+		if (!vdev->tph_st_shadow)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void vfio_pci_tph_deinit(struct vfio_pci_core_device *vdev)
+{
+	kfree(vdev->tph_st_shadow);
+	vdev->tph_st_shadow = NULL;
+	vdev->tph_st_entries = 0;
+}
+
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -555,6 +600,11 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 		goto out_disable_device;
 
 	vdev->reset_works = !ret;
+
+	ret = vfio_pci_tph_init(vdev);
+	if (ret)
+		goto out_disable_device;
+
 	pci_save_state(pdev);
 	vdev->pci_saved_state = pci_store_saved_state(pdev);
 	if (!vdev->pci_saved_state)
@@ -612,6 +662,7 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 out_free_state:
 	kfree(vdev->pci_saved_state);
 	vdev->pci_saved_state = NULL;
+	vfio_pci_tph_deinit(vdev);
 out_disable_device:
 	pci_disable_device(pdev);
 out_power:
@@ -680,6 +731,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	kfree(vdev->region);
 	vdev->region = NULL; /* don't krealloc a freed pointer */
 
+	vfio_pci_tph_deinit(vdev);
 	vfio_config_free(vdev);
 
 	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
@@ -1570,6 +1622,74 @@ static int vfio_pci_core_feature_tph_enable(struct vfio_pci_core_device *vdev,
 	return 0;
 }
 
+static int vfio_pci_core_feature_tph_st_config(
+			struct vfio_pci_core_device *vdev,
+			u32 flags,
+			struct vfio_device_feature_tph_st_config __user *arg,
+			size_t argsz)
+{
+	struct vfio_device_feature_tph_st_config config;
+	struct pci_dev *pdev = vdev->pdev;
+	void __user *uptr;
+	int i, idx, ret;
+	size_t sz;
+	u16 *sts;
+
+	if (!vdev->tph_permit || !vdev->tph_st_shadow)
+		return -EOPNOTSUPP;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
+				 sizeof(config));
+	if (ret <= 0)
+		return ret;
+
+	if (copy_from_user(&config, arg, sizeof(config)))
+		return -EFAULT;
+
+	if (config.count == 0 || config.reserved != 0 ||
+		config.index >= vdev->tph_st_entries ||
+		config.count > vdev->tph_st_entries - config.index)
+		return -EINVAL;
+
+	uptr = u64_to_user_ptr(config.data_uptr);
+	sts = memdup_array_user(uptr, config.count, sizeof(u16));
+	sz = config.count * sizeof(u16);
+	if (IS_ERR(sts))
+		return PTR_ERR(sts);
+
+	down_write(&vdev->memory_lock);
+	ret = vfio_pci_set_power_state(vdev, PCI_D0);
+	if (ret)
+		goto out_unlock_memory;
+
+	if (pcie_tph_enabled_req_type(pdev) == PCI_TPH_REQ_DISABLE)
+		goto update_shadow;
+
+	for (i = 0; i < config.count; i++) {
+		idx = config.index + i;
+		ret = pcie_tph_set_st_entry(pdev, idx, sts[i]);
+		if (ret)
+			goto rollback;
+	}
+
+update_shadow:
+	memcpy(&vdev->tph_st_shadow[config.index], sts, sz);
+	ret = 0;
+	goto out_unlock_memory;
+
+rollback:
+	while (i-- > 0) {
+		idx = config.index + i;
+		pcie_tph_set_st_entry(pdev, idx, vdev->tph_st_shadow[idx]);
+	}
+
+out_unlock_memory:
+	up_write(&vdev->memory_lock);
+
+	kfree(sts);
+	return ret;
+}
+
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
@@ -1590,6 +1710,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_TPH_ENABLE:
 		return vfio_pci_core_feature_tph_enable(vdev, flags, argsz);
+	case VFIO_DEVICE_FEATURE_TPH_ST_CONFIG:
+		return vfio_pci_core_feature_tph_st_config(vdev, flags,
+							   arg, argsz);
 	default:
 		return -ENOTTY;
 	}
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index a946b35e6b85..4f20d5a1d557 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -143,6 +143,8 @@ struct vfio_pci_core_device {
 	struct notifier_block	nb;
 	struct rw_semaphore	memory_lock;
 	struct list_head	dmabufs;
+	u16			*tph_st_shadow;
+	u16			tph_st_entries;
 };
 
 enum vfio_pci_io_width {
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index e5a4d1d7091b..61079594a91f 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1541,6 +1541,28 @@ struct vfio_device_feature_dma_buf {
  */
 #define VFIO_DEVICE_FEATURE_TPH_ENABLE	13
 
+/**
+ * VFIO_DEVICE_FEATURE_TPH_ST_CONFIG - Configure PCIe TPH Steering Tag entries
+ *
+ * Provides userspace interface to configure PCIe TPH ST table entries.
+ *
+ * @index: Start entry offset within ST table
+ * @count: Number of consecutive entries to configure
+ * @data_uptr: Userspace data buffer for 16-bit raw ST values
+ *
+ * This feature requires two preconditions:
+ * 1. Global enable_unsafe_tph module parameter is enabled;
+ * 2. VFIO_DEVICE_FEATURE_TPH_ENABLE has been SET on the device beforehand.
+ */
+#define VFIO_DEVICE_FEATURE_TPH_ST_CONFIG	14
+
+struct vfio_device_feature_tph_st_config {
+	__u16 index;
+	__u16 count;
+	__u32 reserved; /* Reserved for future use, must be zero */
+	__aligned_u64 data_uptr;
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 12/12] vfio/pci: Virtualize PCIe TPH capability registers
  2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
                   ` (10 preceding siblings ...)
  2026-06-16 10:46 ` [PATCH v17 11/12] vfio/pci: Add TPH_ST_CONFIG for PCIe TPH ST configuration Chengwen Feng
@ 2026-06-16 10:46 ` Chengwen Feng
  2026-06-16 11:03   ` sashiko-bot
  11 siblings, 1 reply; 39+ messages in thread
From: Chengwen Feng @ 2026-06-16 10:46 UTC (permalink / raw)
  To: alex, jgg, helgaas
  Cc: wathsala.vithanage, wei.huang2, zhipingz, wangzhou1, wangyushan12,
	liuyonglong, kvm, linux-pci

Virtualize TPH extended capability config space registers:
- Original TPH capability was fully read-only; now split permission:
  TPH_CAP header remains read-only, TPH_CTRL register supports write to
  toggle TPH requester enable mode.
- Block direct ST-table programming via config space write access: all ST
  entry configuration is restricted to VFIO_DEVICE_FEATURE_TPH_ST_CONFIG
  feature exclusively after userspace SET TPH_ENABLE opt-in.
- Backup original virtual config value and revert vconfig if hardware TPH
  enable operation fails or invalid requester mode is configured.
- After TPH requester gets enabled via CTRL write, sync cached shadow ST
  table down to physical hardware with memory_lock protection and PCI D0
  power check.

Add vconfig masking to hide EXT_TPH capability bit if underlying hardware
does not support extended TPH via new vfio_tph_mask_ext_tph_bit helper.
Reset hardware TPH state on device open/close to eliminate cross-session
TPH configuration leakage between different VM lifecycles.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 117 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_core.c   |   4 +
 2 files changed, 121 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 251d3ec7fdd4..1fcb53803b64 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1086,6 +1086,118 @@ static int __init init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
 	return 0;
 }
 
+/* Permissions for TPH extended capability */
+static int __init init_pci_ext_cap_tph_perm(struct perm_bits *perm)
+{
+	int i;
+
+	if (alloc_perm_bits(perm, pci_ext_cap_length[PCI_EXT_CAP_ID_TPH]))
+		return -ENOMEM;
+
+	p_setd(perm, 0, ALL_VIRT, NO_WRITE);
+	p_setd(perm, PCI_TPH_CAP, ALL_VIRT, NO_WRITE);
+
+	p_setd(perm, PCI_TPH_CTRL, ALL_VIRT, ALL_WRITE);
+
+	/* Per PCI specification: There is an upper limit of 64 entries
+	 * when the ST table is located in the TPH Requester Extended
+	 * Capability structure.
+	 * And the pci_ext_cap_length[PCI_EXT_CAP_ID_TPH] is 0xFF, so the
+	 * following operation is fine.
+	 */
+	for (i = 0; i < 64; i++)
+		p_setw(perm, PCI_TPH_BASE_SIZEOF + i * sizeof(u16),
+		       (u16)ALL_VIRT, (u16)ALL_WRITE);
+
+	return 0;
+}
+
+static void vfio_tph_mask_ext_tph_bit(struct vfio_pci_core_device *vdev,
+				      int pos)
+{
+	__le32 *vptr = (__le32 *)&vdev->vconfig[pos + PCI_TPH_CAP];
+	struct pci_dev *pdev = vdev->pdev;
+	u32 val;
+
+	if (!pcie_tph_supported(pdev, true)) {
+		val = le32_to_cpu(*vptr);
+		val &= ~PCI_TPH_CAP_EXT_TPH;
+		*vptr = cpu_to_le32(val);
+	}
+}
+
+static int vfio_find_cap_start(struct vfio_pci_core_device *vdev, int pos);
+static int vfio_tph_config_write(struct vfio_pci_core_device *vdev, int pos,
+				 int count, struct perm_bits *perm,
+				 int offset, __le32 val)
+{
+	int req_en_byte = PCI_TPH_CTRL + 1;
+	struct pci_dev *pdev = vdev->pdev;
+	__le32 org_val = 0;
+	bool extended;
+	u8 mode, req;
+	int i, ret;
+	u16 start;
+	u32 data;
+
+	if (!vdev->tph_permit)
+		return count;
+
+	down_write(&vdev->memory_lock);
+
+	/* Back up the original values in order rollback when fail */
+	if (offset <= req_en_byte && offset + count > req_en_byte)
+		vfio_default_config_read(vdev, pos, count, perm, offset,
+					 &org_val);
+
+	ret = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+	if (ret != count) {
+		up_write(&vdev->memory_lock);
+		return ret;
+	}
+
+	/* Skip if write range does not cover Requester Enable byte */
+	if (offset > req_en_byte || offset + count <= req_en_byte) {
+		up_write(&vdev->memory_lock);
+		return count;
+	}
+
+	ret = vfio_pci_set_power_state(vdev, PCI_D0);
+	if (ret) {
+		vfio_default_config_write(vdev, pos, count, perm, offset,
+					  org_val);
+		up_write(&vdev->memory_lock);
+		return count;
+	}
+
+	start = vfio_find_cap_start(vdev, pos);
+	data = le32_to_cpu(*(__le32 *)&vdev->vconfig[start + PCI_TPH_CTRL]);
+	mode = FIELD_GET(PCI_TPH_CTRL_MODE_SEL_MASK, data);
+	req = FIELD_GET(PCI_TPH_CTRL_REQ_EN_MASK, data);
+
+	if (req == PCI_TPH_REQ_TPH_ONLY || req == PCI_TPH_REQ_EXT_TPH) {
+		extended = !!(req == PCI_TPH_REQ_EXT_TPH);
+		ret = pcie_enable_tph_explicit(pdev, mode, extended);
+		if (!ret && vdev->tph_st_shadow) {
+			for (i = 0; i < vdev->tph_st_entries; i++)
+				pcie_tph_set_st_entry(pdev, i,
+						      vdev->tph_st_shadow[i]);
+		}
+		if (ret)
+			vfio_default_config_write(vdev, pos, count, perm,
+						  offset, org_val);
+	} else if (req == PCI_TPH_REQ_DISABLE) {
+		pcie_disable_tph(vdev->pdev);
+	} else {
+		vfio_default_config_write(vdev, pos, count, perm, offset,
+					  org_val);
+	}
+
+	up_write(&vdev->memory_lock);
+
+	return count;
+}
+
 /*
  * Initialize the shared permission tables
  */
@@ -1101,6 +1213,7 @@ void vfio_pci_uninit_perm_bits(void)
 
 	free_perm_bits(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
 	free_perm_bits(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
+	free_perm_bits(&ecap_perms[PCI_EXT_CAP_ID_TPH]);
 }
 
 int __init vfio_pci_init_perm_bits(void)
@@ -1121,6 +1234,8 @@ int __init vfio_pci_init_perm_bits(void)
 	/* Extended capabilities */
 	ret |= init_pci_ext_cap_err_perm(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
 	ret |= init_pci_ext_cap_pwr_perm(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
+	ret |= init_pci_ext_cap_tph_perm(&ecap_perms[PCI_EXT_CAP_ID_TPH]);
+	ecap_perms[PCI_EXT_CAP_ID_TPH].writefn = vfio_tph_config_write;
 	ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_raw_config_write;
 	ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn = vfio_raw_config_write;
 
@@ -1704,6 +1819,8 @@ static int vfio_ecap_init(struct vfio_pci_core_device *vdev)
 		ret = vfio_fill_vconfig_bytes(vdev, epos, len);
 		if (ret)
 			return ret;
+		if (ecap == PCI_EXT_CAP_ID_TPH && !hidden)
+			vfio_tph_mask_ext_tph_bit(vdev, epos);
 
 		/*
 		 * If we're just using this capability to anchor the list,
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 0f602faeaef3..0bf49e36e64b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -601,6 +601,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 
 	vdev->reset_works = !ret;
 
+	/* Reset TPH status on new user session */
+	pcie_disable_tph(vdev->pdev);
 	ret = vfio_pci_tph_init(vdev);
 	if (ret)
 		goto out_disable_device;
@@ -731,6 +733,8 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	kfree(vdev->region);
 	vdev->region = NULL; /* don't krealloc a freed pointer */
 
+	/* Reset TPH status on session exit */
+	pcie_disable_tph(vdev->pdev);
 	vfio_pci_tph_deinit(vdev);
 	vfio_config_free(vdev);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 06/12] PCI/TPH: Expose the enabled TPH requester type
  2026-06-16 10:46 ` [PATCH v17 06/12] PCI/TPH: Expose the enabled TPH requester type Chengwen Feng
@ 2026-06-16 10:51   ` sashiko-bot
  0 siblings, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 10:51 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: kvm, linux-pci

> From: Zhiping Zhang <zhipingz@meta.com>
> 
> Add pcie_tph_enabled_req_type() so drivers can query the enabled TPH
> requester mode without reaching into pci_dev internals.
> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=6


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 07/12] PCI/TPH: Add pcie_tph_supported() helper to check TPH capability attributes
  2026-06-16 10:46 ` [PATCH v17 07/12] PCI/TPH: Add pcie_tph_supported() helper to check TPH capability attributes Chengwen Feng
@ 2026-06-16 10:52   ` sashiko-bot
  0 siblings, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 10:52 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: linux-pci, kvm

> Add new helper pcie_tph_supported() with want_ext parameter:
> - want_ext = false: Check if device has valid TPH capability;
> - want_ext = true: Check hardware Extended TPH support.
> 
> This helper is prepared for follow-up VFIO TPH virtualization patches to
> uniformly query basic TPH existence and Extended TPH capability.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=7


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 04/12] PCI/TPH: Refactor pcie_enable_tph & add explicit requester variant
  2026-06-16 10:46 ` [PATCH v17 04/12] PCI/TPH: Refactor pcie_enable_tph & add explicit requester variant Chengwen Feng
@ 2026-06-16 10:53   ` sashiko-bot
  0 siblings, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 10:53 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: linux-pci, kvm

> Refactor pcie_enable_tph implementation: extract core logic into static
> internal enable_tph() helper accepting explicit requester type.
> 
> - Preserve original pcie_enable_tph() unchanged as auto wrapper; it
>   auto-selects EXT/standard TPH requester per device capability, existing
>   bnxt/mlx5 callers require zero modification.
> - Add exported pcie_enable_tph_explicit() with bool 'extended' parameter
>   for explicit STD/EXT selection, used by upcoming VFIO TPH support.
> 
> Input validation for EXT_TPH availability is retained inside helper to
> reject invalid explicit EXT request if hardware does not support extended
> requester.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=4


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 05/12] PCI/TPH: Refactor pcie_tph_get_cpu_st & add explicit variant
  2026-06-16 10:46 ` [PATCH v17 05/12] PCI/TPH: Refactor pcie_tph_get_cpu_st & add explicit variant Chengwen Feng
@ 2026-06-16 10:53   ` sashiko-bot
  0 siblings, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 10:53 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: kvm, linux-pci

> Refactor pcie_tph_get_cpu_st(): extract core logic into static internal
> get_cpu_st() helper accepting explicit requester type parameter.
> 
> - Preserve original pcie_tph_get_cpu_st() unchanged as auto wrapper; it
>   uses existing pdev->tph_req_type automatically, existing callers require
>   no change.
> - Add exported pcie_tph_get_cpu_st_explicit() with bool 'extended'
>   parameter for manual STD/EXT requester selection, consumed by upcoming
>   VFIO TPH code.
> - Add capability check: reject explicit EXT request when device does not
>   support extended TPH requester.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=5


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 03/12] PCI/TPH: Cache TPH requester capability at probe time
  2026-06-16 10:46 ` [PATCH v17 03/12] PCI/TPH: Cache TPH requester capability at probe time Chengwen Feng
@ 2026-06-16 10:55   ` sashiko-bot
  0 siblings, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 10:55 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: kvm, linux-pci

> Calculate the negotiated TPH requester type from device and root port
> capabilities once in pci_tph_init().
> 
> Add tph_ext_support flag to cache whether the device is allowed to
> issue Extended TPH requests after topology negotiation. If the final
> requester type is disabled, clear TPH capability to prevent usage.
> 
> Simplify pcie_enable_tph() by using the cached requester capability
> instead of recalculating every time.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=3


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 02/12] PCI/TPH: Fix tph_enabled concurrent update race by bitfield packing
  2026-06-16 10:46 ` [PATCH v17 02/12] PCI/TPH: Fix tph_enabled concurrent update race by bitfield packing Chengwen Feng
@ 2026-06-16 10:55   ` sashiko-bot
  0 siblings, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 10:55 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: linux-pci, kvm

> Split tph_enabled from shared pci_dev bitfield into spare bit of tph_cap's
> u16: tph_cap is immutable post-enumeration (15 bits for offset), remaining
> 1 bit stores tph_enabled. Removes cross-bitfield concurrent write hazards
> highlighted by Sashiko after VFIO TPH exposure. No functional changes.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 10/12] vfio/pci: Add TPH_ENABLE feature skeleton and unsafe module parameter
  2026-06-16 10:46 ` [PATCH v17 10/12] vfio/pci: Add TPH_ENABLE feature skeleton and unsafe module parameter Chengwen Feng
@ 2026-06-16 10:55   ` sashiko-bot
  0 siblings, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 10:55 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: kvm, linux-pci

> Introduce module param enable_unsafe_tph to gate all TPH related features,
> and add VFIO_DEVICE_FEATURE_TPH_ENABLE uapi together with per-device
> tph_permit flag.
> 
> This is a preparatory implementation: only feature framework is added
> for now, actual TPH_CTRL register permission control and steering tag
> features (TPH_CPU_ST / TPH_ST_CONFIG) will be attached in subsequent
> TPH capability virtualization commits.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=10


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 09/12] vfio/pci: Hide TPH capability when TPH is unsupported
  2026-06-16 10:46 ` [PATCH v17 09/12] vfio/pci: Hide TPH capability when TPH is unsupported Chengwen Feng
@ 2026-06-16 10:56   ` sashiko-bot
  0 siblings, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 10:56 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: kvm, linux-pci

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Missing bounds check on dynamically-sized PCIe capabilities allows a malicious device to cause a kernel heap buffer overflow.
--

commit 2b14d080b02da7dcb99b160bf40eeb6f9e38b928
Author: Chengwen Feng <fengchengwen@huawei.com>

vfio/pci: Hide TPH capability when TPH is unsupported

This commit checks the device negotiated TPH support status before parsing
the TPH extended capability. It returns zero length to hide the capability
from userspace if TPH is disabled during topology negotiation.

> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index a10ed733f0e3a..5c6ab172df6c0 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
[ ... ]
> @@ -1450,6 +1451,8 @@ static int vfio_ext_cap_len(struct vfio_pci_core_device *vdev, u16 ecap, u16 epo
>  		byte &= PCI_DPA_CAP_SUBSTATE_MASK;
>  		return PCI_DPA_BASE_SIZEOF + byte + 1;
>  	case PCI_EXT_CAP_ID_TPH:
> +		if (!pcie_tph_supported(pdev, false))
> +			return 0;
>  		ret = pci_read_config_dword(pdev, epos + PCI_TPH_CAP, &dword);

[Severity: Critical]
This is a pre-existing issue, but does this function safely bound the returned
capability length?

Looking at the rest of the PCI_EXT_CAP_ID_TPH case in vfio_ext_cap_len()
in drivers/vfio/pci/vfio_pci_config.c:

    if ((dword & PCI_TPH_CAP_LOC_MASK) == PCI_TPH_LOC_CAP) {
        int sts;

        sts = dword & PCI_TPH_CAP_ST_MASK;
        sts >>= PCI_TPH_CAP_ST_SHIFT;
        return PCI_TPH_BASE_SIZEOF + (sts * 2) + 2;
    }

Since the ST table size is read directly from the attacker-controlled device
config space, the returned length could be up to 4108 bytes.

Later, in vfio_ecap_init() in the same file, this unbounded length is used
to write into heap buffers:

    memset(map + epos, ecap, len);
    ret = vfio_fill_vconfig_bytes(vdev, epos, len);

Since the pci_config_map and vconfig arrays are allocated to pdev->cfg_size
(typically 4096 bytes), could a malicious PCIe endpoint or virtual device
provide an oversized TPH capability size to cause a kernel heap buffer
overflow, as there is no check ensuring epos + len fits within pdev->cfg_size?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=9

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 01/12] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction
  2026-06-16 10:46 ` [PATCH v17 01/12] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
@ 2026-06-16 11:00   ` sashiko-bot
  0 siblings, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 11:00 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: kvm, linux-pci

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Surprise removal (PCI disconnect) or a malicious PCI endpoint can trigger a heap buffer overflow in `pci_save_tph_state()` and an OOB read in `pci_restore_tph_state()` due to a TOCTOU evaluation of the ST table size.
--

commit fe1abb4bbf50a2222b456026a8b0caa06ea99438
Author: Chengwen Feng <fengchengwen@huawei.com>

PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction

pcie_tph_get_st_table_loc() incorrectly uses FIELD_GET(), which shifts the
field value to bit 0. But the function is designed to return raw
PCI_TPH_LOC_* values as defined in the function comment.

> diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
> index 91145e8d9d951..bef3a55539c46 100644
> --- a/drivers/pci/tph.c
> +++ b/drivers/pci/tph.c
[ ... ]
> @@ -185,9 +188,6 @@ u16 pcie_tph_get_st_table_size(struct pci_dev *pdev)
>  
>  	/* Check ST table location first */
>  	loc = pcie_tph_get_st_table_loc(pdev);
> -
> -	/* Convert loc to match with PCI_TPH_LOC_* defined in pci_regs.h */
> -	loc = FIELD_PREP(PCI_TPH_CAP_LOC_MASK, loc);
>  	if (loc != PCI_TPH_LOC_CAP)
>  		return 0;

[Severity: High]
This is a pre-existing issue, but does this function have a TOCTOU race
that could cause a buffer overflow during a device disconnect?

pcie_tph_get_st_table_loc() indirectly reads the PCI_TPH_CAP register.
Shortly after, pcie_tph_get_st_table_size() reads it again:

drivers/pci/tph.c:pcie_tph_get_st_table_size() {
    ...
	loc = pcie_tph_get_st_table_loc(pdev);
	if (loc != PCI_TPH_LOC_CAP)
		return 0;

	pci_read_config_dword(pdev, pdev->tph_cap + PCI_TPH_CAP, &reg);

	return FIELD_GET(PCI_TPH_CAP_ST_MASK, reg) + 1;
}

If a device disconnects exactly between these two reads, or if a malicious
endpoint provides PCI_TPH_LOC_CAP for the first read but ~0 for the second
read, the FIELD_GET() calculation evaluates to 2048.

Since pci_save_tph_state() iterates based on this dynamically queried size,
could it write up to 4096 bytes of disconnect error data into the
save_state->cap.data buffer, overflowing it?

drivers/pci/tph.c:pci_save_tph_state() {
    ...
	num_entries = pcie_tph_get_st_table_size(pdev);
	for (i = 0; i < num_entries; i++) {
		pci_read_config_word(pdev, pdev->tph_cap + offset,
				     st_entry++);
    ...
}

Similarly, could pci_restore_tph_state() be vulnerable to an out-of-bounds
heap read if the table size evaluates to 2048 on disconnect?

drivers/pci/tph.c:pci_restore_tph_state() {
    ...
	num_entries = pcie_tph_get_st_table_size(pdev);
	for (i = 0; i < num_entries; i++) {
		pci_write_config_word(pdev, pdev->tph_cap + offset,
				      *st_entry++);
    ...
}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-16 10:46 ` [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping Chengwen Feng
@ 2026-06-16 11:00   ` sashiko-bot
  2026-06-16 14:42   ` Jason Gunthorpe
  1 sibling, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 11:00 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: linux-pci, kvm

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:

New issues:
- [Medium] sysfs binary read callback incorrectly returns 0 (EOF) for unaligned offsets, breaking partial reads.
- [Medium] Returning -ENODEV for impossible CPUs aborts sequential file reads on sparse CPU topologies.

Pre-existing issues:
- [High] Missing length validation when parsing ACPI _DSM output buffer leads to an out-of-bounds read.
- [High] ACPI _DSM buffer parsing uses an unaligned 64-bit cast and C bitfields, breaking on strict-alignment and big-endian architectures.
--

commit 3d85ff02b28aea880afdba0d09e73076f825fdb4
Author: Chengwen Feng <fengchengwen@huawei.com>

PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
    
Add per-device sysfs binary attribute tph_cpu_st to expose ACPI DSM CPU
to steering-tag data to userspace, resolving the concern that VFIO should
not host CPU-to-ST translation interfaces.

> diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
> index 95280aab4fb51..aca5093e81525 100644
> --- a/drivers/pci/tph.c
> +++ b/drivers/pci/tph.c
> @@ -130,6 +130,29 @@ static acpi_status tph_invoke_dsm(acpi_handle handle, u32 cpu_uid,
>  
>  	return AE_OK;
>  }
> +
> +static int tph_get_cpu_st_info(struct pci_dev *pdev, unsigned int cpu,
> +			       union st_info *info)
> +{
> +	acpi_handle rp_acpi_handle;
> +	struct pci_dev *rp;
> +	u32 cpu_uid;
> +	int ret;
> +
> +	ret = acpi_get_cpu_uid(cpu, &cpu_uid);
> +	if (ret != 0)
> +		return ret;
> +
> +	rp = pcie_find_root_port(pdev);
> +	if (!rp || !rp->bus || !rp->bus->bridge)
> +		return -ENODEV;
> +
> +	rp_acpi_handle = ACPI_HANDLE(rp->bus->bridge);
> +	if (tph_invoke_dsm(rp_acpi_handle, cpu_uid, info) != AE_OK)

[Severity: High]
This isn't a bug introduced by this patch, but tph_invoke_dsm() appears to
have an out-of-bounds read.  When parsing the ACPI _DSM output, it
unconditionally dereferences the buffer pointer without validating the
length:

tph_invoke_dsm() {
	...
	if (out_obj->type != ACPI_TYPE_BUFFER) {
		ACPI_FREE(out_obj);
		return AE_ERROR;
	}

	st_out->value = *((u64 *)(out_obj->buffer.pointer));
	...
}

If a buggy or malicious ACPI firmware returns a buffer smaller than 8 bytes,
could this lead to an out-of-bounds memory read?

This is also a pre-existing issue, but does casting the byte buffer to a u64*
and dereferencing it break on strict-alignment architectures?

Additionally, union st_info uses architecture-dependent C bitfields to
extract data from this 64-bit value.  Could this extract incorrect bits on
big-endian architectures, resulting in corruption of the hardware steering
tags?

[ ... ]

> @@ -619,3 +647,58 @@ bool pcie_tph_supported(struct pci_dev *pdev, bool want_ext)
>  	return pdev->tph_ext_support;
>  }
>  EXPORT_SYMBOL(pcie_tph_supported);
> +
> +static ssize_t tph_cpu_st_read(struct file *filp, struct kobject *kobj,
> +			       const struct bin_attribute *bin_attr, char *buf,
> +			       loff_t off, size_t count)
> +{
> +	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
> +	size_t entry_sz = PCI_TPH_CPU_ST_ENTRY_SZ;
> +	struct pci_tph_cpu_st st;
> +	unsigned int target_cpu;
> +	size_t copy_len;
> +
> +	if (off >= nr_cpu_ids * entry_sz || off % entry_sz != 0)
> +		return 0;

[Severity: Medium]
Does returning 0 here prematurely signal an End-Of-File (EOF) for unaligned
reads?  If a userspace application attempts a short read or reads at an
unaligned offset, should this reject the request with -EINVAL or properly
serve the unaligned data instead of truncating the stream?

> +
> +	target_cpu = off / entry_sz;
> +	if (!cpu_possible(target_cpu))
> +		return -ENODEV;

[Severity: Medium]
Could returning -ENODEV here abort sequential file reads on systems with
sparse CPU topologies?  If sequential readers (e.g., using cat or hexdump)
hit this error at the first missing CPU, they might fail to retrieve valid
steering tags for any CPUs located after the topology gap.  Would it be
better to skip the missing CPU by returning a zeroed entry instead?

> +
> +	get_cpu_all_st(pdev, target_cpu, &st);
> +
> +	copy_len = min_t(size_t, entry_sz, count);
> +	memcpy(buf, &st, copy_len);
> +
> +	return copy_len;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=8

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 12/12] vfio/pci: Virtualize PCIe TPH capability registers
  2026-06-16 10:46 ` [PATCH v17 12/12] vfio/pci: Virtualize PCIe TPH capability registers Chengwen Feng
@ 2026-06-16 11:03   ` sashiko-bot
  0 siblings, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 11:03 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: linux-pci, kvm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Silent state desynchronization between hardware TPH and virtual config due to missing error handling for ST entry updates.
--

commit 22993c676c19c404abf1528141d0dd0e78a51f99
Author: Chengwen Feng <fengchengwen@huawei.com>

vfio/pci: Virtualize PCIe TPH capability registers

This commit virtualizes TPH extended capability config space registers.
It blocks direct ST-table programming via config space, adds vconfig
masking to hide EXT_TPH, and syncs the shadow ST table to physical hardware
when the requester is enabled.

> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c

[ ... ]

> +static int vfio_tph_config_write(struct vfio_pci_core_device *vdev, int pos,
> +				 int count, struct perm_bits *perm,
> +				 int offset, __le32 val)
> +{

[ ... ]

> +	if (req == PCI_TPH_REQ_TPH_ONLY || req == PCI_TPH_REQ_EXT_TPH) {
> +		extended = !!(req == PCI_TPH_REQ_EXT_TPH);
> +		ret = pcie_enable_tph_explicit(pdev, mode, extended);
> +		if (!ret && vdev->tph_st_shadow) {
> +			for (i = 0; i < vdev->tph_st_entries; i++)
> +				pcie_tph_set_st_entry(pdev, i,
> +						      vdev->tph_st_shadow[i]);
> +		}

[Severity: High]
If pcie_tph_set_st_entry() fails (for instance, due to missing MSI-X
allocation), does it leave the hardware and virtual configuration out of
sync?

The loop ignores the return value from pcie_tph_set_st_entry(), meaning
ret will remain 0 even if the ST entry update fails. Since a failure here
can forcefully disable TPH in the physical hardware via pcie_disable_tph(),
the subsequent if (ret) config rollback will be skipped. This would leave
the guest OS believing TPH is enabled while the physical hardware is actually
disabled.

Additionally, in the VFIO_DEVICE_FEATURE_TPH_ST_CONFIG ioctl handler
(vfio_pci_core_feature_tph_st_config), the rollback loop tries to restore
entries using pcie_tph_set_st_entry(). If the initial failure unconditionally
disabled TPH on the physical device, will these rollback calls fail
immediately, leaving the recovery broken?

> +		if (ret)
> +			vfio_default_config_write(vdev, pos, count, perm,
> +						  offset, org_val);
> +	} else if (req == PCI_TPH_REQ_DISABLE) {

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=12

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 11/12] vfio/pci: Add TPH_ST_CONFIG for PCIe TPH ST configuration
  2026-06-16 10:46 ` [PATCH v17 11/12] vfio/pci: Add TPH_ST_CONFIG for PCIe TPH ST configuration Chengwen Feng
@ 2026-06-16 11:05   ` sashiko-bot
  0 siblings, 0 replies; 39+ messages in thread
From: sashiko-bot @ 2026-06-16 11:05 UTC (permalink / raw)
  To: Chengwen Feng; +Cc: linux-pci, kvm

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [High] Improper PM Runtime handling and missing dmabuf restoration in TPH ST configuration ioctl.
- [Medium] Missing __GFP_ACCOUNT in tph_st_shadow allocation.
- [Medium] The `tph_permit` flag is leaked across VFIO user sessions.
- [Medium] Feature PROBE for `TPH_ST_CONFIG` is broken due to an early dynamic permission check.
- [Medium] The TPH ST shadow table is not cleared when the VFIO device is reset, leading to stale data restoration.
--

commit b8ec8f816e02c6def8f7cc759648796f130131e0
Author: Chengwen Feng <fengchengwen@huawei.com>

vfio/pci: Add TPH_ST_CONFIG for PCIe TPH ST configuration

This commit adds a new VFIO device feature VFIO_DEVICE_FEATURE_TPH_ST_CONFIG
to allow userspace to configure PCIe TPH Steering Tag table entries. It
implements a shadow ST table to cache entries, paired with a per-device mutex
for concurrent access protection.

> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index ae05343701f76..cf8f6de025c9a 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c

[ ... ]

> @@ -495,6 +496,50 @@ static const struct dev_pm_ops vfio_pci_core_pm_ops = {

[ ... ]

> +static int vfio_pci_tph_init(struct vfio_pci_core_device *vdev)
> +{
> +	vdev->tph_st_entries = 0;
> +	vdev->tph_st_shadow = NULL;
> +
> +	if (!enable_unsafe_tph)
> +		return 0;
> +
> +	vdev->tph_st_entries = vfio_pci_tph_st_shadow_size(vdev);
> +	if (vdev->tph_st_entries) {
> +		vdev->tph_st_shadow = kcalloc(vdev->tph_st_entries, sizeof(u16),
> +					      GFP_KERNEL);

[Severity: Medium]
Since this per-device allocation is triggered by a userspace container
session, should it use GFP_KERNEL_ACCOUNT instead of GFP_KERNEL? Omitting
the account flag might allow a container to allocate unaccounted kernel
memory and bypass its memory cgroup limits.

> +		if (!vdev->tph_st_shadow)
> +			return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vfio_pci_tph_deinit(struct vfio_pci_core_device *vdev)
> +{
> +	kfree(vdev->tph_st_shadow);
> +	vdev->tph_st_shadow = NULL;
> +	vdev->tph_st_entries = 0;

[Severity: Medium]
Should vdev->tph_permit be reset to 0 here during session exit?

Because vdev persists across open/close cycles until unbound from the driver,
a subsequent user session might inherit the tph_permit == 1 state,
bypassing the API requirement to explicitly enable TPH.

> +}

[ ... ]

> @@ -1535,6 +1587,74 @@ static int vfio_pci_core_feature_tph_enable(struct vfio_pci_core_device *vdev,

[ ... ]

> +static int vfio_pci_core_feature_tph_st_config(
> +			struct vfio_pci_core_device *vdev,
> +			u32 flags,
> +			struct vfio_device_feature_tph_st_config __user *arg,
> +			size_t argsz)
> +{
> +	struct vfio_device_feature_tph_st_config config;
> +	struct pci_dev *pdev = vdev->pdev;
> +	void __user *uptr;
> +	int i, idx, ret;
> +	size_t sz;
> +	u16 *sts;
> +
> +	if (!vdev->tph_permit || !vdev->tph_st_shadow)
> +		return -EOPNOTSUPP;

[Severity: Medium]
Will this early check break feature discovery?

VFIO allows discovering supported features using the
VFIO_DEVICE_FEATURE_PROBE flag. By checking !vdev->tph_permit before
calling vfio_check_feature(), any early capability PROBE request returns
-EOPNOTSUPP. This might make userspace incorrectly assume the kernel does
not support configuring the feature.

> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> +				 sizeof(config));
> +	if (ret <= 0)
> +		return ret;

[ ... ]

> +	uptr = u64_to_user_ptr(config.data_uptr);
> +	sts = memdup_array_user(uptr, config.count, sizeof(u16));
> +	sz = config.count * sizeof(u16);
> +	if (IS_ERR(sts))
> +		return PTR_ERR(sts);
> +
> +	down_write(&vdev->memory_lock);
> +	ret = vfio_pci_set_power_state(vdev, PCI_D0);

[Severity: High]
Does this bypass the driver core's power management and miss dmabuf
restoration?

If the parent bridge is in a suspended state (e.g., D3cold), calling
vfio_pci_set_power_state() without first calling pm_runtime_resume_and_get()
can lead to bus errors (Unsupported Request) or system crashes.

Additionally, if the device was previously in a suspended state, does this
path also need to call vfio_pci_dma_buf_move(vdev, false) to restore dmabufs
and prevent permanently breaking dmabuf tracking?

> +	if (ret)
> +		goto out_unlock_memory;
> +
> +	if (pcie_tph_enabled_req_type(pdev) == PCI_TPH_REQ_DISABLE)
> +		goto update_shadow;

[ ... ]

> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 0b411d2136ed5..d5c6f73747ad9 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -143,6 +143,8 @@ struct vfio_pci_core_device {
>  	struct notifier_block	nb;
>  	struct rw_semaphore	memory_lock;
>  	struct list_head	dmabufs;
> +	u16			*tph_st_shadow;

[Severity: Medium]
When a VFIO device is reset (e.g., via the VFIO_DEVICE_RESET ioctl or FLR),
its hardware ST table typically reverts to its default (zeroed) state.

Should the reset path (vfio_pci_ioctl_reset()) clear vdev->tph_st_shadow?

If not, a subsequent partial ST_CONFIG write or TPH enablement via config
space might incorrectly program the hardware with stale, pre-reset ST values
cached in the shadow table.

> +	u16			tph_st_entries;
>  };
>  
>  enum vfio_pci_io_width {

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616104621.41915-1-fengchengwen@huawei.com?part=11

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-16 10:46 ` [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping Chengwen Feng
  2026-06-16 11:00   ` sashiko-bot
@ 2026-06-16 14:42   ` Jason Gunthorpe
  2026-06-16 16:57     ` Alex Williamson
  1 sibling, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2026-06-16 14:42 UTC (permalink / raw)
  To: Chengwen Feng
  Cc: alex, helgaas, wathsala.vithanage, wei.huang2, zhipingz,
	wangzhou1, wangyushan12, liuyonglong, kvm, linux-pci

On Tue, Jun 16, 2026 at 06:46:17PM +0800, Chengwen Feng wrote:
> Add per-device sysfs binary attribute tph_cpu_st to expose ACPI DSM CPU
> to steering-tag data to userspace, resolving the concern that VFIO should
> not host CPU-to-ST translation interfaces.
> 
> Follow PCI standard binattr framework: dynamic visible group, fixed-size
> 8-byte packed uapi entry, aligned offset read, root-only 0400 permission.
> Refactor duplicate ACPI DSM logic into shared tph_get_cpu_st_info helper.
> 
> ABI: /sys/bus/pci/devices/<BDF>/tph_cpu_st

I'm sorry, I really dislike this :(

Structured binary sysfs attributes are pretty much against the rules,
I think using sysfs at all for this interface is a bad idea.

IMHO the VFIO version was much better.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-16 14:42   ` Jason Gunthorpe
@ 2026-06-16 16:57     ` Alex Williamson
  2026-06-16 17:27       ` Jason Gunthorpe
  2026-06-23  9:56       ` fengchengwen
  0 siblings, 2 replies; 39+ messages in thread
From: Alex Williamson @ 2026-06-16 16:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chengwen Feng, helgaas, wathsala.vithanage, wei.huang2, zhipingz,
	wangzhou1, wangyushan12, liuyonglong, kvm, linux-pci, alex

On Tue, 16 Jun 2026 11:42:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Jun 16, 2026 at 06:46:17PM +0800, Chengwen Feng wrote:
> > Add per-device sysfs binary attribute tph_cpu_st to expose ACPI DSM CPU
> > to steering-tag data to userspace, resolving the concern that VFIO should
> > not host CPU-to-ST translation interfaces.
> > 
> > Follow PCI standard binattr framework: dynamic visible group, fixed-size
> > 8-byte packed uapi entry, aligned offset read, root-only 0400 permission.
> > Refactor duplicate ACPI DSM logic into shared tph_get_cpu_st_info helper.
> > 
> > ABI: /sys/bus/pci/devices/<BDF>/tph_cpu_st  
> 
> I'm sorry, I really dislike this :(
> 
> Structured binary sysfs attributes are pretty much against the rules,
> I think using sysfs at all for this interface is a bad idea.
> 
> IMHO the VFIO version was much better.

There are some deficiencies in this implementation:

 - The ABI needs to be documented in
   Documentation/ABI/testing/sysfs-bus-pci

 - The attribute is at the wrong place, all endpoints would just
   replicate the root port values.  Place it at the root port.

 - Corollary, is_visible should key on whether we have CPU to ST
   mappings (_DSM) and the root port is a TPH completer (DevCap2), not
   the TPH capabilities of an endpoint - a userspace driver can already
   discover the endpoint TPH requester support.

 - The 8-byte aligned read requirement should be removed, perform a
   sub-8-byte read from the slot offset.  !cpu_possible() should be
   filled with zeros.  This allows userspace to dump the entire bin
   file or read only a set of fields.

 - Probably cleaner to zero-initialize the buffer rather than memset()
   and (redundant) reserved = 0.

IMO, this implementation in sysfs more so proves that vfio is the wrong
place for the interface.  vfio has a use case to consume STs, but it
doesn't produce them or own any of the mechanism by which they're
generated.  This proposal, with the above improvements, provides
effectively an ioctl-like interface when using a properly offset and
sized pread().

The weak point is whether this bin attribute, exposing an array of
structures, fits within the socially acceptable norms of sysfs.  There
is some precedent for this, for example cc_settings_bin in infiniband,
but these might also be considered legacy.  So I don't know if this
sort of usage is a grey area that fits social norms or if it's promoting
legacy use cases that we don't want to repeat.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-16 16:57     ` Alex Williamson
@ 2026-06-16 17:27       ` Jason Gunthorpe
  2026-06-17  1:18         ` fengchengwen
  2026-06-23  9:56       ` fengchengwen
  1 sibling, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2026-06-16 17:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Chengwen Feng, helgaas, wathsala.vithanage, wei.huang2, zhipingz,
	wangzhou1, wangyushan12, liuyonglong, kvm, linux-pci

On Tue, Jun 16, 2026 at 10:57:54AM -0600, Alex Williamson wrote:

> IMO, this implementation in sysfs more so proves that vfio is the wrong
> place for the interface.  vfio has a use case to consume STs, but it
> doesn't produce them or own any of the mechanism by which they're
> generated.  

I would argue only a VFIO user can consume this information so putting
it in general sysfs is overkill.

> The weak point is whether this bin attribute, exposing an array of
> structures, fits within the socially acceptable norms of sysfs.  

I'm pretty sure this style of sysfs is not accepted and
counter-examples are things that are ancient or slipped through.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-16 17:27       ` Jason Gunthorpe
@ 2026-06-17  1:18         ` fengchengwen
  2026-06-17  1:30           ` Alex Williamson
  0 siblings, 1 reply; 39+ messages in thread
From: fengchengwen @ 2026-06-17  1:18 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: helgaas, wathsala.vithanage, wei.huang2, zhipingz, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci

Hi Jason, Alex,

On 6/17/2026 1:27 AM, Jason Gunthorpe wrote:
> On Tue, Jun 16, 2026 at 10:57:54AM -0600, Alex Williamson wrote:
> 
>> IMO, this implementation in sysfs more so proves that vfio is the wrong
>> place for the interface.  vfio has a use case to consume STs, but it
>> doesn't produce them or own any of the mechanism by which they're
>> generated.  
> 
> I would argue only a VFIO user can consume this information so putting
> it in general sysfs is overkill.
> 
>> The weak point is whether this bin attribute, exposing an array of
>> structures, fits within the socially acceptable norms of sysfs.  
> 
> I'm pretty sure this style of sysfs is not accepted and
> counter-examples are things that are ancient or slipped through.

After weighing all feedback, I'd like to propose a lightweight single text sysfs node
design to resolve both your concerns, without introducing binary binattr or masses
of per-CPU files:

1. Only create a single r/w text attribute “tph_cpu_st” on Root Ports, no
   hundreds of separate per-CPU sysfs files;
2. Usage flow: echo target cpu_id into the file first, then cat it to fetch
   formatted human-readable ST/XST text line (cpu_id vm_st vm_xst pm_st pm_xst)
   for that CPU, e.g.
      echo 10 > /sys/bus/pci/devices/<BDF>/tph_cpu_st
      cat /sys/bus/pci/devices/<BDF>/tph_cpu_st
      and the output will like: 10 2 20 100 120
   In the case of concurrent access, users can determine whether the first value
   (cpu_id) is the one they expect by making a judgment.

This design avoids heavy dynamic per-CPU directory/node management, eliminates
the controversial binary sysfs ABI.

Let me know if this single interactive text node approach works for you two,

Thanks

> 
> Jason


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-17  1:18         ` fengchengwen
@ 2026-06-17  1:30           ` Alex Williamson
  2026-06-17  2:33             ` fengchengwen
  0 siblings, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2026-06-17  1:30 UTC (permalink / raw)
  To: fengchengwen, Jason Gunthorpe
  Cc: helgaas, wathsala.vithanage, wei.huang2, zhipingz, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci

On Tue, Jun 16, 2026, at 7:18 PM, fengchengwen wrote:
> Hi Jason, Alex,
>
> On 6/17/2026 1:27 AM, Jason Gunthorpe wrote:
>> On Tue, Jun 16, 2026 at 10:57:54AM -0600, Alex Williamson wrote:
>> 
>>> IMO, this implementation in sysfs more so proves that vfio is the wrong
>>> place for the interface.  vfio has a use case to consume STs, but it
>>> doesn't produce them or own any of the mechanism by which they're
>>> generated.  
>> 
>> I would argue only a VFIO user can consume this information so putting
>> it in general sysfs is overkill.
>> 
>>> The weak point is whether this bin attribute, exposing an array of
>>> structures, fits within the socially acceptable norms of sysfs.  
>> 
>> I'm pretty sure this style of sysfs is not accepted and
>> counter-examples are things that are ancient or slipped through.
>
> After weighing all feedback, I'd like to propose a lightweight single 
> text sysfs node
> design to resolve both your concerns, without introducing binary 
> binattr or masses
> of per-CPU files:
>
> 1. Only create a single r/w text attribute “tph_cpu_st” on Root Ports, no
>    hundreds of separate per-CPU sysfs files;
> 2. Usage flow: echo target cpu_id into the file first, then cat it to fetch
>    formatted human-readable ST/XST text line (cpu_id vm_st vm_xst pm_st pm_xst)
>    for that CPU, e.g.
>       echo 10 > /sys/bus/pci/devices/<BDF>/tph_cpu_st
>       cat /sys/bus/pci/devices/<BDF>/tph_cpu_st
>       and the output will like: 10 2 20 100 120
>    In the case of concurrent access, users can determine whether the first value
>    (cpu_id) is the one they expect by making a judgment.

That's a terrible workaround for an inherently racy design.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-17  1:30           ` Alex Williamson
@ 2026-06-17  2:33             ` fengchengwen
  2026-06-17  3:01               ` Alex Williamson
  0 siblings, 1 reply; 39+ messages in thread
From: fengchengwen @ 2026-06-17  2:33 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: helgaas, wathsala.vithanage, wei.huang2, zhipingz, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci

On 6/17/2026 9:30 AM, Alex Williamson wrote:
> On Tue, Jun 16, 2026, at 7:18 PM, fengchengwen wrote:
>> Hi Jason, Alex,
>>
>> On 6/17/2026 1:27 AM, Jason Gunthorpe wrote:
>>> On Tue, Jun 16, 2026 at 10:57:54AM -0600, Alex Williamson wrote:
>>>
>>>> IMO, this implementation in sysfs more so proves that vfio is the wrong
>>>> place for the interface.  vfio has a use case to consume STs, but it
>>>> doesn't produce them or own any of the mechanism by which they're
>>>> generated.  
>>>
>>> I would argue only a VFIO user can consume this information so putting
>>> it in general sysfs is overkill.
>>>
>>>> The weak point is whether this bin attribute, exposing an array of
>>>> structures, fits within the socially acceptable norms of sysfs.  
>>>
>>> I'm pretty sure this style of sysfs is not accepted and
>>> counter-examples are things that are ancient or slipped through.
>>
>> After weighing all feedback, I'd like to propose a lightweight single 
>> text sysfs node
>> design to resolve both your concerns, without introducing binary 
>> binattr or masses
>> of per-CPU files:
>>
>> 1. Only create a single r/w text attribute “tph_cpu_st” on Root Ports, no
>>    hundreds of separate per-CPU sysfs files;
>> 2. Usage flow: echo target cpu_id into the file first, then cat it to fetch
>>    formatted human-readable ST/XST text line (cpu_id vm_st vm_xst pm_st pm_xst)
>>    for that CPU, e.g.
>>       echo 10 > /sys/bus/pci/devices/<BDF>/tph_cpu_st
>>       cat /sys/bus/pci/devices/<BDF>/tph_cpu_st
>>       and the output will like: 10 2 20 100 120
>>    In the case of concurrent access, users can determine whether the first value
>>    (cpu_id) is the one they expect by making a judgment.
> 
> That's a terrible workaround for an inherently racy design.  Thanks,

This is because the DSM method is slow to invoke and involves many cores. Therefore,
this trade-off is made.

I propose another lazy-cached read-only text file design:

1. Only create one read-only sysfs attribute named tph_cpu_all on Root Port
   devices, no write store hook, completely eliminate read-write race risks;
2. Lazy initialization: All CPU steering tag entries are fetched via batch
   ACPI _DSM calls only on the first userspace read operation, cached inside
   afterwards to avoid duplicate slow firmware interactions;
3. Cache invalidation only happens on Root Port PCI reset events, including
   system cold boot, AER fatal error recovery, userspace manual reset via sysfs,
   downstream hotplug and root port driver re-probe.
   No CPU hotplug notification hooks are registered to avoid extra code complexity;
4. On every sysfs read, iterate over the cached entry list and filter out
   offline CPUs according to the current online CPU mask dynamically, so
   only valid active CPU records are printed to userspace;
5. A read-write semaphore protects cached table access to support concurrent
   userspace readers safely;
6. Plain human-readable text output complies with sysfs text-first norms
   Jason prefers, no structured binary binattr is introduced;

Usage example:
   cat /sys/bus/pci/devices/<root-port-bdf>/tph_cpu_all
   the output will be: cpu_id vm_st pm_st vm_xst pm_xst
	0 0x00 0x00 0x0000 0x0000
	1 0x02 0x00 0x0005 0x0000
	2 0x01 0x00 0x0003 0x0000
	...

Thanks

> 
> Alex


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-17  2:33             ` fengchengwen
@ 2026-06-17  3:01               ` Alex Williamson
  2026-06-17  3:41                 ` fengchengwen
  0 siblings, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2026-06-17  3:01 UTC (permalink / raw)
  To: fengchengwen, Jason Gunthorpe
  Cc: helgaas, wathsala.vithanage, wei.huang2, zhipingz, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci



On Tue, Jun 16, 2026, at 8:33 PM, fengchengwen wrote:
> On 6/17/2026 9:30 AM, Alex Williamson wrote:
>> On Tue, Jun 16, 2026, at 7:18 PM, fengchengwen wrote:
>>> Hi Jason, Alex,
>>>
>>> On 6/17/2026 1:27 AM, Jason Gunthorpe wrote:
>>>> On Tue, Jun 16, 2026 at 10:57:54AM -0600, Alex Williamson wrote:
>>>>
>>>>> IMO, this implementation in sysfs more so proves that vfio is the wrong
>>>>> place for the interface.  vfio has a use case to consume STs, but it
>>>>> doesn't produce them or own any of the mechanism by which they're
>>>>> generated.  
>>>>
>>>> I would argue only a VFIO user can consume this information so putting
>>>> it in general sysfs is overkill.
>>>>
>>>>> The weak point is whether this bin attribute, exposing an array of
>>>>> structures, fits within the socially acceptable norms of sysfs.  
>>>>
>>>> I'm pretty sure this style of sysfs is not accepted and
>>>> counter-examples are things that are ancient or slipped through.
>>>
>>> After weighing all feedback, I'd like to propose a lightweight single 
>>> text sysfs node
>>> design to resolve both your concerns, without introducing binary 
>>> binattr or masses
>>> of per-CPU files:
>>>
>>> 1. Only create a single r/w text attribute “tph_cpu_st” on Root Ports, no
>>>    hundreds of separate per-CPU sysfs files;
>>> 2. Usage flow: echo target cpu_id into the file first, then cat it to fetch
>>>    formatted human-readable ST/XST text line (cpu_id vm_st vm_xst pm_st pm_xst)
>>>    for that CPU, e.g.
>>>       echo 10 > /sys/bus/pci/devices/<BDF>/tph_cpu_st
>>>       cat /sys/bus/pci/devices/<BDF>/tph_cpu_st
>>>       and the output will like: 10 2 20 100 120
>>>    In the case of concurrent access, users can determine whether the first value
>>>    (cpu_id) is the one they expect by making a judgment.
>> 
>> That's a terrible workaround for an inherently racy design.  Thanks,
>
> This is because the DSM method is slow to invoke and involves many 
> cores. Therefore,
> this trade-off is made.
>
> I propose another lazy-cached read-only text file design:
>
> 1. Only create one read-only sysfs attribute named tph_cpu_all on Root 
> Port
>    devices, no write store hook, completely eliminate read-write race 
> risks;
> 2. Lazy initialization: All CPU steering tag entries are fetched via 
> batch
>    ACPI _DSM calls only on the first userspace read operation, cached 
> inside
>    afterwards to avoid duplicate slow firmware interactions;
> 3. Cache invalidation only happens on Root Port PCI reset events, 
> including
>    system cold boot, AER fatal error recovery, userspace manual reset 
> via sysfs,
>    downstream hotplug and root port driver re-probe.
>    No CPU hotplug notification hooks are registered to avoid extra code 
> complexity;
> 4. On every sysfs read, iterate over the cached entry list and filter 
> out
>    offline CPUs according to the current online CPU mask dynamically, so
>    only valid active CPU records are printed to userspace;
> 5. A read-write semaphore protects cached table access to support 
> concurrent
>    userspace readers safely;
> 6. Plain human-readable text output complies with sysfs text-first norms
>    Jason prefers, no structured binary binattr is introduced;
>
> Usage example:
>    cat /sys/bus/pci/devices/<root-port-bdf>/tph_cpu_all
>    the output will be: cpu_id vm_st pm_st vm_xst pm_xst
> 	0 0x00 0x00 0x0000 0x0000
> 	1 0x02 0x00 0x0005 0x0000
> 	2 0x01 0x00 0x0003 0x0000
> 	...

This is a more complicated, more bloated, less functional version of the same array of structures problem as the bin attribute, just in ascii.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-17  3:01               ` Alex Williamson
@ 2026-06-17  3:41                 ` fengchengwen
  2026-06-17  3:53                   ` Krzysztof Wilczyński
  0 siblings, 1 reply; 39+ messages in thread
From: fengchengwen @ 2026-06-17  3:41 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: helgaas, wathsala.vithanage, wei.huang2, zhipingz, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci

On 6/17/2026 11:01 AM, Alex Williamson wrote:
> 
> 
> On Tue, Jun 16, 2026, at 8:33 PM, fengchengwen wrote:
>> On 6/17/2026 9:30 AM, Alex Williamson wrote:
>>> On Tue, Jun 16, 2026, at 7:18 PM, fengchengwen wrote:
>>>> Hi Jason, Alex,
>>>>
>>>> On 6/17/2026 1:27 AM, Jason Gunthorpe wrote:
>>>>> On Tue, Jun 16, 2026 at 10:57:54AM -0600, Alex Williamson wrote:
>>>>>
>>>>>> IMO, this implementation in sysfs more so proves that vfio is the wrong
>>>>>> place for the interface.  vfio has a use case to consume STs, but it
>>>>>> doesn't produce them or own any of the mechanism by which they're
>>>>>> generated.  
>>>>>
>>>>> I would argue only a VFIO user can consume this information so putting
>>>>> it in general sysfs is overkill.
>>>>>
>>>>>> The weak point is whether this bin attribute, exposing an array of
>>>>>> structures, fits within the socially acceptable norms of sysfs.  
>>>>>
>>>>> I'm pretty sure this style of sysfs is not accepted and
>>>>> counter-examples are things that are ancient or slipped through.
>>>>
>>>> After weighing all feedback, I'd like to propose a lightweight single 
>>>> text sysfs node
>>>> design to resolve both your concerns, without introducing binary 
>>>> binattr or masses
>>>> of per-CPU files:
>>>>
>>>> 1. Only create a single r/w text attribute “tph_cpu_st” on Root Ports, no
>>>>    hundreds of separate per-CPU sysfs files;
>>>> 2. Usage flow: echo target cpu_id into the file first, then cat it to fetch
>>>>    formatted human-readable ST/XST text line (cpu_id vm_st vm_xst pm_st pm_xst)
>>>>    for that CPU, e.g.
>>>>       echo 10 > /sys/bus/pci/devices/<BDF>/tph_cpu_st
>>>>       cat /sys/bus/pci/devices/<BDF>/tph_cpu_st
>>>>       and the output will like: 10 2 20 100 120
>>>>    In the case of concurrent access, users can determine whether the first value
>>>>    (cpu_id) is the one they expect by making a judgment.
>>>
>>> That's a terrible workaround for an inherently racy design.  Thanks,
>>
>> This is because the DSM method is slow to invoke and involves many 
>> cores. Therefore,
>> this trade-off is made.
>>
>> I propose another lazy-cached read-only text file design:
>>
>> 1. Only create one read-only sysfs attribute named tph_cpu_all on Root 
>> Port
>>    devices, no write store hook, completely eliminate read-write race 
>> risks;
>> 2. Lazy initialization: All CPU steering tag entries are fetched via 
>> batch
>>    ACPI _DSM calls only on the first userspace read operation, cached 
>> inside
>>    afterwards to avoid duplicate slow firmware interactions;
>> 3. Cache invalidation only happens on Root Port PCI reset events, 
>> including
>>    system cold boot, AER fatal error recovery, userspace manual reset 
>> via sysfs,
>>    downstream hotplug and root port driver re-probe.
>>    No CPU hotplug notification hooks are registered to avoid extra code 
>> complexity;
>> 4. On every sysfs read, iterate over the cached entry list and filter 
>> out
>>    offline CPUs according to the current online CPU mask dynamically, so
>>    only valid active CPU records are printed to userspace;
>> 5. A read-write semaphore protects cached table access to support 
>> concurrent
>>    userspace readers safely;
>> 6. Plain human-readable text output complies with sysfs text-first norms
>>    Jason prefers, no structured binary binattr is introduced;
>>
>> Usage example:
>>    cat /sys/bus/pci/devices/<root-port-bdf>/tph_cpu_all
>>    the output will be: cpu_id vm_st pm_st vm_xst pm_xst
>> 	0 0x00 0x00 0x0000 0x0000
>> 	1 0x02 0x00 0x0005 0x0000
>> 	2 0x01 0x00 0x0003 0x0000
>> 	...
> 
> This is a more complicated, more bloated, less functional version of the same array of structures problem as the bin attribute, just in ascii.  Thanks,

I fully get your criticism that dumping all CPU records as a complete table
is just a variant of the array export design and brings extra unnecessary
complexity.

Now there are only two viable options left for exposing CPU-to-ST mappings:
1. Add a struct bin_attribute on Root Ports to export the full array of CPU
   entries as binary blob;
2. Implement a new VFIO_DEVICE_FEATURE_TPH_CPU_ST feature for on-demand
   CPUs query.

Option 1 introduces a structured array binattr sysfs ABI that Jason has
explicitly stated the community no longer accepts for new additions.

For option 2, all ACPI DSM and steering tag calculation logic lives fully
within the PCI subsystem. VFIO only contains a trivial thin forwarding layer
to call the existing PCI exported helper, no platform metadata generation logic
is moved into VFIO at all. This feature acts as a native auxiliary helper for
virtualized TPH workloads.

Could you kindly reconsider option 2 ?

If the community still prefers option 1 after this discussion, I will fully
rework the binattr implementation and send out v18 following all your review
feedback.

Thanks

> 
> Alex


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-17  3:41                 ` fengchengwen
@ 2026-06-17  3:53                   ` Krzysztof Wilczyński
  2026-06-17  6:04                     ` fengchengwen
  0 siblings, 1 reply; 39+ messages in thread
From: Krzysztof Wilczyński @ 2026-06-17  3:53 UTC (permalink / raw)
  To: fengchengwen
  Cc: Alex Williamson, Jason Gunthorpe, helgaas, wathsala.vithanage,
	wei.huang2, zhipingz, wangzhou1, wangyushan12, liuyonglong, kvm,
	linux-pci

Hello,

[...]
> If the community still prefers option 1 after this discussion, I will fully
> rework the binattr implementation and send out v18 following all your review
> feedback.

Perhaps configfs or debugfs would be a viable alternative, if this is
something that does not have to exposed all the time?

Just thinking out loud...

Thank you!

	Krzysztof

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-17  3:53                   ` Krzysztof Wilczyński
@ 2026-06-17  6:04                     ` fengchengwen
  0 siblings, 0 replies; 39+ messages in thread
From: fengchengwen @ 2026-06-17  6:04 UTC (permalink / raw)
  To: Krzysztof Wilczyński
  Cc: Alex Williamson, Jason Gunthorpe, helgaas, wathsala.vithanage,
	wei.huang2, zhipingz, wangzhou1, wangyushan12, liuyonglong, kvm,
	linux-pci

On 6/17/2026 11:53 AM, Krzysztof Wilczyński wrote:
> Hello,
> 
> [...]
>> If the community still prefers option 1 after this discussion, I will fully
>> rework the binattr implementation and send out v18 following all your review
>> feedback.
> 
> Perhaps configfs or debugfs would be a viable alternative, if this is
> something that does not have to exposed all the time?

For debugfs/configfs: These are not suitable for production stable ABI use.
DPDK/SPDK virtualization workloads rely on permanent, stable userspace
interfaces and cannot depend on optional debug-only filesystems.

Thanks

> 
> Just thinking out loud...
> 
> Thank you!
> 
> 	Krzysztof


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-16 16:57     ` Alex Williamson
  2026-06-16 17:27       ` Jason Gunthorpe
@ 2026-06-23  9:56       ` fengchengwen
  2026-06-26 15:22         ` Alex Williamson
  1 sibling, 1 reply; 39+ messages in thread
From: fengchengwen @ 2026-06-23  9:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: helgaas, wathsala.vithanage, wei.huang2, zhipingz, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci, Jason Gunthorpe

Hi Alex,

On 6/17/2026 12:57 AM, Alex Williamson wrote:
> On Tue, 16 Jun 2026 11:42:24 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
>> On Tue, Jun 16, 2026 at 06:46:17PM +0800, Chengwen Feng wrote:
>>> Add per-device sysfs binary attribute tph_cpu_st to expose ACPI DSM CPU
>>> to steering-tag data to userspace, resolving the concern that VFIO should
>>> not host CPU-to-ST translation interfaces.
>>>
>>> Follow PCI standard binattr framework: dynamic visible group, fixed-size
>>> 8-byte packed uapi entry, aligned offset read, root-only 0400 permission.
>>> Refactor duplicate ACPI DSM logic into shared tph_get_cpu_st_info helper.
>>>
>>> ABI: /sys/bus/pci/devices/<BDF>/tph_cpu_st  
>>
>> I'm sorry, I really dislike this :(
>>
>> Structured binary sysfs attributes are pretty much against the rules,
>> I think using sysfs at all for this interface is a bad idea.
>>
>> IMHO the VFIO version was much better.
> 
> There are some deficiencies in this implementation:
> 
>  - The ABI needs to be documented in
>    Documentation/ABI/testing/sysfs-bus-pci
> 
>  - The attribute is at the wrong place, all endpoints would just
>    replicate the root port values.  Place it at the root port.
> 
>  - Corollary, is_visible should key on whether we have CPU to ST
>    mappings (_DSM) and the root port is a TPH completer (DevCap2), not
>    the TPH capabilities of an endpoint - a userspace driver can already
>    discover the endpoint TPH requester support.
> 
>  - The 8-byte aligned read requirement should be removed, perform a
>    sub-8-byte read from the slot offset.  !cpu_possible() should be
>    filled with zeros.  This allows userspace to dump the entire bin
>    file or read only a set of fields.
> 
>  - Probably cleaner to zero-initialize the buffer rather than memset()
>    and (redundant) reserved = 0.

I’ve sent v18-resend of the PCI TPH sysfs patchset with all prior review
comments fully addressed.

I remain open to both discussed implementation options. This v18-resend
implements the sysfs binary blob approach on Root Ports as you previously
suggested.

Please kindly review v18-resend and share your thoughts.

Thanks

> 
> IMO, this implementation in sysfs more so proves that vfio is the wrong
> place for the interface.  vfio has a use case to consume STs, but it
> doesn't produce them or own any of the mechanism by which they're
> generated.  This proposal, with the above improvements, provides
> effectively an ioctl-like interface when using a properly offset and
> sized pread().
> 
> The weak point is whether this bin attribute, exposing an array of
> structures, fits within the socially acceptable norms of sysfs.  There
> is some precedent for this, for example cc_settings_bin in infiniband,
> but these might also be considered legacy.  So I don't know if this
> sort of usage is a grey area that fits social norms or if it's promoting
> legacy use cases that we don't want to repeat.  Thanks,
> 
> Alex


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-23  9:56       ` fengchengwen
@ 2026-06-26 15:22         ` Alex Williamson
  2026-06-28 11:58           ` fengchengwen
  0 siblings, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2026-06-26 15:22 UTC (permalink / raw)
  To: fengchengwen
  Cc: helgaas, wathsala.vithanage, wei.huang2, zhipingz, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci, Jason Gunthorpe, alex

On Tue, 23 Jun 2026 17:56:51 +0800
fengchengwen <fengchengwen@huawei.com> wrote:
> 
> I’ve sent v18-resend of the PCI TPH sysfs patchset with all prior review
> comments fully addressed.
> 
> I remain open to both discussed implementation options. This v18-resend
> implements the sysfs binary blob approach on Root Ports as you previously
> suggested.

There's a problem with the pci-sysfs binary attribute beyond whether
it's acceptable from a sysfs perspective: it only solves the CPU-ID to
ST lookup.  We already know that this interface should support dma-buf
provided TPH ST and PH values from Zhiping's series.  If we take the
pci-sysfs route for CPU-ID, no subsystem exposes the value embedded
into the dma-buf.  Another solution is required.  Repeat for each
possible ST/PH source we might see in the future.

Therefore, I'm coming around to the idea that vfio-pci provides this
interface, but we need to do so in a way that it's not a wart on the
side of the interface.  I think we can do this based on an incremental
progression of features and support, where the ST can be provided
through an extensible set of objects, CPU ID, dma-buf, no-preference
(zero), and literal user-provided ST values.

The key components of the uAPI are similar to ones you've proposed:

 - An opt-in/feature flags (SET/GET): VFIO_DEVICE_FEATURE_TPH

   The TPH uAPI necessarily changes the ABI of the vfio-pci device;
   capabilities and control are virtualized.  An empty SET is the
   opt-in to the ABI change.  There are also feature flags necessary
   for uAPI surfaces not discovered through the PCI TPH capability, GET
   on this feature returns a flags field indicating supported features.
   (SET actually takes a zeroed flags field for symmetry in the vfio
   device feature API).

 - A mechanism to program indexes in architected ST locations (SET):
   VFIO_DEVICE_FEATURE_TPH_ST

   This includes flags to define the type of source provided, such as:

    - VFIO_DEVICE_TPH_SRC_DMABUF
    - VFIO_DEVICE_TPH_SRC_NONE
    - VFIO_DEVICE_TPH_SRC_CPU_VOLATILE
    - VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT
    - VFIO_DEVICE_TPH_SRC_LITERAL

   The requested namespace for the tag:

    - VFIO_DEVICE_TPH_EXTENDED

   And a failure policy, ex. if a translation results in zero, no-pref:

    - VFIO_DEVICE_TPH_REQUIRE_ST

   This would also include a start/count and pointer to user buffer to
   allow batch settings, if the TPH subsystem can be extended to
   support these, otherwise a single u32 holds all the decodable
   sources.

 - A mechanism to decode a source (GET): VFIO_DEVICE_FEATURE_TPH_RESOLVE

   This is necessary for DS mode, where the driver may store STs in
   device specific locations, but also for dma-buf sources even in IV
   mode, where the PH needs to be decoded for the driver's use.  This
   proposal also extends the API, through progressive opt-ins, to
   support virtualization, where a VMM may (in specific scenarios)
   implement a _DSM returning host values, and write through literal ST
   values from this source, or from a guest-based userspace driver for
   a completer, where the guest kernel can source a valid ST.

   In order to prevent this from becoming the side-car/wart, returning
   ST values is limited to configurations where the resulting ST may
   actually be used within the interface.  For example, No-ST and IV
   mode would operate only with DMABUF, NONE, and CPU_* sources, the
   literal ST need not be exposed to userspace.  The RESOLVE feature
   would be limited to returning the PH value for a DMABUF source.

   The structure includes input fields of u32 flags and source, output
   fields of u8 valid (bits declaring each of the next fields
   validity), u8 ph, u16 st.  Flag bits are defined the same as for
   TPH_ST, modulo NONE, LITERAL, REQUIRE_ST are not applicable and
   reserved.

With that rough uAPI, we can step through an incremental support path.

 1. The most basic level of support, and default sans module options,
    is No-ST mode.  After opt-in, the TPH capability is virtualized to
    report only No-ST mode supported, the remainder of the register
    zero.  Writes to ST Mode allow only 000b.  Writes to Requester
    Enable allow only valid, non-zero values.  In combination, this
    enables the device to operate only in No-ST mode.  The TPH_ST
    interface is inoperative in this mode.  The RESOLVE interface only
    accepts DMABUF sources and only provides the PH as valid.  This
    allows only the PH aspect of TPH to be used and does not risk any
    exposure or malicious use of ST values.

    NB. All modes, including this one, must also validate support for
    all reachable features through to the root port, ie. TPH Completer.

 2. Via a module option (proposed below), IV mode can be enabled where
    supported by the device.  This allows the unvirtualized interrupt
    vector mode supported bit, ST table location, and ST table size
    fields to be exposed in the capability register, and allows 001b to
    be written in the mode select control register.  Additionally, this
    brings onboard the TPH_ST feature, supporting a selection of
    source flags.  NONE and DMABUF can be assumed, based on sequencing
    of the dma-buf based TPH completer support series.  CPU sources
    rely on root port _DSM support.  GET on the noted feature flags
    above can indicate this support.  LITERAL is not enabled at this
    point, this mode only supports ST programming via objects and
    instance numbers.  RESOLVE support is identical to step 1.

    Via SET of TPH_ST, sources are decoded and ST values are written
    directly to the ST storage location.  The REQUIRE_ST flag indicates
    whether the user accepts a translation resulting in zero (no-pref)
    or if the API is to generate an error on this result.

    Support for translating CPU_* sources relies on platform support
    for _DSM, while other sources like DMABUF have no such requirement.
    Therefore, for this and following steps, the returned flags on
    FEATURE_TPH GET will include a bit indicating whether CPU_* sources
    are supported.

    There are some important considerations to resolve/document here
    between the TPH specification and Linux implementation, such as
    requiring TPH is enabled (and for MSI-X vector location, the
    programmed vector is enabled), before the TPH_ST.  We need to be
    particularly careful of a uAPI that relies on mutable
    implementation.

 3. Again, via a module option, DS mode can be enabled where supported
    by the device.  On such a supported device (and only on such a
    supported device), the device specific mode supported flag is now
    reported unvirtualized in the capability register and the control
    register supports writes of 010b to the mode field.  The RESOLVE
    interface now supports CPU_* as a source.  RESOLVE will now return
    both the PH and ST for a DMABUF, indicated through both valid bits
    set, and the ST for a CPU ID, only the ST valid bit set.  The
    TPH_ST interface still accepts only DMABUF, CPUID, and NONE for
    configuring architected ST storage locations.

 4. Finally, literal mode can be enabled via another progression of the
    module option.  Literal mode allows the TPH_ST feature to accept an
    actual, raw, ie. literal, ST value and also allows RESOLVE to
    return a valid ST even for devices only supporting IV mode.  This
    is intended to support VM use cases, such as a VMM configured with
    1:1 vCPU:pCPU mappings that exposes a _DSM to the guest with host
    CPU ST values, or if a completer exists in the guest that provides
    ST values through a dmabuf.  This is the only mode that allows
    writing unchecked ST values to architected ST storage locations.
    The user discovers support for this mode via a LITERAL flag
    returned from GET on the TPH feature.

The proposed module option is therefore a progression where the user is
granted increasing privilege to make use of STs:

vfio_pci_core.tph=:
	0: No-ST Mode (default)
	1: + IV Mode
	2: + DS Mode
	3: + LITERAL

It should be noted that even in IV mode, the user can R/W the ST value
via mmap covering the MSI-X vector table and could read the value in
the capability storage location unless we explicitly prevent it.
Removing the latter is trivial, removing the former costs reverting
previous decisions to allow mmap of the MSI-X vector table for
performance, esp. in larger PAGE_SIZE systems.  If this is an issue,
(tph >= 1) would need to reintroduce sparse mmap for MSI-X, but higher
tph values already allow for increasingly direct ST programming.

In general, while the proposal here presents progressively increasing
access and direct use of raw ST values, this access is not considered a
security issue.  TPH only allows addressing specific caching
structures, it does not present a correctness issue.  Abusive use of ST
values is at best a QoS issue, where access to a device capable of
these features imposes some inherent risk of exposure.

The feature ioctls therefore take this sort of form:

VFIO_DEVICE_FEATURE_TPH (SET/GET):

    struct vfio_device_feature_tph {
            __u32       flags;
    }

    @flags is reserved (0) on SET, returns VFIO_DEVICE_TPH_CAP_* on GET:

    #define VFIO_DEVICE_TPH_CAP_CPU             (1 << 0) /* CPU sources resolvable (_DSM) */
    #define VFIO_DEVICE_TPH_CAP_LITERAL         (1 << 1) /* LITERAL source available */

VFIO_DEVICE_FEATURE_TPH_RESOLVE (GET):

    struct vfio_device_feature_tph_resolve {
            __u32       flags;          /* IN: VFIO_DEVICE_TPH_* source + namespace */
            __u32       src;            /* IN: CPU id or dma-buf fd, per @flags */
            __u8        valid;          /* OUT: VFIO_DEVICE_TPH_VALID_* */
    #define VFIO_DEVICE_TPH_VALID_PH    (1 << 0) /* @ph holds a processing hint */
    #define VFIO_DEVICE_TPH_VALID_ST    (1 << 1) /* @st holds a non-zero steering tag */
            __u8        ph;             /* OUT: processing hint */
            __u16       st;             /* OUT: raw ST (DS presented or LITERAL) */
    };

    @flags is common for both RESOLVE and the following TPH_ST feature,
    bits are allocated according to their order of introduction.  Some
    bits are only applicable to the TPH_ST interface and are reserved
    for RESOLVE:

    #define VFIO_DEVICE_TPH_SRC_DMABUF          (1 << 0) /* dma-buf fd */
    #define VFIO_DEVICE_TPH_SRC_NONE            (1 << 1) /* write ST 0 (TPH_ST only) */
    #define VFIO_DEVICE_TPH_REQUIRE_ST          (1 << 2) /* modifier (TPH_ST): stop on
                                                            resolved ST 0 */
    #define VFIO_DEVICE_TPH_SRC_CPU_VOLATILE    (1 << 3) /* CPU id, volatile memory */
    #define VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT  (1 << 4) /* CPU id, persistent memory */
    #define VFIO_DEVICE_TPH_EXTENDED            (1 << 5) /* modifier: Extended ST
                                                            namespace */
    #define VFIO_DEVICE_TPH_SRC_LITERAL         (1 << 6) /* literal ST value
                                                            (TPH_ST only) */

    The source type must be uniquely defined, therefore only one bit in
    the source mask can be set:

    #define VFIO_DEVICE_TPH_SRC_MASK \
            (VFIO_DEVICE_TPH_SRC_DMABUF | \
             VFIO_DEVICE_TPH_SRC_NONE | \
             VFIO_DEVICE_TPH_SRC_CPU_VOLATILE | \
             VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT | \
             VFIO_DEVICE_TPH_SRC_LITERAL)

VFIO_DEVICE_FEATURE_TPH_ST (SET):

    struct vfio_device_feature_tph_st {
            __u32               flags;  /* IN: VFIO_DEVICE_TPH_* source + namespace */
            __u16               start;  /* IN: first ST table index
                                               (IV: interrupt vector) */
            __u16               count;  /* IN: number of contiguous entries */
            __aligned_u64       dests;  /* IN: @count __u32 (CPU/DMABUF/LITERAL;
                                               unused for NONE) */
    };

    Batching here is intended to shorten the TPH disabled/quiescent
    window, but this is not currently supported by the TPH subsystem.
    This could degrade to a single inline entry if necessary.

    Batching also introduces a risk of partial success, where some
    sources may resolve, while others generate an error or are promoted
    to an error based on the REQUIRE_ST modifier.  The proposed
    solution for this is to allow vfio device features to return values
    other than 0 or -errno, as currently described.  This interface
    would adopt a solution similar to KVM_SET_MSRS, where the ioctl
    returns the number of successfully programmed entries.  Therefore
    if (ret == count) the full set was successful, otherwise the user
    knows exactly which entry failed.

There are various intricate details involved in validating
configurations where TPH can be enabled, and I'm glossing over the uAPI
dependencies on implementation versus specification, but I think this
provides a good basis for exposing TPH support on the device.  I'd
strongly recommend the incremental enablement stages proposed here,
even if we stop before LITERAL support based on current use cases.
Please comment if there are gaps or anything is unclear.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-26 15:22         ` Alex Williamson
@ 2026-06-28 11:58           ` fengchengwen
  2026-06-29 15:43             ` Alex Williamson
  0 siblings, 1 reply; 39+ messages in thread
From: fengchengwen @ 2026-06-28 11:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: helgaas, wathsala.vithanage, wei.huang2, zhipingz, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci, Jason Gunthorpe

Hi Alex,

Resending this message because the previous version was sent in HTML format.

On 2026/6/26 23:22, Alex Williamson wrote:
> On Tue, 23 Jun 2026 17:56:51 +0800
> fengchengwen <fengchengwen@huawei.com> wrote:
>> I’ve sent v18-resend of the PCI TPH sysfs patchset with all prior review
>> comments fully addressed.
>>
>> I remain open to both discussed implementation options. This v18-resend
>> implements the sysfs binary blob approach on Root Ports as you previously
>> suggested.
> There's a problem with the pci-sysfs binary attribute beyond whether
> it's acceptable from a sysfs perspective: it only solves the CPU-ID to
> ST lookup.  We already know that this interface should support dma-buf
> provided TPH ST and PH values from Zhiping's series.  If we take the
> pci-sysfs route for CPU-ID, no subsystem exposes the value embedded
> into the dma-buf.  Another solution is required.  Repeat for each
> possible ST/PH source we might see in the future.
>
> Therefore, I'm coming around to the idea that vfio-pci provides this
> interface, but we need to do so in a way that it's not a wart on the
> side of the interface.  I think we can do this based on an incremental
> progression of features and support, where the ST can be provided
> through an extensible set of objects, CPU ID, dma-buf, no-preference
> (zero), and literal user-provided ST values.
>
> The key components of the uAPI are similar to ones you've proposed:
>
>   - An opt-in/feature flags (SET/GET): VFIO_DEVICE_FEATURE_TPH
>
>     The TPH uAPI necessarily changes the ABI of the vfio-pci device;
>     capabilities and control are virtualized.  An empty SET is the
>     opt-in to the ABI change.  There are also feature flags necessary
>     for uAPI surfaces not discovered through the PCI TPH capability, GET
>     on this feature returns a flags field indicating supported features.
>     (SET actually takes a zeroed flags field for symmetry in the vfio
>     device feature API).
>
>   - A mechanism to program indexes in architected ST locations (SET):
>     VFIO_DEVICE_FEATURE_TPH_ST
>
>     This includes flags to define the type of source provided, such as:
>
>      - VFIO_DEVICE_TPH_SRC_DMABUF
>      - VFIO_DEVICE_TPH_SRC_NONE
>      - VFIO_DEVICE_TPH_SRC_CPU_VOLATILE
>      - VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT
>      - VFIO_DEVICE_TPH_SRC_LITERAL
>
>     The requested namespace for the tag:
>
>      - VFIO_DEVICE_TPH_EXTENDED
>
>     And a failure policy, ex. if a translation results in zero, no-pref:
>
>      - VFIO_DEVICE_TPH_REQUIRE_ST
>
>     This would also include a start/count and pointer to user buffer to
>     allow batch settings, if the TPH subsystem can be extended to
>     support these, otherwise a single u32 holds all the decodable
>     sources.
>
>   - A mechanism to decode a source (GET): VFIO_DEVICE_FEATURE_TPH_RESOLVE
>
>     This is necessary for DS mode, where the driver may store STs in
>     device specific locations, but also for dma-buf sources even in IV
>     mode, where the PH needs to be decoded for the driver's use.  This
>     proposal also extends the API, through progressive opt-ins, to
>     support virtualization, where a VMM may (in specific scenarios)
>     implement a _DSM returning host values, and write through literal ST
>     values from this source, or from a guest-based userspace driver for
>     a completer, where the guest kernel can source a valid ST.
>
>     In order to prevent this from becoming the side-car/wart, returning
>     ST values is limited to configurations where the resulting ST may
>     actually be used within the interface.  For example, No-ST and IV
>     mode would operate only with DMABUF, NONE, and CPU_* sources, the
>     literal ST need not be exposed to userspace.  The RESOLVE feature
>     would be limited to returning the PH value for a DMABUF source.
>
>     The structure includes input fields of u32 flags and source, output
>     fields of u8 valid (bits declaring each of the next fields
>     validity), u8 ph, u16 st.  Flag bits are defined the same as for
>     TPH_ST, modulo NONE, LITERAL, REQUIRE_ST are not applicable and
>     reserved.
>
> With that rough uAPI, we can step through an incremental support path.
>
>   1. The most basic level of support, and default sans module options,
>      is No-ST mode.  After opt-in, the TPH capability is virtualized to
>      report only No-ST mode supported, the remainder of the register
>      zero.  Writes to ST Mode allow only 000b.  Writes to Requester
>      Enable allow only valid, non-zero values.  In combination, this
>      enables the device to operate only in No-ST mode.  The TPH_ST
>      interface is inoperative in this mode.  The RESOLVE interface only
>      accepts DMABUF sources and only provides the PH as valid.  This
>      allows only the PH aspect of TPH to be used and does not risk any
>      exposure or malicious use of ST values.
>
>      NB. All modes, including this one, must also validate support for
>      all reachable features through to the root port, ie. TPH Completer.
>
>   2. Via a module option (proposed below), IV mode can be enabled where
>      supported by the device.  This allows the unvirtualized interrupt
>      vector mode supported bit, ST table location, and ST table size
>      fields to be exposed in the capability register, and allows 001b to
>      be written in the mode select control register.  Additionally, this
>      brings onboard the TPH_ST feature, supporting a selection of
>      source flags.  NONE and DMABUF can be assumed, based on sequencing
>      of the dma-buf based TPH completer support series.  CPU sources
>      rely on root port _DSM support.  GET on the noted feature flags
>      above can indicate this support.  LITERAL is not enabled at this
>      point, this mode only supports ST programming via objects and
>      instance numbers.  RESOLVE support is identical to step 1.
>
>      Via SET of TPH_ST, sources are decoded and ST values are written
>      directly to the ST storage location.  The REQUIRE_ST flag indicates
>      whether the user accepts a translation resulting in zero (no-pref)
>      or if the API is to generate an error on this result.
>
>      Support for translating CPU_* sources relies on platform support
>      for _DSM, while other sources like DMABUF have no such requirement.
>      Therefore, for this and following steps, the returned flags on
>      FEATURE_TPH GET will include a bit indicating whether CPU_* sources
>      are supported.
>
>      There are some important considerations to resolve/document here
>      between the TPH specification and Linux implementation, such as
>      requiring TPH is enabled (and for MSI-X vector location, the
>      programmed vector is enabled), before the TPH_ST.  We need to be
>      particularly careful of a uAPI that relies on mutable
>      implementation.
>
>   3. Again, via a module option, DS mode can be enabled where supported
>      by the device.  On such a supported device (and only on such a
>      supported device), the device specific mode supported flag is now
>      reported unvirtualized in the capability register and the control
>      register supports writes of 010b to the mode field.  The RESOLVE
>      interface now supports CPU_* as a source.  RESOLVE will now return
>      both the PH and ST for a DMABUF, indicated through both valid bits
>      set, and the ST for a CPU ID, only the ST valid bit set.  The
>      TPH_ST interface still accepts only DMABUF, CPUID, and NONE for
>      configuring architected ST storage locations.
>
>   4. Finally, literal mode can be enabled via another progression of the
>      module option.  Literal mode allows the TPH_ST feature to accept an
>      actual, raw, ie. literal, ST value and also allows RESOLVE to
>      return a valid ST even for devices only supporting IV mode.  This
>      is intended to support VM use cases, such as a VMM configured with
>      1:1 vCPU:pCPU mappings that exposes a _DSM to the guest with host
>      CPU ST values, or if a completer exists in the guest that provides
>      ST values through a dmabuf.  This is the only mode that allows
>      writing unchecked ST values to architected ST storage locations.
>      The user discovers support for this mode via a LITERAL flag
>      returned from GET on the TPH feature.
>
> The proposed module option is therefore a progression where the user is
> granted increasing privilege to make use of STs:
>
> vfio_pci_core.tph=:
> 	0: No-ST Mode (default)
> 	1: + IV Mode
> 	2: + DS Mode
> 	3: + LITERAL
>
> It should be noted that even in IV mode, the user can R/W the ST value
> via mmap covering the MSI-X vector table and could read the value in
> the capability storage location unless we explicitly prevent it.
> Removing the latter is trivial, removing the former costs reverting
> previous decisions to allow mmap of the MSI-X vector table for
> performance, esp. in larger PAGE_SIZE systems.  If this is an issue,
> (tph >= 1) would need to reintroduce sparse mmap for MSI-X, but higher
> tph values already allow for increasingly direct ST programming.
>
> In general, while the proposal here presents progressively increasing
> access and direct use of raw ST values, this access is not considered a
> security issue.  TPH only allows addressing specific caching
> structures, it does not present a correctness issue.  Abusive use of ST
> values is at best a QoS issue, where access to a device capable of
> these features imposes some inherent risk of exposure.
>
> The feature ioctls therefore take this sort of form:
>
> VFIO_DEVICE_FEATURE_TPH (SET/GET):
>
>      struct vfio_device_feature_tph {
>              __u32       flags;
>      }
>
>      @flags is reserved (0) on SET, returns VFIO_DEVICE_TPH_CAP_* on GET:
>
>      #define VFIO_DEVICE_TPH_CAP_CPU             (1 << 0) /* CPU sources resolvable (_DSM) */
>      #define VFIO_DEVICE_TPH_CAP_LITERAL         (1 << 1) /* LITERAL source available */
>
> VFIO_DEVICE_FEATURE_TPH_RESOLVE (GET):
>
>      struct vfio_device_feature_tph_resolve {
>              __u32       flags;          /* IN: VFIO_DEVICE_TPH_* source + namespace */
>              __u32       src;            /* IN: CPU id or dma-buf fd, per @flags */
>              __u8        valid;          /* OUT: VFIO_DEVICE_TPH_VALID_* */
>      #define VFIO_DEVICE_TPH_VALID_PH    (1 << 0) /* @ph holds a processing hint */
>      #define VFIO_DEVICE_TPH_VALID_ST    (1 << 1) /* @st holds a non-zero steering tag */
>              __u8        ph;             /* OUT: processing hint */
>              __u16       st;             /* OUT: raw ST (DS presented or LITERAL) */
>      };
>
>      @flags is common for both RESOLVE and the following TPH_ST feature,
>      bits are allocated according to their order of introduction.  Some
>      bits are only applicable to the TPH_ST interface and are reserved
>      for RESOLVE:
>
>      #define VFIO_DEVICE_TPH_SRC_DMABUF          (1 << 0) /* dma-buf fd */
>      #define VFIO_DEVICE_TPH_SRC_NONE            (1 << 1) /* write ST 0 (TPH_ST only) */
>      #define VFIO_DEVICE_TPH_REQUIRE_ST          (1 << 2) /* modifier (TPH_ST): stop on
>                                                              resolved ST 0 */
>      #define VFIO_DEVICE_TPH_SRC_CPU_VOLATILE    (1 << 3) /* CPU id, volatile memory */
>      #define VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT  (1 << 4) /* CPU id, persistent memory */
>      #define VFIO_DEVICE_TPH_EXTENDED            (1 << 5) /* modifier: Extended ST
>                                                              namespace */
>      #define VFIO_DEVICE_TPH_SRC_LITERAL         (1 << 6) /* literal ST value
>                                                              (TPH_ST only) */
>
>      The source type must be uniquely defined, therefore only one bit in
>      the source mask can be set:
>
>      #define VFIO_DEVICE_TPH_SRC_MASK \
>              (VFIO_DEVICE_TPH_SRC_DMABUF | \
>               VFIO_DEVICE_TPH_SRC_NONE | \
>               VFIO_DEVICE_TPH_SRC_CPU_VOLATILE | \
>               VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT | \
>               VFIO_DEVICE_TPH_SRC_LITERAL)
>
> VFIO_DEVICE_FEATURE_TPH_ST (SET):
>
>      struct vfio_device_feature_tph_st {
>              __u32               flags;  /* IN: VFIO_DEVICE_TPH_* source + namespace */
>              __u16               start;  /* IN: first ST table index
>                                                 (IV: interrupt vector) */
>              __u16               count;  /* IN: number of contiguous entries */
>              __aligned_u64       dests;  /* IN: @count __u32 (CPU/DMABUF/LITERAL;
>                                                 unused for NONE) */
>      };
>
>      Batching here is intended to shorten the TPH disabled/quiescent
>      window, but this is not currently supported by the TPH subsystem.
>      This could degrade to a single inline entry if necessary.
>
>      Batching also introduces a risk of partial success, where some
>      sources may resolve, while others generate an error or are promoted
>      to an error based on the REQUIRE_ST modifier.  The proposed
>      solution for this is to allow vfio device features to return values
>      other than 0 or -errno, as currently described.  This interface
>      would adopt a solution similar to KVM_SET_MSRS, where the ioctl
>      returns the number of successfully programmed entries.  Therefore
>      if (ret == count) the full set was successful, otherwise the user
>      knows exactly which entry failed.
>
> There are various intricate details involved in validating
> configurations where TPH can be enabled, and I'm glossing over the uAPI
> dependencies on implementation versus specification, but I think this
> provides a good basis for exposing TPH support on the device.  I'd
> strongly recommend the incremental enablement stages proposed here,
> even if we stop before LITERAL support based on current use cases.
> Please comment if there are gaps or anything is unclear.  Thanks,

Thanks a lot for your extremely detailed and professional design breakdown
for the unified VFIO TPH uAPI framework. I’ve fully gone through all your
design points and aligned my implementation plan accordingly. I have several
key implementation questions to confirm with you as below:

1. Plan for dma-buf TPH metadata storage
   I plan to add the following TPH-related fields into struct 
vfio_pci_dma_buf
   in my preparatory patch series, which can be fully reused after Zhiping’s
   dma-buf TPH patches land upstream:
       u16 tph_st_ext;
       u8  tph_st;
       u8  revoked:1;
       u8  tph_st_valid:1;
       u8  tph_st_ext_valid:1;
       u8  tph_ph:2;
       u8  tph_ph_valid:1;
   The tph_ph_valid bit is newly added to track whether a valid PH value 
is bound
   to the dma-buf. Is this field layout and validity flag design acceptable?

2. Validation rule for VFIO_DEVICE_TPH_EXTENDED flag mismatches
   The VFIO_DEVICE_TPH_EXTENDED modifier you defined is an excellent design,
   letting users select either 8-bit base ST or 16-bit extended ST when 
hardware
   supports both variants.
   But a mismatch risk exists: users may set ST entries via TPH_ST with 
EXTENDED,
   then later enable TPH requester in pure 8-bit mode only, causing 
inconsistency
   between shadow config and active hardware mode.

   My proposed solution: maintain two separate shadow ST tables inside VFIO,
   one for base 8-bit ST and one for extended 16-bit ST. When enabling TPH
   requester mode, activate the shadow table matching the selected ST width.
   For devices only supporting 8-bit ST, directly reject EXTENDED flag 
in all
   TPH_ST ioctl calls.

   Should we enforce strict cross-check between EXTENDED flag used during ST
   programming and the final active requester ST width during 
enablement? If yes,
   is the dual shadow table approach reasonable?

3. Virtualization logic for TPH requester enable bits with heterogeneous
completer capabilities
   Two complex real hardware topologies need proper handling:
   - Case 1: Single device with multiple queues routing TLPs to host memory
     and P2P peer memory via dma-buf flow; root port and P2P TPH completer
     capabilities may differ.
   - Case 2: Root port has no TPH completer support, while endpoint and P2P
     peers fully support TPH completer.

   I’m confused about how to virtualize the device’s TPH requester 
control bits.
   My tentative idea: take the minimum supported capability between endpoint
   and host root port. If root port lacks TPH completer, block TPH 
requester enable
   entirely.

   Is this the correct approach to handle heterogeneous completer capability
   across different traffic paths?

Looking forward to your feedback

Thanks,
Chengwen Feng


>
> Alex


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
  2026-06-28 11:58           ` fengchengwen
@ 2026-06-29 15:43             ` Alex Williamson
  0 siblings, 0 replies; 39+ messages in thread
From: Alex Williamson @ 2026-06-29 15:43 UTC (permalink / raw)
  To: fengchengwen
  Cc: helgaas, wathsala.vithanage, wei.huang2, zhipingz, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci, Jason Gunthorpe, alex

On Sun, 28 Jun 2026 19:58:09 +0800
fengchengwen <fengchengwen@huawei.com> wrote:
> 
> Thanks a lot for your extremely detailed and professional design breakdown
> for the unified VFIO TPH uAPI framework. I’ve fully gone through all your
> design points and aligned my implementation plan accordingly. I have several
> key implementation questions to confirm with you as below:
> 
> 1. Plan for dma-buf TPH metadata storage
>    I plan to add the following TPH-related fields into struct 
> vfio_pci_dma_buf
>    in my preparatory patch series, which can be fully reused after Zhiping’s
>    dma-buf TPH patches land upstream:
>        u16 tph_st_ext;
>        u8  tph_st;
>        u8  revoked:1;
>        u8  tph_st_valid:1;
>        u8  tph_st_ext_valid:1;
>        u8  tph_ph:2;
>        u8  tph_ph_valid:1;
>    The tph_ph_valid bit is newly added to track whether a valid PH value 
> is bound
>    to the dma-buf. Is this field layout and validity flag design acceptable?

In Zhiping's design, the PH completer validity is bound to the ST
validity.  In my proposal, the user makes a request relative to the
namespace, EXTENDED set = 16-bit, clear = 8-bit.  Internally we do a
.get_tph on the dmabuf based on the requested namespace and get back
success or failure.  On success, the full PH + ST is provided to the
user when running with DS or LITERAL capability available, otherwise the
ST is withheld and only the PH is provided.  I don't see a need to track
the ph validity separately.

> 2. Validation rule for VFIO_DEVICE_TPH_EXTENDED flag mismatches
>    The VFIO_DEVICE_TPH_EXTENDED modifier you defined is an excellent design,
>    letting users select either 8-bit base ST or 16-bit extended ST when 
> hardware
>    supports both variants.
>    But a mismatch risk exists: users may set ST entries via TPH_ST with 
> EXTENDED,
>    then later enable TPH requester in pure 8-bit mode only, causing 
> inconsistency
>    between shadow config and active hardware mode.
> 
>    My proposed solution: maintain two separate shadow ST tables inside VFIO,
>    one for base 8-bit ST and one for extended 16-bit ST. When enabling TPH
>    requester mode, activate the shadow table matching the selected ST width.
>    For devices only supporting 8-bit ST, directly reject EXTENDED flag 
> in all
>    TPH_ST ioctl calls.
> 
>    Should we enforce strict cross-check between EXTENDED flag used during ST
>    programming and the final active requester ST width during 
> enablement? If yes,
>    is the dual shadow table approach reasonable?

We've abandoned the apply at enable-time approach in this proposal, TPH
must first be enabled in device config space.  There is also no
buffering of user values, they're written straight through to hardware.
If the user has enabled only 8-bit mode, then a TPH_ST with the
EXTENDED flag set should generate an error.  Likewise, if the user
calls TPH_ST while Requester Enable is 00b, this generates an error
regardless of the namespace.
 
> 3. Virtualization logic for TPH requester enable bits with heterogeneous
> completer capabilities
>    Two complex real hardware topologies need proper handling:
>    - Case 1: Single device with multiple queues routing TLPs to host memory
>      and P2P peer memory via dma-buf flow; root port and P2P TPH completer
>      capabilities may differ.
>    - Case 2: Root port has no TPH completer support, while endpoint and P2P
>      peers fully support TPH completer.
> 
>    I’m confused about how to virtualize the device’s TPH requester 
> control bits.
>    My tentative idea: take the minimum supported capability between endpoint
>    and host root port. If root port lacks TPH completer, block TPH 
> requester enable
>    entirely.
> 
>    Is this the correct approach to handle heterogeneous completer capability
>    across different traffic paths?

In case 1, this is why it doesn't work to allow the user to buffer per
namespace STs to be applied based on the value written to Requester
Enable.  Register value 11b allows the requester to operate in both
namespaces simultaneously.  The only governance we can provide is to
disallow EXTENDED STs to be written when Requester Enable is 01b.

The peer completer's capability is provided through the dmabuf.  The
user can ask for the requester's preferred namespace, use the alternate
if available, or fail if there's no compatible namespace available,
which includes no .get_tph support.

In case 2, we're gated by the Linux TPH implementation and carry it
through to the uAPI.  The overall TPH feature opt-in needs to depend on
both TPH support in the requester (the user's device) AND TPH completer
support at the root port (unless the requester itself is a RCiEP).  I
had missed elaborating on this requirement in my write-up.

I'm glad you're onboard with the design, please let me know if any
further clarifications are needed.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2026-06-29 15:43 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
2026-06-16 10:46 ` [PATCH v17 01/12] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
2026-06-16 11:00   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 02/12] PCI/TPH: Fix tph_enabled concurrent update race by bitfield packing Chengwen Feng
2026-06-16 10:55   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 03/12] PCI/TPH: Cache TPH requester capability at probe time Chengwen Feng
2026-06-16 10:55   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 04/12] PCI/TPH: Refactor pcie_enable_tph & add explicit requester variant Chengwen Feng
2026-06-16 10:53   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 05/12] PCI/TPH: Refactor pcie_tph_get_cpu_st & add explicit variant Chengwen Feng
2026-06-16 10:53   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 06/12] PCI/TPH: Expose the enabled TPH requester type Chengwen Feng
2026-06-16 10:51   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 07/12] PCI/TPH: Add pcie_tph_supported() helper to check TPH capability attributes Chengwen Feng
2026-06-16 10:52   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping Chengwen Feng
2026-06-16 11:00   ` sashiko-bot
2026-06-16 14:42   ` Jason Gunthorpe
2026-06-16 16:57     ` Alex Williamson
2026-06-16 17:27       ` Jason Gunthorpe
2026-06-17  1:18         ` fengchengwen
2026-06-17  1:30           ` Alex Williamson
2026-06-17  2:33             ` fengchengwen
2026-06-17  3:01               ` Alex Williamson
2026-06-17  3:41                 ` fengchengwen
2026-06-17  3:53                   ` Krzysztof Wilczyński
2026-06-17  6:04                     ` fengchengwen
2026-06-23  9:56       ` fengchengwen
2026-06-26 15:22         ` Alex Williamson
2026-06-28 11:58           ` fengchengwen
2026-06-29 15:43             ` Alex Williamson
2026-06-16 10:46 ` [PATCH v17 09/12] vfio/pci: Hide TPH capability when TPH is unsupported Chengwen Feng
2026-06-16 10:56   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 10/12] vfio/pci: Add TPH_ENABLE feature skeleton and unsafe module parameter Chengwen Feng
2026-06-16 10:55   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 11/12] vfio/pci: Add TPH_ST_CONFIG for PCIe TPH ST configuration Chengwen Feng
2026-06-16 11:05   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 12/12] vfio/pci: Virtualize PCIe TPH capability registers Chengwen Feng
2026-06-16 11:03   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox