From: Alex Williamson <alex@shazbot.org>
To: fengchengwen <fengchengwen@huawei.com>
Cc: <helgaas@kernel.org>, <wathsala.vithanage@arm.com>,
<wei.huang2@amd.com>, <zhipingz@meta.com>,
<wangzhou1@hisilicon.com>, <wangyushan12@huawei.com>,
<liuyonglong@huawei.com>, <kvm@vger.kernel.org>,
<linux-pci@vger.kernel.org>, Jason Gunthorpe <jgg@ziepe.ca>,
alex@shazbot.org
Subject: Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
Date: Fri, 26 Jun 2026 09:22:32 -0600 [thread overview]
Message-ID: <20260626092232.53ed3a7c@shazbot.org> (raw)
In-Reply-To: <9105ceef-5e27-4e3c-8903-d46aef52a2bd@huawei.com>
On Tue, 23 Jun 2026 17:56:51 +0800
fengchengwen <fengchengwen@huawei.com> wrote:
>
> I’ve sent v18-resend of the PCI TPH sysfs patchset with all prior review
> comments fully addressed.
>
> I remain open to both discussed implementation options. This v18-resend
> implements the sysfs binary blob approach on Root Ports as you previously
> suggested.
There's a problem with the pci-sysfs binary attribute beyond whether
it's acceptable from a sysfs perspective: it only solves the CPU-ID to
ST lookup. We already know that this interface should support dma-buf
provided TPH ST and PH values from Zhiping's series. If we take the
pci-sysfs route for CPU-ID, no subsystem exposes the value embedded
into the dma-buf. Another solution is required. Repeat for each
possible ST/PH source we might see in the future.
Therefore, I'm coming around to the idea that vfio-pci provides this
interface, but we need to do so in a way that it's not a wart on the
side of the interface. I think we can do this based on an incremental
progression of features and support, where the ST can be provided
through an extensible set of objects, CPU ID, dma-buf, no-preference
(zero), and literal user-provided ST values.
The key components of the uAPI are similar to ones you've proposed:
- An opt-in/feature flags (SET/GET): VFIO_DEVICE_FEATURE_TPH
The TPH uAPI necessarily changes the ABI of the vfio-pci device;
capabilities and control are virtualized. An empty SET is the
opt-in to the ABI change. There are also feature flags necessary
for uAPI surfaces not discovered through the PCI TPH capability, GET
on this feature returns a flags field indicating supported features.
(SET actually takes a zeroed flags field for symmetry in the vfio
device feature API).
- A mechanism to program indexes in architected ST locations (SET):
VFIO_DEVICE_FEATURE_TPH_ST
This includes flags to define the type of source provided, such as:
- VFIO_DEVICE_TPH_SRC_DMABUF
- VFIO_DEVICE_TPH_SRC_NONE
- VFIO_DEVICE_TPH_SRC_CPU_VOLATILE
- VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT
- VFIO_DEVICE_TPH_SRC_LITERAL
The requested namespace for the tag:
- VFIO_DEVICE_TPH_EXTENDED
And a failure policy, ex. if a translation results in zero, no-pref:
- VFIO_DEVICE_TPH_REQUIRE_ST
This would also include a start/count and pointer to user buffer to
allow batch settings, if the TPH subsystem can be extended to
support these, otherwise a single u32 holds all the decodable
sources.
- A mechanism to decode a source (GET): VFIO_DEVICE_FEATURE_TPH_RESOLVE
This is necessary for DS mode, where the driver may store STs in
device specific locations, but also for dma-buf sources even in IV
mode, where the PH needs to be decoded for the driver's use. This
proposal also extends the API, through progressive opt-ins, to
support virtualization, where a VMM may (in specific scenarios)
implement a _DSM returning host values, and write through literal ST
values from this source, or from a guest-based userspace driver for
a completer, where the guest kernel can source a valid ST.
In order to prevent this from becoming the side-car/wart, returning
ST values is limited to configurations where the resulting ST may
actually be used within the interface. For example, No-ST and IV
mode would operate only with DMABUF, NONE, and CPU_* sources, the
literal ST need not be exposed to userspace. The RESOLVE feature
would be limited to returning the PH value for a DMABUF source.
The structure includes input fields of u32 flags and source, output
fields of u8 valid (bits declaring each of the next fields
validity), u8 ph, u16 st. Flag bits are defined the same as for
TPH_ST, modulo NONE, LITERAL, REQUIRE_ST are not applicable and
reserved.
With that rough uAPI, we can step through an incremental support path.
1. The most basic level of support, and default sans module options,
is No-ST mode. After opt-in, the TPH capability is virtualized to
report only No-ST mode supported, the remainder of the register
zero. Writes to ST Mode allow only 000b. Writes to Requester
Enable allow only valid, non-zero values. In combination, this
enables the device to operate only in No-ST mode. The TPH_ST
interface is inoperative in this mode. The RESOLVE interface only
accepts DMABUF sources and only provides the PH as valid. This
allows only the PH aspect of TPH to be used and does not risk any
exposure or malicious use of ST values.
NB. All modes, including this one, must also validate support for
all reachable features through to the root port, ie. TPH Completer.
2. Via a module option (proposed below), IV mode can be enabled where
supported by the device. This allows the unvirtualized interrupt
vector mode supported bit, ST table location, and ST table size
fields to be exposed in the capability register, and allows 001b to
be written in the mode select control register. Additionally, this
brings onboard the TPH_ST feature, supporting a selection of
source flags. NONE and DMABUF can be assumed, based on sequencing
of the dma-buf based TPH completer support series. CPU sources
rely on root port _DSM support. GET on the noted feature flags
above can indicate this support. LITERAL is not enabled at this
point, this mode only supports ST programming via objects and
instance numbers. RESOLVE support is identical to step 1.
Via SET of TPH_ST, sources are decoded and ST values are written
directly to the ST storage location. The REQUIRE_ST flag indicates
whether the user accepts a translation resulting in zero (no-pref)
or if the API is to generate an error on this result.
Support for translating CPU_* sources relies on platform support
for _DSM, while other sources like DMABUF have no such requirement.
Therefore, for this and following steps, the returned flags on
FEATURE_TPH GET will include a bit indicating whether CPU_* sources
are supported.
There are some important considerations to resolve/document here
between the TPH specification and Linux implementation, such as
requiring TPH is enabled (and for MSI-X vector location, the
programmed vector is enabled), before the TPH_ST. We need to be
particularly careful of a uAPI that relies on mutable
implementation.
3. Again, via a module option, DS mode can be enabled where supported
by the device. On such a supported device (and only on such a
supported device), the device specific mode supported flag is now
reported unvirtualized in the capability register and the control
register supports writes of 010b to the mode field. The RESOLVE
interface now supports CPU_* as a source. RESOLVE will now return
both the PH and ST for a DMABUF, indicated through both valid bits
set, and the ST for a CPU ID, only the ST valid bit set. The
TPH_ST interface still accepts only DMABUF, CPUID, and NONE for
configuring architected ST storage locations.
4. Finally, literal mode can be enabled via another progression of the
module option. Literal mode allows the TPH_ST feature to accept an
actual, raw, ie. literal, ST value and also allows RESOLVE to
return a valid ST even for devices only supporting IV mode. This
is intended to support VM use cases, such as a VMM configured with
1:1 vCPU:pCPU mappings that exposes a _DSM to the guest with host
CPU ST values, or if a completer exists in the guest that provides
ST values through a dmabuf. This is the only mode that allows
writing unchecked ST values to architected ST storage locations.
The user discovers support for this mode via a LITERAL flag
returned from GET on the TPH feature.
The proposed module option is therefore a progression where the user is
granted increasing privilege to make use of STs:
vfio_pci_core.tph=:
0: No-ST Mode (default)
1: + IV Mode
2: + DS Mode
3: + LITERAL
It should be noted that even in IV mode, the user can R/W the ST value
via mmap covering the MSI-X vector table and could read the value in
the capability storage location unless we explicitly prevent it.
Removing the latter is trivial, removing the former costs reverting
previous decisions to allow mmap of the MSI-X vector table for
performance, esp. in larger PAGE_SIZE systems. If this is an issue,
(tph >= 1) would need to reintroduce sparse mmap for MSI-X, but higher
tph values already allow for increasingly direct ST programming.
In general, while the proposal here presents progressively increasing
access and direct use of raw ST values, this access is not considered a
security issue. TPH only allows addressing specific caching
structures, it does not present a correctness issue. Abusive use of ST
values is at best a QoS issue, where access to a device capable of
these features imposes some inherent risk of exposure.
The feature ioctls therefore take this sort of form:
VFIO_DEVICE_FEATURE_TPH (SET/GET):
struct vfio_device_feature_tph {
__u32 flags;
}
@flags is reserved (0) on SET, returns VFIO_DEVICE_TPH_CAP_* on GET:
#define VFIO_DEVICE_TPH_CAP_CPU (1 << 0) /* CPU sources resolvable (_DSM) */
#define VFIO_DEVICE_TPH_CAP_LITERAL (1 << 1) /* LITERAL source available */
VFIO_DEVICE_FEATURE_TPH_RESOLVE (GET):
struct vfio_device_feature_tph_resolve {
__u32 flags; /* IN: VFIO_DEVICE_TPH_* source + namespace */
__u32 src; /* IN: CPU id or dma-buf fd, per @flags */
__u8 valid; /* OUT: VFIO_DEVICE_TPH_VALID_* */
#define VFIO_DEVICE_TPH_VALID_PH (1 << 0) /* @ph holds a processing hint */
#define VFIO_DEVICE_TPH_VALID_ST (1 << 1) /* @st holds a non-zero steering tag */
__u8 ph; /* OUT: processing hint */
__u16 st; /* OUT: raw ST (DS presented or LITERAL) */
};
@flags is common for both RESOLVE and the following TPH_ST feature,
bits are allocated according to their order of introduction. Some
bits are only applicable to the TPH_ST interface and are reserved
for RESOLVE:
#define VFIO_DEVICE_TPH_SRC_DMABUF (1 << 0) /* dma-buf fd */
#define VFIO_DEVICE_TPH_SRC_NONE (1 << 1) /* write ST 0 (TPH_ST only) */
#define VFIO_DEVICE_TPH_REQUIRE_ST (1 << 2) /* modifier (TPH_ST): stop on
resolved ST 0 */
#define VFIO_DEVICE_TPH_SRC_CPU_VOLATILE (1 << 3) /* CPU id, volatile memory */
#define VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT (1 << 4) /* CPU id, persistent memory */
#define VFIO_DEVICE_TPH_EXTENDED (1 << 5) /* modifier: Extended ST
namespace */
#define VFIO_DEVICE_TPH_SRC_LITERAL (1 << 6) /* literal ST value
(TPH_ST only) */
The source type must be uniquely defined, therefore only one bit in
the source mask can be set:
#define VFIO_DEVICE_TPH_SRC_MASK \
(VFIO_DEVICE_TPH_SRC_DMABUF | \
VFIO_DEVICE_TPH_SRC_NONE | \
VFIO_DEVICE_TPH_SRC_CPU_VOLATILE | \
VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT | \
VFIO_DEVICE_TPH_SRC_LITERAL)
VFIO_DEVICE_FEATURE_TPH_ST (SET):
struct vfio_device_feature_tph_st {
__u32 flags; /* IN: VFIO_DEVICE_TPH_* source + namespace */
__u16 start; /* IN: first ST table index
(IV: interrupt vector) */
__u16 count; /* IN: number of contiguous entries */
__aligned_u64 dests; /* IN: @count __u32 (CPU/DMABUF/LITERAL;
unused for NONE) */
};
Batching here is intended to shorten the TPH disabled/quiescent
window, but this is not currently supported by the TPH subsystem.
This could degrade to a single inline entry if necessary.
Batching also introduces a risk of partial success, where some
sources may resolve, while others generate an error or are promoted
to an error based on the REQUIRE_ST modifier. The proposed
solution for this is to allow vfio device features to return values
other than 0 or -errno, as currently described. This interface
would adopt a solution similar to KVM_SET_MSRS, where the ioctl
returns the number of successfully programmed entries. Therefore
if (ret == count) the full set was successful, otherwise the user
knows exactly which entry failed.
There are various intricate details involved in validating
configurations where TPH can be enabled, and I'm glossing over the uAPI
dependencies on implementation versus specification, but I think this
provides a good basis for exposing TPH support on the device. I'd
strongly recommend the incremental enablement stages proposed here,
even if we stop before LITERAL support based on current use cases.
Please comment if there are gaps or anything is unclear. Thanks,
Alex
next prev parent reply other threads:[~2026-06-26 15:22 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
2026-06-16 10:46 ` [PATCH v17 01/12] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
2026-06-16 11:00 ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 02/12] PCI/TPH: Fix tph_enabled concurrent update race by bitfield packing Chengwen Feng
2026-06-16 10:55 ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 03/12] PCI/TPH: Cache TPH requester capability at probe time Chengwen Feng
2026-06-16 10:55 ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 04/12] PCI/TPH: Refactor pcie_enable_tph & add explicit requester variant Chengwen Feng
2026-06-16 10:53 ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 05/12] PCI/TPH: Refactor pcie_tph_get_cpu_st & add explicit variant Chengwen Feng
2026-06-16 10:53 ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 06/12] PCI/TPH: Expose the enabled TPH requester type Chengwen Feng
2026-06-16 10:51 ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 07/12] PCI/TPH: Add pcie_tph_supported() helper to check TPH capability attributes Chengwen Feng
2026-06-16 10:52 ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping Chengwen Feng
2026-06-16 11:00 ` sashiko-bot
2026-06-16 14:42 ` Jason Gunthorpe
2026-06-16 16:57 ` Alex Williamson
2026-06-16 17:27 ` Jason Gunthorpe
2026-06-17 1:18 ` fengchengwen
2026-06-17 1:30 ` Alex Williamson
2026-06-17 2:33 ` fengchengwen
2026-06-17 3:01 ` Alex Williamson
2026-06-17 3:41 ` fengchengwen
2026-06-17 3:53 ` Krzysztof Wilczyński
2026-06-17 6:04 ` fengchengwen
2026-06-23 9:56 ` fengchengwen
2026-06-26 15:22 ` Alex Williamson [this message]
2026-06-16 10:46 ` [PATCH v17 09/12] vfio/pci: Hide TPH capability when TPH is unsupported Chengwen Feng
2026-06-16 10:56 ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 10/12] vfio/pci: Add TPH_ENABLE feature skeleton and unsafe module parameter Chengwen Feng
2026-06-16 10:55 ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 11/12] vfio/pci: Add TPH_ST_CONFIG for PCIe TPH ST configuration Chengwen Feng
2026-06-16 11:05 ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 12/12] vfio/pci: Virtualize PCIe TPH capability registers Chengwen Feng
2026-06-16 11:03 ` sashiko-bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260626092232.53ed3a7c@shazbot.org \
--to=alex@shazbot.org \
--cc=fengchengwen@huawei.com \
--cc=helgaas@kernel.org \
--cc=jgg@ziepe.ca \
--cc=kvm@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=liuyonglong@huawei.com \
--cc=wangyushan12@huawei.com \
--cc=wangzhou1@hisilicon.com \
--cc=wathsala.vithanage@arm.com \
--cc=wei.huang2@amd.com \
--cc=zhipingz@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox