Linux PCI subsystem development
 help / color / mirror / Atom feed
From: Alex Williamson <alex@shazbot.org>
To: fengchengwen <fengchengwen@huawei.com>
Cc: <helgaas@kernel.org>, <wathsala.vithanage@arm.com>,
	<wei.huang2@amd.com>, <zhipingz@meta.com>,
	<wangzhou1@hisilicon.com>, <wangyushan12@huawei.com>,
	<liuyonglong@huawei.com>, <kvm@vger.kernel.org>,
	<linux-pci@vger.kernel.org>, Jason Gunthorpe <jgg@ziepe.ca>,
	alex@shazbot.org
Subject: Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
Date: Fri, 26 Jun 2026 09:22:32 -0600	[thread overview]
Message-ID: <20260626092232.53ed3a7c@shazbot.org> (raw)
In-Reply-To: <9105ceef-5e27-4e3c-8903-d46aef52a2bd@huawei.com>

On Tue, 23 Jun 2026 17:56:51 +0800
fengchengwen <fengchengwen@huawei.com> wrote:
> 
> I’ve sent v18-resend of the PCI TPH sysfs patchset with all prior review
> comments fully addressed.
> 
> I remain open to both discussed implementation options. This v18-resend
> implements the sysfs binary blob approach on Root Ports as you previously
> suggested.

There's a problem with the pci-sysfs binary attribute beyond whether
it's acceptable from a sysfs perspective: it only solves the CPU-ID to
ST lookup.  We already know that this interface should support dma-buf
provided TPH ST and PH values from Zhiping's series.  If we take the
pci-sysfs route for CPU-ID, no subsystem exposes the value embedded
into the dma-buf.  Another solution is required.  Repeat for each
possible ST/PH source we might see in the future.

Therefore, I'm coming around to the idea that vfio-pci provides this
interface, but we need to do so in a way that it's not a wart on the
side of the interface.  I think we can do this based on an incremental
progression of features and support, where the ST can be provided
through an extensible set of objects, CPU ID, dma-buf, no-preference
(zero), and literal user-provided ST values.

The key components of the uAPI are similar to ones you've proposed:

 - An opt-in/feature flags (SET/GET): VFIO_DEVICE_FEATURE_TPH

   The TPH uAPI necessarily changes the ABI of the vfio-pci device;
   capabilities and control are virtualized.  An empty SET is the
   opt-in to the ABI change.  There are also feature flags necessary
   for uAPI surfaces not discovered through the PCI TPH capability, GET
   on this feature returns a flags field indicating supported features.
   (SET actually takes a zeroed flags field for symmetry in the vfio
   device feature API).

 - A mechanism to program indexes in architected ST locations (SET):
   VFIO_DEVICE_FEATURE_TPH_ST

   This includes flags to define the type of source provided, such as:

    - VFIO_DEVICE_TPH_SRC_DMABUF
    - VFIO_DEVICE_TPH_SRC_NONE
    - VFIO_DEVICE_TPH_SRC_CPU_VOLATILE
    - VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT
    - VFIO_DEVICE_TPH_SRC_LITERAL

   The requested namespace for the tag:

    - VFIO_DEVICE_TPH_EXTENDED

   And a failure policy, ex. if a translation results in zero, no-pref:

    - VFIO_DEVICE_TPH_REQUIRE_ST

   This would also include a start/count and pointer to user buffer to
   allow batch settings, if the TPH subsystem can be extended to
   support these, otherwise a single u32 holds all the decodable
   sources.

 - A mechanism to decode a source (GET): VFIO_DEVICE_FEATURE_TPH_RESOLVE

   This is necessary for DS mode, where the driver may store STs in
   device specific locations, but also for dma-buf sources even in IV
   mode, where the PH needs to be decoded for the driver's use.  This
   proposal also extends the API, through progressive opt-ins, to
   support virtualization, where a VMM may (in specific scenarios)
   implement a _DSM returning host values, and write through literal ST
   values from this source, or from a guest-based userspace driver for
   a completer, where the guest kernel can source a valid ST.

   In order to prevent this from becoming the side-car/wart, returning
   ST values is limited to configurations where the resulting ST may
   actually be used within the interface.  For example, No-ST and IV
   mode would operate only with DMABUF, NONE, and CPU_* sources, the
   literal ST need not be exposed to userspace.  The RESOLVE feature
   would be limited to returning the PH value for a DMABUF source.

   The structure includes input fields of u32 flags and source, output
   fields of u8 valid (bits declaring each of the next fields
   validity), u8 ph, u16 st.  Flag bits are defined the same as for
   TPH_ST, modulo NONE, LITERAL, REQUIRE_ST are not applicable and
   reserved.

With that rough uAPI, we can step through an incremental support path.

 1. The most basic level of support, and default sans module options,
    is No-ST mode.  After opt-in, the TPH capability is virtualized to
    report only No-ST mode supported, the remainder of the register
    zero.  Writes to ST Mode allow only 000b.  Writes to Requester
    Enable allow only valid, non-zero values.  In combination, this
    enables the device to operate only in No-ST mode.  The TPH_ST
    interface is inoperative in this mode.  The RESOLVE interface only
    accepts DMABUF sources and only provides the PH as valid.  This
    allows only the PH aspect of TPH to be used and does not risk any
    exposure or malicious use of ST values.

    NB. All modes, including this one, must also validate support for
    all reachable features through to the root port, ie. TPH Completer.

 2. Via a module option (proposed below), IV mode can be enabled where
    supported by the device.  This allows the unvirtualized interrupt
    vector mode supported bit, ST table location, and ST table size
    fields to be exposed in the capability register, and allows 001b to
    be written in the mode select control register.  Additionally, this
    brings onboard the TPH_ST feature, supporting a selection of
    source flags.  NONE and DMABUF can be assumed, based on sequencing
    of the dma-buf based TPH completer support series.  CPU sources
    rely on root port _DSM support.  GET on the noted feature flags
    above can indicate this support.  LITERAL is not enabled at this
    point, this mode only supports ST programming via objects and
    instance numbers.  RESOLVE support is identical to step 1.

    Via SET of TPH_ST, sources are decoded and ST values are written
    directly to the ST storage location.  The REQUIRE_ST flag indicates
    whether the user accepts a translation resulting in zero (no-pref)
    or if the API is to generate an error on this result.

    Support for translating CPU_* sources relies on platform support
    for _DSM, while other sources like DMABUF have no such requirement.
    Therefore, for this and following steps, the returned flags on
    FEATURE_TPH GET will include a bit indicating whether CPU_* sources
    are supported.

    There are some important considerations to resolve/document here
    between the TPH specification and Linux implementation, such as
    requiring TPH is enabled (and for MSI-X vector location, the
    programmed vector is enabled), before the TPH_ST.  We need to be
    particularly careful of a uAPI that relies on mutable
    implementation.

 3. Again, via a module option, DS mode can be enabled where supported
    by the device.  On such a supported device (and only on such a
    supported device), the device specific mode supported flag is now
    reported unvirtualized in the capability register and the control
    register supports writes of 010b to the mode field.  The RESOLVE
    interface now supports CPU_* as a source.  RESOLVE will now return
    both the PH and ST for a DMABUF, indicated through both valid bits
    set, and the ST for a CPU ID, only the ST valid bit set.  The
    TPH_ST interface still accepts only DMABUF, CPUID, and NONE for
    configuring architected ST storage locations.

 4. Finally, literal mode can be enabled via another progression of the
    module option.  Literal mode allows the TPH_ST feature to accept an
    actual, raw, ie. literal, ST value and also allows RESOLVE to
    return a valid ST even for devices only supporting IV mode.  This
    is intended to support VM use cases, such as a VMM configured with
    1:1 vCPU:pCPU mappings that exposes a _DSM to the guest with host
    CPU ST values, or if a completer exists in the guest that provides
    ST values through a dmabuf.  This is the only mode that allows
    writing unchecked ST values to architected ST storage locations.
    The user discovers support for this mode via a LITERAL flag
    returned from GET on the TPH feature.

The proposed module option is therefore a progression where the user is
granted increasing privilege to make use of STs:

vfio_pci_core.tph=:
	0: No-ST Mode (default)
	1: + IV Mode
	2: + DS Mode
	3: + LITERAL

It should be noted that even in IV mode, the user can R/W the ST value
via mmap covering the MSI-X vector table and could read the value in
the capability storage location unless we explicitly prevent it.
Removing the latter is trivial, removing the former costs reverting
previous decisions to allow mmap of the MSI-X vector table for
performance, esp. in larger PAGE_SIZE systems.  If this is an issue,
(tph >= 1) would need to reintroduce sparse mmap for MSI-X, but higher
tph values already allow for increasingly direct ST programming.

In general, while the proposal here presents progressively increasing
access and direct use of raw ST values, this access is not considered a
security issue.  TPH only allows addressing specific caching
structures, it does not present a correctness issue.  Abusive use of ST
values is at best a QoS issue, where access to a device capable of
these features imposes some inherent risk of exposure.

The feature ioctls therefore take this sort of form:

VFIO_DEVICE_FEATURE_TPH (SET/GET):

    struct vfio_device_feature_tph {
            __u32       flags;
    }

    @flags is reserved (0) on SET, returns VFIO_DEVICE_TPH_CAP_* on GET:

    #define VFIO_DEVICE_TPH_CAP_CPU             (1 << 0) /* CPU sources resolvable (_DSM) */
    #define VFIO_DEVICE_TPH_CAP_LITERAL         (1 << 1) /* LITERAL source available */

VFIO_DEVICE_FEATURE_TPH_RESOLVE (GET):

    struct vfio_device_feature_tph_resolve {
            __u32       flags;          /* IN: VFIO_DEVICE_TPH_* source + namespace */
            __u32       src;            /* IN: CPU id or dma-buf fd, per @flags */
            __u8        valid;          /* OUT: VFIO_DEVICE_TPH_VALID_* */
    #define VFIO_DEVICE_TPH_VALID_PH    (1 << 0) /* @ph holds a processing hint */
    #define VFIO_DEVICE_TPH_VALID_ST    (1 << 1) /* @st holds a non-zero steering tag */
            __u8        ph;             /* OUT: processing hint */
            __u16       st;             /* OUT: raw ST (DS presented or LITERAL) */
    };

    @flags is common for both RESOLVE and the following TPH_ST feature,
    bits are allocated according to their order of introduction.  Some
    bits are only applicable to the TPH_ST interface and are reserved
    for RESOLVE:

    #define VFIO_DEVICE_TPH_SRC_DMABUF          (1 << 0) /* dma-buf fd */
    #define VFIO_DEVICE_TPH_SRC_NONE            (1 << 1) /* write ST 0 (TPH_ST only) */
    #define VFIO_DEVICE_TPH_REQUIRE_ST          (1 << 2) /* modifier (TPH_ST): stop on
                                                            resolved ST 0 */
    #define VFIO_DEVICE_TPH_SRC_CPU_VOLATILE    (1 << 3) /* CPU id, volatile memory */
    #define VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT  (1 << 4) /* CPU id, persistent memory */
    #define VFIO_DEVICE_TPH_EXTENDED            (1 << 5) /* modifier: Extended ST
                                                            namespace */
    #define VFIO_DEVICE_TPH_SRC_LITERAL         (1 << 6) /* literal ST value
                                                            (TPH_ST only) */

    The source type must be uniquely defined, therefore only one bit in
    the source mask can be set:

    #define VFIO_DEVICE_TPH_SRC_MASK \
            (VFIO_DEVICE_TPH_SRC_DMABUF | \
             VFIO_DEVICE_TPH_SRC_NONE | \
             VFIO_DEVICE_TPH_SRC_CPU_VOLATILE | \
             VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT | \
             VFIO_DEVICE_TPH_SRC_LITERAL)

VFIO_DEVICE_FEATURE_TPH_ST (SET):

    struct vfio_device_feature_tph_st {
            __u32               flags;  /* IN: VFIO_DEVICE_TPH_* source + namespace */
            __u16               start;  /* IN: first ST table index
                                               (IV: interrupt vector) */
            __u16               count;  /* IN: number of contiguous entries */
            __aligned_u64       dests;  /* IN: @count __u32 (CPU/DMABUF/LITERAL;
                                               unused for NONE) */
    };

    Batching here is intended to shorten the TPH disabled/quiescent
    window, but this is not currently supported by the TPH subsystem.
    This could degrade to a single inline entry if necessary.

    Batching also introduces a risk of partial success, where some
    sources may resolve, while others generate an error or are promoted
    to an error based on the REQUIRE_ST modifier.  The proposed
    solution for this is to allow vfio device features to return values
    other than 0 or -errno, as currently described.  This interface
    would adopt a solution similar to KVM_SET_MSRS, where the ioctl
    returns the number of successfully programmed entries.  Therefore
    if (ret == count) the full set was successful, otherwise the user
    knows exactly which entry failed.

There are various intricate details involved in validating
configurations where TPH can be enabled, and I'm glossing over the uAPI
dependencies on implementation versus specification, but I think this
provides a good basis for exposing TPH support on the device.  I'd
strongly recommend the incremental enablement stages proposed here,
even if we stop before LITERAL support based on current use cases.
Please comment if there are gaps or anything is unclear.  Thanks,

Alex

  reply	other threads:[~2026-06-26 15:22 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-16 10:46 [PATCH v17 00/12] vfio/pci: Add PCIe TPH support Chengwen Feng
2026-06-16 10:46 ` [PATCH v17 01/12] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
2026-06-16 11:00   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 02/12] PCI/TPH: Fix tph_enabled concurrent update race by bitfield packing Chengwen Feng
2026-06-16 10:55   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 03/12] PCI/TPH: Cache TPH requester capability at probe time Chengwen Feng
2026-06-16 10:55   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 04/12] PCI/TPH: Refactor pcie_enable_tph & add explicit requester variant Chengwen Feng
2026-06-16 10:53   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 05/12] PCI/TPH: Refactor pcie_tph_get_cpu_st & add explicit variant Chengwen Feng
2026-06-16 10:53   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 06/12] PCI/TPH: Expose the enabled TPH requester type Chengwen Feng
2026-06-16 10:51   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 07/12] PCI/TPH: Add pcie_tph_supported() helper to check TPH capability attributes Chengwen Feng
2026-06-16 10:52   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping Chengwen Feng
2026-06-16 11:00   ` sashiko-bot
2026-06-16 14:42   ` Jason Gunthorpe
2026-06-16 16:57     ` Alex Williamson
2026-06-16 17:27       ` Jason Gunthorpe
2026-06-17  1:18         ` fengchengwen
2026-06-17  1:30           ` Alex Williamson
2026-06-17  2:33             ` fengchengwen
2026-06-17  3:01               ` Alex Williamson
2026-06-17  3:41                 ` fengchengwen
2026-06-17  3:53                   ` Krzysztof Wilczyński
2026-06-17  6:04                     ` fengchengwen
2026-06-23  9:56       ` fengchengwen
2026-06-26 15:22         ` Alex Williamson [this message]
2026-06-16 10:46 ` [PATCH v17 09/12] vfio/pci: Hide TPH capability when TPH is unsupported Chengwen Feng
2026-06-16 10:56   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 10/12] vfio/pci: Add TPH_ENABLE feature skeleton and unsafe module parameter Chengwen Feng
2026-06-16 10:55   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 11/12] vfio/pci: Add TPH_ST_CONFIG for PCIe TPH ST configuration Chengwen Feng
2026-06-16 11:05   ` sashiko-bot
2026-06-16 10:46 ` [PATCH v17 12/12] vfio/pci: Virtualize PCIe TPH capability registers Chengwen Feng
2026-06-16 11:03   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260626092232.53ed3a7c@shazbot.org \
    --to=alex@shazbot.org \
    --cc=fengchengwen@huawei.com \
    --cc=helgaas@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=kvm@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=liuyonglong@huawei.com \
    --cc=wangyushan12@huawei.com \
    --cc=wangzhou1@hisilicon.com \
    --cc=wathsala.vithanage@arm.com \
    --cc=wei.huang2@amd.com \
    --cc=zhipingz@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox