From: Alex Williamson <alex@shazbot.org>
To: fengchengwen <fengchengwen@huawei.com>
Cc: <jgg@ziepe.ca>, <wathsala.vithanage@arm.com>,
<helgaas@kernel.org>, <wei.huang2@amd.com>,
<wangzhou1@hisilicon.com>, <wangyushan12@huawei.com>,
<liuyonglong@huawei.com>, <kvm@vger.kernel.org>,
<linux-pci@vger.kernel.org>,
alex@shazbot.org
Subject: Re: [PATCH v11 5/5] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST for TPH ST entry management
Date: Fri, 22 May 2026 08:27:30 -0600 [thread overview]
Message-ID: <20260522082730.3d843a7f@shazbot.org> (raw)
In-Reply-To: <512bce27-13a2-44a1-aa5b-b5ea77c174ba@huawei.com>
On Fri, 22 May 2026 18:04:13 +0800
fengchengwen <fengchengwen@huawei.com> wrote:
> On 5/22/2026 12:10 PM, Alex Williamson wrote:
> > On Mon, 18 May 2026 15:17:01 +0800
> > Chengwen Feng <fengchengwen@huawei.com> wrote:
> >
> >> Add VFIO_DEVICE_FEATURE_TPH_ST to allow userspace to manage PCIe TPH
> >> Steering Tag entries.
> >>
> >> SET: Program contiguous ST entries when ST table resides in TPH Capability
> >> or MSI-X table. U32_MAX CPU ID clears the corresponding ST entry.
> >> GET: Retrieve ST values per CPU ID, only available in DS mode.
> >>
> >> Both operations are only valid when TPH is enabled on the device.
> >
> > What? That doesn't align with the hardware programming model. The
> > specification even suggests that the device should be quiesced or TPH
> > disabled during programming of the steering tags for deterministic
> > behavior.
>
> As mentioned in the 4/5 commit, obtaining ST must first enable TPH.
> Of course, I think it might be necessary to calculate pdev->tph_req_type
> during the capability detection phase, so that we can modify ST before
> enabling TPH. Do you think such a modification is needed?
I think a uAPI that doesn't align with the recommended programming
order of the specification, potentially resulting in non-deterministic
behavior is bogus.
> >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> >> ---
> >> drivers/vfio/pci/vfio_pci_core.c | 91 ++++++++++++++++++++++++++++++++
> >> include/uapi/linux/vfio.h | 41 ++++++++++++++
> >> 2 files changed, 132 insertions(+)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >> index 0d0140202280..13a22c1d6ea6 100644
> >> --- a/drivers/vfio/pci/vfio_pci_core.c
> >> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >> @@ -1558,6 +1558,95 @@ static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
> >> return 0;
> >> }
> >>
> >> +static int vfio_pci_core_feature_tph_st(struct vfio_pci_core_device *vdev,
> >> + u32 flags,
> >> + struct vfio_device_feature_tph_st __user *arg,
> >> + size_t argsz)
> >> +{
> >> + bool is_set = !!(flags & VFIO_DEVICE_FEATURE_SET);
> >> + struct vfio_device_feature_tph_st tph_st;
> >> + struct pci_dev *pdev = vdev->pdev;
> >> + enum tph_mem_type mtype;
> >> + int i, j, ret;
> >> + u32 *cpus;
> >> + u16 st;
> >> +
> >> + guard(mutex)(&vdev->tph_lock);
> >
> > Wrong place for this lock. Why do we even need it? If a user races
> > SET/GET with enable/disable, that's their problem.
>
> OK, will fix in next version
>
> >
> >> +
> >> + ret = vfio_check_feature(flags, argsz,
> >> + VFIO_DEVICE_FEATURE_GET |
> >> + VFIO_DEVICE_FEATURE_SET,
> >> + sizeof(tph_st));
> >> + if (ret <= 0)
> >> + return ret;
> >> +
> >> + if (!enable_unsafe_tph ||
> >
> > Why did we allow the feature to be probe if it can't be used?
>
> OK, will fix in next version
>
> >
> >> + pcie_tph_enabled_mode(pdev) == PCI_TPH_ST_NS_MODE)
> >
> > Here's that ambiguity of reporting NS_MODE for disabled, but as above
> > it's a bogus requirement.
>
> OK, will fix in next version
>
> >
> >> + return -EOPNOTSUPP;
> >> + if (!is_set && pcie_tph_enabled_mode(pdev) != PCI_TPH_ST_DS_MODE)
> >
> > Why? Let the interface be symmetric that GET works regardless of mode.
>
> OK, will remove the restrict
>
> >
> >> + return -EOPNOTSUPP;
> >> + if (is_set && pcie_tph_get_st_table_loc(pdev) == PCI_TPH_LOC_NONE)
> >> + return -EOPNOTSUPP;
> >> +
> >> + if (copy_from_user(&tph_st, arg, sizeof(tph_st)))
> >> + return -EFAULT;
> >> +
> >> + if (tph_st.count == 0 || tph_st.count > VFIO_TPH_ST_MAX_COUNT ||
> >> + tph_st.flags > VFIO_TPH_ST_MEM_TYPE_PM)
> >
> > Wrong operation for a flags field.
>
> Because currently flags only use bit0 (0-VM, 1-PM), this judge make sure
> it will return error if userspace passing non-zero in remain bits.
This should be a mask to test undefined bits are not set, not a value
comparison.
> >> + return -EINVAL;
> >> + if (!is_set && tph_st.index != 0)
> >
> > We can only GET index 0? Why?
>
> Because GET operation don't need index, this judgment make sure
> it will return error if userspace passing non-zero index.
The interface is not symmetric.
> >> + return -EINVAL;
> >> + if (is_set && (tph_st.index >= VFIO_TPH_ST_MAX_COUNT ||
> >> + tph_st.index + tph_st.count > VFIO_TPH_ST_MAX_COUNT))
> >
> > DS can have more that 2048 steering tags.
>
> I check the PCIE protocol, DS mode with ST in pcie-config or msi-x must not
> hold > 2048 steering tags.
The MSI-X storage location is limited to 2048, but DS mode doesn't
necessarily use either of these storage locations.
> >> + return -EINVAL;
> >> +
> >> + cpus = memdup_array_user(&arg->data, tph_st.count, sizeof(*cpus));
> >
> > Use an __aligned_u64 data_uptr like iommufd does and avoid copying this
> > into the kernel. Isn't this steering tags, why's it named cpus?
>
> we reuse the field which according your previous suggestion:
> in GET, user pass in cpus, and vfio will feedback steering tags in cpus.
Perhaps this was over stated, but aligning to the generic uAPI name
since it's dual-purpose would seem to make sense.
> >> + if (IS_ERR(cpus))
> >> + return PTR_ERR(cpus);
> >> +
> >> + mtype = tph_st.flags & VFIO_TPH_ST_MEM_TYPE_PM ? TPH_MEM_TYPE_PM :
> >> + TPH_MEM_TYPE_VM;
> >
> > Then why did we use two flag bits for this?
>
> VFIO_TPH_ST_MEM_TYPE_PM is just one bit, could you explain the two flag bits?
My bad, I overlooked that they're both shifted by 0. Typically we
wouldn't define both the 0 and 1 value, only the latter.
> >> + if (!is_set) {
> >> + for (i = 0; i < tph_st.count; i++) {
> >> + ret = pcie_tph_get_cpu_st(pdev, mtype, cpus[i], &st);
> >
> > My understanding was that GET would return the translated values from a
> > previous SET, not translate on the fly. I think in order to support
> > writing STs before enabling TPH, we're going to need to store them
> > somewhere anyway so that we can push them to the correct location when
> > TPH is enabled. That means there a buffer for SET/GET regardless of
> > the mode that's enabled.
>
> OK, will try to fix
>
> >
> >> + if (ret)
> >> + goto out;
> >> + cpus[i] = st;
> >> + }
> >> + goto out;
> >> + }
> >> +
> >> + for (i = 0; i < tph_st.count; i++) {
> >> + if (cpus[i] == U32_MAX) {
> >> + ret = pcie_tph_set_st_entry(pdev, tph_st.index + i, 0);
> >
> > Why do we need a magic value to set it to zero? Hardware doesn't have
> > a "clear ST register value".
>
> So you mean for SET, usespace direct passing steering-tag not cpus in this patch?
How does this work with Zhiping's series where there might be TPH values
exposed through the dma-buf to support p2p DMA? Where would that level
of TPH programming occur?
> >> + if (ret)
> >> + goto out;
> >> + continue;
> >> + }
> >> +
> >> + ret = pcie_tph_get_cpu_st(pdev, mtype, cpus[i], &st);
> >> + if (ret)
> >> + goto out;
> >> + ret = pcie_tph_set_st_entry(pdev, tph_st.index + i, st);
> >> + if (ret)
> >> + goto out;
> >
> > Clearly broken for DS mode.
>
> For DS mode:
> 1\ if ST table in CONFIG or MSI-X, then it could done by this way
> 2\ if ST table not in CONFIG or MSI-X, then it only need invoke GET, and set by device-specific way.
The SET only after enable model is broken, DS mode needs to support
SET, where GET returns the host translations. Thanks,
Alex
next prev parent reply other threads:[~2026-05-22 14:27 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-18 7:16 [PATCH v11 0/5] vfio/pci: Add PCIe TPH support Chengwen Feng
2026-05-18 7:16 ` [PATCH v11 1/5] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
2026-05-18 7:16 ` [PATCH v11 2/5] PCI/TPH: Export pcie_tph_get_st_modes() for external use Chengwen Feng
2026-05-22 4:10 ` Alex Williamson
2026-05-18 7:16 ` [PATCH v11 3/5] PCI/TPH: Add pcie_tph_enabled_mode() helper Chengwen Feng
2026-05-22 4:10 ` Alex Williamson
2026-05-22 9:18 ` fengchengwen
2026-05-22 14:00 ` Alex Williamson
2026-05-18 7:17 ` [PATCH v11 4/5] vfio/pci: Add PCIe TPH configuration space virtualization Chengwen Feng
2026-05-22 4:10 ` Alex Williamson
2026-05-22 9:39 ` fengchengwen
2026-05-22 14:09 ` Alex Williamson
2026-05-18 7:17 ` [PATCH v11 5/5] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST for TPH ST entry management Chengwen Feng
2026-05-22 4:10 ` Alex Williamson
2026-05-22 10:04 ` fengchengwen
2026-05-22 14:27 ` Alex Williamson [this message]
2026-05-26 6:29 ` fengchengwen
2026-05-22 1:31 ` [PATCH v11 0/5] vfio/pci: Add PCIe TPH support fengchengwen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260522082730.3d843a7f@shazbot.org \
--to=alex@shazbot.org \
--cc=fengchengwen@huawei.com \
--cc=helgaas@kernel.org \
--cc=jgg@ziepe.ca \
--cc=kvm@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=liuyonglong@huawei.com \
--cc=wangyushan12@huawei.com \
--cc=wangzhou1@hisilicon.com \
--cc=wathsala.vithanage@arm.com \
--cc=wei.huang2@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox