Re: [PATCH v11 5/5] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST for TPH ST entry management

Kernel KVM virtualization development
 help / color / mirror / Atom feed

From: Alex Williamson <alex@shazbot.org>
To: fengchengwen <fengchengwen@huawei.com>
Cc: <jgg@ziepe.ca>, <wathsala.vithanage@arm.com>,
	<helgaas@kernel.org>, <wei.huang2@amd.com>,
	<wangzhou1@hisilicon.com>, <wangyushan12@huawei.com>,
	<liuyonglong@huawei.com>, <kvm@vger.kernel.org>,
	<linux-pci@vger.kernel.org>,
	alex@shazbot.org
Subject: Re: [PATCH v11 5/5] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST for TPH ST entry management
Date: Fri, 22 May 2026 08:27:30 -0600	[thread overview]
Message-ID: <20260522082730.3d843a7f@shazbot.org> (raw)
In-Reply-To: <512bce27-13a2-44a1-aa5b-b5ea77c174ba@huawei.com>

On Fri, 22 May 2026 18:04:13 +0800
fengchengwen <fengchengwen@huawei.com> wrote:

> On 5/22/2026 12:10 PM, Alex Williamson wrote:
> > On Mon, 18 May 2026 15:17:01 +0800
> > Chengwen Feng <fengchengwen@huawei.com> wrote:
> >   
> >> Add VFIO_DEVICE_FEATURE_TPH_ST to allow userspace to manage PCIe TPH
> >> Steering Tag entries.
> >>
> >> SET: Program contiguous ST entries when ST table resides in TPH Capability
> >>      or MSI-X table. U32_MAX CPU ID clears the corresponding ST entry.
> >> GET: Retrieve ST values per CPU ID, only available in DS mode.
> >>
> >> Both operations are only valid when TPH is enabled on the device.  
> > 
> > What?  That doesn't align with the hardware programming model.  The
> > specification even suggests that the device should be quiesced or TPH
> > disabled during programming of the steering tags for deterministic
> > behavior.  
> 
> As mentioned in the 4/5 commit, obtaining ST must first enable TPH.
> Of course, I think it might be necessary to calculate pdev->tph_req_type
> during the capability detection phase, so that we can modify ST before
> enabling TPH. Do you think such a modification is needed?

I think a uAPI that doesn't align with the recommended programming
order of the specification, potentially resulting in non-deterministic
behavior is bogus.

> >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> >> ---
> >>  drivers/vfio/pci/vfio_pci_core.c | 91 ++++++++++++++++++++++++++++++++
> >>  include/uapi/linux/vfio.h        | 41 ++++++++++++++
> >>  2 files changed, 132 insertions(+)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >> index 0d0140202280..13a22c1d6ea6 100644
> >> --- a/drivers/vfio/pci/vfio_pci_core.c
> >> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >> @@ -1558,6 +1558,95 @@ static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
> >>  	return 0;
> >>  }
> >>  
> >> +static int vfio_pci_core_feature_tph_st(struct vfio_pci_core_device *vdev,
> >> +					u32 flags,
> >> +					struct vfio_device_feature_tph_st __user *arg,
> >> +					size_t argsz)
> >> +{
> >> +	bool is_set = !!(flags & VFIO_DEVICE_FEATURE_SET);
> >> +	struct vfio_device_feature_tph_st tph_st;
> >> +	struct pci_dev *pdev = vdev->pdev;
> >> +	enum tph_mem_type mtype;
> >> +	int i, j, ret;
> >> +	u32 *cpus;
> >> +	u16 st;
> >> +
> >> +	guard(mutex)(&vdev->tph_lock);  
> > 
> > Wrong place for this lock.  Why do we even need it?  If a user races
> > SET/GET with enable/disable, that's their problem.  
> 
> OK, will fix in next version
> 
> >   
> >> +
> >> +	ret = vfio_check_feature(flags, argsz,
> >> +				 VFIO_DEVICE_FEATURE_GET |
> >> +				 VFIO_DEVICE_FEATURE_SET,
> >> +				 sizeof(tph_st));
> >> +	if (ret <= 0)
> >> +		return ret;
> >> +
> >> +	if (!enable_unsafe_tph ||  
> > 
> > Why did we allow the feature to be probe if it can't be used?  
> 
> OK, will fix in next version
> 
> >   
> >> +		pcie_tph_enabled_mode(pdev) == PCI_TPH_ST_NS_MODE)  
> > 
> > Here's that ambiguity of reporting NS_MODE for disabled, but as above
> > it's a bogus requirement.  
> 
> OK, will fix in next version
> 
> >   
> >> +		return -EOPNOTSUPP;
> >> +	if (!is_set && pcie_tph_enabled_mode(pdev) != PCI_TPH_ST_DS_MODE)  
> > 
> > Why?  Let the interface be symmetric that GET works regardless of mode.  
> 
> OK, will remove the restrict
> 
> >   
> >> +		return -EOPNOTSUPP;
> >> +	if (is_set && pcie_tph_get_st_table_loc(pdev) == PCI_TPH_LOC_NONE)
> >> +		return -EOPNOTSUPP;
> >> +
> >> +	if (copy_from_user(&tph_st, arg, sizeof(tph_st)))
> >> +		return -EFAULT;
> >> +
> >> +	if (tph_st.count == 0 || tph_st.count > VFIO_TPH_ST_MAX_COUNT ||
> >> +		tph_st.flags > VFIO_TPH_ST_MEM_TYPE_PM)  
> > 
> > Wrong operation for a flags field.  
> 
> Because currently flags only use bit0 (0-VM, 1-PM), this judge make sure
> it will return error if userspace passing non-zero in remain bits.

This should be a mask to test undefined bits are not set, not a value
comparison.

> >> +		return -EINVAL;
> >> +	if (!is_set && tph_st.index != 0)  
> > 
> > We can only GET index 0?  Why?  
> 
> Because GET operation don't need index, this judgment make sure
> it will return error if userspace passing non-zero index.

The interface is not symmetric.
 
> >> +		return -EINVAL;
> >> +	if (is_set && (tph_st.index >= VFIO_TPH_ST_MAX_COUNT ||
> >> +		       tph_st.index + tph_st.count > VFIO_TPH_ST_MAX_COUNT))  
> > 
> > DS can have more that 2048 steering tags.  
> 
> I check the PCIE protocol, DS mode with ST in pcie-config or msi-x must not
> hold > 2048 steering tags.

The MSI-X storage location is limited to 2048, but DS mode doesn't
necessarily use either of these storage locations.

> >> +		return -EINVAL;
> >> +
> >> +	cpus = memdup_array_user(&arg->data, tph_st.count, sizeof(*cpus));  
> > 
> > Use an __aligned_u64 data_uptr like iommufd does and avoid copying this
> > into the kernel.  Isn't this steering tags, why's it named cpus?  
> 
> we reuse the field which according your previous suggestion:
> in GET, user pass in cpus, and vfio will feedback steering tags in cpus.

Perhaps this was over stated, but aligning to the generic uAPI name
since it's dual-purpose would seem to make sense.
 
> >> +	if (IS_ERR(cpus))
> >> +		return PTR_ERR(cpus);
> >> +
> >> +	mtype = tph_st.flags & VFIO_TPH_ST_MEM_TYPE_PM ? TPH_MEM_TYPE_PM :
> >> +							 TPH_MEM_TYPE_VM;  
> > 
> > Then why did we use two flag bits for this?  
> 
> VFIO_TPH_ST_MEM_TYPE_PM is just one bit, could you explain the two flag bits?

My bad, I overlooked that they're both shifted by 0.  Typically we
wouldn't define both the 0 and 1 value, only the latter.
 
> >> +	if (!is_set) {
> >> +		for (i = 0; i < tph_st.count; i++) {
> >> +			ret = pcie_tph_get_cpu_st(pdev, mtype, cpus[i], &st);  
> > 
> > My understanding was that GET would return the translated values from a
> > previous SET, not translate on the fly.  I think in order to support
> > writing STs before enabling TPH, we're going to need to store them
> > somewhere anyway so that we can push them to the correct location when
> > TPH is enabled.  That means there a buffer for SET/GET regardless of
> > the mode that's enabled.  
> 
> OK, will try to fix
> 
> >   
> >> +			if (ret)
> >> +				goto out;
> >> +			cpus[i] = st;
> >> +		}
> >> +		goto out;
> >> +	}
> >> +
> >> +	for (i = 0; i < tph_st.count; i++) {
> >> +		if (cpus[i] == U32_MAX) {
> >> +			ret = pcie_tph_set_st_entry(pdev, tph_st.index + i, 0);  
> > 
> > Why do we need a magic value to set it to zero?  Hardware doesn't have
> > a "clear ST register value".  
> 
> So you mean for SET, usespace direct passing steering-tag not cpus in this patch?

How does this work with Zhiping's series where there might be TPH values
exposed through the dma-buf to support p2p DMA?  Where would that level
of TPH programming occur?
 
> >> +			if (ret)
> >> +				goto out;
> >> +			continue;
> >> +		}
> >> +
> >> +		ret = pcie_tph_get_cpu_st(pdev, mtype, cpus[i], &st);
> >> +		if (ret)
> >> +			goto out;
> >> +		ret = pcie_tph_set_st_entry(pdev, tph_st.index + i, st);
> >> +		if (ret)
> >> +			goto out;  
> > 
> > Clearly broken for DS mode.  
> 
> For DS mode:
> 1\ if ST table in CONFIG or MSI-X, then it could done by this way
> 2\ if ST table not in CONFIG or MSI-X, then it only need invoke GET, and set by device-specific way.

The SET only after enable model is broken, DS mode needs to support
SET, where GET returns the host translations.  Thanks,

Alex

next prev parent reply	other threads:[~2026-05-22 14:27 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-18  7:16 [PATCH v11 0/5] vfio/pci: Add PCIe TPH support Chengwen Feng
2026-05-18  7:16 ` [PATCH v11 1/5] PCI/TPH: Fix pcie_tph_get_st_table_loc() field extraction Chengwen Feng
2026-05-18  7:16 ` [PATCH v11 2/5] PCI/TPH: Export pcie_tph_get_st_modes() for external use Chengwen Feng
2026-05-22  4:10   ` Alex Williamson
2026-05-18  7:16 ` [PATCH v11 3/5] PCI/TPH: Add pcie_tph_enabled_mode() helper Chengwen Feng
2026-05-22  4:10   ` Alex Williamson
2026-05-22  9:18     ` fengchengwen
2026-05-22 14:00       ` Alex Williamson
2026-05-18  7:17 ` [PATCH v11 4/5] vfio/pci: Add PCIe TPH configuration space virtualization Chengwen Feng
2026-05-22  4:10   ` Alex Williamson
2026-05-22  9:39     ` fengchengwen
2026-05-22 14:09       ` Alex Williamson
2026-05-18  7:17 ` [PATCH v11 5/5] vfio/pci: Add VFIO_DEVICE_FEATURE_TPH_ST for TPH ST entry management Chengwen Feng
2026-05-22  4:10   ` Alex Williamson
2026-05-22 10:04     ` fengchengwen
2026-05-22 14:27       ` Alex Williamson [this message]
2026-05-26  6:29         ` fengchengwen
2026-05-22  1:31 ` [PATCH v11 0/5] vfio/pci: Add PCIe TPH support fengchengwen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260522082730.3d843a7f@shazbot.org \
    --to=alex@shazbot.org \
    --cc=fengchengwen@huawei.com \
    --cc=helgaas@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=kvm@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=liuyonglong@huawei.com \
    --cc=wangyushan12@huawei.com \
    --cc=wangzhou1@hisilicon.com \
    --cc=wathsala.vithanage@arm.com \
    --cc=wei.huang2@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox